Friday, July 19th, 2024, was an unforgettable day, albeit for all the wrong reasons.
A global technical outage linked to the cybersecurity firm CrowdStrike grounded flights, crashed enterprise Windows-based systems, and underscored the fact that the dependencies within the world’s IT infrastructure are so fragile and critical that a small bug can bring it to its knees.
Despite the fact that the event’s impacts echoed the effects of a widespread ransomware attack, Crowdstrike CEO George Kurtz was quick to reassure the public that the outage was unrelated to a security breach. In fact, the cause was almost trivial — a small bug in a content update. Regardless, the interruption highlighted the relationship between cybersecurity and global business…and not in a positive way.
Before we take a closer look, I think it’s important to call out the IT heroes at Crowdstrike (as well as those who work at the tens of thousands of affected organizations) who worked so hard and so quickly to restore operations and service. We see you, and we appreciate you.
“Ministry of No”
Reflecting on my nearly two decades of experience as a CSO and in other security leadership roles at British Telecom (BT), I have learned the lesson that “security mustn’t get in the way of the business.” I have no doubt that many of my fellow CISOs and CSOs have been receiving similar feedback (in harsh or worried tones) from their management and organizations since Friday, July 19.
Once perceived as obstacles to agility, CISOs have worked hard to promote the understanding of security as an enabler to the business — especially in our fully digitized enterprises. After all, business interruptions due to security incidents can be disastrous to reputation and customer trust. CISOs have long promoted the mantra of “security and resilience by design” to help CIOs, CTOs and boards understand that minimizing security risks and mitigating controls must always be considered when planning business applications and services. In so doing, we have succeeded in changing the perception of the “Ministry of No” to the “Ministry of How.”
“Ministry of Slow”
Another security “public-relations problem” has been the fact that transitioning from an environment where security tools and monitoring were bolted-on afterthoughts to one where they are inherent components of core applications and services has been painful. The stage of the “security retrofit” transformed the perception of the CISO as the “Ministry of No” to the “Ministry of Slow.” Necessary security reviews and longer design periods — as well as the overhead load of deploying security tooling on endpoints, servers, etc. — raised concerns about agility and performance. Of course, in most environments, these burdens could have been avoided if security had been built into business infrastructure from the outset.
At BT, we had a very forward-thinking CIO and CISO who both recognized that the security tooling was part of the fabric of the IT and networks. They made sure that security was considered and managed along with the deployment and support of all enterprise infrastructure. A similar approach would have been a core strength on July 19.
“Ministry of Woe”
In the minds of many, the Crowdstrike incident transformed the “Ministry of Slow” into the “Ministry of Woe,” as, once again, security got in the way of the business. But I see it differently.
“Ministry of Pro(active)”
In reference to security and IT and network infrastructure, I have always said “there is no 100%.” The gold standard in IT is 5x9s (99.999% uptime between faults). Crowdstrike is investigating how the .001% — in this case, a flawed update — made it into production. I won’t speculate about that here.
This incident reminds us of the grave importance of ongoing management of updates across an enterprise. We must look at the net of the quality controls that we expect from our vendors and our own IT support teams. To prevent the “Ministry of Woe” we must plan defensively by embracing the “Ministry of Pro(active)”
Here’s what that might look like:
- Primary cohort. Define a “primary cohort” — a small set of affected devices and hosts across each part of the organization — to use to test and confirm the success of updates. Ensure that they have a support wrap around them. As part of the general population in the environment, you should plan for a manual emergency response to restore them to operation in the event of an emergency.
- Phases. Plan out a set of phases (beyond the primary cohort) for deployment and operational confirmation.
- N-1. Adopting an N-1 (preserving the last version of software before the new release) is a prudent practice. However, be careful here. When it comes to security tooling, an N-1 approach could leave non-updated systems vulnerable to threats.
While the July 19 incident was not a security issue, there were serious security implications.
It was a hugely disruptive IT event — one that attackers could quickly leverage for advantage. There are two key vectors at play here:
- Phishing and fraud. Since the Crowdstrike event, there has been a rapid and significant spike in fake support sites and associated phishing. All employees must be on their guard and follow Crowdstrike’s advice to use only their direct, validated support services and report anything suspicious.
- Exploiting defense weakness. Affected organizations may be weighing up stability vs. re-enabling Crowdstrike. They may think that it’s a good idea to disable this protection “as things settle down,” in service of a period of operational stability. This would leave them wide open to attack (attackers will definitely be looking for unprotected targets). This is not a time for organizations to forget their sound reasoning in initially deploying Crowdstrike. The damage inflicted by a cyber-attack would far outweigh the disruption they experienced, since recovery of brand and reputation would have long-lasting costs.
More broadly, cybersecurity for the modern enterprise is based on an ecosystem of overlapping and compensating controls that maintain the security posture. Maintaining that posture is dependent on operational tools that are updated with the latest threat intelligence.
As threats and attacks emerge, visibility and velocity of response are key. They are fueled by up-to-the-minute threat intelligence immediately bonded to the enterprise’s telemetry and defenses. This requires a different approach to security and the operations platform — one that Anomali has pioneered and is bringing to customers, helping them achieve their security mission.
Once again, let’s appreciate the huge efforts of all the IT and security teams who got their organizations and, let’s face it, the world back on their feet. Learning from the philosopher and writer George Santayana who wrote in 1905, “Those who cannot remember the past are condemned to repeat it” . Let's take the time to reconsider our resilience and put proactive measures in place. Let’s not allow the aftermath or sense of panic to disrupt the calm and considered approach that has built out our security controls and protection. Let us redouble our efforts to have them treated as an integral part of our IT and network fabric. Finally, let us stay united as a team to deliver business performance and achieve the security mission.