On Thursday night, Microsoft’s cloud platform Azure experienced a widespread outage. By Friday morning, the situation turned into a perfect storm when CrowdStrike released a flawed software update that sent Windows computers into a catastrophic reboot spiral. The cause has become clear: buggy code pushed out as an update to it's monitoring product, an antivirus platform that runs with system access on laptops, servers, and routers to detect suspicious activity. The nature of the problem means individually impacted machines may need to be rebooted manually rather than through an automated process. “It could be some time for systems that automatically won’t recover. CrowdStrike has shown why pushing updates without IT intervention is unsustainable."Only a handful of times in history has a single piece of code managed to instantly wreck computer systems worldwide. The Slammer worm of 2003. Russia’s Ukraine-targeted NotPetya cyberattack. North Korea’s self-spreading ransomware WannaCry. But the ongoing digital catastrophe that rocked the internet and IT infrastructure around the globe over the past 12 hours appears to have been triggered not by malicious code released by hackers, but by the software designed to stop them.Two internet infrastructure disasters collided on Friday to produce disruptions around the world in airports, train systems, banks, health care organizations, hotels, television stations, and more. On Thursday night, Microsoft’s cloud platform Azure experienced a widespread outage. By Friday morning, the situation turned into a perfect storm when the security firm CrowdStrike released a flawed software update that sent Windows computers into a catastrophic reboot spiral. A Microsoft spokesperson tells WIRED that the two IT failures are unrelated.
The cause of one of those two disasters, at least, has become clear: buggy code pushed out as an update to CrowdStrike’s Falcon monitoring product, essentially an antivirus platform that runs with deep system access on “endpoints” like laptops, servers, and routers to detect malware and suspicious activity that could indicate compromise. Falcon requires permission to update itself automatically and regularly, since CrowdStrike is constantly adding detections to the system to defend against new and evolving threats. The downside of this arrangement, though, is the risk that this system, which is meant to enhance security and stability, could end up undermining it instead.
“It's the biggest case in history. We’ve never had a worldwide workstation outage like this,” says Mikko Hyppönen, the chief research officer at cybersecurity company WithSecure. Around a decade ago, Hyppönen says, widespread outages were more common due to the spread of worms or trojans. More recently, global outages have happened on the “server side” of systems, meaning outages often stem from cloud providers such as Amazon’s Web Services, internet cable cuts, or authentication and DNS issues.
CrowdStrike CEO George Kurtz said on Friday that the issues were caused by a “defect” in code the company released for Windows. Mac and Linux systems were not affected. “The issue has been identified, isolated and a fix has been deployed,” Kurtz said in a statement, adding the problems were not the result of a cyberattack. In an interview with NBC, Kurtz apologized for the disruption and said it may take some time for things to be back to normal.
The widespread Windows outages have been linked to a software update from cybersecurity giant CrowdStrike. It is believed the issues are not linked to a malicious cyberattack, cybersecurity officials say, but rather stem from a misconfigured/corrupted update that CrowdStrike pushed out to its customers.
In a more detailed update Friday evening, CrowdStrike wrote in a blog post that the root cause of the crash had been a single configuration file pushed as an update to Falcon. The update was specifically aimed at changing how Falcon inspects “named pipes” in Windows, a feature that allows software to send data between processes on the same machine or with other computers on the local network. CrowdStrike says the configuration file update was aimed at allowing Falcon to catch a new method that hackers were using for communication between their malware on victim machines and command-and-control servers. “The configuration update triggered a logic error that resulted in an operating system crash,” the post reads.
Security and IT analysts searching for the root cause of the gargantuan outage had initially thought that it must be related to a “kernel driver” update to CrowdStrike’s Falcon software, due in part to the fact that the file that caused the crash ended in .sys, the file extension kernel drivers use. Kernel drivers are the software components that allow applications to interact with Windows at its deepest level, the core of the operating system known as its kernel. That highly sensitive level of access is necessary for security software, so that it can run prior to any malicious software installed on the system and access any part of the system where hackers might seek to plant their code. As malware has improved and evolved, it has pushed defense software to require constant connection and more extensive control.
That deeper access also introduces a far higher possibility that security software—and updates to that software—will crash the whole system, says Matthieu Suiche, head of detection engineering at the security firm Magnet Forensics. He compares running malicious code detection software at the kernel level of an operating system to “open-heart surgery.”
CrowdStrike noted in its blog post that despite the fact that the configuration file that caused the crash ended in the .sys file extension, it was not in fact a kernel driver. Yet it does appear that the configuration file was used by the driver and altered its functionality in a way that caused it to crash, says Costin Raiu, who worked at Russian security software firm Kaspersky for 23 years and led its threat intelligence team before leaving the company last year. During his years at Kaspersky, Raiu says, driver updates for Windows software were closely scrutinized and tested for weeks before they were pushed out. In this case, he suggests the configuration file may have been a far less scrutinized update that nonetheless able to change the way the kernel driver functioned and thus cause the crash. “It’s surprising that with the extreme attention paid to drivers, this still happened,” says Raiu. “One simple driver can bring down everything. Which is what we saw here.”
Microsoft requires developers to get its approval for kernel driver updates, which entails the company’s own careful inspection process. But Microsoft wouldn’t necessarily require any such approval for a configuration file. A Microsoft spokesperson told WIRED that the “CrowdStrike update was responsible for bringing down a number of IT systems globally,” and added that “Microsoft does not have oversight into updates that CrowdStrike makes in its systems.”
Raiu adds that, even so, CrowdStrike is far from the only security firm to trigger Windows crashes. Updates to Kaspersky and even Windows’ own built-in antivirus software Windows Defender have caused similar Blue Screen of Death crashes in years past, he notes. “Every security solution on the planet has had their CrowdStrike moments,” Raiu says. “This is nothing new but the scale of the event.”
Cybersecurity authorities around the world have issued alerts about the disruption, but have similarly been quick to rule out any nefarious activity by hackers. “The NCSC assesses that these have not been caused by malicious cyber attacks,” Felicity Oswald, CEO of the UK’s National Cyber Security Center, said. Officials in Australia have come to the same conclusion.
Nevertheless, the impact has been sweeping and dramatic. Around the world, the outages have been spiraling as companies, public bodies, and IT teams race to fix bricked machines, which involves manually taking machines through a series of corrective steps, including rebooting. In the UK, Israel, and Germany, health care services and hospitals saw systems that they use to communicate with patients disrupted, and canceled some appointments. Emergency services in the US using 911 have reportedly had problems with their lines too. In the earliest hours of the outages, some TV stations, including Sky News in the UK, stopped live news broadcasts.
Global air travel has been one of the most impacted sectors so far. Huge lines formed at airports around the world, with one airport in India using handwritten boarding passes. In the US, Delta, United, and American Airlines grounded all flights at least temporarily, with a dramatic graphic showing air traffic plummeting above the US.
The catastrophic situation reflects the fragility and deep interconnectedness of the internet. Numerous security practitioners told WIRED that they anticipated or even worked with clients to attempt to protect against a scenario where defense software itself caused cascading failures as a result of malicious exploitation or human error, as is the case with CrowdStrike. “This is an incredibly powerful illustration of our global digital vulnerabilities and the fragility of core internet infrastructure,” says Ciaran Martin, a professor at the University of Oxford and the former head of the UK’s National Cyber Security Center.
The ability of one update to trigger such massive disruption still puzzles Raiu. According to Gartner, a market research firm, CrowdStrike accounts for 14 percent of the security software market by revenue, meaning its software is on a wide array of systems. Raiu suggests that the Falcon update must have triggered crashes in other parts of web infrastructure, which could have multiplied the disaster. “CrowdStrike is big, but it can’t be this big,” Raiu says. “Airports, critical infrastructure, hospitals. It cannot be just CrowdStrike everywhere. I suspect we’re seeing a combination of factors, a cascading effect, a chain reaction.”
Hyppönen, from WithSecure, says his “guess” is that the issues may have happened due to “human error” in the update process. “An engineer at CrowdStrike is having a really bad day,” he says. Hyppönen suggests that CrowdStrike could have shipped software different to what they had been testing or mixed up files, or there could’ve been a combination of different factors. “Software like this has to go through extensive testing,” Hyppönen says. “That's what we do. That's what CrowdStrike, of course, does. You have to be really careful about what you ship, which is tough to do because security software is updated very frequently.”
While many of the outage are ongoing and still unraveling, the nature of the problem means that individually impacted machines may need to be rebooted manually rather than through an automated process. “It could be some time for some systems that just automatically won’t recover,” the impacts of CrowdStrike CEO Kurtz told NBC.
The company’s initial “workaround” guidance for dealing with the incident says Windows machines should be booted in a safe mode, a specific file should be deleted, and then rebooted. “The fixes we’ve seen so far mean that you have to physically go to every machine, which will take days, because it’s millions of machines around the world which are having the problem right now,” says Hyppönen from WithSecure.
As system administrators race to contain the fallout, the larger existential question of how to prevent another, similar crisis looms large.
“People may now demand changes in this operating model,” says Jake Williams, vice president of research and development at the cybersecurity consultancy Hunter Strategy. “For better or worse, CrowdStrike has just shown why pushing updates without IT intervention is unsustainable.”
Update 7/19/2024, 11am ET: Added comment from Microsoft saying that the Azure outage and the CrowdStrike issue are unrelated.
Update 7/19/2024, 12:30pm ET: Added further comment from Microsoft about its lack of oversight of CrowdStrike's updates.
Update 7/19/2024, 3:45pm ET: Updated to clarify that Amazon Web Services was not impacted by the CrowdStrike update, according to the company.
Update 7/20/2024, 9:30am ET: In an technical explanation released on Friday evening, CrowdStrike clarified that the issue causing the global IT crash was due to a problem with a configuration file that uses the .sys file extension also used by kernel drivers. However, the company clarified that it was not a kernel driver itself. We've updated the piece with the new technical details.
Jul 20, 2024
How Crowdstrike's "Routine" Update Crashed the World's Computers
Putting the technological explanation aside, computer updates sent out automatically without human oversight were the root cause of the Crowdstrike global tech meltdown.
But don't expect an easy fix: Microsoft and others don't want to increase human oversight because it might reduce their profit margins and their ability to maneuver as they see fit. JL
Lily Newman and colleagues report in Wired:
0 comments:
Post a Comment