CrowdStrike recently published the results of a technical root cause analysis (RCA) stemming from the July 19th, 2024 incident that caused millions of computer system crashes worldwide.
The report identifies several factors contributing to the incident and presents mitigative actions. It cites an out-of-bounds memory read leading to the failure of the EDR sensor, causing Windows operating systems to crash, as the root cause.
The root cause analysis presents findings and remedies, which form the basis for actions now underway, as summarized in the RCA executive summary:
Update Content Configuration System test procedures. This work has been completed. This includes upgraded tests for Template Type development, with automated tests for all existing Template Types. Template Types are part of the sensor and contain predefined fields for threat detection engineers to leverage in Rapid Response Content.
Add additional deployment layers and acceptance checks for the Content Configuration System. This work has been completed with an updated deployment ring process, ensuring Template Instances pass successive deployment rings before rollout into production.
Provide customers additional control over the deployment of Rapid Response Content updates. New capabilities have been implemented and deployed to our cloud that allow customers to control how Rapid Response Content is deployed, with additional functionality planned for the future.
Prevent the creation of problematic Channel 291 files. Validation for the number of input fields has been implemented to prevent this issue from happening.
Implement additional checks in the Content Validator. Additional checks are planned for release into production by August 19, 2024.
Enhance bounds checking in the Content Interpreter for Rapid Response Content in Channel File 291. Bounds checking was added on July 25, 2024, with general availability expected August 9, 2024. These fixes are being backported to all Windows sensor versions 7.11 and above through a sensor software hotfix release.
Engage two independent third-party software security vendors to conduct further review of the Falcon sensor code and end-to-end quality control and release processes. This work has begun and will be ongoing as part of our focus on security and resilience by design.
While these actions and the specific mitigative measures in the RCA report are important, they may not be sufficient. The reason? The root cause may not be purely technical.
The Scope of the RCA Was Limited and Reductive
The findings reported in the RCA only considered technical, proximal causes. Proximal (or first-order) causes are closest to the event and often don't provide sufficient explanation of what initiated the causal chain leading to the incident. Identifying the causal chain requires looking beyond first-order causes and considering non-technical factors.
The analysis failed to answer:
What were the prior conditions or actions that created the opportunity for an unmitigated high-risk software change to be deployed to customers?
This question can only be answered by taking a systems perspective that includes both technical and non-technical factors. Based on what has been reported, this comprehensive analysis has yet to be conducted.
The Root Cause Was Likely Not Technical
A statement from the report offers a clue about the nature of the root cause and why the primary cause may not be purely technical:
"In summary, it was the confluence of these issues that resulted in a system crash: the mismatch between the 21 inputs validated by the Content Validator versus the 20 provided to the Content Interpreter, the latent [emphasis added] out-of-bounds read issue in the Content Interpreter, and the lack of a specific test for non-wildcard matching criteria in the 21st field."
The word "LATENT" stands out. The vulnerability that led to the incident already existed, lying dormant and waiting for a software change to expose it. This form of risk propagation, proposed by James Reason and illustrated by the well-known Swiss cheese model, is not new.
Studies of other complex systems have shown that the greatest risk is usually not a failure in a single component, but rather the existence of smaller latent vulnerabilities that, as a result of ongoing changes, align to allow risk to materialize.
This raises further questions:
What other latent vulnerabilities reside within the software?
What future changes will cause these latent risks to breach the defenses?
What is causing the vulnerabilities to be introduced in the first place?
Why was the software change process ineffective at identifying and mitigating potential risks?
The last question suggests an alternative root cause:
A failure to effectively handle risk due to planned software changes.
A Software Engineering Failure
In high-risk sectors such as chemical processing, nuclear energy, and medical devices, safety is paramount, and change is considered a significant source of risk. That's why managing change is regulated to ensure organizations take necessary steps to protect the public, employees, assets, and the environment.
It's also why we have process, safety, industrial, and quality engineers to design safety into system processes and ensure it is managed effectively throughout the life-cycle of facilities, products, or services.
What appears to be missing is this same level of concern for safety in software engineering and in particular with respect to this incident.
The Need to Dig Deeper
While the out-of-bounds read error may have caused the sensor failure, there is an argument to be made that it may not be the root cause.
To discover the true root cause, we must look beyond technical considerations and proximal causes. We need to dig deeper into the systems that create conditions for vulnerabilities to emerge and lie dormant, waiting for future changes to expose them.
There is much to learn from other high-risk domains that can be applied to the practice of software development. However, this knowledge can only be effectively implemented when the real root causes are uncovered. Only then will preventive and mitigative risk measures be truly effective.