The recent global IT outage due to an update from CrowdStrike taught us some interesting insights. While we could blame upstreams, why can certain organisations recover within hours while others may need weeks? Why are the most heavily regulated industries (health, finance, airlines) hit the hardest? š¤
I answered these questions in the opening panel on āRegulating for Resilienceā at the annual ACCC/AER Regulatory Conference https://lnkd.in/gAwh7CAG. These are also based on insights from CSIRO’s Data61’s all-hazards (cyber, software/hardware supply chain, disaster) scientific approach to achieving resilience, exemplified via our Critical Infrastructure Protection and Resilience (CIPR) mission https://lnkd.in/g58xk6SH. šš¬
1. Staged rollouts and robust rollback plans are essential for the resilience of every organisation. Although any release or update should be thoroughly tested, bugs and issues may always slip through. One key resilience approach for organisations should be to have staged rollouts and robust rollback plans. This means an update is initially released to a small number of machines, observed, and then released to more machines. This can prevent obvious failures before wider impact. However, this may not mitigate failure modes that only emerge at large-scale rollouts due to cumulative effects and complex interactions. To be resilient against large-scale emergent failures, there must be a robust rollback plan to early versions or a fallback plan to alternatives. Even if some complex fix and rollback plans cannot be known beforehand, the infrastructure should be capable of deploying newly devised automation of fixes and complex rollbacks.
2. Compliance canāt be bought. One interesting phenomenon with this outage is that it has impacted the highly regulated industries the most. There is a reason for that. Highly regulated industries mandate rigorous security postures and practices, requiring organisations to demonstrate compliance. Often, it is costly for organisations to do so, especially if they lack the right expertise. Thus, there is a tendency to buy compliance – and CrowdStrike has been very effective at demonstrating this through their software. Some organisations feel that by purchasing compliance-by-design solutions, they tick the box of security compliance without realising the gaps that may still exist regarding resilience infrastructures such as their own org-specific testing environments, staged rollouts, default rollback/fallback plans, and infrastructure amenable to automating new fixes and specialised rollbacks upon incidents.
Many of these insights were discussed in our book on DevOps: A Software Architect’s Perspective and in an upcoming book: Architecture and DevOps for AI Systems dealing with AI-specific challenges. The lessons also apply to responsible AI, especially if you view dealing with AI risks not just through prevention but through resilience design. Achieving perfect-by-design is almost impossible for complex AI systems, thus necessitating robust monitoring and fallback/rollback mechanisms.