কিভাবে মাইক্রোসফট ক্রাউডস্ট্রাইকের বড় আউটেজ সমাধান করেছিল

Microsoft Faces Major Incident: The CrowdStrike Outage Explained

In a shocking turn of events early Friday morning, CrowdStrike found itself inundated with crash reports as engineers within Microsoft realized the gravity of the situation. Millions of Windows machines were experiencing what is known as the notorious Blue Screen of Death (BSOD), disrupting critical servers and PCs globally.

Understanding the Severity of the Incident

Microsoft promptly categorized the incident as a “severity zero,” internally referred to as sev0. This designation is the most urgent classification for incidents impacting Microsoft products or services. Sev0 incidents are exceedingly rare, prompting notification of on-call engineers and immediate action to address the issue, often in the middle of the night.

The Role of CrowdStrike

The complexity of the situation was further heightened by the involvement of CrowdStrike, a third-party cybersecurity firm. On July 19th at 12:09 AM ET, an update released by CrowdStrike inadvertently led to the disconnection of approximately 8.5 million PCs from the network. While the error did not originate from Microsoft itself, it quickly turned into a significant problem for the tech giant.

Impact on Microsoft and Its Customers

This incident particularly affected what Microsoft identifies as its “pri0 customers,” which include large organizations with critical infrastructure that rely heavily on uninterrupted service. Companies with essential operations were left scrambling to address the fallout from this unexpected outage.

The Response: Collaboration and Communication

In light of the severe repercussions of the outage, Microsoft was compelled to maintain constant communication with CrowdStrike engineers. The urgency of the situation necessitated collaboration across multiple platforms, including outreach to cloud rivals like Amazon and Google.

The Aftermath

As the dust settles on this unexpected event, both CrowdStrike and Microsoft are faced with the challenge of restoring normal operations while evaluating the causes and implications of such a widespread failure.

Key Takeaways

The incident highlights the vulnerability of interconnected systems, where third-party updates can disrupt operations on a massive scale.
Effective communication and cross-company collaboration are essential during critical outages.
Understanding the classification of incidents like sev0 is vital for recognizing the urgency required in tech incident management.

Conclusion

Though CrowdStrike's update was the catalyst for the outage, the incident serves as a learning opportunity for both firms, emphasizing the importance of robust testing and communication protocols to avoid similar situations in the future.