AT&T Outage Shines A Spotlight On Network Dependability

On February 22nd, a massive service interruption in AT&T cellular services affected subscribers across the nation. Although outage-report volumes were in the hundreds of thousands, that is likely just the tip of the iceberg. What lays beneath is a massive number of subscribers who experienced issues but didn’t or couldn’t report them, and services using cellular networks (e.g., tracking services, PoS terminals, etc.). The outage last for approximately 11 hours, based on the impacts of similar past outages on such things as financial transactions and supply chain disruptions, we estimate the impact to the US economy at $500 million. Here’s what we know happened and what happens next:

A mundane network change caused the massive outage. AT&T has officially released a statement that attributes the outage to “… the application and execution of an incorrect process used as we were expanding our network, not a cyber attack…” AT&T February 22, 2024, 6:46 p.m. CT. What’s the big deal? For most of us in IT, cellular technologies have been used as backup underlay technology for wide-area networks making the impact minimal. But for some enterprises, cellular connectivity is the live line of their core business functions such as operations (e.g., field and fleet operations, asset tracking and management) or sales (e.g., payment terminals, kiosks, etc.). In these circumstances, an outage like this can be devasting.
There will be investigations and significant cost to AT&T…and ultimately its customers. A chain of events will unfold following the outage, starting with AT&T submitting the official outage root cause report to the FCC. In parallel, US Government agencies will support the efforts to rule out any possible cyberattacks. Customer rebates and credits will start to flow, and similarly lawsuits from consumers and businesses alike. AT&T will implement processes and technology improvements addressing the root cause(s) and FCC will be forced to review its rules. If we use the July 8, 2022, Rogers outage in Canada as a guide, we estimate that AT&T will see as much as $1.5 billion in impact considering outage duration and population proportions. which could be bundled into a 3-years plan as done by Rogers (C$10 billion over 3 years). If such an improvement plan is put together by AT&T, we expect it to be in the vicinity of USD$20 to 30 billion. It is likely that customers will see the result of this in higher costs similar to what Rogers subscribers experienced few months after the outage.

That’s not great news for anyone. It is important to remember that networks will always have outages and performance degradations, it’s a matter of physics, human intervention, and technology complexity. What made this news worthy was that this was a major carrier that enterprises and citizens depend on. For these reasons, carriers are held to the highest standards – often with SLAs of five nines of availability for a year. That means being unavailable for no more than 5 minutes and 15 seconds a year. Being down for 11-hours … that’s a new ballpark. What are the key lessons for carriers and IT leaders from this unfortunate event?

IT leaders must revisit their end-device wireless connectivity capabilities. Especially for companies that rely on a single-carrier cellular connectivity, it may be time to reconsider that approach and whether other technologies might better serve your needs. For example, allowing for multi SIM/eSIM redundant carrier connectivity or having multiple wireless connectivity options, such as satellite, LoRa, Sigfox, or even WiFi in your end-devices. But there’s more to learn here. As much as we hold carriers to higher standards, we can try to avoid their mistakes…
All networking orgs must accelerate monitoring, visibility, observability, and AI investments. As noted above, networks will always have outages and performance degradations. However, networking teams aren’t known for diligent planning ahead and proactive resilience measures. For example, network monitoring solutions are usually an afterthought. After an issue arises, especially when the root cause can’t be found, networking teams will invest in monitoring solution. Part of the issue is lack of budget for fundamentals versus flashing new concepts, such as autonomous networks, intent based networking (IBN), and networking-as-a-service. But that approach is nothing more than taping a crack on airplane wing and must be phased out. Uptime and fast remediation are essential for customer experience. This makes network automation, performance management (including visibility, observability, and AIOps), fast analytics for root cause analysts/CAST, and system-wide improvements via AI essential. Automation and AI won’t eliminate all outages but it can help uncover and avoid many outages and performance degradations while running simulations before changes or issues.
Advanced companies, like carriers, should seek out advanced practices. The expectations for large enterprises, especially carriers, is are even higher. It is no longer enough to just invest fully in the items above. They need to push into advanced practices like businesswide networking fabrics, simulations/digital twins, real-time event communication, etc. Why are these so important? Past segmented networks were discrete components, manually controlled with changes occurring across each network point, sequentially, over a long period. The emergence of businesswide networking fabrics controlled by software, where one change can occur across hundreds if not thousands of devices simultaneously — pushes the need for running through scenarios through digital twins to ensure understanding the full scope of change before it occurs for things like network config changes, updates, upgrades. Carriers should accelerate the adoption of these technologies. Similar to the simulations aerospace and aircraft industry does before building the components, aircraft, or rockets.