AWS Outage October 2025: Inside the Massive Cloud Failure That Shook the Internet

October 20, 2025

At approximately midnight Pacific Time on October 20, 2025, a significant portion of the internet stopped working. Not from a cyberattack, not from a natural disaster, but from problems at a cluster of data centers in Northern Virginia owned by Amazon Web Services. Within minutes, Snapchat users couldn’t send messages, Fortnite players were kicked from matches, Duolingo learners lost their streaks, and financial apps like Robinhood and Venmo became inaccessible. The cascading failure exposed a fundamental fragility in modern digital infrastructure: when AWS goes down, much of the internet goes with it.

The outage, which began around 8:00 AM UK time, affected dozens of major platforms simultaneously—a pattern that immediately pointed to a common infrastructure failure rather than isolated technical problems. Downdetector, the crowd-sourced outage tracking service, lit up with tens of thousands of reports across an unprecedented breadth of services. Gaming platforms, social media apps, productivity tools, financial services, smart home devices, and even Amazon’s own retail operations all experienced disruptions at the same moment.

Amazon acknowledged the issue on its AWS Service Health Dashboard, confirming “increased error rates” and latency problems affecting Amazon DynamoDB (AWS’s NoSQL database service) and Amazon Elastic Compute Cloud (EC2), the virtual computing backbone that powers thousands of companies’ applications. The terse technical language understated the real-world impact: for millions of users, favorite apps simply stopped working with no explanation.

This incident marks the latest in a recurring pattern that raises uncomfortable questions about internet infrastructure. How did we arrive at a point where problems at facilities in a single geographic region can disrupt global services? What does this concentration of digital power mean for resilience, competition, and the future of cloud computing? And why does this keep happening?

The invisible empire: How AWS became the internet’s foundation

To understand why this outage mattered so much, you need to grasp Amazon Web Services’ extraordinary reach into the digital economy. AWS isn’t just a cloud provider—it’s the infrastructure layer beneath vast swaths of the modern internet.

Amazon launched AWS in 2006, initially offering simple storage and computing services that developers could rent by the hour. The value proposition was revolutionary: instead of buying and maintaining physical servers, companies could rent computing power on demand, paying only for what they used. This “cloud computing” model eliminated massive capital expenses and allowed startups to scale from zero to millions of users without building data centers.

Nearly two decades later, AWS has become a $100+ billion annual business serving millions of customers across virtually every industry. The company operates massive data center facilities in 33 geographic “regions” worldwide, each containing multiple isolated “availability zones” for redundancy. These facilities house millions of servers running sophisticated software that abstracts physical hardware into virtual resources customers can provision in minutes.

AWS’s market position is extraordinary: the company controls approximately 32% of the global cloud infrastructure market as of 2025, more than Microsoft Azure (23%) and Google Cloud (11%) combined. This dominance stems from first-mover advantage, aggressive investment in infrastructure, competitive pricing, and a comprehensive service catalog offering over 200 distinct products covering everything from basic computing to advanced artificial intelligence.

The platform serves a Who’s Who of digital companies. Netflix streams video to 250+ million subscribers using AWS infrastructure. Airbnb manages millions of property listings and bookings on AWS. Epic Games hosts Fortnite’s 230 million players on AWS servers. Financial technology companies like Robinhood and Coinbase process billions in transactions through AWS. Educational platforms like Duolingo and Canvas reach hundreds of millions of learners via AWS. Even competitors use AWS: Apple reportedly spends over $30 million monthly on AWS services.

This ecosystem extends far beyond well-known consumer apps. Government agencies, healthcare providers, financial institutions, manufacturers, retailers, and media companies run critical operations on AWS. The platform handles everything from medical records and financial transactions to supply chain logistics and content delivery. AWS has become embedded infrastructure—invisible to end users but essential to daily life.

The Northern Virginia concentration amplifies this dependency. AWS’s us-east-1 region, located in data centers across Northern Virginia, serves as the company’s oldest and largest deployment. Industry estimates suggest us-east-1 handles 35-40% of all AWS traffic globally. Many companies initially deployed applications in us-east-1 and never migrated, creating enormous concentration of workloads in this single geographic region.

Geography matters because cloud computing isn’t actually “in the cloud”—it runs on physical servers in physical buildings consuming massive amounts of electricity and generating enormous heat. Data centers require stable power, redundant network connectivity, favorable weather conditions, and proximity to internet backbone infrastructure. Northern Virginia became an internet hub due to its location between major East Coast cities, abundant land, relatively mild climate, and competitive electricity pricing. The region hosts not just AWS but also facilities from Microsoft, Google, and numerous other providers.

What went wrong: Anatomy of the October 20th failure

While Amazon has not yet released a detailed post-mortem explaining the root cause, the company’s status page and observable symptoms reveal the outage’s technical character.

The failure originated in core AWS services: specifically Amazon DynamoDB and Amazon EC2, both hosted in the us-east-1 region. These services form foundational layers of AWS architecture. DynamoDB provides a managed NoSQL database service that many applications rely on for storing and retrieving data at massive scale with low latency. EC2 delivers the virtual computing instances where applications actually run. When these services experience problems, cascading effects ripple through the entire ecosystem.

Amazon’s acknowledgment of “increased error rates” suggests the infrastructure didn’t completely fail but rather degraded significantly. Applications attempting to read from or write to DynamoDB databases received errors or experienced severe timeouts. EC2 instances may have encountered problems communicating with other AWS services or faced networking issues that prevented normal operation.

This pattern resembles previous major AWS outages. A July 2024 incident lasted nearly seven hours due to a failure in Amazon Kinesis Data Streams, another foundational service. A December 2021 outage stemmed from network device problems causing elevated API error rates. A September 2021 disruption originated from stuck I/O issues in Elastic Block Store (EBS), the service providing persistent storage for EC2 instances. A November 2020 outage started with API errors in Kinesis that cascaded through dozens of dependent services.

The common thread across these incidents: AWS’s highly interconnected architecture amplifies failures. AWS architected its platform with thousands of microservices that communicate continuously. When a foundational service like DynamoDB or Kinesis experiences problems, the effects propagate to other services that depend on it. An API Gateway needs to check DynamoDB for configuration. A Lambda function needs to write logs to a service backed by Kinesis. CloudWatch monitoring requires database storage for metrics. The dependencies create a web where a single failure point can trigger cascading problems.

During the October 20th incident, customers attempting to access the AWS Management Console—the web interface for managing cloud resources—reportedly faced errors, suggesting the outage affected even AWS’s own administrative infrastructure. This meta-problem makes outages harder to debug and communicate about: if the systems for updating the status page are themselves broken, how do you tell customers what’s happening?

The us-east-1 region’s age contributes to these fragility issues. As AWS’s first and largest region, us-east-1 contains legacy architecture dating back to AWS’s early years. The region has been extended and modified countless times to accommodate growth. This accretion of systems, each built with different assumptions and technologies, creates complexity that makes reliability harder to maintain.

Amazon’s engineering culture emphasizes rapid innovation and new feature development—strengths that built AWS’s market lead but potentially at the cost of investment in unglamorous infrastructure hardening. The company faces constant pressure to add new services, support new use cases, and maintain competitive pricing. These priorities can overshadow efforts to refactor aging systems or reduce architectural complexity.

The ripple effect: Who got hurt and how badly

The October 20th outage’s impact extended across multiple sectors, affecting hundreds of millions of users and countless businesses.

Gaming and entertainment suffered immediate, visible disruptions. Roblox, with over 70 million daily active users, went down entirely—a disaster for a platform where users expect 24/7 access to play with friends and participate in virtual economies. Fortnite players were kicked from matches mid-game. Epic Games Store users couldn’t purchase games or download content. PlayStation Network experienced connectivity problems affecting online multiplayer. Riot Games’ League of Legends, Clash Royale, Clash of Clans, Rainbow Six Siege, Dead by Daylight, VRChat, and Rocket League all reported issues. For gaming companies, every minute of downtime translates to frustrated players and lost revenue from in-game purchases.

Social media and communication platforms went dark. Snapchat, used by over 400 million people daily, experienced widespread outages preventing users from sending messages or viewing content—particularly problematic for a platform where “streaks” (consecutive days of communication) hold social currency among young users. The outage disrupted personal communication and affected businesses that use Snapchat for marketing and customer engagement.

Financial services faced critical disruptions with particularly serious implications. Robinhood, the stock trading platform with millions of active traders, became inaccessible—potentially preventing users from executing time-sensitive trades or managing positions during volatile market hours. Venmo users couldn’t send or receive payments, disrupting the peer-to-peer payment flows that millions rely on for splitting bills and transferring money. Chime, the mobile banking app serving millions of Americans, experienced problems preventing customers from accessing accounts or making payments. Coinbase, the largest U.S. cryptocurrency exchange, reported connectivity issues affecting users’ ability to trade digital assets whose prices can swing dramatically minute-to-minute.

For financial services, outages create existential risks. Traders unable to close positions during adverse market movements can suffer substantial losses. Consumers unable to make urgent payments face real-world consequences from missed rent deadlines to bounced checks. These companies face potential regulatory scrutiny, customer lawsuits, and lasting damage to trust—the currency financial businesses trade on.

Productivity and education tools disrupted work and learning. Canva, the design platform used by millions of businesses and individuals, went offline—halting ongoing projects and preventing users from accessing work. Duolingo’s outage interrupted language learners’ daily practice, breaking “streaks” that motivate consistent engagement. Canvas by Instructure, used by thousands of educational institutions for course management, experienced problems—potentially disrupting online classes, assignment submissions, and grade access during the academic year.

Amazon’s own ecosystem wasn’t spared. Amazon.com experienced problems, affecting the e-commerce giant’s retail operations. Amazon Prime Video users couldn’t stream content. Alexa, Amazon’s voice assistant embedded in millions of homes, struggled to respond to commands—a particularly visible failure that made the outage tangible even for non-technical users. Ring, Amazon’s home security camera system, had connectivity issues, preventing users from monitoring their homes remotely and undermining the security promise that justifies the product’s existence.

The simultaneous failure of Amazon’s consumer-facing services alongside AWS’s cloud infrastructure revealed an uncomfortable truth: Amazon built its entire business on the same foundation it rents to others. When that foundation cracks, everything suffers together.

The scale of individual impact multiplies into societal disruption. A gamer missing a Fortnite session experiences mild inconvenience. Millions of users unable to access entertainment simultaneously represents cultural disruption. A trader missing one trade opportunity faces individual loss. Thousands of traders locked out during market hours creates market distortion. One student missing an assignment deadline deals with personal consequences. An entire school unable to access learning materials disrupts education systemically.

The economic toll, while not yet quantified for this specific incident, likely reaches into hundreds of millions of dollars in lost revenue, productivity, and reputation damage. Previous comparable AWS outages have been estimated to cost affected companies $100+ million collectively in direct and indirect losses.

Why this keeps happening: The structural problems with cloud concentration

The October 20th outage isn’t an anomaly—it’s part of a pattern revealing systemic issues with how we’ve built modern internet infrastructure.

AWS has experienced significant outages with alarming regularity. In December 2021, two major disruptions within a week took down huge swaths of the internet, each lasting multiple hours. A July 2022 power outage in us-east-2 affected services for several hours. A September 2021 incident in us-east-1 lasted eight hours due to EBS storage problems. November 2020 saw a Kinesis failure cascade through dozens of services. September 2015 brought DynamoDB disruptions in us-east. The pattern extends backward through AWS’s entire history: major outages in 2013, 2012, 2011, 2010, and 2009.

These incidents share common characteristics: they often originate in us-east-1, stem from problems in foundational services, cascade through dependent systems, last multiple hours, and affect huge swaths of the internet simultaneously. The frequency suggests not just bad luck but structural vulnerabilities in how AWS architected and operates its infrastructure.

Technical complexity creates fragility at scale. Modern cloud platforms comprise millions of lines of code, thousands of microservices, complex networking configurations, distributed databases, sophisticated orchestration systems, and elaborate monitoring and management tools. This complexity exists necessarily—cloud platforms must solve genuinely hard technical problems around resource allocation, fault isolation, security, networking, and state management at unprecedented scale.

But complexity breeds failure modes. A configuration change in one system triggers unexpected behavior in another. A software bug lies dormant until specific traffic patterns expose it. A race condition appears only under high load. An optimization in one service creates resource contention affecting another. Cascading failures propagate through dependency chains in ways that are difficult to predict or prevent.

The “shared fate” problem intensifies concentration risk. When companies build applications entirely on AWS infrastructure—using EC2 for computing, DynamoDB for storage, Lambda for serverless functions, S3 for files, and CloudWatch for monitoring—they create comprehensive dependency on a single provider. AWS promotes this tight integration as a feature: services work seamlessly together with integrated security, simplified management, and optimized performance.

But this integration creates shared failure modes. An outage affecting core AWS services simultaneously disrupts everything built on top. Companies running “multi-region” architectures for redundancy still face problems if the failure occurs in shared control plane infrastructure or global services. The very integration that makes AWS convenient to use makes it difficult to build truly independent fallback systems.

Economic and organizational factors push companies toward concentrated risk. Building truly redundant, multi-cloud architectures is expensive and complex. Companies must maintain parallel infrastructure across multiple providers, manage different APIs and tools, ensure data consistency across systems, handle complicated failover logic, and staff teams with expertise in multiple platforms. Many companies, particularly startups with limited engineering resources, make the calculated bet that AWS’s 99.95% uptime SLA provides acceptable reliability at far lower cost than building elaborate redundancy.

This calculation makes sense for individual companies but creates systemic fragility. When thousands of companies make the same risk calculation, a single infrastructure provider’s failure affects society-wide services simultaneously. The market doesn’t efficiently price systemic risk—individual companies optimize for their own cost-benefit trade-offs without accounting for the social cost of widespread simultaneous failure.

The “too critical to fail” problem emerges. AWS has become so embedded in critical infrastructure that its reliability matters at a societal level—yet AWS operates as a private company primarily accountable to shareholders rather than the public. The company faces minimal regulatory oversight regarding reliability, transparency about outages, or preparation for resilience. When AWS experiences outages, it provides terse technical updates but rarely detailed post-mortems explaining root causes or committed improvements.

This arrangement creates moral hazard. AWS captures enormous economic value from operating as critical infrastructure but doesn’t bear full responsibility for the social costs when that infrastructure fails. Companies have limited alternative options due to AWS’s market dominance and the switching costs of moving workloads. Users of applications built on AWS have no choice at all—they’re exposed to concentration risk whether they realize it or not.

What it means: Implications for businesses, regulators, and the future

The October 20th incident, viewed alongside the pattern of recurring AWS outages, reveals tensions that will shape digital infrastructure’s evolution.

For businesses, the calculus around cloud dependency needs revisiting. The conventional wisdom has been that AWS’s scale, expertise, and investment in infrastructure make it more reliable than what individual companies could build. Historical data supports this for normal operations—AWS’s overall uptime exceeds what most companies achieve running their own data centers. But the frequency of significant outages affecting large portions of AWS’s customer base suggests the risk profile has shifted.

Companies should honestly assess their tolerance for and exposure to cloud outages. For consumer entertainment apps, multi-hour outages several times per year might represent acceptable risk—frustrating but not existential. For financial services platforms, healthcare systems, critical infrastructure control systems, or emergency services, the risk calculation changes dramatically. Lives, livelihoods, and essential services are at stake.

Multi-cloud and hybrid approaches offer theoretical protection but practical challenges. Building applications that can seamlessly fail over between AWS, Azure, and Google Cloud sounds prudent but requires enormous investment. The cloud providers use different APIs, different networking models, different security frameworks, and different operational paradigms. Maintaining truly parallel infrastructure essentially means building and operating everything twice—doubling costs and complexity.

A more pragmatic approach involves tiered strategies. Critical functionality might warrant true multi-cloud redundancy. Less critical features might accept occasional outages. Companies should maintain “break glass” contingency plans—simplified fallback modes that preserve core functionality even if full features are unavailable. They should avoid deep dependency on proprietary AWS services that create lock-in, preferring open-source alternatives that could migrate to other providers.

From a public policy perspective, AWS’s market concentration and critical role invite regulatory attention. The telecommunications industry faced similar issues decades ago, leading to common carrier regulations requiring network reliability, service transparency, and customer protections. Financial infrastructure faces extensive regulation and oversight. Electrical grids operate under public utility frameworks. The argument that cloud computing deserves similar scrutiny grows stronger as these platforms become more critical to economic and social functioning.

Potential regulatory approaches might include mandatory outage reporting and transparency, reliability standards with enforcement mechanisms, incident investigation by independent authorities, interoperability requirements to reduce switching costs, and possibly structural separation between infrastructure operations and competing services. Any such regulation faces significant technical complexity and risk of stifling innovation—but leaving critical infrastructure entirely to market forces has demonstrated weaknesses.

The technology industry’s response matters as much as regulation. Open standards for cloud interoperability would reduce lock-in and enable realistic multi-cloud strategies. Better tools for chaos engineering—deliberately introducing failures to test resilience—could help companies identify and fix vulnerabilities before users experience them. Architectural patterns that embrace partial failure rather than assuming full reliability could make systems more robust.

AWS itself should face pressure for greater transparency about outages, more aggressive investment in infrastructure reliability, clearer communication about risk trade-offs, and perhaps separation of its newest, most complex services from the core infrastructure that so much depends on. The company’s profit margins on AWS remain high—it can afford investment in resilience if pushed to prioritize it.

For individual users, the lesson is humbling: we’ve built a world where our daily lives—communication, entertainment, work, education, financial transactions, home security—depend on infrastructure we don’t control, barely understand, and have no alternatives to. This dependency isn’t inherently wrong, but it comes with fragility we’ve only begun to recognize.

The October 20th outage joins a growing catalog of incidents—Facebook’s October 2021 BGP routing failure, the CrowdStrike update that crashed millions of Windows systems in July 2024, the Fastly CDN outage in June 2021—demonstrating that digital infrastructure concentration creates cascading, society-wide impacts when failures occur.

The uncomfortable truth: This will happen again

As this article publishes, AWS engineers are presumably working to restore full service and will eventually publish a post-mortem explaining what went wrong and what changes they’ll implement to prevent recurrence. These post-mortems will detail specific technical failures, outline remediation steps, and likely express regret for customer impact.

But the uncomfortable truth is this will happen again. Not necessarily identical technical failures—AWS presumably will fix the specific issues that caused the October 20th outage. But new failures will emerge from the irreducible complexity of operating infrastructure at this scale. Configuration changes will trigger unexpected cascades. Software bugs will lurk in code paths rarely executed. Hardware will fail in novel ways. Human errors will occur despite careful procedures.

AWS’s architecture, shaped by early decisions and accumulated technical debt, carries inherent fragility that can’t be eliminated through incremental improvements. True transformation would require massive refactoring efforts that would slow new feature development, potentially creating competitive disadvantage. The economic incentives don’t align with taking systems offline for months of architectural overhaul when they work most of the time.

The concentration of so much internet infrastructure in so few hands—primarily AWS, but also Microsoft Azure, Google Cloud, and a handful of other large providers—creates systemic risk that market forces alone won’t solve. Companies individually make rational decisions to use cloud providers, accepting occasional outages as the cost of avoiding infrastructure management. But collectively these decisions create society-level vulnerability to single points of failure.

We’ve built a house of cards on a foundation we assume is rock-solid, because most of the time it is. When that foundation shifts, we’re reminded that “the cloud” is just someone else’s computer—and computers fail.

The October 20th outage will eventually be resolved. Services will return to normal. Most people will forget it happened until the next time. But the fundamental question remains unanswered: how should society build critical digital infrastructure that millions depend on? Should it remain primarily in private hands, optimized for profit and innovation? Should it face greater regulation and oversight? Should we accept recurring outages as the price of convenient, affordable cloud services? Or should we demand better reliability even if it means higher costs and slower innovation?

These questions matter more each year as digital services penetrate deeper into daily life. The October 20th AWS outage is just the latest wake-up call—one of many we’ve received, and surely not the last. Whether we’ll actually wake up and address the structural issues it reveals remains to be seen.

What do you think?

More notes