On September 16, 2025, Sendcloud experienced a complete outage of our shipping panel and API services that lasted 40 minutes, from 11:19 to 11:59 CEST. During this time, our customers were unable to access the Sendcloud panel, create shipping labels, or use our API endpoints to manage their shipments.
The incident originated from a database schema change that caused table locks, but was significantly amplified by our recent infrastructure upgrade to an asynchronous web framework (ASGI). This combination created a cascade failure that required us to completely shut down and restart our entire platform to restore service.
We regret any inconvenience caused by the temporary unavailability of our services. This post provides a detailed analysis of what happened and the steps we're implementing to prevent similar incidents.
Time (CEST) Description
11:00 Routine deployment begins, including a database schema change to modify the payment account structure
11:19 Database migration executes, creating locks on the invoice table. Panel and API become unavailable IMPACT START
11:23 The engineering team begins a rollback attempt to the previous version
11:31 Incident formally declared as a rollback shows no improvement
11:56 Decision made to perform a complete platform shutdown due to cascading database connection overload
11:59 Platform restart completed, services fully restored IMPACT END
12:19 Incident officially closed after monitoring confirms stability
The immediate trigger was a database schema migration that modified the structure of our payment accounts table. This migration required updating foreign key constraints on our invoices table, which temporarily locked the table for approximately 45 seconds.
Under normal circumstances, this brief lock would have caused minimal impact. However, our recent infrastructure upgrade to ASGI (Asynchronous Server Gateway Interface) fundamentally changed how our system handles database connections during disruptions.
Unlike our previous synchronous framework (WSGI), ASGI continues accepting HTTP requests even when the database is unavailable, opening new database connections for each request. As customers' systems automatically retried failed requests, our platform accumulated over 50,000 queued database connections through our RDS proxy.
This created a catastrophic feedback loop:
Customer Impact:
Services Affected:
Data Integrity:
Our engineering team responded immediately when monitoring systems detected the service degradation. Initial efforts focused on rolling back the deployment that included the problematic migration. However, the rollback was ineffective because the root cause had evolved beyond the original database lock.
The critical insight came when we realized that our infrastructure's new asynchronous architecture was amplifying the problem. With tens of thousands of connection attempts queued in our database proxy, even removing the original cause couldn't restore service—the system was trapped in a death spiral of failing connection attempts.
This led to the difficult decision to perform a complete platform shutdown and cold restart, which immediately resolved the connection queue and restored regular operation.
The incident resulted from the intersection of three factors:
Our payment account table migration required constraint recreation on a large invoices table, causing a brief but impactful lock. While the migration itself completed in under a minute, the lock affected one of our most frequently accessed tables.
Our recent infrastructure upgrade to ASGI improved performance under normal conditions, but altered system behaviour during failure scenarios. ASGI's non-blocking architecture meant it continued to accept requests during the database lock, exponentially increasing connection attempts rather than failing quickly.
Our RDS database proxy, designed to manage connection pooling efficiently, became a bottleneck when overwhelmed with 50,000+ simultaneous connection attempts. The proxy's generous timeout settings, appropriate for normal operations, prevented rapid failure of queued connections during the incident.
We're implementing several changes to prevent similar incidents:
Immediate Actions (Completed):
Short-term Improvements (In Progress):
Long-term Architectural Changes (Planned):
This incident highlighted how infrastructure improvements can introduce new failure modes that aren't immediately apparent. Our move to ASGI improved normal-case performance but changed how our system responded during database disruptions.
The experience emphasized the importance of understanding cascade failure patterns in distributed systems and the need for infrastructure changes to include comprehensive failure mode analysis.
Most importantly, it demonstrated that effective incident response requires not just identifying the immediate trigger but understanding how system interactions can amplify problems beyond their original scope.
We're committed to transparency about our infrastructure and the continuous improvement of our service reliability. This incident provides valuable insights that will make our platform more resilient.
We've already begun implementing the technical improvements outlined above and are conducting additional reviews of our other recent infrastructure changes to identify similar potential failure modes.
We understand that reliable shipping infrastructure is important to your business operations, and we are dedicated to continuously improving that reliability. Thank you for your patience as we continue to strengthen our platform.
For any questions about this incident or its impact on your shipments, please don't hesitate to reach out to our support team.