Unable to access Sendcloud Panel and use Sendcloud API

Incident Report for Sendcloud

Postmortem

A deep dive into Sendcloud's September 16, 2025, panel and API outage

On September 16, 2025, Sendcloud experienced a complete outage of our shipping panel and API services that lasted 40 minutes, from 11:19 to 11:59 CEST. During this time, our customers were unable to access the Sendcloud panel, create shipping labels, or use our API endpoints to manage their shipments.

The incident originated from a database schema change that caused table locks, but was significantly amplified by our recent infrastructure upgrade to an asynchronous web framework (ASGI). This combination created a cascade failure that required us to completely shut down and restart our entire platform to restore service.

We regret any inconvenience caused by the temporary unavailability of our services. This post provides a detailed analysis of what happened and the steps we're implementing to prevent similar incidents.

Timeline

Time (CEST) Description

11:00 Routine deployment begins, including a database schema change to modify the payment account structure

11:19 Database migration executes, creating locks on the invoice table. Panel and API become unavailable IMPACT START

11:23 The engineering team begins a rollback attempt to the previous version

11:31 Incident formally declared as a rollback shows no improvement

11:56 Decision made to perform a complete platform shutdown due to cascading database connection overload

11:59 Platform restart completed, services fully restored IMPACT END

12:19 Incident officially closed after monitoring confirms stability

What happened

The immediate trigger was a database schema migration that modified the structure of our payment accounts table. This migration required updating foreign key constraints on our invoices table, which temporarily locked the table for approximately 45 seconds.

Under normal circumstances, this brief lock would have caused minimal impact. However, our recent infrastructure upgrade to ASGI (Asynchronous Server Gateway Interface) fundamentally changed how our system handles database connections during disruptions.

Unlike our previous synchronous framework (WSGI), ASGI continues accepting HTTP requests even when the database is unavailable, opening new database connections for each request. As customers' systems automatically retried failed requests, our platform accumulated over 50,000 queued database connections through our RDS proxy.

This created a catastrophic feedback loop:

  • Application pods crashed due to database connection failures
  • Kubernetes automatically restarted crashed pods
  • Each restarting pod required database queries to initialize
  • These initialization queries joined the massive connection queue
  • The growing queue caused more pods to fail, perpetuating the cycle

Impact assessment

Customer Impact:

  • Complete inability to access the Sendcloud panel
  • All API endpoints returned 5xx errors
  • Shipping label creation and order management are unavailable
  • Duration: 40 minutes of total service unavailability

Services Affected:

  • Sendcloud shipping panel (Web interface)
  • All API endpoints
  • Order processing and label generation
  • Customer integrations and webhooks

Data Integrity:

  • No data loss occurred during the incident
  • All orders and shipments created before the outage remained intact
  • No financial transactions were affected

Our response

Our engineering team responded immediately when monitoring systems detected the service degradation. Initial efforts focused on rolling back the deployment that included the problematic migration. However, the rollback was ineffective because the root cause had evolved beyond the original database lock.

The critical insight came when we realized that our infrastructure's new asynchronous architecture was amplifying the problem. With tens of thousands of connection attempts queued in our database proxy, even removing the original cause couldn't restore service—the system was trapped in a death spiral of failing connection attempts.

This led to the difficult decision to perform a complete platform shutdown and cold restart, which immediately resolved the connection queue and restored regular operation.

Root cause analysis

The incident resulted from the intersection of three factors:

  1. Database Migration Design

Our payment account table migration required constraint recreation on a large invoices table, causing a brief but impactful lock. While the migration itself completed in under a minute, the lock affected one of our most frequently accessed tables.

  1. ASGI Framework Behaviour

Our recent infrastructure upgrade to ASGI improved performance under normal conditions, but altered system behaviour during failure scenarios. ASGI's non-blocking architecture meant it continued to accept requests during the database lock, exponentially increasing connection attempts rather than failing quickly.

  1. Connection Pool Amplification

Our RDS database proxy, designed to manage connection pooling efficiently, became a bottleneck when overwhelmed with 50,000+ simultaneous connection attempts. The proxy's generous timeout settings, appropriate for normal operations, prevented rapid failure of queued connections during the incident.

Prevention and improvements

We're implementing several changes to prevent similar incidents:

Immediate Actions (Completed):

  • Implemented backpressure mechanisms in ASGI applications to limit concurrent database connections
  • Reduced RDS proxy connection timeout settings from 120 seconds to 30 seconds to prevent queue buildup
  • Enhanced database migration review process to identify potential high-impact table modifications

Short-term Improvements (In Progress):

  • Developing automated rollback mechanisms that can force deployment reversions even when new pods fail to start
  • Creating specialized runbooks for database lock incident response
  • Implementing connection circuit breakers to prevent cascade failures during database unavailability

Long-term Architectural Changes (Planned):

  • Designing graceful degradation patterns for ASGI applications during database unavailability
  • Implementing request shedding at the load balancer level during database incidents
  • Establishing more sophisticated migration impact assessment processes

What we learned

This incident highlighted how infrastructure improvements can introduce new failure modes that aren't immediately apparent. Our move to ASGI improved normal-case performance but changed how our system responded during database disruptions.

The experience emphasized the importance of understanding cascade failure patterns in distributed systems and the need for infrastructure changes to include comprehensive failure mode analysis.

Most importantly, it demonstrated that effective incident response requires not just identifying the immediate trigger but understanding how system interactions can amplify problems beyond their original scope.

Moving forward

We're committed to transparency about our infrastructure and the continuous improvement of our service reliability. This incident provides valuable insights that will make our platform more resilient.

We've already begun implementing the technical improvements outlined above and are conducting additional reviews of our other recent infrastructure changes to identify similar potential failure modes.

We understand that reliable shipping infrastructure is important to your business operations, and we are dedicated to continuously improving that reliability. Thank you for your patience as we continue to strengthen our platform.

For any questions about this incident or its impact on your shipments, please don't hesitate to reach out to our support team.

Posted Sep 19, 2025 - 10:57 CEST

Resolved

This incident has been resolved.
Posted Sep 16, 2025 - 13:29 CEST

Monitoring

Services are now operational. We are actively monitoring system performance and functionality to ensure everything continues to run smoothly. Thank you for your understanding and patience during this period.
Posted Sep 16, 2025 - 12:08 CEST

Update

We are continuing to investigate this issue.
Posted Sep 16, 2025 - 11:45 CEST

Update

We are currently investigating an issue preventing users from accessing the Sendcloud Panel and using the Sendcloud API
Posted Sep 16, 2025 - 11:41 CEST

Investigating

We are currently investigating an issue preventing users from accessing the Sendcloud Panel.
Posted Sep 16, 2025 - 11:37 CEST
This incident affected: Sendcloud (Sendcloud Panel, Sendcloud website, Sendcloud Email Services, Sendcloud Service points, Tracking page, Sendcloud Chat, Sendcloud Phone, Sendcloud Payments, Sendcloud Infrastructure, Sendcloud Returns, Sendcloud Analytics, Dynamic Checkout Form, Sendcloud Print Settings, Shipping Intelligence (Tracey), Dynamic Checkout API, Sendcloud Insurance) and Affected market (GLOBAL, AT, BE, DE, ES, FR, IT, NL, UK).