Elevated Error Rates
Incident Report for Estated
Postmortem

Incident Summary

Beginning around 9:27 AM UTC on October 5th of 2020 we began to experience a severe increase to the amount of failed requests to our V4 property endpoint (https://apis.estated.com/v4/property). As a result of this issue, we were alerted to a cyclic failover of our authentication system’s databases, causing all V4 requests to fail to surpass our authentication layer. During our investigation phase, we were notified by AWS of an outage for RDS Aurora impacting a subset of customers in the us-west-2 region, which is where our authentication system’s databases are hosted.

Timeline

The events leading up to and moments until restoration of the outage are follows:

  1. A snapshot of our database was created at 8:58 AM UTC
  2. Replication lag began to spike to a read-replica instance at 9:32 AM UTC
  3. Our writer instance was deemed as failed and the failover mechanism kicked in at 9:35 AM UTC
  4. Team tries to manually resolve cyclic failover issues
  5. Failover continues to cycle between readers and the previous writer until 3:00 PM UTC

Impact

The impact of this outage resulted in our entire V4 property API going down for approximately seven hours, though we still were able to process some requests intermittently. All customers who use the V4 API were affected by this outage. We had a number of users email us notifying us that they were experiencing an issue with our API, and we thank you for your diligence in doing so, though issues are normally reported to our status page: https://status.estated.com so we advise you subscribe to this to receive updates as they become available.

If you observe any discrepancies in your number of API calls remaining prior to the outage and immediately after it was resolved, please contact support (support@estated.com) so we can look into this for you.

Detection

Our team detected this outage immediately at 9:32 AM UTC as it became known. Our internal metrics and alerting systems advised us that both our authorizer layer and V4 property API were experiencing issues.

Outcome

As a result of our team’s investigations, we were able to restore full access to our V4 property API by 4:30 PM UTC. AWS pushed a notification indicating that they were working on resolving the outage which impacted us. By 5:50 PM UTC, AWS had advised us that the issue on their end was resolved and service was back to normal.

Next Steps

As a result of this outage, our team is working with AWS to denote what steps can be taken in the future to reduce the impact an event of this type can have on our customers to ensure any downtime is minimized.

Posted Oct 05, 2020 - 20:16 UTC

Resolved
We have received confirmation from AWS that the problem is fixed, and this issue is now resolved.
Posted Oct 05, 2020 - 20:10 UTC
Monitoring
Error rates have returned to normal since 9:35am PDT. We are still working to identify the underlying cause of the outage, and are monitoring for further issues.
Posted Oct 05, 2020 - 17:53 UTC
Identified
We have identified an issue with one of our databases being stuck in an failover loop, due to an underlying problem with AWS infrastructure. We are working to resolve this issue and restore connectivity to our API as soon as possible.
Posted Oct 05, 2020 - 16:35 UTC
Investigating
We are experiencing high error rates with our V4 API, starting around 2:25am PDT. We are investigating the cause.
Posted Oct 05, 2020 - 14:47 UTC