Beginning around 9:27 AM UTC on October 5th of 2020 we began to experience a severe increase to the amount of failed requests to our V4 property endpoint (https://apis.estated.com/v4/property). As a result of this issue, we were alerted to a cyclic failover of our authentication system’s databases, causing all V4 requests to fail to surpass our authentication layer. During our investigation phase, we were notified by AWS of an outage for RDS Aurora impacting a subset of customers in the us-west-2 region, which is where our authentication system’s databases are hosted.
The events leading up to and moments until restoration of the outage are follows:
The impact of this outage resulted in our entire V4 property API going down for approximately seven hours, though we still were able to process some requests intermittently. All customers who use the V4 API were affected by this outage. We had a number of users email us notifying us that they were experiencing an issue with our API, and we thank you for your diligence in doing so, though issues are normally reported to our status page: https://status.estated.com so we advise you subscribe to this to receive updates as they become available.
If you observe any discrepancies in your number of API calls remaining prior to the outage and immediately after it was resolved, please contact support (support@estated.com) so we can look into this for you.
Our team detected this outage immediately at 9:32 AM UTC as it became known. Our internal metrics and alerting systems advised us that both our authorizer layer and V4 property API were experiencing issues.
As a result of our team’s investigations, we were able to restore full access to our V4 property API by 4:30 PM UTC. AWS pushed a notification indicating that they were working on resolving the outage which impacted us. By 5:50 PM UTC, AWS had advised us that the issue on their end was resolved and service was back to normal.
As a result of this outage, our team is working with AWS to denote what steps can be taken in the future to reduce the impact an event of this type can have on our customers to ensure any downtime is minimized.