Email sending delay
Incident Report for HOA Express
Postmortem

For a two-hour window of time on the morning of October 22nd, a portion of emails sent through our platform were being queued but not sent by our email sending partner, SendGrid. Email sending is a critical component of our infrastructure that’s relied upon by thousands of communities to send time-sensitive information, so we understand the impact any interruption to the service can have.

We’re writing this postmortem to document what happened, how it happened, and what we’re doing about it. All times are in Central Time, where HOA Express’s headquarters are located.

What happened

At 7:41 AM, we received an alert from SendGrid’s consumer trust team stating that a suspicious link was noticed in an outbound email, and that they had immediately stopped delivering emails sent by our account. The email did not identify the account they were referring to. We have multiple accounts with SendGrid for different purposes / development environments, so it took our team a few minutes to identify the account in question, and a few more minutes to understand the impact to our platform.

By 8:04 AM, our product team had published an incident to our public status page to be transparent about the issue and the impact.

Meanwhile, our team opened a dialogue with SendGrid’s team to understand the issue and to work with their team to resume email sending operations ASAP. SendGrid’s responses, unfortunately, were concerning slow, and shed little additional light on what occurred. Even now, we have not received any details about the suspicious link that was noticed, how it was noticed, who it was sent from/to, etc.

Around 8:15 AM we started working to provision a backup SendGrid account and to divert affected messages through this new account. At 8:22 AM, thousands of emails began routing through this new account. At first, this workaround temporarily resolved the issue.

Unfortunately, as a high-volume sender, this large and sudden influx of emails to a new account triggered SendGrid’s undocumented internal protection systems, and by 8:30 AM, we noticed these emails were being queued but not delivered. At the time we weren’t sure why, and it wasn’t until much later in the day that we were made aware of their throttling mechanism for new accounts.

We continued dialogue with SendGrid’s team, and at 9:43 AM we were informed that SendGrid would resume delivering emails through the affected account, and we stopped diverting emails to the newly-created backup account.

Next steps

SendGrid, owned by Twilio, is a publicly-traded $44 billion company that delivers over 70 billion emails per year to more than half of the world’s email addresses. They’re relied upon by companies like Uber, eBay, Walmart, Spotify, Airbnb, Glassdoor, Intuit, and over 80,000 other companies.

We ourselves have been a customer of SendGrid’s for nearly a decade, spend tens of thousands of dollars on their service, and deliver tens of millions of emails annually through them. Yet, SendGrid’s team did not make any attempt to reach out to discuss the suspicious link that was noticed by their system before wholesale disabling email deliveries. Then, they took over two hours to resolve the matter with poor communication along the way.

Make no mistake, we are deeply concerned by SendGrid’s handling of this situation. Our team has already begun evaluating alternative email sending partners, and we’re still communicating with SendGrid’s team to better understand how this happened and how their processes can be improved to avoid future situations like this.

In the coming weeks, we intend to either build confidence that this will never happen again on SendGrid’s service, or we will terminate our long-standing relationship with SendGrid. We’re also investigating ways to further separate emails, like routing emails sent by free/trial communities separately from emails sent by paying communities, which may alleviate the impact of this type of situation should it happen again.

Posted Oct 22, 2020 - 16:23 CDT

Resolved
Since the underlying issue has been resolved, and has been for quite some time, we're going to be resolving this incident on our status page. We're still waiting for further communication from SendGrid's team regarding the small number of remaining emails stuck in a "processing" state. An incident postmortem will follow.
Posted Oct 22, 2020 - 15:15 CDT
Update
A small number of emails are still waiting to be sent on SendGrid's end; we're awaiting further information from their team.
Posted Oct 22, 2020 - 11:55 CDT
Update
Our monitoring systems are showing that newly-sent emails are being delivered without delay now. A portion of emails sent before the resolution was reached are still queued to be sent, and we're working with SendGrid's team to ensure they're delivered as soon as possible.
Posted Oct 22, 2020 - 09:56 CDT
Monitoring
We've resolved the issue with SendGrid's team. We expect delayed emails to be delivered soon. Our team is still actively monitoring the situation.
Posted Oct 22, 2020 - 09:47 CDT
Update
We're continuing to see delayed emails, even with the workaround in place. Our team is working to resolve the issue as soon as possible.
Posted Oct 22, 2020 - 09:15 CDT
Update
We've applied a workaround for emails sent from this moment going forward, and are continuing to work with our email sending partner on a resolution to the delayed emails sent over the last ~30 minutes.
Posted Oct 22, 2020 - 08:34 CDT
Identified
We've identified an issue with our email sending partner, SendGrid, and are working with them to resolve it. Emails may be delayed during this time.
Posted Oct 22, 2020 - 08:04 CDT
This incident affected: Third parties.