For a two-hour window of time on the morning of October 22nd, a portion of emails sent through our platform were being queued but not sent by our email sending partner, SendGrid. Email sending is a critical component of our infrastructure that’s relied upon by thousands of communities to send time-sensitive information, so we understand the impact any interruption to the service can have.
We’re writing this postmortem to document what happened, how it happened, and what we’re doing about it. All times are in Central Time, where HOA Express’s headquarters are located.
At 7:41 AM, we received an alert from SendGrid’s consumer trust team stating that a suspicious link was noticed in an outbound email, and that they had immediately stopped delivering emails sent by our account. The email did not identify the account they were referring to. We have multiple accounts with SendGrid for different purposes / development environments, so it took our team a few minutes to identify the account in question, and a few more minutes to understand the impact to our platform.
By 8:04 AM, our product team had published an incident to our public status page to be transparent about the issue and the impact.
Meanwhile, our team opened a dialogue with SendGrid’s team to understand the issue and to work with their team to resume email sending operations ASAP. SendGrid’s responses, unfortunately, were concerning slow, and shed little additional light on what occurred. Even now, we have not received any details about the suspicious link that was noticed, how it was noticed, who it was sent from/to, etc.
Around 8:15 AM we started working to provision a backup SendGrid account and to divert affected messages through this new account. At 8:22 AM, thousands of emails began routing through this new account. At first, this workaround temporarily resolved the issue.
Unfortunately, as a high-volume sender, this large and sudden influx of emails to a new account triggered SendGrid’s undocumented internal protection systems, and by 8:30 AM, we noticed these emails were being queued but not delivered. At the time we weren’t sure why, and it wasn’t until much later in the day that we were made aware of their throttling mechanism for new accounts.
We continued dialogue with SendGrid’s team, and at 9:43 AM we were informed that SendGrid would resume delivering emails through the affected account, and we stopped diverting emails to the newly-created backup account.
SendGrid, owned by Twilio, is a publicly-traded $44 billion company that delivers over 70 billion emails per year to more than half of the world’s email addresses. They’re relied upon by companies like Uber, eBay, Walmart, Spotify, Airbnb, Glassdoor, Intuit, and over 80,000 other companies.
We ourselves have been a customer of SendGrid’s for nearly a decade, spend tens of thousands of dollars on their service, and deliver tens of millions of emails annually through them. Yet, SendGrid’s team did not make any attempt to reach out to discuss the suspicious link that was noticed by their system before wholesale disabling email deliveries. Then, they took over two hours to resolve the matter with poor communication along the way.
Make no mistake, we are deeply concerned by SendGrid’s handling of this situation. Our team has already begun evaluating alternative email sending partners, and we’re still communicating with SendGrid’s team to better understand how this happened and how their processes can be improved to avoid future situations like this.
In the coming weeks, we intend to either build confidence that this will never happen again on SendGrid’s service, or we will terminate our long-standing relationship with SendGrid. We’re also investigating ways to further separate emails, like routing emails sent by free/trial communities separately from emails sent by paying communities, which may alleviate the impact of this type of situation should it happen again.