May 21, 2026 5:32:11 PM

Availability incident, May 2026 — what happened and what we changed

TL;DR

Two separate problems hit at the same time, and the combination is what produced the user-visible outages.

The first was on the search engine side. A regular Referee Finder query completes in around 0.3 seconds on our search cluster. It was possible to configure Referee Finder settings so that the request hit a very slow codepath deep inside the search engine and took 200+ seconds. A handful of those queries was enough to pin every node in the cluster at 100% CPU. And requests kept arriving! A Referee Finder page fails with a timeout on Cloudflare side, the user reloads the page to try again. After a few such attempts we had stretches of up to 25 minutes of 100% CPU on the search cluster.

The second was on the web server side, and unrelated. A bug in our web server software caused worker containers to occasionally deadlock during a regular restart. During the same time window we had an unusual surge of bot traffic that kept triggering it.

These two problems amplified each other. While the search cluster was saturated, worker threads on our web servers piled up waiting for it. Subsequent requests to any page queued up and couldn't be handled; our infrastructure decided the web servers were unhealthy and started restarting them. Because of the web server bug, those restarts hit many containers at the same time. Users saw "no available server" responses, and Cloudflare served its "Bad gateway" page on top.

We had cumulative 1.5 hour outage from the search engine misconfiguration, and 5-6 hours of intermittent errors because of high bot load. Which is within our SLA, but it's no longer "we have better availability than AWS S3" that we had for a couple of years before that.

postmortem-illustration

What looked like one problem

The root cause was hard to pin down at first, because the early CPU spikes on the search cluster coincided with a very large increase in bot traffic. We hadn't seen traffic like that before, so we initially attributed the spikes to load. We tightened request limits and bot protection, and scaled up the number of web servers to handle what looked like additional demand.

That turned out to be only half-right. The bot traffic was not actually causing the search CPU spikes, as those came entirely from the Referee Finder query shape we describe below. But the bot traffic was real, and *was* driving a separate failure on the web server side; we just couldn't tell the two apart yet.

Catching the actual offender

In the search engine's own logs we found one attribute that consistently appeared in the slow requests. A hint, but not enough to act on. We added more logging on the application side: log the full body of every Referee Finder request with that attribute *before* sending it to the search engine. We already had after-request logging, but it was never happening, because the request was killed by a cluster restart.

The next day we had another couple of spikes, but this time it was straightforward. We correlated the new application logs with the 100% CPU windows and found the pattern: a rarely used author-ID field had not been configured for fast indexed lookup. When someone in Referee Finder settings selected "own works" together with a large author group — in this case, roughly 15,000 authors — the search engine had to scan almost every article in the corpus one by one to apply that filter.

We audited every field for a similar problem, found one more case in grant search, and fixed both. Once the search cluster was healthy again, we re-ran the original problem query: 370 ms, down from over three minutes. No further spikes since.

What we found afterwards

With the search fix in place, we turned to the two other things we'd uncovered along the way.

First, the bot traffic was real and sustained. We now see more than 24 million requests per day on prophy.ai, and during spikes we were holding 1,000+ requests per second for an hour at a time, almost entirely from bots crawling article pages, author profiles, and search results.

Second, our web server containers were restarting in ways we couldn't fully explain. Just dying under high load with no clear reason.

Closing the public surface

There were no offending IPs to block, the load was quite distributed. Blocking whole IP ranges breaks the service for legitimate users. So we decided to close all pages to anonymous users. We previously kept article pages, limited author profiles, and limited search available without login, but the bot load made that unsustainable. Now everything except the landing pages requires a login.

We also made sure the login page is working when fully behind Cloudflare's cache, so we're not paying server-side cost. Today more than 70% of requests to prophy.ai are served from the cache directly to bots.

Bots, of course, still remember the URLs of the pages they used to crawl, so we continue to see large volumes of requests hitting our origin and getting redirected to login pages. The traffic is lighter and doesn't hit DB or search engine, and redirects are cheap. But the rate was still enough to keep pressure on our web servers, and they continued to restart under bot spikes during the night. During our night! Our customers from New Zealand were just coming to work and were not satisfied to see intermittent outage.

Fixing the web server bug

We could have masked the symptom by adding more web servers and more workers, but we wanted to find the actual cause. We tracked it down to a bug in the web server version we were running: under sustained load, workers could deadlock during a routine restart. When that happened, the container could no longer serve requests. Our healthcheck marked it unhealthy within standard 120 seconds timeout, and our infrastructure restarted it. At 1000+ requests per second it hit many containers at pretty much the same time. Hence the "Bad gateway" pages!

We found the issue already filed in the upstream tracker, with a fix in a newer release. We reproduced the bug locally, then on staging under synthetic load at 500 rps, where the failure consistently showed up within 10 minutes. We applied the update, repeated the same synthetic load on staging, and the failure stopped. We shipped the fix to production, and watched it closely for 24 hours, including through a real overnight bot spike, before calling the fix done.

Since then we've had 26+ hours between deploys with zero spurious container restarts. Our status page agrees: response times have stayed below 250 ms for the past 24 hours, with none of the 3-second, 10-second, or timeout spikes we were seeing during the incident.

Jitter on worker restarts

We also addressed a related failure mode. Our web server workers were restarting on a fixed schedule, which is fine on its own, but when the workers are all processing very fast, very similar requests in round-robin order, they tend to reach the same "restart me now" condition at very nearly the same moment. Under millions of nearly identical requests, this turns into a synchronized restart across the whole worker pool.

We added jitter to the worker restart procedure, so workers no longer line up for the restart.

What we're doing differently going forward

Schema audit: we're reviewing every attribute referenced by IN(...) or CONTAINS search query clauses to make sure each one has the right fast-lookup setting. The check is now part of our schema-review checklist.
Web-server health monitoring: we've added alerts on request queue depth, spurious container restarts, and worker health, so we hear about this failure mode directly, instead of through Cloudflare's "Bad gateway" pages.

Acknowledgements

We apologize for the disruption these incidents caused, particularly to time-sensitive Referee Finder workflows.