Not seeing the wood for the trees

04 May 2020

The way Flathub infrastructure works is not complicated for current trends, but there are enough moving parts to make debugging transient issues tricky.

When a user starts a download, Flatpak connects to CDN provided by Fastly. CDN connects to one of two front servers, VPSes acting as caching load balancers/proxies in front of hub, the main server exposing ostree repositories and publishing new builds with flat-manager. These happen on Buildbot, another VPS. All HTTP servers are nginx. No magic involved; boring is an advantage for infrastructure.

One long-standing issue was random 503 Service Unavailable errors, causing Flatpak to abort installation and update operations. At first glance a slow HDD in the main repository server was the culprit, meaning that downloading an object not cached by Fastly CDN or front servers could have triggered it. It also affected the time needed to publish new builds, in extreme cases increasing it to over an hour. Load balancers were also misconfigured to cache error responses for 60 seconds, inflating the error rate. Eventually, the entire server has been replaced thanks to Mythic Beasts’ donation in January this year.

A friend of mine, along with some users, used to complain about slow download speeds from Flathub. I have never seen it happen at home as my Internet connection is slow, so I shrugged it off.

Flatpak downloads involve a lot of HTTP requests since each file in the application maps to a file in the ostree repository. Server-side Keep-Alive helps, but there is still a performance penalty over downloading a big single file. Flatpak can mitigate it by generating a static delta, either between two revisions or from scratch for new installations.

It turned out to be a bug causing Flatpak to ignore from scratch deltas altogether. For example, installing LibreOffice involves making 5515 GET requests versus only 128 with static delta support working properly. Who could have known!

In the beginning of April, multiple issues popped out around the same time. PagerDuty was poking me few times a day about high CPU usage on the server hosting Buildbot. Buildbot home page was not displaying recently finished builds at all. New commits, pull requests and manual build triggers were not causing new builds to start reliably. The number of 503 errors increased so much I saw it myself.

Since our Buildbot was few releases behind, it looked like a good idea to start from here and hope it will fix all related problems. After the upgrade, I looked into the home page issue. The browser network monitor revealed a request timeout to the backend. The same request executed directly on the server also time outed. Everything worked fine in my local environment though. Few print calls later it became apparent that the frontend has been requesting all 17000 builds that were ever ran, which also explains high CPU usage. Adding limit=50 to the request brought it to 2-4s. Even though it needed less than a second on the server itself, it sounded reasonable enough to consider it fixed.

Any new comment, pull request, or code change causes GitHub to send HTTP POST payload to the configured URL. Unfortunately, GitHub is overly optimistic about the reliability of the receiving endpoint and provides no way to query the delivery status or request a resubmission other than manually inspecting it with a web browser and clicking redeliver button next to each payload. I considered writing a middleware that would retry failed requests and expose statistics for monitoring and alerting, but it would be merely a workaround. As I could not find anything about failed submissions in HTTP server logs on front servers, I deployed nginx to Buildbot server and reconfigured GitHub to submit events there instead. It helped.

At this point, it was clear there is something wrong with either CDN or load balancer. However, there was no trace of 503 errors in the log files on front servers. Fortunately logs format used by Fastly can be configured to include exact connection error by adding %{resp.response}V to the format string. The majority of errors turned out to be first byte timeout, and so were not logged by nginx.

When distributions started upgrading Flatpak to versions with functional static delta downloads, it became too much for our caching proxy with spinning rust. The spike in traffic caused download errors, affected webhook delivery from GitHub and the time needed to return recent builds to frontend, as front servers were the single entry point to all services. After exchanging few messages with Mythic Beasts, they were replaced with SSD-backed machines and the error rate dropped almost to zero.

Lessons learned: response time is an important metric to track and alert on, especially when 90% of what you do is serving files. It is not enough to monitor just the part behind CDN, as CDN itself may have trouble connecting. Finally, it is easy not to see the wood for the trees. Sometimes everything is connected and it may be a better choice to find the common denominator rather than try to fix each problem separately.