Opened 7 years ago

Closed 6 years ago

#54247 closed defect (fixed)

buildbot buildmaster is sometimes very slow to respond

Reported by: ryandesign (Ryan Carsten Schmidt) Owned by: admin@…
Priority: Normal Milestone:
Component: server/hosting Version:
Keywords: Cc:
Port:

Description

The buildbot buildmaster is sometimes extremely slow to respond. We use nginx as a reverse proxy in front of the buildmaster, and buildmaster's slowness sometimes exceeds nginx's default proxy timeouts, causing nginx to respond with a 502 Bad Gateway or 504 Gateway Time-out error. Not only does this cause developers to be unable to use the buildbot web interface at times, it also prevents GitHub from successfully delivering push notifications, so some ports might not be getting built.

Not sure whether the notifications aren't getting to the buildmaster, or whether the buildmaster just isn't responding in time for GitHub to consider it a success, but I suspect the latter.

Not sure whether it's the proxy_connect_timeout or proxy_read_timeout or proxy_send_timeout or a combination of these but I've increased them all, so now we shouldn't see any more 502 or 504 errors; the requests should complete, if given enough time. This hasn't helped GitHub successfully send its push notifications. GitHub seems to have its own built-in timeout for sending these notifications.

I've gone through the entire history of GitHub push notifications up to now and redelivered those that failed, so any ports that previously weren't built because of this should now be built.

I suspect the cause is that the RAID the buildmaster is on is too slow. It seems quick enough when tested with Blackmagic Disk Speed Test, but I have a feeling that the buildmaster workload plus the periodic syncing via rsync makes random disk accesses slow. When the web interface is responding slowly, accessing the server over ssh also feels sluggish.

The server does have an unused Apple SSD. I could try moving the buildmaster folder to the SSD; that would be fairly easy to do and would let us know whether disk speed is a factor. Another option is to move the entire OS and everything except the rsync directory to the SSD. That's more difficult to do and will involve more downtime, and the reason I hadn't done so initially is that I wanted the assurance of a RAID for our critical infrastructure. Some parts of macOS Server, such as the Caching Server, also seem particularly unhappy about storing their data on a disk that is not the startup volume, and default back to storing their data on the startup volume after a restart. But the caching service uses less disk space than I had anticipated so there should be room to store that data on the SSD.

Change History (5)

comment:1 Changed 7 years ago by neverpanic (Clemens Lang)

Note that even though GitHub marks some delivered webhooks as failed, they may actually have been processed. That's what we've seen with Trac, for the most part.

In the long run, we should probably have a very simple caching service that accepts the GitHub webhooks, writes them to a database and makes sure they get successfully delivered to the buildmaster (or Trac) with larger timeouts. That would also allow us to restart those services without loosing webhook events.

comment:2 in reply to:  description ; Changed 7 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to ryandesign:

The server does have an unused Apple SSD. I could try moving the buildmaster folder to the SSD; that would be fairly easy to do and would let us know whether disk speed is a factor. Another option is to move the entire OS and everything except the rsync directory to the SSD. That's more difficult to do and will involve more downtime, and the reason I hadn't done so initially is that I wanted the assurance of a RAID for our critical infrastructure.

By running a disk speed test while the server is idle, and again while the server is doing nothing other than upgrading ports using binaries, I'm convinced the main problem is drastic disk speed issues during concurrent disk access. I'll be moving everything except the rsync data to the SSD, which should make most tasks on the server much faster. I need to rewrite the launchd plists and configuration files to anticipate the path to the rsync data changing.

Some parts of macOS Server, such as the Caching Server, also seem particularly unhappy about storing their data on a disk that is not the startup volume, and default back to storing their data on the startup volume after a restart. But the caching service uses less disk space than I had anticipated so there should be room to store that data on the SSD.

Caching Server is no longer part of Server now and is part of High Sierra, so I'll be moving that service from the buildmaster to a VM running High Sierra.

comment:3 in reply to:  2 Changed 7 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to ryandesign:

Caching Server is no longer part of Server now and is part of High Sierra, so I'll be moving that service from the buildmaster to a VM running High Sierra.

Caching Server cannot be used when High Sierra is running as a VM. Thanks, Apple.

comment:4 in reply to:  2 Changed 6 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to ryandesign:

I'll be moving everything except the rsync data to the SSD, which should make most tasks on the server much faster. I need to rewrite the launchd plists and configuration files to anticipate the path to the rsync data changing.

This is done. The rsync data and mprsyncup tmp directories remain on the RAID; the buildbot data, the OS, and everything else is now on SSD.

comment:5 Changed 6 years ago by ryandesign (Ryan Carsten Schmidt)

Resolution: fixed
Status: newclosed

This does seem to be a lot faster, and is good enough for now.

Note: See TracTickets for help on using tickets.