Opened 4 years ago

Closed 4 years ago

#60112 closed defect (fixed)

10.13 buildbot worker is down

Reported by: ryandesign (Ryan Carsten Schmidt) Owned by: admin@…
Priority: Normal Milestone:
Component: server/hosting Version:
Keywords: Cc:
Port:

Description (last modified by ryandesign (Ryan Carsten Schmidt))

One of our VMware hosts has suffered an SSD failure, as a result of which the 10.6-i386, 10.8, 10.11 and 10.13 workers are down and will need to be restored from backups. There may be separate issues involved in bringing each of these back online so I'll make separate tickets. I plan to restore them to spare hard disks temporarily until I can get a new SSD.

Change History (7)

comment:1 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Description: modified (diff)

comment:2 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Component: portsserver/hosting
Owner: set to admin@…

The OS has been reinstalled using an old High Sierra installer I had on hand. Upon selecting the Time Machine backup to restore it, it says the OS should be updated first, so it's doing that now.

comment:3 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Oddly, after restoring from the backup, the VM won't boot. It doesn't get past the Apple logo with the progress bar, which fills in very slowly but then never disappears.

comment:4 Changed 4 years ago by fhgwright (Fred Wright)

I sometimes see that problem here (VMware Fusion 8.5.10, OSX 10.9.5, Mac Pro 5,1), particularly with 10.13. Once it's been frozen long enough that I'm convinced that it's really stuck and not just slow, I do a restart from the VMware menu, and it usually comes right up. It seems to be less prone to happening if I avoid booting multiple VMs simultaneously (i.e., stagger the startups across the VMs).

BTW, one should never rely on SSDs for primary storage. Not only is flash storage inherently less reliable than magnetic storage, but flash reliability is one of the few technological parameters that's actually been getting *worse* with "advances", rather than better. And a system whose primary purpose is to run builds is probably going to stress the flash write limits a lot.

comment:5 in reply to:  4 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to fhgwright:

I sometimes see that problem here (VMware Fusion 8.5.10, OSX 10.9.5, Mac Pro 5,1), particularly with 10.13. Once it's been frozen long enough that I'm convinced that it's really stuck and not just slow, I do a restart from the VMware menu, and it usually comes right up. It seems to be less prone to happening if I avoid booting multiple VMs simultaneously (i.e., stagger the startups across the VMs).

This does not seem to be what's happening here. We've never had any intermittent problems booting this or the other VMs when the disk was working. And now, following the restoration from backup, it is also not intermittently failing to boot; it is always failing to boot. We use VMware ESXi and it does stagger VM startups, but that's also not applicable here as the problem happens even when booting only this single VM.

BTW, one should never rely on SSDs for primary storage.

I'd like to keep this ticket specific to any issues related to getting the 10.13 worker back online. Your observation is a more general critique of our infrastructure and I'd like to move that to its own ticket; I filed #60178.

comment:6 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

I downloaded a fresh copy of Install macOS High Sierra (on a machine running Mojave, because downloading the High Sierra installer on a machine running High Sierra does not give you the full installer) and made a new iso file from it and installed that to a new VM. The installation went smoother and the restoration from Time Machine succeeded, even so far that the launchd plist for the buildbot workers activated and a build started while installation was finishing up, but then it restarted and wouldn't finish starting up, as before.

Booting in Safe Mode, however, worked, and showed a message that migration was complete. Xcode would not install its additional components, saying "The package 'MobileDeviceDevelopment.pkg' is untrusted", similar to what happened on the El Capitan worker due to the certificate that expired in October 2019. I will download a fresh copy of Xcode 9.4.1 which should have packages signed with newer certificates.

comment:7 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Resolution: fixed
Status: newclosed

This worker is back online now. It'll take several days to work through the backlog of pending builds.

It's on a hard disk for now, which will likely be a bit slower. I'll try to buy a new SSD soon to get it back to the usual speed.

Note: See TracTickets for help on using tickets.