Opened 3 years ago

Closed 3 years ago

#62257 closed defect (fixed)

10.15, 10.9, 10.6 x86_64 buildbot workers are offline

Reported by: gorticus (Jason Mitchell) Owned by: admin@…
Priority: Normal Milestone:
Component: server/hosting Version:
Keywords: Cc: chrstphrchvz (Christopher Chavez), Dave-Allured (Dave Allured), FranklinYu (Franklin Yu)
Port:

Description

The 10.15 base and ports builders and watcher appear to be offline, https://build.macports.org/builders

From https://build.macports.org/buildslaves,

  • base last heard from: about 7 days ago (2021-Feb-03 04:25:12)
  • ports last heard from: about 7 days ago (2021-Feb-03 03:34:46)

Change History (12)

comment:1 Changed 3 years ago by chrstphrchvz (Christopher Chavez)

Cc: chrstphrchvz added

comment:2 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)

Summary: 10.15 buildbot worker is offline10.15, 10.9, 10.6 x86_64 buildbot workers are offline

I took the 10.15, 10.9 and 10.6 x86_64 workers offline after Josh reported to me that he observed build failures that indicated disk problems. I determined that the SSD that these workers are stored on has failed. A new SSD has been ordered and installed and I'm trying to copy the virtual disks from the old SSD to the new one, but am encountering I/O errors. Once I've exhausted the possibilities for saving the old disks, the remaining course of action would be to create new virtual disks on the new SSD, reinstall the OS, and restore from Time Machine backup. This is what we did with the other two hosts when their SSDs died last year but it has certain disadvantages so I'd like to save the existing virtual disks if at all possible.

comment:3 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)

10.6 x86_64 is back up and working through the backlog of builds. CCC was able to clone most of the files. A couple dozen could not be read. One was a file installed in /opt/bblocal by a port; deactivating and reactivating the port brought the file back. Another was part of the git clone of base; I deleted the clone; it will be recreated automatically next time a base commit comes in. The remaining unreadable files were various MacPorts .tbz2 archives; I uninstalled those ports; they will be reinstalled automatically when needed.

Version 0, edited 3 years ago by ryandesign (Ryan Carsten Schmidt) (next)

comment:4 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)

10.9 is back up. CCC 4.1.23 could clone most of the files. Several MacPorts .tbz2 archives couldn't be read but unfortunately neither could the registry.db; without that, I had no choice but to delete /opt/local entirely. Buildbot will have to reinstall ports as needed. Some files within com.apple.IconServices cache directories could not be read; I deleted the entire com.apple.IconServices directories and the OS will recreate them. A couple files within Xcode.app and one system font couldn't be read; restored them from another 10.9 disk.

comment:5 Changed 3 years ago by l2dy (Zero King)

Is the 10.15 builder hard to recover? IMHO, the three most recent versions of macOS should be prioritized, because most users have upgraded to these and they still receive security updates from Apple.

comment:6 Changed 3 years ago by jmroot (Joshua Root)

Bear in mind that Ryan has not been able to work on this at all for days due to power issues. I'm sure we all would like to have the 10.15 builder running again as soon as possible.

comment:7 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)

Now that I have power again I can work on this again. Today I prioritized getting the buildbot system up and running again with everything except for the 10.15 worker. Maybe tomorrow I will work on getting 10.15 back online.

I worked on transferring the earlier OS versions first because reinstalling earlier OS versions is more difficult. (Installer certificates expire, making OS and Xcode installation more tedious.) If the old SSD failed completely at some point during the recovery process, I wanted to have the difficult-to-restore OS versions done via cloning and leave the easy-to-restore OS versions to be done by Time Machine. Fortunately the old SSD hasn't failed completely yet and is still letting me read most data so hopefully I can do the 10.15 restore by cloning as well.

I already had other VMs running 10.6 and 10.9 on network storage that I could boot to, run a contemporaneous version of CCC, and clone the disk from the failed SSD to the new one. I don't have an existing other VM for 10.15. I plan to make one, then do the cloning. Cloning from a different OS version might work but I'm not sure so I'd rather play it safe, and it would be good to have that other 10.15 VM anyway for other reasons.

comment:8 in reply to:  7 Changed 3 years ago by l2dy (Zero King)

Replying to ryandesign:

I worked on transferring the earlier OS versions first because reinstalling earlier OS versions is more difficult. (Installer certificates expire, making OS and Xcode installation more tedious.) If the old SSD failed completely at some point during the recovery process, I wanted to have the difficult-to-restore OS versions done via cloning and leave the easy-to-restore OS versions to be done by Time Machine. Fortunately the old SSD hasn't failed completely yet and is still letting me read most data so hopefully I can do the 10.15 restore by cloning as well.

Thanks for the explanation and your hard work recovering everything. Is it possible to slim down the VM images and make a backup of that? Some extra storage would save you from such tedious work. SSD failures would eventually happen again after all.

comment:9 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)

I'm happy to discuss backup strategy but maybe let's do it on the infra list rather than in this unrelated ticket.

comment:10 Changed 3 years ago by Dave-Allured (Dave Allured)

Cc: Dave-Allured added

comment:11 Changed 3 years ago by FranklinYu (Franklin Yu)

Cc: FranklinYu added

comment:12 Changed 3 years ago by ryandesign (Ryan Carsten Schmidt)

Resolution: fixed
Status: newclosed

After the weather-related outage and getting everything else online again, we had an unrelated buildmaster outage which was more important to address first. Now that that's fixed and I turned my attention back to getting the 10.15 worker online, I found that the SSD had failed completely by now; the server froze when trying to access it, and then wouldn't boot if it was installed. So I've restored 10.15 from Time Machine backups and it's working through the backlog of builds now.

Note: See TracTickets for help on using tickets.