Opened 9 years ago

Closed 8 years ago

#47976 closed defect (fixed)

Snow Leopard buildbot builder is offline since May 18

Reported by: ryandesign (Ryan Carsten Schmidt) Owned by: admin@…
Priority: High Milestone:
Component: server/hosting Version:
Keywords: Cc: mojca (Mojca Miklavec), Veence (Vincent), ctreleaven (Craig Treleaven), bgilbert (Benjamin Gilbert), dliessi (Davide Liessi), danielluke (Daniel J. Luke), mkae (Marko Käning), basil.nikityuk@…, neverpanic (Clemens Lang), dbevans (David B. Evans)
Port:

Description

This was originally reported to the Mac OS Forge admin email address, but some emails between Keith and myself may be getting lost, so I want to report the issue here where we can hopefully track and resolve it successfully.


Mon May 18 02:04:51 PDT 2015: Ports build 35778 (trying to rebuild dmapd after r136473) failed with an exception because the builder went offline.

Mon May 18 16:53:38 PDT 2015: The builder has not been attempting any builds since the exception was noted. I reported this to Mac OS Forge.

Mon May 18 21:10:20 PDT 2015: Keith restarted the server, after which the builder resumed trying to do builds. Ports build 35779 failed because it could not write the portlist because the SQLite registry database was corrupted. All subsequent builds to date have failed for the same reason.

Tue May 19 09:30:46 PDT 2015: Keith responded that the server had been displaying a dialog box asking to be restarted, and that he had restarted it.


Tue May 19 17:14:44 PDT 2015: Ben Gilbert reported the problem on macports-dev.

Tue May 19 20:16:54 PDT 2015: I responded to Ben letting him know the above, Cc'ing Mac OS Forge. I wondered if the server had run out of disk space. I wondered if running a port command with sudo—such as sudo port installed—would repair the SQLite database corruption; I asked Keith to try it. I asked Keith to clarify the restart dialog box he had seen.

Thu May 21 14:53:36 PDT 2015: I reminded Mac OS Forge, forwarding my message from May 19. I added that SQLite can repair some corruption if it has write access to the database, which it would with sudo but that generating the port list happens without sudo.


Sat May 23 13:32:55 PDT 2015: Dave Evans reported the problem on macports-dev.

Sat May 23 15:58:24 PDT 2015: I reminded Mac OS Forge, forwarding my message from May 21.


Wed May 27 23:03:22 PDT 2015: I reminded Mac OS Forge in a new email.

Thu May 28 13:56:04 PDT 2015: Keith asked if there was still a problem after restarting the server.

Thu May 28 14:31:28 PDT 2015: I said yes, a problem still exists, forwarding my message from May 23.


At this time I still want to know:

  • What was the nature of the dialog box the server was displaying about needing a restart? Was it a kernel panic or some other situation?
  • Did the server run out of disk space? If so, can we get 5-10GB additional disk space?
  • Does running sudo port installed fix the problem? If not, we need to come up with another solution.
  • Were my forwarded and Cc'd emails to Mac OS Forge perhaps not delivered, or discarded by a spam filter? If so, can we fix this so that communication to Mac OS Forge can occur unimpeded?

Thanks.

Change History (37)

comment:1 Changed 9 years ago by mojca (Mojca Miklavec)

Cc: mojca@… added

Cc Me!

comment:2 Changed 9 years ago by ctreleaven (Craig Treleaven)

Cc: ctreleaven@… added

Cc Me!

comment:3 Changed 9 years ago by bgilbert (Benjamin Gilbert)

Cc: bgilbert@… added

Cc Me!

comment:4 Changed 9 years ago by dliessi (Davide Liessi)

Cc: davide.liessi@… added

Cc Me!

comment:5 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Keith, could you please let us know your status regarding this issue? Thanks.

comment:6 Changed 9 years ago by Veence (Vincent)

Now, everything seems to be completely stuck.

comment:7 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Cc: vince@… added

Vince: that's an unrelated issue that needs to be addressed separately; see #48025.

comment:8 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

To update everyone Cc'd on this ticket: Keith and Joshua and I have been communicating by email, and the problem is that the people who had previously been tasked with maintaining the MacPorts infrastructure at Mac OS Forge have all left, and Keith is still trying to get up to speed while also handling other responsibilities. We will try to provide guidance as needed and hopefully can make some progress on this issue soon.

comment:9 Changed 9 years ago by danielluke (Daniel J. Luke)

Cc: dluke@… added

Cc Me!

comment:10 Changed 9 years ago by mkae (Marko Käning)

Cc: mk@… added

Cc Me!

comment:11 Changed 9 years ago by mkae (Marko Käning)

This buildbot seems to be fine by now, no!?

comment:12 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

No; the problem remains as described. Check out the log of writing the portlist for the latest build.

Last edited 9 years ago by ryandesign (Ryan Carsten Schmidt) (previous) (diff)

comment:13 Changed 9 years ago by mkae (Marko Käning)

Oh, ok, "malformed database disk image" sounds bad. "Idle" doesn't mean anything good then. Got it. :)

comment:14 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Keith: we are still waiting on this issue. Can you please let us know what happens when you run

sudo /opt/local/bin/port -d installed

on the Snow Leopard builder? (I'm adding -d to the command to display all debugging info.)

comment:15 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Priority: NormalHigh

Keith: could you please update us on this issue?

comment:16 Changed 9 years ago by basil.nikityuk@…

Cc: basil.nikityuk@… added

Cc Me!

comment:17 Changed 9 years ago by neverpanic (Clemens Lang)

Cc: cal@… added

Any updates on this? Mac OS Forge Admins?

comment:18 Changed 9 years ago by dbevans (David B. Evans)

Cc: devans@… added

Cc Me!

comment:19 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Keith, this service has now been offline for 58 days. Could you please respond? We are still waiting for you to connect to the server tensix-slave.macports.org, run the command:

sudo /opt/local/bin/port -d installed

and tell us what happens. Thanks.

comment:20 Changed 9 years ago by basil.nikityuk@…

Hi, Please answer, what is the status of ticket?

comment:21 in reply to:  description ; Changed 9 years ago by keith_dart@…

Replying to ryandesign@…:

This was originally reported to the Mac OS Forge admin email address, but some emails between Keith and myself may be getting lost, so I want to report the issue here where we can hopefully track and resolve it successfully.

I just restarted it again. I see that the buildbot is running and that there is about 23 GB of free space. I don't see any obvious problem. It does seem very slow, however. That may be a hypervisor host issue. Try it now and we'll see if that works.

comment:22 in reply to:  21 Changed 9 years ago by dbevans (David B. Evans)

Replying to keith_dart@…:

Replying to ryandesign@…:

This was originally reported to the Mac OS Forge admin email address, but some emails between Keith and myself may be getting lost, so I want to report the issue here where we can hopefully track and resolve it successfully.

I just restarted it again. I see that the buildbot is running and that there is about 23 GB of free space. I don't see any obvious problem. It does seem very slow, however. That may be a hypervisor host issue. Try it now and we'll see if that works.

The database issue remains. While the waterfall display indicates success, each build is actually failing during sync with nothing productive thereafter.

See https://build.macports.org/builders/buildports-snowleopard-x86_64/builds/37549/steps/sync/logs/stdio.

comment:23 in reply to:  21 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to keith_dart@…:

I just restarted it again. I see that the buildbot is running and that there is about 23 GB of free space. I don't see any obvious problem. It does seem very slow, however. That may be a hypervisor host issue. Try it now and we'll see if that works.

Restarting the server won't fix the corruption of the MacPorts registry. To try to fix that, as I said above, I'd like you to run this command:

sudo /opt/local/bin/port -d installed

comment:24 Changed 9 years ago by basil.nikityuk@…

Hi Any progress on this? Service is offline about five months.. Any admins? You are serious, guys, cannot fix database corruption? Or there is filesystem degradation/hardware malfunction?

comment:25 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Josh suggested that what we probably need to do is restore the tensix-slave virtual machine to a snapshot taken before the MacPorts registry corruption began, which was on May 18, 2015 so we would need to go earlier than that.

After getting the vm back up and running with the old snapshot, its disk size should be increased so that the corruption doesn't happen again.

comment:26 Changed 9 years ago by mojca (Mojca Miklavec)

Does the same hold for Lion?

comment:27 in reply to:  26 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to mojca@…:

Does the same hold for Lion?

Yes, but that's being tracked separately in #48486.

comment:28 Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

The Snow Leopard builder has been taken offline while Keith and I work on this.

comment:29 in reply to:  description ; Changed 9 years ago by ryandesign (Ryan Carsten Schmidt)

As we did for the Lion builder in #48486, we removed the software directory and the corrupted registry.db.

Replying to ryandesign@…:

Mon May 18 02:04:51 PDT 2015: Ports build 35778 (trying to rebuild dmapd after r136473) failed with an exception because the builder went offline.

Noting that the last attempted build was for dmapd, and that all subsequent builds failed before trying to install anything, all the now-unregistered files on the Snow Leopard builder should be for dmapd's dependencies, so we force-installed dmapd. While this runs, we are taking a break and will resume later.

When we resume, we should be able to just "sudo port deactivate active", then remove the .mp_* files, then, assuming we see no other unregistered files, turn the buildslave on again.

comment:30 Changed 8 years ago by ctreleaven (Craig Treleaven)

Update?

comment:31 Changed 8 years ago by mojca (Mojca Miklavec)

Snow Leopard is no longer an exception as (almost) all the buildbots are down by now (#49483).

comment:32 in reply to:  29 Changed 8 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to ryandesign@…:

As we did for the Lion builder in #48486, we removed the software directory and the corrupted registry.db.

Replying to ryandesign@…:

Mon May 18 02:04:51 PDT 2015: Ports build 35778 (trying to rebuild dmapd after r136473) failed with an exception because the builder went offline.

Noting that the last attempted build was for dmapd, and that all subsequent builds failed before trying to install anything, all the now-unregistered files on the Snow Leopard builder should be for dmapd's dependencies, so we force-installed dmapd. While this runs, we are taking a break and will resume later.

When we resume, we should be able to just "sudo port deactivate active", then remove the .mp_* files, then, assuming we see no other unregistered files, turn the buildslave on again.

This failed because we had used a working copy of dports that was current at the time of our attempt on October 15. I backdated the working copy to r136473 to match the last failed build. I then force-installed dmapd, and when that succeeded but still didn't seem to eliminate all the unregistered files, I force-installed rdepof:dmapd as well, to also install the build dependencies that hadn't been installed before. After deactivating all ports and deleting the .mp_* files, this seemed to leave a fairly clean prefix, so I re-enabled the builders.

The base builder didn't have much to do and was quickly done, finding a build problem which I reported in #49753. The ports builder is working through the forced builds and committed revisions from October 15 onward. This will probably take days to complete. Once it does, I'll start a build for all ports, in order to build those ports changed between May 18 and October 15, and also to reupload any older binary packages that were inadvertently purged from the packages server some weeks ago.

comment:33 Changed 8 years ago by dbevans (David B. Evans)

It appears that when there are a number of forced builds queued, the builder only builds the most recent one and skips the rest. So in this case it only did a forced build of p5-path-tiny which I submitted today and the rest are gone. Not a big problem since the build all should pick up what was skipped.

Thanks for all your efforts in getting this buildbot back up and working!! It will help debug several outstanding 10.6 issues that I (and others I'm sure) have had a hard time diagnosing otherwise.

comment:34 in reply to:  33 Changed 8 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to devans@…:

It appears that when there are a number of forced builds queued, the builder only builds the most recent one and skips the rest. So in this case it only did a forced build of p5-path-tiny which I submitted today and the rest are gone. Not a big problem since the build all should pick up what was skipped.

Oh great. Well the previous builds are looking normal, so I've started the build of all ports.

comment:35 Changed 8 years ago by mojca (Mojca Miklavec)

I guess this is solved now?

comment:36 Changed 8 years ago by ryandesign (Ryan Carsten Schmidt)

I still want a complete build of all ports. The previous attempt ended prematurely, possibly due to this issue.

comment:37 Changed 8 years ago by ryandesign (Ryan Carsten Schmidt)

Resolution: fixed
Status: newclosed

But that's a separate issue that deserves its own ticket: #49854

Note: See TracTickets for help on using tickets.