Opened 4 years ago

Closed 4 years ago

#60590 closed defect (fixed)

rsync.macports.org and other services at FAU are down

Reported by: herbygillot (Herby Gillot) Owned by: admin@…
Priority: High Milestone:
Component: server/hosting Version:
Keywords: Cc:
Port:

Description

It appears that MacPorts mirrors may be down. Seeing the following when trying to selfupdate:

$ sudo port -d selfupdate
DEBUG: Copying /Users/herby/Library/Preferences/com.apple.dt.Xcode.plist to /opt/local/var/macports/home/Library/Preferences
DEBUG: MacPorts sources location: /opt/local/var/macports/sources/rsync.macports.org/macports/release/tarballs
--->  Updating MacPorts base sources using rsync
DEBUG: system: /usr/bin/rsync -rtzvl --delete-after rsync://rsync.macports.org/macports/release/tarballs/base.tar /opt/local/var/macports/sources/rsync.macports.org/macports/release/tarballs
rsync: failed to connect to rsync.macports.org: Operation timed out (60)
rsync error: error in socket IO (code 10) at /AppleInternal/BuildRoot/Library/Caches/com.apple.xbs/Sources/rsync/rsync-54.120.1/rsync/clientserver.c(106) [receiver=2.6.9]
Command failed: /usr/bin/rsync -rtzvl --delete-after rsync://rsync.macports.org/macports/release/tarballs/base.tar /opt/local/var/macports/sources/rsync.macports.org/macports/release/tarballs
Exit code: 10

Exit code 10 for rsync means Error in socket I/O.

Also CI is failing immediately for PRs in Github due to failure to connect to the mirrors:

https://dev.azure.com/macports/macports-ports/_build/results?buildId=7308&view=logs&j=572c5e49-83d5-5271-390a-e6dc77f89c6b&t=0d6ffe6c-e69c-549e-b152-719f8b1b1603&l=19

rsync: failed to connect to rsync.macports.org: Operation timed out (60)
rsync error: error in socket IO (code 10) at /BuildRoot/Library/Caches/com.apple.xbs/Sources/rsync/rsync-52.200.1/rsync/clientserver.c(106) [receiver=2.6.9]

Change History (33)

comment:1 Changed 4 years ago by jmroot (Joshua Root)

Yes, ftp.fau.de, which is the origin server for the CDN, is down. It's being worked on but is out of our control.

comment:2 Changed 4 years ago by herbygillot (Herby Gillot)

Got it. Is there a place I could read more about how the MacPorts mirrors and infra work? Where does most of the communication & coordination on this kind of thing happen? IRC?

comment:3 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Cc: ryandesign removed

wiki:Mirrors lists our mirrors. wiki:Mirroring explains how to set up a mirror. There is a private macports-mirrors mailing list which I would like to use for communication among our mirror administrators. We have not yet used this mailing list because we have not yet announced its existence to all of the mirror administrators.

We have a private macports-infra mailing list where the MacPorts infrastructure team can discuss such things. This mailing list receives email for the "admin" mail alias so it is notified about any tickets filed in the "server/hosting" Trac component.

comment:4 in reply to:  2 Changed 4 years ago by neverpanic (Clemens Lang)

Replying to herbygillot:

Where does most of the communication & coordination on this kind of thing happen? IRC?

In this specific instance, there wasn't much coordination, since it's a hardware issue, which occurred without planning.

comment:5 Changed 4 years ago by breiter (Brian Reiter)

Thanks the mirroring information is helpful. I didn't know that I could manually set a selfupdate rsync server in macports.conf and a list of mirrors in sources.conf.

comment:6 Changed 4 years ago by jmroot (Joshua Root)

That will work but those mirrors will be a bit out of date until the master one is fixed. If you need the very latest changes then wiki:howto/SyncingWithGit is the way to go.

comment:7 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Are we back up and running now?

$ rsync rsync://nue.de.rsync.macports.org/macports/
drwxr-xr-x            238 2019/10/27 21:08:41 .
drwxr-xr-x        305,116 2020/06/04 10:07:02 distfiles
drwxr-xr-x        886,040 2020/06/03 14:40:24 packages
drwxr-xr-x            204 2020/06/04 10:07:02 release
drwxr-xr-x            136 2008/06/05 20:13:58 trunk

comment:8 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

I guess not... the above makes it look like their server is back online, but they haven't tried to get new content from the private server today. The last connection was 2020/06/03 15:08:09 UTC.

comment:9 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Ok and the http://ftp.fau.de web site says:

The machine has a faulty RAID controller. Until a replacement part arrives, bad performance and crashes are to be expected. Mirrors will NOT be updated for now. Sorry for the inconvenience.

comment:10 Changed 4 years ago by herbygillot (Herby Gillot)

Anything to be done about having such a single point of failure in MacPorts infra?

comment:11 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

What would you suggest?

We primarily sync our ports over rsync. That requires an rsync server. If the server goes down, then you cannot connect to it. Logical. You can't remove that single point of failure.

You can configure your MacPorts to use a different rsync server or to use git, if you want to change what the singe point of failure is. GitHub goes down from time to time too. We can't guarantee 100% uptime. Nobody can.

We have a CDN distributing our distfiles and packages. It pulls from FAU as the public master. It was our understanding that we had configured the CDN to continue to deliver its last copy of the files even if the master was down or returned error codes. Rainer has told me that is not happening. I'm asking for clarification.

comment:12 Changed 4 years ago by danielluke (Daniel J. Luke)

Is there another server we could be rsync'ing the private server to as well? Can the CDN be configured with a 'backup' origin (so if/when the origin goes down it will continue to serve)?

comment:13 in reply to:  12 ; Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to danielluke:

Is there another server we could be rsync'ing the private server to as well?

The private server is passive. It does not push its content to other servers; other servers pull content from it. Because the private server has limited upstream bandwidth, we have arranged for FAU to be the only server that pulls from it. Other mirrors pull from FAU, via the hostname rsync-origin.macports.org.

We could temporarily change rsync-origin.macports.org to point to the private server.

Can the CDN be configured with a 'backup' origin (so if/when the origin goes down it will continue to serve)?

The CDN can only be configured to use one origin server, as far as I know. When the origin server is down or has errors, it is supposed to serve up its cached content.

We could switch the CDN to use our private server as the origin. We can see whether we have enough upstream bandwidth to support that.

comment:14 Changed 4 years ago by danielluke (Daniel J. Luke)

Yeah, I was suggesting that we have FAU and maybe one other mirror rsync from the private server. We could reconfigure the CDN to use the 'other' mirror as origin (manually, I guess if they don't offer a way to failover). One way to handle that would be to repoint DNS as you suggest. I guess if we're going to have to manually reconfigure, we could configure a second site to only rsync from the private server if FAU is down.

Do you know what the current disk space + bandwidth requirements are? Do we need to recruit one or more sites that could act as CDN origins?

comment:15 in reply to:  13 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to ryandesign:

We could temporarily change rsync-origin.macports.org to point to the private server.

I have made this change.

comment:16 in reply to:  14 ; Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to danielluke:

Yeah, I was suggesting that we have FAU and maybe one other mirror rsync from the private server.

We could do that. It would help make distfiles and packages available in the event that FAU is down, since MacPorts tries multiple servers when it cannot find distfiles or packages. It would not help much for the ports tree, since users must manually configure MacPorts for which rsync server they want to use. Users would have to know which of our other mirrors is the "second master" that you're proposing.

Our private server's upstream bandwidth was a lot more limited when we first made our arrangement with FAU in late 2016. Having multiple servers connect to the private server regularly might not have worked back then but might work now. And as I said I already changed it so that all mirrors are connecting to the private server now, so we'll see how that goes.

We could reconfigure the CDN to use the 'other' mirror as origin (manually, I guess if they don't offer a way to failover). One way to handle that would be to repoint DNS as you suggest.

We only have the arrangement with FAU that allows us to use them as our CDN origin. We would have to contact the other mirror administrators and ask them if such an arrangement would be possible with them as well.

I guess if we're going to have to manually reconfigure, we could configure a second site to only rsync from the private server if FAU is down.

So you're suggesting that this one "second master" should modify their mirroring script so that it checks if FAU is up, if it is then mirror from FAU, and if not then mirror from the private server? What criteria would they use to determine when FAU is up? Right now their rsync server is responding, but according to their web site their server is unreliable due to the RAID controller failure so it may respond sometimes and other times not.

Do you know what the current disk space + bandwidth requirements are?

Disk space hovers around 1TB. Currently a little less because I recently ran the cleanup script. Sometimes more if I haven't run the cleanup script for awhile. I don't know how much bandwidth they should expect to be used.

Do we need to recruit one or more sites that could act as CDN origins?

Such a need has not arisen until now. We could certainly ask around.

comment:17 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

In 8707a0074a60d9f813a927c9140361566e388225/macports-ports (master):

bootstrap.sh: Get PortIndex from rsync-origin

See: #60590

comment:18 in reply to:  13 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Replying to ryandesign:

We could switch the CDN to use our private server as the origin. We can see whether we have enough upstream bandwidth to support that.

I've made this change.

comment:19 Changed 4 years ago by herbygillot (Herby Gillot)

Another question is, would it make sense to have ping/port/service monitoring on the rsync origin servers (and I suppose for the private server now)?

If the CDNs stop serving when the origin is down, I'd imagine we'd def want some sort of monitoring + alerting on the origins.

comment:20 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Possibly that would be useful. I've wanted to have something set up to monitor the buildbot workers too, since they sometimes crash. But nothing has been done on that front yet.

comment:21 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Summary: Macports mirrors are down?rsync.macports.org and other services at FAU are down
Version: 2.6.2

comment:22 in reply to:  20 Changed 4 years ago by herbygillot (Herby Gillot)

Replying to ryandesign:

Possibly that would be useful. I've wanted to have something set up to monitor the buildbot workers too, since they sometimes crash. But nothing has been done on that front yet.

Datadog seems to have allowances for open source projects. Here's the inquiry form for that:

https://www.datadoghq.com/partner/open-source/

Probably makes sense to use a 3rd-party service so that there's less infrastructure to run/worry about, especially in the case of monitoring & alerting.

comment:23 in reply to:  16 ; Changed 4 years ago by danielluke (Daniel J. Luke)

Replying to ryandesign:

Replying to danielluke:

Yeah, I was suggesting that we have FAU and maybe one other mirror rsync from the private server.

We could do that. It would help make distfiles and packages available in the event that FAU is down, since MacPorts tries multiple servers when it cannot find distfiles or packages. It would not help much for the ports tree, since users must manually configure MacPorts for which rsync server they want to use. Users would have to know which of our other mirrors is the "second master" that you're proposing.

We could solve that with changes to base or by updating the DNS record to point to only active servers (there are providers which offer this as part of DNS load balancing).

We could reconfigure the CDN to use the 'other' mirror as origin (manually, I guess if they don't offer a way to failover). One way to handle that would be to repoint DNS as you suggest.

We only have the arrangement with FAU that allows us to use them as our CDN origin. We would have to contact the other mirror administrators and ask them if such an arrangement would be possible with them as well.

Yes, that's the suggestion - that we get (at least) one other mirror that could act as our origin.

I guess if we're going to have to manually reconfigure, we could configure a second site to only rsync from the private server if FAU is down.

So you're suggesting that this one "second master" should modify their mirroring script so that it checks if FAU is up, if it is then mirror from FAU, and if not then mirror from the private server? What criteria would they use to determine when FAU is up? Right now their rsync server is responding, but according to their web site their server is unreliable due to the RAID controller failure so it may respond sometimes and other times not.

The key in that sentence was 'manually reconfigure' - this would only necessary in the case where private server bandwidth constraints prevent a second mirror from syncing. Something like checking for the 'freshness' of the files on the primary + some flap dampening would work (even in this case), but it would be nicer if we didn't need to be extra clever with failover logic.

Do you know what the current disk space + bandwidth requirements are?

Disk space hovers around 1TB. Currently a little less because I recently ran the cleanup script. Sometimes more if I haven't run the cleanup script for awhile. I don't know how much bandwidth they should expect to be used.

Do we need to recruit one or more sites that could act as CDN origins?

Such a need has not arisen until now. We could certainly ask around.

Probably best if one of our existing mirrors can add this - if not, I can double-check and see if it'll be OK for me to do this on a box I have co-located on a gige connection - it would be helpful to get some traffic statistics to determine if I can or not (alternatively, I may have another box available in a different location in a few months with different operating constraints that could work - but it won't help us now).

comment:24 in reply to:  23 ; Changed 4 years ago by neverpanic (Clemens Lang)

Replying to herbygillot:

Another question is, would it make sense to have ping/port/service monitoring on the rsync origin servers (and I suppose for the private server now)?

FYI, I run a cronjob that notifies me if the ports.tar on rsync.macports.org gets outdated, and forward this to the list whenever this happens. Haven't done this for this specific outage, since I figured everybody already knows.

This doesn't cover the packages and distfiles servers, but since it's all the same machine anyway, it should be good enough.

Replying to danielluke:

Probably best if one of our existing mirrors can add this - if not, I can double-check and see if it'll be OK for me to do this on a box I have co-located on a gige connection - it would be helpful to get some traffic statistics to determine if I can or not.

The mirror at FAU provides some traffic statistics when it's up. For example, here's the traffic of a random day I've picked: http://ftp.fau.de/cgi-bin/show-ftp-stats.cgi?statstype=1&datum=2020-05-20&orderby=mirrorname&orderdir=desc&submit=Go%21 (scroll to the four entries for MacPorts)

You can also generate graphs using the form at the top of the page: http://ftp.fau.de/cgi-bin/show-ftp-stats.cgi?statstype=2&what=bytes&mirrorname=macports%2Fpackages&timespan=-1&graphsize=large&submit=Go%21

We also do get statistics from our CDN, but I don't think I have access to that.

comment:25 Changed 4 years ago by neverpanic (Clemens Lang)

Throwing out random other ideas of things that would avoid this in the future: Switch MacPorts base syncing to HTTP downloads and ports syncing to Git in MacPorts base, add git mirrors and code that will fall back to those if syncing fails.

We could also look into using services such as bintray to host our distfiles and packages.

comment:26 in reply to:  25 Changed 4 years ago by herbygillot (Herby Gillot)

Replying to neverpanic:

Throwing out random other ideas of things that would avoid this in the future: Switch MacPorts base syncing to HTTP downloads and ports syncing to Git in MacPorts base, add git mirrors and code that will fall back to those if syncing fails.

We could also look into using services such as bintray to host our distfiles and packages.

I like these ideas. Maybe have the port client use an HTTP HEAD request, perhaps additionally with checksums sent in the headers by the server so that the client can avoid downloading base if it doesn't need to.

Additionally it would be nice to have perhaps something like status.macports.org that shows the status of origin/mirror hosts.

Last edited 4 years ago by herbygillot (Herby Gillot) (previous) (diff)

comment:27 Changed 4 years ago by herbygillot (Herby Gillot)

I would also recommend following up with the Datadog Open Source program. The agent is something that can probably be installed on the current machines to keep an eye on disk usage, CPU, service health (like rsyncd), probably bandwidth and more and alert on these in an easy way without requiring extra work or infra.

comment:28 in reply to:  24 Changed 4 years ago by danielluke (Daniel J. Luke)

Replying to neverpanic:

Replying to danielluke:

Probably best if one of our existing mirrors can add this - if not, I can double-check and see if it'll be OK for me to do this on a box I have co-located on a gige connection - it would be helpful to get some traffic statistics to determine if I can or not.

The mirror at FAU provides some traffic statistics when it's up. For example, here's the traffic of a random day I've picked: http://ftp.fau.de/cgi-bin/show-ftp-stats.cgi?statstype=1&datum=2020-05-20&orderby=mirrorname&orderdir=desc&submit=Go%21 (scroll to the four entries for MacPorts)

You can also generate graphs using the form at the top of the page: http://ftp.fau.de/cgi-bin/show-ftp-stats.cgi?statstype=2&what=bytes&mirrorname=macports%2Fpackages&timespan=-1&graphsize=large&submit=Go%21

These graphs are almost helpful. I really care about p95 utilization though as that is what affects the transit pricing for my servers (and I don't see how to get that from these statistics/graphs).

comment:29 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

There are too many things happening in this ticket. Individual suggestions for infrastructure improvements should be filed in separate tickets. This ticket is about the current hardware degradation at the FAU server and will be resolved when their new hardware is in place.

FAU continues to mirror our content, but not as frequently as before the hardware problem. They said they do not see a need, from a stability or load standpoint on their server, to move rsync.macports.org away from their server. Updates to users who are using that server will just be more delayed than usual for now.

There have been reports of 504 errors (e.g. #60602 and some pull request CI build logs) when downloading from the CDN now that the CDN is set to get files from the private server. As I suspected, the private server's upstream bandwidth is insufficient to deliver the volume of requested files before the CDN times out and delivers an error. Therefore I've set distfiles-origin and packages-origin back to the FAU server.

comment:30 Changed 4 years ago by neverpanic (Clemens Lang)

I've created #60608 to track updating MacPorts base via HTTP downloads.

comment:31 Changed 4 years ago by herbygillot (Herby Gillot)

Should there be another ticket for monitoring, or a "status.macports.org" page that provides insight on buildbot, origin and mirrors health?

comment:32 Changed 4 years ago by neverpanic (Clemens Lang)

I guess status.macports.org would require some server to actually deliver this page, which we currently don't have. We could try to find a service to redirect status.macports.org to our twitter feed, which I guess is the thing that will keep working when the server is down. Do you want to file a ticket for that?

comment:33 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Resolution: fixed
Status: newclosed

The FAU server's hardware has been repaired and is back to normal so I'll close this.

I'll leave rsync-origin.macports.org pointing to the private server for now since it seems to be able to handle that amount of load.

Feel free to file other tickets for the status page and other ideas mentioned above.

Note: See TracTickets for help on using tickets.