Opened 4 years ago

Closed 4 years ago

#60662 closed defect (fixed)

psql: FATAL: remaining connection slots are reserved for non-replication superuser connections

Reported by: ryandesign (Ryan Carsten Schmidt) Owned by: admin@…
Priority: Normal Milestone:
Component: server/hosting Version:
Keywords: Cc: neverpanic (Clemens Lang)
Port:

Description

Are we still running postgresql on Braeburn? I know ports.php was retired and the php-based web site was modified to no longer connect to the database. Are there any remaining needs for it?

The buildbot task that updates the database is now failing:

https://build.macports.org/builders/jobs-portindex/builds/15052/steps/psql/logs/stdio

Could not create directory '/var/empty/.ssh'.
psql: FATAL:  remaining connection slots are reserved for non-replication superuser connections

If we no longer need that task, we can remove it.

Change History (9)

comment:1 Changed 4 years ago by raimue (Rainer Müller)

Cc: neverpanic added

The portindex data is definitely used by prbot to map email addresses to github maintainers and for the openmaintainer/nomaintainer labels to pull requests on GitHub.

With the recent upgrade, we changed the deployment of Trac to use gunicorn instead of mod_wsgi in apache2. That has a different worker/threading model. It appears that we have more open database connections now than we had previously. This error was also seen on some HTTP requests for Trac.

@neverpanic, sounds like we need to increase the maximum number of DB connections (as we discussed privately earlier) or should we reduce the number of gunicorn workers?

comment:2 Changed 4 years ago by neverpanic (Clemens Lang)

I have the feeling that trac got more responsive since that change, so personally I'd do a bit of everything:

  • reduce the number of gunicorn workers
  • increase the number of PostgreSQL connections
  • reduce the number of requests a single gunicorn worker serves before restart

@Ryan: We're very aware of the connection limit issue, since we're also getting it for the cronjobs. In general, our current setup seems to work better than the previous one, at least judging from the continuity of the memory usage, so there'll be a few days of tuning to get it right eventually.

comment:3 Changed 4 years ago by neverpanic (Clemens Lang)

@Ryan: We limited the number of requests served by a single gunicorn worker instance to 1000 requests before a new worker is started. This seems to have improved the situation as evident from our monitoring: https://p.dnnr.de/qOJl19wWQiaIDWnC.png

Did you see any further such issues after Tuesday 20:00 UTC?

In any case, I've also reduced the number of gunicorn workers to 14 now, and changed to restarting the servers every 500 requests. Let's see whether this reduces the number of open connections even further without compromising on Trac's performance.

Last edited 4 years ago by neverpanic (Clemens Lang) (previous) (diff)

comment:4 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

Thanks, I'll let you know if I see the error again.

comment:5 Changed 4 years ago by neverpanic (Clemens Lang)

Some of the cronjobs did trigger the problem again yesterday…

I've made a few more changes to attempt to mitigate the issue.

comment:6 Changed 4 years ago by neverpanic (Clemens Lang)

I did some more debugging, using

sudo -u postgres psql -c "select pid, backend_start, query_start, state_change from pg_stat_activity where usename = 'trac' order by pid asc, state_change asc;" | cat

which gives me a list of currently open connections and their last state change. This shows that most of the connections are actually getting used regularly, and Trac just keeps them open in its connection pool for the next client. I think I now know what was happening here:

I have now specified a connection pool size of 6 connections per server instance and increased the worker count to 15. That should give us 90 database connections at most and leave some room for cronjobs.

comment:7 Changed 4 years ago by ryandesign (Ryan Carsten Schmidt)

A couple days ago I saw errors when trying to load web pages from ports.macports.org. I think they were the same error we're talking about here but I didn't write it down. Does it connect to the same database server or was that unrelated?

comment:8 Changed 4 years ago by neverpanic (Clemens Lang)

It connects to the same database server, but with a different user. The limit of 100 connections is per user, so it should not have affected ports.macports.org, unless that also creates a lot of connections (our monitoring says it doesn't).

The webapp averages at under one concurrent connection over the last week, and its maximum were 5 open connections, so I think that must have been something else.

comment:9 Changed 4 years ago by neverpanic (Clemens Lang)

Resolution: fixed
Status: newclosed

I think we have this covered now: https://p.dnnr.de/slpCR-NxOpHoag8D.png

Note: See TracTickets for help on using tickets.