504 Gateway timeouts and performance degradation after losing PostgreSQL nodes
Issue
- Intermittent
504
Gateway timeout errors in the GitLab UI - Overall UI slowness and degraded performance
- In Sidekiq, the enqueued and failed counts are increasing exponentially when viewing the background jobs
- Health checks for the Rails load balancer bouncing between success and failed. In addition, when checking the health check endpoints manually they might be slow but they can succeed.
Environment
-
Impacted offerings:
- GitLab Self-Managed on 3K reference architecture or higher
Cause
GitLab uses database load balancing to distribute
read traffic across multiple PostgreSQL nodes in a round-robin approach. The database load balancing settings are configured via the gitlab_rails['db_load_balancing']
setting on all Rails/Sidekiq nodes.
If one of the PostgreSQL nodes goes offline for some reason and the failover handling does not kick in, the Rails/Sidekiq nodes won't be able to consistently connect back to the database as the round-robin will attempt to connect to the offline PostgreSQL node and eventually time out, leading to the issues described above.
Resolution
Note: All steps below must be performed on all Rails and Sidekiq nodes to ensure consistent configuration across the GitLab environment.
-
Check the database load balancing logs to identify offline hosts:
sudo tail -f /var/log/gitlab/gitlab-rails/database_load_balancing.log
-
In the
/etc/gitlab/gitlab.rb
file, check thegitlab_rails['db_load_balancing']
setting:-
Confirm that all PostgreSQL hosts are online and can be accessed on the port defined in
gitlab_rails['db_port']
:# Replace HOST and PORT with your database host IP and port (default: 5432) nc -zv HOST PORT
-
Remove any unresponsive host IPs from
gitlab_rails['db_load_balancing']
-
-
Verify the fix by:
- Monitoring the database load balancing logs
/var/log/gitlab/gitlab-rails/database_load_balancing.log
for errors - Checking that 504 errors have stopped and that the GitLab UI is responsive
- Checking that the background jobs queues are coming down
- Monitoring the database load balancing logs
If you restore the offline PostgreSQL node or add new ones, remember to update gitlab_rails['db_load_balancing']
on all Rails and Sidekiq nodes to make full use of the database load balancing.
Additional information
The following logs indicate database connectivity issues:
-
Rails and Sidekiq nodes:
# /var/log/gitlab/gitlab-rails/database_load_balancing.log {"severity":"WARN","time":"2024-11-04T22:16:39.921Z","correlation_id":"01JBWKE96VARZHG2K5SZQQ80VH","event":"host_offline","message":"Host is offline after replica status check","db_host":"x.x.x.x","db_port":null}
-
Rails nodes:
# /var/log/gitlab/puma/puma_stderr.log source=rack-timeout id=01JBWJJD0H1Z21BVFP5XG0GVV1 timeout=60000ms service=60000ms state=timed_out at=error # /var/log/gitlab/gitlab-workhorse/current {"correlation_id":"01JBWJJD0H1Z21BVFP5XG0GVV1","duration_ms":9795,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2024-11-04T21:51:29Z","uri":"/api/v4/user"}