500 errors in GitLab UI with "no space left on device" in Postgres logs
Description
- GitLab UI returns 500 errors
- PostgreSQL data file storage is at 100% usage
-
/var/opt/gitlab/postgresql/data/pg_wal/
takes up large amounts of storage - PostgreSQL cannot start up with error:
FATAL: could not write to file "pg_wal/some_filename": No space left on device
Environment
-
Geo PostgreSQL replication is used
-
Impacted offerings:
- GitLab Self-Managed
Solution
- Remove the inactive replication slot
- Decide how to handle replication going forward
- If no longer required, remove that Geo site.
- If still required, re-initiate the replication process, to recreate the replication slot correctly.
The used storage space should go down quickly afterwards without further intervention.
Cause
The mount where PostgreSQL is storing its data files (/var/opt/gitlab
by default) is at 100% usage so PostgreSQL can't start up.
If a replication slot is inactive, the pg_wal
logs corresponding to the slot are reserved forever
(or until the slot is active again). This causes continuous disk usage growth
Additional Information
- The default PostgreSQL data file storage is
/var/opt/gitlab
- The error
FATAL: could not write to file "pg_wal/some_filename": No space left on device
can be found in/var/log/gitlab/postgresql/current
by default. This will also show up if runninggitlab-ctl tail postgresql
. - Running
gitlab-ctl status
will show if postgresl is in a crashloop or down. If up, Postgres will have a shorter runtime compared to the other services of a few seconds.
run: geo-postgresql: 172305s; run: log: (pid 8636) 6303702s
run: gitaly: 172304s; run: log: (pid 8616) 6303702s
run: gitlab-exporter: 172303s; run: log: (pid 8618) 6303702s
run: gitlab-kas: 172292s; run: log: (pid 8610) 6303702s
run: gitlab-workhorse: 172292s; run: log: (pid 8622) 6303702s
run: logrotate: 3090s; run: log: (pid 8615) 6303702s
run: nginx: 172291s; run: log: (pid 8612) 6303702s
run: node-exporter: 172291s; run: log: (pid 8621) 6303702s
run: postgres-exporter: 172290s; run: log: (pid 8634) 6303702s
run: postgresql: 15s; run: log: (pid 8601) 6303702s
run: prometheus: 172289s; run: log: (pid 8611) 6303702s
run: puma: 172289s; run: log: (pid 8623) 6303702s
run: redis: 172289s; run: log: (pid 8614) 6303702s
run: redis-exporter: 172288s; run: log: (pid 8607) 6303702s
run: sidekiq: 172289s; run: log: (pid 8606) 6303702s