Releasing PostgreSQL table locks held by autovacuum using manual vacuum
Description
Activity on your GitLab instance is failing because locks cannot be acquired on one or more tables.
Most commonly, this will cause database migrations to fail during an upgrade.
Errors include:
PG::NoActiveSqlTransaction: ERROR: LOCK TABLE can only be used in transaction blocks'
PG::QueryCanceled: ERROR: canceling statement due to statement timeout (ActiveRecord::QueryCanceled)
PG::LockNotAvailable: ERROR: canceling statement due to lock timeout
Environment
-
Impacted offerings:
- GitLab Self-Managed
Diagnosis
Identify that autovacuum
is the cause using a database console session.
-
Look to see if
autovacuum
is running.select pid,datname,query,backend_type from pg_stat_activity where query like '%VACUUM%';
-
If the migration that's failing is trying to modify a table that
autovacuum
is active on, it may be blocked from taking a lock that it needs on that table. -
Shut down Sidekiq and Puma to eliminate locks created by Rails.
-
Try to re-run the migration (
sudo gitlab-rake db:migrate
) -
Look at what locks are granted (
granted: true
) and not:SELECT pid, locktype, mode, granted, relation::regclass FROM pg_locks;
Solution
Start a database console session in screen
or TMUX.
The operation will take some time, and you need the session to be protected from SSH session timeouts, network interruptions etc.
We will use the public.ci_builds
table as an example in subsequent instructions. Determine
which table should be vacuumed from your analysis of the failing migration and from what table the
autovacuum
process is working on.
SET statement_timeout = 0;
set maintenance_work_mem='1GB';
VACUUM (ANALYZE, VERBOSE, FREEZE, INDEX_CLEANUP ON) public.ci_builds;
This VACUUM
will now wait because it needs the same lock that autovacuum
has.
In a second database console:
-
Identify the PID of the
VACUUM public.ci_builds (to prevent wraparound)
process -
Instruct PostgreSQL to terminate that process via the PID value (example: 12345)
select pid,datname,query,backend_type from pg_stat_activity where query like '%VACUUM%'; select pg_terminate_backend(12345) from pg_stat_activity;
autovacuum
must not be ended some other way; such as kill
on the Linux command line.
Issuing the command on the console ensures there are no surprises for PostgreSQL.
Certain background processes getting killed by Linux (such as out of memory OOM) will cause PostgreSQL to panic
and shut down.
autovacuum
will restart after being terminated. Your VACUUM
command will have got to the front of the queue for the table lock, so autovacuum
now cannot do anything, and will itself queue for a lock.
The output of VACUUM
will look like this:
INFO: aggressively vacuuming "public.ci_builds"
INFO: launched 2 parallel vacuum workers for index vacuuming (planned: 2)
INFO: scanned index "index_ci_builds_on_commit_id_artifacts_expired_at_and_id" to remove 49 row versions
DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
INFO: scanned index ...
-- multiple indexes scanned.
INFO: table "ci_builds": removed 49 dead item identifiers in 4 pages
DETAIL: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
-- this is the vacuum on the table itself done, it then vacuums the indexes
INFO: index "index_ci_builds_on_commit_id_artifacts_expired_at_and_id" now contains 35 row versions in 2 pages
DETAIL: 0 index row versions were removed.
0 index pages were newly deleted.
0 index pages are currently deleted, of which 0 are currently reusable.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
INFO: index ...
-- followed by the other indexes
INFO: table "ci_builds": found 0 removable, 125 nonremovable row versions in 5 out of 2298 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 420154
Skipped 0 pages due to buffer pins, 2293 frozen pages.
CPU: user: 0.01 s, system: 0.01 s, elapsed: 0.08 s.
INFO: aggressively vacuuming "pg_toast.pg_toast_40619"
INFO: table "pg_toast_40619": found 0 removable, 0 nonremovable row versions in 0 out of 0 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 420154
Skipped 0 pages due to buffer pins, 0 frozen pages.
CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s.
-- now it does the ANALYZE
INFO: analyzing "public.ci_builds"
INFO: "ci_builds": scanned 2298 of 2298 pages, containing 76477 live rows and 0 dead rows; 76477 rows in sample, 76477 estimated total rows
VACUUM
The command prints out VACUUM
when it completes.
If VACUUM
releases the table lock, autovacuum
will grab it. VACUUM
will then to terminate with a message like canceling statement due to lock timeout
If so, repeat the process in the two sessions - starting VACUUM
and then shutting down autovacuum
.
Cause
Autovacuum has triggered. It has taken locks on a table to perform housekeeping. It is triggered by lots of changes occuring on a table, which could have been caused by database migrations, so there might be a cause and effect from earlier migrations in an GitLab upgrade causing later migrations to fail.
A manual workaround is needed for such a situation because:
- A
(to prevent wraparound)
instance of Autovacuum will not back off. If you cancel it, it will restart automatically. - Autovacuum throttled to avoid disrupt normal operations. During an emergency, this is not desirable, the operation needs to be completed ASAP.
- The GitLab instance may be configured with too little working memory for Autovacuum to complete work on large tables.
Additional information
It is possible that the PostgreSQL database doesn't have enough working memory
(work_mem
) configured so that Autovacuum can efficiently housekeep the largest tables. In extreme cases, it never completes and will run indefinitely.
PostgreSQL should be configured to provide more working memory specifically
for operations like autovacuum
.
For packaged GitLab or docker, modify gitlab.rb
and apply with gitlab-ctl reconfigure
:
# allow autovacuum to alway use 1GB of work mem
postgresql['maintenance_work_mem'] = "1GB"
# log autovacuum activity so it can be debugged if needed.
postgresql['log_autovacuum_min_duration'] = "0"
All database nodes - Geo primary and secondary, and any Patroni replicas, should be reconfigured.
AWS RDS is deployed with an appropriately increased maintenance_work_mem
. Other cloud vendors
and deployments of PostgreSQL may also require this change.