PostgreSQL – Autovacuum
Sometime in the 2000s the developers of PostgreSQL found a major loophole in the design of their relational database management system with respect to storage space and transaction speed. It turned out that the UPDATE query was becoming an expensive routine. UPDATE was duplicating the old row and rewriting new data in, which meant that the size of the database or the tables were not bound to any limit! Additionally, deleting a row only MARKED the row deleted while the actual data remained untouched – data forensics supported later.
This may sound familiar since it is what present-day file systems and data recovery software rely on i.e., data, when deleted, remains intact on the magnetic disk in its raw form, but is hidden on the interface. However, keeping old data was important for older transactions as well. So technically, it wasn’t right to compromise on transactional integrity. This being sufficient stimulus, the Postgres team soon introduced the ‘vacuum’ feature which literally vacuumed the deleted rows. However, this was a manual process and due to the several parameters involved in the function, it wasn’t desirable. Hence, autovacuum was developed.
What is Autovacuum?
Autovacuum is a daemon or background utility process offered by PostgreSQL to users to issue a regular clean-up of redundant data in the database and server. It does not require the user to manually issue the vacuuming and instead, is defined in the postgresql.conf file. In order to access this file, simply direct yourself to the following directory on your terminal and then open the file in a suitable editor.
>> cd C:\Program Files\PostgreSQL\13\data >> "postgresql.conf"
When implemented on command prompt:
The autovacuum utility goes about reallocating the deleted block of data (blocks which were marked deleted) to new transactions by first removing the dead/outdated tuples and then informing the queued transactions where any updates or insertions may be placed in the table. This is in stark contrast to the old and former procedure where a transaction would blindly insert a new row of data with the same identifying elements and updated attributes. The benefits of autovacuuming are quite evident:
- Storage space is well-used.
- Free-space map visibility is improved. An FSM is a map of multiple binary trees that indicate available spaces in the relations/tables.
- Unlike manual vacuums, they are not time and resource-intensive.
- They do not place exclusive locks on tables (FULL vacuums place locks on transactions wanting to access tables).
- Table bloating is prevented. Bloating is a process by which the size of a table reaches enormous values with unnecessary and invalid data.
One way of monitoring data sizes before and after transactions is simply by executing the following lines of code in the Shell after connecting to a specific database:
postgresql=# SELECT pg_size_pretty(pg_relation_size('table_name');
Consider a table storing the accounts of customers at a Toll booth:
The size of this table is then given by:
If transactions ensue, this query may be dealt out again to demonstrate the change in size of the table. Disproportionate changes in size would suggest failed autovacuuming (if the size doesn’t change even though transactions have no use for the outdated state of rows).
Since autovacuum is a background utility, it is turned on by default. However, keep in mind that it was developed quite a while back and therefore the parameters set were conservative i.e., parameters were set according to the availability of hardware and version of the software. Modern applications call for reviewing these parameters and adjusting them proportionately. We shall have a quick look at the nodal parameters:
- autovacuum: It is set to ‘on’ by default so it may not be declared exclusively in the shell or terminal.
- autovacuum_naptime: This parameter is set to 1min or 60s which indicates the duration between consecutive autovacuum calls or wakeups.
- autovacuum_max_workers: This indicates the number of processes is vacuumed every time the function is woken up after ‘naptime’.
- autovacuum_vacuum_scale_factor: usually set to 0.2, it means that autovacuum conducts a clean-up only if 20% of the relation/table has been changed/updated.
- autovacuum_vacuum_threshold: A precautionary measure, this parameter ensures that autovacuuming happens only if a set number of changes are made to the table (50 by default).
- autovacuum_analyze_scale_factor: this is the analyzing utility that creates regular statistics of tables during transactions. If set to 0.1, analysis is done only if 10% of the table observes updates (deletes, updates, inserts, alters, etc.).
- autovacuum_analyze_threshold: similar to autovacuum_vacuum_threshold, although here the action performed is analysis. The analysis is performed only if a minimum of 50 changes has been made by the transactions.
These parameters are modified based on the frequency at which transactions affect the database and how large the database is or is expected to grow. So, if transactions seem to be occurring at faster rates, autovacuum_max_workers may be increased or autovacuum_vacuum_scale_factor may be decreased if transactions aren’t expected to demand older data. Additionally, the developer may adjust the analysis parameters to formulate better query techniques.
This brings us to question the idea behind analyzing tables so often.
Which tables require analysis?
Analysis may be performed manually or simply by keeping autovacuum turned on. Analysis provides specific statistical information regarding the database which helps developers enhance efficiency. Essentially, analysis provides the following information:
- A list of the most common values in a specific column of a relation/table. In some cases, this isn’t required as the column might be the unique identifier – unique identifiers cannot be expected to repeat in a table.
- A histogram of the data distribution. This may include the sizes of data respective to columns or which columns are subject to the highest and lowest updates from transactions.
Now answering the pertinent question — which tables actually require analysis. By the means of autovacuum, most tables are subjected to analysis. Nevertheless, in the likely case that the explicit ANALYZE function is issued, it is done due to the following reasons:
- Used when UPDATE activities don’t seem to directly affect certain columns. It might so happen that the statistics of certain columns are required which are not changed by ongoing transactions. Hence, the automatic analysis may be insignificant.
- Analysis may be important to keep a tab on tables in which the rate at which updates occur is relevant.
- To understand which aspects of the data are least prone to changes to establish a pattern.
Yet another question that may be posed is – How to shortlist tables on which analysis daemon must be separately issued?
There exists a simple rule of thumb: analysis on tables makes sense as long as the minimum or maximum values of the columns are prone to changes. For example, a table showing speeds of vehicles measured by a velocity gun is bound to have its maximum values change. Therefore, the analysis will yield something conclusive.
Please Login to comment...