Assistant Engineer
Assistant Engineer
  • UID622
  • Fans3
  • Follows0
  • Posts52

[PostgreSQL Development]Troubleshooting problems caused by inconsistent pg_resetxlog versions

More Posted time:Sep 19, 2016 16:45 PM
When the redo log of a database is corrupt, or a control file is corrupt, the database may fail to start up.
This issue may be caused by some problem in the block device storing pg_xlog or pg_control files.
So what should we do when the xlog or control file is corrupt?
The database will write data into the control file when it is closed normally, and redo log is used to recover the database after the database crashes. If the database was normally closed, it does not need to be recovered from the redo log.
PostgreSQL provides a tool to generate or edit the control file to erase a specified pg_xlog.
It can be used when the database fails to start up normally because of a corrupt control file or pg_xlog.
After you use the tool and the database is started up successfully, make sure to import the control file to a new cluster after the logic export of it.
During export, it is suggested you set to skip error blocks (by setting zero_damaged_pages=true), because under such circumstances, nine out of ten that blocks are damaged.
Meanwhile, because pg_resetxlog will rewrite the control file, if you are using a pg_resetxlog of a different version with the database, it will generate a control file of the pg_resetxlog version, with a series of problems following.
Content of control file
We can know the definition of control file content by viewing the header file of the control file.

Version of control file
/* Version identifier for this pg_control format */
#define PG_CONTROL_VERSION      942

Checkpoint information of control file
 * Body of CheckPoint XLOG records.  This is declared here because we keep
 * a copy of the latest one in pg_control for possible disaster recovery.
 * Changing this struct requires a PG_CONTROL_VERSION bump.
typedef struct CheckPoint
        XLogRecPtr      redo;                   /* next RecPtr available when we began to
                                                                 * create CheckPoint (i.e. REDO start point) */
        TimeLineID      ThisTimeLineID; /* current TLI */
        TimeLineID      PrevTimeLineID; /* previous TLI, if this record begins a new
                                                                 * timeline (equals ThisTimeLineID otherwise) */
        bool            fullPageWrites; /* current full_page_writes */
        uint32          nextXidEpoch;   /* higher-order bits of nextXid */
        TransactionId nextXid;          /* next free XID */
        Oid                     nextOid;                /* next free OID */
        MultiXactId nextMulti;          /* next free MultiXactId */
        MultiXactOffset nextMultiOffset;        /* next free MultiXact offset */
        TransactionId oldestXid;        /* cluster-wide minimum datfrozenxid */
        Oid                     oldestXidDB;    /* database with minimum datfrozenxid */
        MultiXactId oldestMulti;        /* cluster-wide minimum datminmxid */
        Oid                     oldestMultiDB;  /* database with minimum datminmxid */
        pg_time_t       time;                   /* time stamp of checkpoint */
        TransactionId oldestCommitTsXid;        /* oldest Xid with valid commit
                                                                                 * timestamp */
        TransactionId newestCommitTsXid;        /* newest Xid with valid commit
                                                                                 * timestamp */

         * Oldest XID still running. This is only needed to initialize hot standby
         * mode from an online checkpoint, so we only bother calculating this for
         * online checkpoints and only when wal_level is hot_standby. Otherwise
         * it's set to InvalidTransactionId.
        TransactionId oldestActiveXid;
} CheckPoint;

Data of control file
 * Contents of pg_control.
 * NOTE: try to keep this under 512 bytes so that it will fit on one physical
 * sector of typical disk drives.  This reduces the odds of corruption due to
 * power failure midway through a write.

typedef struct ControlFileData
} ControlFileData;

You can output the control file content through the pg_controldata command and the source code is as follows.
$ export PGDATA=/home/digoal/pgdata
$ pg_controldata
pg_control version number:            942
Catalog version number:               201510051
Database system identifier:           6318621837015461309
Database cluster state:               in production
pg_control last modified:             Sun 14 Aug 2016 06:54:51 PM CST
Latest checkpoint location:           42/E20000E0
Prior checkpoint location:            42/E2000028
Latest checkpoint's REDO location:    42/E20000E0
Latest checkpoint's REDO WAL file:    0000000100000042000000E2
Latest checkpoint's TimeLineID:       1
Latest checkpoint's PrevTimeLineID:   1
Latest checkpoint's full_page_writes: on
Latest checkpoint's NextXID:          0/3896508417
Latest checkpoint's NextOID:          10000
Latest checkpoint's NextMultiXactId:  1
Latest checkpoint's NextMultiOffset:  0
Latest checkpoint's oldestXID:        3800764117
Latest checkpoint's oldestXID's DB:   1
Latest checkpoint's oldestActiveXID:  0
Latest checkpoint's oldestMultiXid:   1
Latest checkpoint's oldestMulti's DB: 13294
Latest checkpoint's oldestCommitTsXid:0
Latest checkpoint's newestCommitTsXid:0
Time of latest checkpoint:            Sun 14 Aug 2016 06:54:51 PM CST
Fake LSN counter for unlogged rels:   0/1
Minimum recovery ending location:     0/0
Min recovery ending loc's timeline:   0
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
wal_level setting:                    minimal
wal_log_hints setting:                off
max_connections setting:              100
max_worker_processes setting:         8
max_prepared_xacts setting:           0
max_locks_per_xact setting:           64
track_commit_timestamp setting:       off
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Size of a large-object chunk:         2048
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Data page checksum version:           0

What if pg_resetxlog version and database cluster version are inconsistent
If the database cluster version is 9.5, and you used another version of pg_resetxlog to modify its control file, what will happen?
For example, you use pg_resetxlog of version 9.2 to set the pg_control file of a cluster of version 9.5.
/data/temp/pgdata/pgsql9.2/bin/pg_resetxlog -D /data/temp/pgdata/main
pg_resetxlog: pg_control exists but is broken or unknown version; ignoring it
Guessed pg_control values:

If you enable the -f option to forcefully flash the pg_control file, the version number will change to 9.2. At this time, it will fail if you try to start the database with the version 9.5 file.
But will it succeed if we start the database with version 9.2?
Of course no. Because other places in the database have also recorded the version number, namely $PGDATA and the PG_VERSION files corresponding to the data file directories of the database.
This file won’t be rewritten by pg_resetxlog, as it can be used to trace the true version.

The database will fail to start up if the control file is forcibly reset by pg_resetxlog of a different version.
What should we do?
The solution is simple. Use the pg_resetxlog of the same version of PG_VERSION to regenerate the control file.
Calculation of pg_resetxlog parameters
There are several pg_resetxlog parameters, with each used to set the next xid, next oid, next multi-xact xid and so on in the control file.
These values have fixed algorithms and security scopes. For example, you cannot set a transaction ID smaller than the actual created transaction. Otherwise, it will cause invisible data generated by transaction before the transaction ID is consumed.
For calculation of the security value, see descriptions of pg_resetxlog.
As a matter of fact, it is about inferring next xid using the file name in xlog. You can use members and offsets in pg_multixact to deduce the oldest multixact id, next multi-xact and so on.
The -o, -x, -e, -m, -O, -c and -l options allow the next OID, next transaction ID, next transaction ID's epoch, next and oldest multitransaction ID, next multitransaction offset, oldest and newest transaction IDs for which
       the commit time can be retrieved, and WAL starting address values to be set manually. These are only needed when pg_resetxlog is unable to determine appropriate values by reading pg_control. Safe values can be determined as

       ·   A safe value for the next transaction ID (-x) can be determined by looking for the numerically largest file name in the directory pg_clog under the data directory, adding one, and then multiplying by 1048576. Note that
           the file names are in hexadecimal. It is usually easiest to specify the option value in hexadecimal too. For example, if 0011 is the largest entry in pg_clog, -x 0x1200000 will work (five trailing zeroes provide the
           proper multiplier).

       ·   A safe value for the next multitransaction ID (first part of -m) can be determined by looking for the numerically largest file name in the directory pg_multixact/offsets under the data directory, adding one, and then
           multiplying by 65536. Conversely, a safe value for the oldest multitransaction ID (second part of -m) can be determined by looking for the numerically smallest file name in the same directory and multiplying by 65536. As
           above, the file names are in hexadecimal, so the easiest way to do this is to specify the option value in hexadecimal and append four zeroes.

       ·   A safe value for the next multitransaction offset (-O) can be determined by looking for the numerically largest file name in the directory pg_multixact/members under the data directory, adding one, and then multiplying
           by 52352. As above, the file names are in hexadecimal. There is no simple recipe such as the ones above of appending zeroes.

       ·   A safe value for the oldest transaction ID for which the commit time can be retrieved (first part of -c) can be determined by looking for the numerically smallest file name in the directory pg_commit_ts under the data
           directory. Conversely, a safe value for the newest transaction ID for which the commit time can be retrieved (second part of -c) can be determined by looking for the numerically greatest file name in the same directory.
           As above, the file names are in hexadecimal.

       ·   The WAL starting address (-l) should be larger than any WAL segment file name currently existing in the directory pg_xlog under the data directory. These names are also in hexadecimal and have three parts. The first part
           is the “timeline ID” and should usually be kept the same. For example, if 00000001000000320000004A is the largest entry in pg_xlog, use -l 00000001000000320000004B or higher.

               pg_resetxlog itself looks at the files in pg_xlog and chooses a default -l setting beyond the last existing file name. Therefore, manual adjustment of -l should only be needed if you are aware of WAL segment files
               that are not currently present in pg_xlog, such as entries in an offline archive; or if the contents of pg_xlog have been lost entirely.

       ·   There is no comparably easy way to determine a next OID that's beyond the largest one in the database, but fortunately it is not critical to get the next-OID setting right.

       ·   The transaction ID epoch is not actually stored anywhere in the database except in the field that is set by pg_resetxlog, so any value will work so far as the database itself is concerned. You might need to adjust this
           value to ensure that replication systems such as Slony-I and Skytools work correctly — if so, an appropriate value should be obtainable from the state of the downstream replicated database.

       The -n (no operation) option instructs pg_resetxlog to print the values reconstructed from pg_control and values about to be changed, and then exit without modifying anything. This is mainly a debugging tool, but can be
       useful as a sanity check before allowing pg_resetxlog to proceed for real.

After you have the correct value, you can set the control file through the pg_resetxlog.
How to view the correct database version from $PGDATA
Apart from pg_controldata, you can also get the correct database version by viewing the PG_VERSION file.
Using pg_resetxlog of a different version from the data file may cause corruption of control file for the data file.  And the database may fail to be started.
Solution: use pg_resetxlog of the same version of the data file to regenerate the control file and enable -f for compulsory execution.
For calculation method of pg_resetxlog parameter security value, see the reference manual of pg_resetxlog.