×
Community Blog Case Analysis of PostgreSQL Transaction Rollback

Case Analysis of PostgreSQL Transaction Rollback

This article focuses on the case analysis of a recent client report for PostgreSQL.

By Xie Guiqi (Yuanhong)

Background

Recently, a client reported that the RDS PostgreSQL could not be written. The error message is listed below:

postgres=# select * from test;
 id 
----
(0 rows)

postgres=# insert into test select 1;
ERROR:  database is not accepting commands to avoid wraparound data loss in database "xxxx"
HINT:  Stop the postmaster and vacuum that database in single-user mode.
You might also need to commit or roll back old prepared transactions.

After RDS engineers intervened, the problem was solved.

XID Principles

Definition of XID

Transaction ID (XID) is the transaction number inside PostgreSQL. Each transaction is assigned an XID, and the transaction number is incremented in sequence. Each tuple header in PostgreSQL data holds the XID for inserting or deleting the tuple. Then, the kernel constructs a consistent read of the database through this XID. When the transaction isolation level is repeatable, let's assume there are two transactions, namely xid1=200 and xid2=201. Only the tuple of t_xmin <= 200 is visible in XID1, and the tuple of t_xmin > 200 is invisible.

typedef uint32 TransactionId;  /* Transaction number definition, 32-bit unsigned integer */

typedef struct HeapTupleFields
{
  TransactionId t_xmin;    /* Insert the transaction number of the tuple */
  TransactionId t_xmax;    /* Delete or lock the transaction number of the tuple */

    /*** Other properties are omitted ***/
} HeapTupleFields;

struct HeapTupleHeaderData
{
  union
  {
    HeapTupleFields t_heap;
    DatumTupleFields t_datum;
  }      t_choice;

    /*** Other properties are omitted ***/
};

Distribution Mechanism of XID

According to the structure above, XID is a 32-bit unsigned integer that ranges from 0 to 2^32-1. What about transactions that exceed 2^32-1? After exceeding 2^32-1, XID will be allocated from the beginning. It is also proved through the source code:

// Invalid transaction number
#define InvalidTransactionId    ((TransactionId) 0)
// Boot the transaction number, which is used in the database initialization process (BKI execution)
#define BootstrapTransactionId    ((TransactionId) 1)
// Frozen transaction numbers are used to represent very old tuples, which are older than all normal transaction numbers (that is, visible)
#define FrozenTransactionId      ((TransactionId) 2)
// The first normal transaction number
#define FirstNormalTransactionId  ((TransactionId) 3)
// The lower 32 bits of the FullTransactionId are used as unsigned integers to generate XIDs
#define XidFromFullTransactionId(x)    ((uint32) (x).value)

static inline void
FullTransactionIdAdvance(FullTransactionId *dest)
{
  dest->value++;
  while (XidFromFullTransactionId(*dest) < FirstNormalTransactionId)
    dest->value++;
}

FullTransactionId
GetNewTransactionId(bool isSubXact)
{
    /*** Omit ***/
  full_xid = ShmemVariableCache->nextFullXid;
  xid = XidFromFullTransactionId(full_xid);
    /*** Omit ***/
  FullTransactionIdAdvance(&ShmemVariableCache->nextFullXid);
    /*** Omit ***
  return full_xid;
}

static void
AssignTransactionId(TransactionState s)
{
    /*** Omit ***/
  s->fullTransactionId = GetNewTransactionId(isSubXact);
  if (!isSubXact)
    XactTopFullTransactionId = s->fullTransactionId;
    /*** Omit ***/
}

TransactionId
GetTopTransactionId(void)
{
  if (!FullTransactionIdIsValid(XactTopFullTransactionId))
    AssignTransactionId(&TopTransactionStateData);
  return XidFromFullTransactionId(XactTopFullTransactionId);
}

As you can see, the new transaction number is stored in the shared variable cache: ShmemVariableCache->nextFullXid. After each transaction number is issued, adjust its value upward and skip the three special values above. The three special buttons are 0, 1, and 2. The features are shown in the code comments above.

Rollback Mechanism of XID

As mentioned earlier, XID is like a ring. It restarts from 3 after being allocated to 2^32-1. How does the kernel compare the sizes of two transactions? For example, XID has gone through such a process 3-> 2^32-1 -> 5, so how does the kernel know the transaction 5 is behind 2^32-1? Let's look at the code again:

/*
 * TransactionIdPrecedes --- is id1 logically < id2?
 */
bool
TransactionIdPrecedes(TransactionId id1, TransactionId id2)
{
  /*
   * If either ID is a permanent XID then we can just do unsigned
   * comparison.  If both are normal, do a modulo-2^32 comparison.
   */
  int32    diff;

  if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
    return (id1 < id2);

  diff = (int32) (id1 - id2);
  return (diff < 0);
}

As you can see, the kernel uses a more ingenious method. (int32) (id1-id2) < 0: The value range of 32-bit signed integers is -2 ^ 31 to 231-1, The value obtained by 5-(232-1) is larger than 2 ^ 31-1, so converting to int32 will make the value a negative number. However, there is a problem here. The latest transaction number-oldest transaction number must be less than 2^31. Once it is greater than 2^31, it will be rolled back, causing the data generated by the old transaction to be invisible to the new transaction.

Rollback Prevention of XID

As mentioned above, the latest transaction number minuses oldest transaction number must be less than 2^31. Otherwise, rollback will occur, and the data generated by the old transaction will not be visible to the new transaction. How does the kernel avoid this problem?

The kernel periodically updates the XID of the tuple generated by the old transaction to FrozenTransactionId (to 2) to recycle the XID, while the tuple whose XID is 2 is visible to all transactions. This process is called XID freezing. This way, the XID can be recycled to ensure |the latest transaction number minuses the oldest transaction number| < 2^31.

In addition to automatic kernel freezing and recycling of XID, we can manually freeze and recycle XID using commands or SQL:

  • Query the age of the database or table. Database age refers to the latest transaction number minuses oldest transaction number in the database, and table age refers to the latest transaction number minuses oldest transaction number in the table.
# View the age of each database
SELECT datname, age(datfrozenxid) FROM pg_database;

# The age sorting of each table in the database
SELECT c.oid::regclass as table_name, greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ('r', 'm') order by age desc; 

# View the age of one table
select oid::regclass,age(relfrozenxid) from pg_class where oid='schema name.Table name'::regclass::oid;
  • Manually freeze and recycle SQL for XID of the tuple of a table:
vacuum freeze table name;
  • Manually freeze and recycle the command of the XID of all tables in a database:
vacuumdb -d database name -freeze --jobs=30 -h connection string -p port number -U database Owner

The freezing and recycling process is a heavy I/O operation. During this process, the kernel will describe all pages of the table and then update the t_xmin field of the tuple that meets the requirements to 2. Therefore, this process needs to be performed at the low peak of business to avoid affecting the business.

There are three kernel parameters related to the freezing and recycling process: vacuum_freeze_min_age, vacuum_freeze_table_age, and autovacuum_freeze_max_age. If you are interested in them, you can search for relative information.

Solutions

Problem Analysis

Based on the principle analysis above, we know the latest transaction number minuses the oldest transaction number = 2^31-1000000. Therefore, when there are only one million available XIDs, the kernel will prohibit instance writing and report an error saying, “The database is not accepting commands to avoid wraparound data loss in database. At this time, you must connect to "xxxx" in the prompt to freeze the table and recycle more XIDs.”

void
SetTransactionIdLimit(TransactionId oldest_datfrozenxid, Oid oldest_datoid)
{
  TransactionId xidVacLimit;
  TransactionId xidWarnLimit;
  TransactionId xidStopLimit;
  TransactionId xidWrapLimit;
  TransactionId curXid;

  Assert(TransactionIdIsNormal(oldest_datfrozenxid));

  /*
     * xidWrapLimit = the oldest transaction number + 0x7FFFFFFF. Once the current transaction number reaches xidWrapLimit, it will be rolled back
   */
  xidWrapLimit = oldest_datfrozenxid + (MaxTransactionId >> 1);
  if (xidWrapLimit < FirstNormalTransactionId)
    xidWrapLimit += FirstNormalTransactionId;

  /*
     * Once the current transaction number reaches xidStopLimit, the instance will not be writable, and the XID of the 1000000 will be reserved for vacuum
     * Vacuum each table needs to occupy one XID
   */
  xidStopLimit = xidWrapLimit - 1000000;
  if (xidStopLimit < FirstNormalTransactionId)
    xidStopLimit -= FirstNormalTransactionId;

  /*
     * Once the current transaction number reaches xidWarnLimit, it will be received continuously
     * WARNING:  database "xxxx" must be vacuumed within 2740112 transactions
   */
  xidWarnLimit = xidStopLimit - 10000000;
  if (xidWarnLimit < FirstNormalTransactionId)
    xidWarnLimit -= FirstNormalTransactionId;

  /*
     * The force autovacuums will be triggered once the current transaction number reaches xidVacLimit
   */
  xidVacLimit = oldest_datfrozenxid + autovacuum_freeze_max_age;
  if (xidVacLimit < FirstNormalTransactionId)
    xidVacLimit += FirstNormalTransactionId;

  /* Grab lock for just long enough to set the new limit values */
  LWLockAcquire(XidGenLock, LW_EXCLUSIVE);
  ShmemVariableCache->oldestXid = oldest_datfrozenxid;
  ShmemVariableCache->xidVacLimit = xidVacLimit;
  ShmemVariableCache->xidWarnLimit = xidWarnLimit;
  ShmemVariableCache->xidStopLimit = xidStopLimit;
  ShmemVariableCache->xidWrapLimit = xidWrapLimit;
  ShmemVariableCache->oldestXidDB = oldest_datoid;
  curXid = XidFromFullTransactionId(ShmemVariableCache->nextFullXid);
  LWLockRelease(XidGenLock);

  /* Log the info */
  ereport(DEBUG1,
      (errmsg("transaction ID wrap limit is %u, limited by database with OID %u",
          xidWrapLimit, oldest_datoid)));

  /*
     * If the current transaction number>=oldest transaction number+autovacuum_freeze_max_age 
    * Trigger autovacuum to clean up the oldest database. If multiple databases meet the requirements, clean up them according to the age sorting of database. 
   * Mark autovacuum again after the end of the current autovacuum by setting the flag bit
     */
  if (TransactionIdFollowsOrEquals(curXid, xidVacLimit) &&
    IsUnderPostmaster && !InRecovery)
    SendPostmasterSignal(PMSIGNAL_START_AUTOVAC_LAUNCHER);

  /* Give an immediate warning if past the wrap warn point */
  if (TransactionIdFollowsOrEquals(curXid, xidWarnLimit) && !InRecovery)
  {
    char     *oldest_datname;

    if (IsTransactionState())
      oldest_datname = get_database_name(oldest_datoid);
    else
      oldest_datname = NULL;

    if (oldest_datname)
      ereport(WARNING,
          (errmsg("database \"%s\" must be vacuumed within %u transactions",
              oldest_datname,
              xidWrapLimit - curXid),
           errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
               "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
    else
      ereport(WARNING,
          (errmsg("database with OID %u must be vacuumed within %u transactions",
              oldest_datoid,
              xidWrapLimit - curXid),
           errhint("To avoid a database shutdown, execute a database-wide VACUUM in that database.\n"
               "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
  }
}

bool
TransactionIdFollowsOrEquals(TransactionId id1, TransactionId id2)
{
  int32    diff;
  if (!TransactionIdIsNormal(id1) || !TransactionIdIsNormal(id2))
    return (id1 >= id2);

  diff = (int32) (id1 - id2);
  return (diff >= 0);
}

FullTransactionId
GetNewTransactionId(bool isSubXact)
{
    /*** Omit ***/
  full_xid = ShmemVariableCache->nextFullXid;
  xid = XidFromFullTransactionId(full_xid);

  if (TransactionIdFollowsOrEquals(xid, ShmemVariableCache->xidVacLimit))
  {
    TransactionId xidWarnLimit = ShmemVariableCache->xidWarnLimit;
    TransactionId xidStopLimit = ShmemVariableCache->xidStopLimit;
    TransactionId xidWrapLimit = ShmemVariableCache->xidWrapLimit;
    Oid      oldest_datoid = ShmemVariableCache->oldestXidDB;

        /*** Omit ***/
    if (IsUnderPostmaster &&
      TransactionIdFollowsOrEquals(xid, xidStopLimit))
    {
      char     *oldest_datname = get_database_name(oldest_datoid);

      /* complain even if that DB has disappeared */
      if (oldest_datname)
        ereport(ERROR,
            (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
             errmsg("database is not accepting commands to avoid wraparound data loss in database \"%s\"",
                oldest_datname),
             errhint("Stop the postmaster and vacuum that database in single-user mode.\n"
                 "You might also need to commit or roll back old prepared transactions, or drop stale replication slots.")));
            /*** Omit ***/
    }
        /*** Omit ***/
  }
    /*** Omit ***/
}

Location of the Problem

# View the age of each database
SELECT datname, age(datfrozenxid) FROM pg_database;

# The age sorting of each table in the database
SELECT c.oid::regclass as table_name, greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ('r', 'm') order by age desc; 

# View the age of one table
select oid::regclass,age(relfrozenxid) from pg_class where oid='schema name. Table name':regclass::oid;

Solutions to the Problems

  1. Find the oldest database through the first SQL above. The database age refers to |the latest transaction number minuses the oldest transaction number in the database|.
  2. Find the oldest table through the second SQL above and then execute. Vacuum freeze table name on the table to freeze and recycle the old transaction number in the table. The table age refers to |the latest transaction number minuses the oldest transaction number in the table|.
  3. O&M Scripts
  • Single-Process Shell Scripts
# Perform vacuum freeze on the top 50 oldest tables in the specified database

for cmd in `psql -U username-p port number -h connection string -d database name -c "SELECT 'vacuum freeze '||c.oid::regclass||';' as vacuum_cmd FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ('r', 'm') order by greatest(age(c.relfrozenxid),age(t.relfrozenxid)) desc offset 50 limit 50;" | grep -v vacuum_cmd  | grep -v row | grep vacuum`; do
    psql  -U username -p port number -h connection string -d database name -c "$cmd"
done
  • Multi-Process Python Scripts
from multiprocessing import Pool
import psycopg2

args = dict(host='pgm-bp10xxxx.pg.rds.aliyuncs.com', port=5432, dbname='database name',
            user='username', password='password')

def vacuum_handler(sql):
    sql_str = "SELECT c.oid::regclass as table_name, greatest(age(c.relfrozenxid),age(t.relfrozenxid)) as age FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ('r', 'm') order by age desc limit 10; "
    try:
        conn = psycopg2.connect(**args)
        cur = conn.cursor()
        cur.execute(sql)
        conn.commit()
        cur = conn.cursor()
        cur.execute(sql_str)
        print cur.fetchall()
        conn.close()
    except Exception as e:
        print str(e)

# Perform vacuum freeze on the top oldest 1000 tables in the specified database, and execute 32 processes concurrently
def multi_vacuum():
    pool = Pool(processes=32)
    sql_str = "SELECT 'vacuum freeze '||c.oid::regclass||';' as vacuum_cmd FROM pg_class c LEFT JOIN pg_class t ON c.reltoastrelid = t.oid WHERE c.relkind IN ('r', 'm') order by greatest(age(c.relfrozenxid),age(t.relfrozenxid)) desc limit 1000;";
    try:
        conn = psycopg2.connect(**args)
        cur = conn.cursor()
        cur.execute(sql_str)
        rows = cur.fetchall()
        for row in rows:
            cmd = row['vacuum_cmd']
            pool.apply_async(vacuum_handler, (cmd ))
        conn.close()
        pool.close()
        pool.join()
    except Exception as e:
        print str(e)


multi_vacuum()

Note:

The vacuum freeze scans all pages of the table and updates them. It is a heavy I/O operation. The concurrency must be controlled during the operation. Otherwise, it is very easy to hang up instances.

0 0 0
Share on

ApsaraDB

197 posts | 17 followers

You may also like

Comments