本文将介绍对PENDING任务进行适当处理的实践方法。

背景信息

新的DDL任务引擎启用时,当DDL执行失败或者被意外中断后,对应的DDL任务会处于PENDING待处理的状态。此时必须对该PENDING状态进行合适的任务处理,才能解除PENDING状态并恢复正常访问,否则后续的DDL将会被禁止执行并报错。

处理原则

  • 您可以通过SHOW [FULL] DDL语句查看DDL任务的信息和失败原因(即REMARK字段记录的异常信息)。
  • 您也可以参考如下常用的处理方式(仅供参考,请根据实际情况选择最适合的方式):
    • 分析失败原因,修复或排除导致失败的因素(例如是由于数据问题导致的任务失败,则您可以通过去重等方式订正数据。如果是由于其它约束导致的失败,请确认是否能够去掉约束等)。修复完成后,使用RECOVER DDL恢复该PENDING任务。
    • 如果导致失败的因素无法解除,并且DDL因失败而不能真正执行,您可以使用REMOVE DDL删除任务(务必确认DDL没有真正执行才可删除,否则可能会造成状态不一致),删除后恢复可访问状态。
    • 如果您想直接删除DDL任务失败的表(比如表中无数据,可以直接删除重建),您可以使用REMOVE DDL删除任务,然后再执行DROP TABLE IF EXISTS删除表(务必确认表中无数据,或者数据可以丢弃,并且DROP TABLE一定要指定IF EXISTS语法,确保强制删除)。

示例

如下示例展示了对处于PENDING状态的DDL任务进行处理的过程。

  1. 建表时没有指定主键,并且插入了带有重复值的数据行(ID=1有两行数据):
    mysql> create table test_pending (id int not null, age int) dbpartition by hash(id);
    Query OK, 0 rows affected (0.33 sec)
    mysql> insert into test_pending values(1,10),(1,20),(2,20),(3,30);
    Query OK, 4 rows affected (0.10 sec)
    mysql> select * from test_pending order by id;
    +------+------+
    | id   | age  |
    +------+------+
    |    1 |   10 |
    |    1 |   20 |
    |    2 |   20 |
    |    3 |   30 |
    +------+------+
    4 rows in set (0.10 sec)
  2. 之后想为上述ID加上主键约束,但由于表中已有数据违反了唯一性约束,因此DDL执行失败:
    mysql> alter table test_pending add primary key (id);
    ERROR 4636 (HY000): [f5be83373466000][10.81.69.55:3306][ddltest]ERR-CODE: [TDDL-4636][ERR_DDL_JOB_ERROR] Not all physical operations have been done successfully: expected 9,
    but done 8. Caused by: 1062:DDLTEST_1562056402230OYMK_7WW7_0001:Duplicate entry '1' for key 'PRIMARY' on `test_pending`;.
                
  3. 通过SHOW FULL DDL语句查看任务状态和失败原因,发现其中一个物理表中的数据有重复值导致了物理DDL执行失败:
    mysql> show full ddl\G
    *************************** 1. row ***************************
              JOB_ID: 1106733441212637184
       PARENT_JOB_ID: 0
              SERVER: 1:102:10.81.69.55
       OBJECT_SCHEMA: ddltest
         OBJECT_NAME: test_pending
     NEW_OBJECT_NAME:
            JOB_TYPE: ALTER_TABLE
               PHASE: EXECUTE
               STATE: PENDING
            PROGRESS: 77%
          START_TIME: 2019-09-06 17:17:55.002
            END_TIME: 2019-09-06 17:17:55.273
    ELAPSED_TIME(MS): 271
            DDL_STMT: alter table test_pending add primary key (id)
              REMARK: ERR-CODE: [TDDL-4636][ERR_DDL_JOB_ERROR] Not all physical operations have been done successfully: expected 9, but done 8. Caused by: 1062:DDLTEST_1562056402
    230OYMK_7WW7_0001:Duplicate entry '1' for key 'PRIMARY' on `test_pending`;.
                
    REMARK字段中的详细信息解释如下
    • Not all physical operations have been done successfully: expected 9, but done 8.:该逻辑表的DDL涉及到9个物理DDL的执行,完成了8个,有1个失败了,这个失败的物理DDL导致整个逻辑DDL失败,任务被置于 PENDING状态。
    • Caused by: 1062:DDLTEST_1562056402 230OYMK_7WW7_0001:Duplicate entry '1' for key 'PRIMARY' on 'test_pending';:失败的根源在于DDLTEST_1562056402 230OYMK_7WW7_0001物理库的test_pending物理表中有ID字段的重复数据1,导致无法添加主键约束。
  4. 此时逻辑表处于不一致的状态:
    mysql> check table test_pending;
    +----------------------------------------+-------+----------+-------------------------------------------------------------------------------------------------------------+
    | TABLE                                  | OP    | MSG_TYPE | MSG_TEXT                                                                                                    |
    +----------------------------------------+-------+----------+-------------------------------------------------------------------------------------------------------------+
    | ddltest_1562056402230oymk.test_pending | check | Error    | Table 'DDLTEST_1562056402230OYMK_7WW7_0001.test_pending' find incorrect columns 'id', please recreate table |
    +----------------------------------------+-------+----------+-------------------------------------------------------------------------------------------------------------+
    1 row in set (0.04 sec)
                
  5. 执行其它DDL也会被禁止,返回错误信息:
    mysql> drop table test_pending;
    ERROR 4644 (HY000): [f5beae39d466000][10.81.69.55:3306][ddltest]ERR-CODE: [TDDL-4644][ERR_PENDING_DDL_JOB_EXISTS] Another DDL job '1106733441212637184' with operation 'ALTER_
    TABLE' is pending on ddltest.test_pending in ddltest. Please use SHOW DDL to check it, and then recover or rollback it using RECOVER DDL or ROLLBACK DDL, or just remove it us
    ing REMOVE DDL if you confirm that the pending job can be discarded.
                
  6. 接下来,根据前面所述的常见的处理方式,您可以选择以下几种方式继续进行处理(分别展示它们的效果):
    • 去重(删除重复的数据)后,恢复DDL任务,继续完成添加主键约束的操作。
      1. 删除重复数据(根据业务需要,仅保留一条数据),删除数据操作可以通过DRDS执行,也可以根据报错信息,直接连接到DRDS后端的RDS物理库中操作:
        mysql> delete from test_pending where id=1 and age=20;
        Query OK, 1 row affected (0.07 sec)
        mysql> select * from test_pending order by id;
        +------+------+
        | id   | age  |
        +------+------+
        |    1 |   10 |
        |    2 |   20 |
        |    3 |   30 |
        +------+------+
        3 rows in set (0.02 sec)
                                    
      2. 确认表中已经没有重复数据后,恢复之前PENDING的DDL任务的执行(恢复成功,完成的任务被自动清理,主键添加成功):
        mysql> recover ddl 1106733441212637184;
        Query OK, 0 rows affected (1.28 sec)
        mysql> show full ddl\G
        Empty set (0.00 sec)
        mysql> show create table test_pending\G
        *************************** 1. row ***************************
               Table: test_pending
        Create Table: CREATE TABLE `test_pending` (
          `id` int(11) NOT NULL,
          `age` int(11) DEFAULT NULL,
          PRIMARY KEY (`id`),
          KEY `auto_shard_key_id` (`id`) USING BTREE
        ) ENGINE=InnoDB DEFAULT CHARSET=utf8 dbpartition by hash(`id`)
        1 row in set (0.02 sec)
        mysql> check table test_pending;
        +----------------------------------------+-------+----------+----------+
        | TABLE                                  | OP    | MSG_TYPE | MSG_TEXT |
        +----------------------------------------+-------+----------+----------+
        | ddltest_1562056402230oymk.test_pending | check | status   | OK       |
        +----------------------------------------+-------+----------+----------+
        1 row in set (0.10 sec)
                    
    • 直接删除任务,然后删除表(测试数据可以丢弃),后续再根据需要重新创建该表。
      mysql> remove ddl 1106733441212637184;
      Query OK, 1 row affected (0.02 sec)
      mysql> drop table if exists test_pending;
      Query OK, 0 rows affected (0.44 sec)
      mysql> show tables like 'test_pending';
      Empty set (0.01 sec)