案例:openGauss/postgreSQL 数据库手动清理膨胀Heap Bloat (dead tup)

发表于:2023年11月22日 17点40分 0 阅读 1评论 3点赞

前段时间整理过一篇《有哪些技术可以减少PostgreSQL/openGauss数据库的存储空间?》,记录过postgresql系数据库出现的膨胀表(索引也一样)可能会导致数据库空间浪费,在openGauss中发现存在一个现象,比如对一张几千万行的table做过千万级大事务更新或平时更新比例较多时,autovacuum的清理并不理想,导致出现几十倍的空间膨胀,记录一则处理案例。

背景:
from Pavan Deolasee PGCon 2008

PostgreSQL uses MVCC (Multi Version Concurrency Control) for transaction semantics
The good things:
Readers don’t wait for writers
Writer doesn’t wait for readers
Highly concurrent access and no locking overhead
The bad things:
Multiple versions of a row are created
The older, dead versions can not be easily removed because indexes don’t have visibility information
Maintenance overhead to reduce table/index bloat

查看TOP table

从pg_stat_user_tables或pg_stat_all_tables的n_dead_tup统计信息排序,工,或使用pg_stat_get_dead_tuples funtion。更多在手册中可以找到。

SELECT pg_stat_get_live_tuples(c.oid) AS n_live_tup
     , pg_stat_get_dead_tuples(c.oid) AS n_dead_tup
FROM   pg_class c;

 SELECT relid,schemaname,relname,n_live_tup,n_dead_tup,last_autovacuum,last_analyze,last_data_changed
   FROM pg_stat_all_tables
   WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
   order by n_dead_tup desc limit 10 ;

n_dead_tup :dead tup是已删除的行,dead tup可以通过 VACUUM FULL(或者如果幸运的话,普通 VACUUM)或类似的表操作来删除,从而相应地缩小物理大小。每当删除或更新一行时,旧行版本对于该事务提交后开始的所有其他事务都变得不可见 ( 使用默认的 READ COMMITTED 事务隔离)。 一旦不再有未提交的旧事务,该行就完全死亡。 这是 PostgreSQL 的 MVCC 模型处理并发所必需的。

TOAST对象

 schemaname |         relname          | n_dead_tup 
------------+--------------------------+------------
 pg_toast   | pg_toast_part_1806609843 | 2959872769
 pg_toast   | pg_toast_part_1806609840 | 2034090877
 pg_toast   | pg_toast_1806609253      |   76601524
 pg_toast   | pg_toast_1806603161      |   44095962
 sssssss    | xxxxxxxxxxxxxxx          |   16379804
 pg_toast   | pg_toast_1806604865      |   10065268

select relname,relfilenode,reltoastrelid from pg_class where oid=1806609253;

select relname,parttype,parentid,rangenum,partstrategy from pg_partition where oid=1806609843;

这里又涉及了TOAST对象,TOAST 是“ The Oversized-Attribute Storage Technique ”的缩写, relname的命名是pg_toast_<OID>,如果是分区表对应的是pg_toast_part_<OID>,PG 不允许一行数据跨页(BLOCK)存储,那么对于超长的行数据(如超过默认的8KB block size),PG 就会启动 TOAST ,具体就是采用压缩和切片的方式。如果启用了切片,实际数据存储在另一张系统表的多个行中,这张表就叫 TOAST 表,这种存储方式叫行外存储.

分区表大小

select relname,pg_size_pretty(pg_partition_size('anbob_com_table',relname)),parentid 
from  pg_partition 
where parentid=1806609735 and parttype='p' order by 1;
        relname         | pg_size_pretty |  parentid  
------------------------+----------------+------------
 anbob_com_table_201901 | 13 GB          | 1806609735
 anbob_com_table_201904 | 13 GB          | 1806609735
 anbob_com_table_201907 | 14 GB          | 1806609735
 anbob_com_table_201910 | 11 GB          | 1806609735
 anbob_com_table_202001 | 5623 MB        | 1806609735
 anbob_com_table_202004 | 10 GB          | 1806609735
 anbob_com_table_202007 | 11 GB          | 1806609735
 anbob_com_table_202010 | 8572 MB        | 1806609735
 anbob_com_table_202101 | 6607 MB        | 1806609735
 anbob_com_table_202104 | 9830 MB        | 1806609735
 anbob_com_table_202107 | 11 GB          | 1806609735
 anbob_com_table_202110 | 9228 MB        | 1806609735
 anbob_com_table_202201 | 5831 MB        | 1806609735
 anbob_com_table_202204 | 9721 MB        | 1806609735
 anbob_com_table_202207 | 9428 MB        | 1806609735
 anbob_com_table_202210 | 8163 MB        | 1806609735
 anbob_com_table_202301 | 7432 MB        | 1806609735
 anbob_com_table_202304 | 429 GB         | 1806609735  <<<<<<
 anbob_com_table_202307 | 550 GB         | 1806609735  <<<<<<
(19 rows)

Note:
可见202304分区是我们迁移到openGauss系后的时间,后面膨胀非常严重。相比前几个月大了近40多倍,可以count有效记录并没有业务量级的增长。

auto vacuum日志

2023-09-13 14:38:40 CST [47987247023872]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=65015502.3849,xid=0  0 [VACUUM] LOG:  normally vacuum rel "anbob_com.anbob_com_table" freeze 2000000000 OldestXmin 622657833, FreezeLimit 3, freezeTableLimit 3
2023-09-13 14:39:02 CST [47987247023872]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=65015502.3849,xid=0  0 [VACUUM] LOG:  "anbob_com_table": found 0 removable, 0 nonremovable row versions in 0 out of 276268 pages
2023-09-13 14:39:02 CST [47987247023872]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=65015502.3849,xid=0  0 [BACKEND] LOG:  automatic vacuum of table "anbob_com.anbob_com_table": could not (re)acquire exclusive lock for truncate scan
2023-09-13 14:39:02 CST [47987247023872]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=65015502.3849,xid=0  0 [BACKEND] LOG:  automatic vacuum of table "anbob_com.anbob_com_table": index scans: 0
2023-09-13 14:39:05 CST [47994093831936]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=650158d6.3848,xid=0  0 [VACUUM] LOG:  normally vacuum rel "anbob_com.anbob_com_table" freeze 2000000000 OldestXmin 622657833, FreezeLimit 3, freezeTableLimit 3

2023-09-13 14:43:30 CST [47994093831936]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=650158d6.3848,xid=0  0 [VACUUM] LOG:  "anbob_com_table": found 2475 removable, 251395 nonremovable row versions in 76579 out of 278221 pages
2023-09-13 14:43:34 CST [47994093831936]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=650158d6.3848,xid=0  0 [BACKEND] LOG:  automatic vacuum of table "anbob_com.anbob_com_table": could not (re)acquire exclusive lock for truncate scan
2023-09-13 14:43:34 CST [47994093831936]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=650158d6.3848,xid=0  0 [BACKEND] LOG:  automatic vacuum of table "anbob_com.anbob_com_table": index scans: 1
2023-09-13 14:43:34 CST [47994093831936]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=650158d6.3848,xid=0  0 [VACUUM] LOG:  normally vacuum rel "anbob_com.anbob_com_table" freeze 2000000000 OldestXmin 622684051, FreezeLimit 3, freezeTableLimit 3

控制文件

$ pg_controldata 
pg_control version number:            923
Catalog version number:               201611171
Database system identifier:           13553184229945870278
Database cluster state:               in archive recovery
pg_control last modified:             Wed 13 Sep 2023 02:59:28 PM CST
Latest checkpoint location:           2298/BCF7FA88
Prior checkpoint location:            2298/BBBF7C40
Latest checkpoint's REDO location:    2298/BBBDFE28
Latest checkpoint's TimeLineID:       1
Latest checkpoint's full_page_writes: off
Latest checkpoint's NextXID:          622807106
Latest checkpoint's NextOID:          3423305728
Latest checkpoint's NextMultiXactId:  177
Latest checkpoint's NextMultiOffset:  351
Latest checkpoint's oldestXID:        13017
Latest checkpoint's oldestXID's DB:   15935
Latest checkpoint's oldestActiveXID:  622804693
Latest checkpoint's remove lsn:          0/2294BA
Time of latest checkpoint:            Wed 13 Sep 2023 02:58:38 PM CST
Minimum recovery ending location:     2298/C3905A98
Backup start location:                0/0
Backup end location:                  0/0
End-of-backup record required:        no
Current wal_level setting:            logical
Current max_connections setting:      3050
Current max_prepared_xacts setting:   3000
Current max_locks_per_xact setting:   256
Maximum data alignment:               8
Database block size:                  8192
Blocks per segment of large relation: 131072
WAL block size:                       8192
Bytes per WAL segment:                16777216
Maximum length of identifiers:        64
Maximum columns in an index:          32
Maximum size of a TOAST chunk:        1996
Date/time type storage:               64-bit integers
Float4 argument passing:              by value
Float8 argument passing:              by value
Database system TimeLine:             30

交换分区

create table anbob_com.anbob_com_table_p07 as select * from anbob_com.anbob_com_table partition (anbob_com_table_202307)

-- add not null 
-- create index same as orginal table.

alter table anbob_com.anbob_com_table exchange partition (anbob_com_table_202304) with table anbob_com.anbob_com_table_p04 without validation  ;

\dt+ anbob_com.anbob_com_table_p07
                                             List of relations
 Schema  |        Name         | Type  |  Owner  |  Size  |             Storage              | Description 
---------+---------------------+-------+---------+--------+----------------------------------+-------------
 anbob_com | anbob_com_table_p07 | table | anbob_com | 552 GB | {orientation=row,compression=no} | 

truncate table anbob_com.anbob_com_table_p07;

drop table anbob_com.anbob_com_table_p07;

Note:
确认历史月份分区为静态数据后,可以使用交换分区的方式收缩物理大小,因为vacuum无法收回,vacuum full(类似oracle的alte table move)同样无法online要加锁影响业务并且需要一倍的空间放临时表,对于当前在线业务且空间紧张,交换分区优选。但交换分区的alter table需要排他锁,如果有堵塞用户可以临时kill。

select pg_cancel_backend(locker_pid) from dba.ase where locked_query like 'alter table anbob_com.anbob_com_table%';

注意: 当表truncate ,drop 都e有可能不会在OS 层释放空间存储,trunate还可能找不到文件句柄。 mogdb是线程模式,在主进程下可以找deleted 句柄,但无法释放, 需要重启DB,或等使用的线程退出。

延迟备库
以上操作注意做好备份,确认在数据一致后再trucate table或drop table. 我们有配置延迟同步备库,可以防范误操作,但是如果备库意外crash,启动备库里如果使用gs_ctl start -D 没有指定database role, 会直接启动数据库到primary role激活, failover无法再回退,记住增加 -M standby,这是一个很危险的涉及, postgresql不会出现,因为它是有配置文件静态. 控制文件显示的不一定是replay最新的读的xlog,建议使用函数查询pg_last_xlog_replay_location或 pg_last_xact_replay_timestamp(). 对于延迟备库无法指定要应用到的截止位置, 如果做了误操作,需要使用延迟备库找回,需要有主库操作的时间或位置,采用提前move走xlog的方式,或使用select pg xlog replay_pause暂停应用,pg_xlog_replay_resume恢复已暂停的应用手动逐渐释放,但可能速度不可控.

对于备库应用到的位置应该是参考控制文件,控制文件修改有严格的校验,但我们有一个非常规的方法和工具,测试可以修改控制文件中的位置,后期等待验证恢复误激活(failover)的standby备库.

— over —

{{c.name}} {{c.create_time|simymdhm}} {{c.like_num}}
{{c.content}}