前段时间整理过一篇《有哪些技术可以减少PostgreSQL/openGauss数据库的存储空间?》,记录过postgresql系数据库出现的膨胀表(索引也一样)可能会导致数据库空间浪费,在openGauss中发现存在一个现象,比如对一张几千万行的table做过千万级大事务更新或平时更新比例较多时,autovacuum的清理并不理想,导致出现几十倍的空间膨胀,记录一则处理案例。
背景:
from Pavan Deolasee PGCon 2008
PostgreSQL uses MVCC (Multi Version Concurrency Control) for transaction semantics
The good things:
Readers don’t wait for writers
Writer doesn’t wait for readers
Highly concurrent access and no locking overhead
The bad things:
Multiple versions of a row are created
The older, dead versions can not be easily removed because indexes don’t have visibility information
Maintenance overhead to reduce table/index bloat
查看TOP table
从pg_stat_user_tables或pg_stat_all_tables的n_dead_tup统计信息排序,工,或使用pg_stat_get_dead_tuples
funtion。更多在手册中可以找到。
SELECT pg_stat_get_live_tuples(c.oid) AS n_live_tup , pg_stat_get_dead_tuples(c.oid) AS n_dead_tup FROM pg_class c; SELECT relid,schemaname,relname,n_live_tup,n_dead_tup,last_autovacuum,last_analyze,last_data_changed FROM pg_stat_all_tables WHERE schemaname NOT IN ('pg_catalog', 'information_schema') order by n_dead_tup desc limit 10 ;
n_dead_tup :dead tup是已删除的行,dead tup可以通过 VACUUM FULL(或者如果幸运的话,普通 VACUUM)或类似的表操作来删除,从而相应地缩小物理大小。每当删除或更新一行时,旧行版本对于该事务提交后开始的所有其他事务都变得不可见 ( 使用默认的 READ COMMITTED 事务隔离)。 一旦不再有未提交的旧事务,该行就完全死亡。 这是 PostgreSQL 的 MVCC 模型处理并发所必需的。
TOAST对象
schemaname | relname | n_dead_tup ------------+--------------------------+------------ pg_toast | pg_toast_part_1806609843 | 2959872769 pg_toast | pg_toast_part_1806609840 | 2034090877 pg_toast | pg_toast_1806609253 | 76601524 pg_toast | pg_toast_1806603161 | 44095962 sssssss | xxxxxxxxxxxxxxx | 16379804 pg_toast | pg_toast_1806604865 | 10065268 select relname,relfilenode,reltoastrelid from pg_class where oid=1806609253; select relname,parttype,parentid,rangenum,partstrategy from pg_partition where oid=1806609843;
这里又涉及了TOAST对象,TOAST 是“ The Oversized-Attribute Storage Technique ”的缩写, relname的命名是pg_toast_<OID>,如果是分区表对应的是pg_toast_part_<OID>,PG 不允许一行数据跨页(BLOCK)存储,那么对于超长的行数据(如超过默认的8KB block size),PG 就会启动 TOAST ,具体就是采用压缩和切片的方式。如果启用了切片,实际数据存储在另一张系统表的多个行中,这张表就叫 TOAST 表,这种存储方式叫行外存储.
分区表大小
select relname,pg_size_pretty(pg_partition_size('anbob_com_table',relname)),parentid from pg_partition where parentid=1806609735 and parttype='p' order by 1; relname | pg_size_pretty | parentid ------------------------+----------------+------------ anbob_com_table_201901 | 13 GB | 1806609735 anbob_com_table_201904 | 13 GB | 1806609735 anbob_com_table_201907 | 14 GB | 1806609735 anbob_com_table_201910 | 11 GB | 1806609735 anbob_com_table_202001 | 5623 MB | 1806609735 anbob_com_table_202004 | 10 GB | 1806609735 anbob_com_table_202007 | 11 GB | 1806609735 anbob_com_table_202010 | 8572 MB | 1806609735 anbob_com_table_202101 | 6607 MB | 1806609735 anbob_com_table_202104 | 9830 MB | 1806609735 anbob_com_table_202107 | 11 GB | 1806609735 anbob_com_table_202110 | 9228 MB | 1806609735 anbob_com_table_202201 | 5831 MB | 1806609735 anbob_com_table_202204 | 9721 MB | 1806609735 anbob_com_table_202207 | 9428 MB | 1806609735 anbob_com_table_202210 | 8163 MB | 1806609735 anbob_com_table_202301 | 7432 MB | 1806609735 anbob_com_table_202304 | 429 GB | 1806609735 <<<<<< anbob_com_table_202307 | 550 GB | 1806609735 <<<<<< (19 rows)
Note:
可见202304分区是我们迁移到openGauss系后的时间,后面膨胀非常严重。相比前几个月大了近40多倍,可以count有效记录并没有业务量级的增长。
auto vacuum日志
2023-09-13 14:38:40 CST [47987247023872]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=65015502.3849,xid=0 0 [VACUUM] LOG: normally vacuum rel "anbob_com.anbob_com_table" freeze 2000000000 OldestXmin 622657833, FreezeLimit 3, freezeTableLimit 3 2023-09-13 14:39:02 CST [47987247023872]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=65015502.3849,xid=0 0 [VACUUM] LOG: "anbob_com_table": found 0 removable, 0 nonremovable row versions in 0 out of 276268 pages 2023-09-13 14:39:02 CST [47987247023872]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=65015502.3849,xid=0 0 [BACKEND] LOG: automatic vacuum of table "anbob_com.anbob_com_table": could not (re)acquire exclusive lock for truncate scan 2023-09-13 14:39:02 CST [47987247023872]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=65015502.3849,xid=0 0 [BACKEND] LOG: automatic vacuum of table "anbob_com.anbob_com_table": index scans: 0 2023-09-13 14:39:05 CST [47994093831936]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=650158d6.3848,xid=0 0 [VACUUM] LOG: normally vacuum rel "anbob_com.anbob_com_table" freeze 2000000000 OldestXmin 622657833, FreezeLimit 3, freezeTableLimit 3 2023-09-13 14:43:30 CST [47994093831936]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=650158d6.3848,xid=0 0 [VACUUM] LOG: "anbob_com_table": found 2475 removable, 251395 nonremovable row versions in 76579 out of 278221 pages 2023-09-13 14:43:34 CST [47994093831936]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=650158d6.3848,xid=0 0 [BACKEND] LOG: automatic vacuum of table "anbob_com.anbob_com_table": could not (re)acquire exclusive lock for truncate scan 2023-09-13 14:43:34 CST [47994093831936]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=650158d6.3848,xid=0 0 [BACKEND] LOG: automatic vacuum of table "anbob_com.anbob_com_table": index scans: 1 2023-09-13 14:43:34 CST [47994093831936]: user=anbob,db=anbob_com,app=dn_6001,client=localhost,sid=650158d6.3848,xid=0 0 [VACUUM] LOG: normally vacuum rel "anbob_com.anbob_com_table" freeze 2000000000 OldestXmin 622684051, FreezeLimit 3, freezeTableLimit 3
控制文件
$ pg_controldata pg_control version number: 923 Catalog version number: 201611171 Database system identifier: 13553184229945870278 Database cluster state: in archive recovery pg_control last modified: Wed 13 Sep 2023 02:59:28 PM CST Latest checkpoint location: 2298/BCF7FA88 Prior checkpoint location: 2298/BBBF7C40 Latest checkpoint's REDO location: 2298/BBBDFE28 Latest checkpoint's TimeLineID: 1 Latest checkpoint's full_page_writes: off Latest checkpoint's NextXID: 622807106 Latest checkpoint's NextOID: 3423305728 Latest checkpoint's NextMultiXactId: 177 Latest checkpoint's NextMultiOffset: 351 Latest checkpoint's oldestXID: 13017 Latest checkpoint's oldestXID's DB: 15935 Latest checkpoint's oldestActiveXID: 622804693 Latest checkpoint's remove lsn: 0/2294BA Time of latest checkpoint: Wed 13 Sep 2023 02:58:38 PM CST Minimum recovery ending location: 2298/C3905A98 Backup start location: 0/0 Backup end location: 0/0 End-of-backup record required: no Current wal_level setting: logical Current max_connections setting: 3050 Current max_prepared_xacts setting: 3000 Current max_locks_per_xact setting: 256 Maximum data alignment: 8 Database block size: 8192 Blocks per segment of large relation: 131072 WAL block size: 8192 Bytes per WAL segment: 16777216 Maximum length of identifiers: 64 Maximum columns in an index: 32 Maximum size of a TOAST chunk: 1996 Date/time type storage: 64-bit integers Float4 argument passing: by value Float8 argument passing: by value Database system TimeLine: 30
交换分区
create table anbob_com.anbob_com_table_p07 as select * from anbob_com.anbob_com_table partition (anbob_com_table_202307) -- add not null -- create index same as orginal table. alter table anbob_com.anbob_com_table exchange partition (anbob_com_table_202304) with table anbob_com.anbob_com_table_p04 without validation ; \dt+ anbob_com.anbob_com_table_p07 List of relations Schema | Name | Type | Owner | Size | Storage | Description ---------+---------------------+-------+---------+--------+----------------------------------+------------- anbob_com | anbob_com_table_p07 | table | anbob_com | 552 GB | {orientation=row,compression=no} | truncate table anbob_com.anbob_com_table_p07; drop table anbob_com.anbob_com_table_p07;
Note:
确认历史月份分区为静态数据后,可以使用交换分区的方式收缩物理大小,因为vacuum无法收回,vacuum full(类似oracle的alte table move)同样无法online要加锁影响业务并且需要一倍的空间放临时表,对于当前在线业务且空间紧张,交换分区优选。但交换分区的alter table需要排他锁,如果有堵塞用户可以临时kill。
select pg_cancel_backend(locker_pid) from dba.ase where locked_query like 'alter table anbob_com.anbob_com_table%';
注意: 当表truncate ,drop 都e有可能不会在OS 层释放空间存储,trunate还可能找不到文件句柄。 mogdb是线程模式,在主进程下可以找deleted 句柄,但无法释放, 需要重启DB,或等使用的线程退出。
延迟备库
以上操作注意做好备份,确认在数据一致后再trucate table或drop table. 我们有配置延迟同步备库,可以防范误操作,但是如果备库意外crash,启动备库里如果使用gs_ctl start -D 没有指定database role, 会直接启动数据库到primary role激活, failover无法再回退,记住增加 -M standby,这是一个很危险的涉及, postgresql不会出现,因为它是有配置文件静态. 控制文件显示的不一定是replay最新的读的xlog,建议使用函数查询pg_last_xlog_replay_location或 pg_last_xact_replay_timestamp(). 对于延迟备库无法指定要应用到的截止位置, 如果做了误操作,需要使用延迟备库找回,需要有主库操作的时间或位置,采用提前move走xlog的方式,或使用select pg xlog replay_pause暂停应用,pg_xlog_replay_resume恢复已暂停的应用手动逐渐释放,但可能速度不可控.
对于备库应用到的位置应该是参考控制文件,控制文件修改有严格的校验,但我们有一个非常规的方法和工具,测试可以修改控制文件中的位置,后期等待验证恢复误激活(failover)的standby备库.
— over —