PostgreSQL主备切换

前言

PostgreSQL双节点高可用构架中，如果主库宕机或挂了，高可用系统会提升备库为新主库对外继续服务。对于原主库的处理，可以删掉后重搭新备库，也可以降级为备库继续服务。

触发方式

PostgreSQL热备(HOT-Standby)如果主库出现异常，备库如何激活，来替换主库工作，有2种方式可以选择

备库配置文件 recovery.conf 中有配置项 trigger_file ，它是激活从库的触发文件，当它存在就会激活从库。
在备库上执行 pg_ctl promote 命令激活。

文件触发

如果备库配置文件 recovery.conf 配置项 trigger_file 不为空，例如：

1	trigger_file = '/var/lib/pgsql/10/data/trigger_standby'

创建文件/var/lib/pgsql/10/data/trigger_standby，就可以激活备库。

1	$ touch /var/lib/pgsql/10/data/trigger_standby

命令触发

在备库上执行命令 pg_ctl promote 就可以激活备库。

1	$ pg_ctl promote

主备切换

状态标识

数据字典表pg_stat_replication 、命令 pg_controldata、进程、自带的函数 pg_is_in_recovery() 都可以区别或判断实例的主备状态。

主库标识 in production

1 2	$ pg_controldata \| grep 'Database cluster state' Database cluster state: in production

备库标识 in archive recovery

1 2	$ pg_controldata \| grep 'Database cluster state' Database cluster state: in archive recovery

关闭主库

在主库执行 pg_ctl stop 模拟主库宕机。

1
2
3

$ pg_ctl stop
waiting for server to shut down.... done
server stopped

这时备库日志会报错，提示 primary 主库连接不上

$ tailf /var/lib/pgsql/10/data/log/postgresql-Tue.log
		TCP/IP connections on port 5432?
2019-08-06 18:00:19.399 CST [21594] FATAL:  could not connect to the primary server: could not connect to server: Connection refused
		Is the server running on host "192.168.1.2" and accepting

激活备库

在备库执行 pg_ctl promote 激活备库

1
2
3

$ pg_ctl promote
waiting for server to promote.... done
server promoted

备库激活后可以插入数据，变为可读写。这时配置文件 recovery.conf 变为 recovery.done。

备库被激活，标识变为 in production

1
2
3

$ pg_controldata | grep 'Database cluster state'
Database cluster state:               in production
-bash-4.2$

此时备库已被激活，可以读写

1 2	postgres=# create database test; CREATE DATABASE

重做备库

将宕机原主库修复好后，重新作为新主库的备库

前期准备

编辑新主库认证文件 pg_hba.conf，新增对原主库的认证方式，如果存在可以跳过此步骤。

1 2	host all all 192.168.1.2/32 trust host replication repuser 192.168.1.2/32 md5

pg_ctl reload 使配置生效

1 2	$ pg_ctl reload server signaled

新主库新增归档目录 pg_archive，如果存在可以跳过此步骤。

1	$ mkdir /var/lib/pgsql/10/pg_archive

新备库新增文件认证文件 .pgpass ，如果存在可以跳过此步骤。

$ pwd
/var/lib/pgsql
$ cat .pgpass
192.168.1.3:5432:postgres:repuser:repuser

新备库配置文件需要确认以下参数

hot_standby = on
max_standby_streaming_delay = 30s
wal_receiver_status_interval = 10s
hot_standby_feedback = on

删除再搭建备库

删除原主库数据

1	$ rm -rf /var/lib/pgsql/10/data/*

从新主库拷贝数据

1
2
3

$ pg_basebackup -h 192.168.1.3 -U repuser -D /var/lib/pgsql/10/data/ -X stream -P
Password:
40584/40584 kB (100%), 1/1 tablespace

拷贝配置文件 recovery.conf.sample 或 recovery.done 为 recovery.conf

1	cp /usr/pgsql-10/share/recovery.conf.sample /var/lib/pgsql/10/data/recovery.conf

更新配置文件 recovery.conf

1
2
3

standby_mode = on
primary_conninfo = 'host=192.168.1.3 port=5432 user=repuser password=repuser application_name=standby002'
recovery_target_timeline = 'latest'

启动新备库

1	$ pg_ctl start

直接降级为备库

拷贝配置文件 recovery.conf.sample 或 recovery.done 为 recovery.conf

1	cp /usr/pgsql-10/share/recovery.conf.sample /var/lib/pgsql/10/data/recovery.conf

更新配置文件 recovery.conf

1
2
3

standby_mode = on
primary_conninfo = 'host=192.168.1.3 port=5432 user=repuser password=repuser application_name=standby002'
recovery_target_timeline = 'latest'

同步时间线，由于数据发生了变化，可能造成当前主备的时间线不一致，可以先用 pg_rewind 同步下时间线

$ pg_rewind --target-pgdata=/var/lib/pgsql/10/data --source-server='host=192.168.1.3 port=5432 user=postgres dbname=postgres' -P
connected to server
source and target cluster are on the same timeline
no rewind required

启动新备库

1	$ pg_ctl start