Check For Postgresql Replication Delay/Lag -Part2

November 26, 2016

By postgresdba

Master:

SELECT pg_current_xlog_location();

Slave:

SELECT pg_last_xlog_receive_location();

and by comparing these two values you could see if the servers are in sync.

The problem yet again is that if streaming replication fails, both of these

functions will continue to return same values and you could still end up

thinking the replication is working. But also you need to query both the

master and slave to be able to monitor this, which is not that easy on

monitoring systems, and you still don’t have the information about the

actual lag in seconds, so you would still need to run the first query.

Check For Postgresql Replication Delay/Lag

November 26, 2016

By postgresdba

This can be achieved by comparing pg_last_xlog_receive_location() and

pg_last_xlog_replay_location() on the slave, and if they are the same it

returns 0, otherwise it runs the above query again:

SELECT

CASE

WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location() THEN 0

ELSE EXTRACT (EPOCH FROM now() – pg_last_xact_replay_timestamp())::INTEGER

END

AS replication_lag;

This query is all good, but the problem is that it is not safe. If for some

reason the master stops sending transaction logs, this query will continue

to return 0 and you will think the replication is working, when it is not.

Check Replication Delay/Lag In Postgressql -Part1

November 26, 2016

By postgresdba

I used to check replication delay/lag by running the following query on the

slave:

SELECT EXTRACT(EPOCH FROM (now() – pg_last_xact_replay_timestamp()))::INT;

This query works great and it is a very good query to give you the lag in

seconds. The problem is if the master is not active, it doesn’t mean a

thing. So you need to first check if two servers are in sync and if they

are, return 0.

Monitoring Approach For Streaming Replication With Hot Standby In Postgresql 9.3

November 26, 2016

By postgresdba

The people using PostgreSQL and the Streaming Replication feature seem to ask many of the same questions:

1. How best to monitor Streaming Replication?

2. What is the best way to do that?

3. Are there alternatives, when monitoring on Standby, to using the pg_stat_replication view on Master?

4. How should I calculate replication lag-time, in seconds, minutes, etc.?

In light of these commonly asked questions, I thought a blog would help. The following are some methods I’ve found to be useful.

Monitoring is critical for large infrastructure deployments where you have Streaming Replication for:

1. Disaster recovery

2. Streaming Replication is for High Availability

3. Load balancing, when using Streaming Replication with Hot Standby

PostgreSQL has some building blocks for replication monitoring, and the following are some important

functions and views which can be use for monitoring the replication:

1. pg_stat_replication view on master/primary server.

This view helps in monitoring the standby on Master. It gives you the following details:

pid: Process id of walsender process

usesysid: OID of user which is used for Streaming replication.

usename: Name of user which is used for Streaming replication

application_name: Application name connected to master

client_addr: Address of standby/streaming replication

client_hostname: Hostname of standby.

client_port: TCP port number on which standby communicating with WAL sender

backend_start: Start time when SR connected to Master.

state: Current WAL sender state i.e streaming

sent_location: Last transaction location sent to standby.

write_location: Last transaction written on disk at standby

flush_location: Last transaction flush on disk at standby.

replay_location: Last transaction flush on disk at standby.

sync_priority: Priority of standby server being chosen as synchronous standby

sync_state: Sync State of standby (is it async or synchronous).

e.g.:

postgres=# select * from pg_stat_replication ;

-[ RECORD 1 ]—-+———————————

pid | 1114

usesysid | 16384

usename | repuser

application_name | walreceiver

client_addr | 172.17.0.3

client_hostname |

client_port | 52444

backend_start | 15-MAY-14 19:54:05.535695 -04:00

state | streaming

sent_location | 0/290044C0

write_location | 0/290044C0

flush_location | 0/290044C0

replay_location | 0/290044C0

sync_priority | 0

sync_state | async

2. pg_is_in_recovery() : Function which tells whether standby is still in recovery mode or not.

e.g.postgres=# select pg_is_in_recovery();

pg_is_in_recovery

——————-

(1 row)

3. pg_last_xlog_receive_location: Function which tells location of last transaction log which was streamed by Standby and also written on standby disk.

e.g.postgres=# select pg_last_xlog_receive_location();

pg_last_xlog_receive_location

——————————-

0/29004560

(1 row)

4. pg_last_xlog_replay_location: Function which tells last transaction replayed during recovery process. e.g is given below:

postgres=# select pg_last_xlog_replay_location();

pg_last_xlog_replay_location

——————————

0/29004560

(1 row)

5. pg_last_xact_replay_timestamp: This function tells about the time stamp of last transaction which was replayed during recovery. Below is an example:

postgres=# select pg_last_xact_replay_timestamp();

pg_last_xact_replay_timestamp

———————————-

15-MAY-14 20:54:27.635591 -04:00

(1 row)

Above are some important functions/views, which are already available in PostgreSQL for monitoring the streaming replication.

So, the logical next question is, “What’s the right way to monitor the Hot Standby with Streaming Replication on Standby Server?”

If you have Hot Standby with Streaming Replication, the following are the points you should monitor:

1. Check if your Hot Standby is in recovery mode or not:

For this you can use pg_is_in_recovery() function.

2.Check whether Streaming Replication is working or not.

And easy way of doing this is checking the pg_stat_replication view on Master/Primary. This view gives information only on master if Streaming Replication is working.

3. Check If Streaming Replication is not working and Hot standby is recovering from archived WAL file.

For this, either the DBA can use the PostgreSQL Log file to monitor it or utilize the following functions provided in PostgreSQL 9.3:

pg_last_xlog_replay_location();

pg_last_xact_replay_timestamp();

4. Check how far off is the Standby from Master.

There are two ways to monitor lag for Standby.

i. Lags in Bytes: For calculating lags in bytes, users can use the pg_stat_replication view on the master with the function pg_xlog_location_diff function. Below is an example:

pg_xlog_location_diff(pg_stat_replication.sent_location, pg_stat_replication.replay_location)

which gives the lag in bytes.

ii. Calculating lags in Seconds. The following is SQL, which most people uses to find the lag in seconds:

SELECT CASE WHEN pg_last_xlog_receive_location() = pg_last_xlog_replay_location()

THEN 0

ELSE EXTRACT (EPOCH FROM now() – pg_last_xact_replay_timestamp())

END AS log_delay;

Including the above into your repertoire can give you good monitoring for PostgreSQL.

I will in a future post include the script that can be used for monitoring the Hot Standby with PostgreSQL streaming replication

Replication Synchronous_Commit Options Available In Postgrsql

November 26, 2016

By postgresdba

Options available

Allowed values up to Postgres version 9.5 were – „on“, „remote_write“, „local“ and „off“ and now with 9.6 „remote_apply“ will join the party. Short descriptions of all of these in plain-text could be something like that of below. NB! Latter 3 values are effective only in synchronous streaming replication mode, and fall back to „on“ value when no replicas have been listed in the „synchronous_standby_names“ parameter.

* on – Transaction commit always waits until the data is really flushed to the transaction log (aka WAL or XLOG) making sure the transaction is really persisted. In synchronous streaming replication mode also the replica needs to do the same.

* off – Commit can be acknowledged to the calling client before the transaction is actually flushed to the transaction log, making it possible to lose some recent (<1s by default) allegedly-committed transactions in case of a server crash. Synchronous replication is ignored. This is one of the most known Postgres performance tweaks.

* local – Forces the „on“ behavior (guaranteed data flush) only on the master node. Usually used as user set session parameter for speeding up non-critical data imports on master for example.

* remote_write – Provides a weaker guarantee than „on“, transaction counts as commited if master does a guaranteed flush and the replica just gets a write confirmation from the operating system (prone to replica corruption if replica machine crashes before hitting storage).

* remote_apply – Provides the strongest replica consistency – commit doesn’t return before replica flushes and also applies the data. Clients could either query the master or the slave, they would have exactly the same view on the data.

Setup Steaming Replication Step By Step

November 25, 2016

By postgresdba

Though many know about streaming replication, I would like to start from Prerequisites and with some Introduction about this replication as this blog is for beginners :-).

Prerequisites:

1. Same version of the PostgreSQL Database must be installed on both servers.

2. Configure the Password-less ssh authentication to use the “postgres” user.

3. Production server must operate in WAL archiving enabled mode by setting archive_mode and archive_command in postgresql.conf file.

4. Production and stand by servers always should have connectivity between them to transfer the archived WAL files from production.

5. Set up your standby host’s environment and directory structure exactly the same as your primary.

Introduction:

It is an asynchronous mechanism; the standby server lags behind the master. But unlike other replication methods, this lag is very short, and can be as little as a single transaction, depending on network speed, database activity, and Streaming replication settings. Also, the load on the master for each slave is minimal, allowing a single master to support dozens of slaves. The feature is included in Postgresql-9.0, with this the second database instance (normally on a separate server) replaying the primary’s binary log, while making that standby server can accept read-only queries.

Here are the practical steps with necessary commands:

1. Connect to Master and create a “replication” user with replication privilege.

$ psql

Password for user postgres:

psql.bin (9.2.1) Type “help” for help.

postgres=# create user replication with replication password ”;

2. We need to change some parameters in postgresql.conf and authentication in pg_hba.conf which are located at /opt/PostgreSQL92/data/ location on Master. Set up connections and authentication so that the standby server can successfully connect to the replication pseudo-database on the primary.

$ $EDITOR postgresql.conf

listen_addresses = ‘*’

$ $EDITOR pg_hba.conf

#The standby server must have superuser access privileges.

host replication replication 10.176.0.0/16 md5

3. Set up the streaming replication related parameters on the primary server.

$EDITOR postgresql.conf

#To enable read¬only queries on a standby server, wal_level must be set to “hot_standby”. But you can choose “archive” if you never connect to the server in standby mode.

wal_level = hot_standby

#Set the maximum number of concurrent connections from the standby servers.

max_wal_senders = 5

#To prevent the primary server from removing the WAL segments required for the standby server before shipping them, set the minimum number of segments retained in the pg_xlog directory.

wal_keep_segments = 32

#Enable WAL archiving on the primary to an archive directory accessible from the standby. If wal_keep_segments is a high enough number to retain the WAL segments required for the standby server, this may not be necessary.

archive_mode = on

archive_command = ‘cp %p %f && scp %p [email protected]:/%f’

Note: Restart the cluster after modifying the above parameters in postgresql.conf file.

4. Re-start postgres on the primary server and check if the parameters are affected.

postgres=# show archive_command ;

archive_command

¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬ —————–

cp %p /var/PG92_Archives/%f

(1 row)

postgres=# show archive_mode ;

archive_mode

¬¬¬¬¬¬¬¬¬¬¬¬¬¬————-

(1 row)

postgres=# show wal_level ;

wal_level

¬¬¬¬¬¬¬¬¬¬¬¬¬————

hot_standby

(1 row)

postgres=# show max_wal_senders ;

max_wal_senders

¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬—————-

(1 row)

postgres=# show wal_keep_segments ;

wal_keep_segments

¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬——————

5. Make a base backup of Master server’s data directory.

$ psql ¬c “SELECT pg_start_backup(‘label’, true)”

$ cp /opt/PostgreSQL92/data/* backup/

$psql ¬c “SELECT pg_stop_backup()”

— tar the backup directory and move to standby location.

$tar ¬cvzf backup.tar backup/

$scp backup.tar [email protected]:/opt/PostgreSQL92/

6. Move the slave data directory contents to any other location, untar the backup file and copy contents to slave data directory.

7. Set up replication-related parameters, connections and authentication in the standby server like the primary, so that the standby might work as a primary after failover. 8. Enable read-only queries on the standby server. But if wal_level is archive on the primary, leave hot_standby unchanged (i.e., off).

$ $EDITOR postgresql.conf

hot_standby = on

9. Create a recovery command file in the standby server; the following parameters are required for streaming replication.

$ $EDITOR recovery.conf

# Specifies whether to start the server as a standby. In streaming

# replication, this parameter must to be set to on.

standby_mode = ‘on’

# Specifies a connection string which is used for the standby server to

# connect with the primary.

primary_conninfo = ‘host=10.176.112.188 port=5432 user=replication

password= application=’

# Specifies a trigger file whose presence should cause streaming

# replication to end (i.e., failover).

trigger_file = ” ===> Do not create the file. You have to

create the file when failover.

# Specifies a command to load archive segments from the WAL archive. If

# # wal_keep_segments is a high enough number to retain the WAL segments

# # required for the standby server, this may not be necessary. But

# a large workload can cause segments to be recycled before the standby

# is fully synchronized, requiring you to start again from a new base

# backup.

restore_command = ‘cp %f “%p”‘

10. Start postgres in the standby server. It will start streaming replication and you will see log messages like below:

LOG: entering standby mode

LOG: consistent recovery state reached at 0/1D000078

LOG: record with zero length at 0/1D000078

LOG: streaming replication successfully connected to primary

11. You can calculate the replication lag by comparing the current WAL write location on the primary with the last WAL location received/replayed by the standby. They can be retrieved using pg_current_xlog_location function on the primary side and the pg_last_xlog_receive_location or pg_last_xlog_replay_location function on the standby, respectively.

$ psql ¬c “SELECT pg_current_xlog_location()” ¬h192.168.0.10

(primary host)

pg_current_xlog_location

¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬

0/2000000

(1 row)

$ psql ¬c “select pg_last_xlog_receive_location()” ¬h192.168.0.20

(standby host)

pg_last_xlog_receive_location

¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬

0/2000000

(1 row)

$ psql ¬c “select pg_last_xlog_replay_location()” ¬h192.168.0.20

(standby host)

pg_last_xlog_replay_location

¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬¬

0/2000000

(1 row)

12. Other ways to check streaming replication: the easiest way is “select now()-pg_last_xact_replay_timestamp();” at slave side. pg_last_xact_replay_timestamp() function gives time stamp of last transaction replayed during recovery, time at which the commit or abort WAL record for that transaction was generated on the primary. If no transactions have been replayed during recovery, this function returns NULL. Otherwise, if recovery is still in progress this will increase monotonically. If recovery has completed then this value will remain static at the value of the last transaction applied during that recovery. When the server has been started normally without recovery the function returns NULL. You can try with some operation on Master and then check the fuction output.

If you want to check the delay manually, then go for below steps:

Step 1:Need to create table on Primary using below command.

create table stream_delay (tstamp timestamp without time zone );

insert into stream_delay select now();

Step 2: schedule the below command on primary to execute every minute on cronjob.

update stream_delay set tstamp=’now()’;

step 3: verify the delay on slave by selecting the “stream_delay” table.

It should show the last time that was updated in primary. Difference between this timestamp and current timestamp of slave server shows the time delay between Primary and slave.

You can also check the progress of streaming replication by using ps command.#The displayed LSNs indicate the byte position that the standby server has written up to in the xlogs.

[primary] $ ps ¬ef | grep sender

postgres 6879 6831 0 10:31 ? 00:00:00 postgres: wal sender

process postgres 127.0.0.1(44663) streaming 0/2000000

[standby] $ ps ¬ef | grep receiver

postgres 6878 6872 1 10:31 ? receiver process streaming 0/2000000

Thanks all, Kindly let me know if I miss anything.

Automatic failover using Replication Manager 2.0 on PostgreSQL 9.3.5

November 25, 2016

By postgresdba

In PostgreSQL high availability(Streaming Replication/Hot Standby), one of the thing require human interference and has no automation, that is in the event of master database crash; initiating failover procedure(trigger file creation or pg_ctl promote command) on the standby. Because in core of PostgreSQL there’s no such built-in functionality to identify a failure of master and notify the standby. Thus, we require some automated tools to take over manual failover work or we have to dive into scripting land for writing our own script to do it.

Today, we have very good external tools to handle automatic failover like Replication Manager(repmgr), EDB Failover Manager(EFM), pgHA and HandyRep. Thanks to all for filling the gap of automatic failover in PostgreSQL.

In this post, am demonstrating Replication Manager on single node(localhost) on RHEL 6.5 – PostgreSQL 9.3.5. In order to achieve an easy and good understanding of concept I have compiled repmgr with EnterpriseDB One Click Installer(a pre-build binary package) instead of PG source.

To compile repmgr, we need to install few mandatory dependency packages gcc, postgresql-devel, libxslt-devel, pam-devel, libopenssl-devel, krb5-devel and libedit-devel by using yum or rpm. After installing dependencies, download repmgr 2.0 from here and set pg_config in your path and start compiling.

[postgres@localhost:/opt/PostgreSQL/9.3~]$ type pg_config
pg_config is hashed (/opt/PostgreSQL/9.3/bin/pg_config)

export PATH=/opt/PostgreSQL/9.3/bin:$PATH
export LD_LIBRARY_PATH=/opt/PostgreSQL/9.3/lib:$LD_LIBRARY_PATH

tar xvzf repmgr-2.0.tar.gz
cd repmgr
make USE_PGXS=1
make USE_PGXS=1 install

Mostly, repmgr compiles smoothly without any hiccups if we have installed all dependency packages, since am compiling against PG pre-build binaries, there may be diverse variants of libraries came with pre-build and rpm which might throw some compilation errors. Like one you see here:

/lib64/libldap_r-2.4.so.2: undefined reference to `ber_sockbuf_io_udp’
collect2: ld returned 1 exit status
make: *** [repmgrd] Error 1

To fix, find the checking library in /lib64/libldap_r-2.4.so.2.
[root@localhost repmgr-2.0]# cd /lib64/
[root@localhost lib64]# ls -l libldap*

lrwxrwxrwx. 1 root root 20 Dec 8 09:23 libldap-2.4.so.2 -> libldap-2.4.so.2.5.6
-rwxr-xr-x. 1 root root 317200 Apr 29 2013 libldap-2.4.so.2.5.6
lrwxrwxrwx. 1 root root 22 Dec 8 09:23 libldap_r-2.4.so.2 -> libldap_r-2.4.so.2.5.6
-rwxr-xr-x. 1 root root 335264 Apr 29 2013 libldap_r-2.4.so.2.5.6

Ok, there are two copies, retain one and unlink other.
[root@localhost lib64]# unlink libldap_r-2.4.so.2

Now clean the previous compilation by command “make USE_PGXS=1 clean” and proceed from step 1 and it will compile without any issues. Similarly, you need to fix for other library related errors. After installation you can find binaries, libraries and .SQL files related to repmgr in PostgreSQL locations.

[postgres@localhost:/opt/PostgreSQL/9.3/bin~]$ ls repmgr* (Two utility commands)
repmgr repmgrd

[postgres@localhost:/opt/PostgreSQL/9.3/lib/postgresql~]$ ls rep*
repmgr_funcs.so

[postgres@localhost:/opt/PostgreSQL/9.3/share/postgresql/contrib~]$ ls
repmgr_funcs.sql repmgr.sql uninstall_repmgr_funcs.sql uninstall_repmgr.sql

We are all set to setup automatic failover with a super-simple-toy Replication Manager. As a first step we need to have streaming replication(Refer to wiki) configured which I have done already on my localhost between two port 5432 (Master) and 5433 (Standby) lets use them. You can also try building standby using repmgr STANDBY CLONE command. Refer to repmgr documentation for more details.

Step 1. Enable repmgr libraries on both PostgreSQL instances that required for its backend functions.
Master Data Directory : /opt/PostgreSQL/9.3/data
Standby Data Directory: /opt/PostgreSQL/9.3/data_slave

Edit $PGDATA/postgresql.conf
shared_preload_libraries = ‘repmgr_funcs’

[postgres@localhost:/opt/PostgreSQL/9.3~]$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /opt/PostgreSQL/9.3/data start

[postgres@localhost:/opt/PostgreSQL/9.3~]$ psql -p 5432 -c “show shared_preload_libraries;”
shared_preload_libraries
————————–
repmgr_funcs
(1 row)

Step 2. Repmgr need repmgr.conf file for each node, since we are working on localhost we need to keep each nodes repmgr.conf in seperate directory.

[postgres@localhost:/opt/PostgreSQL/9.3~]$ pwd
/opt/PostgreSQL/9.3
[postgres@localhost:/opt/PostgreSQL/9.3~]$ mkdir -p repmgr/master repmgr/standby

Step 3. Create repmgr.conf file for Master(5432) and Standby(5433) in the directories we created in Step 2.

[postgres@localhost:/opt/PostgreSQL/9.3/repmgr/master~]$ pwd
/opt/PostgreSQL/9.3/repmgr/master
[postgres@localhost:/opt/PostgreSQL/9.3/repmgr/master~]$ more repmgr.conf
cluster=test
node=1
node_name=master
conninfo=’host=127.0.0.1 port=5432 dbname=postgres’
pg_bindir=/opt/PostgreSQL/9.3/bin
master_response_timeout=60
reconnect_attempts=6
reconnect_interval=10
failover=automatic
promote_command=’/opt/PostgreSQL/9.3/repmgr/auto_failover.sh’

[postgres@localhost:/opt/PostgreSQL/9.3/repmgr/slave~]$ pwd
/opt/PostgreSQL/9.3/repmgr/slave
[postgres@localhost:/opt/PostgreSQL/9.3/repmgr/slave~]$ more repmgr.conf
cluster=test
node=2
node_name=slave
conninfo=’host=127.0.0.1 port=5433 dbname=postgres’
pg_bindir=/opt/PostgreSQL/9.3/bin
master_response_timeout=60
reconnect_attempts=6
reconnect_interval=10
failover=automatic
loglevel=DEBUG
promote_command=’/opt/PostgreSQL/9.3/repmgr/auto_failover.sh’

What mainly you have to observe in repmgr.conf is “master_reponse_timeout” which’s total wait duration in seconds before declaring master has disappeared. In that duration 6 reconnect attempts made with 10 seconds of interval. After no response from master in “master_response_timeout” duration automatic failover takes place by promote_command script. The script consist of Standby promotion steps, which I have created one for this setup shared below.

[postgres@localhost:/opt/PostgreSQL/9.3~]$ more repmgr/auto_failover.sh
#!/bin/bash
echo “Promoting Standby at `date ‘+%Y-%m-%d %H:%M:%S’`” >>/tmp/repsetup.log
/opt/PostgreSQL/9.3/bin/pg_ctl -D /opt/PostgreSQL/9.3/data_slave promote >>/tmp/repsetup.log

Step 4. Register Master and Standby node with repmgr using “repmgr” utility.
[postgres@localhost:/opt/PostgreSQL/9.3~]$ repmgr -f repmgr/master/repmgr.conf –verbose master register >/tmp/repsetup.log 2>&1

Master register logs (/tmp/repsetup.log):

[2015-01-12 01:28:55] [INFO] repmgr connecting to master database
[2015-01-12 01:28:55] [INFO] repmgr connected to master, checking its state
[2015-01-12 01:28:55] [INFO] master register: creating database objects inside the repmgr_test schema
[2015-01-12 01:28:55] [DEBUG] master register: CREATE SCHEMA repmgr_test
[2015-01-12 01:28:55] [DEBUG] master register: CREATE TABLE repmgr_test.repl_nodes ( [2015-01-12 01:28:55] [DEBUG] master register: CREATE TABLE repmgr_test.repl_monitor ( [2015-01-12 01:28:55] [DEBUG] master register: CREATE VIEW repmgr_test.repl_status AS [2015-01-12 01:28:55] [DEBUG] master register: CREATE INDEX idx_repl_status_sort ON repmgr_test.repl_monitor (last_monitor_time, standby_node)
[2015-01-12 01:28:55] [DEBUG] master register: INSERT INTO repmgr_test.repl_nodes (id, cluster, name, conninfo, priority) VALUES (1, ‘test’, ‘master’, ‘host=1
27.0.0.1 port=5432 dbname=postgres’, 0)
[2015-01-12 01:28:55] [NOTICE] Master node correctly registered for cluster test with id 1 (conninfo: host=127.0.0.1 port=5432 dbname=postgres)
Opening configuration file: repmgr/master/repmgr.conf

Standby:
[postgres@localhost:/opt/PostgreSQL/9.3~]$ repmgr -f repmgr/slave/repmgr.conf –verbose standby register >>/tmp/repsetup.log 2>&1

Standby register logs (/tmp/repsetup.log):

[2015-01-12 01:30:37] [INFO] repmgr connecting to standby database
[2015-01-12 01:30:37] [INFO] repmgr connected to standby, checking its state
[2015-01-12 01:30:37] [DEBUG] standby register: SELECT 1 FROM pg_namespace WHERE nspname = ‘repmgr_test’
[2015-01-12 01:30:37] [INFO] repmgr connecting to master database
[2015-01-12 01:30:37] [INFO] finding node list for cluster ‘test’
[2015-01-12 01:30:37] [INFO] checking role of cluster node ‘host=127.0.0.1 port=5432 dbname=postgres’
[2015-01-12 01:30:37] [INFO] repmgr connected to master, checking its state
[2015-01-12 01:30:37] [INFO] repmgr registering the standby
[2015-01-12 01:30:37] [DEBUG] standby register: INSERT INTO repmgr_test.repl_nodes(id, cluster, name, conninfo, priority) VALUES (2, ‘test’, ‘slave’, ‘host=12
7.0.0.1 port=5433 dbname=postgres’, 0)
[2015-01-12 01:30:37] [INFO] repmgr registering the standby complete
[2015-01-12 01:30:37] [NOTICE] Standby node correctly registered for cluster test with id 2 (conninfo: host=127.0.0.1 port=5433 dbname=postgres)
Opening configuration file: repmgr/slave/repmgr.conf

By looking to the logs you can easily notice, repmgr creating its own schema in database with “repmgr_$CLUSTER” name and some tables/view/functions in it. And it create one row regarding replication lag in repl_monitor table. Repmgr has a utility called “repmgrd” to monitor Master availability running daemon process from Standby node. Its also a management and monitoring system daemon that watches the cluster status and can trigger standby promotion. We need to start “repmgrd” daemon process after registering master/standby nodes with repmgr.

Step 5. Lets start the repmgr daemon process to watch master, in our case we are not running standby on separate node hence we need to start the daemon with repmgr/standby/repmgr.conf file.
repmgrd -f repmgr/slave/repmgr.conf –verbose –monitoring-history >>/tmp/repsetup.log 2>&1 &

Logs (/tmp/repsetup.log)

[2015-01-12 01:42:13] [INFO] repmgrd Connecting to database ‘host=127.0.0.1 port=5433 dbname=postgres’
[2015-01-12 01:42:13] [INFO] repmgrd Connected to database, checking its state
[2015-01-12 01:42:13] [INFO] repmgrd Connecting to primary for cluster ‘test’
[2015-01-12 01:42:13] [INFO] finding node list for cluster ‘test’
[2015-01-12 01:42:13] [INFO] checking role of cluster node ‘host=127.0.0.1 port=5432 dbname=postgres’
[2015-01-12 01:42:13] [INFO] repmgrd Checking cluster configuration with schema ‘repmgr_test’
[2015-01-12 01:42:13] [INFO] repmgrd Checking node 2 in cluster ‘test’
[2015-01-12 01:42:13] [INFO] Reloading configuration file and updating repmgr tables
[2015-01-12 01:42:13] [INFO] repmgrd Starting continuous standby node monitoring
[2015-01-12 01:42:14] [DEBUG] standby_monitor: INSERT INTO repmgr_test.repl_monitor VALUES(1, 2, ‘2015-01-12 09:42:14.457287+00’::timestamp with time zone, ‘2015-01-12 09:42:13.950172+00’::timestamp with time zone, ‘2/C84DAB08’, ‘2/C84DAB08’, 0, 0)

Step 6. Monitor nodes registered with repmgr and running daemon process.
[postgres@localhost:/opt/PostgreSQL/9.3~]$ psql -p 5432 -d postgres -xc “SELECT * FROM
repmgr_test.repl_status”

[postgres@localhost:/opt/PostgreSQL/9.3~]$ repmgr -f repmgr/master/repmgr.conf cluster show
[2015-01-12 01:47:00] [INFO] repmgr connecting to database
Role | Connection String
* master | host=127.0.0.1 port=5432 dbname=postgres
standby | host=127.0.0.1 port=5433 dbname=postgres

Step 7. Lets simulate failover scenario by taking master down. I am doing some stupid way by killing postmaster.pid. Please don’t do the same on production if you want to stick for long time in the same company 🙂

[postgres@localhost:/opt/PostgreSQL/9.3~]$ kill `head -n1 data/postmaster.pid`
or, you can take master down safely.
[postgres@localhost:/opt/PostgreSQL/9.3~]$ pg_ctl -D /opt/PostgreSQL/9.3/data stop -mf

Now check the logs, how repmgr has promted the standby:
[2015-01-12 02:14:11] [WARNING] Can’t stop current query: PQcancel() — connect() failed: Connection refused
[2015-01-12 02:14:11] [WARNING] repmgrd: Connection to master has been lost, trying to recover… 10 seconds before failover decision
…
[2015-01-12 02:14:21] [ERROR] repmgrd: We couldn’t reconnect for long enough, exiting…
[2015-01-12 02:14:21] [DEBUG] repmgrd: there are 2 nodes registered
[2015-01-12 02:14:21] [DEBUG] repmgrd: node=1 conninfo=”host=127.0.0.1 port=5432 dbname=postgres” witness=false
[2015-01-12 02:14:21] [ERROR] Connection to database failed: could not connect to server: Connection refused
Is the server running on host “127.0.0.1” and accepting
TCP/IP connections on port 5432?
[2015-01-12 02:14:21] [DEBUG] repmgrd: node=2 conninfo=”host=127.0.0.1 port=5433 dbname=postgres” witness=false
[2015-01-12 02:14:21] [DEBUG] Total nodes counted: registered=2, visible=1
[2015-01-12 02:14:21] [DEBUG] XLog position of node 2: log id=2 (2), offset=3360733048 (C850B778)
[2015-01-12 02:14:21] [DEBUG] Last XLog position of node 2: log id=2 (2), offset=3360733048 (C850B778)
[2015-01-12 02:14:26] [INFO] repmgrd: This node is the best candidate to be the new primary, promoting…
[2015-01-12 02:14:26] [DEBUG] promote command is: “/opt/PostgreSQL/9.3/repmgr/auto_failover.sh”
Promoting Standby at 2015-01-12 02:14:26
server promoting
[2015-01-12 02:14:29] [INFO] repmgrd Checking cluster configuration with schema ‘repmgr_test’
[2015-01-12 02:14:29] [INFO] repmgrd Checking node 2 in cluster ‘test’
[2015-01-12 02:14:29] [INFO] Reloading configuration file and updating repmgr tables
[2015-01-12 02:14:29] [INFO] repmgrd Starting continuous primary connection check

Perfect, repmgr daemon recognised master failure and before promoting standby by “auto_failover.sh” script it has properly verified the WAL’s location as well. Lets verify whether Standby had promotion or not ?

[postgres@localhost:/opt/PostgreSQL/9.3~]$ psql -p 5433 -c “select pg_is_in_recovery();”
pg_is_in_recovery
——————-
f
(1 row)

Incredible, without any human intervention manual failover procedure taken care by repmgr. You can rebuild the dead master as standby either by repmgr itself or you can follow the switchback procedure shown in my previous post. Same way you can setup for two nodes, by executing standby steps on standby node instead of localhost.