Relational Database Technologies: 2014

Monday, December 29, 2014

Implementing Switchover/Switchback in PostgreSQL 9.3.

This post educates sophisticated DBA's on how to setup graceful Switchover and Switchback environment in PostgreSQL high availability. Firstly, thanks to patch authors Heikki and Fujii for making Switchover/Switchback easier in PostgreSQL 9.3.(Pardom me if I missed other names).

Let me attempt to illustrate it in short prior to these patches, all of you know Standby's are critical components in achieving fast and safe disaster recovery. In PostgreSQL, recovery concept majorly deals with timelines to identify a series of WAL segments before and after the PITR or promotion of Standby to avoid overlapping of WAL segments. Timeline ID are associated with WAL segment file names(Eg:- In $PGDATA/pg_xlog/0000000C000000020000009E segment "0000000C" is timeline ID). In Streaming Replication both Primary and Slave will follow the same timeline ID, however when Standby gets promotion as new master by Switchover it bumps the timeline ID and old Primary refuses to restart as Standby due to timeline ID difference and throw error message as:

FATAL:  requested timeline 10 is not a child of this server's history
DETAIL:  Latest checkpoint is at 2/9A000028 on timeline 9, but in the history of the requested timeline, the server forked off from that timeline at 2/99017E68.

Thus, a new Standby has to be built from scratch, if the database size is huge then a longer time to rebuild and for this period newly promoted Primary will be running without Standby. There's also other issue like, when Switchover happens Primary does clean shutdown, Walsender process sends all outstanding WAL records to the standby but it doesn't wait for them to be replicated before it exits. Walreceiver fails to apply those outstanding WAL records as it detects closure of connection and exits.

Today, with two key software updates in PostgreSQL 9.3, both of the issues addressed very well by authors and now Streaming Replication Standby's follow a timeline switch consistently. We can now seamlessly and painlessly switch the duties between Primary and Standby by just restarting and majorly reducing rebuild time of Standby.

Note: Switchover/Switchback not possible if WAL Archives are not accessible to both servers and in Switchover process Primary database must do clean shutdown(normal or fast mode).

To demo, lets start with setup of Streaming Replication(wiki to setup SR) which I have configured in my local VM between two clusters (5432 as Primary and 5433 as Standby) sharing a common WAL archives location, because both clusters should have complete access of sequence of WAL archives. Look at the snapshot shared below with setup details and current timeline ID for better understanding of concept.

At this stage everyone must have a solid understanding that Switchover and Switchback are planned activities. Now SR setup in place we can exchange the duties of primary and standby as shown below:

Switchover steps:

Step 1. Do clean shutdown of Primary[5432] (-m fast or smart)

[postgres@localhost:/~]$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /opt/PostgreSQL/9.3/data stop -mf
waiting for server to shut down.... done
server stopped

Step 2. Check for sync status and recovery status of Standby[5433] before promoting it:

[postgres@localhost:/opt/PostgreSQL/9.3~]$  psql -p 5433 -c 'select pg_last_xlog_receive_location() "receive_location",
pg_last_xlog_replay_location() "replay_location",
pg_is_in_recovery() "recovery_status";'
 receive_location | replay_location | recovery_status
------------------+-----------------+-----------------
 2/9F000A20       | 2/9F000A20      | t
(1 row)

Standby in complete sync. At this stage we are safe to promote it as Primary.
Step 3. Open the Standby as new Primary by pg_ctl promote or creating a trigger file.

[postgres@localhost:/opt/PostgreSQL/9.3~]$ grep trigger_file data_slave/recovery.conf
trigger_file = '/tmp/primary_down.txt'
[postgres@localhost:/opt/PostgreSQL/9.3~]$ touch /tmp/primary_down.txt

[postgres@localhost:/opt/PostgreSQL/9.3~]$ psql -p 5433 -c "select pg_is_in_recovery();"
 pg_is_in_recovery
-------------------
 f
(1 row)

In Logs:  
2014-12-29 00:16:04 PST-26344-- [host=] LOG:  trigger file found: /tmp/primary_down.txt
2014-12-29 00:16:04 PST-26344-- [host=] LOG:  redo done at 2/A0000028
2014-12-29 00:16:04 PST-26344-- [host=] LOG:  selected new timeline ID: 14
2014-12-29 00:16:04 PST-26344-- [host=] LOG:  restored log file "0000000D.history" from archive
2014-12-29 00:16:04 PST-26344-- [host=] LOG:  archive recovery complete
2014-12-29 00:16:04 PST-26342-- [host=] LOG:  database system is ready to accept connections
2014-12-29 00:16:04 PST-31874-- [host=] LOG:  autovacuum launcher started

Standby has been promoted as master and a new timeline followed which you can notice in logs.
Step 4. Restart old Primary as standby and allow to follow the new timeline by passing "recovery_target_timline='latest'" in $PGDATA/recovery.conf file.

[postgres@localhost:/opt/PostgreSQL/9.3~]$ cat data/recovery.conf
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=localhost port=5433 user=postgres'
restore_command = 'cp /opt/PostgreSQL/9.3/archives93/%f %p'
trigger_file = '/tmp/primary_131_down.txt'
[postgres@localhost:/opt/PostgreSQL/9.3~]$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /opt/PostgreSQL/9.3/data start
server starting

If you go through recovery.conf its very clear that old Primary trying to connect to 5433 port as new Standby pointing to common WAL Archives location and started.

In Logs:
2014-12-29 00:21:17 PST-32315-- [host=] LOG:  database system was shut down at 2014-12-29 00:12:23 PST
2014-12-29 00:21:17 PST-32315-- [host=] LOG:  restored log file "0000000E.history" from archive
2014-12-29 00:21:17 PST-32315-- [host=] LOG:  entering standby mode
2014-12-29 00:21:17 PST-32315-- [host=] LOG:  restored log file "0000000D00000002000000A0" from archive
2014-12-29 00:21:17 PST-32315-- [host=] LOG:  restored log file "0000000D.history" from archive
2014-12-29 00:21:17 PST-32315-- [host=] LOG:  consistent recovery state reached at 2/A0000090
2014-12-29 00:21:17 PST-32315-- [host=] LOG:  record with zero length at 2/A0000090
2014-12-29 00:21:17 PST-32310-- [host=] LOG:  database system is ready to accept read only connections
2014-12-29 00:21:17 PST-32325-- [host=] LOG:  started streaming WAL from primary at 2/A0000000 on timeline 14

Step 5. Verify the new Standby status.

[postgres@localhost:/opt/PostgreSQL/9.3~]$ psql -p 5432 -c "select pg_is_in_recovery();"
 pg_is_in_recovery
-------------------
 t
(1 row)

Cool, without any re-setup we have brought back old Primary as new Standby.

Switchback steps:

Step 1. Do clean shutdown of new Primary [5433]:

[postgres@localhost:/opt/~]$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /opt/PostgreSQL/9.3/data_slave stop -mf
waiting for server to shut down.... done
server stopped

Step 2. Check for sync status of new Standby [5432] before promoting.
Step 3. Open the new Standby [5432] as Primary by creating trigger file or pg_ctl promote.

[postgres@localhost:/opt/PostgreSQL/9.3~]$ touch /tmp/primary_131_down.txt

Step 4. Restart stopped new Primary [5433] as new Standby.

[postgres@localhost:/opt/PostgreSQL/9.3~]$ more data_slave/recovery.conf
recovery_target_timeline = 'latest'
standby_mode = on
primary_conninfo = 'host=localhost port=5432 user=postgres'
restore_command = 'cp /opt/PostgreSQL/9.3/archives93/%f %p'
trigger_file = '/tmp/primary_down.txt'

[postgres@localhost:/opt/PostgreSQL/9.3~]$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /opt/PostgreSQL/9.3/data_slave start
server starting

You can verify the logs of new Standby.

In logs:
[postgres@localhost:/opt/PostgreSQL/9.3/data_slave/pg_log~]$ more postgresql-2014-12-29_003655.log
2014-12-29 00:36:55 PST-919-- [host=] LOG:  database system was shut down at 2014-12-29 00:34:01 PST
2014-12-29 00:36:55 PST-919-- [host=] LOG:  restored log file "0000000F.history" from archive
2014-12-29 00:36:55 PST-919-- [host=] LOG:  entering standby mode
2014-12-29 00:36:55 PST-919-- [host=] LOG:  restored log file "0000000F.history" from archive
2014-12-29 00:36:55 PST-919-- [host=] LOG:  restored log file "0000000E00000002000000A1" from archive
2014-12-29 00:36:55 PST-919-- [host=] LOG:  restored log file "0000000E.history" from archive
2014-12-29 00:36:55 PST-919-- [host=] LOG:  consistent recovery state reached at 2/A1000090
2014-12-29 00:36:55 PST-919-- [host=] LOG:  record with zero length at 2/A1000090
2014-12-29 00:36:55 PST-914-- [host=] LOG:  database system is ready to accept read only connections
2014-12-29 00:36:55 PST-929-- [host=] LOG:  started streaming WAL from primary at 2/A1000000 on timeline 15
2014-12-29 00:36:56 PST-919-- [host=] LOG:  redo starts at 2/A1000090

Very nice, without much time we have switched the duties of Primary and Standby servers. You can even notice the increment of the timeline IDs from logs for each promotion.

Like others all my posts are part of knowledge sharing, any comments or corrections are most welcome. :)

--Raghav

Saturday, December 13, 2014

Switchover/Switchback in Slony-I while upgrading PostgreSQL major versions 8.4.x/9.3.x

Every new release of PostgreSQL comes with a packed of exciting features. To benefit new features, database server should be upgraded. Choosing traditional upgrade paths like pg_dump/pg_restore or pg_upgrade requires a significant downtime of application. Today, if you are looking for minimum downtime upgrade path among major PostgreSQL versions with perfect rollback plan, then it will be accomplished by asynchronous Slony-I replication. Since Slony-I (know more about it here) has the capability to replicate between different PostgreSQL versions,OS and bit architectures easily, so upgrades are doable without requiring a substantial downtime. In addition, it has a consistent switchover and switchback functionality in its design.

IMO, while doing major version upgrades there should be a proper fallback plan because just in case application turn out to buggy or fail to perform well on upgraded version, then we should be able to rollback to older version immediately. Slony-I provides such functionality in the way of switchback. This post demonstrates, minimum downtime upgradation including switchover/switchback steps.

Before going to demo, one important step to be noted, earlier to PG 9.0.x version bytea datatype columns use to store data in ESCAPE format and later version its in HEX format. While performing switchback (newer version to older version), this kind of bytea format differences are not support by Slony-I hence ESCAPE format should be maintained through out upgrade duration, else you may encounter with an error:

ERROR  remoteWorkerThread_1_1: error at end of COPY IN: ERROR:  invalid input syntax for type bytea
CONTEXT:  COPY sl_log_1, line 1: "1     991380  1       100001  public  foo I       0       {id,500003,name,"A         ",b,"\\\\x41"}"
ERROR  remoteWorkerThread_1: SYNC aborted

To fix, no changes required on PG 8.4.x but on PG 9.3.5 bytea_output parameter should be set from HEX to ESCAPE as shown. We can set it at cluster-level ($PGDATA/postgresql.conf) or user-level (ALTER TABLE...SET), I have preferred to go with user-level changes.

slavedb=# alter user postgres set bytea_output to escape;
ALTER ROLE

Lets proceed with upgrade steps. Below are my two versions server details used in this demo, change it accordingly as per your server setup if you are trying:

Origin Node (Master/Primary are called as Origin)                     Subscriber Node (Slave/Secondary are called as Subscriber)
-------------------------------------------------                     ----------------------------------------------------------
Host IP     : 192.168.22.130                                          192.168.22.131
OS Version  : RHEL 6.5 64 bit                                         RHEL 6.5 64 bit 
PG Version  : 8.4.22 (5432 Port)                                      9.3.5 (5432 Port)
Slony Vers. : 2.2.2                                                   2.2.2
PG Binaries : /usr/local/pg84/bin                                     /opt/PostgreSQL/9.3/
Database    : masterdb                                                slavedb 
PK Table    : foo(id int primary key, name char(20), image bytea)     ...restore PK tables structure from Origin...

For simple understanding and easy implementation, I have divided demo in three sections

1. Compiling for Slony-I binaries against PostgreSQL versions
2. Creating Replication Scripts and executing
3. Testing Switchover/Switchback.

1. Compiling for Slony-I binaries against PostgreSQL version
Download Slony-I sources from here, and perform source installation against PostgreSQL binaries on Origin and Subscriber nodes.

On Origin Node:
# tar -xvf slony1-2.2.2.tar.bz2
# cd slony1-2.2.2
./configure --with-pgbindir=/usr/local/pg84/bin 
            --with-pglibdir=/usr/local/pg84/lib 
            --with-pgincludedir=/usr/local/pg84/include 
            --with-pgpkglibdir=/usr/local/pg84/lib/postgresql 
            --with-pgincludeserverdir=/usr/local/pg84/include/postgresql/
make 
make install

On Subscriber Node: (assuming PG 9.3.5 installed)
# tar -xvf slony1-2.2.2.tar.bz2
# cd slony1-2.2.2
./configure --with-pgconfigdir=/opt/PostgreSQL/9.3/bin 
            --with-pgbindir=/opt/PostgreSQL/9.3/bin 
            --with-pglibdir=/opt/PostgreSQL/9.3/lib 
            --with-pgincludedir=/opt/PostgreSQL/9.3/include 
            --with-pgpkglibdir=/opt/PostgreSQL/9.3/lib/postgresql 
            --with-pgincludeserverdir=/opt/PostgreSQL/9.3/include/postgresql/server/ 
            --with-pgsharedir=/opt/PostgreSQL/9.3/share
make 
make install

2. Creating Replication Scripts and executing
To setup replication, we need create few scripts that take care of replication including switchover/switchback.

1. initialize.slonik - This script holds the Origin/Subscriber nodes connection information.
2. create_set.slonik - This script holds all the Origin PK Tables that replicate to Subscriber Node.
3. subscribe_set.slonik - This script starts replicating sets data to Subscriber Node.
4. switchover.slonik - This script helps to move control from Origin to Subscriber.
5. switchback.slonik - This script helps to fallback control from Subscriber to Origin.

Finally, two more startup scripts "start_OriginNode.sh" and "start_SubscriberNode.sh" that starts slon processes according to the binaries compiled on Origin/Subscriber Nodes.

Download all scripts from here.

Here's the sample data on Origin Node(8.4.22) in Foo Table with a column of bytea datatype, that we will replicate it to Subscriber Node(9.3.5) with the help of scripts created.

masterdb=# select * from foo;
 id |         name         | image
----+----------------------+-------
  1 | Raghav               | test1
  2 | Rao                  | test2
  3 | Rags                 | test3
(3 rows)

Lets call the scripts one by one to setup replication. REMEMBER ALL SLONIK SCRIPT SHOULD BE EXECUTED ON ORIGIN NODE ONLY, EXCEPT "start_OriginNode.sh" AND "start_SubscriberNode.sh" THAT SHOULD BE EXECUTED INDIVIDUALLY.

-bash-4.1$ slonik initalize.slonik
-bash-4.1$ slonik create_set.slonik
create_set.slonik:13: Set 1 ...created
create_set.slonik:16: PKey table *** public.foo *** added.
-bash-4.1$ sh start_OriginNode.sh      
-bash-4.1$ sh start_SubscriberNode.sh  //ON SUBSCRIBER NODE   
-bash-4.1$ slonik subscribe_set.slonik

After successful execution of above script, you can notice the data on Origin(masterdb) has replicated to Subscriber(slavedb). Also not allowing any DML operation on Subscriber node:

slavedb=# select * from foo;
 id |         name         | image
----+----------------------+-------
  1 | Raghav               | test1
  2 | Rao                  | test2
  3 | Rags                 | test3
(3 rows)

slavedb=# insert into foo values (4,'PG-Experts','Image2');
ERROR:  Slony-I: Table foo is replicated and cannot be modified on a subscriber node - role=0

Cool... We have moved data to newer version of PostgreSQL 9.3.5. At this stage if you feel all data have replicated to Subscriber Node, then you can do switchover.

3. Testing Switchover/Switchback.

Let's switchover to latest version with the script and try inserting data on Subscriber/Origin Nodes.

-bash-4.1$ slonik switchover.slonik
switchover.slonik:8: Set 1 has been moved from Node 1 to Node 2

slavedb=# insert into foo values (4,'PG-Experts','Image2');
INSERT 0 1

masterdb=# select * from foo ;
 id |         name         | image
----+----------------------+-------
  1 | Raghav               | test1
  2 | Rao                  | test2
  3 | Rags                 | test3
  4 | PG-Experts           | Image2
(4 rows)

masterdb=# insert into foo values (5,'PG-Experts','Image3');
ERROR:  Slony-I: Table foo is replicated and cannot be modified on a subscriber node - role=0

Perfect... This is what we are looking, now slavedb(Subscriber Node) running PG 9.3.5 version accepting data and masterdb(Origin Node) receiving the slavedb data. Also its rejecting DMLs executed on masterdb.

Slony-I Logs shows the origin/subscriber node id movements at the time of Switchover:

2014-12-12 04:55:06 PST CONFIG moveSet: set_id=1 old_origin=1 new_origin=2
2014-12-12 04:55:06 PST CONFIG storeListen: li_origin=1 li_receiver=2 li_provider=1
2014-12-12 04:55:06 PST CONFIG remoteWorkerThread_1: update provider configuration
2014-12-12 04:55:06 PST CONFIG remoteWorkerThread_1: helper thread for provider 1 terminated
2014-12-12 04:55:06 PST CONFIG remoteWorkerThread_1: disconnecting from data provider 1
...
...
2014-12-12 04:55:11 PST INFO   start processing ACCEPT_SET
2014-12-12 04:55:11 PST INFO   ACCEPT: set=1
2014-12-12 04:55:11 PST INFO   ACCEPT: old origin=1
2014-12-12 04:55:11 PST INFO   ACCEPT: new origin=2
2014-12-12 04:55:11 PST INFO   ACCEPT: move set seq=5000006393
2014-12-12 04:55:11 PST INFO   got parms ACCEPT_SET

If you encounter any issues at this stage, you can switchback to older version. After switchback you can continue with Older version until your application or other issues fixed. This is the perfect rollback plan without wasting much of time in case of issues after switchover..

-bash-4.1$ slonik switchback.slonik
switchback.slonik:8: Set 1 has been moved from Node 2 to Node 1

slavedb=# insert into foo values (5,'PG-Experts','Image3');
ERROR:  Slony-I: Table foo is replicated and cannot be modified on a subscriber node - role=0

masterdb=# insert into foo values (5,'PG-Experts','Image3');
INSERT 0 1

slavedb=# select * from foo ;
 id |         name         | image
----+----------------------+-------
  1 | Raghav               | test1
  2 | Rao                  | test2
  3 | Rags                 | test3
  4 | PG-Experts           | Image2
  5 | PG-Experts           | Image3
(5 rows)

Very Nice...!!! Is this not the exact rollback with minimum downtime ? Yes, its a perfect switching between nodes without missing a transaction.

Logs showing the switchback from Subscriber to Origin Node:

2014-12-12 04:58:45 PST CONFIG moveSet: set_id=1 old_origin=2 new_origin=1
2014-12-12 04:58:45 PST CONFIG storeListen: li_origin=2 li_receiver=1 li_provider=2
2014-12-12 04:58:45 PST CONFIG remoteWorkerThread_2: update provider configuration
2014-12-12 04:58:45 PST CONFIG remoteWorkerThread_2: helper thread for provider 2 terminated
2014-12-12 04:58:45 PST CONFIG remoteWorkerThread_2: disconnecting from data provider 2
2014-12-12 04:58:46 PST CONFIG storeListen: li_origin=2 li_receiver=1 li_provider=2
...
...
2014-12-12 04:58:47 PST INFO   start processing ACCEPT_SET
2014-12-12 04:58:47 PST INFO   ACCEPT: set=1
2014-12-12 04:58:47 PST INFO   ACCEPT: old origin=2
2014-12-12 04:58:47 PST INFO   ACCEPT: new origin=1
2014-12-12 04:58:47 PST INFO   ACCEPT: move set seq=5000006403
2014-12-12 04:58:47 PST INFO   got parms ACCEPT_SET
2014-12-12 04:58:48 PST CONFIG moveSet: set_id=1 old_origin=2 new_origin=1

By this time you might have noticed, none of the transactions are lost during switching operation between PostgreSQL versions. Only downtime might be your application to start/stop for connecting to Origin and Subscriber nodes, but whereas Origin/Subscriber nodes are never taken down they are just up and running.

Remember, the method shown here is not only useful for upgrades however its the same method in Slony-I for moving between Nodes.

Thank you for your patience :). Hope this post helps you to upgrade PostgreSQL with minimum downtime using Slony-I including proper rollback plan.

--Raghav

Thursday, November 20, 2014

How to replicate only INSERTs not DELETEs/UPDATEs on Slony Slave Node ?

In the first place, we need to know about why such requirement needed. IMO, its absolutely a business necessity to maintain some kind of historical data on the target database(Slave Node). Especially, out of multiple slave nodes one of the slave node to retain the very first form of the data when it initially written into the database.

To accomplish this requirement, we should come up with some kind of filters like TRIGGERs/RULEs on Slave Node so that it avoids relaying DELETE and UPDATE statements. Since we are dealing with Slony-I, it doesn't have such built-in mechanism to filter DML's while replaying them on slave node though it has gathered all events from the Master node.(AFAIK Mysql,Oracle,SQL Server do support filters).

To get this straight, traditional Slony-I way maintains uniqueness of rows across all the nodes with its core concept of tables must have primary keys. In such architecture design, its hard to exclude DELETE/UPDATE statements, take an example of primary key column "orderid" of "orders" table has a first INSERT statement with value 100 and its been replicated as first form on filtered Slave Node. Later a DELETE statement executed for "orderid=100" and deleted row, now if any INSERT or UPDATE statement attempts to use the "orderid=100" then Slave node hits with duplicate key violation and it simple break the replication.

ERROR:  duplicate key value violates unique constraint "reptest_pkey"
DETAIL:  Key (id)=(2) already exists.
CONTEXT:  SQL statement "INSERT INTO "public"."reptest" ("id", "name") VALUES ($1, $2);"
.....
or
....
CONTEXT:  SQL statement "UPDATE ONLY "public"."reptest" SET "id" = $1 WHERE "id" = $2;"
2014-11-17 23:18:53 PST ERROR  remoteWorkerThread_1: SYNC aborted

Thus, implementing rule not an issue yet one should be extremely cautious when its in place. In reality however applying these filters on Slony-I slave node are very fragile, especially application/developer should always keep this in mind any duplicate entry of row by INSERT OR UPDATE could break the replication.

As DML rules not possible alone with Slony-I, we can make use of PostgreSQL CREATE RULE...ON DELETE/ON UPDATE DO INSTEAD NOTHING and apply that RULE on table by ALTER TABLE...ENABLE REPLICA RULE to void DELETE/UPDATE statement. Using this option takes a lot of discipline, so you can ensure your application and staff members really follow these rules.

To continue with steps, you should have slony setup, on the off chance that you need to setup up you can refer to my past post here.

Steps on Slave Node (Master DB: postgres, Slave DB: demo, Port: 5432):

1. Stop slon daemons
2. Create ON DELETE and ON UPDATE DO INSTEAD NOTHING rule

demo=# CREATE RULE void_delete AS ON DELETE TO reptest DO INSTEAD NOTHING;
CREATE RULE
demo=# CREATE RULE void_update AS ON UPDATE TO reptest DO INSTEAD NOTHING;
CREATE RULE

3. Apply RULE on table

demo=# ALTER TABLE reptest ENABLE REPLICA RULE void_delete;
ALTER TABLE
demo=# ALTER TABLE reptest ENABLE REPLICA RULE void_update ;
ALTER TABLE

4. Start Slon daemons

Now, you can notice below that UPDATE/DELETE has no impact on Slave Node:

postgres=# delete from reptest where id =2;
DELETE 1
postgres=# update reptest set id=2 where id=1;
UPDATE 1

--On Master
postgres=# select * from reptest ;
 id |    name
----+------------
  2 | A
(1 row)

--On Slave
demo=# select * from reptest ;
 id |    name
----+------------
  1 | A
  2 | C
(2 rows)

If INSERT statement executed with value 1 then it will break the replication. Be noted...!!

Remember, there other ways to full-fill this request like dblinks, Triggers like BEFORE DELETE...return NULL value from function, but I believe the most efficient way would be to use RULE/ENABLE REPLICA RULE when you are working with Slony replication.

By now you might have read many blogs on Logical Decoding Replication slots new feature in PostgreSQL 9.4, hope in future it might include the concept of filter DMLs on Slave.

Thank you for visiting.

--Raghav

Thursday, June 5, 2014

Utilising caching contrib's pg_prewarm and pg_hibernator in PostgreSQL 9.4.

Numerous DBA's (counting me), put questions all the time to PostgreSQL hackers/developers/architects on mailing list:

Q1. Does PG has the ability to cache/warm a relation ?
Q2. Is it possible to return to prior state of cache where it was left before shutting down the database server because of maintenance ?

In earlier releases of PostgreSQL, there in no chance of warming a relation or storing a cache states, but from PostgreSQL 9.4 onwards each of the above queries(Q1,Q2) addressed with two contrib modules pg_prewarm and pg_hibernator. Despite the very fact that they're distinctive in practicality, however the combination appears to be extremely viable and useful in future for DBA's. In short about contrib's:

pg_prewarm contrib (Author: Robert Haas), provides the capability to load a relation data into OS buffer cache or PG buffer cache. It has the functionality of first or last block number to prewarm. (Note: It has no special protection on pre-warmed data from cache eviction and also if database instance restarted then re-warm needed on the relations).

pg_hibernator contrib (Author: Gurjeet Singh), provides the capability to automatically save the list of shared buffer contents to disk on database shutdown, and automatically restores the buffers on database startup, much the same as save/restore a snapshot of shared_buffers. It make use PG 9.3 module to register "background worker process" and spawns two process "Buffer Saver", "Buffer Reader" for save/restore. Interestingly, with a little hack, pg_hibernator can also allow standby slave to start serving queries with full speed with same contents of master, will see that in a minute :).

Lastly, we need pg_buffercache module to look inside the current contents of PostgreSQL shared_buffers. This module helps to understand what percentage buffer's occupied by a relation.

Let's put all these contrib's into play and see how they serve the purpose of two questions(Q1,Q2). Am going to use a table 'foo' of size 885MB on my local VM, along with a standard pg_buffercache query.

SELECT c.relname,
       count(*) AS buffers
FROM pg_class c
INNER JOIN pg_buffercache b ON b.relfilenode=c.relfilenode AND c.relname='foo'
INNER JOIN pg_database d ON (b.reldatabase=d.oid AND d.datname=current_database())
GROUP BY c.relname
ORDER BY 2 DESC LIMIT 10;

Usage of pg_prewarm contrib and warming 'foo' table.

postgres=# create extension pg_prewarm;
CREATE EXTENSION 
postgres=# \dt+
                    List of relations
 Schema | Name | Type  |  Owner   |  Size  | Description
--------+------+-------+----------+--------+-------------
 public | foo  | table | postgres | 885 MB |
(1 row)
postgres=# select pg_prewarm('foo');
 pg_prewarm
------------
     113278
(1 row)
--pg_buffercache query output
 relname | buffers
---------+---------
 foo     |  113278
(1 row)

Very simple and straightforward usage of pg_prewarm with a output of blocks warmed in shared_buffers for relation 'foo'. From pg_buffercache query output we can evaluate it that there are 113278 (113278 * 8 / 1024 = 884MB ) buffers of 8KB block size of relation 'foo' which matches with pg_prewarm output. Here, if Postgres server restarts because of some reason, shared_buffers are empty and DBA's need to re-warm again to come back to past warm stage. For a single table, re-warm is always simple except for a group of tables its agony.

At this point, we can make use of pg_hibernator contrib, because it has the flexibility to save the shared_buffer's contents and restore it back at startup. Let's enable pg_hibernator/pg_prewarm together and run a similar exercise by simply including one step of restart and see if the cache state return back as is or not. Am not going to cover installation of pg_hibernator, because on git its very well described, however I would directly jump to implementation part and start the server with pg_hibernator.

postgres 24623     1  0 02:06 pts/4    00:00:00 /usr/local/pgpatch/pg/bin/postgres -D /usr/local/pgpatch/pg/data_10407
postgres 24627 24623  0 02:06 ?        00:00:00 postgres: logger process
postgres 24631 24623  0 02:06 ?        00:00:00 postgres: checkpointer process
postgres 24632 24623  0 02:06 ?        00:00:00 postgres: writer process
postgres 24633 24623  0 02:06 ?        00:00:00 postgres: wal writer process
postgres 24634 24623  0 02:06 ?        00:00:00 postgres: autovacuum launcher process
postgres 24635 24623  0 02:06 ?        00:00:00 postgres: archiver process
postgres 24636 24623  0 02:06 ?        00:00:00 postgres: stats collector process
postgres 24637 24623  0 02:06 ?        00:00:00 postgres: bgworker: Buffer Saver
postgres 24638 24623 11 02:06 ?        00:00:01 postgres: bgworker: Block Reader 2

In database server logs at startup time:

-bash-4.1$ more postgresql-2014-06-02_083033.log
LOG:  database system was shut down at 2014-06-02 08:13:00 PDT
LOG:  starting background worker process "Buffer Saver"
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started

Since, its first time pg_hibernator in play, you can see two process and also logs with some information regarding start of "Buffer Saver". Now, lets prewarm relation 'foo' and restart the server, later check the buffer status whether pg_hibernator filled the buffer back where it was left.

-bash-4.1$ psql -p 10407
psql (9.4beta1)
Type "help" for help.

postgres=# select pg_prewarm('foo');
 pg_prewarm
------------
     113278
(1 row)

--pg_buffercache query output
 relname | buffers
---------+---------
 foo     |  113278
(1 row)
postgres=# \q

-bash-4.1$ /usr/local/pgpatch/pg/bin/pg_ctl -D /usr/local/pgpatch/pg/data_10407 stop
waiting for server to shut down.... done
server stopped

-bash-4.1$ ls -l $PGDATA/pg_hibernator/
total 12
-rw------- 1 postgres postgres  160 Jun  3 01:41 1.global.save
-rw------- 1 postgres postgres  915 Jun  3 01:41 2.postgres.save  

-bash-4.1$ /usr/local/pgpatch/pg/bin/pg_ctl -D /usr/local/pgpatch/pg/data_10407 start
server starting

We have restarted the database server, lets examine the logs

-bash-4.1$ more postgresql-2014-06-03_020601.log
LOG:  database system was shut down at 2014-06-03 02:05:57 PDT
LOG:  starting background worker process "Buffer Saver"
LOG:  database system is ready to accept connections
LOG:  autovacuum launcher started
LOG:  registering background worker "Block Reader 2"
LOG:  starting background worker process "Block Reader 2"
LOG:  Block Reader 2: restored 113433 blocks
LOG:  Block Reader 2: all blocks read successfully
LOG:  worker process: Block Reader 2 (PID 24638) exited with exit code 1
LOG:  unregistering background worker "Block Reader 2"
LOG:  registering background worker "Block Reader 1"
LOG:  starting background worker process "Block Reader 1"
LOG:  Block Reader 1: restored 20 blocks
LOG:  Block Reader 1: all blocks read successfully
LOG:  worker process: Block Reader 1 (PID 24664) exited with exit code 1
LOG:  unregistering background worker "Block Reader 1"

So, "Buffer Reader" has restored blocks of 113433 + 20, out of which 113278 belongs to relation 'foo'. Great, lets connect and see.

-bash-4.1$ psql -p 10407
psql (9.4beta1)
Type "help" for help.

--pg_buffercache query output
 relname | buffers
---------+---------
 foo     |  113278
(1 row)

Cool... pg_hibernator has brought back the cache warmed state without DBA's interference.

Another good thing about pg_hibernator, a newly created standby can have the same shared buffer contents as the master, so that the standby can start serving queries at full speed. To do this exercise, while taking a backup of $PGDATA directory, I have passed SIGTERM to "Buffer Saver" process so that it writes the current state shared_buffers content to disk($PGDATA/pg_hibernator directory) and then followed with standby setup.

postgres 24637 24623  0 02:06 ?        00:00:00 postgres: bgworker: Buffer Saver
postgres 24653 15179  0 02:06 ?        00:00:01 postgres: wal receiver process   streaming 1/6A000A10
postgres 24654 24623  0 02:06 ?        00:00:00 postgres: wal sender process postgres ::1(65011) streaming 1/6A000A10

After setup, my slave started with same content of primary

-bash-4.1$ psql -p 10477
psql (9.4beta1)
Type "help" for help.

postgres=# select pg_is_in_recovery();
 pg_is_in_recovery
-------------------
 t
(1 row)

--pg_buffercache query output
 relname | buffers
---------+---------
 foo     |  113278
(1 row)

Thanks to both the authors for a wonderful extension on caching.

--Raghav

Monday, May 19, 2014

Few areas of improvements in PostgreSQL 9.4

With the beta release of PostgreSQL 9.4, DBA's have been given some cool features like pg_prewarm, JSONB, ALTER SYSTEM, Replication Slots and many more. Out of numerous architectural level features presented in this version, likewise there are other few minor enhancements those I have attempted to cover in this blog.

pg_stat_activity view included two new columns (backend_xid/backend_min) to track the transaction id information. pg_stat_activity.backend_xid column covers the id of top-level transaction currently begin executed and pg_stat_activity.backend_xmin column covers the information of minimal running XID. Check out below two query outputs executed in two different situations, first one show the hierarchal information of the transaction id in backend_xmin column of sessions trying to acquire lock(table/Row) on same row, whereas other one just an independent transactions happening without disturbing the same row. This kind of a information help user to know more about the transactions when waiting queries found in the database.

postgres=# select pid,backend_xid,backend_xmin,query from pg_stat_activity where pid<>pg_backend_pid();
  pid  | backend_xid | backend_xmin |           query
-------+-------------+--------------+---------------------------
 22351 |        1905 |         1904 | insert into a values (1);
   785 |        1904 |              | insert into a values (1);
 12796 |             |         1904 | truncate  a;
 12905 |             |         1904 | delete from a ;

postgres=# select pid,backend_xid,backend_xmin,query from pg_stat_activity where pid<>pg_backend_pid();
  pid  | backend_xid | backend_xmin |            query
-------+-------------+--------------+-----------------------------
 22351 |             |              | insert into foo values (1);
   785 |        1900 |              | insert into foo values (1);
(2 rows)

New clauses in CREATE TABLESPACE/ALTER TABLESPACE as "with" and "move" options respectively. Similarly, meta command \db+ to give detailed information about the parameters set for a particular TABLESPACE using "with" option.

postgres=# \h create tablespace
Command:     CREATE TABLESPACE
Description: define a new tablespace
Syntax:
CREATE TABLESPACE tablespace_name
    [ OWNER user_name ]
    LOCATION 'directory'
    [ WITH ( tablespace_option = value [, ... ] ) ]

Example:

postgres=# create tablespace t1 location '/usr/local/pgpatch/pg/ts' with (seq_page_cost=1,random_page_cost=3); 
CREATE TABLESPACE

postgres=# \db+
                                                    List of tablespaces
    Name    |  Owner   |         Location         | Access privileges |               Options                | Description
------------+----------+--------------------------+-------------------+--------------------------------------+-------------
 pg_default | postgres |                          |                   |                                      |
 pg_global  | postgres |                          |                   |                                      |
 t1         | postgres | /usr/local/pgpatch/pg/ts |                   | {seq_page_cost=1,random_page_cost=3} |
(3 rows)

New system functions to give information on type regclass,regproc,regprocedure,regoper,regoperator and regtype. For all the types, new functions are to_regclass(), to_regproc(), to_regprocedure(), to_regoper(), to_regoperator() and to_regtype().

Example:
select to_regclass('pg_catalog.pg_class'),to_regtype('pg_catalog.int4'),to_regprocedure('pg_catalog.abs(numeric)'),to_regproc('pg_catalog.now'),to_regoper('pg_catalog.||/');
 to_regclass | to_regtype | to_regprocedure | to_regproc | to_regoper
-------------+------------+-----------------+------------+------------
 pg_class    | integer    | abs(numeric)    | now        | ||/
(1 row)

New "-g" option in command line utility CREATEUSER to specify role membership.

-bash-4.1$ createuser -g rw -p 10407 r1 
-bash-4.1$ psql -p 10407
psql (9.4beta1) Type "help" for help.

postgres=# \dg
                             List of roles
 Role name |                   Attributes                   | Member of
-----------+------------------------------------------------+-----------
 postgres  | Superuser, Create role, Create DB, Replication | {}
 r1        |                                                | {rw}

pg_stat_all_tables view, has a new column "n_mod_since_analyze", which highlights on the number of rows has been modified since the table was last analyzed. Below outputs brief about the "n_mod_since_analyze" column changes, first time manual analyze executed and after sometime autovacuum invoked on the table, in this duration we can figure out how many rows effected with different catalog update calls.

postgres=# analyze a;
ANALYZE
postgres=# select relname,last_autoanalyze,last_analyze,n_mod_since_analyze from pg_stat_all_tables where relname='a';
 relname | last_autoanalyze |         last_analyze          | n_mod_since_analyze
---------+------------------+-------------------------------+---------------------
 a       |                  | 2014-05-03 02:09:51.002006-07 |                   0
(1 row)

postgres=# insert into a values(generate_series(1,100));
INSERT 0 100
postgres=# select relname,last_autoanalyze,last_analyze,n_mod_since_analyze from pg_stat_all_tables where relname='a';
 relname | last_autoanalyze |         last_analyze          | n_mod_since_analyze
---------+------------------+-------------------------------+---------------------
 a       |                  | 2014-05-03 02:09:51.002006-07 |                 100
(1 row)

postgres=# truncate a;
TRUNCATE TABLE
postgres=# select relname,last_autoanalyze,last_analyze,n_mod_since_analyze from pg_stat_all_tables where relname='a';
 relname | last_autoanalyze |         last_analyze          | n_mod_since_analyze
---------+------------------+-------------------------------+---------------------
 a       |                  | 2014-05-03 02:09:51.002006-07 |                 100
(1 row)

postgres=# select relname,last_autoanalyze,last_analyze,n_mod_since_analyze from pg_stat_all_tables where relname='a';
 relname |       last_autoanalyze        |         last_analyze          | n_mod_since_analyze
---------+-------------------------------+-------------------------------+---------------------
 a       | 2014-05-03 02:14:21.806912-07 | 2014-05-03 02:09:51.002006-07 |                   0
(1 row)

pg_stat_archiver, its a new view introduced to track all WALs generated and it also captures failed WAL's count. If you are from Oracle then this one is like "ARCHIVE LOG LIST".

postgres=# select * from pg_stat_archiver ;
-[ RECORD 1 ]------+------------------------------
 archived_count     | 167
 last_archived_wal  | 00000001000000000000009B
 last_archived_time | 2014-05-02 20:42:36.230998-07
 failed_count       | 75
 last_failed_wal    | 000000010000000000000012
 last_failed_time   | 2014-05-01 12:09:57.087644-07
 stats_reset        | 2014-04-30 19:02:01.288521-07

pg_stat_statements, extension module has a new column queryid to track the internal hash code, computed from the statement's parse tree.

postgres=# select queryid,query from pg_stat_statements;
  queryid   |               query
------------+------------------------------------
 1144716789 | select * from pg_stat_statements ;
(1 row)

Thank you.

--Raghav

Tuesday, April 29, 2014

"WARNING: Mismatch found between sl_table and pg_class." in Slony-I

WARNING: Mismatch found between sl_table and pg_class. Slonik command REPAIR CONFIG may be useful to rectify this.
2014-04-26 07:32:54 PDT FATAL slon_node_health_check() returned false - fatal health problem!
REPAIR CONFIG may be helpful to rectify this problem

You see this WARNING message in logs and immediate stop of replication, if Slony has observed a mismatch of pg_class.oid and sl_table.tabreloid of a replicating table in a node. Because, by architecture slony holds all replicating objects OID information in its catalogs captured at configure time from pg_class.oid.

In which case pg_class.oid != sl_table.tabreloid ?

Most cases, a node moved its place using pg_dump/pg_restore by causing objects OID to change.

To mimic the above WARNING message, I have used two node replication setup between two database on same cluster[5432] for few tables. (Refer here on how to setup Slony replication). Here's the current OID information on slave node(demo database) for one of the object 'dtest':

demo=# select oid,relfilenode,relname from pg_class where relname='dtest';
  oid  | relfilenode | relname
-------+-------------+---------
 26119 |       26119 | detest
(1 row)
demo=# select tab_id,tab_reloid,tab_relname from _rf.sl_table where tab_relname='dtest';
 tab_id | tab_reloid | tab_relname
--------+------------+-------------
      2 |      26119 | dtest
(1 row)

Ok, 'dtest' OID 26119 stored in slony catalog in sl_table.tabreloid.(Slony schema _rf). Take the logical backup and restore of same demo database simply to change the object OID like below: (Remember, Slon process are stopped at this moment)

-bash-4.1$ pg_dump -Fc -p 5432 -U postgres demo >/tmp/demo93.dmp
-bash-4.1$ psql -c "alter database demo rename to demo_bk;"
-bash-4.1$ psql -c "create database demo;"
-bash-4.1$ pg_restore -Fc -p 5432 -U postgres -d demo /tmp/demo93.dmp
-bash-4.1$ psql -c "select oid,relfilenode,relname from pg_class where relname='dtest';"
  oid  | relfilenode | relname
-------+-------------+---------
 26640 |       26640 | dtest
(1 row)
-bash-4.1$ psql -c "select tab_id,tab_reloid,tab_relname from _rf.sl_table where tab_relname='dtest';"
 tab_id | tab_reloid | tab_relname
--------+------------+-------------
      2 |      26119 | dtest
(1 row)

Now, pg_class.oid of 'dtest' has changed to 26640 whereas sl_table.tab_reloid still reflects the old OID 26119. At this stage if we start slon process it essentially stops with WARNING message on mismatch of OID by running a query pg_class.oid = sl_table.tabreloid. On returning false result it won't move ahead until its fixed. We can also call the function slon_node_health_check() explicitly for verification :

demo=# select _rf.slon_node_health_check();
WARNING:  table [id,nsp,name]=[1,a,public] - sl_table does not match pg_class/pg_namespace
WARNING:  table [id,nsp,name]=[2,dtest,public] - sl_table does not match pg_class/pg_namespace
WARNING:  table [id,nsp,name]=[3,movepage,public] - sl_table does not match pg_class/pg_namespace
WARNING:  Mismatch found between sl_table and pg_class.  Slonik command REPAIR CONFIG may be useful to rectify this.
 slon_node_health_check
------------------------
 f
(1 row)

We can fix this in two ways.

Using Slonik command line utility with preamble script REPAIR CONFIG or
Using Slony catalog function updatereloid() within psql terminal.

Method 1: Create preamble script as below and execute with slonik command. I would be using second method, its just for reference.

demo=# \o /tmp/repair_conf.slonik
demo=# select 'REPAIR CONFIG ( SET ID = '||set_id||', EVENT NODE = 1 );' FROM _rf.sl_set;
demo=# \o

Add nodes information at the beginning of the file "/tmp/repair_conf.slonik"

cluster name = rf;
node 1 admin conninfo = 'host=localhost dbname=postgres user=postgres port=5432 password=postgres';
node 2 admin conninfo = 'host=localhost dbname=demo  user=postgres port=5432 password=postgres';

 REPAIR CONFIG ( SET ID = 1, EVENT NODE = 2 );
 REPAIR CONFIG ( SET ID = 2, EVENT NODE = 2 );
 REPAIR CONFIG ( SET ID = 3, EVENT NODE = 2 );

-bash-4.1$ slonik /tmp/repair_conf.slonik

Method 2: Pass the table-set id and node information to a function:

demo=# select _rf.updatereloid(tab_set,2) from _rf.sl_table ;   
 updatereloid
--------------
            1
            1
            1
(3 rows)

Cool, lets check the OID information now on slave node (demo database) from pg_class and _slonycatalog.sl_table

-bash-4.1$  psql -d demo -c "select oid,relfilenode,relname from pg_class where relname='dtest';"
  oid  | relfilenode | relname
-------+-------------+---------
 26119 |       26119 | dtest
(1 row)

-bash-4.1$ psql -d demo -c "select tab_id,tab_reloid,tab_relname from _rf.sl_table where tab_relname='dtest';"
 tab_id | tab_reloid | tab_relname
--------+------------+-------------
      2 |      26119 | dtest
(1 row)

After the update, slony will begin syncing without any issues.
Thanks to Slony-I team.

--Raghav

Tuesday, April 22, 2014

Faster statistics update after upgrade using "vacuumdb --analyze-in-stages" in PostgreSQL 9.4

As all of you know after upgrading the database server from one version to other major version, ANALYZE command should be executed to update the pg_catalogs on newly populated data. On a huge upgraded database, its a challenge for the application to gain its performance back without updating the statistics. In PostgreSQL 9.4, an option "vacuumdb --analyze-in-stages" will make this work faster to produce usable statistics required by the optimizer. It runs in three stages with different configuration settings(default_statistics_target/vacuum_cost_delay) to analyze the database.

If the database followed any of the up-gradation procedure like pg_dump/pg_restore or pg_upgrade, then its recommended to use "vacuumdb --analyze-in-stages"

Sample output:

bash-4.1$ /usr/local/pgpatch/pg/bin/vacuumdb -p 9999 --analyze-in-stages -d tester
Generating minimal optimizer statistics (1 target)
Generating medium optimizer statistics (10 targets)
Generating default (full) optimizer statistics

Sunday, April 6, 2014

While performing PITR, would it be possible to Pause/Resume in PostgreSQL ?

Yes, truly possible and handled smartly by PostgreSQL. To demo this, first I need to take after the standard technique of Point in Time Recovery in PostgreSQL. Various Books/Articles/Blogs demoed extremely well by extraordinary authors, hence am not going into details of how to do it, however, heading off directly to the subject i.e., how to pause while recovering with same technique. Arguably, I put forth a mathematical expression out of PITR as "PITR = (Last Filesystem Backup(LFB) + WAL Archives generated after LFB + Un-Archived WAL's in current $PGDATA/pg_xlogs)". For better understanding, I have put this into graph, in light of the fact that it clear the thought more: (Sorry, this blog is bit long, unknowingly it happened while going in details of the concept)

PITR steps,which am going to follow with slight changes that I talk about soon:

Step 1. Restore the most recent File System-level backup(FSB) to any location where recovery is planned to perform.
Step 2. If FSB is tar,then untar it, and clean the pg_xlog directory leaving archive_status. If backup has excluded this directory, then create the empty pg_xlog directory in FSB.
Step 3. Copy un-archived WAL's from crashed cluster $PGDATA/pg_xlog into $FSB/pg_xlog (Step 2)
Step 4. Delete the postmaster.pid from FSB directory.
Step 5. Create recovery.conf file in FSB directory.
Step 6. Start the cluster (FSB).

We should put question, when pausing the recovery required ?. Maybe, to prevent multiple base restorations or roll-forward recovery but check in between or rollback a particular tables data or interest to see how far it has recovered :). Remember, pause in recovery means, its allowing to connect while recovering. To outline this, I have reproduced a situation in chart of a particular table rows improvement until to a mishap.

From above diagram, its agreeable a DEMO table rows were 10,00,000 when file system-level backup($PGDATA) taken and 40,00,000 rows before crash. In my local VM, I have made the situation on groundwork of TIME instead of date.

Pre-Requisite:
1. File System-Level Backup when DEMO tables having 10,00,000 rows.
2. From that point forward, WAL Archives before crash where DEMO table having 40,00,000 rows.
3. WAL Archives Location: /opt/PostgreSQL/9.3/archives.
4. Data Directory : /opt/PostgreSQL/9.3/data (PGDATA)
5. Backup Location : /opt/PostgreSQL/9.3/backups

Keep in mind, working with pause recovery need compulsory changes on main cluster($PGDATA) "wal_level" set to "hot_standby" and on recovery cluster(file system-level backup) "hot_standby" set to "ON". I have made these changes to main cluster, restarted the cluster to take effect and initiated the backup. If you don't mind make a note it simply a demo, so my WAL archives might not be gigantic number's as they are in few numbers. I have listed WAL archives too here, which were generated from the time of backup to crash.

-bash-4.1$ psql -c "select count(*), now() from demo;"
  count  |              now
---------+-------------------------------
 1000000 | 2014-04-04 15:06:04.036928-07
(1 row)

-bash-4.1$ pg_basebackup -D /opt/PostgreSQL/9.3/backup/data_pitr          -- I have my $PGDATA, $PGUSER, $PGPORT set, so its a straight command in my case
NOTICE:  pg_stop_backup complete, all required WAL segments have been archived

Current state of WAL archives and $PGDATA/pg_xlog

-bash-4.1$ ls -lrth /opt/PostgreSQL/9.3/archives
-rw------- 1 postgres postgres 16M Apr  4 16:01 00000001000000000000001C
-rw------- 1 postgres postgres 16M Apr  4 16:01 00000001000000000000001D
-rw------- 1 postgres postgres 289 Apr  4 16:06 00000001000000000000001E.000000C8.backup
-rw------- 1 postgres postgres 16M Apr  4 16:06 00000001000000000000001E

-bash-4.1$ ls -lrth /opt/PostgreSQL/9.3/data/pg_xlog | tail -4
-rw------- 1 postgres postgres 289 Apr  4 16:06 00000001000000000000001E.000000C8.backup
-rw------- 1 postgres postgres 16M Apr  4 16:06 00000001000000000000001E
-rw------- 1 postgres postgres 16M Apr  4 16:06 00000001000000000000001F
drwx------ 2 postgres postgres 4.0K Apr  4 16:13 archive_status

Fine now, we have the backup copy, lets INSERT few records in three parts by noting the time, so it will help to pause recovery and additionally see the WAL's produced from the time of FSB.

-bash-4.1$ psql -c "insert into demo values (generate_series(1,1000000));"
INSERT 0 1000000
-bash-4.1$ psql -c "select count(*),now() from demo;"
  count  |              now
---------+-------------------------------
 2000000 | 2014-04-04 16:06:34.941615-07
(1 row)
-bash-4.1$ psql -c "insert into demo values (generate_series(1,1000000));"
INSERT 0 1000000
-bash-4.1$ psql -c "select count(*),now() from demo;"
  count  |              now
---------+-------------------------------
 3000000 | 2014-04-04 16:10:31.136725-07
(1 row)
-bash-4.1$ psql -c "insert into demo values (generate_series(1,1000000));"
INSERT 0 1000000
-bash-4.1$ psql -c "select count(*),now() from demo;"
  count  |              now
---------+-------------------------------
 4000000 | 2014-04-04 16:13:00.136725-07
(1 row)

Check the number of WAL's produced during the INSERT.

-bash-4.1$ ls -lrth /opt/PostgreSQL/9.3/archives
-rw------- 1 postgres postgres 289 Apr  4 16:06 00000001000000000000001E.000000C8.backup
-rw------- 1 postgres postgres 16M Apr  4 16:06 00000001000000000000001E
-rw------- 1 postgres postgres 16M Apr  4 16:06 00000001000000000000001F
-rw------- 1 postgres postgres 16M Apr  4 16:06 000000010000000000000020
-rw------- 1 postgres postgres 16M Apr  4 16:06 000000010000000000000021
-rw------- 1 postgres postgres 16M Apr  4 16:06 000000010000000000000022
-rw------- 1 postgres postgres 16M Apr  4 16:06 000000010000000000000023
-rw------- 1 postgres postgres 16M Apr  4 16:06 000000010000000000000024
-rw------- 1 postgres postgres 16M Apr  4 16:06 000000010000000000000025
-rw------- 1 postgres postgres 16M Apr  4 16:06 000000010000000000000026
-rw------- 1 postgres postgres 16M Apr  4 16:10 000000010000000000000027
-rw------- 1 postgres postgres 16M Apr  4 16:10 000000010000000000000028
-rw------- 1 postgres postgres 16M Apr  4 16:10 000000010000000000000029
-rw------- 1 postgres postgres 16M Apr  4 16:10 00000001000000000000002A
-rw------- 1 postgres postgres 16M Apr  4 16:13 00000001000000000000002B

Assume at this point mishap happened and you have to do recovery using FSB + WAL archives + Unarchived WAL's(if any). During recovery, I want to pause three time to see each recovery of 20,00,000, 30,00,000 and 40,00,000 rows of DEMO table by connecting to the database in READ-ONLY mode. For each resume of recovery need a restart of recovery cluster by bumping to new timeline in recovery.conf/recovery_target_time. Also,in $FSB/postgresql.conf, we have to set hot_standby=on. Here's my recovery.conf file:

-bash-4.1$ more recovery.conf
pause_at_recovery_target = true
#recovery_target_time = '2014-04-04 16:06:34'   # For 2 lakh records
#recovery_target_time = '2014-04-04 16:10:31'   # For 3 lakh records
#recovery_target_time = '2014-04-04 16:13:00'   # For 4 lakh records
restore_command = 'cp /opt/PostgreSQL/9.3/archives/%f %p'

Let's start recovery for 20,00,000 records:

-bash-4.1$ /opt/PostgreSQL/9.3/bin/pg_ctl -D /opt/PostgreSQL/9.3/data_pitr/ start
server starting

Now in logs:

-bash-4.1$ more postgresql-2014-04-04_162524.log
2014-04-04 16:25:24 PDT-24187---[] LOG:  starting point-in-time recovery to 2014-02-06 18:48:56-08
2014-04-04 16:25:24 PDT-24187---[] LOG:  restored log file "00000001000000000000001E" from archive
2014-04-04 16:25:24 PDT-24187---[] LOG:  redo starts at 0/1E0000C8
2014-04-04 16:25:24 PDT-24187---[] LOG:  consistent recovery state reached at 0/1E000190
2014-04-04 16:25:24 PDT-24185---[] LOG:  database system is ready to accept read only connections
2014-04-04 16:25:24 PDT-24187---[] LOG:  restored log file "00000001000000000000001F" from archive
2014-04-04 16:25:24 PDT-24187---[] LOG:  restored log file "000000010000000000000020" from archive
2014-04-04 16:25:25 PDT-24187---[] LOG:  restored log file "000000010000000000000021" from archive
2014-04-04 16:25:25 PDT-24187---[] LOG:  restored log file "000000010000000000000022" from archive
2014-04-04 16:25:25 PDT-24187---[] LOG:  recovery stopping before commit of transaction 1833, time 2014-04-04 16:06:23.893487-07
2014-04-04 16:25:25 PDT-24187---[] LOG:  recovery has paused
2014-04-04 16:25:25 PDT-24187---[] HINT:  Execute pg_xlog_replay_resume() to continue

Cool, see in logs it has paused and a smart HINT asking to resume. Here, if the recovery was satisfactory, you can resume it by calling "select pg_xlog_replay_resume();"(You can check it out). Lets not resume now, however check the number of rows recovered by connecting to the server.

-bash-4.1$ psql -c "select count(*),pg_is_in_recovery() from demo;"
  count  | pg_is_in_recovery
---------+-------------------
 2000000 | t
(1 row)

Good, it has reached to the point and paused where I requested. Lets move one more step ahead for recovering 30,00,000 rows. Now, set the next timeline in recovery.conf/recovery_target_time and restart the cluster.

2014-04-04 16:28:40 PDT-24409---[] LOG:  restored log file "00000001000000000000002A" from archive
2014-04-04 16:28:40 PDT-24409---[] LOG:  recovery stopping before commit of transaction 1836, time 2014-04-04 16:10:40.141175-07
2014-04-04 16:28:40 PDT-24409---[] LOG:  recovery has paused
2014-04-04 16:28:40 PDT-24409---[] HINT:  Execute pg_xlog_replay_resume() to continue.

-bash-4.1$ psql -c "select count(*),pg_is_in_recovery() from demo;"
  count  | pg_is_in_recovery
---------+-------------------
 3000000 | t
(1 row)

Nice..., let's give the last attempt to pause at 40,00,000 rows.

2014-04-04 20:09:07 PDT-4723---[] LOG:  restored log file "00000001000000000000002B" from archive
cp: cannot stat `/opt/PostgreSQL/9.3/archives/00000001000000000000002C': No such file or directory
2014-04-04 20:09:07 PDT-4723---[] LOG:  redo done at 0/2B0059A0
2014-04-04 20:09:07 PDT-4723---[] LOG:  last completed transaction was at log time 2014-04-04 16:11:12.264512-07
2014-04-04 20:09:07 PDT-4723---[] LOG:  restored log file "00000001000000000000002B" from archive
2014-04-04 20:09:07 PDT-4723---[] LOG:  restored log file "00000002.history" from archive
2014-04-04 20:09:07 PDT-4723---[] LOG:  restored log file "00000003.history" from archive
2014-04-04 20:09:07 PDT-4723---[] LOG:  restored log file "00000004.history" from archive
cp: cannot stat `/opt/PostgreSQL/9.3/archives/00000005.history': No such file or directory
2014-04-04 20:09:07 PDT-4723---[] LOG:  selected new timeline ID: 5
cp: cannot stat `/opt/PostgreSQL/9.3/archives/00000001.history': No such file or directory
2014-04-04 20:09:07 PDT-4723---[] LOG:  archive recovery complete
2014-04-04 20:09:08 PDT-4721---[] LOG:  database system is ready to accept connections
2014-04-04 20:09:08 PDT-4764---[] LOG:  autovacuum launcher started

-bash-4.1$ psql -c "select count(*),pg_is_in_recovery() from demo;"
  count  | pg_is_in_recovery
---------+-------------------
 4000000 | f
(1 row)

Oops, what happened, why it has not paused and what its complaining?. Keep in mind, if no WAL archives present at the time of recovery_target_time then it won't pause and expect as it has arrived to the last point and open the database for READ/WRITE. In logs, without much stretch make out it was hunting down for file "00000001000000000000002C" which's not available, because at that time cluster has crashed. Some may not acknowledge this behaviour but its fact and makes sense when no WAL archives present then there's no reason for pausing the recovery. If at all required to pause even after no WAL archives, then make use of standby_mode='on' (HOT_STANDBY), in this method it won't come out of recovery but wait for WAL Archives.

Hope it was useful.

--Raghav