Tuesday, April 29, 2014

"WARNING: Mismatch found between sl_table and pg_class." in Slony-I

WARNING: Mismatch found between sl_table and pg_class. Slonik command REPAIR CONFIG may be useful to rectify this.
2014-04-26 07:32:54 PDT FATAL slon_node_health_check() returned false - fatal health problem!
REPAIR CONFIG may be helpful to rectify this problem
You see this WARNING message in logs and immediate stop of replication, if Slony has observed a mismatch of pg_class.oid and sl_table.tabreloid of a replicating table in a node. Because, by architecture slony holds all replicating objects OID information in its catalogs captured at configure time from pg_class.oid.

In which case pg_class.oid != sl_table.tabreloid ?

Most cases, a node moved its place using pg_dump/pg_restore by causing objects OID to change.

To mimic the above WARNING message, I have used two node replication setup between two database on same cluster[5432] for few tables. (Refer here on how to setup Slony replication). Here's the current OID information on slave node(demo database) for one of the object 'dtest':
demo=# select oid,relfilenode,relname from pg_class where relname='dtest';
  oid  | relfilenode | relname
-------+-------------+---------
 26119 |       26119 | detest
(1 row)
demo=# select tab_id,tab_reloid,tab_relname from _rf.sl_table where tab_relname='dtest';
 tab_id | tab_reloid | tab_relname
--------+------------+-------------
      2 |      26119 | dtest
(1 row)
Ok, 'dtest' OID 26119 stored in slony catalog in sl_table.tabreloid.(Slony schema _rf). Take the logical backup and restore of same demo database simply to change the object OID like below: (Remember, Slon process are stopped at this moment)
-bash-4.1$ pg_dump -Fc -p 5432 -U postgres demo >/tmp/demo93.dmp
-bash-4.1$ psql -c "alter database demo rename to demo_bk;"
-bash-4.1$ psql -c "create database demo;"
-bash-4.1$ pg_restore -Fc -p 5432 -U postgres -d demo /tmp/demo93.dmp
-bash-4.1$ psql -c "select oid,relfilenode,relname from pg_class where relname='dtest';"
  oid  | relfilenode | relname
-------+-------------+---------
 26640 |       26640 | dtest
(1 row)
-bash-4.1$ psql -c "select tab_id,tab_reloid,tab_relname from _rf.sl_table where tab_relname='dtest';"
 tab_id | tab_reloid | tab_relname
--------+------------+-------------
      2 |      26119 | dtest
(1 row)
Now, pg_class.oid of 'dtest' has changed to 26640 whereas sl_table.tab_reloid still reflects the old OID 26119. At this stage if we start slon process it essentially stops with WARNING message on mismatch of OID by running a query pg_class.oid = sl_table.tabreloid. On returning false result it won't move ahead until its fixed. We can also call the function slon_node_health_check() explicitly for verification :
demo=# select _rf.slon_node_health_check();
WARNING:  table [id,nsp,name]=[1,a,public] - sl_table does not match pg_class/pg_namespace
WARNING:  table [id,nsp,name]=[2,dtest,public] - sl_table does not match pg_class/pg_namespace
WARNING:  table [id,nsp,name]=[3,movepage,public] - sl_table does not match pg_class/pg_namespace
WARNING:  Mismatch found between sl_table and pg_class.  Slonik command REPAIR CONFIG may be useful to rectify this.
 slon_node_health_check
------------------------
 f
(1 row)
We can fix this in two ways.

  1. Using Slonik command line utility with preamble script REPAIR CONFIG or 
  2. Using Slony catalog function updatereloid() within psql terminal.

Method 1: Create preamble script as below and execute with slonik command. I would be using second method, its just for reference.
demo=# \o /tmp/repair_conf.slonik
demo=# select 'REPAIR CONFIG ( SET ID = '||set_id||', EVENT NODE = 1 );' FROM _rf.sl_set;
demo=# \o

Add nodes information at the beginning of the file "/tmp/repair_conf.slonik"

cluster name = rf;
node 1 admin conninfo = 'host=localhost dbname=postgres user=postgres port=5432 password=postgres';
node 2 admin conninfo = 'host=localhost dbname=demo  user=postgres port=5432 password=postgres';

 REPAIR CONFIG ( SET ID = 1, EVENT NODE = 2 );
 REPAIR CONFIG ( SET ID = 2, EVENT NODE = 2 );
 REPAIR CONFIG ( SET ID = 3, EVENT NODE = 2 );

-bash-4.1$ slonik /tmp/repair_conf.slonik
Method 2: Pass the table-set id and node information to a function:
demo=# select _rf.updatereloid(tab_set,2) from _rf.sl_table ;   
 updatereloid
--------------
            1
            1
            1
(3 rows)
Cool, lets check the OID information now on slave node (demo database) from pg_class and _slonycatalog.sl_table
-bash-4.1$  psql -d demo -c "select oid,relfilenode,relname from pg_class where relname='dtest';"
  oid  | relfilenode | relname
-------+-------------+---------
 26119 |       26119 | dtest
(1 row)

-bash-4.1$ psql -d demo -c "select tab_id,tab_reloid,tab_relname from _rf.sl_table where tab_relname='dtest';"
 tab_id | tab_reloid | tab_relname
--------+------------+-------------
      2 |      26119 | dtest
(1 row)
After the update, slony will begin syncing without any issues.
Thanks to Slony-I team.

--Raghav

2 comments :

Anonymous said...

>>> Take the logical backup and restore of same demo database <<<<<

There are other potential issues with taking a logical backup and restore of a slony node and expecting the restored node to continue from where the backup'd up node left off.

Specifically
1. If replication in the cluster progressed after the backup started restoring the backup is essentially making a node time-travel backwards this can confuse other nodes.
2. If your restoring your logical backup to a different postgres cluster (datadir) you need to make sure that the XID epoch on the target system is AHEAD of the system you did the backup from. Otherwise it is possible for new xid's on the new database to be earlier than xid's which were marked as committed pre-backup. This confuses slony.

It is fine to take a logical backup and restore it to a system but uninstall slony on the restored system but trying to make the node resume where it left off in replication is not recommended. If this is what you are trying to do the CLONE PREPARE/CLONE FINISH slonik commands might be closer to what you want.

Steve

Raghavendra said...

Thank you Steve for your comment.

[1] ,[2]. Agreed.

I faced OID mismatch issue in my realtime on one of the customer environment when pg_dump/pg_restore executed on a slave node, luckily at that time neither application nor slony process running to jump XID's or any transaction that caused cluster to progress, so sync catch up without any issue.

As you said, CLONE PREPARE/CLONE FINISH is right approach, however I didn't get the point why to uninstall slony on the restored system ?

--Raghav

Post a Comment

Creative Commons License
This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License