Relational Database Technologies: May 2013

Friday, May 31, 2013

GET DIAGNOSTICS with COPY statement in Pl/pgsql function

Now GET DIAGNOSTIC will return the number of rows processed by COPY statement in a Pl/Pgsql function.
COPY statement in Pl/Pgsql Function:

CREATE OR REPLACE FUNCTION public.copy_data(fname text) RETURNS integer
AS 
$$
declare
    copy_qry text;
    cnt integer;
Begin
copy_qry := 'copy t from'||quote_literal(fname)||' with CSV HEADER;';
Execute copy_qry;
GET DIAGNOSTICS cnt = ROW_COUNT;
return cnt;
end;
$$ Language plpgsql;

Previous release:

-bash-4.1$ psql
psql.bin (9.2.3)
Type "help" for help.

postgres=# select copy_data('/usr/local/pg93beta/t_load.csv');
 copy_data
-----------
         0
(1 row)

In PostgreSQL 9.3

-bash-4.1$ ./psql -p 5555
psql (9.3beta1)
Type "help" for help.

postgres=# select copy_data('/usr/local/pg93beta/t_load.csv');
 copy_data
-----------
         3
(1 row)
or
-bash-4.1$ ./psql -p 5555 -c "select copy_data('/usr/local/pg93beta/t_load.csv');"
 copy_data
-----------
         3
(1 row)

Thanks to author, it looks simple but very effective when working with loading the data using scripts and want to know the number of rows processed by the COPY statement.

--Raghav

Thursday, May 16, 2013

Disk page checksums to detect filesystem failures in PostgreSQL 9.3Beta 1

New feature introduced in PostgreSQL 9.3Beta 1 i.e. "Disk page checksums". Thanks to author Ants Aasama and Simon Riggs, Jeff Davis,Greg Smith.

In earlier releases, if there's any data corruption block on disk it was silently ignored until any pointer arrives on it or some wrong results shown by the queries. Now, data corruption detected beforehand by triggering WARNING message instead of silently using or waiting for hit on the corrupted block.

Disk page checksums feature implementation is unique, its not plug-able like EXTENSIONs its selectable feature. That's if you need your database should be under monitory umbrella of data corruption then it should be enabled at the time of cluster initialization not on existing or running cluster. Below's the example how it works.

Initialize the cluster with checksums:

initdb [OPTION]... [DATADIR]
   ........
   -k, --data-checksums      use data page checksums
   
initdb -D data_directory -k

Now, any data corruption found will be notified as below:

postgres=# select * from corruption_test;
WARNING:  page verification failed, calculated checksum 63023 but expected 48009
ERROR:  invalid page in block 0 of relation base/12896/16409

In earlier version,just an error message.

postgres=# select * from corruption_test where id=1;
ERROR:  invalid page header in block 0 of relation base/12870/18192

That's cool right....

So, how do you know whether disk page checksums enabled on the cluster or not ?
As of now, there's no pg_catalog to store such information or any files created in the $PGDATA directory, only pg_control file will hold that information. Using pg_controldata utility you can know about it.

$ export PGDATA=/usr/local/pg93beta/data
$ pg_controldata 
....
....
....
Data page checksum version:           1

Some points on Disk page checksums:
1. Temp tables are excluded from checksums checks.
2. There's performance overhead if checksums enabled as per the PG documentation.
3. Once enabled checksums on a cluster cannot be rolled back.

Thanks
Raghav

Tuesday, May 14, 2013

What if, import file (txt/csv) having "BOM-ed UTF-8" encoding?

So what is "UTF-8 BOM" mean ? its byte order mark for UTF-8, some bytes (0xEF,0xBB,0xBF) are added at the start of the file to indicate that the file having unicode characters in it. BOM Characters "ï»¿9".

As per Unicode documentation, the presence of BOM in file are useless, because it causes problems with non-BOM-aware software's to identify or parse the leading characters having at the start. Same has been quoted at the bottom of the Wikipedia page:

Related errors in PostgreSQL:
ERROR: invalid input syntax for integer: "ï»¿9" (in psql-client)
SQL state: 22P02 (in PgAdmin-III)

Test case & fix on Windows:
Sample file "state_data.txt" created in NOTEPAD with unicode characters in it:

9,Karnataka,कर्नाटक
10,Kerala,केरळा

Table to import data:

create table states(state_code int, state_name char(30), state_in_hindi text);

Error:

postgres=# copy test from 'c:/Pgfile/state_data.txt' with delimiter ',' CSV;
ERROR:  invalid input syntax for integer: "ï»¿9"
CONTEXT:  COPY test, line 1, column state_code: "ï»¿9"

To fix, I have used a tool "bomremover.exe" to remove leading characters from a file as its on windows, if its on linux, then there are many tips & tricks available on net to wipe BOM from a utf-8 format file.

Tool Download link and usage:
http://www.mannaz.at/codebase/utf-byte-order-mark-bom-remover/

Eg:-
C:\Pgfile>bomremover.exe . *
Added '.\state_data.txt' to processing list.
Press enter to process all files in the list. (1 files in total)

Processing file '.\state_data.txt'...
Finished. Press Enter to Exit

After running bomremover.exe on file, re-run COPY command which will succeed to import data.

 state_code | state_name | State_name_in_hindi
------------+------------+---------------------
          9 | Karnataka  | αñòαñ░αÑ<8d>αñ¿αñ╛αñƒαñò
         10 | Kerala     | αñòαÑçαñ░αñ│αñ╛
(2 rows)

Some of the editors, avoids default saving text with UTF8-BOM:
- Windows - Notepad++ (In Notepade default BOM enabled)
- Linux - VI
- Mac - TextEdit

--Raghav