nosql: Bring data from GreenPlum to Hadoop via Sqoop and access it by HiveQL

How did it start?

You are a big data engineer. You manage a production Data Warehouse on EMC GreenPlum where analysts cannot run adhoc queries, but they love to explore data. You have a hadoop analytics sandbox cluster with several terabytes of storage and analysts love to explore data there.So, there is a need for a quick and elegant solution to bring data from a large table on a daily basis from GreenPlum RDBMS to Hadoop. You are not a traditional Java Programmer but you love SQL. Sqoop comes to the rescue.

Here is the Sqoop:

http://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html

Here is how I have done it using SQL,Python, Sqoop and Hive:

1. There is a large daily partitioned table in GreenPlum( about 200M rows per day)
Since data is daily partitioned on GreenPlum, I need an equivalent 1-1 mapping in Hive so that I can bring data daily via sqoop to HDFS and access them via Hive daily partitioned tables.
GreenPlum table:

Test
Col1 : numeric
Col2 : varchar
Col3 : char
Col4 : int

2. Create an external table on Hive. I prefer external table for the initial Hive table because if anybody drops the table accidentally data is still there on HDFS and you can just create the Hive table again.

Here I create a Hive daily partitioned table:
create external table test (
Col1 bigint,
Col2 string,
Col3 string,
Col4 int
)
partitioned by (day string)
row format delimited
fields terminated by ','
lines terminated by '\n'
STORED AS TEXTFILE
location '/hadoop/analytics/test/';

Note: /hadoop/analytics/test is HDFS, not just a regular directory.
So, you need to use
$hadoop fs -ls /hadoop/analytics/test/ to browse the contents of HDFS.

2. Here is a python script which brings daily data from GreenPlum to HDFS and adds Hive partition.

#!/usr/bin/env python

import os, re,sys
import datetime

from datetime import date
from time import time, localtime, strftime, strptime, mktime, ctime, gmtime

delta = datetime.timedelta(days=1)
d=startdate=date.today() - delta
enddate=datetime.date.today()

cmd= "sqoop import --connect jdbc:postgresql://myserver.com:5432/gpdb --username testuser --password testuser --query \"select col1,col2,col3,col4 from test where dt >= '%s' and dt <'%s' and \$CONDITIONS\" --split-by col1 --target-dir /hadoop/analytics/test/day=%s/"

part_cmd= "hive -e \"alter table test add partition (day = '%s');\""

d=startdate
while d< enddate:
d1=d
d += delta
d2=d
sqoop_cmd = cmd % (d1,d2,d1)
print sqoop_cmd
add_part_cmd = part_cmd % (d1)
print add_part_cmd

3. When you run this script it will generate a sqoop command like this:

sqoop import --connect jdbc:postgresql://myserver.com:5432/gpdb --username testuser --password testuser --query "select col1,col2,col3,col4 from where dt >= '2013-01-01' and dt <'2013-01-02' and \$CONDITIONS" --split-by col1 --target-dir /hadoop/analytics/test/day=2013-01-01

hive -e "alter table test add partition (day = '2013-01-01');"

4. You can save the above in a file to execute it.

5. In the main script you should add some checks whether data came successfully to HDFS.
Each day you should see a directory created on HDFS as/hadoop/analytics/test/day=2013-01-01 and alter table command will create the corresponding Hive partition day=(2013-01-01).

You can view Hive partition using this command:
After Hadoop environment is set, type hive and you will get a hive prompt
hive>show partitions test;
day=2013-01-01

6. At the end you can select data from Hive:

hive> select count(*) from test where day='2013-01-01';
This coune should match exactly as your rdms table count for that day.

Comment:

You can use this technique to do any type of partitioning.
If your data is sharded based on mod 32 or any other sharding scheme, you can implement partitioning accordingly. In Hive partitioning means a corresponding directory should be there on HDFS as I have shown above.

In the next blog I will talk about some data type issue between GreenPlum and Hive and how you can take care of those by using Hive.

3 comments:

Chintan ThakkarMarch 25, 2017 at 3:10 AM
Hi

Thanks for the post, are there any Greenplum JDBC Drivers available in the market which we can add to the Sqoop and write the sqoop commands to fetch the data directly to HDFS.
TejutejuJune 24, 2018 at 9:31 PM
Really Good blog post.provided a helpful information.I hope that you will post more updates like this Big data Hadoop online training Bangalore
UnknownJuly 31, 2019 at 3:02 AM
Very useful blog.

Monday, September 2, 2013

Bring data from GreenPlum to Hadoop via Sqoop and access it by HiveQL

How did it start?

Here is the Sqoop:

Here is how I have done it using SQL,Python, Sqoop and Hive:

Comment:

3 comments: