Review Board 1.6.3

Hive SerDe and Pig LoadFunc for reading Sqoop sequence files

Review Request #1670 - updated 3 years ago

James Grant Reviewers
sqoop
SQOOP-171
None sqoop
Add new classes that can be used in Hive and Pig along with the Sqoop generated FieldMappable class to access Sqoop generated sequence files.

To use the SerDe in Hive you first have to add the Sqoop jar and the jar generated by Sqoop when you dumped the table then create the table. An example is below.


ADD JAR /path/to/sqoop-1.x.x.jar;
ADD JAR /path/to/table_name.jar;
CREATE EXTERNAL TABLE table_name (
  id INT,
  name STRING
)
ROW FORMAT SERDE 'com.cloudera.sqoop.contrib.FieldMappableSerDe'
WITH SERDEPROPERTIES (
  "fieldmappable.classname" = "name.of.FieldMappable.generated.by.sqoop"
)
STORED AS SEQUENCEFILE
LOCATION "hdfs://hdfs.server/path/to/sequencefile";


If you want to use the Pig one then you have to register the two jars and then define the LoadFunc. An example is below.

REGISTER /path/to/sqoop-1.x.x.jar
REGISTER /path/to/table_name.jar
DEFINE FieldMappableLoadFunc com.cloudera.sqoop.contrib.FieldMappableLoadFunc();

dataset = LOAD 'hdfs://hdfs.server/path/to/sequencefile'
  USING FieldMappableLoadFunc
  AS (
    id,
    name
  );
These classes are not far from the proof of concept phase there are almost certainly going to be rough edges and there are not yet any unit tests. The Hive SerDe has been tested with 'real' queries on a fairly large dataset (0.5 billion rows and about 45GB) giving confidence that there aren't any obvious memory leaks. The Pig LoadFunc hasn't been tested much beyond determining that it works and returns the correct results.
Posted 3 years ago (March 29th, 2011, 4:16 a.m.)
com.cloudera.sqoop.contrib may not be the correct package. It was chosen quickly after moving the classes from fm.last.sqoop. A suggestion for a better location is welcome.