Hive SerDe and Pig LoadFunc for reading Sqoop sequence files
Review Request #1670 - updated 3 years ago
Add new classes that can be used in Hive and Pig along with the Sqoop generated FieldMappable class to access Sqoop generated sequence files. To use the SerDe in Hive you first have to add the Sqoop jar and the jar generated by Sqoop when you dumped the table then create the table. An example is below. ADD JAR /path/to/sqoop-1.x.x.jar; ADD JAR /path/to/table_name.jar; CREATE EXTERNAL TABLE table_name ( id INT, name STRING ) ROW FORMAT SERDE 'com.cloudera.sqoop.contrib.FieldMappableSerDe' WITH SERDEPROPERTIES ( "fieldmappable.classname" = "name.of.FieldMappable.generated.by.sqoop" ) STORED AS SEQUENCEFILE LOCATION "hdfs://hdfs.server/path/to/sequencefile"; If you want to use the Pig one then you have to register the two jars and then define the LoadFunc. An example is below. REGISTER /path/to/sqoop-1.x.x.jar REGISTER /path/to/table_name.jar DEFINE FieldMappableLoadFunc com.cloudera.sqoop.contrib.FieldMappableLoadFunc(); dataset = LOAD 'hdfs://hdfs.server/path/to/sequencefile' USING FieldMappableLoadFunc AS ( id, name );
These classes are not far from the proof of concept phase there are almost certainly going to be rough edges and there are not yet any unit tests. The Hive SerDe has been tested with 'real' queries on a fairly large dataset (0.5 billion rows and about 45GB) giving confidence that there aren't any obvious memory leaks. The Pig LoadFunc hasn't been tested much beyond determining that it works and returns the correct results.