Jan
26

Hadoop Sequence Files and a Use Case on Reading only the keys

Posted on January 26, 2014 by Chitra Ramanathan

 To enjoy all the benefits of a hadoop based big data system, the first step would be to get the data into HDFS also known as the ingestion process. Hadoop does not work very well with a lot of small files, i.e., files that are smaller than a typical HDFS Block size as it causes a memory overhead for the NameNode to hold huge amounts of small files. Also, every map task processes a block of data at a time and when a map task has too little data to process, it becomes inefficient. Starting up several such map tasks is an overhead.

To solve this problem, Sequence files are used as a container to store the small files. Sequence files are flat files containing key, value pairs. A very common use case when designing ingestion systems is to use Sequence files as containers and store any file related metadata(filename, path, creation time etc) as the key and the file contents as the value.

More on Sequence Files and Input Splits 

A sequence file has a header which contains information on the key/value class names, version, file format, metadata about the file and sync marker to denote the end of the header. The header is followed by the records which constitute the key/value pairs and their respective lengths. A Sequence file can be have three different formats: An Uncompressed format, a Record Compressed format where the value is compressed and a Block Compressed format where entire records are compressed. There are sync markers for every few 100 bytes (approximately) that represent record boundaries.

Sequence files are splittable with each map task processing a split, with one or many key/value pairs. Each call to map() method in the Mapper would retrieve the next key and value in the corresponding split. Even when a sequence file split cuts off in the middle of a record, the sequence file reader will read until a sync marker is reached ensuring that a record is read in whole.

Working through a use case: Read only the keys and generate a report

Lets take a file of size 650 MB that gets ingested into HDFS as a sequence file with the key being the file metadata and value the file contents. My goal is to read all the metadata i.e. all the keys and generate a report. Lets assume a block size (split size) of 128 MB and that the sequence file contains only the above input file. So the size of the sequence file is approximately 650 MB (assuming that the headers are negligible) and 6 splits or 6 map tasks would be invoked on this file. We can visualize it as follows. 

Sequence

Note that the above diagram represents the uncompressed and record compressed sequence file formats. The Block Compressed format has a different structure, but the idea is the same. Even though 6 map tasks are invoked to process this sequence file, the first map task would read the whole of Record 1 until it finds a sync marker, while the other map tasks would see that there is no record to read and do nothing.

I started off using the SequenceFileInputFormat [1] class to read the input files and write the keys in the Mapper and null in place of the values and do additional processing in the reducer to generate the report. When I profile the memory usage of the map tasks, I noticed that it is using a lot of memory to read the value bytes, which never gets used in the job.

To optimize the memory usage, I wrote a custom InputFormat and a custom RecordReader class that reads keys but skips the value portion of the record and the memory usage was much lower. When we have lots of such large input files that are ingested, there would be a significant impact in memory usage. The nextKeyValue() method in the RecordReader would look like:

@Override
public boolean nextKeyValue() throws IOException, InterruptedException {
if (!more) {
return false;
}
long pos = reader.getPosition();
more = reader.next(key);
if (!more || (pos >= end && reader.syncSeen())) {
	more = false;
	key = null;
	value = null;
}
return more;
}

 graphs

More Use cases

At this point we have written our own InputFormat and RecordReader classes. We are still using the SequenceFile.Reader [2] class to read the underlying bytes. Writing a custom version of the SequenceFile.Reader opens up a lot of possibilities in the way the key/value bytes are handled. One such use case is the ability to stream the value, few bytes at a time as opposed to reading the entire value to improve the memory usage.

References

[1]http://hadoop.apache.org/docs/current/api/org/apache/hadoop/mapreduce/lib/input/SequenceFileInputFormat.html

[2]http://hadoop.apache.org/docs/r1.1.2/api/org/apache/hadoop/io/SequenceFile.Reader.html

Share Button



0 Comments

Leave a Reply