My Oracle Support Banner

FAQ On Converting Many Small Files To A Sequence File In HDFS (Doc ID 1954143.1)

Last updated on DECEMBER 17, 2014

Applies to:

Big Data Appliance Integrated Software - Version 2.0.1 and later
Linux x86-64

Purpose

Having a lot of small files in HDFS are not efficient for processing and also not good for NameNode metadata. This FAQ provide some common questions and answers on how to convert many small files to a larger sequence file and how to access it.

Questions and Answers

To view full details, sign in with your My Oracle Support account.

Don't have a My Oracle Support account? Click to get started!


In this Document
Purpose
Questions and Answers
  What use cases sequence file option would be better than HAR file option and What are some of the reasons for not using sequence file options?
 What would be the best way to convert the small XML files into sequence files? Is there a utility or a command that would do the conversion?
 Can we use all tools (like hive, pig, Java, Impala, etc) on the converted sequence files?
 If I had a external hive table pointing to the xml folder, will the same table work by changing the location and store type to sequence file?
 Will the key (filenames) have the original small individual xml file names? or larger merged file names?
 How do I retrieve the file names associated with a sequence file? Is there a command line utility or do I have to write a MR program?

My Oracle Support provides customers with access to over a million knowledge articles and a vibrant support community of peers and Oracle experts.