![]() gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman Coding.īzip2 is a freely available, patent free (see below), high-quality data compressor. Since the map output is written to disk and transferred across the network to the reducer nodes, by using a fast compressor such as LZO or Snappy, you can get performance gains simply because the volume of data to transfer is reduced. ![]() Therefore, it is necessary to compress the output before storing on HDFS.Įven if your MapReduce application reads and writes uncompressed data, it may benefit from compressing the intermediate output of the map phase. However, these history files may not be used very frequently, resulting in a waste of HDFS space. If the amount of output per day is extensive, and we often need to store history results for future use, then these accumulated results will take extensive amount of HDFS space. Often we need to store the output as history files. gz can be identified as gzip-compressed file and thus read with GzipCodec. If the input files are compressed, they will be decompressed automatically as they are read by MapReduce, using the filename extension to determine which codec to use. This time conservation is beneficial to the performance of job execution. If the input file is compressed, then the bytes read in from HDFS is reduced, which means less time to read data. When dealing with large volumes of data, both of these savings can be significant, so it pays to carefully consider how to use compression in Hadoop. File compression brings two major benefits: it reduces the space needed to store files, and it speeds up data transfer across the network or to or from disk.
0 Comments
Leave a Reply. |
Details
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |