Hadoop里面有VLongWritable和VIntWritable,对long和int进行编码存储,它们都是采用zero-compressed encoded的变长编码方式存储数值,有利于减少数据空间占用。算法大概过程是单独拿出一个字节存储数值占用的字节长度,再拿出一些字节存储数值。这样的话一个long类型就不是固定的占用8个字节了,而是可以根据数值范围压缩成1 ~ 9 个字节。大大减少了,空间占用。对一个数值 i 编码具体过程如下:
- 对于-112 <= i <= 127的整数,只用1个字节byte来表示;如果超过上述范围时,编码第一个字节则会用来表示i的总字节数,后面则跟着 i 数值的字节;
- 如果i大于0,则编码的第一个字节 b 范围在-113和-120之间,则 i 会有 -(112+b)个字节,所以可以表示有1-8个字节;
- 如果i小于0,则编码第一个字节 b 范围在 -121 和 -128之间,则 i 会有 -(120+b)个字节,同样也可以表示有1-8个字节。对于负数hadoop实现的时候先进行了补码操作,然后再进行编码。
算法不复杂,比较容易理解但是实现的时候还有一些小技巧,hadoop WritableUtils 实现如下:
/**
* Serializes a long to a binary stream with zero-compressed encoding.
* For -112 <= i <= 127, only one byte is used with the actual value.
* For other values of i, the first byte value indicates whether the
* long is positive or negative, and the number of bytes that follow.
* If the first byte value v is between -113 and -120, the following long
* is positive, with number of bytes that follow are -(v+112).
* If the first byte value v is between -121 and -128, the following long
* is negative, with number of bytes that follow are -(v+120). Bytes are
* stored in the high-non-zero-byte-first order.
*
* @param stream Binary output stream
* @param i Long to be serialized
* @throws java.io.IOException
*/
public static void writeVLong(DataOutput stream, long i) throws IOException {
if (i >= -112 && i <= 127) {
stream.writeByte((byte)i);
return;
}
int len = -112;
if (i < 0) {
i ^= -1L; // take one's complement' //负数求补码
len = -120;
}
long tmp = i;
while (tmp != 0) {
tmp = tmp >> 8;
len--;
}
stream.writeByte((byte)len);
len = (len < -120) ? -(len + 120) : -(len + 112);
for (int idx = len; idx != 0; idx--) {
int shiftbits = (idx - 1) * 8;
long mask = 0xFFL << shiftbits;
stream.writeByte((byte)((i & mask) >> shiftbits));
}
}
/**
* Reads a zero-compressed encoded long from input stream and returns it.
* @param stream Binary input stream
* @throws java.io.IOException
* @return deserialized long from stream.
*/
public static long readVLong(DataInput stream) throws IOException {
byte firstByte = stream.readByte();
int len = decodeVIntSize(firstByte);
if (len == 1) {
return firstByte;
}
long i = 0;
for (int idx = 0; idx < len-1; idx++) {
byte b = stream.readByte();
i = i << 8;
i = i | (b & 0xFF);
}
return (isNegativeVInt(firstByte) ? (i ^ -1L) : i);
}
/**
* Parse the first byte of a vint/vlong to determine the number of bytes
* @param value the first byte of the vint/vlong
* @return the total number of bytes (1 to 9)
*/
public static int decodeVIntSize(byte value) {
if (value >= -112) {
return 1;
} else if (value < -120) {
return -119 - value;
}
return -111 - value;
}
我们也可以字节写代码验证一下:
package com.hainiubl.hadoop.demo;
import java.io.ByteArrayOutputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.util.ArrayList;
import java.util.List;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.VLongWritable;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.util.StringUtils;
public class VLongDemo {
//将writable转换成字节
public static byte[] writable2Bytes(Writable writable) throws IOException {
ByteArrayOutputStream out = new ByteArrayOutputStream();
DataOutputStream dataOut = new DataOutputStream(out);
writable.write(dataOut);
dataOut.close();
return out.toByteArray();
}
public static void main(String[] args) throws IOException {
List<Writable> list = new ArrayList<Writable>();
list.add(new LongWritable(100L));
list.add(new VLongWritable(100L));
list.add(new VLongWritable(9999L));
list.add(new VLongWritable(9999999999L));
byte[] bytes;
for(Writable w : list){
bytes = writable2Bytes(w); // 序列化成字节
System.out.println(w.getClass().getSimpleName() + ":" + w + ",hex:" + StringUtils.byteToHexString(bytes) + ",len:" + bytes.length);
}
}
}
执行结果:
LongWritable:100,hex:0000000000000064,len:8
VLongWritable:100,hex:64,len:1
VLongWritable:9999,hex:8e270f,len:3
VLongWritable:9999999999,hex:8b02540be3ff,len:6
可以看出VLongWritable 压缩掉前面填充的 0 ,节省了空间。