The Comparator principle and source code learning of HBase Filter

Keywords: Big Data HBase Java

**Foreword: * * last article HBase Filter overview This paper briefly introduces the composition and genealogy of HBase Filter. This paper mainly introduces the comparator of HBase Filter, which is also the necessary low-level soul skill for learning HBase Filter. The source code in this article is based on HBase 1.1.2.2.6.5.0-292 HDP version.

All the comparator implementation classes of HBase inherit from the parent class ByteArrayComparable, which implements the compatible interface; the comparers with different functions differ in the rewriting logic of the parent class compareTo() method.

Here are seven comparators implemented by default in HBase Filter.

1. BinaryComparator

**Introduction: * * binary comparator, used to compare the specified byte array in dictionary order.

Let's start with a small example:

public class BinaryComparatorDemo {

    public static void main(String[] args) {

        BinaryComparator bc = new BinaryComparator(Bytes.toBytes("bbb"));

        int code1 = bc.compareTo(Bytes.toBytes("bbb"), 0, 3);
        System.out.println(code1); // 0
        int code2 = bc.compareTo(Bytes.toBytes("aaa"), 0, 3);
        System.out.println(code2); // 1
        int code3 = bc.compareTo(Bytes.toBytes("ccc"), 0, 3);
        System.out.println(code3); // -1
        int code4 = bc.compareTo(Bytes.toBytes("bbf"), 0, 3);
        System.out.println(code4); // -4
        int code5 = bc.compareTo(Bytes.toBytes("bbbedf"), 0, 6);
        System.out.println(code5); // -3
    }
}

It is not difficult to see that the comparison rules of the comparator are as follows:

  • If the initials of two strings are different, the method returns the difference between the asc codes of the initials
  • If the first character of the two strings involved in the comparison is the same, the next character is compared until there is a difference, and the asc code difference of the different characters is returned
  • If the two strings are not the same length and the characters that can be compared are the same, the length difference between the two strings will be returned

Take a look at the source code implementation of the compareTo() method corresponding to the above rules: Implementation 1:

static enum UnsafeComparer implements Bytes.Comparer<byte[]> {
INSTANCE;
....
public int compareTo(byte[] buffer1, int offset1, int length1, byte[] buffer2, int offset2, int length2) {
	if (buffer1 == buffer2 && offset1 == offset2 && length1 == length2) {
		return 0;
	} else {
		int minLength = Math.min(length1, length2);
		int minWords = minLength / 8;
		long offset1Adj = (long)(offset1 + BYTE_ARRAY_BASE_OFFSET);
		long offset2Adj = (long)(offset2 + BYTE_ARRAY_BASE_OFFSET);
		int j = minWords << 3;

		int offset;
		for(offset = 0; offset < j; offset += 8) {
			long lw = theUnsafe.getLong(buffer1, offset1Adj + (long)offset);
			long rw = theUnsafe.getLong(buffer2, offset2Adj + (long)offset);
			long diff = lw ^ rw;
			if (diff != 0L) {
				return lessThanUnsignedLong(lw, rw) ? -1 : 1;
			}
		}

		offset = j;
		int b;
		int a;
		if (minLength - j >= 4) {
			a = theUnsafe.getInt(buffer1, offset1Adj + (long)j);
			b = theUnsafe.getInt(buffer2, offset2Adj + (long)j);
			if (a != b) {
				return lessThanUnsignedInt(a, b) ? -1 : 1;
			}

			offset = j + 4;
		}

		if (minLength - offset >= 2) {
			short sl = theUnsafe.getShort(buffer1, offset1Adj + (long)offset);
			short sr = theUnsafe.getShort(buffer2, offset2Adj + (long)offset);
			if (sl != sr) {
				return lessThanUnsignedShort(sl, sr) ? -1 : 1;
			}

			offset += 2;
		}

		if (minLength - offset == 1) {
			a = buffer1[offset1 + offset] & 255;
			b = buffer2[offset2 + offset] & 255;
			if (a != b) {
				return a - b;
			}
		}

		return length1 - length2;
	}
}

Implementation 2:

static enum PureJavaComparer implements Bytes.Comparer<byte[]> {
	INSTANCE;

	private PureJavaComparer() {
	}

	public int compareTo(byte[] buffer1, int offset1, int length1, byte[] buffer2, int offset2, int length2) {
		if (buffer1 == buffer2 && offset1 == offset2 && length1 == length2) {
			return 0;
		} else {
			int end1 = offset1 + length1;
			int end2 = offset2 + length2;
			int i = offset1;

			for(int j = offset2; i < end1 && j < end2; ++j) {
				int a = buffer1[i] & 255;
				int b = buffer2[j] & 255;
				if (a != b) {
					return a - b;
				}

				++i;
			}

			return length1 - length2;
		}
	}
}

Implementation 1 is an optimization of implementation 2, which is derived from the Bytes class. HBase takes priority in implementing scheme 1, and implements scheme 2 if there is an exception. As follows:

public static int compareTo(byte[] buffer1, int offset1, int length1, byte[] buffer2, int offset2, int length2) {
	return Bytes.LexicographicalComparerHolder.BEST_COMPARER.compareTo(buffer1, offset1, length1, buffer2, offset2, length2);
}
...
...

static final String UNSAFE_COMPARER_NAME = Bytes.LexicographicalComparerHolder.class.getName() + "$UnsafeComparer";
static final Bytes.Comparer<byte[]> BEST_COMPARER = getBestComparer();
static Bytes.Comparer<byte[]> getBestComparer() {
	try {
		Class<?> theClass = Class.forName(UNSAFE_COMPARER_NAME);
		Bytes.Comparer<byte[]> comparer = (Bytes.Comparer)theClass.getEnumConstants()[0];
		return comparer;
	} catch (Throwable var2) {
		return Bytes.lexicographicalComparerJavaImpl();
	}
}

2. BinaryPrefixComparator

**Introduction: * * binary comparer, only to compare whether the prefix is the same as the specified byte array.

Let's start with a small example:

public class BinaryPrefixComparatorDemo {

    public static void main(String[] args) {

        BinaryPrefixComparator bc = new BinaryPrefixComparator(Bytes.toBytes("b"));

        int code1 = bc.compareTo(Bytes.toBytes("bbb"), 0, 3);
        System.out.println(code1); // 0
        int code2 = bc.compareTo(Bytes.toBytes("aaa"), 0, 3);
        System.out.println(code2); // 1
        int code3 = bc.compareTo(Bytes.toBytes("ccc"), 0, 3);
        System.out.println(code3); // -1
        int code4 = bc.compareTo(Bytes.toBytes("bbf"), 0, 3);
        System.out.println(code4); // 0
        int code5 = bc.compareTo(Bytes.toBytes("bbbedf"), 0, 6);
        System.out.println(code5); // 0
        int code6 = bc.compareTo(Bytes.toBytes("ebbedf"), 0, 6);
        System.out.println(code6); // -3
    }
}

The comparator is only slightly changed based on the BinaryComparator. The following code is clear at a glance:

public int compareTo(byte[] value, int offset, int length) {
	return Bytes.compareTo(this.value, 0, this.value.length, value, offset, this.value.length <= length ? this.value.length : length);
}

Take a look at the similarities and differences with the BinaryComparator method:

public int compareTo(byte[] value, int offset, int length) {
	return Bytes.compareTo(this.value, 0, this.value.length, value, offset, length);
}

The only difference is that the last parameter, length=min(this.value.length,value.length), is smaller. In this way, when the subsequent bytes are compared bit by bit, it is only necessary to compare min length times.

3. BitComparator

**Introduction: * * bit price comparator, compare with AND (AND), OR (OR), NOT (NOT) provided by BitwiseOp. The return result is either 1 OR 0, only EQUAL AND non EQUAL are supported.

Let's start with a small example:

public class BitComparatorDemo {

    public static void main(String[] args) {

        // Bit by bit or comparison with the same length: compare bit by bit from the low order. If each bit by bit or comparison is 0, then return 1, otherwise return 0.
        BitComparator bc1 = new BitComparator(new byte[]{0,0,0,0}, BitComparator.BitwiseOp.OR);
        int i = bc1.compareTo(new byte[]{0,0,0,0}, 0, 4);
        System.out.println(i); // 1
        // Bit by bit and comparison with the same length: compare bit by bit from the low order. If each bit by bit and comparison is 0, then return 1, otherwise return 0.
        BitComparator bc2 = new BitComparator(new byte[]{1,0,1,0}, BitComparator.BitwiseOp.AND);
        int j = bc2.compareTo(new byte[]{0,1,0,1}, 0, 4);
        System.out.println(j); // 1
        // Bitwise exclusive or comparison with the same length: compare bit by bit from the low order. If each bitwise exclusive or comparison is 0, return 1, otherwise return 0.
        BitComparator bc3 = new BitComparator(new byte[]{1,0,1,0}, BitComparator.BitwiseOp.XOR);
        int x = bc3.compareTo(new byte[]{1,0,1,0}, 0, 4);
        System.out.println(x); // 1
        // If the length is different, return 1, otherwise compare by bit
        BitComparator bc4 = new BitComparator(new byte[]{1,0,1,0}, BitComparator.BitwiseOp.XOR);
        int y = bc4.compareTo(new byte[]{1,0,1}, 0, 3);
        System.out.println(y); // 1
    }
}

The rules described in the above notes correspond to the following codes: ··· public int compareTo(byte[] value, int offset, int length) { if (length != this.value.length) { return 1; } else { int b = 0;

	for(int i = length - 1; i >= 0 && b == 0; --i) {
		switch(this.bitOperator) {
		case AND:
			b = this.value[i] & value[i + offset] & 255;
			break;
		case OR:
			b = (this.value[i] | value[i + offset]) & 255;
			break;
		case XOR:
			b = (this.value[i] ^ value[i + offset]) & 255;
		}
	}

	return b == 0 ? 1 : 0;
}

} ··· The core idea is: compare bit by bit from the low order until b!=0 exits the loop.

4. LongComparator

**Introduction: * * Long special comparator, return value: 0 - 11. I didn't mention it in the previous overview, but I'll add it here.

Let's start with a small example:

public class LongComparatorDemo {

    public static void main(String[] args) {
        LongComparator longComparator = new LongComparator(1000L);
        int i = longComparator.compareTo(Bytes.toBytes(1000L), 0, 8);
        System.out.println(i); // 0
        int i2 = longComparator.compareTo(Bytes.toBytes(1001L), 0, 8);
        System.out.println(i2); // -1
        int i3 = longComparator.compareTo(Bytes.toBytes(998L), 0, 8);
        System.out.println(i3); // 1
    }
}

The implementation of this comparator is quite simple, not to mention much, as follows:

public int compareTo(byte[] value, int offset, int length) {
	Long that = Bytes.toLong(value, offset, length);
	return this.longValue.compareTo(that);
}

5. NullComparatorDemo

**Introduction: * * control comparison, judge whether the current value is null. Null returns 0, not null returns 1. Only EQUAL and non EQUAL are supported.

Let's start with a small example:

public class NullComparatorDemo {

    public static void main(String[] args) {
        NullComparator nc = new NullComparator();
        int i1 = nc.compareTo(Bytes.toBytes("abc"));
        int i2 = nc.compareTo(Bytes.toBytes(""));
        int i3 = nc.compareTo(null);
        System.out.println(i1); // 1
        System.out.println(i2); // 1
        System.out.println(i3); // 0
    }
}

The implementation of this comparator is quite simple, not to mention much, as follows:

public int compareTo(byte[] value) {
	return value != null ? 1 : 0;
}

6. RegexStringComparator

**Introduction: * * provides a regular comparer, which supports value comparison of regular expressions, only EQUAL and non EQUAL. 0 is returned for a successful match and 1 for a failed match.

Let's start with a small example:

public class RegexStringComparatorDemo {

    public static void main(String[] args) {
        RegexStringComparator rsc = new RegexStringComparator("abc");
        int abc = rsc.compareTo(Bytes.toBytes("abcd"), 0, 3);
        System.out.println(abc); // 0
        int bcd = rsc.compareTo(Bytes.toBytes("bcd"), 0, 3);
        System.out.println(bcd); // 1

        String check = "^([a-z0-9A-Z]+[-|\\.]?)+[a-z0-9A-Z]@([a-z0-9A-Z]+(-[a-z0-9A-Z]+)?\\.)+[a-zA-Z]{2,}$";
        RegexStringComparator rsc2 = new RegexStringComparator(check);
        int code = rsc2.compareTo(Bytes.toBytes("zpb@163.com"), 0, "zpb@163.com".length());
        System.out.println(code); // 0
        int code2 = rsc2.compareTo(Bytes.toBytes("zpb#163.com"), 0, "zpb#163.com".length());
        System.out.println(code2); // 1
    }
}

Its compareTo() method has two engine implementations, corresponding to two sets of regular matching rules, namely JAVA version and JONI Version (for JRuby). The default is RegexStringComparator.EngineType.JAVA. As follows:

public int compareTo(byte[] value, int offset, int length) {
	return this.engine.compareTo(value, offset, length);
}

public static enum EngineType {
	JAVA,
	JONI;

	private EngineType() {
	}
}

The concrete implementation is very simple, it is to call regular syntax matching. The following is the JAVA EngineType implementation:

public int compareTo(byte[] value, int offset, int length) {
	String tmp;
	if (length < value.length / 2) {
		tmp = new String(Arrays.copyOfRange(value, offset, offset + length), this.charset);
	} else {
		tmp = new String(value, offset, length, this.charset);
	}

	return this.pattern.matcher(tmp).find() ? 0 : 1;
}

JONI EngineType implementation:

public int compareTo(byte[] value, int offset, int length) {
	Matcher m = this.pattern.matcher(value);
	return m.search(offset, length, this.pattern.getOptions()) < 0 ? 1 : 0;
}

It's easy to understand, not to talk about it.

7. SubstringComparator

**Description: * * determines whether the provided substring appears in value and is case insensitive. Include string returns 0, not 1, only EQUAL and non EQUAL are supported.

Let's start with a small example:

public class SubstringComparatorDemo {

    public static void main(String[] args) {
        String value = "aslfjllkabcxxljsl";
        SubstringComparator sc = new SubstringComparator("abc");
        int i = sc.compareTo(Bytes.toBytes(value), 0, value.length());
        System.out.println(i); // 0

        SubstringComparator sc2 = new SubstringComparator("abd");
        int i2 = sc2.compareTo(Bytes.toBytes(value), 0, value.length());
        System.out.println(i2); // 1

        SubstringComparator sc3 = new SubstringComparator("ABC");
        int i3 = sc3.compareTo(Bytes.toBytes(value), 0, value.length());
        System.out.println(i3); // 0
    }
}

The implementation of the comparator is also quite simple

public int compareTo(byte[] value, int offset, int length) {
	return Bytes.toString(value, offset, length).toLowerCase().contains(this.substr) ? 0 : 1;
}

Here, the introduction of seven comparators is finished. If you are not interested in the source code, it is also recommended to take a look at the small examples in the article and be familiar with the constructor and result output of each comparator. Later in the process of using HBase filter, it is often used. Of course, in addition to these seven comparators, you can also customize the comparator.

Reprint please indicate the source! Welcome to my WeChat official account [HBase working notes]

Posted by Oni on Sat, 25 Apr 2020 06:53:54 -0700