How porting Lucene made me care about bit operations…

I am going to be honest… I haven’t touched binary operations since I attended a university assembly class about 20 years ago. So when I came across the writeVInt and readVInt methods from DataOutput and DataInput base classes I thought this would be a good to brush up. I lost a good few days because I did not consider the difference between arithmetic and logical shifts.

Unlike Java Rust does not have separate operators for arithmetic and logical operators. In java >> is arithmetic and >>> is logical. However in the rust documentation there was a footnote I completely glanced over stating.

** Arithmetic right shift on signed integer types, logical right shift on unsigned integer types.
https://doc.rust-lang.org/reference/expressions/operator-expr.html#arithmetic-and-logical-binary-operators

So what does this mean in practice is a signed data primitive like i8 when shifted gets a 0 or 1 based on it”s sign. So -127 (10000000) becomes -64 (11000000). So what is wrong with this logic? The implementation of variable length quantity in Lucene uses prefix 0’s to determine how many 7 bit bytes to write. For instance a typical VByte encoding is:

Value	byte 1	byte 2	byte3
0	00000000
1	00000001
2	00000010
127	01111111
128	10000000	00000001
129	10000001	00000001
130	10000010	00000001
16383	11111111	01111111
16384	10000000	10000000	00000001

So for positive or unsigned numbers this logic transferred over easily. However a negative number would cause an infinite loop. To work around this I flip the sign after shifting the bits for the first run. So it now looks like this.

loop step	value	binary
1	-2147483648	0b10000000000000000000000000000000
2	16777216	0b00000001000000000000000000000000
3	131072	0b00000000000000100000000000000000
4	1024	0b00000000000000000000010000000000
5	8	0b00000000000000000000000000001000

After spending all that time making negative numbers working with variable length integers I realized that use case may never be used. The documentation for those methods specifically states

Negative numbers are supported, but should be avoided.
https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/store/DataOutput.html

When I asked the lucene mailing list about this I got

They are fully supported, so you can write and read them.
The problem with negative numbers is that they need lot of (disk) space, because in two’s complement they have almost all bits set. The largest number is kinds of disk space is -1. Negative numbers appear in older index formats, so they can’t be prevented. Just take the comment as given: all is supported, but if you want to store negative numbers use a different encoding, e.g. zigzag.
http://mail-archives.apache.org/mod_mbox/lucene-java-user/202109.mbox/%3cBEFCBB73-0848-4EB6-80E8-249D3D14EE30@thetaphi.de%3e

Nevertheless I now know the difference between arithmetic and logical shifts in Rust.

Aaron Cohen

Thoughts on software development & life

How porting Lucene made me care about bit operations…

Leave a comment Cancel reply

Share this:

Related

Leave a comment Cancel reply