How porting Lucene made me care about bit operations…

I am going to be honest… I haven’t touched binary operations since I attended a university assembly class about 20 years ago. So when I came across the writeVInt and readVInt methods from DataOutput and DataInput base classes I thought this would be a good to brush up. I lost a good few days because I did not consider the difference between arithmetic and logical shifts.

Unlike Java Rust does not have separate operators for arithmetic and logical operators. In java >> is arithmetic and >>> is logical. However in the rust documentation there was a footnote I completely glanced over stating.

** Arithmetic right shift on signed integer types, logical right shift on unsigned integer types.

https://doc.rust-lang.org/reference/expressions/operator-expr.html#arithmetic-and-logical-binary-operators

So what does this mean in practice is a signed data primitive like i8 when shifted gets a 0 or 1 based on it”s sign. So -127 (10000000) becomes -64 (11000000). So what is wrong with this logic? The implementation of variable length quantity in Lucene uses prefix 0’s to determine how many 7 bit bytes to write. For instance a typical VByte encoding is:

Valuebyte 1byte 2byte3
000000000
100000001
200000010
12701111111
1281000000000000001
1291000000100000001
1301000001000000001
163831111111101111111
16384100000001000000000000001

So for positive or unsigned numbers this logic transferred over easily. However a negative number would cause an infinite loop. To work around this I flip the sign after shifting the bits for the first run. So it now looks like this.

loop stepvaluebinary
1-21474836480b10000000000000000000000000000000
2167772160b00000001000000000000000000000000
31310720b00000000000000100000000000000000
410240b00000000000000000000010000000000
580b00000000000000000000000000001000

After spending all that time making negative numbers working with variable length integers I realized that use case may never be used. The documentation for those methods specifically states

Negative numbers are supported, but should be avoided.

https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/store/DataOutput.html

When I asked the lucene mailing list about this I got

They are fully supported, so you can write and read them.
The problem with negative numbers is that they need lot of (disk) space, because in two’s complement they have almost all bits set. The largest number is kinds of disk space is -1. Negative numbers appear in older index formats, so they can’t be prevented. Just take the comment as given: all is supported, but if you want to store negative numbers use a different encoding, e.g. zigzag.

http://mail-archives.apache.org/mod_mbox/lucene-java-user/202109.mbox/%3cBEFCBB73-0848-4EB6-80E8-249D3D14EE30@thetaphi.de%3e

Nevertheless I now know the difference between arithmetic and logical shifts in Rust.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s