I am going to be honest… I haven’t touched binary operations since I attended a university assembly class about 20 years ago. So when I came across the writeVInt and readVInt methods from DataOutput and DataInput base classes I thought this would be a good to brush up. I lost a good few days because I did not consider the difference between arithmetic and logical shifts.
Unlike Java Rust does not have separate operators for arithmetic and logical operators. In java >>
is arithmetic and >>>
is logical. However in the rust documentation there was a footnote I completely glanced over stating.
** Arithmetic right shift on signed integer types, logical right shift on unsigned integer types.
https://doc.rust-lang.org/reference/expressions/operator-expr.html#arithmetic-and-logical-binary-operators
So what does this mean in practice is a signed data primitive like i8 when shifted gets a 0 or 1 based on it”s sign. So -127 (10000000) becomes -64 (11000000). So what is wrong with this logic? The implementation of variable length quantity in Lucene uses prefix 0’s to determine how many 7 bit bytes to write. For instance a typical VByte encoding is:
Value | byte 1 | byte 2 | byte3 |
0 | 00000000 | ||
1 | 00000001 | ||
2 | 00000010 | ||
127 | 01111111 | ||
128 | 10000000 | 00000001 | |
129 | 10000001 | 00000001 | |
130 | 10000010 | 00000001 | |
16383 | 11111111 | 01111111 | |
16384 | 10000000 | 10000000 | 00000001 |
So for positive or unsigned numbers this logic transferred over easily. However a negative number would cause an infinite loop. To work around this I flip the sign after shifting the bits for the first run. So it now looks like this.
loop step | value | binary |
1 | -2147483648 | 0b10000000000000000000000000000000 |
2 | 16777216 | 0b00000001000000000000000000000000 |
3 | 131072 | 0b00000000000000100000000000000000 |
4 | 1024 | 0b00000000000000000000010000000000 |
5 | 8 | 0b00000000000000000000000000001000 |
After spending all that time making negative numbers working with variable length integers I realized that use case may never be used. The documentation for those methods specifically states
Negative numbers are supported, but should be avoided.
https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/store/DataOutput.html
When I asked the lucene mailing list about this I got
They are fully supported, so you can write and read them.
http://mail-archives.apache.org/mod_mbox/lucene-java-user/202109.mbox/%3cBEFCBB73-0848-4EB6-80E8-249D3D14EE30@thetaphi.de%3e
The problem with negative numbers is that they need lot of (disk) space, because in two’s complement they have almost all bits set. The largest number is kinds of disk space is -1. Negative numbers appear in older index formats, so they can’t be prevented. Just take the comment as given: all is supported, but if you want to store negative numbers use a different encoding, e.g. zigzag.
Nevertheless I now know the difference between arithmetic and logical shifts in Rust.