Lessons from porting Lucene: Equality

Java and Rust take two different approaches to evaluating equality of objects. Yes on the surface they are trying to accomplish the same thing but the differences may result in double the methods. First let’s get a foundation in equality.

For the purpose of this article there are three types of equality.

  • Reflexive – An object is “equal to” another object. (eg. x==x)
  • Symmetric – An object is “equal to” another object in both directions (eg. x==y && y==x)
  • Transitive – If an object is “equal to” a second object and that object is “equal to” a third object then the first and the third are “equal to” (eg. x==y && y==z then x==z)

Now both Java and Rust break down comparing just symmetric/transitive and all three. This is sometimes referred to as shallow and deep comparisons. Shallow (or flat) compares the values identity and deep compares the fields/values. Java and Rust’s approach to deep comparisons are the same. So when porting from Java to Rust the “Equals”method logic can directly be ported to “PartialEq”, However shallow comparisons are where they differ. Yes they both do shallow, memcmp-like, pointer-comparing versions. The first difference is Java also leverages a overridable hashCode method to provide a unique identifier for an object. Second to that is Rust will fall back onto deep comparisons (partialEQ) in some scenarios. For instance Floating point NaN can’t be equal.

So what does that mean for my porting process? Because I want to maintain some level of backwards compatibility I need to port the hashCode() method even though it is useless to Rust.

How porting Lucene made me care about bit operations…

I am going to be honest… I haven’t touched binary operations since I attended a university assembly class about 20 years ago. So when I came across the writeVInt and readVInt methods from DataOutput and DataInput base classes I thought this would be a good to brush up. I lost a good few days because I did not consider the difference between arithmetic and logical shifts.

Unlike Java Rust does not have separate operators for arithmetic and logical operators. In java >> is arithmetic and >>> is logical. However in the rust documentation there was a footnote I completely glanced over stating.

** Arithmetic right shift on signed integer types, logical right shift on unsigned integer types.


So what does this mean in practice is a signed data primitive like i8 when shifted gets a 0 or 1 based on it”s sign. So -127 (10000000) becomes -64 (11000000). So what is wrong with this logic? The implementation of variable length quantity in Lucene uses prefix 0’s to determine how many 7 bit bytes to write. For instance a typical VByte encoding is:

Valuebyte 1byte 2byte3

So for positive or unsigned numbers this logic transferred over easily. However a negative number would cause an infinite loop. To work around this I flip the sign after shifting the bits for the first run. So it now looks like this.

loop stepvaluebinary

After spending all that time making negative numbers working with variable length integers I realized that use case may never be used. The documentation for those methods specifically states

Negative numbers are supported, but should be avoided.


When I asked the lucene mailing list about this I got

They are fully supported, so you can write and read them.
The problem with negative numbers is that they need lot of (disk) space, because in two’s complement they have almost all bits set. The largest number is kinds of disk space is -1. Negative numbers appear in older index formats, so they can’t be prevented. Just take the comment as given: all is supported, but if you want to store negative numbers use a different encoding, e.g. zigzag.


Nevertheless I now know the difference between arithmetic and logical shifts in Rust.

The most carrot pasta I have ever made.

During the year of Covid I started a garden. I’ve had plants in pots before but this was the first time I had a section of land dedicated to plants. It turned out to be a great education opportunity for my son. However now I need to actually do something with what I grew. I was amazed how much plant there was above the carrot and wanted to make a meal which used it.


  • 1+ carrots with top. Enough to make 2 cups of greens once the stems are removed.
  • zucchini (optional)
  • 2 cups of baby spinach
  • 3-4 cloves of garlic.
  • 1 cup of roasted unsalted cashews
  • olive oil
  • Salt & pepper
  • 1lb of pasta (penne works well)
  • Parmesan to taste


  • Preheat oven to 425
  • Separate carrot(s) from top and wash/peel
  • Slice carrots and zucchini (optional) to somewhat equal size
  • Toss carrots with olive oil, salt, and pepper.
  • Place on sheet pan and put in the oven for about 20 minutes
  • Wash and separate the greens from the tough stems. Also remove anything that looks…ugly
  • Put greens, spinach, garlic, roasted unsalted cashews, and 1 cup of olive oil into a blender.
  • Pulse until smooth. Depending on your blender you may want to add the olive oil in parts.
  • Cook pasta until al-dente
  • Drain most the water but not all. roughly half a cup
  • Combine all and serve with parmesan.

Porting Lucene: Iteration 1 – Decisions about how to start porting…

So time to make a list of files to port. Lets see how many java files there are…

find lucene |grep ".java" |wc -l

Hmm…. OK let’s try focusing on Test cases instead…

find lucene |grep ".java" |grep Test |wc -l

A bit better but not great. Let’s focus on the first thing you need to create an index…

find lucene/core/src/test/org/apache/lucene/store |grep ".java" |grep Test |wc -l

Looking good….Oh wait…. and then there are dependencies

 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * See the License for the specific language governing permissions and
 * limitations under the License.
package org.apache.lucene.store;

import static org.junit.Assert.*;

import com.carrotsearch.randomizedtesting.RandomizedTest;
import com.carrotsearch.randomizedtesting.Xoroshiro128PlusRandom;
import com.carrotsearch.randomizedtesting.generators.RandomBytes;
import com.carrotsearch.randomizedtesting.generators.RandomNumbers;
import com.carrotsearch.randomizedtesting.generators.RandomPicks;
import com.carrotsearch.randomizedtesting.generators.RandomStrings;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import org.apache.lucene.util.ArrayUtil;
import org.apache.lucene.util.IOUtils.IOConsumer;
import org.junit.Test;

public abstract class BaseDataOutputTestCase extends RandomizedTest {
protected abstract T newInstance();

So org.apache.lucene.util is too big to implement at once.

find lucene/core/src/test/org/apache/lucene/util/ |grep ".java" |grep Test |wc -l

So in theory implementing the rest of core should result in the util package getting fully implemented. Now Lucene also has it’s own Test Framework.

Test Frameworks

So here is the conundrum. The problem with JNI is you are living in two domains (JavaVM & System).It is very tempting to use the Java test cases as a source of truth for the Rust code behavior. The key problem is when it goes wrong is the problem in the java code, JNI, or the rust code? The alternative is to have two sets of books. port the tests to rust and run both. This essentially doubles the amount of work. However should save an incredible amount of time when tests fail. The resulting logic should be

Rust test failsJUnit test failsNext steps
YesYesFocus on changing the code to make the Rust test pass.
YesNoUpdate the Rust tests to match the logic of the JUnit Test
NoYesEnsure the logic in the rust test match the JUnit test. If so focus on the JNI glue code.

Grill + Veggies + Pasta

Two of our new favorites summer pandemic activities come together in this dish which is gardening and grilling. Grilling because I got a new grill and I intend to use it. We also joined our community garden and now have more squash than we can count.


Pasta: Penne works best but others like rigatoni or elbow macaroni will work to.


  • 1-2 large Zucchini
  • 1-2 large Summer squash
  • 1 lb Asparagus
  • 1 ln Bell peppers
  • Kosher salt & pepper to taste


  • 1 package of goat cheese
  • 1/3 cup olive oil
  • 3 Tbsp balsamic vinegar
  • 2 Tbsp mayonnaise
  • 1/2 Tbsp Dijon mustard
  • 1 clove garlic, minced
  • 1/2 tsp dried basil
  • 1/2 tsp salt
  • Pepper to taste


  • Warm up the grill to 400-500 degrees F
  • Toss asparagus with olive oil, salt & pepper
  • Cut squash evenly and toss with olive oil, salt, & pepper.
  • Place veggies on grill. You may want to do this in batches. Asparagus and peppers are usually done first. I usually rotate them every 2 minutes.squash is usually closer to 4 depending on thickness.
  • Cook past to slightly Al dente. Reserve 1/3-1/2 cup of pasta water and drain the rest.
  • Combine reserved water with the rest of the sauce ingredients minus the goat cheese.
  • Roughly chop veggies.
  • Combine pasta, sauce, & veggies.
  • Crumble in goat cheese and combine.

Porting Lucene: Iteration 1 – Project Setup

So I thought project setup would be the easiest item to complete. Turns out the due diligence was far greater than I thought. Why? In short there are tools that have best practices for a project built in like Gradle or Cargo. However they just didn’t quite fit what I needed. Letls look at those use-cases.

Little to no setup required

This is something I feel very strongly about. This project should make contributing absolutely frictionless. You should not need a specific IDE or require changing versions of system libraries.

Don’t introduce yet another language if you don’t have to

Choosing the best tool for a task is important. However there is a certain amount of overhead with each language. Some require additional tooling and may not be widely known. So reusing languages that are used elsewhere in the project is preferable. 

Option #1: Gradle

This should be a no brainer. Lucene uses Gradle. I’m porting Lucene. I should use Gradle. However while I am porting Lucene to learn I want to make this easy for others to adopt. Gradle brings in either Groovy or Kotlin.

Option #2: Cargo

Cargo is the Rust package manager. Cargo downloads your Rust package‘s dependencies, compiles your packages, makes distributable packages, and uploads them to crates.io, the Rust community’s package registry

OK… That I stole from Cargo’s guide. Cargo is going to be needed for building/managing the rust components. However we really need a level of orchestration above it.

Option #3: Maven

Maven for the longest time was the work horse for most Java development. Unlike Gradle Maven can be done using xml exclusively. This then becomes a discussion of project management via configuration or code. Both have their time/place. As a project becomes increasingly complex code becomes preferable over configuration.

Option #4: Make

There is something to be said for keeping it simple. Make is typically included in every distribution and shell scripting provides the capabilities I need. While shell scripting is yet another language it is already pulled in by Docker. This makes the most sense.


A mixture of Option #2, #3, & #4 seems like the best course. Due to the expected complexity of the project Make seems like the best option as it has the least amount of dependencies. Maven & Cargo can then be used for Java & Rust sub-components respectively.

Kubernetes useful tricks: Creating a secret from a file in an image

Recently I had an interesting problem. The product I was working on needed to create a secret from a file. Now on one hand this is an easy thing as you can just have a job run within an image

kubectl create secret generic mysecret --from-file=./file.txt

Ah… However Kubernetes command line client is only compatible within one minor version of the Kubernetes api server. So if you want to support all major version for Redhat Openshift currently supported you have to support Kubernetes 1.11 to 1.21. So to get around this problem you have to curl against the kubapi server directly. Here is a sample script I created to demonstrate this method.

apiVersion: batch/v1
kind: Job
  name: readfiletosecret
  description: "Example of reading a file to a kubernetes secret."
      serviceAccountName: account-with-secret-create-priv
      - name: local
        emptyDir: {}
      - name: get-file
        image: registry.access.redhat.com/ubi8/ubi-minimal:latest
        - "/bin/sh"
        - "-c"
        - name: UPLOAD_FILE_PATH
          value: "/root/buildinfo/content_manifests/ubi8-minimal-container*.json"
        - |
          cat $UPLOAD_FILE_PATH
          cp -vf $UPLOAD_FILE_PATH /work/
        - name: local
          mountPath: /work
      - name: create-secret
        image: registry.access.redhat.com/ubi8/ubi-minimal:latest
        - "/bin/sh"
        - "-c"
        - |
          ls /work/
          export CONTENT=$(cat /work/* | base64 )
          echo $CONTENT
          #Set auth info
          export SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
          export NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
          export TOKEN=$(cat ${SERVICEACCOUNT}/token)
          export CACERT=${SERVICEACCOUNT}/ca.crt
          export APISERVER="https://kubernetes.default.svc"
          # Explore the API with TOKEN
          curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' -d @-  ${APISERVER}/api/v1/namespaces/$NAMESPACE/secrets <<EOF
            "kind": "Secret",
            "apiVersion": "v1",
            "metadata": {
              "name": "example"
            "data": {
              "file": "$CONTENT"
          rm /work/*
        - name: local
          mountPath: /work
      restartPolicy: Never
  backoffLimit: 4

Porting Lucene: Iteration 0

Iteration 0 is often used to create a product backlog or setup technical foundation (code repos, build pipelines, etc..). I frankly thought the concept was odd. In Agile you work to dates not scope. So to have an iteration that defines scope is odd. Thus I have a different take on it.

Iteration 0 is when you “start to know what you don’t” and with each subsequent iteration you learn more and the plan changes; So with that said it is time to prime the product backlog. Let’s start with defining goals.

Goal #1: Port Lucene from Java to Rust

While it seems straight forward we need to understand what Java parts are in the Lucene project. So let’s start start with a…

git clone https://github.com/apache/lucene.git
cd lucene
LICENSE		build.gradle	dev-docs	gradle		gradlew.bat	lucene		versions.lock
README.md	buildSrc	dev-tools	gradlew		help		settings.gradle	versions.props

OK, So right off the bat I can see it is a Gradle project. Gradle supports multiple languages out of the box and supports plugins to support others like Rust. Here we have our first decision… Do we port the Lucene library code or do we port the project? Let’s revisit that later after we are done exploring.

The interesting thing is there are a good number of files that are very specific to how Apache runs their projects and performs releases. For instance I haven’t seen an RDF document in probably 10 years yet here is a DOAP document defining the project and all the releases per Apache standards. The two items which I should focus on is the Lucene directory with all the java files and dev-tools/scripts/smokeTestRelease.py. The later will be useful for the next goal.

Goal #2: It should pass existing test scripts

So I think this is something which will set the project apart from other ports. Providing a way to reuse existing test scripts will ensure compatibility over time. However doing so will require the ability to do 2 things.

Leverage JNI from RUST

It has been a while since I’ve used JNI so this will be a good refresher. Based on existing documentation it seems this is done all the time for Android but still an experiment is required to ensure there are not any unforeseen gotchas.

The ability to sync Tests with the parent project

This is going to be a hard one. Reading through dev-tools/scripts/smokeTestRelease.py the Release verification is more than just Unit Tests. It also verifies digest mismatch, documentation, and missing metadata. Not all of these verifications will be applicable. For instance verifying the jar “Implementation-Vendor”metadata would not apply. So valid parts of this script will need to be ported and maintained.

At first glance majority of the verification is in the form of unit tests. In fact it looks like roughly 33% of all the Java source files are unit tests. That being said the Gradle build handles preparing data which may be used in those tests.

find lucene/ |grep ".java" |wc -l
find lucene/ |grep ".java" | grep test | wc -l

So we probably need a mechanism to

  • pull the latest project from Lucene
  • Clear out the non-test related Java source files
  • Update the build scripts to leverage an external JAR
  • Run the tests

Of course this means the code for the port needs to be managed separately from Lucene. This is turning into more of a complex project than I expected. Going to need to spend some time understanding a good project setup which will allow this. However it is currently the end of this iteration so that will have to wait for the next one.

Building a boat in the basement….

It goes without saying that 2020 was a tough year. While 2021 is getting better we are still not out of the woods yet. Sometimes the best way to get through it is to have a project that isn’t related to work but helps exercise the mind. Now I am not actually building a boat. My wife would kill me. However I am starting a project of similar ambition.

While granted I am a manager I still like to keep my development skills sharp. This is why I jump in to write code that helps my team reach their goals. This can range from everything from GO to Python. So for a personal project I don’t want to program in any of those.

Thus I am going to take a few open source projects and port them to Rust. I am gong document my experience and what works/doesn’t. Now I intent to do something a bit different for this port. I am going to use JNI to ensure the port is 100% compatible with the old one.

Choosing the right project

It goes without saying that there are plenty of open source projects. So here is the criteria I am using for selecting the right project to port.

  • It should be an established project which is not radically changing. This will make maintenance of the port easer over time.
  • It should have existing ports and the community should be open to new ports.
  • It should be large but not so large that it can never be completed.
  • It should be a project I am familiar with.

Based on this criteria I have decided to port the Lucene project. I have worked with Lucene on multiple projects back in the day. It will be good to revisit it and understand better how it works under the hood.

Applying Behavior Driven Development practices to infrastructure. 

Earlier this summer I worked with my team to apply Behavior Driven Development practices to infrastructure that we deployed our products to. Prior to this the DevOps team simply identified toil and implemented solutions. Unfortunately this meant the reasons behind the changes would get lost over time. Having GHERKIN files with your solution means that information would not get lost. Interesting enough we were able to reduce the total size of the source code considerably because some of the user stories were no longer relevant.

I wrote a tutorial based on this experience that can be found here.