How porting Lucene made me care about bit operations…

I am going to be honest… I haven’t touched binary operations since I attended a university assembly class about 20 years ago. So when I came across the writeVInt and readVInt methods from DataOutput and DataInput base classes I thought this would be a good to brush up. I lost a good few days because I did not consider the difference between arithmetic and logical shifts.

Unlike Java Rust does not have separate operators for arithmetic and logical operators. In java >> is arithmetic and >>> is logical. However in the rust documentation there was a footnote I completely glanced over stating.

** Arithmetic right shift on signed integer types, logical right shift on unsigned integer types.

https://doc.rust-lang.org/reference/expressions/operator-expr.html#arithmetic-and-logical-binary-operators

So what does this mean in practice is a signed data primitive like i8 when shifted gets a 0 or 1 based on it”s sign. So -127 (10000000) becomes -64 (11000000). So what is wrong with this logic? The implementation of variable length quantity in Lucene uses prefix 0’s to determine how many 7 bit bytes to write. For instance a typical VByte encoding is:

Valuebyte 1byte 2byte3
000000000
100000001
200000010
12701111111
1281000000000000001
1291000000100000001
1301000001000000001
163831111111101111111
16384100000001000000000000001

So for positive or unsigned numbers this logic transferred over easily. However a negative number would cause an infinite loop. To work around this I flip the sign after shifting the bits for the first run. So it now looks like this.

loop stepvaluebinary
1-21474836480b10000000000000000000000000000000
2167772160b00000001000000000000000000000000
31310720b00000000000000100000000000000000
410240b00000000000000000000010000000000
580b00000000000000000000000000001000

After spending all that time making negative numbers working with variable length integers I realized that use case may never be used. The documentation for those methods specifically states

Negative numbers are supported, but should be avoided.

https://lucene.apache.org/core/6_4_0/core/org/apache/lucene/store/DataOutput.html

When I asked the lucene mailing list about this I got

They are fully supported, so you can write and read them.
The problem with negative numbers is that they need lot of (disk) space, because in two’s complement they have almost all bits set. The largest number is kinds of disk space is -1. Negative numbers appear in older index formats, so they can’t be prevented. Just take the comment as given: all is supported, but if you want to store negative numbers use a different encoding, e.g. zigzag.

http://mail-archives.apache.org/mod_mbox/lucene-java-user/202109.mbox/%3cBEFCBB73-0848-4EB6-80E8-249D3D14EE30@thetaphi.de%3e

Nevertheless I now know the difference between arithmetic and logical shifts in Rust.

The most carrot pasta I have ever made.

During the year of Covid I started a garden. I’ve had plants in pots before but this was the first time I had a section of land dedicated to plants. It turned out to be a great education opportunity for my son. However now I need to actually do something with what I grew. I was amazed how much plant there was above the carrot and wanted to make a meal which used it.

Ingredients:

  • 1+ carrots with top. Enough to make 2 cups of greens once the stems are removed.
  • zucchini (optional)
  • 2 cups of baby spinach
  • 3-4 cloves of garlic.
  • 1 cup of roasted unsalted cashews
  • olive oil
  • Salt & pepper
  • 1lb of pasta (penne works well)
  • Parmesan to taste

Instructions

  • Preheat oven to 425
  • Separate carrot(s) from top and wash/peel
  • Slice carrots and zucchini (optional) to somewhat equal size
  • Toss carrots with olive oil, salt, and pepper.
  • Place on sheet pan and put in the oven for about 20 minutes
  • Wash and separate the greens from the tough stems. Also remove anything that looks…ugly
  • Put greens, spinach, garlic, roasted unsalted cashews, and 1 cup of olive oil into a blender.
  • Pulse until smooth. Depending on your blender you may want to add the olive oil in parts.
  • Cook pasta until al-dente
  • Drain most the water but not all. roughly half a cup
  • Combine all and serve with parmesan.

Porting Lucene: Iteration 1 – Decisions about how to start porting…

So time to make a list of files to port. Lets see how many java files there are…

find lucene |grep ".java" |wc -l
5502

Hmm…. OK let’s try focusing on Test cases instead…

find lucene |grep ".java" |grep Test |wc -l
1570

A bit better but not great. Let’s focus on the first thing you need to create an index…

find lucene/core/src/test/org/apache/lucene/store |grep ".java" |grep Test |wc -l
25

Looking good….Oh wait…. and then there are dependencies

 *
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS,
 * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 * See the License for the specific language governing permissions and
 * limitations under the License.
 */
package org.apache.lucene.store;

import static org.junit.Assert.*;

import com.carrotsearch.randomizedtesting.RandomizedTest;
import com.carrotsearch.randomizedtesting.Xoroshiro128PlusRandom;
import com.carrotsearch.randomizedtesting.generators.RandomBytes;
import com.carrotsearch.randomizedtesting.generators.RandomNumbers;
import com.carrotsearch.randomizedtesting.generators.RandomPicks;
import com.carrotsearch.randomizedtesting.generators.RandomStrings;
import java.io.ByteArrayOutputStream;
import java.io.IOException;
import java.nio.ByteBuffer;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;
import org.apache.lucene.util.ArrayUtil;
import org.apache.lucene.util.IOUtils.IOConsumer;
import org.junit.Test;

public abstract class BaseDataOutputTestCase extends RandomizedTest {
protected abstract T newInstance();

So org.apache.lucene.util is too big to implement at once.

find lucene/core/src/test/org/apache/lucene/util/ |grep ".java" |grep Test |wc -l
     100

So in theory implementing the rest of core should result in the util package getting fully implemented. Now Lucene also has it’s own Test Framework.

Test Frameworks

So here is the conundrum. The problem with JNI is you are living in two domains (JavaVM & System).It is very tempting to use the Java test cases as a source of truth for the Rust code behavior. The key problem is when it goes wrong is the problem in the java code, JNI, or the rust code? The alternative is to have two sets of books. port the tests to rust and run both. This essentially doubles the amount of work. However should save an incredible amount of time when tests fail. The resulting logic should be

Rust test failsJUnit test failsNext steps
YesYesFocus on changing the code to make the Rust test pass.
YesNoUpdate the Rust tests to match the logic of the JUnit Test
NoYesEnsure the logic in the rust test match the JUnit test. If so focus on the JNI glue code.

Grill + Veggies + Pasta

Two of our new favorites summer pandemic activities come together in this dish which is gardening and grilling. Grilling because I got a new grill and I intend to use it. We also joined our community garden and now have more squash than we can count.

Ingredients

Pasta: Penne works best but others like rigatoni or elbow macaroni will work to.

Veggies:

  • 1-2 large Zucchini
  • 1-2 large Summer squash
  • 1 lb Asparagus
  • 1 ln Bell peppers
  • Kosher salt & pepper to taste

Sauce:

  • 1 package of goat cheese
  • 1/3 cup olive oil
  • 3 Tbsp balsamic vinegar
  • 2 Tbsp mayonnaise
  • 1/2 Tbsp Dijon mustard
  • 1 clove garlic, minced
  • 1/2 tsp dried basil
  • 1/2 tsp salt
  • Pepper to taste

Steps

  • Warm up the grill to 400-500 degrees F
  • Toss asparagus with olive oil, salt & pepper
  • Cut squash evenly and toss with olive oil, salt, & pepper.
  • Place veggies on grill. You may want to do this in batches. Asparagus and peppers are usually done first. I usually rotate them every 2 minutes.squash is usually closer to 4 depending on thickness.
  • Cook past to slightly Al dente. Reserve 1/3-1/2 cup of pasta water and drain the rest.
  • Combine reserved water with the rest of the sauce ingredients minus the goat cheese.
  • Roughly chop veggies.
  • Combine pasta, sauce, & veggies.
  • Crumble in goat cheese and combine.

Porting Lucene: Iteration 1 – Project Setup

So I thought project setup would be the easiest item to complete. Turns out the due diligence was far greater than I thought. Why? In short there are tools that have best practices for a project built in like Gradle or Cargo. However they just didn’t quite fit what I needed. Letls look at those use-cases.

Little to no setup required

This is something I feel very strongly about. This project should make contributing absolutely frictionless. You should not need a specific IDE or require changing versions of system libraries.

Don’t introduce yet another language if you don’t have to

Choosing the best tool for a task is important. However there is a certain amount of overhead with each language. Some require additional tooling and may not be widely known. So reusing languages that are used elsewhere in the project is preferable. 

Option #1: Gradle

This should be a no brainer. Lucene uses Gradle. I’m porting Lucene. I should use Gradle. However while I am porting Lucene to learn I want to make this easy for others to adopt. Gradle brings in either Groovy or Kotlin.

Option #2: Cargo

Cargo is the Rust package manager. Cargo downloads your Rust package‘s dependencies, compiles your packages, makes distributable packages, and uploads them to crates.io, the Rust community’s package registry

OK… That I stole from Cargo’s guide. Cargo is going to be needed for building/managing the rust components. However we really need a level of orchestration above it.

Option #3: Maven

Maven for the longest time was the work horse for most Java development. Unlike Gradle Maven can be done using xml exclusively. This then becomes a discussion of project management via configuration or code. Both have their time/place. As a project becomes increasingly complex code becomes preferable over configuration.

Option #4: Make

There is something to be said for keeping it simple. Make is typically included in every distribution and shell scripting provides the capabilities I need. While shell scripting is yet another language it is already pulled in by Docker. This makes the most sense.

Conclusion

A mixture of Option #2, #3, & #4 seems like the best course. Due to the expected complexity of the project Make seems like the best option as it has the least amount of dependencies. Maven & Cargo can then be used for Java & Rust sub-components respectively.

Kubernetes useful tricks: Creating a secret from a file in an image

Recently I had an interesting problem. The product I was working on needed to create a secret from a file. Now on one hand this is an easy thing as you can just have a job run within an image

kubectl create secret generic mysecret --from-file=./file.txt

Ah… However Kubernetes command line client is only compatible within one minor version of the Kubernetes api server. So if you want to support all major version for Redhat Openshift currently supported you have to support Kubernetes 1.11 to 1.21. So to get around this problem you have to curl against the kubapi server directly. Here is a sample script I created to demonstrate this method.

apiVersion: batch/v1
kind: Job
metadata:
  name: readfiletosecret
  description: "Example of reading a file to a kubernetes secret."
spec:
  template:
    spec:
      serviceAccountName: account-with-secret-create-priv
      volumes:
      - name: local
        emptyDir: {}
      initContainers:
      - name: get-file
        image: registry.access.redhat.com/ubi8/ubi-minimal:latest
        command:
        - "/bin/sh"
        - "-c"
        env:
        - name: UPLOAD_FILE_PATH
          value: "/root/buildinfo/content_manifests/ubi8-minimal-container*.json"
        args:
        - |
          cat $UPLOAD_FILE_PATH
          cp -vf $UPLOAD_FILE_PATH /work/
        volumeMounts:
        - name: local
          mountPath: /work
      containers:
      - name: create-secret
        image: registry.access.redhat.com/ubi8/ubi-minimal:latest
        command:
        - "/bin/sh"
        - "-c"
        args:
        - |
          ls /work/
          i#
          export CONTENT=$(cat /work/* | base64 )
          echo $CONTENT
          #Set auth info
          export SERVICEACCOUNT=/var/run/secrets/kubernetes.io/serviceaccount
          export NAMESPACE=$(cat ${SERVICEACCOUNT}/namespace)
          export TOKEN=$(cat ${SERVICEACCOUNT}/token)
          export CACERT=${SERVICEACCOUNT}/ca.crt
          export APISERVER="https://kubernetes.default.svc"
          # Explore the API with TOKEN
          curl --cacert ${CACERT} --header "Authorization: Bearer ${TOKEN}" -X POST -H 'Accept: application/json' -H 'Content-Type: application/json' -d @-  ${APISERVER}/api/v1/namespaces/$NAMESPACE/secrets <<EOF
          {
            "kind": "Secret",
            "apiVersion": "v1",
            "metadata": {
              "name": "example"
            },
            "data": {
              "file": "$CONTENT"
            }
          }
          EOF
          rm /work/*
        volumeMounts:
        - name: local
          mountPath: /work
      restartPolicy: Never
  backoffLimit: 4

Porting Lucene: Iteration 0

Iteration 0 is often used to create a product backlog or setup technical foundation (code repos, build pipelines, etc..). I frankly thought the concept was odd. In Agile you work to dates not scope. So to have an iteration that defines scope is odd. Thus I have a different take on it.

Iteration 0 is when you “start to know what you don’t” and with each subsequent iteration you learn more and the plan changes; So with that said it is time to prime the product backlog. Let’s start with defining goals.

Goal #1: Port Lucene from Java to Rust

While it seems straight forward we need to understand what Java parts are in the Lucene project. So let’s start start with a…

git clone https://github.com/apache/lucene.git
cd lucene
ls                                            
LICENSE		build.gradle	dev-docs	gradle		gradlew.bat	lucene		versions.lock
README.md	buildSrc	dev-tools	gradlew		help		settings.gradle	versions.props

OK, So right off the bat I can see it is a Gradle project. Gradle supports multiple languages out of the box and supports plugins to support others like Rust. Here we have our first decision… Do we port the Lucene library code or do we port the project? Let’s revisit that later after we are done exploring.

The interesting thing is there are a good number of files that are very specific to how Apache runs their projects and performs releases. For instance I haven’t seen an RDF document in probably 10 years yet here is a DOAP document defining the project and all the releases per Apache standards. The two items which I should focus on is the Lucene directory with all the java files and dev-tools/scripts/smokeTestRelease.py. The later will be useful for the next goal.

Goal #2: It should pass existing test scripts

So I think this is something which will set the project apart from other ports. Providing a way to reuse existing test scripts will ensure compatibility over time. However doing so will require the ability to do 2 things.

Leverage JNI from RUST

It has been a while since I’ve used JNI so this will be a good refresher. Based on existing documentation it seems this is done all the time for Android but still an experiment is required to ensure there are not any unforeseen gotchas.

The ability to sync Tests with the parent project

This is going to be a hard one. Reading through dev-tools/scripts/smokeTestRelease.py the Release verification is more than just Unit Tests. It also verifies digest mismatch, documentation, and missing metadata. Not all of these verifications will be applicable. For instance verifying the jar “Implementation-Vendor”metadata would not apply. So valid parts of this script will need to be ported and maintained.

At first glance majority of the verification is in the form of unit tests. In fact it looks like roughly 33% of all the Java source files are unit tests. That being said the Gradle build handles preparing data which may be used in those tests.

find lucene/ |grep ".java" |wc -l
5505
find lucene/ |grep ".java" | grep test | wc -l
1844

So we probably need a mechanism to

  • pull the latest project from Lucene
  • Clear out the non-test related Java source files
  • Update the build scripts to leverage an external JAR
  • Run the tests

Of course this means the code for the port needs to be managed separately from Lucene. This is turning into more of a complex project than I expected. Going to need to spend some time understanding a good project setup which will allow this. However it is currently the end of this iteration so that will have to wait for the next one.

Building a boat in the basement….

It goes without saying that 2020 was a tough year. While 2021 is getting better we are still not out of the woods yet. Sometimes the best way to get through it is to have a project that isn’t related to work but helps exercise the mind. Now I am not actually building a boat. My wife would kill me. However I am starting a project of similar ambition.

While granted I am a manager I still like to keep my development skills sharp. This is why I jump in to write code that helps my team reach their goals. This can range from everything from GO to Python. So for a personal project I don’t want to program in any of those.

Thus I am going to take a few open source projects and port them to Rust. I am gong document my experience and what works/doesn’t. Now I intent to do something a bit different for this port. I am going to use JNI to ensure the port is 100% compatible with the old one.

Choosing the right project

It goes without saying that there are plenty of open source projects. So here is the criteria I am using for selecting the right project to port.

  • It should be an established project which is not radically changing. This will make maintenance of the port easer over time.
  • It should have existing ports and the community should be open to new ports.
  • It should be large but not so large that it can never be completed.
  • It should be a project I am familiar with.

Based on this criteria I have decided to port the Lucene project. I have worked with Lucene on multiple projects back in the day. It will be good to revisit it and understand better how it works under the hood.

Applying Behavior Driven Development practices to infrastructure. 

Earlier this summer I worked with my team to apply Behavior Driven Development practices to infrastructure that we deployed our products to. Prior to this the DevOps team simply identified toil and implemented solutions. Unfortunately this meant the reasons behind the changes would get lost over time. Having GHERKIN files with your solution means that information would not get lost. Interesting enough we were able to reduce the total size of the source code considerably because some of the user stories were no longer relevant.

I wrote a tutorial based on this experience that can be found here.

Embracing Behavior Driven Development

Many years Ago I worked on a project which became Rational Team Concert 1.0. The ability (via OSLC) to link to all of the development assets made life easier. I could easily click from requirements to test results. Today I spend majority of my days in github which doesn’t have the same type of linkage. While linkage made my life easier it did not mean the assets were in sync which caused greater overhead. Recently I adopted BDD (Behavior Driven Development) and found myself using it for…everything.

Frankly it just makes sense to use it for everything from javascript applications to infrastructure ansible playbooks. All of your requirements in one place with your code and it encourages better requirements. It sounds too good to be true and unfortunately it can be a hard sell to others. Especially since the main advertised use case for BDD is to help the business owner/requirements author which don’t always have a strong presence on smaller projects.

I recall a few projects where I spent majority of my time calling myself an architect and converting business requirements to development requirements & test cases for development. Frankly it was like playing a game of telephone.  In software development the best way to ensure requirements are met are to have less middle men.

I have learned the hard way that documenting requirements is important. Even if you think it is for disposable code. On one hand it forces you to think about what you are going to write. So you spend less time rewriting your code. However on the other hand projects have a habit of lasting far longer than they should.  Your future self will thank you for documenting.

Better requirements

I started my IBM career in the Rational acquisition back in 2003. Home of requirements, governance, testing, and traceability software. I have an entire book on gathering and writing requirements that I quote from more often than I should. Nevertheless a good project manager, architect, designer, or anyone else in a requirements gathering role is not always available for projects. So a simple language/framework like Gherkin that anyone can use is far better than nothing.

While I was a Teaching Assistant for the introduction to computer science class at Clark University I taught students to outline preconditions and postconditions for each method before writing a line of code. Gherkin is essentially the same thing with given, when, and then. “Given” is your precondition, “when” is your method action, and “then” is your postcondition. You write them for each scenario of each feature.

Features

BDD documentation is different from other project related documentation. It isn’t a substitute for a decisions document or design thinking outputs. Those are all point in time documents. A BDD Feature is a living document which outlines the current expectation for a solution’s specific feature,

Think about how a typical development project is managed. You have an agile story or change request for the solution to implement. Then over time you have additional stories or change requests to change that behavior. An archeological dig through documents, development assets, and meeting notes are required to grasp current behavior.

The basic schema of a feature document is as follows:

Feature: <feature name>
    <Feature Description>

    Background:
        Given <precondition>
        And <precondition>

    Scenario: <scenario name>
        Given <precondition>
        Where <action>
        Then <postcondition

Now of course it can get far more complicate but that is the basic gist. It is human readable and can be used to describe the solution, component, or system role features.

The Glue

More documentation is all good but it isn’t code. Text only has impact if it can pass/fail code. That is where step code comes in. Now depending on which language you are using step code will look slightly different however it will look something like this:

@given(’text’)
function setup_scenario_x (test_context) {
    …
}

Each step is a method to match to feature document text, an action to perform, and a scoped variable to the test. Yes this is essentially a form of unit test at the end of the day but it provides very different insight.

End of the day

Up to recently I was a born again test driven developer. I would translate my requirements into an architecture decision document, then to component specifications, then to tests, and lastly write my code. This process over time proved less and less agile. Constant change made this inflexible. Majority of my test were written to ensure my code addressed null pointer exceptions and reach 100% coverage. While important what is critical for a minimum viable product is just enough code to meet the business requirements.

For more information about BDD and a great framework to get you started go to the cucumber project.