devlog 2024-05-04, bit-representations for graph vertices

Lots of fun stuff today! It was rainy for a lot of the morning after my regatta, so I managed to cozy up and do some programming without getting too much sunshine-FOMO.

In the evening I did actual comind work, and the first part of the day was mostly fucking around with bit-representations for graph databases. I'll do the comind stuff first and then a brief recap of the graph database work.

comind work

I'm adding an endpoint to the server called examine. When you examine a thought, it is a way focusing your attention on a particular piece of content to uncover more information about it.Examinations are essentially detail views of thoughts.

There's a few ways I'm imagining this working. Examinations should show you

Directly linked thoughts, i.e. parents, children, replies, etc.
A brief description of the surrouding context, i.e. "your friends saw this and have been discussing why you would say something so stupid, in a loving way".
Suggested thoughts that are not linked but seem to be related.

I have (1) done, (3) is easy enough but not done, and (2) is where I spent most of the evening.

It turns out that contextualizing a "web" of thoughts for language models is not super easy, at least not from a prompt engineering side. I've settled for beginning to specify a text format for describing general context and relationships between thoughts.

When you examine a thought, I go and pull the "neighborhood" of thoughts surrouding your primary thought within some nmber of hops. For example, if I pulled in all thoughts within two hops of A and was looking at a link chain A -> B -> C -> D, I would pull in A, B, and C.

Then, I need to dump the content of each of those thoughts into a query to provide to the language model. The way this is starting to form is to provide the language model a new comind-specific markup language that feels very 1970.

Thoughts are wrapped in blocks that include some variables describing interrelationships and annotating specific content.

Here is the block for the primary thought. When you examine a thought, you are making that thought "primary".

Here's something I might write, and how it is rendered in the context prompt:

<BEGIN PRIMARY THOUGHT BY=cameron>
here's some notes i wrote about language models
<END PRIMARY THOUGHT>

Within the same prompt, you can optionally include context thoughts that the language model is instructed to review only if relevant to the primary thought. If the thought is in the neighborhood, a HOPS=[number] field is included to suggest to the model that it is more likely to be relevant.

<BEGIN THOUGHT HOPS=1 BY=cameron>
language models tend to hallucinate, so it's important to provide context that is relevant to the primary thought.
<END THOUGHT>

Your context window would basically be as full as possible of related thoughts, and then I leave it up to the model to determine what is most relevant for the answer. To determine what I actually want from the model, I found it to help by adding a REQUEST block:

<BEGIN REQUEST>
can you please tell me what i know about language models?
<END REQUEST>

This whole thing is vaguely hacky, but ideally I could specify a fairly robust markup language for describing context and relationships between thoughts, and then fine-tune models to understand the syntax. More on it later.

Anyway, the endpoint is kinda done but I broke the fuck out of the embedding system so all of comind is broken! Living the dream.

graph database work

I also fucked around with graph databases. Today's thing was to see how binary representations of entities, relationships, and properties work.

To rephrase – how the fuck do you actually store a node on disk in an efficient way? Neo4j stores nodes as fixed-length 15 byte chunks:

The first byte checks whether the record is in use or can be re-used. Apparently this is called a "tombstone", thanks to my brother Q for the tip.
The next four bytes contain the ID of the first relationship connected to the node.
The following four contain the ID of the first property connected to the node.
Five store labels references (i.e. Person, or Car), but they can be inlined apparently if there's a small number.
Final byte reserved for some future use.

I wanted to play around with this, so I wrote some code. My graph specification only has nodes for now, but I've made a few alternative choices:

A tombstone byte.
An N-byte ID for the node.
An N-byte ID for the relationship splay tree.
An N-byte ID for the property splay tree.
An N-byte ID for the location of the node's embedding vector, or a pointer to the splay tree if different embedding models are used.

IDs 3-5 are "pointers" in that they refer to positions in storage files containing another fixed-length record. More on the splay trees later, haven't figured them out yet.

N here is probably 128, which is big, but I'm still trying to figure out how you support ordered records that are compact and easy to handle in a distributed format. I settled on ULIDs, which are 128-bit and hardened to prevent collisions across distributed systems.

Here's how you specify the format of the storage system.

ID_TYPE is the type of the ID – how many bytes it takes to represent the ID. Here it's a 2-byte ID.
ID_SIZE_BYTES is the number of bytes in the ID.
NODE_RECORD_SIZE is the number of bytes in the node record – four IDs and a tombstone byte.
BLOCK_SIZE is the number of records per partition file. A partition file is just a file on disk with the format node.[partition].dat.

const ID_TYPE = UInt16
const ID_SIZE_BYTES = sizeof(ID_TYPE)
const NODE_RECORD_SIZE = 1 + ID_SIZE_BYTES * 4
const BLOCK_SIZE = 2 # Records per partition file

In a simpler version of this, say where we only have 1-bit IDs and a block size of 2, then the binary representation of each partition file looks like

10000 # First record, tombstone is set
00000 # Second record, tombstone is not set

and so on.

Anyway. Now we need to do some byte representation stuff. Most of the shit on your machine is bytes and thus so is the code, and working with raw bits and casting up to UInt128 or whatever all the time sucks. So I made ByteArray, which just wraps around a Vector{UInt8} and lets you treat it like a standard array but supports offset indexing.

struct ByteArray
    bits::Vector{UInt8}
end

# Standard interface implementations
Base.length(b::ByteArray) = length(b.bits) * 8
Base.firstindex(b::ByteArray) = 1
Base.lastindex(b::ByteArray) = length(b.bits) * 8
Base.getindex(b::ByteArray, i::Int) = b.bits[8i-1:8i]
Base.setindex(b::ByteArray, v::Vector{UInt8}, i::Int) = b.bits[8i-1:8i] = v

We now need to be able to convert an array of bytes to whatever the root ID type is, and vice versa. For example, if I have a UInt16 ID type, then I need to be able to convert to UInt8 values into a UInt16 ID.

id2bytes(x) = reinterpret(UInt8, [x])
bytes2id(x) = reinterpret(ID_TYPE, x) |> only # TODO not clear on why this returns a single vector

The ByteVector representation of an ID is just a vector of bytes, like so:

2-element Vector{UInt8}:
 0x41
 0x3b

but converts to the UInt16 ID type when you use bytes2id:

0x3b41

Let's bundle these up into a node record, which stores all the different IDs it tracks, as well as some constructors.

struct NodeRecord{BV}
    tombstone::UInt8
    id::BV
    relationships::BV
    properties::BV
    vector::BV
end

function NodeRecord(tombstone::Bool, id, relationships, properties, vector)
    NodeRecord(tombstone, id2bytes(id), id2bytes(relationships), id2bytes(properties), id2bytes(vector))
end

function NodeRecord(bytes::Vector{UInt8})
    tombstone = bytes[1]
    id = bytes[2]
    relationships = bytes[3]
    properties = bytes[4]
    vector = bytes[5]
    NodeRecord(tombstone, id, relationships, properties, vector)
end

We know which partition file the record belongs to by it's ID. Only BLOCK_SIZE records are stored in a partition file, and the IDs of the records in the file are sequential. If we have IDs 1,2,3, and 4 and a block size of 2, then 1 and 2 would be in partition 1, 3 and 4 would be in partition 2.

whichpartition is a function that takes an ID and returns the partition file it belongs to.

"""
    whichpartition(id)

Determine which partition file a given ID belongs to. Partitions are of size BLOCK_SIZE nodes.
An ID of 1 through BLOCK_SIZE belongs to partition 1, BLOCK_SIZE+1 through 2*BLOCK_SIZE belongs to partition 2, etc.
"""
function whichpartition(byteid)
    return div(bytes2id(byteid) - 1, BLOCK_SIZE) + 1
end

# Also get the name of the file we'll write/read to.
datafile(n::NodeRecord) = "node.$(whichpartition(n.id)).dat"

We also know that, within a binary file, we can skip ahead to where the record is if we know its ID. recordstart is a function that takes an ID and returns the start of the record within a partition file. In the above example, if I have a record with ID 3, then the record starts at the 6th byte of the file if records are five-bytes long. The whole file is 12 bytes.

"""
    recordstart(id)

Determine the start of a record within a partition file. Records are of size NODE_RECORD_SIZE,
so a record starts at a multiple of NODE_RECORD_SIZE. 
"""
function recordstart(byteid)
    return mod(bytes2id(byteid) - 1, BLOCK_SIZE) * NODE_RECORD_SIZE
end

Next we want to convert a NodeRecord to a byte array so we can write it to disk.

"""
    bytes(id)

Node storage contains a leading byte indicating whether the node is active.
It is followed by the 128-bit node ID.
Then are 128-bit IDs for each of the relationship, properties, and vector files.

`bytes` converts a `NodeRecord` to a byte array.
"""
function bytes(n::NodeRecord)
    return vcat(
        UInt8(n.tombstone),
        n.id,
        n.relationships,
        n.properties,
        n.vector
    )
end

This gets used here, where we write a NodeRecord to the partition file it belongs to:

function save(n::NodeRecord)
    # Get the partition file
    partition = whichpartition(n.id)
    filename = datafile(n)

    # If the partition file does not exist, create it. Fill the whole BLOCK_SIZE
    # with zeros to make the file size a multiple of BLOCK_SIZE.
    if !isfile(filename)
        open(filename, "w") do f
            write(f, zeros(UInt8, NODE_RECORD_SIZE * BLOCK_SIZE))
        end
    else
        # Check that the partition file is of the correct size
        if filesize(filename) != NODE_RECORD_SIZE * BLOCK_SIZE
            @error "Partition file is not the correct size" filename filesize(filename) NODE_RECORD_SIZE * BLOCK_SIZE
        end
    end

    # Write the record
    open(filename, append=true) do f
        seek(f, recordstart(n.id))
        write(f, bytes(n))
    end
end

To unpack this briefly:

Get the partition file to write to.
If the partition file does not exist, create it. Fill the whole BLOCK_SIZE with zeros to make the file size a multiple of BLOCK_SIZE.
Write the record to the partition file at its location (we go to the record start spot using seek).

Then, lastly, we can load the record back using

"""
    node_record(id)

Load a `NodeRecord` from the file system.
"""
function node_record(id)
    partition = whichpartition(id)
    filename = "node.$partition.dat"
    open(filename, "r") do f
        seek(f, recordstart(id))
        read(f, NODE_RECORD_SIZE)
    end
end

Voila!

Now, the following works:

julia> id1 = rand(UInt8, ID_SIZE_BYTES)
2-element Vector{UInt8}:
 0x63
 0xad

julia> id2 = rand(UInt8, ID_SIZE_BYTES)
2-element Vector{UInt8}:
 0x7a
 0x67

julia> id3 = rand(UInt8, ID_SIZE_BYTES)
2-element Vector{UInt8}:
 0x9f
 0xd2

julia> id4 = rand(UInt8, ID_SIZE_BYTES)
2-element Vector{UInt8}:
 0xf0
 0xab

julia> record = NodeRecord(UInt8(1), id1, id2, id3, id4)
NodeRecord{Vector{UInt8}}(0x01, UInt8[0x63, 0xad], UInt8[0x7a, 0x67], UInt8[0x9f, 0xd2], UInt8[0xf0, 0xab])

julia> record_bytes = bytes(record)
9-element Vector{UInt8}:
 0x01
 0x63
 0xad
 0x7a
 0x67
 0x9f
 0xd2
 0xf0
 0xab

julia> whichpartition(id1)
22194

julia> recordstart(id1)
0

julia> save(record)
9

julia> @test node_record(id1) == record_bytes
Test Passed

Cool. Nailed it. More to do here, excited to get into some more interesting relationship/property stuff!

– Cameron