devlog 2024-04-22, the case for a julia graph database

One of the nice things about comind is that it is a platform where I get to experiment with random ideas. Comind is still a passion project, so I kind of get to do whatever interests me.

This evening, I did a fun little dive into what a graph database might look like in Julia. Comind currently uses Postgres, but it's not really the easiest to use. Maybe there's a cool Julia database to be made!

A few great graph database products exist already (like Neo4j, Amazon Neptune, Gremlin), but I think there's something to be had in the Julia ecosystem, and in particular, one for Comind specifically.

Graph databases are a type of database that stores information in a particular way – as nodes and edges. Nodes are "entitites", like people, companies, information, etc. Edges are the relationships between these entities: Alice and Bob know each other, so we have an edge from Alice to Bob.

Graph databases work really well when you need to traverse relationships between nodes. For example, if you want to see who you could ask Bob to connect Alice to, you can traverse Alice -> Bob and see who is connected to Bob. In a standard relational database like postgres or any of the other members of the innumerate hordes of RDBMS, you would have to do something like a JOIN to get this information. Joins can be expensive (but not always!). Graph databases make doing lots of "hops" between nodes very fast, which might require many joins in the case of a relational database.

Comind could use something like a graph database. Basically everything in comind is a "thought", which is some kind of post or content from a user. We're going to have a lot of these (with any luck).

Thoughts are linked to one another by users. Each time you add a new thought, add a thought to your meld, or someone else adds a thought to a meld you are in, you are linking two thoughts together. It is the core, fundamental action in Comind to link thoughts.

The second part of this that makes a graph database useful is that all of comind is intended to be accessed regularly, easily, and safely (in a privacy sense) by generative AI models.

A graph database could help manage "knowledge" available to language models when and where they need to have information. The way that people currently provide context to language models is by using retrieval-augmented generation (RAG) produced with semantic search.

The process for this is

The user asks a query, "what color is the sky?"
We get the embedding for that query, some vector Q.
We then take the cosine similarity of Q with all the embeddings in our graph database. This gives you a score from 0 to 1, with 1 meaning "more similar to this", and 0 meaning "not at all similar to this".
We return these results and pack the most related ones into the language model context window.

Note that this is super expensive. For each new query Q, we have to compute the cosine similarity of Q with all the embeddings in our graph database. This is a lot of computations for each new query. If you have billions of records, this is an ass-load of compute.

Fortunately, we have lots of vector databases for this kind of thing. Pinecone, arguably the loudest vector DB at the moment, does what is called "approximate nearest neighbors" search. This involves "clumping" similar embedding vectors together. Then, when you have a new query, you only search over something like the "average" of that clump. This is a lot more efficient than a full search through your entire database. It's also a simplification, but you get the gist.

I think we can do better, at least for Comind's use case. Let's take a simple example, where we want to show the user a "start page" thought. This is kind of a vague idea I've had where I show you some LLM commentary about what you've been thinking about recently.

The user logs on.
We retrieve the user's n most recent thoughts, as well as the k+1 "hops" around those thoughts (i.e. for A -> B -> C, C is two hops from A).
We perform semantic search for RAG on only the neighoring thoughts.

Boom. That's fast as fuck, at least way faster than a full search through your entire database, and probably more precise and relevant than approximate nearest neighbor.

So, Julia. I use Julia a lot. It is my favorite language. All the backend for Comind is in Julia. Why not a Julia database, and why not one custom-rolled for Comind's link-based embedding search?

Plus, adding various plugins to the database to do graph neural networks using GraphNeuralNetworks.jl would be a breeze, since the entire ecosystem is consolidated around Graphs.jl. We could keep everything self contained and extensible in the only the way Julia can be.

And for parallelism, stuff like Dagger.jl could be extremely cool for the storage engine/query engine.

Importantly, the database could also be completely GPU-enabled relatively easily! The whole JuliaGPU ecosystem is incredibly good. I haven't thought too hard about that but man would it be cool as fuck.

Anyway – fun to speculate. Did a little tinkering but mostly reading tonight.

– Cameron

the work i actually did

Lots of UI overhauls to the web interface
Debugging some dumb fucking stuff

References

https://dgraph.io/blog/post/why-google-needed-graph-serving-system/