Building ANN-OY: a small vector index in C
Building ANN-OY: a small vector index in C
Every time I touch a vector DB I get the same feeling: half of this is just a tree of floats, why is it a whole product? So I decided to write my own one, in C, from scratch, to actually understand what is going on under the hood.
This code is actually quite old, I had written it around one year ago, but I recently reopened it to integrate sqlite metadata management and this made me remember how cool this algorithm was so I decided to write about it in my blog.
The result lives here. Its a tiny reimplementation of Spotify’s ANNOY (Approximate Nearest Neighbors Oh Yeah) algorithm, with a SQLite metadata layer glued on the side and a small CLI to drive it.
Why ANN-OY ?
There is a lot of ANN algorithms around (HNSW, IVF, PQ, ScaNN, ecc.) but I picked ANNOY mostly for 3 reasons:
- Its based on random projections and binary trees, so its conceptually very close to a normal KD-tree.
- Erik Bernhardsson, the author, has a really nice blog post describing it that you can basically follow line by line.
- It has a cool name, how many algorithms have OH Yeah in their name?
The idea is so simple yet sooo elegant: split the space recursively by random hyperplanes until each cell has few enough points, then at query time you go down to the cell where your query lives and search only inside it (plus a bit around). It’s just a tree.
The tree
The core data structure is a binary tree where:
- Leaf nodes hold up to
Kvectors (I useK = 16). - Internal nodes dont hold any vector at all, they only hold the normal of the hyperplane that splits their region and the indexed median used as the decision value.
typedef struct Node {
VectorPair **data; // list of VectorPair, only set on leafs
int size;
struct Node *left;
struct Node *right;
struct Node *parent;
float *normal; // normal vector of the splitting hyperplane
float indexedMedian; // decision value
} Node;
Every vector is wrapped in a VectorPair so we can carry a UUID around with it, the UUID will then later be used as a key in the SQLite metadata table.
The split itself is pretty straightforward. You pick two random points $a$ and $b$ from the current node, you take their difference as the normal vector
\[n = a - b\]then for every point $v_i$ in the node you compute the magnitude of its projection on $n$
\[p_i = \left\lVert \frac{n \cdot v_i}{n \cdot n} \, n \right\rVert\]sort them, take the median, and use that as decision value. Everything with projection below the median goes left, everything above goes right. If both halves still have more than K points, you recurse.
In code it looks like this (cleaned up a bit):
idxA = rand() % size;
idxB = rand() % size;
if (idxA == idxB) {
idxB = rand() % size; // if we get 2 times the same values o zmn sictik
}
a = data[idxA]->value;
b = data[idxB]->value;
normal = determineNormal(a, b, dataSize);
for (int i = 0; i < size; i++) {
projections[i] = computeProjection(normal, data[i]->value, dataSize);
}
median = calculateMedian(projections, size);
Adding a vector
Adding a vector is just walking down the tree like a normal BST:
if (self->data == NULL) {
// Internal node — go left or right
projection = computeProjection(self->normal, vector, dataSize);
if (projection > self->indexedMedian) {
addVector(self->right, vector, dataSize);
} else {
addVector(self->left, vector, dataSize);
}
} else {
// Leaf — append, and split if we overflow
...
if (self->size > K) {
addSplit(self, dataSize);
}
}
The only subtle part is when a leaf gets too big and we have to split it. Theres a small addVectorPair variant too, that takes an already created VectorPair instead of building a new one. I needed this so that the caller (the CLI mostly) can know the UUID before the vector is even inserted, otherwise you cant write the metadata row.
Searching
Searching is the part where ANN gets interesting. Naively, you would go down to the leaf where your query lands and return whats in there. But if you have K = 16 and you ask for topK = 50, theres simply not enough vectors in a single leaf.
The trick is to go down to your leaf, and then walk up a few levels and collect all the leafs hanging below the ancestor you reached. The deeper you go up, the more candidates you collect, but also the slower the search.
nodesToFind = (int) sqrt((topK * 2 / (float) K) * 1.25f) + 2;
// Here we define how many parent nodes to go up to be able to
// find topK vectors. Its a bit excessive ik.
The 1.25f is a slack factor to make sure we collect more than topK candidates, so that after re-ranking by cosine similarity we still have a good chance of finding the actual nearest neighbors. The +2 is just there to be excessive, but I prefer to overcollect a bit and then sort, rather then undercollect and miss a neighbor.
Once we have all the candidates we just compute cosine similarity against the query, sort by score, and return the top K:
for (int i = 0; i < (*size); i++) {
similarity[i] = cosineSimilarity(vectors[i]->value, vector, dataSize);
}
...
qsort(arr, *(size), sizeof *arr, compPairs);
The whole search is single threaded, which is fine for a learning project i guess.
Saving the tree to disk
A vector index that you have to rebuild every time you restart the process is pretty much useless, so I added a small binary serialization layer. The format is recursive and the simplest thing I could come up with:
- For every node we write a
markerint:0for leaf,1for internal. - For a leaf, we then write the number of vectors, and for each vector we dump the UUID (37 bytes) and the floats (
dataSize * sizeof(float)). - For an internal node, we write the median (1 float), the normal vector (
dataSizefloats), and then recursively write left and right child.
So basically we write the tree in a bin file!!
void serializeTree(Node *root, FILE *f, int dataSize) {
if (!root) return;
if (root->data) {
int marker = 0;
fwrite(&marker, sizeof(int), 1, f);
fwrite(&root->size, sizeof(int), 1, f);
for (int i = 0; i < root->size; i++) {
fwrite(root->data[i]->uuid, sizeof(char), LEN_UUID, f);
fwrite(root->data[i]->value, sizeof(float), dataSize, f);
}
} else {
int marker = 1;
fwrite(&marker, sizeof(int), 1, f);
fwrite(&root->indexedMedian, sizeof(float), 1, f);
fwrite(root->normal, sizeof(float), dataSize, f);
serializeTree(root->left, f, dataSize);
serializeTree(root->right, f, dataSize);
}
}
Deserializing is the symmetric operation, reading the marker first and then either reconstructing a leaf or an internal node and recursing. The only ANNOYing part is rewiring the parent pointers, since they are not in the file.
Storing the metadata
A vector by itself isnt very useful you almost always want to attach something to it, a document, a URL, a chunk of text, ecc. ANNOY itself only stores vectors, so the convention is to keep an external map from id -> metadata somewhere else.
I went with SQLite, because its embeddable, its single-file, and it has the nice property that I can run real SQL queries on the metadata without writing my own parser. Two small tables are enough:
CREATE TABLE collection_info (
key TEXT PRIMARY KEY,
value TEXT NOT NULL
);
CREATE TABLE vectors (
uuid TEXT PRIMARY KEY,
metadata TEXT
);
The collection_info table is a key/value bag that for now only contains the collection name and the vector dimensionality, but its a good place to put more stuff later on (ie. which distance metric was used, when the index was built, …)
The vectors table is what does the heavy lifting: every vector in the tree has a row here keyed on the same UUID that lives in the .bin file. The metadata column is just a TEXT, I keep it as a JSON string for simplicity.
The CLI
The whole thing is driven by a single CLI with five subcommands:
--create <collection> <vector_file> <metadata_file> -d <dim>
--search <collection> <vector_file> -k <topK>
--add <collection> <vector_file> -m <metadata_file>
--remove <collection> <uuid>
--update <collection> <uuid> <metadata_json>
A collection is just a pair of files: <name>.bin for the tree and <name>.db for the metadata. You build the index once with --create, then you can add new vectors with --add, delete metadata with --remove, update it with --update, and run searches with --search. Its very basic but its enough to test the whole pipeline end to end.
The vector file is plain text, one vector per line, floats separated by spaces. The metadata file is one JSON string per line, line i in the metadata file corresponds to vector i in the vector file. Its primitive but it makes scripting really easy from Python without having to bind anything.
Whats missing
Honestly a lot. The current TODO list is more or less:
- A proper Python C-extension wrapper so I dont have to shell out to the CLI from notebooks (this is the next thing I’ll do if I continue with this proj).
- Other distance metrics.
- Some form of ACID compliancy on the index file.
Also the code is, lets say, learning project quality. So lots of stuff.
Warning and final note
If you want to actually use a vector DB in production please use FAISS, hnswlib or my fav MILVUS they will be 100x faster and 100x more correct than this.
My repo: here.
Enjoy Reading This Article?
Here are some more articles you might like to read next: