Two places where SIMD matters in a graph+vector database, and one place where it doesn't.
SIMD (Single Instruction, Multiple Data) applies the same operation to multiple values simultaneously. On modern x86 CPUs:
The key requirement: data must be contiguous in memory. SIMD cannot help with pointer-chasing through B-trees or linked lists — the CPU stalls waiting on cache misses, not arithmetic.
During HNSW search, every candidate node requires a distance computation against the query vector. For a 1536-dimensional embedding, that is 1536 multiply-add operations per comparison. This is the single hottest loop in vector search.
Loka uses explicit SIMD intrinsics for three distance functions:
| Function | AVX2 | SSE | Use |
|---|---|---|---|
| Dot product | 8 f32/cycle (FMA) | 4 f32/cycle | Cosine similarity (pre-normalized vectors) |
| Squared Euclidean | 8 f32/cycle (FMA) | 4 f32/cycle | Euclidean distance (avoids sqrt) |
| L2 norm | 8 f32/cycle (FMA) | 4 f32/cycle | Vector normalization at insert time |
Runtime feature detection dispatches to the best available path. Scalar fallback on non-x86 architectures.
When a SPARQL query hits a pseudo-table, the executor scans columnar arrays of u64 TermIds. Loka stores these as packed dense arrays (null values use a sentinel) and scans them with SIMD:
| Operation | AVX2 | SSE2 | Use |
|---|---|---|---|
| Equality scan | 4 u64/cycle | 2 u64/cycle | ?x :city :Tokyo |
| Not-null scan | 4 u64/cycle | 2 u64/cycle | ?x :email ?e (has property?) |
| Range scan | Cache-friendly scalar | Cache-friendly scalar | FILTER(?age > 25) |
The packed layout (dense u64 array instead of Option<u64>) eliminates 8 bytes of overhead per element and ensures data is contiguous for SIMD loads.
Regular triple pattern lookups traverse a B-tree or LSM tree. The bottleneck is pointer-chasing through tree nodes, not arithmetic comparison. The CPU is waiting on cache misses, not running out of comparison bandwidth. SIMD cannot help here — the data is not contiguous.
This is why pseudo-tables exist: they transform the pointer-chasing problem into a columnar scan problem, which is SIMD-friendly.
Computing cosine similarity directly requires two norms and a dot product per comparison (3 passes over the data). Loka follows Qdrant's pattern: normalize vectors at insert time, then use dot product for all similarity computations at search time. This is mathematically equivalent but requires only 1 pass instead of 3.