SIMD Acceleration

What SIMD Does

SIMD (Single Instruction, Multiple Data) applies the same operation to multiple values simultaneously. On modern x86 CPUs:

AVX2 (256-bit): processes 8 floats or 4 u64 integers per instruction
SSE (128-bit): processes 4 floats or 2 u64 integers per instruction

The key requirement: data must be contiguous in memory. SIMD cannot help with pointer-chasing through B-trees or linked lists — the CPU stalls waiting on cache misses, not arithmetic.

Where Loka Uses SIMD

1. Vector Distance Functions (HNSW hot path)

During HNSW search, every candidate node requires a distance computation against the query vector. For a 1536-dimensional embedding, that is 1536 multiply-add operations per comparison. This is the single hottest loop in vector search.

Loka uses explicit SIMD intrinsics for three distance functions:

Function	AVX2	SSE	Use
Dot product	8 f32/cycle (FMA)	4 f32/cycle	Cosine similarity (pre-normalized vectors)
Squared Euclidean	8 f32/cycle (FMA)	4 f32/cycle	Euclidean distance (avoids sqrt)
L2 norm	8 f32/cycle (FMA)	4 f32/cycle	Vector normalization at insert time

Runtime feature detection dispatches to the best available path. Scalar fallback on non-x86 architectures.

2. Pseudo-Table Column Scanning (columnar hot path)

When a SPARQL query hits a pseudo-table, the executor scans columnar arrays of u64 TermIds. Loka stores these as packed dense arrays (null values use a sentinel) and scans them with SIMD:

Operation	AVX2	SSE2	Use
Equality scan	4 u64/cycle	2 u64/cycle	`?x :city :Tokyo`
Not-null scan	4 u64/cycle	2 u64/cycle	`?x :email ?e` (has property?)
Range scan	Cache-friendly scalar	Cache-friendly scalar	`FILTER(?age > 25)`

The packed layout (dense u64 array instead of Option<u64>) eliminates 8 bytes of overhead per element and ensures data is contiguous for SIMD loads.

Where SIMD Does NOT Help

Triple index scans (SPO/POS/OSP)

Regular triple pattern lookups traverse a B-tree or LSM tree. The bottleneck is pointer-chasing through tree nodes, not arithmetic comparison. The CPU is waiting on cache misses, not running out of comparison bandwidth. SIMD cannot help here — the data is not contiguous.

This is why pseudo-tables exist: they transform the pointer-chasing problem into a columnar scan problem, which is SIMD-friendly.

Cosine via Normalized Dot Product

Computing cosine similarity directly requires two norms and a dot product per comparison (3 passes over the data). Loka follows Qdrant's pattern: normalize vectors at insert time, then use dot product for all similarity computations at search time. This is mathematically equivalent but requires only 1 pass instead of 3.