The full narrative of how a lean RDF-star triplestore became a neuro-symbolic world-model engine. What worked, what didn't, and why each call got made.
The earlier framing was "lean RDF-star triplestore that handles vector queries natively." That's still true mechanically, but the purpose moved: the engine is now one half of a two-system composition.
Both expose the same SPARQL+ interface. The caller doesn't pick which system answered — federation is implicit, except through provenance edges on the result. Canonical vision: world-model-thesis.md.
Product framing: what Ollama is to LLMs, Loka is to world models. Pull or train a world model locally; pluggable; agent-first; honest provenance. The "agent-first" stance was already baked into Loka; the world-model layer is what makes the whole project a thing you'd want to install rather than just a database.
propositionInferredFrom output schema. See fine-tuning-track.md.
RDF-star moved from "one feature among many" to load-bearing. It's how every kind of citation in the system is expressed:
| Verb | Used for |
|---|---|
propositionInferredFrom | model-generated triple → context that informed it |
wdt:P854 / wdt:P248 / wdt:P813 | external curated references (Wikidata) |
All use the identical <<S P O>> verb <<source>> shape. Wikidata's API distinguishes "qualifiers" from "references" but Loka collapses both into the
same RDF-star annotation form because they're semantically the same thing.
Reserved namespace. Every predicate under http://loka.dev/provenance/ is system-internal. The world model
never sees, proposes, or emits one. Three layers of enforcement:
preprocess.py) drops every row whose predicate matches the prefix.FILTER NOT EXISTS << ?s ?p ?o >> propositionGenerated ?_g excludes inner generated triples at query time.infer_with_citations.py) refuses to consider reserved-namespace predicates as candidates and refuses to emit one
even if a downstream bug allowed it.
Names are deliberately verbose — propositionGeneratedFrom rather than generatedFrom — so a human scanning raw triples spots
them at a glance and accidental collision with real-world predicates is vanishingly unlikely.
Before: each Wikidata claim emitted one flat triple, dropping all qualifier and reference data.
After: each claim emits the main triple plus an RDF-star annotation per qualifier and per reference snak, all sharing the
<<S P O>> quoted-triple subject.
The Wikidata API rate-limit (1.5s per request) made BFS the bottleneck — 5M triples needed days at that rate.
Switched to streaming philippesaade/wikidata
(CC0, ~30M entities, JSON-shaped per-entity rows in parquet) via the HuggingFace datasets library.
Local-bandwidth-bound instead of API-bound.
End state: 5,055,385 triples / 1,695,402 RDF-star annotations / 27,780 entities / 770 MB on-disk Loka store.
Initially every imported triple got <<S P O>> loka:propositionImportedFrom <wikidata.org/wiki/Q...>.
For a database where every row came from Wikidata, that's redundant noise — 22,593 rows after the first hour of re-walk (~46% of all annotations).
The actual provenance is already in Wikidata's own reference predicates. Dropped.
Previously hardcoded en/ja/de/fr/zh; now iterates every language in entity.labels and entity.descriptions.
The training preprocessor still filters to English, but the database keeps the multilingual richness for future work.
The original BFS importer called Ollama (mxbai-embed-large) per entity. The world-model loop tokenizes English labels — vectors don't enter
the training corpus. Stripped from the importer. The HNSW index in loka-core stays — that's an engine feature, not specific to import.
?s ?p ?o occasionally returns literal values in the predicate slot. RDF disallows literal predicates;
this is invalid output from the executor. Filtered at preprocess (drops ~1% of rows on a 5M corpus). Real engine bug.POST /triples wedges after roughly every 5–6× growth in stored triples. Hit at ~174k and again at ~1M during the
HF ingest. /health keeps responding, but /triples and SPARQL hang indefinitely until restart. Data is intact on disk;
recoverable. Real engine bug.
A separate proto-layer bug was found and fixed mid-period: POST /triples was returning HTTP 400 for the entire batch when any RDF-star annotation's
inner triple already existed in persistent storage. The in-memory branch already discarded DuplicateTriple; only the persistent branch propagated.
Fixed at server.rs:935 and :962 so both branches handle duplicates the same way.
v0/v1/v2 were the early smoke-test checkpoints on a 6,300-triple shrine-only corpus. v3 onward use the 5M-triple HF-derived corpus.
| Model | Architecture | Corpus | Final ppl | Notes |
|---|---|---|---|---|
| v3 | 16M params (256 / 4 layers) | 779k label-substituted | 53.43 | Pre-cleanup. Misleadingly low ppl from memorising xmlschema decimal URI fragments. |
| v4 | 16M params (256 / 4 layers) | 757k cleaned | 92.48 | Higher ppl, better output. Numerical regression masks real-quality improvement. |
| v5 | 44M params (512 / 6 layers) | 757k cleaned | 84.85 | Bigger model. Beats v4's final ppl by epoch 4. Picks specific real-world tokens where v4 fell back to common connectors. |
^^<datatype> suffixes from typed literals. Loka's SPARQL serialization embeds the datatype URI in the literal value
string. Without stripping, datatype-URI fragments (xmlschema, decimal, org) reached the tokenizer as if they were entity content
and dominated certain predictions.?p slot. RDF disallows literal predicates, so dropping is safe.Abbas Mirza | has works in collection | 1 http www w3 org 2001 xmlschema decimal (confidence 0.93) — a memorization of literal-with-embedded-datatype-URI patterns.
After fixing the corpus: metropolitan museum of museum (confidence 0.43). The Met genuinely holds Abbas Mirza pieces.
| Epoch | v4 (16M) ppl | v5 (44M) ppl |
|---|---|---|
| 1 | 1150.7 | 1528.7 |
| 2 | 196.0 | 147.3 |
| 3 | 133.5 | 104.2 |
| 4 | 100.7 | 90.7 |
| 5 | 92.5 | 84.85 |
v5 starts higher (epoch 1) — more parameters mean a harder optimisation landscape and slower initial convergence. It crosses under v4 at epoch 2 and pulls ahead from there. By epoch 4 it has already passed v4's final perplexity. Wall time on a RTX 4070 Laptop: 91 min for v5 vs 42 min for v4 (2.2× compute, 8% better final ppl).
Masked-S/P/O training produces models that "know" the answer category (university, museum, https-URL) but degenerate during greedy decoding to
fillers like of of of of or museum museum.
We didn't retrain. We changed the decoder. Cumulative penalty: divide each repeated token's logit by
repetition_penaltycount, where count is the number of prior emissions of that token in this sequence.
Three emissions of of at penalty 3.0 multiply its divisor by 27, reliably dropping it below the per-token floor.
Same v4 checkpoint moves from university of of of of of of of (no penalty) to university of halle (cumulative penalty 3.0).
The model knew Halle — the decoder just couldn't stop emitting of long enough to reach it.
"Loka" is a name for the engine. The project that's emerging (engine + corpus + trained world model + inference layer) needs its own identity. Loka is the name on Hugging Face. The GitHub repo will be renamed to match later.
tools/hf_snapshot.py pushes corpus + checkpoints to a single dataset repo
EmmaLeonhart/loka with each upload tagged as a snapshot revision (v3, v4, v5).
Each upload is a commit; tagged snapshots are pullable via revision="v4". LFS is handled transparently by huggingface_hub.
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(
repo_id="EmmaLeonhart/loka",
repo_type="dataset",
filename="checkpoints/wikidata_v5.pt",
revision="v5",
)
Brief period of production-readiness fixes after testing Loka against the ManuForge SDK consumer. Surfaced a small set of real issues:
POST /triples had edge cases on parsing.Bumped to v0.3.7 at the end of this round. The ManuForge integration also produced docs/AGENT_SETUP.md for AI-agent consumers.
loka-ffi crate wraps the engine in a C-compatible shared library so non-Rust consumers (Flutter, in particular) can embed the database in-process.
Studio uses dart:ffi to load loka_ffi.dll/.so/.dylib and runs the engine on a background thread sharing the same handle as the optional MCP server.
Two entry points: loka mcp (MCP + database, no GUI) and Loka Studio (GUI + database + optional MCP server, all one process).
Studio also auto-starts the server in serverless mode when launched, so the user never has to run loka serve manually.
Includes graph view (D3 then vis-network), HNSW health diagnostics, OWL/Turtle export, dark/light theme, persistent connection settings.
A non-trivial extension: every triple is conceptually contained in a temporal interval, and queries can ask "what was true at time T" or "what changed between T1 and T2" without reifying every statement individually. Implementation phases:
loka:assertedAt, loka:validFrom, loka:validTo), TSPO index.AT_TIME and DURING query operators.WORLD_STATE and TEMPORAL_DIFF.Containment semantics use three-valued query logic (true / false / unknown). Design lives in docs/ontochronology.md.
COSINE_SEARCH, EUCLID_SEARCH, DOTPRODUCT_SEARCH.A consolidating release. Headlines: query planner, agent installer, Java SDK, Loka Studio first cut. All four SDKs (Go, Rust, Java, .NET) had endpoint mismatches caught and fixed. SDK publish workflow + integration test CI added.
Pseudo-tables landed in this window: columnar indexes with zonemap pruning and vectorized scans, on top of the standard SPO/POS/OSP indexes. Designed to make multi-hop subgraph queries (the kind RDF databases are typically slow at) competitive with property-graph databases.
Released as v0.2.0 Developer Preview on 2026-03-18.
A single very productive day. Brought SPARQL coverage from "minimum viable" to roughly feature-complete for SPARQL 1.1 over RDF-star.
Vec<bool> visited list to HashSet, cut ~2s page-fault overhead at 200K+ HNSW nodes.PersistentStore (sled-backed) wired to HTTP server with write-through. Hydrate on startup.dot_product, squared_euclidean, l2_norm./vectors/health for diagnostics.verify_consistency() and repair() for index integrity.FILTER NOT EXISTS / EXISTSASK queriesGROUP BY + aggregates (COUNT, SUM, AVG, MIN, MAX)BIND / VALUES&&, ||, !), comparison ops, type checksLANG() / LANGMATCHES()CONTAINS, STRSTARTS, STRENDS, REGEX)INSERT DATA / DELETE DATA (SPARQL Update)CONSTRUCT and DESCRIBEHAVING for GROUP BY filtering+, *, ?, /)SELECT)DATATYPE(), STR(), COALESCE(), IF()FILTER<< ?s ?p ?o >>)loka import / export / info / install-agent%%sparql cell magicFirst Wikidata BFS import on this day: 439 entities, 16,084 triples, 439 vectors (1024-dim mxbai-embed-large), 0 errors. Later abandoned as the corpus base in favor of the HF parquet stream — the BFS rate limit made it impractical to scale.
Project began on 2026-03-13 with the cleanvibe scaffold. Within 24 hours: architecture docs, normalised loka-* workspace structure, loka-core and loka-hnsw foundations. Apache 2.0 license. CI workflow.
Borrowed patterns explicitly from Qdrant (HNSW: immutable GraphLayers for search, thread-local visited pools, per-node RwLock during construction) and Oxigraph (storage, IRI interning, RDF triplestore baseline).
By the end of the first 48 hours the project had: a working engine, a working SPARQL surface, a working vector layer, persistence, six SDKs, CI, a website, and a stress test passing at 1M scale. That set the pace for everything that came after.
The original ingest crawled Wikidata's API at the 1.5s rate limit. At that rate, 5M triples would take ~3.3 hours per million entries with an exponential queue growth from qualifier-discovered links. We pushed through ~870 entities then switched to the HF parquet stream entirely. Lesson: rate-limited APIs are fine for discovery, not for bulk ingest. Streaming a frozen dump is always faster when one exists.
When all your data comes from one source, tagging every row with that source is redundant noise. 22,593 rows of propositionImportedFrom <wikidata.org> after one hour of import — ~46% of all RDF-star annotations. Lesson: provenance is most useful when the source is genuinely variable. Wikidata's own reference predicates already capture per-claim provenance; don't shadow them.
The first repetition-penalty implementation tracked emitted tokens in a set: each prior emission penalized the token's later probability once, regardless of how many times it had already been emitted. Penalty 1.5 didn't break loops because dominant tokens like of stayed dominant after one division. Lesson: for masked-position decoding, cumulative penalty (penaltycount) is the right knob — it's the only thing that defeats deeply-overweighted single tokens.
We didn't catch the embedded datatype-suffix leak until v3 had finished training. v3's perplexity (53) was numerically better than v4's (92) but qualitatively dramatically worse — v3 was memorising URI fragments. Lesson: perplexity is a proxy for content quality, not a measurement. Always sample-and-eyeball.
The user pasted an HF token in chat to enable an upload. The first use worked. The auto-classifier blocked the second use of the same token, correctly — re-transmitting a leaked credential is the pattern it's designed to stop. Lesson: never normalize credential pasting. huggingface-cli login writes to ~/.cache/huggingface/token and is the only path that doesn't leak the secret into chat or shell history.
The original Wikidata BFS importer called Ollama for a 1024-dim embedding per entity (~2s per call, plus the Wikidata 1.5s = ~3.5s/entity). Vectors didn't enter the training corpus. Removing them roughly doubled effective import throughput and dropped the daemon dependency. Lesson: if a side-effect doesn't pay for itself in the loop you're optimizing, take it out.
POST /triples wedgeThe persistent layer wedges after roughly every 5–6× growth in stored triples. We can't fix this in the importer; the workaround is automated stop-restart. Lesson: bugs found at ingest scale are real engine bugs that a smaller test suite would never have caught. Keep them documented — the fix is downstream.
This page is a website-readable rendering of DEVLOG.md.
Per-commit detail lives in git log; the live status of in-flight work lives in status.md.