Loka — Development History

2026-05 — The neuro-symbolic world-model pivot
2026-05 — Data layer rebuild (BFS → Hugging Face parquet)
2026-05 — Training: v0–v5
2026-05 — Cumulative repetition penalty
2026-05 — Loka rebrand and Hugging Face mirror
2026-04 — ManuForge integration → v0.3.7
2026-03 (late) — Ontochronology + Loka Studio + FFI
2026-03 (mid) — v0.2.0 Developer Preview
2026-03-15 — The SPARQL completeness sweep
2026-03 (early) — Foundation + 1M embedding stress test
Dead ends — what didn't work

2026-05 — The neuro-symbolic world-model pivot (weeks of 2026-05-07 to 2026-05-09)

The earlier framing was "lean RDF-star triplestore that handles vector queries natively." That's still true mechanically, but the purpose moved: the engine is now one half of a two-system composition.

The store = explicit memory. Stores what is known. Returns exact answers.
A small transformer trained from scratch on the same triples = implicit memory. Predicts what is plausible. Returns inferred answers with cited inference chains.

Both expose the same SPARQL+ interface. The caller doesn't pick which system answered — federation is implicit, except through provenance edges on the result. Canonical vision: world-model-thesis.md.

Product framing: what Ollama is to LLMs, Loka is to world models. Pull or train a world model locally; pluggable; agent-first; honest provenance. The "agent-first" stance was already baked into Loka; the world-model layer is what makes the whole project a thing you'd want to install rather than just a database.

Pivot revisited. The thesis explicitly rejected fine-tuning a general LLM on RDF (§6.6) for provenance, closed-world, and hallucination reasons. That rejection was revisited mid-period and admitted as a parallel near-term track for empirical pragmatism: small from-scratch models on small corpora produce word salad, while a fine-tuned 1–7B base could plausibly produce coherent triples within hours. Both tracks share the same propositionInferredFrom output schema. See fine-tuning-track.md.

RDF-star as the citation mechanism

RDF-star moved from "one feature among many" to load-bearing. It's how every kind of citation in the system is expressed:

Verb	Used for
`propositionInferredFrom`	model-generated triple → context that informed it
`wdt:P854 / wdt:P248 / wdt:P813`	external curated references (Wikidata)

All use the identical <<S P O>> verb <<source>> shape. Wikidata's API distinguishes "qualifiers" from "references" but Loka collapses both into the same RDF-star annotation form because they're semantically the same thing.

Reserved namespace. Every predicate under http://loka.dev/provenance/ is system-internal. The world model never sees, proposes, or emits one. Three layers of enforcement:

Corpus stripping (preprocess.py) drops every row whose predicate matches the prefix.
SPARQL-star FILTER NOT EXISTS << ?s ?p ?o >> propositionGenerated ?_g excludes inner generated triples at query time.
Inference (infer_with_citations.py) refuses to consider reserved-namespace predicates as candidates and refuses to emit one even if a downstream bug allowed it.

Names are deliberately verbose — propositionGeneratedFrom rather than generatedFrom — so a human scanning raw triples spots them at a glance and accidental collision with real-world predicates is vanishingly unlikely.

2026-05 — Data layer rebuild (2026-05-07 to 2026-05-09)

BFS importer learned RDF-star qualifiers + references

Before: each Wikidata claim emitted one flat triple, dropping all qualifier and reference data. After: each claim emits the main triple plus an RDF-star annotation per qualifier and per reference snak, all sharing the <<S P O>> quoted-triple subject.

BFS → Hugging Face parquet stream

The Wikidata API rate-limit (1.5s per request) made BFS the bottleneck — 5M triples needed days at that rate. Switched to streaming philippesaade/wikidata (CC0, ~30M entities, JSON-shaped per-entity rows in parquet) via the HuggingFace datasets library. Local-bandwidth-bound instead of API-bound.

End state: 5,055,385 triples / 1,695,402 RDF-star annotations / 27,780 entities / 770 MB on-disk Loka store.

propositionImportedFrom dropped

Initially every imported triple got <<S P O>> loka:propositionImportedFrom <wikidata.org/wiki/Q...>. For a database where every row came from Wikidata, that's redundant noise — 22,593 rows after the first hour of re-walk (~46% of all annotations). The actual provenance is already in Wikidata's own reference predicates. Dropped.

Multilingual labels

Previously hardcoded en/ja/de/fr/zh; now iterates every language in entity.labels and entity.descriptions. The training preprocessor still filters to English, but the database keeps the multilingual richness for future work.

Embeddings: gone

The original BFS importer called Ollama (mxbai-embed-large) per entity. The world-model loop tokenizes English labels — vectors don't enter the training corpus. Stripped from the importer. The HNSW index in loka-core stays — that's an engine feature, not specific to import.

Engine bugs surfaced at scale.

SPARQL ?s ?p ?o occasionally returns literal values in the predicate slot. RDF disallows literal predicates; this is invalid output from the executor. Filtered at preprocess (drops ~1% of rows on a 5M corpus). Real engine bug.
POST /triples wedges after roughly every 5–6× growth in stored triples. Hit at ~174k and again at ~1M during the HF ingest. /health keeps responding, but /triples and SPARQL hang indefinitely until restart. Data is intact on disk; recoverable. Real engine bug.

A separate proto-layer bug was found and fixed mid-period: POST /triples was returning HTTP 400 for the entire batch when any RDF-star annotation's inner triple already existed in persistent storage. The in-memory branch already discarded DuplicateTriple; only the persistent branch propagated. Fixed at server.rs:935 and :962 so both branches handle duplicates the same way.

2026-05 — Training: v0–v5

v0/v1/v2 were the early smoke-test checkpoints on a 6,300-triple shrine-only corpus. v3 onward use the 5M-triple HF-derived corpus.

Model	Architecture	Corpus	Final ppl	Notes
v3	16M params (256 / 4 layers)	779k label-substituted	53.43	Pre-cleanup. Misleadingly low ppl from memorising `xmlschema decimal` URI fragments.
v4	16M params (256 / 4 layers)	757k cleaned	92.48	Higher ppl, better output. Numerical regression masks real-quality improvement.
v5	44M params (512 / 6 layers)	757k cleaned	84.85	Bigger model. Beats v4's final ppl by epoch 4. Picks specific real-world tokens where v4 fell back to common connectors.

Two corpus quality fixes between v3 and v4

Strip ^^<datatype> suffixes from typed literals. Loka's SPARQL serialization embeds the datatype URI in the literal value string. Without stripping, datatype-URI fragments (xmlschema, decimal, org) reached the tokenizer as if they were entity content and dominated certain predictions.
Drop rows with non-URI predicates. ~1% of rows on a 5M corpus exhibit a Loka SPARQL bug where literal values surface in the ?p slot. RDF disallows literal predicates, so dropping is safe.

v3 → v4 lesson. v3 produced predictions like Abbas Mirza | has works in collection | 1 http www w3 org 2001 xmlschema decimal (confidence 0.93) — a memorization of literal-with-embedded-datatype-URI patterns. After fixing the corpus: metropolitan museum of museum (confidence 0.43). The Met genuinely holds Abbas Mirza pieces.

v4 vs v5 capacity scaling

Epoch	v4 (16M) ppl	v5 (44M) ppl
1	1150.7	1528.7
2	196.0	147.3
3	133.5	104.2
4	100.7	90.7
5	92.5	84.85

v5 starts higher (epoch 1) — more parameters mean a harder optimisation landscape and slower initial convergence. It crosses under v4 at epoch 2 and pulls ahead from there. By epoch 4 it has already passed v4's final perplexity. Wall time on a RTX 4070 Laptop: 91 min for v5 vs 42 min for v4 (2.2× compute, 8% better final ppl).

2026-05 — Cumulative repetition penalty

Masked-S/P/O training produces models that "know" the answer category (university, museum, https-URL) but degenerate during greedy decoding to fillers like of of of of or museum museum.

We didn't retrain. We changed the decoder. Cumulative penalty: divide each repeated token's logit by repetition_penalty^count, where count is the number of prior emissions of that token in this sequence. Three emissions of of at penalty 3.0 multiply its divisor by 27, reliably dropping it below the per-token floor.

Same v4 checkpoint moves from university of of of of of of of (no penalty) to university of halle (cumulative penalty 3.0). The model knew Halle — the decoder just couldn't stop emitting of long enough to reach it.

2026-05 — Loka rebrand and Hugging Face mirror

"Loka" is a name for the engine. The project that's emerging (engine + corpus + trained world model + inference layer) needs its own identity. Loka is the name on Hugging Face. The GitHub repo will be renamed to match later.

tools/hf_snapshot.py pushes corpus + checkpoints to a single dataset repo EmmaLeonhart/loka with each upload tagged as a snapshot revision (v3, v4, v5). Each upload is a commit; tagged snapshots are pullable via revision="v4". LFS is handled transparently by huggingface_hub.

from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(
    repo_id="EmmaLeonhart/loka",
    repo_type="dataset",
    filename="checkpoints/wikidata_v5.pt",
    revision="v5",
)

2026-04 — ManuForge integration → v0.3.7

Brief period of production-readiness fixes after testing Loka against the ManuForge SDK consumer. Surfaced a small set of real issues:

RDF-star query support fixes — quoted-triple wildcards weren't matching correctly.
HTTP star import — N-Triples-star payloads via POST /triples had edge cases on parsing.
CRLF line endings — Windows clients sending CRLF-terminated N-Triples weren't being recognised.
Self-update asset bug — release-pipeline self-update was downloading the wrong artifact name.
Import error reporting — the response now lists per-line errors rather than failing the whole batch.

Bumped to v0.3.7 at the end of this round. The ManuForge integration also produced docs/AGENT_SETUP.md for AI-agent consumers.

2026-03 (late) — Ontochronology + Loka Studio + FFI

Loka Studio (Flutter desktop/web/mobile client)

loka-ffi crate wraps the engine in a C-compatible shared library so non-Rust consumers (Flutter, in particular) can embed the database in-process. Studio uses dart:ffi to load loka_ffi.dll/.so/.dylib and runs the engine on a background thread sharing the same handle as the optional MCP server.

Two entry points: loka mcp (MCP + database, no GUI) and Loka Studio (GUI + database + optional MCP server, all one process). Studio also auto-starts the server in serverless mode when launched, so the user never has to run loka serve manually. Includes graph view (D3 then vis-network), HNSW health diagnostics, OWL/Turtle export, dark/light theme, persistent connection settings.

Ontochronology

A non-trivial extension: every triple is conceptually contained in a temporal interval, and queries can ask "what was true at time T" or "what changed between T1 and T2" without reifying every statement individually. Implementation phases:

Phase 1–3 — temporal literal type, predicates (loka:assertedAt, loka:validFrom, loka:validTo), TSPO index.
Phase 4a — AT_TIME and DURING query operators.
Phase 4b — WORLD_STATE and TEMPORAL_DIFF.
Temporal-aware property path traversal.

Containment semantics use three-valued query logic (true / false / unknown). Design lives in docs/ontochronology.md.

Other March-late items

Pseudo-tables to deep subgraph columnar indexes — generalised the columnar shortcut so multi-hop subgraph queries can run vectorised SIMD scans where the structure repeats.
Cost-based query planning — predicate pushdown, HNSW edge labelling, join-strategy selection, hash join optimization.
Vector SPARQL operators — COSINE_SEARCH, EUCLID_SEARCH, DOTPRODUCT_SEARCH.
ACID compliance — atomic transactions, durability, isolation.
Self-update + version check + HNSW rebuild endpoint.
18+ theory pages on sutradb.org — HNSW-in-RDF, four-index architecture, RDF-star edges, SPARQL exit conditions, hybrid databases, traversal indexing, cost-based planning.
Code of Ethics page — Buddhist/Shinto-techno-animist framing, deadpan style.

2026-03 (mid) — v0.2.0 Developer Preview

A consolidating release. Headlines: query planner, agent installer, Java SDK, Loka Studio first cut. All four SDKs (Go, Rust, Java, .NET) had endpoint mismatches caught and fixed. SDK publish workflow + integration test CI added.

Pseudo-tables landed in this window: columnar indexes with zonemap pruning and vectorized scans, on top of the standard SPO/POS/OSP indexes. Designed to make multi-hop subgraph queries (the kind RDF databases are typically slow at) competitive with property-graph databases.

Released as v0.2.0 Developer Preview on 2026-03-18.

2026-03-15 — The SPARQL completeness sweep

A single very productive day. Brought SPARQL coverage from "minimum viable" to roughly feature-complete for SPARQL 1.1 over RDF-star.

Core engine

First-query cold-start fix — Vec<bool> visited list to HashSet, cut ~2s page-fault overhead at 200K+ HNSW nodes.
HNSW cross-cluster search — multiple entry points (up to 8). Fixed bias toward first-inserted cluster.
Persistence — PersistentStore (sled-backed) wired to HTTP server with write-through. Hydrate on startup.
SIMD-accelerated distance functions — AVX2/FMA + SSE fallback for dot_product, squared_euclidean, l2_norm.
HNSW rebuild from stored vector triples on startup — vectors persist; the index is reconstructed lazily.
HNSW compaction — background pass to clean tombstoned nodes, plus /vectors/health for diagnostics.
Hash join optimization for large intermediate result sets.
Cardinality estimation for cost-based planning.
Crash recovery — verify_consistency() and repair() for index integrity.
Adjacency lists materialized for Neo4j-speed traversal.
Parallel HNSW construction via rayon.

SPARQL completeness

FILTER NOT EXISTS / EXISTS
ASK queries
GROUP BY + aggregates (COUNT, SUM, AVG, MIN, MAX)
BIND / VALUES
Boolean operators (&&, ||, !), comparison ops, type checks
LANG() / LANGMATCHES()
String functions (CONTAINS, STRSTARTS, STRENDS, REGEX)
INSERT DATA / DELETE DATA (SPARQL Update)
CONSTRUCT and DESCRIBE
HAVING for GROUP BY filtering
Property paths (+, *, ?, /)
Subqueries (nested SELECT)
DATATYPE(), STR(), COALESCE(), IF()
Arithmetic in FILTER
RDF-star quoted triple patterns in SPARQL (<< ?s ?p ?o >>)

CLI, distribution, ecosystem

loka import / export / info / install-agent
Install scripts, Dockerfile, GraphStoreProtocol, content negotiation
Protégé plugin (Java OSGi)
MCP server for AI-agent ↔ Loka integration
Client-side OWL validation, OWL verification query generation
Turtle, RDF/XML, JSON-LD, N-Quads parsers
LangChain VectorStore integration; Jupyter %%sparql cell magic

First Wikidata BFS import on this day: 439 entities, 16,084 triples, 439 vectors (1024-dim mxbai-embed-large), 0 errors. Later abandoned as the corpus base in favor of the HF parquet stream — the BFS rate limit made it impractical to scale.

2026-03 (early) — Foundation + 1M embedding stress test

Project began on 2026-03-13 with the cleanvibe scaffold. Within 24 hours: architecture docs, normalised loka-* workspace structure, loka-core and loka-hnsw foundations. Apache 2.0 license. CI workflow. Borrowed patterns explicitly from Qdrant (HNSW: immutable GraphLayers for search, thread-local visited pools, per-node RwLock during construction) and Oxigraph (storage, IRI interning, RDF triplestore baseline).

By the end of the first 48 hours the project had: a working engine, a working SPARQL surface, a working vector layer, persistence, six SDKs, CI, a website, and a stress test passing at 1M scale. That set the pace for everything that came after.

Dead ends — what didn't work, and why

1. BFS-only Wikidata ingest

The original ingest crawled Wikidata's API at the 1.5s rate limit. At that rate, 5M triples would take ~3.3 hours per million entries with an exponential queue growth from qualifier-discovered links. We pushed through ~870 entities then switched to the HF parquet stream entirely. Lesson: rate-limited APIs are fine for discovery, not for bulk ingest. Streaming a frozen dump is always faster when one exists.

2. propositionImportedFrom on every imported triple

When all your data comes from one source, tagging every row with that source is redundant noise. 22,593 rows of propositionImportedFrom <wikidata.org> after one hour of import — ~46% of all RDF-star annotations. Lesson: provenance is most useful when the source is genuinely variable. Wikidata's own reference predicates already capture per-claim provenance; don't shadow them.

3. Set-based (non-cumulative) repetition penalty

The first repetition-penalty implementation tracked emitted tokens in a set: each prior emission penalized the token's later probability once, regardless of how many times it had already been emitted. Penalty 1.5 didn't break loops because dominant tokens like of stayed dominant after one division. Lesson: for masked-position decoding, cumulative penalty (penalty^count) is the right knob — it's the only thing that defeats deeply-overweighted single tokens.

4. v3 trained on the unclean corpus

We didn't catch the embedded datatype-suffix leak until v3 had finished training. v3's perplexity (53) was numerically better than v4's (92) but qualitatively dramatically worse — v3 was memorising URI fragments. Lesson: perplexity is a proxy for content quality, not a measurement. Always sample-and-eyeball.

5. The token-leak retry loop

The user pasted an HF token in chat to enable an upload. The first use worked. The auto-classifier blocked the second use of the same token, correctly — re-transmitting a leaked credential is the pattern it's designed to stop. Lesson: never normalize credential pasting. huggingface-cli login writes to ~/.cache/huggingface/token and is the only path that doesn't leak the secret into chat or shell history.

6. Embedded Ollama in the importer

The original Wikidata BFS importer called Ollama for a 1024-dim embedding per entity (~2s per call, plus the Wikidata 1.5s = ~3.5s/entity). Vectors didn't enter the training corpus. Removing them roughly doubled effective import throughput and dropped the daemon dependency. Lesson: if a side-effect doesn't pay for itself in the loop you're optimizing, take it out.

7. `POST /triples` wedge

The persistent layer wedges after roughly every 5–6× growth in stored triples. We can't fix this in the importer; the workaround is automated stop-restart. Lesson: bugs found at ingest scale are real engine bugs that a smaller test suite would never have caught. Keep them documented — the fix is downstream.

This page is a website-readable rendering of DEVLOG.md. Per-commit detail lives in git log; the live status of in-flight work lives in status.md.

Contents

2026-05 — The neuro-symbolic world-model pivot (weeks of 2026-05-07 to 2026-05-09)

RDF-star as the citation mechanism

2026-05 — Data layer rebuild (2026-05-07 to 2026-05-09)

BFS importer learned RDF-star qualifiers + references

BFS → Hugging Face parquet stream

propositionImportedFrom dropped

Multilingual labels

Embeddings: gone

2026-05 — Training: v0–v5

Two corpus quality fixes between v3 and v4

v4 vs v5 capacity scaling

2026-05 — Cumulative repetition penalty

2026-05 — Loka rebrand and Hugging Face mirror

2026-04 — ManuForge integration → v0.3.7

2026-03 (late) — Ontochronology + Loka Studio + FFI

Loka Studio (Flutter desktop/web/mobile client)

Ontochronology

Other March-late items

2026-03 (mid) — v0.2.0 Developer Preview

2026-03-15 — The SPARQL completeness sweep

Core engine

SPARQL completeness

CLI, distribution, ecosystem

2026-03 (early) — Foundation + 1M embedding stress test

Dead ends — what didn't work, and why

1. BFS-only Wikidata ingest

2. propositionImportedFrom on every imported triple

3. Set-based (non-cumulative) repetition penalty

4. v3 trained on the unclean corpus

5. The token-leak retry loop

6. Embedded Ollama in the importer

7. POST /triples wedge

7. `POST /triples` wedge