Two systems sharing one query language. Explicit memory (an RDF-star triplestore, exact answers) plus
implicit memory (a transformer trained on the same triples, plausible answers). Generated facts
write back into the store as RDF-star annotations with propositionInferredFrom citation edges.
Loka started as Loka, a lean RDF-star triplestore with native vector indexing. Over time the purpose
shifted: the engine is now one half of a neuro-symbolic world-model engine. The other half is a small role-aware
transformer trained from scratch on the same triples (with English labels substituted for opaque QIDs/PIDs). Both
expose the same SPARQL+ interface. A query reaches both systems and the caller doesn't pick which one answered —
except via propositionInferredFrom RDF-star edges that thread every model-generated triple back to the
context that informed it. The engine retains the Loka name; the project as a whole is becoming Loka.
The loop is closed. Generated triples land in the store flagged propositionGenerated true.
The next training-corpus extraction's SPARQL-star FILTER excludes them, so the model never trains on its own output.
Inference can be re-run repeatedly to grow the citation graph without polluting the training distribution.
When the inference layer accepts a candidate (S, P) and emits a predicted object "X", it
writes a fixed-shape annotation block. The block's subject is the quoted generated triple. Its objects
include the metadata predicates plus one or more propositionInferredFrom edges whose object is
another quoted triple — a cited piece of context the prediction was conditioned on:
<S> <P> "X" .
<<S P "X">> prov:propositionGenerated "true"^^xsd:boolean .
<<S P "X">> prov:propositionGeneratedBy "wikidata_v5" .
<<S P "X">> prov:propositionConfidence "0.43"^^xsd:decimal .
<<S P "X">> prov:propositionInferredFrom <<S existing_p1 existing_o1>> .
<<S P "X">> prov:propositionInferredFrom <<S existing_p2 existing_o2>> .
...one per cited context triple (default 10)
prov: expands to http://loka.dev/provenance/. This whole namespace is reserved:
the world model never sees, proposes, or emits one of its predicates. Three independent guards enforce that — corpus
stripping at extraction time, candidate-predicate filtering during inference, and an emit-time guard before each primary
triple is written. Verbose names (propositionGeneratedFrom, not generatedFrom) make the rule
scannable by humans and collision-resistant against real-world predicates.
propositionInferredFrom edge is still a transparent RDF-star row pointing at a
concrete context triple — auditable, filterable like any generated triple, often informative about what the
model thinks the reasoning is. The schema does the work; we don't need elaborate guards.
Masked-S/P/O training produces models that "know" the answer category (university, museum, https-URL) but degenerate
during greedy decoding into fillers like of of of of or museum museum. We don't fix this at
training time. We fix it at decode time:
Every emission of token t increments a per-token counter. At each later masked position we divide the logit
of every emitted token by repetition_penaltycount. Three emissions of of at penalty 3.0
multiply its divisor by 27, reliably dropping it below the per-token floor and breaking the cascade. A genuinely-needed
re-use can still win on its first repeat; only loops collapse.
| Subject / predicate | No penalty | Cumulative penalty 3.0 |
|---|---|---|
Comtesse de Die / educated at | university of of of of of of of | university of halle (correctly identifies Halle, where she studied) |
canton of Romilly / Commons category | canton of of sur sur | canton of (clean truncation) |
Zudar / area | (didn't pass threshold) | 33 (numeric — model picked up that area is a number) |
Abbas Mirza / has works in collection | 1 http www w3 org 2001 xmlschema decimal | metropolitan museum of museum (the Met genuinely holds Abbas Mirza pieces) |
All checkpoints train on a 5,055,385-triple slice of Wikidata streamed from philippesaade/wikidata, with English labels substituted for QIDs/PIDs. Pull a snapshot from huggingface.co/datasets/EmmaLeonhart/loka:
| Model | Architecture | Final ppl | Notes |
|---|---|---|---|
wikidata_v3.pt | 16M params (256 / 4 layers) | 53.43 | Pre-cleanup corpus — misleadingly low ppl from memorising datatype-suffix tokens. |
wikidata_v4.pt | 16M params (256 / 4 layers) | 92.48 | Cleaned corpus. Higher ppl, qualitatively much better content. |
wikidata_v5.pt | 44M params (512 / 6 layers) | 84.85 | Bigger model. Beats v4's final ppl by epoch 4. Picks specific real-world tokens (halle, 33, kosmos 116) where v4 fell back to common connectors. |
from huggingface_hub import hf_hub_download
ckpt = hf_hub_download(
repo_id="EmmaLeonhart/loka",
repo_type="dataset",
filename="checkpoints/wikidata_v5.pt",
revision="v5",
)
# 1. Engine
cargo build --release -p loka-cli
./target/release/loka serve --port 3030 &
# 2. Pull the prebuilt corpus snapshot (or rebuild via tools/wikidata_hf_import.py)
git clone https://huggingface.co/datasets/EmmaLeonhart/loka
# 3. Generative-citation inference (v5 + cumulative penalty)
python training/infer_with_citations.py \
--checkpoint training/checkpoints/wikidata_v5.pt \
--vocab training/data/vocab.json \
--endpoint http://localhost:3030 \
--max-subjects 50 \
--confidence 0.4 \
--repetition-penalty 3.0 \
--output training/data/generated_v5.nt
# add --post to write predictions back into the store