Loka — a neuro-symbolic world model

Two systems sharing one query language. Explicit memory (an RDF-star triplestore, exact answers) plus implicit memory (a transformer trained on the same triples, plausible answers). Generated facts write back into the store as RDF-star annotations with propositionInferredFrom citation edges.

RDF-star SPARQL+ & SPARQL-star Self-citing inference v0.4.0 — World-model release
Pull from Hugging Face Engine release Full history

The pivot in one paragraph

Loka started as Loka, a lean RDF-star triplestore with native vector indexing. Over time the purpose shifted: the engine is now one half of a neuro-symbolic world-model engine. The other half is a small role-aware transformer trained from scratch on the same triples (with English labels substituted for opaque QIDs/PIDs). Both expose the same SPARQL+ interface. A query reaches both systems and the caller doesn't pick which one answered — except via propositionInferredFrom RDF-star edges that thread every model-generated triple back to the context that informed it. The engine retains the Loka name; the project as a whole is becoming Loka.

The two-system loop

┌───────────────────┐ │ Curated triples │ (Wikidata, philippesaade/wikidata HF parquet) │ (RDF-star) │ └─────────┬─────────┘ ▼ ┌───────────────────┐ ┌──────────────────────┐ │ Loka store │ ──────► │ Training corpus │ │ (.sdb file) │ SPARQL+ │ (label-substituted) │ │ │ SPARQL- │ │ │ │ star │ │ └─────────▲─────────┘ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ │ │ Role-aware │ │ │ transformer │ │ │ (16M / 44M params) │ │ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ │ │ Inference loop │ │ │ + cumulative rep. │ │ │ penalty decoder │ │ │ + RDF-star write- │ │ │ back to store │ │ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ └──────────────────┤ Generated triples + │ │ propositionInferred │ │ From edges │ └──────────────────────┘

The loop is closed. Generated triples land in the store flagged propositionGenerated true. The next training-corpus extraction's SPARQL-star FILTER excludes them, so the model never trains on its own output. Inference can be re-run repeatedly to grow the citation graph without polluting the training distribution.

Generative citation

When the inference layer accepts a candidate (S, P) and emits a predicted object "X", it writes a fixed-shape annotation block. The block's subject is the quoted generated triple. Its objects include the metadata predicates plus one or more propositionInferredFrom edges whose object is another quoted triple — a cited piece of context the prediction was conditioned on:

<S> <P> "X" .
<<S P "X">>  prov:propositionGenerated     "true"^^xsd:boolean .
<<S P "X">>  prov:propositionGeneratedBy   "wikidata_v5" .
<<S P "X">>  prov:propositionConfidence    "0.43"^^xsd:decimal .
<<S P "X">>  prov:propositionInferredFrom  <<S existing_p1 existing_o1>> .
<<S P "X">>  prov:propositionInferredFrom  <<S existing_p2 existing_o2>> .
   ...one per cited context triple (default 10)

prov: expands to http://loka.dev/provenance/. This whole namespace is reserved: the world model never sees, proposes, or emits one of its predicates. Three independent guards enforce that — corpus stripping at extraction time, candidate-predicate filtering during inference, and an emit-time guard before each primary triple is written. Verbose names (propositionGeneratedFrom, not generatedFrom) make the rule scannable by humans and collision-resistant against real-world predicates.

Why hallucinated citations aren't a blocker. A fabricated propositionInferredFrom edge is still a transparent RDF-star row pointing at a concrete context triple — auditable, filterable like any generated triple, often informative about what the model thinks the reasoning is. The schema does the work; we don't need elaborate guards.

Cumulative repetition penalty

Masked-S/P/O training produces models that "know" the answer category (university, museum, https-URL) but degenerate during greedy decoding into fillers like of of of of or museum museum. We don't fix this at training time. We fix it at decode time:

Every emission of token t increments a per-token counter. At each later masked position we divide the logit of every emitted token by repetition_penaltycount. Three emissions of of at penalty 3.0 multiply its divisor by 27, reliably dropping it below the per-token floor and breaking the cascade. A genuinely-needed re-use can still win on its first repeat; only loops collapse.

Same checkpoint, different decoder

Subject / predicateNo penaltyCumulative penalty 3.0
Comtesse de Die / educated atuniversity of of of of of of ofuniversity of halle (correctly identifies Halle, where she studied)
canton of Romilly / Commons categorycanton of of sur surcanton of (clean truncation)
Zudar / area(didn't pass threshold)33 (numeric — model picked up that area is a number)
Abbas Mirza / has works in collection1 http www w3 org 2001 xmlschema decimalmetropolitan museum of museum (the Met genuinely holds Abbas Mirza pieces)

Trained checkpoints

All checkpoints train on a 5,055,385-triple slice of Wikidata streamed from philippesaade/wikidata, with English labels substituted for QIDs/PIDs. Pull a snapshot from huggingface.co/datasets/EmmaLeonhart/loka:

ModelArchitectureFinal pplNotes
wikidata_v3.pt16M params (256 / 4 layers)53.43Pre-cleanup corpus — misleadingly low ppl from memorising datatype-suffix tokens.
wikidata_v4.pt16M params (256 / 4 layers)92.48Cleaned corpus. Higher ppl, qualitatively much better content.
wikidata_v5.pt44M params (512 / 6 layers)84.85Bigger model. Beats v4's final ppl by epoch 4. Picks specific real-world tokens (halle, 33, kosmos 116) where v4 fell back to common connectors.
from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(
    repo_id="EmmaLeonhart/loka",
    repo_type="dataset",
    filename="checkpoints/wikidata_v5.pt",
    revision="v5",
)

Run the full loop

# 1. Engine
cargo build --release -p loka-cli
./target/release/loka serve --port 3030 &

# 2. Pull the prebuilt corpus snapshot (or rebuild via tools/wikidata_hf_import.py)
git clone https://huggingface.co/datasets/EmmaLeonhart/loka

# 3. Generative-citation inference (v5 + cumulative penalty)
python training/infer_with_citations.py \
    --checkpoint training/checkpoints/wikidata_v5.pt \
    --vocab training/data/vocab.json \
    --endpoint http://localhost:3030 \
    --max-subjects 50 \
    --confidence 0.4 \
    --repetition-penalty 3.0 \
    --output training/data/generated_v5.nt
# add --post to write predictions back into the store

Read more

The engine remains Loka in code (loka-core, loka-hnsw, etc.). The project — engine + corpus + transformer + inference layer — is Loka.