Loka — Neuro-Symbolic World Model

The pivot in one paragraph

Loka started as Loka, a lean RDF-star triplestore with native vector indexing. Over time the purpose shifted: the engine is now one half of a neuro-symbolic world-model engine. The other half is a small role-aware transformer trained from scratch on the same triples (with English labels substituted for opaque QIDs/PIDs). Both expose the same SPARQL+ interface. A query reaches both systems and the caller doesn't pick which one answered — except via propositionInferredFrom RDF-star edges that thread every model-generated triple back to the context that informed it. The engine retains the Loka name; the project as a whole is becoming Loka.

The two-system loop

┌───────────────────┐ │ Curated triples │ (Wikidata, philippesaade/wikidata HF parquet) │ (RDF-star) │ └─────────┬─────────┘ ▼ ┌───────────────────┐ ┌──────────────────────┐ │ Loka store │ ──────► │ Training corpus │ │ (.sdb file) │ SPARQL+ │ (label-substituted) │ │ │ SPARQL- │ │ │ │ star │ │ └─────────▲─────────┘ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ │ │ Role-aware │ │ │ transformer │ │ │ (16M / 44M params) │ │ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ │ │ Inference loop │ │ │ + cumulative rep. │ │ │ penalty decoder │ │ │ + RDF-star write- │ │ │ back to store │ │ └──────────┬───────────┘ │ ▼ │ ┌──────────────────────┐ └──────────────────┤ Generated triples + │ │ propositionInferred │ │ From edges │ └──────────────────────┘

The loop is closed. Generated triples land in the store flagged propositionGenerated true. The next training-corpus extraction's SPARQL-star FILTER excludes them, so the model never trains on its own output. Inference can be re-run repeatedly to grow the citation graph without polluting the training distribution.

Generative citation

When the inference layer accepts a candidate (S, P) and emits a predicted object "X", it writes a fixed-shape annotation block. The block's subject is the quoted generated triple. Its objects include the metadata predicates plus one or more propositionInferredFrom edges whose object is another quoted triple — a cited piece of context the prediction was conditioned on:

<S> <P> "X" .
<<S P "X">>  prov:propositionGenerated     "true"^^xsd:boolean .
<<S P "X">>  prov:propositionGeneratedBy   "wikidata_v5" .
<<S P "X">>  prov:propositionConfidence    "0.43"^^xsd:decimal .
<<S P "X">>  prov:propositionInferredFrom  <<S existing_p1 existing_o1>> .
<<S P "X">>  prov:propositionInferredFrom  <<S existing_p2 existing_o2>> .
   ...one per cited context triple (default 10)

prov: expands to http://loka.dev/provenance/. This whole namespace is reserved: the world model never sees, proposes, or emits one of its predicates. Three independent guards enforce that — corpus stripping at extraction time, candidate-predicate filtering during inference, and an emit-time guard before each primary triple is written. Verbose names (propositionGeneratedFrom, not generatedFrom) make the rule scannable by humans and collision-resistant against real-world predicates.

Why hallucinated citations aren't a blocker. A fabricated propositionInferredFrom edge is still a transparent RDF-star row pointing at a concrete context triple — auditable, filterable like any generated triple, often informative about what the model thinks the reasoning is. The schema does the work; we don't need elaborate guards.

Cumulative repetition penalty

Masked-S/P/O training produces models that "know" the answer category (university, museum, https-URL) but degenerate during greedy decoding into fillers like of of of of or museum museum. We don't fix this at training time. We fix it at decode time:

Every emission of token t increments a per-token counter. At each later masked position we divide the logit of every emitted token by repetition_penalty^count. Three emissions of of at penalty 3.0 multiply its divisor by 27, reliably dropping it below the per-token floor and breaking the cascade. A genuinely-needed re-use can still win on its first repeat; only loops collapse.

Same checkpoint, different decoder

Subject / predicate	No penalty	Cumulative penalty 3.0
`Comtesse de Die / educated at`	`university of of of of of of of`	`university of halle` (correctly identifies Halle, where she studied)
`canton of Romilly / Commons category`	`canton of of sur sur`	`canton of` (clean truncation)
`Zudar / area`	(didn't pass threshold)	`33` (numeric — model picked up that area is a number)
`Abbas Mirza / has works in collection`	`1 http www w3 org 2001 xmlschema decimal`	`metropolitan museum of museum` (the Met genuinely holds Abbas Mirza pieces)

Trained checkpoints

All checkpoints train on a 5,055,385-triple slice of Wikidata streamed from philippesaade/wikidata, with English labels substituted for QIDs/PIDs. Pull a snapshot from huggingface.co/datasets/EmmaLeonhart/loka:

Model	Architecture	Final ppl	Notes
`wikidata_v3.pt`	16M params (256 / 4 layers)	53.43	Pre-cleanup corpus — misleadingly low ppl from memorising datatype-suffix tokens.
`wikidata_v4.pt`	16M params (256 / 4 layers)	92.48	Cleaned corpus. Higher ppl, qualitatively much better content.
`wikidata_v5.pt`	44M params (512 / 6 layers)	84.85	Bigger model. Beats v4's final ppl by epoch 4. Picks specific real-world tokens (`halle`, `33`, `kosmos 116`) where v4 fell back to common connectors.

from huggingface_hub import hf_hub_download

ckpt = hf_hub_download(
    repo_id="EmmaLeonhart/loka",
    repo_type="dataset",
    filename="checkpoints/wikidata_v5.pt",
    revision="v5",
)

Run the full loop

# 1. Engine
cargo build --release -p loka-cli
./target/release/loka serve --port 3030 &

# 2. Pull the prebuilt corpus snapshot (or rebuild via tools/wikidata_hf_import.py)
git clone https://huggingface.co/datasets/EmmaLeonhart/loka

# 3. Generative-citation inference (v5 + cumulative penalty)
python training/infer_with_citations.py \
    --checkpoint training/checkpoints/wikidata_v5.pt \
    --vocab training/data/vocab.json \
    --endpoint http://localhost:3030 \
    --max-subjects 50 \
    --confidence 0.4 \
    --repetition-penalty 3.0 \
    --output training/data/generated_v5.nt
# add --post to write predictions back into the store

Full development history — the pivot, the corpus rebuild, every checkpoint, every dead end.
world-model-thesis.md — canonical vision document.
fine-tuning-track.md — parallel near-term track admitting Qwen-style fine-tuning.
paper draft — Loka: Generative Citation in a Neuro-Symbolic World Model over RDF-Star Knowledge Graphs.
Hugging Face dataset — corpus + checkpoints with snapshot tags.

The engine remains Loka in code (loka-core, loka-hnsw, etc.). The project — engine + corpus + transformer + inference layer — is Loka.

Loka — a neuro-symbolic world model

The pivot in one paragraph

The two-system loop

Generative citation

Cumulative repetition penalty

Same checkpoint, different decoder

Trained checkpoints

Run the full loop

Read more