Pseudo-Table Discovery

Characteristic Sets

When many nodes share the same set of predicates — like all Person nodes having name, age, email — they form a "characteristic set." This group behaves like rows in a relational table, even though no table was declared.

Loka discovers these groups automatically during the background maintenance cycle and materializes pseudo-tables: columnar indexes that accelerate the SQL-like portions of SPARQL execution.

Discovery Criteria

A group qualifies when:

A statistically significant cluster of nodes (p < 0.05 vs. random co-occurrence) shares 5+ properties
Each of those properties is held by ≥50% of the group
The group has enough members to justify the overhead (minimum 10 nodes)

Statistical significance testing filters out spurious clusters that appear by chance. Frequency-only thresholds would produce false positives on noisy data.

Segment-Level Zonemaps

Rows are stored in segments of ~2048 rows. Each segment maintains per-column statistics (min, max, null count, distinct count). When a query filter doesn't overlap a segment's min/max range, the entire segment is skipped without examining any rows.

Example: if a segment's maximum age is 30 and the query asks for ?age > 50, the segment is pruned. This is the DuckDB pattern applied to RDF.

Cliff Steepness: Data Health Metric

The distribution of property coverage across a pseudo-table reveals data quality:

Sharp cliff (steepness > 10): 10 properties at 100%, everything else at <10%. Clean, consistent schema.
Gradual slope (steepness 1–3): Properties spread across 20–80% coverage. Messy, inconsistent schema.
No cliff (steepness < 1): No clear core/tail separation. The pseudo-table may not be useful.

This metric is exposed through the health endpoint and Loka Studio, making pseudo-table discovery double as a data quality audit.

Mixed Resolution

When a query mixes pseudo-table columns with properties not in the pseudo-table, the planner uses a two-phase strategy:

Columnar phase: Resolve pseudo-table columns first (vectorized scan, fast)
Join phase: Use the bound subjects from step 1 to do targeted SPO lookups for remaining properties

The columnar phase produces bound subjects cheaply. The join phase is fast because the subject is already bound (point lookup). Neither phase requires a full table scan.