Tutorials

Eighteen worked examples that go from your very first run all the way to audited pipelines and structural analysis. Each one starts with a real situation, then shows the code to handle it. Read top to bottom for a smooth ride, or skip to whichever box your problem fits in.

Start here

What is the engine, in one paragraph

The engine takes any kind of data (a sentence, a photo, a list of weather readings) and breaks it into the smallest repeating pieces it can find. It remembers each unique piece in a notebook called a ledger, then writes down the order it saw them in. From that, it can do useful things: tell you if two files are secretly the same, count how often each piece appears, spot when a pattern changes, or rebuild your original data exactly.

You do not need to be a math person to use it. The tutorials below show practical examples and explain what each line of code is for.

Three words you will see a lot

Bytes: the engine reads everything as a stream of bytes (numbers from 0 to 255). For text we use .encode("utf-8"); for files we use Path.read_bytes(); for numbers we pack them into bytes with bytes([1, 2, 3]).
Ledger: the engine's notebook. It is a single file on disk (you choose the name) that holds every piece the engine has seen, in the order it saw them. Open the same file later and the engine picks up where it left off.
Seed: a small number the engine returns every time you give it data. Same data, same seed. Different data, almost always a different seed. Think of it as a quick label for "here is what I saw."

Setup, once

Follow Local Python Wheel in the docs to install and activate the engine. After that, every tutorial below is just a plain Python script you can copy and run. Quick check that everything is wired up:

python -m ufm status
# → {"active": true, "tier": "standard", "days_remaining": 29, ...}

python -c "import ufm; print(ufm.VERSION)"
# → 3.0-rust

Stuck on setup? The docs page has a download command, version detection, install troubleshooting, and a recommended Python version. Come back here once status prints active: true.

1. Find the most-common letters in a poem

Beginner

The situation:you have a poem (or a paragraph, or a product review) and you want to know which letters and spaces show up most often. Maybe you are designing a phone keypad, or you want to know if your writing leans heavily on the letter "e".

# tutorial_01_letters.py
import ufm

poem = (
    "Tyger Tyger, burning bright,\n"
    "In the forests of the night;\n"
    "What immortal hand or eye,\n"
    "Could frame thy fearful symmetry?"
)

# The engine reads bytes, not strings, so we encode the text first.
data = poem.encode("utf-8")

# Open a ledger (a notebook on disk) and feed it our poem.
with ufm.InvariantIdentityEngine(storage_path="poem-ledger.bin") as eng:
    seed, status = eng.process(data)
    print(f"the engine called this poem: seed={seed} ({status})")

    # Pull the underlying notebook so we can ask questions of it.
    notebook = eng.ledger()

    # The most-common pieces, with how many times each appeared.
    top = notebook.top_n_primitives(5)
    print("\nfive most common pieces:")
    for piece_id, count in top:
        print(f"  piece #{piece_id}  appeared {count} times")

    print(f"\ntotal unique pieces:  {notebook.primitive_count}")
    print(f"total pieces written: {notebook.timeline_length}")

What you should see:

the engine called this poem: seed=4 (NOVELTY)

five most common pieces:
  piece #2   appeared 18 times
  piece #5   appeared 8 times
  piece #1   appeared 6 times
  piece #11  appeared 5 times
  piece #14  appeared 5 times

total unique pieces:  21
total pieces written: 121

For text, each "piece" is usually a single character (so the most-common piece is almost always the space character, then the letter "e" in English). The engine assigns piece numbers in the order it first sees them, so the IDs are not fixed across runs of different text.

Try changing one thing

Replace the poem with a paragraph from a book and watch the unique-piece count climb.
Run the script twice in a row. The second run prints REPLAY: the engine has already seen this exact poem.
Delete poem-ledger.bin between runs to start with a fresh notebook.

2. Save your work and pick up tomorrow

Beginner

The situation: you are slowly feeding a long list of documents into the engine. You do not want to start over every day. The ledger handles this for you: it saves to disk automatically, and you can reopen the same file as many times as you like.

# tutorial_02_notebook.py
import shutil
from pathlib import Path
import ufm

LEDGER = "diary-ledger.bin"

# Day 1: write down two thoughts.
with ufm.InvariantIdentityEngine(storage_path=LEDGER) as eng:
    seed_a, _ = eng.process("Today I made a paper airplane.".encode("utf-8"))
    seed_b, _ = eng.process("It flew further than my brother's.".encode("utf-8"))
    print(f"day 1 seeds: {seed_a}, {seed_b}")
# When the "with" block ends, the engine saves the notebook to disk.

# Day 2: open the SAME file and add another thought.
with ufm.InvariantIdentityEngine(storage_path=LEDGER) as eng:
    seed_c, _ = eng.process("I drew a map of my room.".encode("utf-8"))
    print(f"day 2 seed: {seed_c}")

    summary = eng.ledger_summary()
    print(f"the notebook now has {summary['timeline_length']} entries")

# Day 3: just look at what is in there. No new entries added.
with ufm.InvariantIdentityEngine(storage_path=LEDGER) as eng:
    eng.ledger_summary()  # forces the engine to load the file.
    notebook = eng.ledger()
    print(f"unique pieces seen across all three days: {notebook.primitive_count}")
    print(f"top three: {notebook.top_n_primitives(3)}")

# Make a backup copy you can email or archive.
shutil.copyfile(LEDGER, "diary-backup.ufmr")
print(f"\nbackup written: {Path('diary-backup.ufmr').stat().st_size} bytes")

Things to know:

The storage_path argument is the notebook's filename. Pass the same name and you get the same notebook back.
You do not have to call save(). The engine handles it when the with block ends. You can call it manually if you want to save mid-script.
The notebook file is just a regular file. Copy it, email it, put it in version control, or zip it up for a backup.

Heads up: the notebook filename has to live inside your current folder. The engine refuses absolute paths and .. traversals, so pass a plain filename like "diary-ledger.bin" and you will be fine.

3. Tell if a file was tampered with

Beginner

The situation: someone hands you a file and claims it has not been changed since last Tuesday. How do you check? You could compare every byte, but that is slow and only works if you have the original. The engine offers a quicker check: ask it to ingest and rebuild the file. If the rebuild matches the original byte for byte, the engine returns True. If even one byte changed, it returns False.

# tutorial_03_tamper.py
import ufm

samples = [
    b"a tiny file",
    "characters with accents like é and ñ".encode("utf-8"),
    bytes(range(256)),                       # every byte value 0 to 255
    b"\x00" * 1024,                         # 1 KB of zero bytes
    b"\xff" * 1024,                         # 1 KB of 255s
]

with ufm.InvariantIdentityEngine(storage_path="check-ledger.bin") as eng:
    for index, data in enumerate(samples):
        # reconstruct() is the round-trip check. True means the engine can
        # rebuild your data exactly. False means something is off.
        ok = eng.reconstruct(data)
        size = len(data)
        verdict = "intact" if ok else "MISMATCH"
        print(f"sample {index}: {size:>5} bytes -> {verdict}")

What you should see:

sample 0:    11 bytes -> intact
sample 1:    37 bytes -> intact
sample 2:   256 bytes -> intact
sample 3:  1024 bytes -> intact
sample 4:  1024 bytes -> intact

Every sample comes back intact. If you ever see MISMATCH on a file the engine ingested, that is a real problem worth reporting.

Why it matters

Imagine you publish a contract and someone receives a copy weeks later. They run reconstruct() on their copy. If the engine rebuilds it exactly, the file is identical to the one you ingested. No silent edit slipped through.

4. Fingerprint a photo or any binary file

Everyday scenario

The situation: you have a folder of images and you want a quick label per file so you can spot duplicates. The engine gives every file a seed: a small number that depends on the file's contents. Two byte-identical files always get the same seed in the same notebook.

# tutorial_04_fingerprint.py
from pathlib import Path
import ufm

photo = Path("./family-photo.jpg")
data = photo.read_bytes()
print(f"loaded {photo.name} ({len(data):,} bytes)")

with ufm.InvariantIdentityEngine(storage_path="photos-ledger.bin") as eng:
    # First time the engine sees this file.
    seed_first, status_first = eng.process(data)
    print(f"first time:    seed={seed_first}  ({status_first})")

    # Same bytes again. The engine recognises it.
    seed_second, status_second = eng.process(data)
    same = seed_second == seed_first
    print(f"second time:   seed={seed_second}  ({status_second})  same as before: {same}")

    # Now flip a single bit (one byte) and ingest the edited copy.
    edited = bytearray(data)
    edited[0] ^= 0x01
    seed_edited, status_edited = eng.process(bytes(edited))
    different = seed_edited != seed_first
    print(f"after edit:    seed={seed_edited}  ({status_edited})  different now: {different}")

The first run prints NOVELTY (new file). The second prints REPLAY because the engine already has it. The edited copy almost always gets a different seed.

What about hashing?Tools like SHA-256 give a unique-but-meaningless string per file. The engine's seed plays the same role for "are these two files the same" checks, and the same notebook also lets you ask follow-up questions like "how similar are they?" (tutorial 6).

5. Catalog a list of small records

Everyday scenario

The situation: you have a list of small records (a daily reading from a sensor, a row in a spreadsheet, a product feature flag). You want each record to have its own ID and you want to know which values appear most often across the whole catalog.

The trick is to pack each record into a fixed number of bytes so the engine sees them all at the same scale. In the example below, each record is four bytes: temperature, humidity, wind, cloud cover.

# tutorial_05_records.py
import ufm

# Each record packs four small numbers (0 to 255) into four bytes.
# Field 1: temperature step (0 = -50 C ... 100 = 50 C)
# Field 2: humidity percent (0 to 100)
# Field 3: wind km/h (0 to 150)
# Field 4: cloud cover percent (0 to 100)
records = [
    bytes([72, 65,  8, 20]),   # warm, dry, light wind, mostly clear
    bytes([72, 65, 12, 25]),   # next day: very similar
    bytes([55, 90, 30, 95]),   # cooler, humid, windy, overcast
    bytes([60, 88, 28, 90]),
    bytes([58, 92, 35, 99]),
    bytes([72, 68, 10, 22]),   # sunny again
]

with ufm.InvariantIdentityEngine(storage_path="weather-ledger.bin") as eng:
    # process_batch is the bulk version of process(). One ID per record.
    results = eng.process_batch(records)
    seeds = [s for s, _ in results]
    print(f"per-day IDs: {seeds}")

    notebook = eng.ledger()
    print(f"unique values across all fields: {notebook.primitive_count}")
    print(f"five most common values:         {notebook.top_n_primitives(5)}")

    # Pull back any specific day by its ID.
    day_three = list(eng.replay(seeds[2]))
    print(f"day 3 bytes: {bytes(day_three[0]).hex()}")

Each day is one entry. The top_n_primitiveslist answers questions like "which humidity reading came up most?" because each byte stays its own unit. To pull back a specific day, save its seed and call replay(seed) later.

Why fixed-size records? If your records vary wildly in length (a tweet vs a novel), the engine still works, but the per-byte counts get harder to interpret. Packing each record into the same width keeps your reports easy to read.

6. See how similar two files are

Everyday scenario

The situation:you have two essays, two images, or two versions of a document, and you want a single number for "how much do these have in common?" The engine builds a notebook for each file, then reports how much overlap there is.

# tutorial_06_similar.py
from pathlib import Path
import ufm

def to_bits(data: bytes) -> list[int]:
    return [int(bit) for byte in data for bit in f"{byte:08b}"]

essay_a = Path("./essay-original.txt").read_bytes()
essay_b = Path("./essay-revised.txt").read_bytes()

# Build one notebook per essay (in memory, no disk file needed).
notebook_a = ufm.ingest_raw(to_bits(essay_a), symbol_length_mode="auto_curve")
notebook_b = ufm.ingest_raw(to_bits(essay_b), symbol_length_mode="auto_curve")

result = ufm.ledger_compare(notebook_a, notebook_b)

# A number from 0 to 1. 1 means identical pieces; 0 means nothing in common.
similarity = result["jaccard"]
print(f"similarity:        {similarity:.2%}")
print(f"shared pieces:     {result['shared_primitives']}")
print(f"only in essay A:   {result['only_a']}")
print(f"only in essay B:   {result['only_b']}")

if similarity > 0.9:
    print("\nverdict: nearly identical")
elif similarity > 0.5:
    print("\nverdict: substantial overlap, lots of shared text")
elif similarity > 0.2:
    print("\nverdict: some common ideas")
else:
    print("\nverdict: mostly different")

Same idea works for any two files. Compare two photos to see how much of the same visual content they share. Compare two CSV exports to spot how much the numbers overlap. Compare two app log files to see if the system is doing the same thing day to day.

7. Spot fake differences (line endings, BOM)

Everyday scenario

The situation: a teammate sends you a file. Your diff tool screams that every line is different. You squint and the text looks the same. The likely culprit is a fake difference: the file uses Windows line endings instead of Linux ones, or it starts with an invisible byte-order mark.

The engine has a layer that recognises these "noise" differences and tells you whether the real content matches.

# tutorial_07_fake_differences.py
import ufm

# Same content, different line endings.
linux_text   = b"line one\nline two\nline three\n"
windows_text = b"line one\r\nline two\r\nline three\r\n"

# Same content, but the second has an invisible byte-order mark up front.
plain     = "ascii content".encode("utf-8")
with_bom  = b"\xef\xbb\xbf" + plain

pipeline = ufm.SemanticDecisionPipeline("noise-ledger.jsonl")

cases = [
    ("Linux vs Windows line endings", linux_text, windows_text),
    ("Plain vs invisible byte-order mark", plain, with_bom),
]
for label, source, target in cases:
    result = pipeline.run_with_policy(
        source, target,
        # Tell the pipeline which "fake" classes to forgive.
        enabled_noise_classes=["line_ending_crlf", "bom_utf8"],
        strict_allowlist=True,
    )
    print(f"--- {label} ---")
    if result["converges"]:
        print("  verdict: same content, fake difference (line endings or BOM)")
    else:
        print("  verdict: real difference, look closer")
    print(f"  classified deltas: {len(result['noise_units'])}")
    print(f"  audit fingerprint: {result['decision_hash'][:16]}...")

When converges is True, every byte-level difference was fully accounted for by the noise classes you allowed. The two files are the same in every way that matters. When it is False, there is a real difference the engine could not explain away.

What other "fake" classes exist? Quite a few: text encoding flips, base64 wrapping, JSON pretty-printing (whitespace-only), and more. The full list lives at the API endpoint /v1/noise/capabilities. Pass them via enabled_noise_classes.

8. Choose how the engine reads your data

Going deeper

The situation: the engine breaks your data into pieces of a chosen size. The default size is auto-picked, and that is usually the right move. But if your data has a known unit (one byte per ASCII character, two bytes per UTF-16 character, four bytes per packed number), you can tell the engine to use exactly that size and the analytics line up perfectly with your data.

# tutorial_08_chunk_size.py
import ufm

# A pattern that repeats every three bytes: "ABC" 100 times.
pattern = b"ABC" * 100

print(f"{'mode':<14} {'chunk size':>10} {'unique pieces':>14} {'novelty rate':>14}")
for mode in ["auto_curve", "entropy", "fixed8", "fixed16", "fixed24"]:
    info = ufm.ufm_signature(pattern, symbol_length_mode=mode)
    print(
        f"{mode:<14} "
        f"{info['symbol_length']:>10} "
        f"{info['primitive_count']:>14} "
        f"{info['discovery_rate']:>14.4f}"
    )

What you should see:

mode           chunk size  unique pieces   novelty rate
auto_curve             24              1         0.0033
entropy                 8              3         0.0100
fixed8                  8              3         0.0100
fixed16                16              6         0.0100
fixed24                24              1         0.0033

Notice what changed. auto_curve spotted the 24-bit repeating unit and concluded there is only ONE distinct piece. With fixed8 we forced the engine to read one byte at a time, so it found three distinct bytes (A, B, C). With fixed16 we forced two-byte chunks, which cut the pattern in awkward places.

Pick a mode by what your data looks like

You have no idea: use auto_curve. It scans the data and picks for you.
Plain English text or code: fixed8 matches one byte per character.
Records with a fixed field width: use Fixed(N) where N is the bit width of one record. Tutorial 5 uses this.
UTF-16 text or two-byte values: fixed16.

9. Find a hidden rhythm in numbers

Going deeper

The situation: you have a long list of measurements over time (heart rate every second, web traffic every hour, temperature every day). You suspect there is a repeating pattern. How often does it repeat?

The engine's notebook can answer that with acf(short for autocorrelation function). Think of it as a rhythm meter: feed in the data, ask "how strongly does each step look like the step a few back, a dozen back, a hundred back?" Big numbers at a particular distance mean the pattern repeats every that-many steps.

# tutorial_09_rhythm.py
import ufm

def to_bits(data: bytes) -> list[int]:
    return [int(bit) for byte in data for bit in f"{byte:08b}"]

# A pretend hourly reading: a 24-step shape repeated for 50 days.
one_day = bytes(range(24))      # 24 different values, one per hour
fifty_days = one_day * 50

# Read one byte at a time so each hour is its own piece.
notebook = ufm.ingest_raw(to_bits(fifty_days), symbol_length_mode="fixed8")

rhythm = notebook.acf(60)        # check rhythm at distances 1 through 60.

# Sort by strength and print the top 5.
ranked = sorted(enumerate(rhythm, start=1), key=lambda pair: -pair[1])[:5]
print("strongest rhythms (distance, score):")
for distance, score in ranked:
    print(f"  every {distance:>2} steps -> score {score:.3f}")

What you should see:

strongest rhythms (distance, score):
  every 24 steps -> score 1.000
  every 48 steps -> score 1.000
  every 23 steps -> score 0.958
  every 25 steps -> score 0.958
  every 47 steps -> score 0.958

The engine found the 24-step rhythm (and its multiples 48, 72, ...) without being told to look for one. Replace the synthetic data with your own and the same code will find any rhythm that exists.

Real-world uses

Heart-rate readings from a wearable: detect a steady pulse vs a wandering one.
Web traffic logs: confirm the daily and weekly cycles before sizing your servers.
Manufacturing sensor: check that a machine cycle is still firing on schedule.

10. Find when a pattern changes

Going deeper

The situation:your data was steady, then something shifted. You want to know when. The engine's notebook has two tools for this: segments (long stretches where one piece dominates) and transitions (the boundaries where the mix changes).

# tutorial_10_changes.py
import ufm

def to_bits(data: bytes) -> list[int]:
    return [int(bit) for byte in data for bit in f"{byte:08b}"]

# A signal that is calm for 200 steps, then jumpy for 200 steps, then calm again.
calm   = bytes([10, 11, 10, 11] * 50)        # repeats: 10, 11, 10, 11
jumpy  = bytes([200, 50, 230, 5] * 50)       # very different from "calm"
signal = calm + jumpy + calm                 # three regions back to back

notebook = ufm.ingest_raw(to_bits(signal), symbol_length_mode="fixed8")

segments = notebook.segments(window_size=64)
print(f"the engine found {len(segments)} stable region(s):")
for index, seg in enumerate(segments[:10]):
    print(f"  region {index}: starts at {seg['start']}, ends at {seg['end']}")

transitions = notebook.transitions(window_size=64, threshold=0.1)
print(f"\nshift points (first 10): {transitions[:10]}")

You should see roughly three segments and two shift points (around position 200 and around position 400). Those line up with where the calm and jumpy data join. Increase the window to make the engine more patient (it averages over a longer slice before deciding anything has changed). Decrease it to be more sensitive.

Real uses:spotting the moment a service started misbehaving, finding the day a customer's habits changed, flagging when a sensor stream switches mode (idle vs active).

11. Build a duplicate finder for a folder

Going deeper

The situation: you have a folder of files (notes, images, exports) and you suspect there are duplicates. The engine gives you one ID per file. Files that share an ID are very likely duplicates. (The next step, if you want to be sure, is the similarity check from tutorial 6.)

# tutorial_11_duplicates.py
from collections import defaultdict
from pathlib import Path
import ufm

folder = Path("./inbox")               # change to wherever your files live
files = sorted(folder.glob("*"))       # all files in that folder
files = [p for p in files if p.is_file()]
print(f"checking {len(files)} files...")

with ufm.InvariantIdentityEngine(storage_path="inbox-ledger.bin") as eng:
    seeds = []
    for path in files:
        seed, _ = eng.process(path.read_bytes())
        seeds.append(seed)

# Group filenames by ID. Any group with more than one filename is a possible duplicate set.
groups = defaultdict(list)
for path, seed in zip(files, seeds):
    groups[seed].append(path.name)

duplicates = {seed: names for seed, names in groups.items() if len(names) > 1}

if not duplicates:
    print("no duplicate IDs found.")
else:
    print(f"\nfound {len(duplicates)} group(s) of files sharing an ID:")
    for seed, names in duplicates.items():
        print(f"  ID {seed}:")
        for name in names:
            print(f"    - {name}")

Two files with the same ID are nearly always byte-identical. (There is a small chance two different-but-structurally-similar files share an ID; if you need a guarantee, run a similarity check or a plain byte compare on each pair before deleting anything.)

12. Ask Bob a question using your own AI

AI integration

The situation: the wheel ships with two helpers, Ben and Bob, that answer questions about UFM grounded in a sealed knowledge base. Both need a language model to write their replies. If you already have one (an Anthropic key, an OpenAI key, or a CLI like Claude Code on your machine), you can hand it to Ben or Bob without setting up a separate account.

First, write a tiny "bridge" script. Bob calls this script whenever it needs a sentence written. The script reads the request from standard input, calls your AI, and prints the answer.

# bridge.py: hand Bob over to Claude Code on your machine
import json, shutil, subprocess, sys, tempfile
from pathlib import Path

CLAUDE = shutil.which("claude")
if CLAUDE is None:
    print("claude CLI not on PATH", file=sys.stderr)
    sys.exit(2)

def split_messages(messages):
    sys_parts  = [m["content"] for m in messages if m.get("role") == "system"]
    user_parts = [
        f"{m['role'].upper()}: {m['content']}"
        for m in messages if m.get("role") != "system"
    ]
    return "\n\n".join(sys_parts), "\n\n".join(user_parts)

payload = json.loads(sys.stdin.read())
system_prompt, user_text = split_messages(payload["messages"])

with tempfile.TemporaryDirectory(prefix="ufm-bridge-") as tmp:
    sys_file = Path(tmp) / "system.txt"
    sys_file.write_text(system_prompt, encoding="utf-8")
    result = subprocess.run(
        [CLAUDE, "-p",
         "--system-prompt-file", str(sys_file),
         "--allowedTools", "",          # safety: no tools, just words
         "--output-format", "text"],
        input=user_text,
        capture_output=True, text=True, encoding="utf-8",
        cwd=tmp, timeout=180, check=False,
    )
    if result.returncode != 0:
        sys.stderr.write(result.stderr or "claude failed")
        sys.exit(result.returncode)
    print(json.dumps({"text": (result.stdout or "").strip(),
                      "model": "claude-code"}))

Then call Bob through the bridge:

# tutorial_12_bob.py
import ufm

# Tell Bob to use our bridge whenever it needs a reply written.
backend = ufm.SubprocessBackend(cmd=["python", "bridge.py"], timeout=180)
bob = ufm.BobPipeline(backend=backend)

answer = bob.query("What does the engine do that a checksum does not?",
                   mode="advisory", max_anchors=5)

print(f"gate:    {answer.gate_status}")  # PASS, WARN, or BLOCK
print(f"reply:\n{answer.response}\n")
print(f"evidence cited: {len(answer.evidence)} sources")

Bob runs the look-up and the safety check locally; the bridge only produces the natural-language wording. Same trick works for Ben: use ufm.BenSession(backend=backend) and session.ask(question).

Not on Windows or not using Claude Code? The same bridge pattern works with any CLI or any Python SDK. The wire format (JSON in, text or JSON out) is the same. See the subprocess backend docs for the full spec.

13. Get a quality score for every run

Power tools

The situation: a regulator, an auditor, or your own future self wants proof the engine processed the data correctly: replay actually round-tripped, the run was deterministic, no stage failed. The Universal Pipeline runs your data through seven stages and reports back on every one.

# tutorial_13_quality.py
import ufm

doc = b"the quick brown fox jumps over the lazy dog\n" * 100

pipeline = ufm.UniversalPipeline(storage_path="quality-ledger.bin")
result = pipeline.run(doc)

print(f"success:        {result['success']}")
print(f"seed:           {result['seed']}")
q = result["quality"]
print(f"replay_valid:   {q['replay_valid']}")
print(f"deterministic:  {q['deterministic']}")
print(f"discovery_rate: {q['discovery_rate']:.4f}")
print(f"reuse_ratio:    {q['reuse_ratio']:.4f}")

print("\nstages:")
for stage in result["stages_completed"]:
    flag = "OK" if stage["success"] else "FAIL"
    print(f"  {stage['stage']:<10} -> {flag}")

print(f"\nviolations: {result['violations']}")

Each quality field is independent. replay_valid means the bytes round-tripped exactly. deterministic means the same input produces the same seed every time. violations is empty when every stage passes; non-empty entries are what you have to write down before the run can be called clean.

When to use this over plain process: when the cost of being wrong is high and you want a record showing you checked. For day-to-day ingest with no audit need, plain process() is faster.

14. Keep a tamper-evident log of decisions

Power tools

The situation: you are making decisions in a loop (which support ticket to escalate, which file to flag) and you need a hash-chained log proving nothing was edited after the fact. The Decision Pipeline runs each request through four anti-drift gates and writes one row per decision into a JSON ledger. Each row includes the SHA-256 of the row before it.

# tutorial_14_decisions.py
import ufm

pipeline = ufm.DecisionPipeline("decisions-ledger.json")

requests = [
    "Summarise the customer support tickets from yesterday.",
    "Flag the ticket mentioning a refund.",
    "Summarise the customer support tickets from yesterday.",
]

for r in requests:
    result = pipeline.run(r)
    passed = sum(result["gates"].values())
    print(f"\nrequest: {r[:50]}...")
    print(f"  status:          {result['status']}")
    print(f"  tier:            {result['tier']}")
    print(f"  gates passed:    {passed} / 4")
    print(f"  hash (first 16): {result['decision_hash'][:16]}...")
    print(f"  side effects:    {result['side_effects']}")

The four gates check: (1) the response mentions the subject from the input; (2) no earlier identical request failed gates; (3) every substantive output token traces back to the input or the ledger; (4) the ledger row was successfully read back after writing. All four must pass for a status of ok. Note the third request is identical to the first, so its tier jumps to 3: the engine recognises prior context for the same subject.

The hash chain: each row in decisions-ledger.json contains the full SHA-256 of the previous row, so editing any past row breaks every later hash. Open the file and inspect it directly. It is plain JSON.

15. Profile a single file in one call

Power tools

The situation: you have one file and you want a small dictionary of structural numbers for it: how varied is its vocabulary, how often a piece repeats, how steep its frequency curve is. structural_profile gives you that in a single call. No notebook on disk, no setup.

# tutorial_15_profile.py
import ufm

text   = b"the quick brown fox jumps over the lazy dog " * 50
random = bytes(range(256)) * 8

for label, data in [("repeating text", text), ("random bytes", random)]:
    profile = ufm.structural_profile(data, symbol_width=16)
    print(f"\n{label}:")
    print(f"  vocabulary size:    {profile['v_size']}")
    print(f"  reuse:              {profile['reuse']:.4f}")
    print(f"  zipf slope:         {profile['s_zipf']:.4f}")
    print(f"  alpha:              {profile['alpha']:.4f}")
    print(f"  discovery integral: {profile['discovery_integral']}")

Five numbers describe the file at the chosen symbol_width. High reuse and low v_size mean a file with a lot of repetition. Low reuse and a near-flat zipf slope mean the file is more like noise. Same number of fields every call, so two profiles are easy to compare side by side.

16. Ask which chunk size the engine would pick

Power tools

The situation:Tutorial 8 showed how to override the chunk size. This one is the opposite: hand the engine some data and ask "given my data, what would you pick?" Useful when you suspect a hidden unit width and want the engine to confirm it.

# tutorial_16_chunk_finder.py
import ufm

def to_bits(data: bytes) -> list[int]:
    return [int(bit) for byte in data for bit in f"{byte:08b}"]

samples = {
    "ABC repeating": b"ABC" * 200,
    "byte counter":  bytes(range(256)) * 4,
    "long phrase":   b"hello world! " * 200,
}

for label, data in samples.items():
    length, meta = ufm.find_optimal_symbol_length(to_bits(data))
    print(f"\n{label}:")
    print(f"  picked size:       {length} bits")
    print(f"  entropy at choice: {meta['entropy_at_selected']:.4f}")
    print(f"  selection mode:    {meta['mode']}")
    print(f"  sample bits used:  {meta['sample_bits_used']}")

The engine samples up to the first 100,000 bits and tries different widths. The width with the lowest entropy wins (lower means more structured). The selection rule is deterministic: same bytes give the same answer every time. ABC repeatingpicks 24 because each repeat is exactly three bytes; the long phrase picks 104 because the repeating unit is 13 bytes long ("hello world! ").

17. Compute a quick signature without a notebook

Power tools

The situation: you have a handful of independent inputs and you want a structural fingerprint for each, but you do not want a persistent ledger file on disk. ufm_signature and ufm_signature_batch compute one in memory and return.

# tutorial_17_signatures.py
import ufm

inputs = [
    b"hello world",
    b"hello world",                # exact duplicate
    b"hello, world",               # tiny tweak (added comma)
    b"completely different bytes",
]

# One at a time:
sig = ufm.ufm_signature(inputs[0])
print(f"single seed: {sig['seed']}, primitives: {sig['primitive_count']}")

# Batch (same shape, in input order):
sigs = ufm.ufm_signature_batch(inputs)
seeds = [s["seed"] for s in sigs]
print(f"\nbatch seeds: {seeds}")
print(f"input 0 == input 1: {seeds[0] == seeds[1]}")  # True (identical)
print(f"input 0 == input 2: {seeds[0] == seeds[2]}")  # False
print(f"input 0 == input 3: {seeds[0] == seeds[3]}")  # False

Identical bytes give the same seed inside one batch, no matter where they appear. Different bytes almost always give different seeds. Each result is a full nine-key dict with seed, signature, discovery_rate, reuse_ratio and the rest. Nothing is written to disk.

18. Watch the engine learn (discovery-rate convergence)

Power tools

The situation: you are streaming data into one ledger over time. How do you know when the engine has seen enough to recognise the structure of the stream? Watch the discovery rate. Early on, almost every chunk is new and the rate is high. As the notebook fills up, the rate falls. When it sits near zero, the engine is in steady-state recognition.

# tutorial_18_convergence.py
import ufm
import random

random.seed(42)
phrases = [
    b"alpha quick brown fox",
    b"beta slow walking cat",
    b"gamma still sitting dog",
]

with ufm.InvariantIdentityEngine(storage_path="convergence-ledger.bin") as eng:
    print(f"{'batch':>5}  {'discovery_rate':>15}  {'primitives':>11}")
    for i in range(8):
        batch = b" ".join(random.choices(phrases, k=20))
        eng.process(batch)
        s = eng.ledger_summary()
        print(f"{i:>5}  {s['discovery_rate']:>15.4f}  {s['primitive_count']:>11}")

First batch: almost every chunk is new, so the rate is high. By the last batch, almost everything is a known piece in a different order, so the rate is close to zero. That curve shape is the same for any structured stream. Use it to decide when a stream is "saturated" enough to drive a downstream metric off it.

Trick: if the rate stops falling after many batches and stays at, say, 5%, that is the engine telling you new structure is still arriving steadily. Either the stream is genuinely non-stationary, or the chunk size is too small to capture the repeating units.