Tutorials
Eighteen worked examples that go from your very first run all the way to audited pipelines and structural analysis. Each one starts with a real situation, then shows the code to handle it. Read top to bottom for a smooth ride, or skip to whichever box your problem fits in.
Start here
What is the engine, in one paragraph
The engine takes any kind of data (a sentence, a photo, a list of weather readings) and breaks it into the smallest repeating pieces it can find. It remembers each unique piece in a notebook called a ledger, then writes down the order it saw them in. From that, it can do useful things: tell you if two files are secretly the same, count how often each piece appears, spot when a pattern changes, or rebuild your original data exactly.
You do not need to be a math person to use it. The tutorials below show practical examples and explain what each line of code is for.
Three words you will see a lot
- Bytes: the engine reads everything as a stream of bytes (numbers from 0 to 255). For text we use
.encode("utf-8"); for files we usePath.read_bytes(); for numbers we pack them into bytes withbytes([1, 2, 3]). - Ledger: the engine's notebook. It is a single file on disk (you choose the name) that holds every piece the engine has seen, in the order it saw them. Open the same file later and the engine picks up where it left off.
- Seed: a small number the engine returns every time you give it data. Same data, same seed. Different data, almost always a different seed. Think of it as a quick label for "here is what I saw."
Setup, once
Follow Local Python Wheel in the docs to install and activate the engine. After that, every tutorial below is just a plain Python script you can copy and run. Quick check that everything is wired up:
python -m ufm status
# → {"active": true, "tier": "standard", "days_remaining": 29, ...}
python -c "import ufm; print(ufm.VERSION)"
# → 3.0-ruststatus prints active: true.1. Find the most-common letters in a poem
The situation:you have a poem (or a paragraph, or a product review) and you want to know which letters and spaces show up most often. Maybe you are designing a phone keypad, or you want to know if your writing leans heavily on the letter "e".
# tutorial_01_letters.py
import ufm
poem = (
"Tyger Tyger, burning bright,\n"
"In the forests of the night;\n"
"What immortal hand or eye,\n"
"Could frame thy fearful symmetry?"
)
# The engine reads bytes, not strings, so we encode the text first.
data = poem.encode("utf-8")
# Open a ledger (a notebook on disk) and feed it our poem.
with ufm.InvariantIdentityEngine(storage_path="poem-ledger.bin") as eng:
seed, status = eng.process(data)
print(f"the engine called this poem: seed={seed} ({status})")
# Pull the underlying notebook so we can ask questions of it.
notebook = eng.ledger()
# The most-common pieces, with how many times each appeared.
top = notebook.top_n_primitives(5)
print("\nfive most common pieces:")
for piece_id, count in top:
print(f" piece #{piece_id} appeared {count} times")
print(f"\ntotal unique pieces: {notebook.primitive_count}")
print(f"total pieces written: {notebook.timeline_length}")What you should see:
the engine called this poem: seed=4 (NOVELTY)
five most common pieces:
piece #2 appeared 18 times
piece #5 appeared 8 times
piece #1 appeared 6 times
piece #11 appeared 5 times
piece #14 appeared 5 times
total unique pieces: 21
total pieces written: 121For text, each "piece" is usually a single character (so the most-common piece is almost always the space character, then the letter "e" in English). The engine assigns piece numbers in the order it first sees them, so the IDs are not fixed across runs of different text.
Try changing one thing
- Replace the poem with a paragraph from a book and watch the unique-piece count climb.
- Run the script twice in a row. The second run prints
REPLAY: the engine has already seen this exact poem. - Delete
poem-ledger.binbetween runs to start with a fresh notebook.
2. Save your work and pick up tomorrow
The situation: you are slowly feeding a long list of documents into the engine. You do not want to start over every day. The ledger handles this for you: it saves to disk automatically, and you can reopen the same file as many times as you like.
# tutorial_02_notebook.py
import shutil
from pathlib import Path
import ufm
LEDGER = "diary-ledger.bin"
# Day 1: write down two thoughts.
with ufm.InvariantIdentityEngine(storage_path=LEDGER) as eng:
seed_a, _ = eng.process("Today I made a paper airplane.".encode("utf-8"))
seed_b, _ = eng.process("It flew further than my brother's.".encode("utf-8"))
print(f"day 1 seeds: {seed_a}, {seed_b}")
# When the "with" block ends, the engine saves the notebook to disk.
# Day 2: open the SAME file and add another thought.
with ufm.InvariantIdentityEngine(storage_path=LEDGER) as eng:
seed_c, _ = eng.process("I drew a map of my room.".encode("utf-8"))
print(f"day 2 seed: {seed_c}")
summary = eng.ledger_summary()
print(f"the notebook now has {summary['timeline_length']} entries")
# Day 3: just look at what is in there. No new entries added.
with ufm.InvariantIdentityEngine(storage_path=LEDGER) as eng:
eng.ledger_summary() # forces the engine to load the file.
notebook = eng.ledger()
print(f"unique pieces seen across all three days: {notebook.primitive_count}")
print(f"top three: {notebook.top_n_primitives(3)}")
# Make a backup copy you can email or archive.
shutil.copyfile(LEDGER, "diary-backup.ufmr")
print(f"\nbackup written: {Path('diary-backup.ufmr').stat().st_size} bytes")Things to know:
- The
storage_pathargument is the notebook's filename. Pass the same name and you get the same notebook back. - You do not have to call
save(). The engine handles it when thewithblock ends. You can call it manually if you want to save mid-script. - The notebook file is just a regular file. Copy it, email it, put it in version control, or zip it up for a backup.
.. traversals, so pass a plain filename like "diary-ledger.bin" and you will be fine.3. Tell if a file was tampered with
The situation: someone hands you a file and claims it has not been changed since last Tuesday. How do you check? You could compare every byte, but that is slow and only works if you have the original. The engine offers a quicker check: ask it to ingest and rebuild the file. If the rebuild matches the original byte for byte, the engine returns True. If even one byte changed, it returns False.
# tutorial_03_tamper.py
import ufm
samples = [
b"a tiny file",
"characters with accents like é and ñ".encode("utf-8"),
bytes(range(256)), # every byte value 0 to 255
b"\x00" * 1024, # 1 KB of zero bytes
b"\xff" * 1024, # 1 KB of 255s
]
with ufm.InvariantIdentityEngine(storage_path="check-ledger.bin") as eng:
for index, data in enumerate(samples):
# reconstruct() is the round-trip check. True means the engine can
# rebuild your data exactly. False means something is off.
ok = eng.reconstruct(data)
size = len(data)
verdict = "intact" if ok else "MISMATCH"
print(f"sample {index}: {size:>5} bytes -> {verdict}")What you should see:
sample 0: 11 bytes -> intact
sample 1: 37 bytes -> intact
sample 2: 256 bytes -> intact
sample 3: 1024 bytes -> intact
sample 4: 1024 bytes -> intactEvery sample comes back intact. If you ever see MISMATCH on a file the engine ingested, that is a real problem worth reporting.
Why it matters
Imagine you publish a contract and someone receives a copy weeks later. They run reconstruct() on their copy. If the engine rebuilds it exactly, the file is identical to the one you ingested. No silent edit slipped through.
4. Fingerprint a photo or any binary file
The situation: you have a folder of images and you want a quick label per file so you can spot duplicates. The engine gives every file a seed: a small number that depends on the file's contents. Two byte-identical files always get the same seed in the same notebook.
# tutorial_04_fingerprint.py
from pathlib import Path
import ufm
photo = Path("./family-photo.jpg")
data = photo.read_bytes()
print(f"loaded {photo.name} ({len(data):,} bytes)")
with ufm.InvariantIdentityEngine(storage_path="photos-ledger.bin") as eng:
# First time the engine sees this file.
seed_first, status_first = eng.process(data)
print(f"first time: seed={seed_first} ({status_first})")
# Same bytes again. The engine recognises it.
seed_second, status_second = eng.process(data)
same = seed_second == seed_first
print(f"second time: seed={seed_second} ({status_second}) same as before: {same}")
# Now flip a single bit (one byte) and ingest the edited copy.
edited = bytearray(data)
edited[0] ^= 0x01
seed_edited, status_edited = eng.process(bytes(edited))
different = seed_edited != seed_first
print(f"after edit: seed={seed_edited} ({status_edited}) different now: {different}")The first run prints NOVELTY (new file). The second prints REPLAY because the engine already has it. The edited copy almost always gets a different seed.
5. Catalog a list of small records
The situation: you have a list of small records (a daily reading from a sensor, a row in a spreadsheet, a product feature flag). You want each record to have its own ID and you want to know which values appear most often across the whole catalog.
The trick is to pack each record into a fixed number of bytes so the engine sees them all at the same scale. In the example below, each record is four bytes: temperature, humidity, wind, cloud cover.
# tutorial_05_records.py
import ufm
# Each record packs four small numbers (0 to 255) into four bytes.
# Field 1: temperature step (0 = -50 C ... 100 = 50 C)
# Field 2: humidity percent (0 to 100)
# Field 3: wind km/h (0 to 150)
# Field 4: cloud cover percent (0 to 100)
records = [
bytes([72, 65, 8, 20]), # warm, dry, light wind, mostly clear
bytes([72, 65, 12, 25]), # next day: very similar
bytes([55, 90, 30, 95]), # cooler, humid, windy, overcast
bytes([60, 88, 28, 90]),
bytes([58, 92, 35, 99]),
bytes([72, 68, 10, 22]), # sunny again
]
with ufm.InvariantIdentityEngine(storage_path="weather-ledger.bin") as eng:
# process_batch is the bulk version of process(). One ID per record.
results = eng.process_batch(records)
seeds = [s for s, _ in results]
print(f"per-day IDs: {seeds}")
notebook = eng.ledger()
print(f"unique values across all fields: {notebook.primitive_count}")
print(f"five most common values: {notebook.top_n_primitives(5)}")
# Pull back any specific day by its ID.
day_three = list(eng.replay(seeds[2]))
print(f"day 3 bytes: {bytes(day_three[0]).hex()}")Each day is one entry. The top_n_primitiveslist answers questions like "which humidity reading came up most?" because each byte stays its own unit. To pull back a specific day, save its seed and call replay(seed) later.
6. See how similar two files are
The situation:you have two essays, two images, or two versions of a document, and you want a single number for "how much do these have in common?" The engine builds a notebook for each file, then reports how much overlap there is.
# tutorial_06_similar.py
from pathlib import Path
import ufm
def to_bits(data: bytes) -> list[int]:
return [int(bit) for byte in data for bit in f"{byte:08b}"]
essay_a = Path("./essay-original.txt").read_bytes()
essay_b = Path("./essay-revised.txt").read_bytes()
# Build one notebook per essay (in memory, no disk file needed).
notebook_a = ufm.ingest_raw(to_bits(essay_a), symbol_length_mode="auto_curve")
notebook_b = ufm.ingest_raw(to_bits(essay_b), symbol_length_mode="auto_curve")
result = ufm.ledger_compare(notebook_a, notebook_b)
# A number from 0 to 1. 1 means identical pieces; 0 means nothing in common.
similarity = result["jaccard"]
print(f"similarity: {similarity:.2%}")
print(f"shared pieces: {result['shared_primitives']}")
print(f"only in essay A: {result['only_a']}")
print(f"only in essay B: {result['only_b']}")
if similarity > 0.9:
print("\nverdict: nearly identical")
elif similarity > 0.5:
print("\nverdict: substantial overlap, lots of shared text")
elif similarity > 0.2:
print("\nverdict: some common ideas")
else:
print("\nverdict: mostly different")Same idea works for any two files. Compare two photos to see how much of the same visual content they share. Compare two CSV exports to spot how much the numbers overlap. Compare two app log files to see if the system is doing the same thing day to day.
7. Spot fake differences (line endings, BOM)
The situation: a teammate sends you a file. Your diff tool screams that every line is different. You squint and the text looks the same. The likely culprit is a fake difference: the file uses Windows line endings instead of Linux ones, or it starts with an invisible byte-order mark.
The engine has a layer that recognises these "noise" differences and tells you whether the real content matches.
# tutorial_07_fake_differences.py
import ufm
# Same content, different line endings.
linux_text = b"line one\nline two\nline three\n"
windows_text = b"line one\r\nline two\r\nline three\r\n"
# Same content, but the second has an invisible byte-order mark up front.
plain = "ascii content".encode("utf-8")
with_bom = b"\xef\xbb\xbf" + plain
pipeline = ufm.SemanticDecisionPipeline("noise-ledger.jsonl")
cases = [
("Linux vs Windows line endings", linux_text, windows_text),
("Plain vs invisible byte-order mark", plain, with_bom),
]
for label, source, target in cases:
result = pipeline.run_with_policy(
source, target,
# Tell the pipeline which "fake" classes to forgive.
enabled_noise_classes=["line_ending_crlf", "bom_utf8"],
strict_allowlist=True,
)
print(f"--- {label} ---")
if result["converges"]:
print(" verdict: same content, fake difference (line endings or BOM)")
else:
print(" verdict: real difference, look closer")
print(f" classified deltas: {len(result['noise_units'])}")
print(f" audit fingerprint: {result['decision_hash'][:16]}...")When converges is True, every byte-level difference was fully accounted for by the noise classes you allowed. The two files are the same in every way that matters. When it is False, there is a real difference the engine could not explain away.
enabled_noise_classes.8. Choose how the engine reads your data
The situation: the engine breaks your data into pieces of a chosen size. The default size is auto-picked, and that is usually the right move. But if your data has a known unit (one byte per ASCII character, two bytes per UTF-16 character, four bytes per packed number), you can tell the engine to use exactly that size and the analytics line up perfectly with your data.
# tutorial_08_chunk_size.py
import ufm
# A pattern that repeats every three bytes: "ABC" 100 times.
pattern = b"ABC" * 100
print(f"{'mode':<14} {'chunk size':>10} {'unique pieces':>14} {'novelty rate':>14}")
for mode in ["auto_curve", "entropy", "fixed8", "fixed16", "fixed24"]:
info = ufm.ufm_signature(pattern, symbol_length_mode=mode)
print(
f"{mode:<14} "
f"{info['symbol_length']:>10} "
f"{info['primitive_count']:>14} "
f"{info['discovery_rate']:>14.4f}"
)What you should see:
mode chunk size unique pieces novelty rate
auto_curve 24 1 0.0033
entropy 8 3 0.0100
fixed8 8 3 0.0100
fixed16 16 6 0.0100
fixed24 24 1 0.0033Notice what changed. auto_curve spotted the 24-bit repeating unit and concluded there is only ONE distinct piece. With fixed8 we forced the engine to read one byte at a time, so it found three distinct bytes (A, B, C). With fixed16 we forced two-byte chunks, which cut the pattern in awkward places.
Pick a mode by what your data looks like
- You have no idea: use
auto_curve. It scans the data and picks for you. - Plain English text or code:
fixed8matches one byte per character. - Records with a fixed field width: use
Fixed(N)where N is the bit width of one record. Tutorial 5 uses this. - UTF-16 text or two-byte values:
fixed16.
9. Find a hidden rhythm in numbers
The situation: you have a long list of measurements over time (heart rate every second, web traffic every hour, temperature every day). You suspect there is a repeating pattern. How often does it repeat?
The engine's notebook can answer that with acf(short for autocorrelation function). Think of it as a rhythm meter: feed in the data, ask "how strongly does each step look like the step a few back, a dozen back, a hundred back?" Big numbers at a particular distance mean the pattern repeats every that-many steps.
# tutorial_09_rhythm.py
import ufm
def to_bits(data: bytes) -> list[int]:
return [int(bit) for byte in data for bit in f"{byte:08b}"]
# A pretend hourly reading: a 24-step shape repeated for 50 days.
one_day = bytes(range(24)) # 24 different values, one per hour
fifty_days = one_day * 50
# Read one byte at a time so each hour is its own piece.
notebook = ufm.ingest_raw(to_bits(fifty_days), symbol_length_mode="fixed8")
rhythm = notebook.acf(60) # check rhythm at distances 1 through 60.
# Sort by strength and print the top 5.
ranked = sorted(enumerate(rhythm, start=1), key=lambda pair: -pair[1])[:5]
print("strongest rhythms (distance, score):")
for distance, score in ranked:
print(f" every {distance:>2} steps -> score {score:.3f}")What you should see:
strongest rhythms (distance, score):
every 24 steps -> score 1.000
every 48 steps -> score 1.000
every 23 steps -> score 0.958
every 25 steps -> score 0.958
every 47 steps -> score 0.958The engine found the 24-step rhythm (and its multiples 48, 72, ...) without being told to look for one. Replace the synthetic data with your own and the same code will find any rhythm that exists.
Real-world uses
- Heart-rate readings from a wearable: detect a steady pulse vs a wandering one.
- Web traffic logs: confirm the daily and weekly cycles before sizing your servers.
- Manufacturing sensor: check that a machine cycle is still firing on schedule.
10. Find when a pattern changes
The situation:your data was steady, then something shifted. You want to know when. The engine's notebook has two tools for this: segments (long stretches where one piece dominates) and transitions (the boundaries where the mix changes).
# tutorial_10_changes.py
import ufm
def to_bits(data: bytes) -> list[int]:
return [int(bit) for byte in data for bit in f"{byte:08b}"]
# A signal that is calm for 200 steps, then jumpy for 200 steps, then calm again.
calm = bytes([10, 11, 10, 11] * 50) # repeats: 10, 11, 10, 11
jumpy = bytes([200, 50, 230, 5] * 50) # very different from "calm"
signal = calm + jumpy + calm # three regions back to back
notebook = ufm.ingest_raw(to_bits(signal), symbol_length_mode="fixed8")
segments = notebook.segments(window_size=64)
print(f"the engine found {len(segments)} stable region(s):")
for index, seg in enumerate(segments[:10]):
print(f" region {index}: starts at {seg['start']}, ends at {seg['end']}")
transitions = notebook.transitions(window_size=64, threshold=0.1)
print(f"\nshift points (first 10): {transitions[:10]}")You should see roughly three segments and two shift points (around position 200 and around position 400). Those line up with where the calm and jumpy data join. Increase the window to make the engine more patient (it averages over a longer slice before deciding anything has changed). Decrease it to be more sensitive.
11. Build a duplicate finder for a folder
The situation: you have a folder of files (notes, images, exports) and you suspect there are duplicates. The engine gives you one ID per file. Files that share an ID are very likely duplicates. (The next step, if you want to be sure, is the similarity check from tutorial 6.)
# tutorial_11_duplicates.py
from collections import defaultdict
from pathlib import Path
import ufm
folder = Path("./inbox") # change to wherever your files live
files = sorted(folder.glob("*")) # all files in that folder
files = [p for p in files if p.is_file()]
print(f"checking {len(files)} files...")
with ufm.InvariantIdentityEngine(storage_path="inbox-ledger.bin") as eng:
seeds = []
for path in files:
seed, _ = eng.process(path.read_bytes())
seeds.append(seed)
# Group filenames by ID. Any group with more than one filename is a possible duplicate set.
groups = defaultdict(list)
for path, seed in zip(files, seeds):
groups[seed].append(path.name)
duplicates = {seed: names for seed, names in groups.items() if len(names) > 1}
if not duplicates:
print("no duplicate IDs found.")
else:
print(f"\nfound {len(duplicates)} group(s) of files sharing an ID:")
for seed, names in duplicates.items():
print(f" ID {seed}:")
for name in names:
print(f" - {name}")Two files with the same ID are nearly always byte-identical. (There is a small chance two different-but-structurally-similar files share an ID; if you need a guarantee, run a similarity check or a plain byte compare on each pair before deleting anything.)
12. Ask Bob a question using your own AI
The situation: the wheel ships with two helpers, Ben and Bob, that answer questions about UFM grounded in a sealed knowledge base. Both need a language model to write their replies. If you already have one (an Anthropic key, an OpenAI key, or a CLI like Claude Code on your machine), you can hand it to Ben or Bob without setting up a separate account.
First, write a tiny "bridge" script. Bob calls this script whenever it needs a sentence written. The script reads the request from standard input, calls your AI, and prints the answer.
# bridge.py: hand Bob over to Claude Code on your machine
import json, shutil, subprocess, sys, tempfile
from pathlib import Path
CLAUDE = shutil.which("claude")
if CLAUDE is None:
print("claude CLI not on PATH", file=sys.stderr)
sys.exit(2)
def split_messages(messages):
sys_parts = [m["content"] for m in messages if m.get("role") == "system"]
user_parts = [
f"{m['role'].upper()}: {m['content']}"
for m in messages if m.get("role") != "system"
]
return "\n\n".join(sys_parts), "\n\n".join(user_parts)
payload = json.loads(sys.stdin.read())
system_prompt, user_text = split_messages(payload["messages"])
with tempfile.TemporaryDirectory(prefix="ufm-bridge-") as tmp:
sys_file = Path(tmp) / "system.txt"
sys_file.write_text(system_prompt, encoding="utf-8")
result = subprocess.run(
[CLAUDE, "-p",
"--system-prompt-file", str(sys_file),
"--allowedTools", "", # safety: no tools, just words
"--output-format", "text"],
input=user_text,
capture_output=True, text=True, encoding="utf-8",
cwd=tmp, timeout=180, check=False,
)
if result.returncode != 0:
sys.stderr.write(result.stderr or "claude failed")
sys.exit(result.returncode)
print(json.dumps({"text": (result.stdout or "").strip(),
"model": "claude-code"}))Then call Bob through the bridge:
# tutorial_12_bob.py
import ufm
# Tell Bob to use our bridge whenever it needs a reply written.
backend = ufm.SubprocessBackend(cmd=["python", "bridge.py"], timeout=180)
bob = ufm.BobPipeline(backend=backend)
answer = bob.query("What does the engine do that a checksum does not?",
mode="advisory", max_anchors=5)
print(f"gate: {answer.gate_status}") # PASS, WARN, or BLOCK
print(f"reply:\n{answer.response}\n")
print(f"evidence cited: {len(answer.evidence)} sources")Bob runs the look-up and the safety check locally; the bridge only produces the natural-language wording. Same trick works for Ben: use ufm.BenSession(backend=backend) and session.ask(question).
13. Get a quality score for every run
The situation: a regulator, an auditor, or your own future self wants proof the engine processed the data correctly: replay actually round-tripped, the run was deterministic, no stage failed. The Universal Pipeline runs your data through seven stages and reports back on every one.
# tutorial_13_quality.py
import ufm
doc = b"the quick brown fox jumps over the lazy dog\n" * 100
pipeline = ufm.UniversalPipeline(storage_path="quality-ledger.bin")
result = pipeline.run(doc)
print(f"success: {result['success']}")
print(f"seed: {result['seed']}")
q = result["quality"]
print(f"replay_valid: {q['replay_valid']}")
print(f"deterministic: {q['deterministic']}")
print(f"discovery_rate: {q['discovery_rate']:.4f}")
print(f"reuse_ratio: {q['reuse_ratio']:.4f}")
print("\nstages:")
for stage in result["stages_completed"]:
flag = "OK" if stage["success"] else "FAIL"
print(f" {stage['stage']:<10} -> {flag}")
print(f"\nviolations: {result['violations']}")Each quality field is independent. replay_valid means the bytes round-tripped exactly. deterministic means the same input produces the same seed every time. violations is empty when every stage passes; non-empty entries are what you have to write down before the run can be called clean.
process: when the cost of being wrong is high and you want a record showing you checked. For day-to-day ingest with no audit need, plain process() is faster.14. Keep a tamper-evident log of decisions
The situation: you are making decisions in a loop (which support ticket to escalate, which file to flag) and you need a hash-chained log proving nothing was edited after the fact. The Decision Pipeline runs each request through four anti-drift gates and writes one row per decision into a JSON ledger. Each row includes the SHA-256 of the row before it.
# tutorial_14_decisions.py
import ufm
pipeline = ufm.DecisionPipeline("decisions-ledger.json")
requests = [
"Summarise the customer support tickets from yesterday.",
"Flag the ticket mentioning a refund.",
"Summarise the customer support tickets from yesterday.",
]
for r in requests:
result = pipeline.run(r)
passed = sum(result["gates"].values())
print(f"\nrequest: {r[:50]}...")
print(f" status: {result['status']}")
print(f" tier: {result['tier']}")
print(f" gates passed: {passed} / 4")
print(f" hash (first 16): {result['decision_hash'][:16]}...")
print(f" side effects: {result['side_effects']}")The four gates check: (1) the response mentions the subject from the input; (2) no earlier identical request failed gates; (3) every substantive output token traces back to the input or the ledger; (4) the ledger row was successfully read back after writing. All four must pass for a status of ok. Note the third request is identical to the first, so its tier jumps to 3: the engine recognises prior context for the same subject.
decisions-ledger.json contains the full SHA-256 of the previous row, so editing any past row breaks every later hash. Open the file and inspect it directly. It is plain JSON.15. Profile a single file in one call
The situation: you have one file and you want a small dictionary of structural numbers for it: how varied is its vocabulary, how often a piece repeats, how steep its frequency curve is. structural_profile gives you that in a single call. No notebook on disk, no setup.
# tutorial_15_profile.py
import ufm
text = b"the quick brown fox jumps over the lazy dog " * 50
random = bytes(range(256)) * 8
for label, data in [("repeating text", text), ("random bytes", random)]:
profile = ufm.structural_profile(data, symbol_width=16)
print(f"\n{label}:")
print(f" vocabulary size: {profile['v_size']}")
print(f" reuse: {profile['reuse']:.4f}")
print(f" zipf slope: {profile['s_zipf']:.4f}")
print(f" alpha: {profile['alpha']:.4f}")
print(f" discovery integral: {profile['discovery_integral']}")Five numbers describe the file at the chosen symbol_width. High reuse and low v_size mean a file with a lot of repetition. Low reuse and a near-flat zipf slope mean the file is more like noise. Same number of fields every call, so two profiles are easy to compare side by side.
16. Ask which chunk size the engine would pick
The situation:Tutorial 8 showed how to override the chunk size. This one is the opposite: hand the engine some data and ask "given my data, what would you pick?" Useful when you suspect a hidden unit width and want the engine to confirm it.
# tutorial_16_chunk_finder.py
import ufm
def to_bits(data: bytes) -> list[int]:
return [int(bit) for byte in data for bit in f"{byte:08b}"]
samples = {
"ABC repeating": b"ABC" * 200,
"byte counter": bytes(range(256)) * 4,
"long phrase": b"hello world! " * 200,
}
for label, data in samples.items():
length, meta = ufm.find_optimal_symbol_length(to_bits(data))
print(f"\n{label}:")
print(f" picked size: {length} bits")
print(f" entropy at choice: {meta['entropy_at_selected']:.4f}")
print(f" selection mode: {meta['mode']}")
print(f" sample bits used: {meta['sample_bits_used']}")The engine samples up to the first 100,000 bits and tries different widths. The width with the lowest entropy wins (lower means more structured). The selection rule is deterministic: same bytes give the same answer every time. ABC repeatingpicks 24 because each repeat is exactly three bytes; the long phrase picks 104 because the repeating unit is 13 bytes long ("hello world! ").
17. Compute a quick signature without a notebook
The situation: you have a handful of independent inputs and you want a structural fingerprint for each, but you do not want a persistent ledger file on disk. ufm_signature and ufm_signature_batch compute one in memory and return.
# tutorial_17_signatures.py
import ufm
inputs = [
b"hello world",
b"hello world", # exact duplicate
b"hello, world", # tiny tweak (added comma)
b"completely different bytes",
]
# One at a time:
sig = ufm.ufm_signature(inputs[0])
print(f"single seed: {sig['seed']}, primitives: {sig['primitive_count']}")
# Batch (same shape, in input order):
sigs = ufm.ufm_signature_batch(inputs)
seeds = [s["seed"] for s in sigs]
print(f"\nbatch seeds: {seeds}")
print(f"input 0 == input 1: {seeds[0] == seeds[1]}") # True (identical)
print(f"input 0 == input 2: {seeds[0] == seeds[2]}") # False
print(f"input 0 == input 3: {seeds[0] == seeds[3]}") # FalseIdentical bytes give the same seed inside one batch, no matter where they appear. Different bytes almost always give different seeds. Each result is a full nine-key dict with seed, signature, discovery_rate, reuse_ratio and the rest. Nothing is written to disk.
18. Watch the engine learn (discovery-rate convergence)
The situation: you are streaming data into one ledger over time. How do you know when the engine has seen enough to recognise the structure of the stream? Watch the discovery rate. Early on, almost every chunk is new and the rate is high. As the notebook fills up, the rate falls. When it sits near zero, the engine is in steady-state recognition.
# tutorial_18_convergence.py
import ufm
import random
random.seed(42)
phrases = [
b"alpha quick brown fox",
b"beta slow walking cat",
b"gamma still sitting dog",
]
with ufm.InvariantIdentityEngine(storage_path="convergence-ledger.bin") as eng:
print(f"{'batch':>5} {'discovery_rate':>15} {'primitives':>11}")
for i in range(8):
batch = b" ".join(random.choices(phrases, k=20))
eng.process(batch)
s = eng.ledger_summary()
print(f"{i:>5} {s['discovery_rate']:>15.4f} {s['primitive_count']:>11}")First batch: almost every chunk is new, so the rate is high. By the last batch, almost everything is a known piece in a different order, so the rate is close to zero. That curve shape is the same for any structured stream. Use it to decide when a stream is "saturated" enough to drive a downstream metric off it.