A hidden pattern that gzip is blind to
Both datasets look identical from the outside. Only one has structure.
SPECTR AUC: 0.9847. Baseline gzip AUC: 0.5000.
Setup
We created two types of data: one with a hidden repeating pattern woven through it, and one that was randomly shuffled. Both had identical byte-level statistics — byte frequency, byte randomness score, everything. We then gave both to SPECTR and to gzip and asked: which dataset has the pattern? 40 pairs tested.
Why SPECTR detects this
SPECTR reads data at the level of individual bits, not bytes. It spotted the repeating pattern on first encounter and recognised it every time it reappeared, even though the pattern never aligned with byte edges. The structured dataset slowed down in discovery — it was recognising patterns rather than finding new ones — and that signal is what drives the result.
Why standard tools miss it
gzip works by finding repeated sequences of bytes. In this test, the pattern spans across byte edges and never appears as a byte-level repetition. gzip simply can't see it and scores no better than a coin flip.
| Metric | SPECTR | Baseline |
|---|---|---|
| SPECTR accuracy (AUC) | 0.9847 | - |
| gzip accuracy (AUC) | - | 0.5000 |
| Confidence interval (SPECTR) | [0.954, 1.000] | - |
| Confidence interval (gzip) | - | [0.500, 0.500] |
| Byte randomness difference | 0.000000 | 0.000000 |
| Test pairs | 40 | 40 |
| Data length per sequence | 32,000 bits | 32,000 bits |
40 pairs generated from a fixed seed. The same configuration produces identical numbers on any machine. Runnable with the Python wheel.