HyveGuard

Saga #01 — why rolling deploys self-quarantine the cluster

Launch-week field report. The kind of bug you only find by hitting it. Read this and you'll understand a chunk of why the deploy ceremony looks the way it does.

Symptom

You roll out a new hyveguard-bin to one node. Within five minutes the cluster's MPC layer falls below threshold and starts rejecting derive calls. Within fifteen minutes more nodes have self-quarantined. Within thirty you have a cluster of four nodes all convinced the others are compromised, and a sweaty operator wondering whether they should reach for the rollback or the off-switch.

Mechanism

The Merkle layer correctly observes that the freshly-deployed node's hyveguard-bin hash differs from its peers'. Each peer notices the diff independently. Each peer therefore raises a merkle_tamper trigger. Our quarantine engine applied the trigger per peer per cycle: with three peers all noticing the diff, that's three triggers per five-minute cycle, which hits the three-strike threshold instantly. The deploying node self-quarantines on the spot.

Fifteen minutes later the same thing happens to the next node you upgrade, because the second wave of upgraded nodes now disagrees with the un-upgraded ones. Cascade.

The Merkle detector was working exactly as designed. The accounting layer was wrong.

Fix

Three changes:

  1. Per-cycle trigger dedup. Collect all peer observations first, then call the quarantine evaluator at most once per drifted file per cycle. Multi-observer agreement is still enforced — by the DAG response engine — not by counting strikes from the same diff multiple times.
  2. Operator-controlled allowlist. A file at /etc/hyveheim/hyveguard-accepted-hashes can list both the old and new hash of any binary during a planned rollout. A diff is suppressed only when both local and peer hashes are in the allowlist. Attacker swaps to an unlisted hash are still flagged.
  3. Downgrade-detect flag on checkpoint entries. The minimum protocol version is part of the FROST-signed payload, so a man-in-the-middle stripping the post-quantum signature can't also edit the version field without invalidating the signature.

Lesson

Tamper detection that conflates "is the deployed code different from before" with "is someone tampering" punishes the maintainers more than the attackers. The cure is to give the legitimate operator an explicit channel for "yes, I'm doing this on purpose" — without ever hiding what's actually deployed from the cross-peer Merkle layer.

The other lesson: a cluster that auto-quarantines too aggressively isn't more secure. It's just brittle. Self-protection that takes the cluster down is itself a denial of service the attacker would have happily caused.

What this means for you

It means the cluster you're attacking right now does not lock itself out of operations during routine ops. If you observe a deploy in progress (binary hash drift visible at the Merkle layer), that is not a window where the cluster is auto-paralysed and easier to compromise. The allowlist closes that window. Compromising the cluster requires you to make the cluster do something it would not do under any legitimate operation — which is, broadly, the entire point.

Operator's note

If you're reading this and thinking "obviously the trigger should be deduplicated" — yes. We agree. We didn't see it until the cluster taught us. Sometimes the tutorial is the bug.