network-generation · shared utility

degree matchers

A shared top-up toolkit every generator can call after its main pass. Reads two edgelists (current graph + reference target), finds per-node residual deficits, and adds edges that close the gap. Four algorithm families, each with a cluster-preserving twin, and a stacking driver that lets one algorithm clean up what the previous one could not place.

Every generator in this project ends close to the input degree sequence, not exactly on it. Dedup, rewire, and outlier handling each strip a few stubs. The matcher closes the gap: pop the most-needy node, pick a partner that still has room, drop the edge in, repeat. Pick rules, tie-breaks, and what to do when nothing legal is left differ across the four algorithm families. Each family has a cluster-preserving twin that gates partner choice on a per-(block, block) edge budget. Algorithms also stack: a comma-separated list runs each in turn on the residual the previous one left behind.

input · the residual deficit

Where the deficit comes from. Re-roll for a fresh draw.

The deficit drawn here drives every walker on this page. Source: configuration-model pairing on the 20-node fixture's degree sequence, then drop self-loops and parallel edges (the same dedup pass the SBM generator runs). Hubs lose more stubs because collisions scale with degree squared; leaves rarely lose any. Re-roll redraws the shuffle and resets every walker.

20-node fixture · non-deficit nodes dimmed total: 0 · nodes with deficit: 0 · max: 0

input contract

Two edgelists in. Top-up edges out.

Two inputs sharing a node set: the current graph and a reference graph. The per-node residual deficit is

\[ r_u \;=\; \max\bigl(0, \,d_u^{\text{ref}} - d_u^{\text{input}}\bigr). \]

The matcher adds edges to drive \(d_u^{\text{out}} \to d_u^{\text{ref}}\). It only adds; never removes. Cluster-preserving steps additionally consume the reference clustering plus a derived per-(block, block) edge budget. Pipelines attach the clustering files automatically when any step in the stack is cluster-preserving.

when each generator calls it

EC-SBM-v1 uses greedy for byte-stable output against its reference implementation. EC-SBM-v2 and EC-SBM-v3 default to cluster_preserving_true_greedy. The standalone SBM generator defaults to the same. ABCD, ABCD+o, LFR, and nPSO leave the matcher off by default; their pipelines turn it on as an opt-in step, usually paired with the --remap mode covered at the bottom of the page (their output node IDs do not align with the input's).

the four families · overview

Different ways to drain the same deficit pile.

All four families share the same outer loop: pick a node with positive deficit, pick a partner that also still has room and is not already adjacent, add the edge, decrement both. They differ on the picking order, on whether the partner is chosen deterministically or by a weighted random draw, and on what they do when the loop stalls.

family	how it picks pairs	failure mode
greedy	Take the most-needy node. Take its first acceptable partner in id-ascending order. Repeat until that node's deficit is closed, then move on.	Silent gridlock: when no partner is left it drops the remaining stubs without logging.
true_greedy	Take the most-needy node. Take the partner that is itself currently most needy, breaking ties by id. Push the updated needs back into the priority order before the next pick.	Logs gridlock when no partner with positive deficit is left.
random_greedy	Pick the source node by a weighted random draw over residuals. Pick the partner the same way over valid targets.	Logs gridlock. Useful for measuring how much of greedy's output comes from the deterministic ordering.
rewire	Make one stub per missing edge per node, shuffle the pile, pair adjacent stubs. Self-loops, parallels, and duplicates queue for a 2-opt repair pass that swaps endpoints with a random valid edge until the queue is empty or a retry budget is hit.	Logs gridlock. Almost always resolves on realistic inputs; the repair pass is the slow path on hub-heavy graphs.

Greedy and true_greedy are deterministic; random_greedy and rewire sample fresh draws per seed. Plain true_greedy is the bare-CLI default; clustered generators (sbm, ec-sbm-v2, ec-sbm-v3) override to cluster_preserving_true_greedy, the practical default whenever a clustering is available.

Each family has a cluster-preserving twin. The twin runs the same outer loop but rejects any candidate whose (block, block) bucket is exhausted. The bucket budget is the per-(block, block) deficit: how many extra edges each (r, s) pair is expected to absorb if the residual stubs paired up uniformly at random. Built once from the reference clustering at frame start; decrements on every accept. Family sections below show each algorithm's walker followed by its twin's walker; the stacking section composes them.

Computation: total residual stubs in block \(r\) is \(\sigma_r = \sum_{u : b_u = r} r_u\). Expected new-edge count per bp under configuration-model pairing is \(\sigma_r \sigma_s / \sigma_{\text{tot}}\) for \(r \neq s\) and \((\sigma_r(\sigma_r - 1)) / (2\sigma_{\text{tot}})\) for \(r = s\), rounded with a floor of one on any bp the deficit can land in. The pipeline reads its bp budget straight off the reference edgelist on real inputs; this small fixture has no separate reference, so the page estimates the budget from the residual mass.

family · greedy

Take the most-needy node. Drain it against an id-sorted partner list.

Pop the node with the largest deficit. Build its candidate list (every other deficit-positive node that is not already adjacent), sort by id ascending, walk down adding edges until the source is drained or the list runs out. Move on to the next source. Source order, partner sort, and tie-break are fixed; output is deterministic. Stubs left when no partner is available drop without logging (silent gridlock).

step 0 · initial deficit step 0 / 0

click next to advance.

Caveat. Silent gridlock: dropped stubs leave no log line. The other three families log gridlock as a warning.

cluster-preserving twin

Same outer loop, plus one filter: candidates whose (block, block) bucket is empty drop out of the partner list. Each accepted edge decrements its bucket. Budget exhaustion is amplified by greedy's silent gridlock, so this twin can lose a noticeable number of stubs on inputs whose deficit lands in a few blocks. The walker captions name the bucket touched and the remaining budget.

step 0 · initial deficit step 0 / 0

click next to advance.

family · true_greedy

Take the most-needy node. Pair it with the partner that itself is most needy.

Same source pick as greedy. Partner choice is the deficit-positive candidate with the largest residual, ties broken by id ascending. Both endpoints' updated residuals go back into the priority order before the next pick. Greedy commits to one source until drained or stuck; true_greedy re-shuffles after every edge, so it always works on the currently neediest pair. On uneven deficits the gap closes with fewer dropped stubs. Gridlock is logged as a warning.

step 0 · initial deficit step 0 / 0

click next to advance.

cluster-preserving twin

Same priority-driven source and partner picks; the candidate list is filtered to the in-budget pairs first. Gridlock triggers when every remaining candidate's bucket is empty, and is logged. The dynamic re-shuffle saves more stubs than the cluster-preserving greedy does. Whatever the budget gate still refuses can be picked up by chaining a plain pass after this one (see the stack section below).

step 0 · initial deficit step 0 / 0

click next to advance.

family · random_greedy

Pick both endpoints by weighted random draw.

Source and partner are independent weighted draws over current deficits. Useful as a control: running across several seeds reveals which output features come from the input deficit and which come from greedy / true_greedy's tie-break rules. Gridlock is logged. Step count varies between seeds since the order in which nodes get starved shifts.

step 0 · initial deficit step 0 / 0

click next to advance.

cluster-preserving twin

Same weighted source and partner draws; partner draws are restricted to in-budget candidates. Each accept consumes one unit of budget. If the in-budget candidate set is empty for the chosen source, gridlock is logged and the source is abandoned.

step 0 · initial deficit step 0 / 0

click next to advance.

family · rewire

Make a stub per missing edge, shuffle, pair adjacent. Repair conflicts.

Configuration-model approach. Each node drops one stub per missing edge into a bag; shuffle, pair adjacent stubs. Self-loops, parallels, and intra-pairing duplicates queue for repair. Repair is 2-opt: pull a random valid edge from the placed pile, try the four endpoint recombinations, accept the swap if any pair of resulting edges is valid; otherwise requeue. Run for a fixed retry budget. Whatever still conflicts at the end is logged and dropped. The shuffle is the only RNG; dense inputs almost always settle on the first pass, hub-heavy inputs walk the queue.

step 0 · stubs ready, no pairs yet step 0 / 0

click next to advance.

cluster-preserving twin

Stubs are partitioned by (block, block) bucket up front and drawn per bucket independently, capped at the bucket budget. The 2-opt repair only swaps endpoints in ways that keep both resulting pairs inside their original buckets.

Same pair distribution as production; specific edges differ because the browser cannot share graph-tool's RNG.

step 0 · initial deficit step 0 / 0

click next to advance.

stackable matchers

Chain matchers. Each one cleans up what the previous left behind.

Stacks compose any of the eight algorithms in a comma-separated list. Each step runs on the residual the previous step left behind. A single name is a one-step stack. Two-step chains pair a bulk drain with a finishing pass; longer chains compose cluster-preserving and plain passes when both block structure and degree mass matter.

Each step uses its own random seed, derived from the matcher's overall seed, so randomness in one step does not bleed into the priorities of the next. The shared state across steps is the running graph itself plus the deficit dictionary, both updated as edges land. Cluster-preserving steps additionally share the per-(block, block) edge budget, which every cluster-preserving step in the chain decrements; non-cluster-preserving steps ignore the budget entirely. Putting a non-cluster-preserving step after a cluster-preserving one is the way to recover degree mass that the budget gate refused, at the cost of letting that final step place a few edges in buckets the cluster-preserving steps would have refused.

Build a stack

Pick algorithms one at a time. Each pick runs on whatever residual the previous picks left behind. back pops the most recent step and reverts its edges + budget. rand step rerolls the top stochastic frame; rand all rerolls every stochastic frame in the stack.

stack length 0 · bare input deficit picks: 0 · residual ·

stack is empty · current state is the bare input deficit.

A plain step after a cluster-preserving one trades budget fidelity for residual closure: the new edges land in buckets the gated steps already exhausted, so block-shape stats drift while the degree gap closes. Pick the gated step alone when block fidelity is the priority and the residual stubs are tolerable; chain a plain step at the end when degree exactness matters more.

case study · dnc

Pick the matcher off a graph that loses 29% of its edges to dedup.

The dnc network has 1,891 nodes, 5 reference clusters, and 10,429 reference edges. Run the standalone SBM generator with seed 1: graph-tool's stub-pair sampler hits enough collisions on the dense intra-cluster blocks that, after self-loops and duplicates are removed, only 7,438 edges survive. The matcher tops up 2,991 missing edges (a 29% deficit), almost entirely concentrated inside cluster interiors where the parallel-edge collisions cluster.

Four matcher choices, same SBM output, same seed:

matcher	final edges	deficit closed	cluster fidelity
true_greedy	10,429	100% (zero residual)	No bucket gate. Hubs deep inside one cluster reach across to outliers in another whenever no closer partner is left, so the block-pair edge counts drift away from the reference.
cluster_preserving_true_greedy (default)	10,155	90.8% (274 stubs stuck)	Bucket gate enforced strictly. Edges land where the reference says they should, but 274 stubs are abandoned because the buckets they would consume are already exhausted by the time their partner becomes the priority pick.
cluster_preserving_true_greedy, true_greedy	10,340	97.0% (89 stubs stuck)	Cluster-preserving pass first; plain pass picks up the residual. Recovers 185 of the 274 edges the gate refused. The 185 cleanup edges land in buckets the cluster-preserving step had used up, so block-pair fidelity is slightly looser than the gated-only run, but vastly closer to the reference than plain true_greedy.
cluster_preserving_rewire, cluster_preserving_true_greedy, true_greedy	10,347	97.1% (82 stubs stuck)	Same as the two-step stack, plus a stochastic bulk drain at the front. The 2-opt repair has more room to manoeuvre while the buckets are still mostly full, so the cluster-preserving true-greedy step has fewer hard cases to handle. Marginal extra recovery on dnc (7 more edges over the two-step stack).

Numbers from run_generator.sh --generator sbm on dnc, seed 1, with the four --degree-matcher values. SBM output is identical across all four runs (same graph-tool seed), so the differences are purely matcher behaviour.

the --remap option

Pair by degree rank for gens that emit fresh node IDs.

ABCD, ABCD+o, LFR, and nPSO emit integer node IDs unrelated to the input's. Diffing degrees against input IDs is meaningless. --remap pairs nodes by descending-degree rank: input rank \(i\) maps to output rank \(i\), and the deficit is computed on that pairing. The pairing is L1- and L2-optimal for matching two non-identical degree sequences (rearrangement inequality). The matcher still places edges in the input's ID space; no on-disk re-ID happens.

Cluster-preserving variants compose with --remap via two clustering inputs: --input-clustering labels the input-side blocks, --ref-clustering supplies the per-(block, block) edge budget. In direct mode (shared ID space) the two flags collapse to the same file.

illustration · toggle --remap to switch between id-pair (input id == output id) and rank-pair (rank \(i\) ↔ rank \(i\)). Input nodes left, sorted by input degree desc; output right. Crossings under id-pair show where the two sequences disagree on which IDs carry which degrees.

L1 distance · L2 distance ·

Use --remap whenever the generator's output IDs do not align with the input's. The simple-gen wrappers turn it on by default whenever they enable --match-degree. The SBM family already emits in the input's ID space, so it does not need it.