network-generation · ec-sbm-v2

EC-SBM v2

Edge-connected SBM, v2

Same per-cluster \(k\)-edge-connectivity floor as v1. Constructive-only core, one residual SBM over all blocks (including a combined outlier block), a block-preserving 2-opt rewire, and a block-aware degree matcher that logs gridlock instead of silently dropping stubs and gates every accepted edge on a per-block-pair budget.

v1 gets the right cluster guarantees, but the bookkeeping is messy. Stage 2 builds \(K_{k+1}\) cores and runs an SBM on the decremented probs + degree arrays; the SBM can sample a pair the core already placed, dedup drops one copy, and the block's intra-count leaks below the profile. Outliers get their own separate SBM through stage 3a. Stage 4's heap-greedy matcher silently drops stubs when a hub runs out of valid partners.

v2 keeps the \(K_{k+1}\) cores and rebuilds the rest. Stage 2 is constructive-only; nothing else samples during the core phase.

Stage 3 is a single residual SBM over every block (real clusters plus a combined outlier block). Invalid edges go to a block-preserving 2-opt rewire that holds both block-pair counts and every node's block membership constant. Stage 4 offers ten matcher algorithms: five global ones, plus five block-aware counterparts that gate every accepted edge on a per-block-pair budget read from the reference clustering. The default is the dynamic-heap greedy with the block-aware filter, which logs gridlock rather than hiding it.

Every edge is owned by exactly one stage, and nothing disappears without a log line. The per-cluster edge-connectivity proof is inherited verbatim from v1.

profiling · outliers as singletons

Shared input graph: 18 clustered + 2 singletons.

The 20-node synthetic this page builds on, drawn under the shared singleton view: C1 (8), C2 (6), C3 (4), plus outliers 19 + 20 each in their own 1-node cluster. EC-SBM v2 then forces outlier_mode=excluded in the next stage, dropping outliers from the profile-time partition; stage 3 folds them back as a combined block. This view is the universal baseline before either move.

input G · excluded mode · 3 clusters shared fixture

profiling · block assignment

Per-node block label. Outliers excluded now, folded back at stage 3.

Profile writes the same seven CSVs as v1: node_id, cluster_id, assignment, degree, edge_counts, mincut, and com. v2 also forces outlier_mode=excluded, so profile's tables cover only the clustered sub-network. The difference shows up later: at stage 3, v2 re-reads the empirical edgelist and folds every outlier into one combined pseudo-block, while v1 creates one singleton block per outlier.

input G after outlier_mode=excluded · 18 nodes across three clusters assignment.csv

why combined, not singleton

A size-1 cluster has no internal edges (no partner inside), so giving each outlier its own block adds rows and columns to the block matrix without filling any diagonal. Under combined, every outlier shares one block; the outlier-outlier diagonal cell captures the real edges they have among themselves, and clustered-outlier cells capture the cross edges. One block means one residual-SBM cell per relation, which is all v2's stage 3 needs.

profiling · degree sequence

Per-node target degree, post-exclusion.

Same as v1: degree.csv carries \(k_u\) counted on the post-exclusion graph. The constructive phase uses these numbers; stage 3 then bumps per-node out_degs by each node's outlier-incident count after re-reading the original edgelist.

input G · outliers excluded · hover a node or a degree row to link degree.csv

profiling · the block matrix

Edge count per block pair. 3x3 now, 4x4 at stage 3.

Same half-edge convention as v1 / SBM:

\[ e_{rr} \;=\; 2 \cdot \bigl| \{\, \{u, v\} \in E \,:\, b_u = b_v = r \,\} \bigr|. \]

At profile time v2's matrix is the same compact 3x3 as v1 (outliers excluded). Stage 3 then rebuilds a 4x4 by re-reading the original edgelist and adding the combined outlier block on the fourth row/column.

input G · hover an edge or a cell cell: (none)

profiling · the minimum edge cut

One new scalar per cluster: \(k(C)\).

Same profile addition as v1. mincut.csv records the minimum edge cut of each cluster's induced subgraph in the input, measured via pymincut's Nagamochi-Ono-Ibaraki algorithm with a bucket-queue heap. Singleton clusters get \(k = 0\).

Each cluster's mincut on this synthetic, highlighted in red-dashed on the graph below. C1 has a non-trivial internal cut: two subsets of 4 nodes joined by a thin bridge of 2 edges. C2 and C3 isolate a single low-degree vertex.

cluster	nodes	min deg in C	\(k(C)\)	one minimum cut
C1	8	3	2	{(1, 5), (4, 8)}, splitting {1,2,3,4} from {5,6,7,8}.
C2	6	2	2	{(9, 13), (12, 13)}.
C3	4	1	1	{(16, 18)}.

input G · red-dashed = one minimum edge cut per cluster mincut.csv

generation · the \(K_{k+1}\) core

Same \(K_{k+1}\) seed as v1. Nothing else samples here.

Stage 2 calls exactly the same gen_kec_core.generate_cluster as v1, with one change: v2 passes --no-sbm-overlay, so the constructive pass ends with the cores + attach-by-availability and nothing more. No gt.generate_sbm here, no overlay, no dedup. probs and deg carry exactly the budget the construction used.

The walker below is the same per-arrival walk used on the v1 page: cluster-major sort, phase-1 arrivals auto-attach, phase-2 arrivals greedy-pick \(k\) partners by residual int_deg desc. Realization 1 is byte-verified against the shipped Python at PYTHONHASHSEED=0. random step re-samples partner picks at the current arrival only, leaving every other arrival frozen; random all draws a fresh universe across every greedy arrival.

step 0 · stripped arrival 0 / 18

same proof, cleaner ledger

Per-cluster edge-connectivity \(\ge k(C)\) is still guaranteed by the \(K_{k+1}\) seed. What changes is that probs and deg are decremented by exactly the constructive contribution and nothing else; the residual the SBM sees next is honest.

generation · residual accounting

Compute what the SBM still owes each block.

Before the SBM call, prepare_sbm_inputs re-reads the original edgelist and rebuilds out_degs + probs on all four blocks (three real clusters + one combined outlier block). It subtracts the constructive contribution from out_degs and populates probs inter-cells from the original edge list. Finally it reconciles each diagonal against the block's residual stub budget:

\[ D_k \;=\; \sum_{u \,:\, b_u = k} \bigl(k^{\text{empirical}}_u - k^{\text{constructive}}_u\bigr), \quad E^{\text{inter}}_k \;=\; \sum_{s \neq k} \texttt{probs}[k, s]. \]

\(D_k\) is "residual half-edges block \(k\) still owes", \(E^{\text{inter}}_k\) is "inter-block half-edges block \(k\) is already committed to". The intra-block budget the SBM still needs to sample is

\[ \text{diff}_k \;=\; D_k \;-\; E^{\text{inter}}_k. \]

If \(\text{diff}_k \ge 0\), set \(\texttt{probs}[k, k] = \text{diff}_k\) (bump by 1 if odd, to satisfy graph-tool's even-half-edges-per-block rule, and add the bump back to one node's out_degs). If \(\text{diff}_k < 0\), the construction already over-spent the block's out-budget; add \(|\text{diff}_k|\) back to out_degs and set \(\texttt{probs}[k, k] = 0\) (the SBM can only add, never subtract).

On the synthetic, after the verified stage 2 run, the residual looks like:

block

\(D_k\)

\(E^{\text{inter}}_k\)

\(\text{diff}_k\)

OUT

C1 has already spent its full intra budget during construction, so the SBM sees 0 intra residual there. C2, C3, and OUT each still owe 2 half-edges (one intra edge per block). All four diffs are even on this synthetic, so no parity bump fires; the bump path triggers only on odd diffs (e.g. when an outlier-incident OO edge mismatches block-pair parity).

every edge owned by one stage

Constructive edges belong to stage 2. Residual intra + inter come from stage 3's SBM. Match-degree covers whatever the rewire cannot place. No cell is sampled twice, and nothing is double-counted because probs was decremented by exactly what stage 2 placed.

generation · residual SBM + block-preserving rewire

One SBM call. Block-preserving swaps repair invalid edges without losing the block count.

With the residual probs, out_degs, and the combined-outlier block assignment ready, v2 calls gt.generate_sbm(b, probs, out_degs, micro_ers=True, micro_degs=True) once. The sampler can still emit self-loops (two of a node's stubs pair together) and parallel edges (the same pair drawn twice). v1 dedupped both and accepted the degree loss. v2 tries to fix each one via a block-preserving 2-opt swap.

The swap: bucket every valid edge by its unordered block-pair \((A, B)\); for each invalid edge, pick a random valid edge in the same bucket and swap endpoints so each node stays in its home block. That preserves both the block-pair count and every node's block membership. Up to 10 retry passes with a stagnation detector. Edges still invalid after that are dropped with a WARN.

Step through; hit random to resample stage 3.

step 0 · residual SBM sampled step 0 / 3

core + attach residual SBM invalid edge (before rewire) partner (picked for swap)

invariant of the swap

If the invalid edge \((u, v)\) has block-pair \((A, B)\) with \(A \neq B\), the partner edge \((x, y)\) must also sit in bucket \((A, B)\). Swap the \(A\)-side of each edge with the \(B\)-side of the other: \((u_A, x_B)\) and \((x_A, u_B)\). Both nodes stay in their home block, so the block-pair count is invariant. For intra-block \((A = A)\) there are two valid swaps; the code flips a coin.

generation · match the residual degree

Block-aware top-up by default. Logs gridlock; does not silently drop.

After the residual SBM + rewire, some nodes may still be short of their target. v2 hands the leftovers to a block-aware top-up that spends edges out of a per-block-pair budget read from the reference clustering, rather than letting stubs leak across cluster boundaries. Anything still unplaced is logged, never silently dropped the way v1's plain greedy does. The viz below runs the simpler global sibling so the heap mechanics stay readable; see the degree matcher page for the full menu.

Toggle apply match_degree to watch the per-node deficit pile shrink. Click random for a different realization.

before match_degree · residual deficits visible placed: 0 · gridlocked: 0

clustered (core + attach) residual SBM match_degree residual deficit

case study · dnc

What you get on the shipped example.

Default run on dnc + sbm-flat-best+cc at seed 1, with the residual-SBM rewire and the block-aware top-up:

stat	input	v2 output	note
N	906	906	Exact.
edges	10,429	10,191	Off by 2.3%. The block-pair-aware top-up spends edges only where the reference \(e_{rs}\) has budget left; once every eligible block-pair is full or every viable partner is already a neighbour, leftover stubs are logged and the run stops.
mean degree	23.02	22.50	Follows the edge count.
global cc	0.548	0.555	A hair above the reference. The \(K_{k+1}\) cores carry the bulk of the triangles and the block-pair-aware top-up adds most of its edges where the input's \(e_{rs}\) concentrates, on the dense diagonals. v1 sits at 0.424 because its overlay dedup shaves intra-block density that v2's rewire keeps intact.
clusters	87	87	Profile-stage passthrough.

v1 and v2 share the same \(K_{k+1}\) cores edge-for-edge; everything else gets reshuffled. The flow chart below pairs each edge with the stage that placed it on each side.

edge provenance flow · v1 → v2 · ribbons join edges by their stage on each side. Top-to-bottom on each bar follows pipeline stage order: kec_core, then the SBM phase (overlay vs residual), then match_degree. Bottom segments are marginals: edges that exist on one side only.

v1 kec_core (2,202 edges) maps cleanly onto v2 kec_core. v1's SBM overlay (5,089) splits across all three v2 stages: about half lands in v2's residual SBM, a fifth re-emerges in match_degree, and the rest gets dropped. v1's separate outlier SBM (1,007) and v1's match_degree (2,127) get redistributed similarly. v2 also adds 3,460 edges that v1 never produced; v1 carried 3,694 edges that v2 doesn't.

why v2's global cc beats v1's on this input

Both versions place the same \(K_{k+1}\) cores. v1 then runs an SBM on the mutated probs and uses remove_parallel_edges for collisions; the sampler can land on core pairs, dedup drops one, and the intra-block density inside that block is shaved. v2's rewire keeps the block-pair count intact, so the core cliques' triangle structure stays in place and the global cc sits closer to the input's 0.548.

caveat: rewire can still strand a handful of edges

Rewire is not a complete repair. When no valid partner sits in the same bucket, an invalid edge stays invalid; after 10 retry passes or a stagnation break those unresolved edges are dropped with a warning. On dnc at this seed the total drop is tens of edges out of ten thousand: invisible in the stats table, present in the run log. The configuration-model rewire variant of stage 4 is lossy in the same way; the block-aware greedy default catches what rewire strands as long as an eligible partner is still in budget.