network-generation · ec-sbm-v1

EC-SBM v1

Edge-connected SBM, v1

SBM with a per-cluster edge-connectivity floor. Adds one scalar to SBM's contract: the minimum edge cut \(k(C)\) of each cluster. Hand-builds a \(K_{k+1}\) clique inside \(C\) so the output cluster survives any \(k-1\) edge removals.

Plain SBM freezes three things about the input: each node's block label, the per-pair edge count matrix, and the per-node degree. Triangles are not in the contract, and neither is edge-connectivity. A cluster can come out "connected" but one snip away from falling in half: the sampler is free to place intra-block edges on a handful of pairs while starving the rest.

EC-SBM adds one scalar per cluster. Measure the minimum edge cut \(k(C)\) of each cluster's induced subgraph in the input and promise the output cluster is at least that edge-connected. Before any SBM sampling, build a \(K_{k+1}\) seed clique on the top-degree nodes of \(C\). A complete graph on \(k+1\) vertices is \(k\)-edge-connected, so every edge added afterwards can only raise the connectivity, never drop it below \(k\).

This is v1, the first version of EC-SBM, published in Vu-Le et al. 2025. The gen side has four phases and two separate SBM calls. v2 is the rewrite: constructive-only seed, one SBM over all blocks, a block-preserving rewire, a logged true_greedy matcher.

profiling · outliers as singletons

Shared input graph: 18 clustered + 2 singletons.

The 20-node synthetic this page builds on, drawn under the shared singleton view: C1 (8), C2 (6), C3 (4), plus outliers 19 + 20 each in their own 1-node cluster. EC-SBM v1 then forces outlier_mode=excluded in the next stage, dropping outliers and their incident edges before profile counts anything; this view is the universal baseline before that exclusion.

input G · excluded mode · 3 clusters shared fixture

profiling · block assignment

Per-node block label. v1 forces outliers out.

Profile reads the empirical edgelist + clustering and writes seven CSVs. Two are lookups (node_id.csv, cluster_id.csv). Four carry the quantities the gen stages try to match, one per subsection below: assignment.csv, degree.csv, edge_counts.csv, mincut.csv. The seventh, com.csv, is the full (node, cluster) ledger and passes through to the final output unchanged.

v1 forces outlier_mode=excluded at profile time: any node not in a size-2+ cluster drops out, along with every edge incident to it. Those nodes come back later through a separate SBM call. v2 folds them into one combined pseudo-block instead; the rest of this page follows v1.

input G after outlier_mode=excluded · dashed rects enclose each block assignment.csv

profiling · degree sequence

Per-node target degree, counted on the post-exclusion graph.

The second vector is degree.csv: \(k_u\) per node. Because outliers were just dropped, the degrees counted here exclude outlier-incident edges (a node that had 3 neighbours, 2 clustered and 1 outlier, lands at \(k_u = 2\)). Hover a node to find its bar on the right, or a bar to find its node. Gen stages treat the sequence as a target, not a guarantee: dedup and match-degree gridlock can push the achieved degree below.

input G · outliers excluded · hover a node or a degree row to link degree.csv

profiling · the block matrix

Edge count per block pair, 3x3 on the synthetic.

The third input is edge_counts.csv. Read it as a half-edge count: how many half-edges from block \(r\) land in block \(s\). For \(r \neq s\) that coincides with the number of edges between the two blocks; for \(r = s\), each intra-block edge contributes twice (both halves inside \(r\)):

\[ e_{rr} \;=\; 2 \cdot \bigl| \{\, \{u, v\} \in E \,:\, b_u = b_v = r \,\} \bigr|. \]

With outliers excluded, v1's matrix is a compact 3x3. Hover an edge to find its cell, or a cell to light up every edge that contributes to the count.

input G · hover an edge or a cell cell: (none)

profiling · the minimum edge cut

One new scalar per cluster: \(k(C)\).

For each cluster \(C\), the profile computes the minimum edge cut of \(C\)'s induced subgraph in the input. An edge cut for \(C\) is a set of edges whose removal disconnects \(C\); \(k(C)\) is the smallest such set's size. Measured via pymincut's Nagamochi-Ono-Ibaraki algorithm with a bucket-queue heap ("noi" / "bqueue"). Singleton clusters get \(k = 0\) by definition. Writes one integer per cluster to mincut.csv.

Each cluster's mincut on this synthetic, highlighted in red-dashed on the graph below. C1 has a non-trivial internal cut: two subsets of 4 nodes joined by a thin bridge of 2 edges. C2 and C3 isolate a single low-degree vertex.

cluster	nodes	min deg in C	\(k(C)\)	one minimum cut
C1	8	3	2	{(1, 5), (4, 8)}, splitting {1,2,3,4} from {5,6,7,8}.
C2	6	2	2	{(9, 13), (12, 13)}.
C3	4	1	1	{(16, 18)}.

input G · red-dashed = one minimum edge cut per cluster mincut.csv

this is the only thing EC-SBM adds to SBM's contract

Assignment, degree, and block-counts are the SBM contract. EC-SBM reuses all three verbatim and adds one integer per cluster: \(k(C)\). Everything that follows exists to honour that number.

generation · the \(K_{k+1}\) core

Build a \(K_{k+1}\) seed clique. Attach the rest with \(k\) edges each.

Per cluster \(C\), sort nodes by target degree desc (id-asc tiebreak). Take the top \(k+1\) and wire them into a complete subgraph \(K_{k+1}\): that is \(\binom{k+1}{2}\) edges, with each node at degree exactly \(k\). A complete graph on \(k+1\) vertices is \(k\)-edge-connected, so any cluster that contains this \(K_{k+1}\) as a subgraph is \(k\)-edge-connected regardless of what the SBM layer does next.

Phase 2 walks the remaining cluster nodes in descending target-degree order. Each new node \(u\) attaches to exactly \(k\) already-processed partners, preferring the ones with the highest residual (remaining) target degree. The paper's equivalent rule is "make \(u\) adjacent to \(\min\{k, |N_0|\}\) partners with probability proportional to availability"; during the first \(k+1\) iterations \(|N_0| < k\), so the new vertex joins everything in \(N_0\), and the result of those iterations is the \((k+1)\)-clique.

When a partner's residual or the block-pair budget would block a required edge, ensure_edge_capacity inflates both by 1 so the edge can still go in:

\[ \texttt{int\_deg}[v] \mathrel{+}= 1, \quad \texttt{probs}[b_u, b_v] \mathrel{+}= 1. \]

Inflation is the only reason the output's degree sequence ends up approximate. The walker below mirrors gen_kec_core.generate_cluster arrival by arrival, in the same cluster-major order the shipped Python uses (clusters by size desc, nodes within each by post-exclusion target degree desc, id-asc tiebreak).

At each arrival the caption lists every already-processed candidate's residual int_deg with the picked partners marked. Phase-1 arrivals (those with \(|V_t| \leq k\)) auto-attach to every predecessor and grow the \(K_{k+1}\) seed; phase-2 arrivals greedy-pick \(k\) partners by residual desc. Realization 1 is byte-verified against the shipped Python at PYTHONHASHSEED=0. random step re-samples partner picks at the current arrival only, leaving every other arrival frozen; random all draws a fresh universe across every greedy arrival.

step 0 · stripped arrival 0 / 18

why \(K_{k+1}\) is the tight seed

A complete graph on \(k+1\) vertices has \(\binom{k+1}{2}\) edges, exactly \(k\) per vertex, and is \(k\)-edge-connected. Any smaller complete graph is not enough; any bigger one spends edges the profile would rather place elsewhere. v1 inflates the budget to protect the seed edges, not to shrink them.

generation · SBM overlay + dedup

Sample an SBM on the mutated residual, overlay the cores, drop duplicates.

After the constructive pass, probs and deg have been decremented by every edge the core + attach placed and inflated wherever the budget was too tight. v1 feeds those mutated arrays to graph-tool, overlays the constructive edges on top, and simplifies:

    g = gt.generate_sbm(b, probs.tocsr(), out_degs=deg,

        micro_ers=True, micro_degs=True, directed=False)

    g.add_edge_list(core_edges)

    gt.remove_parallel_edges(g); gt.remove_self_loops(g)

Step through the two phases and press random for a different SBM realization. The residual on this synthetic has 1 intra-C2 + 1 intra-C3 budget + 8 inter (4 C1-C2 + 2 C1-C3 + 2 C2-C3). Whether each intra sample lands on a fresh pair or duplicates a constructive edge depends on which slot the SBM picks; the dedup pass collapses any duplicates.

step 0 · constructive only step 0 / 2

core + attach residual SBM sample collision (duplicates core)

caveat: double-accounting

Two samplers writing into the same ledger. The constructive phase commits edges and decrements the budget. The SBM phase redraws on the decremented budget but can still sample the same pair the core already placed; remove_parallel_edges drops one copy. Net per collision: budget charged twice, edge kept once, block's intra count lands under profile. v2 replaces this with a single residual-SBM call and a block-preserving 2-opt rewire that swaps endpoints instead of dropping edges.

generation · the outlier rescue

Outliers get their own SBM pass.

The constructive + overlay phase never saw outliers because the profile excluded them. This phase re-reads the original (pre-exclusion) edgelist, flags every node that is in the edgelist but not in the clustering, and treats each outlier as a size-1 block of its own. It then samples an SBM on the outlier-incident edges only (--scope outlier-incident); the SBM does not touch clustered-side pairs.

The combine_edgelists.py merge is first-seen-wins with the constructive + overlay output as input 1. Two pure bands, undirected-deduped, written to a fresh edge.csv plus sources.json mapping each band to an inclusive 1-based row range.

core + overlay + outlier rescue orchid = outlier SBM band

singleton outlier blocks emit edges, not structure

Each outlier is its own 1-node block, so the SBM only has inter-block cells to populate. Outlier-outlier edges come out as a cell between two singletons; clustered-outlier edges come out as cells between a real cluster and a singleton. Nothing ever lands on a diagonal. Folding the outliers into one combined block instead gives v2's single-SBM picture.

generation · match the residual degree

Heap-greedy top-up. The deficit accumulates; match_degree drains it.

Dedup and the outlier pass together remove some edges from each node's degree. The final phase tops that shortfall back up with src/match_degree.py --algorithm greedy (hardcoded in v1):

Build a max-heap of (-residual, node_id) over nodes still short of their target.
Pop \(u\). Compute \(u\)'s non-neighbours (live nodes minus \(u\) and its current neighbours).
Drain partners from \(u\)'s non-neighbour set in id-ascending order (canonical sorts the candidate set explicitly so the result does not depend on Python's set hash order) until \(u\)'s residual hits zero or candidates run out. Each pair becomes an edge; decrement both sides.
If \(u\) ran out of partners first, the remaining stubs are silently dropped. No log line, no warning.

Toggle apply match_degree to watch the deficit pile (the faint-red tail on each bar) drain. Click random for a different matcher realization.

before match_degree · residual deficits visible placed: 0 · gridlocked: 0

clustered (core + overlay) outlier match_degree residual deficit

caveat: gridlock hides, the degree sequence leaks

A node with residual 3 and only one valid partner leaves the heap with residual 2. Those two stubs vanish without a log line. On a small synthetic you rarely see this. On dense inputs with long-tailed degree distributions, a handful of hubs silently come out short, and the degree-sequence target starts drifting below the profile's.

v2's --algorithm true_greedy drives this drop near zero and logs what it cannot place. See the match_degree page for the four other algorithms (greedy, random_greedy, rewire, hybrid) and the --remap mode for gens with fresh node IDs.

case study · dnc

What you get on the shipped example.

Default run on dnc + sbm-flat-best+cc at --seed 1:

stat	input	v1 output	note
N	906	906	Exact. Outliers folded back in through the outlier-rescue SBM after profile exclusion.
edges	10,429	10,425	Close. The match-degree phase fills almost all of the dedup shortfall.
mean degree	23.02	23.01	Follows the edge count.
global cc	0.548	0.424	Meaningfully higher than plain SBM (0.341): the \(K_{k+1}\) cores are dense on the biggest clusters.
clusters	87	87	Profile-stage passthrough (`drop_singleton_clusters` strips size-1 clusters, but dnc has none).

The 87 clusters cover 551 nodes; the remaining 355 are outliers that excluded drops in profile. Mincuts on this input skew small: 60 clusters have \(k = 1\), 9 have \(k = 2\), 8 have \(k = 3\). A handful of larger clusters carry most of the construction load: \(k = 49\) on the biggest (50-node) cluster, \(k = 18\) on the 36-node one, \(k = 14\) and \(k = 13\) on the next two.

mincut distribution · clusters with each \(k\) value on dnc. Small \(k\) dominates by count; the long tail is where the \(K_{k+1}\) construction spends most of its edges.

Provenance from sources.json, inclusive row ranges. The constructive + overlay band dominates; the outlier rescue is a thin strip; match_degree makes up the dedup shortfall.

edge provenance · inclusive 1-based row ranges from sources.json on the shipped v1 output.

why the cc jumps against plain SBM

The \(K_{k+1}\) core on the 50-node cluster alone is \(K_{50}\): \(\binom{50}{2} = 1225\) intra-edges packed on 50 nodes. Plain SBM on dnc lands at global cc 0.341; v1 lands at 0.424. The delta comes mostly from those cliques. The mincut distribution's long tail matters more than its count because the largest-\(k\) clusters carry most of the edges.

caveat: overlay dedup is invisible in the numbers

v1's output misses the input edge count by a thin margin and the global cc by 0.124 (0.548 input, 0.424 v1). How much of that shortfall is overlay-dedup versus match-degree gridlock? v1's logs do not say. v2's logs do, and they show overlay-dedup is the bigger share on dense inputs; gridlock is small when match-degree only has to fill a few stubs per hub.