network-generation · sbm

SBM

Degree-corrected stochastic block model

Freezes the block assignment, the per-pair edge counts, and the per-node degree sequence; the rest is random.

Given a network \(G\) with a block (or "community") structure, the goal is a synthetic twin that keeps the blocks and the per-node degrees while drawing every edge fresh. The stochastic block model picks the summaries of \(G\) that matter enough to freeze and randomises the rest.

Peixoto's micro-canonical degree-corrected SBM freezes three things: (1) each node's block label, (2) the edge count between every pair of blocks \(e_{rs}\), and (3) each node's degree \(k_i\). Any graph that meets those three constraints is equally likely under the model.

The catch is simplify. The sampler emits a multigraph where the constraints hit exactly; the simplify step then drops self-loops and parallel edges so the output is a simple graph. On dense blocks that loses a visible fraction of edges, and the three frozen quantities end up as upper bounds instead of targets hit.

profiling · outliers as singletons

Shared input graph: 18 clustered + 2 singletons.

The 20-node fixture this page builds on, drawn under the shared singleton view: C1 (8), C2 (6), C3 (4), plus outliers 19 and 20 each in their own one-node cluster. Five clusters in total. SBM's profile then folds those two singletons into a single __outliers__ pseudo-block (next section). This view is the baseline shared across pages.

input G · combined mode · 4 blocks shared fixture

profiling · block assignment

Three vectors and one matrix: the input contract.

Profiling reads the reference network and its clustering and produces three things the sampler will try to match: a block label per node, a degree per node, and an edge count per block-pair. Anything else (mean degree, triangle count, pairwise distances) is not in the contract.

Write \(b\) for the block assignment (\(b_u \in \{1, \ldots, B\}\) for node \(u\)), \(k\) for the degree sequence (\(k_u\) = degree of \(u\)), and \(e\) for the block-pair count matrix (\(e_{rs}\) defined below). The next three sections visualise \(b\), then \(k\), then \(e\).

Outliers ride on the assignment. By default, every unclustered node and every size-1 cluster is folded into a single pseudo-block labelled __outliers__, so the sampler never has to treat them as a special case. Outlier-outlier edges survive into the pseudo-block's diagonal.

input G · dashed rings enclose each block assignment.csv

profiling · degree sequence

How many edges each node has.

The second part of the contract is the per-node degree \(k_i\). The side panel below shows the sequence, one row per node sorted by degree (descending), coloured by block. Hover a node or a bar to link them. The edge count matrix is in the next section.

input G · hover a node or row to link degree.csv

profiling · the edge count matrix

Count edges per block-pair. The only structure the sampler sees.

Read \(e_{rs}\) as a stub (half-edge) count: how many half-edges from block \(r\) land in block \(s\). For \(r \neq s\), that is also the number of edges between the two blocks, and the matrix is symmetric \(e_{rs} = e_{sr}\). For \(r = s\), each intra-block edge has both of its half-edges inside \(r\), so it contributes twice:

\[ e_{rr} \;=\; 2 \cdot \bigl| \{\, \{u, v\} \in E \,:\, b_u = b_v = r \,\} \bigr|. \]

The matrix has one row and column per block, so on this fixture it is a compact 4x4. Each edge contributes to exactly one cell (diagonal) or to a pair of cells \((r, s), (s, r)\) (off-diagonal). Hover an edge to highlight its cell, or hover a cell to light up every edge contributing to its count. The sampler will not remember which specific edges were there, only that the cell owes that many.

input G · hover an edge or a cell cell: (none)

why this is the whole contract

Any two graphs with the same \(e_{rs}\) and the same degree sequence are equally likely under the degree-corrected SBM. The model cannot tell them apart, and the sampler spreads probability across them uniformly. SBM matches \(e_{rs}\) and \(k_i\) in distribution and reshuffles which specific pairs are connected, because specific pairs are not in the contract.

generation · sample the multigraph

Stub-matching per block-pair, micro-canonical.

Each node \(u\) has a target degree \(k_u\). Build one urn per block: for block \(r\), drop each node \(u\) with \(b_u = r\) into the urn \(k_u\) times. The urn is just a bag of half-edges; it does not remember which target block any particular half-edge was supposed to land in. The spokes around each node show its share of the urn.

For each block-pair \((r, s)\) with \(r \le s\), walk the edge count matrix in row-major order. The pair claims \(e_{rs}\) edges: for \(r = s\), pull two half-edges out of urn \(r\) per edge; for \(r \neq s\), pull one out of urn \(r\) and one out of urn \(s\). Sampling is uniform without replacement and every draw becomes an edge.

graph-tool accepts non-equality input

The graph-tool sampler is happy with a looser input than the identity above. The consistency check is the inequality \(\sum_{u \,:\, b_u = r} k_u \ge \sum_{s} e_{rs}\) (left = stubs available in block \(r\), right = edges incident to \(r\)). When the two sides are equal, stubs distribute the natural way. When they are not, some nodes return at a degree below the requested \(k\): graph-tool prioritises hitting \(e_{rs}\) over hitting \(k_u\).

Toy example: \(b = (1, 1)\), \(k = (3, 3)\), \(e = (4)\). Two nodes both asking for degree 3, four half-edges on the diagonal. The check \(3 + 3 \ge 4\) holds, but only strictly. Two of the six stubs go unused, so achieved degrees sum to 4 instead of 6 (most often both return at degree 2, occasionally one at 1 and the other at 3, never both at 3). The identities used on this page hit equality, which is why the runs above behave cleanly.

caveat: counts, not distinct pairs

The constraint is on counts, not distinct pairs. When a hub's stub lands next to one of its own other stubs in the same block (\(r = s\)), the sampler emits a self-loop. When two of a hub's stubs into block \(s\) pair with two stubs of another node in the same block, the sampler emits a parallel edge. Both are legal under the micro constraint, and the graph below flags both as red-dashed.

stub-matching ·

why not put each outlier in its own block?

The other option is to give every outlier its own size-1 block, in which case outlier-outlier edges come back exactly. The matrix below shows \(e\) on the 20-node fixture under that view; the bottom-right 2x2 corner (highlighted) is exactly the adjacency matrix on the outlier subgraph, and the sampler reproduces it bit-for-bit.

The cost: singleton blocks may carry over community structure that the reference clustering missed among the outliers. This project benchmarks community-detection algorithms; if hidden structure exists in the outlier subgraph and the synthetic faithfully replicates it, the ground truth still labels those nodes unclustered, so any algorithm that recovers the hidden communities looks wrong.

generation · simplify

Collapse parallels, drop self-loops. Degree sequence stops being exact.

Every generator in this project ships a simple graph, so the last step is to collapse each parallel edge between \((u, v)\) to one, and drop every self-loop on \(u\). Each dropped half-edge is a unit of degree the sampler can no longer account for, so the output degree sequence ends up at-or-below the target.

The widget below replays the drop. The side panel tracks target vs achieved degree per node; a row that shrinks after the drop lost a duplicate or a self-loop.

before simplify edges: ·

top-up · match-degree

Top-up closes most of the degree deficit.

Simplify leaves several nodes short of their target degree (red tails in the side panel above). The pipeline runs a top-up pass that adds new edges between under-degree nodes, picked in a deterministic, degree-greedy order. The default keeps the block structure honest by spending edges out of a per-block-pair budget read from the reference clustering, rather than letting stubs leak across cluster boundaries. Most of the deficit clears; a few stubs may stay unplaced when every eligible partner is already a neighbour. See the degree matcher page for the full menu and how each variant handles the leftovers.

before top-up edges: ·

case study · dnc

What you get on the shipped example.

Default run on dnc + sbm-flat-best+cc at --seed 1 (post top-up):

stat	input	sbm output	note
# nodes	906	904	The block-pair-aware top-up re-attaches two of the four post-simplify isolates; the other two stay isolated because no eligible partner remains in any block-pair the input had budget for.
# clusters	87	87	Passes through the input clustering, minus size-1 clusters (the input has none).
# edges	10,429	10,155	Sampler emits 7,438; the top-up adds 2,717 edges by spending the per-block-pair budget read from the reference \(e_{rs}\), and stops with 274 stubs gridlocked when no eligible partner remains in budget.
global cc	0.548	0.557	The top-up spends budget where the input had \(e_{rs}\) left over, so most new edges land on the dense diagonals where simplify also bit. Cc lifts from the sampler-only 0.341 to a hair above the reference.
char. time	40.41	42.26	Random-walk mixing time. Top-up edges follow the input's \(e_{rs}\) shape rather than running long cross-graph shortcuts, so mixing barely shifts from the sampler-only 33.14; see caveat 3 below.
pseudo-diameter	8	9	One hop wider than the input. Top-up edges follow the input's \(e_{rs}\) shape, so they don't shorten any cross-cluster paths and the sampler's 9 stays.

The edge count matrix on dnc has 696 non-zero cells over 88 blocks. The largest one sits on the diagonal of cluster C₂ (50 nodes, 2,450 half-edges, 1,225 intra-edges). Simplify lands hardest there: 894 half-edges collide, so 447 of those 1,225 intra-edges vanish. The outlier pseudo-block (355 nodes, 278 intra-edges before simplify) is next, losing 108.

Left: \(e_{rs}\) from the profile. Right: the same matrix recomputed on the output. Bottom: per-cell loss (input minus output), log-scaled so small cells still register. Blocks are sorted by size; the outlier pseudo-block (355 nodes) sits top-left. Hover any cell for its values.

input · \(e_{rs}\) from edge_counts.csv

output · \(e_{rs}\) recomputed on edge.csv

per-cell simplify loss · input minus output, log-scaled. Darker red = more edges lost in that cell.

edges in · edges out · lost ·

caveat 1: simplify loss clusters on dense blocks

The red heatmap above is the sampler story. Parallels collide where stubs concentrate, so loss clusters on the diagonal of dense blocks and on the outlier pseudo-block; most off-diagonal cells keep their edges. On dnc, the worst-hit cell alone loses 894 half-edges out of 2,450, and the outlier block loses 108 of its 278 intra-edges. The top-up reads the input's \(e_{rs}\) and spends edges only where a block-pair still has unfilled budget, so it primarily refills diagonal cells where loss concentrated; off-diagonal cells get partial patching where their reference \(e_{rs}\) left room.

caveat 2: global cc is not a target

The micro constraint pins \(e_{rs}\) and \(k_i\) but says nothing about triangles. The sampler alone delivers cc 0.341, well below input 0.548. The block-pair-aware top-up lifts cc to 0.557 because the input's \(e_{rs}\) concentrates on dense diagonals, so most new edges land where triangle probability is already high; this is a side effect, not a target, and on inputs whose simplify deficit sits off-diagonal or whose \(e_{rs}\) is flatter, cc can still miss by a wide margin. If triangles need a real handle, ec-sbm-v1 and ec-sbm-v2 add that structure on top.

caveat 3: top-up follows \(e_{rs}\), not random walks

Two SBM runs side by side, same seed, same input:

stat	input	without top-up	with top-up
# edges	10,429	7,438	10,155
char. time	40.41	33.14	42.26
pseudo-diameter	8	9	9

The sampler-only graph mixes near the input rate (33 vs 40) and is one hop wider because it is sparser. The top-up adds 2,717 edges to whichever block-pairs the reference \(e_{rs}\) still has budget for; that follows the input's block shape rather than picking long cross-graph shortcuts, so mixing time stays near the sampler-only rate (42 vs 33) and the diameter holds at 9. The trade is \(e_{rs}\)-faithful densification (cc lifts) for no cross-cluster shape change.