network-generation · abcd

ABCD

Artificial Benchmark for Community Detection

Output is a union of \(k+1\) independent random graphs: one per cluster plus a background graph over all nodes. The scalar \(\xi\) sets the share of each node's degree that goes to the background.

Given a network \(G\) with a clustering, the goal is a synthetic twin that keeps both the degree sequence and the cluster sizes verbatim while drawing every edge fresh. ABCD models the graph as a union of independent random graphs, one per cluster plus one global background, with a single scalar \(\xi\) trading degree share between the two.

Each node \(u\) has target degree \(d_u\). ABCD splits this degree: \(y_u = (1 - \xi) d_u\) funds the cluster graph of \(u\)'s home cluster; \(z_u = \xi d_u\) funds the background graph on \(V\).

A configuration model on each cluster graph produces cluster-internal edges; a configuration model on all \(z\)-stubs produces background edges that can pair any two nodes. The final graph is the union. At \(\xi = 0\) the output is disjoint cluster graphs; at \(\xi = 1\) it is a pure random graph with no signal about cluster membership. \(\xi\) is the only structural dial: triangles, clustering coefficient, and per-pair block counts are not in the contract.

profiling · outliers as singletons

Shared input graph: 18 clustered + 2 singletons.

The 20-node synthetic this page builds on: C1 (8), C2 (6), C3 (4), plus outliers 19 + 20. ABCD requires \(\sum_j s_j = n\), so our wrapper's outlier_mode=singleton wraps each unclustered node in its own 1-node cluster. The graph below shows the resulting 5-cluster partition; ABCD's later sections all operate on this view.

input G · singleton mode · 5 clusters shared fixture

profiling · degree sequence

Every node's degree, copied verbatim.

The profile is a deterministic function of the input; it does not change on re-roll. Profile writes three things. The first is degree.csv, the integer degree \(d_u\) per node, sorted desc. The Julia sampler reads the list unchanged and uses it as the exact degree target.

input G · hover a node or a degree row to link degree.csv

profiling · cluster sizes

One integer per cluster, sum to \(n\).

The second file is cluster_sizes.csv, an integer size per cluster, sorted size desc. The paper requires \(\sum_j s_j = n\), so every input node lands in exactly one cluster. Our wrapper satisfies this via outlier_mode=singleton: every unclustered input node becomes a 1-node pseudo-cluster (ID __outlier_<id>__), so the size list sums to \(n\). Real clusters sort to the top; singletons tail. The next two sections show how the sampler uses this list to assign vertices and draw intra-cluster edges; output cluster sizes match the input verbatim.

input G · dashed ring per cluster cluster_sizes.csv

profiling · the mixing scalar

One number: \(\xi\).

The third file is mixing_parameter.txt, a single float. Let \(E_{\text{out}}\) be the number of edges crossing cluster boundaries under the profile's clustering (singletons count, so outlier-incident edges are "inter" when the other endpoint is a different cluster). Then

\[ \xi \;=\; \frac{E_{\text{out}}}{|E|}. \]

The paper reads \(\xi\) as the expected fraction of each vertex's degree that the model routes to the background graph: \(\xi d_u\) background stubs, \((1 - \xi) d_u\) cluster-graph stubs. The realised inter-cluster fraction on output hits \(\xi\) only in expectation; on small graphs it drifts by a few percent seed-to-seed.

input G · red edges cross cluster boundaries (\(E_{\text{out}}\)) mixing_parameter.txt

ABCD's \(\xi\) vs LFR's \(\mu\): not the same number

Both measure "fraction of edges crossing clusters" on the same input graph, but via different reductions. ABCD's \(\xi\) is a stub-weighted ratio over all edges, \(E_{\text{out}} / |E|\). LFR's \(\mu\) is the mean of per-node ratios, \(\text{mean}_u(d^{\text{out}}_u / d_u)\). On skewed-degree inputs the two diverge: on this synthetic ABCD's \(\xi\) is · while LFR's \(\mu\) is ·. Do not swap them across models.

generation · assigning vertices to clusters

Big nodes to big clusters; small nodes fill the rest.

The sampler assigns each vertex to exactly one cluster before any edges are drawn. The paper frames this as a uniform draw from the admissible permutations; the Julia code uses a sequential proxy that walks vertices in descending \(d_u\) and at each step samples one cluster from the admissible set with probability proportional to its remaining capacity. A vertex \(u\) is admissible for cluster \(j\) iff \(x_u \le s_j - 1\), where \(x_u\) is the per-vertex bound below: the expected number of neighbours of \(u\) that will end up in \(u\)'s own cluster.

\[ x_u \;=\; \bigl\lceil (1 - \xi \phi) \, d_u \bigr\rceil, \qquad \phi \;=\; 1 - \sum_{j=1}^{k} \left(\frac{s_j}{n}\right)^{\!2}. \]

Vertices are processed by \(d_u\) desc, so admissibility only loosens as the walk proceeds: once a cluster fits a vertex, it fits every later vertex too. The sequential procedure is not equivalent to a uniform draw over admissible permutations: earlier high-\(d_u\) picks bias which clusters retain capacity for later vertices.

\(\phi\) is the probability that two nodes drawn uniformly at random fall in different clusters, with \(\phi \in [0, 1)\): one mega-cluster gives \(\phi = 0\), and evenly-sized clusters push \(\phi \to 1\). It enters the bound because not all of \(u\)'s \(\xi d_u\) background-routed half-edges actually leave the cluster; only the \(\phi\) fraction does, and the remaining \((1-\phi)\xi d_u\) lands back inside \(u\)'s own cluster by chance. \(u\)'s expected intra-degree is therefore \((1-\xi)d_u + (1-\phi)\xi d_u = (1-\xi\phi)d_u\), and \(x_u\) is the integer ceiling of that quantity.

The inequality \(\phi < 1\) is what keeps the post-split degree inside the cluster: since \(1 - \xi\phi \ge 1 - \xi\), the bound \(x_u \ge \lceil y_u \rceil \ge \hat y_u\), where \(y_u = (1-\xi)d_u\) is the intra-funded degree and \(\hat y_u\) is its integer realisation after the stochastic rounding the next stage performs. Combined with \(x_u \le s_j - 1\) under admissibility, this gives \(\hat y_u \le s_j - 1\), so the node's intra degree fits inside its cluster's other-member count.

singletons break admissibility for any positive-degree vertex

A singleton has \(s_j = 1\), so admissibility requires \(x_u \le 0\); but \(\lceil (1-\xi\phi) d_u \rceil \ge 1\) whenever \(d_u \ge 1\). On inputs with no degree-0 nodes, every positive-degree vertex fails admissibility for every singleton. When real clusters fill up before all small-degree vertices are placed, the sampler falls back to a leftover singleton, prints a warning, and moves on. The fallback violates the bound, so \(\hat y_u\) can exceed \(s_j - 1\); the cluster's configuration model then emits self-loops and parallels for the rewire stage to swap out. Edges the rewire cannot resolve drop, and the realised \(\xi\) drifts above target in proportion to how many singletons absorbed non-zero-degree vertices.

step 0 / 20

outliers in ABCD are a wrapper idea, not a model parameter

The paper has no notion of "outlier"; it takes a cluster-size list with \(\sum_j s_j = n\) and asks no further questions. Our wrapper handles outliers by wrapping each unclustered input node in its own 1-node pseudo-cluster so the sum identity holds, then strips the singletons out of the output community list at the end. ABCD+o is the extended model that promotes outliers into a proper model-level parameter \(s_0\).

generation · the degree split

Spend part of each degree on intra-cluster edges.

With the cluster assignment fixed, each vertex sets aside a portion of its degree for intra-cluster edges. The target portion is

\[ y_u \;=\; (1 - \xi) \, d_u, \]

and \(y_u\) funds the cluster graph \(G_{c(u)}\). Both that cluster graph and the later background graph are sampled by the configuration model (our default), which pairs integer stub counts exactly, rather than Chung-Lu, which only matches the degree sequence in expectation. Configuration-model pairing requires an integer \(\hat y_u\), so each cluster member rounds \(y_u\) to its floor and adds a Bernoulli bump with probability equal to the fractional part:

\[ \hat y_u \;=\; \lfloor y_u \rfloor + \mathbb{1}\{U_u < y_u - \lfloor y_u \rfloor\}, \qquad U_u \sim \text{Uniform}(0, 1). \]

The leader (the cluster's largest-degree member, lowest vertex id on ties) is handled separately. It takes the deterministic floor \(\lfloor y_\ell \rfloor\) and adds 0 or 1 by parity: the bump fires when the sum of non-leader \(\hat y_u\) is odd and the floor is even, or vice versa, so the cluster's total internal stub count ends up even. Without that fixup, the cluster's configuration model would be one stub short of a clean pairing.

The leftover degree, \(\hat z_u = d_u - \hat y_u\), is the portion routed to the background graph. The background pool inherits any rewire residue from the cluster phase too, but that comes later; here it is enough to read \(\hat z_u\) off as the complement of \(\hat y_u\).

background degree · total degree · target \(\xi\) · realised \(\sum \hat z / \sum w\) ·

generation · cluster pairing

Pair \(\hat y_u\) stubs inside each cluster.

Each cluster graph \(G_j\) is a configuration model on its members' \(\hat y_u\) stubs. Each node \(u\) with \(c(u) = j\) contributes \(\hat y_u\) stubs as spokes around the node; they shuffle within the cluster and pair adjacent. Self-loops and parallels are flagged solid red; the next stage rewires them.

A note on layout. This page splits the inner loop by op-type: this stage walks every cluster's pairs, then the next stage walks every cluster's rewires. The Julia code finishes each cluster's pair-then-rewire pass before moving on, so the kernel narrative is by-cluster. Output is the same; the by-op-type split keeps the pairing and rewire mechanisms each in their own focused walker.

stubs ready step 0 / 0

next: one pair · canvas arrows: jump cluster.

generation · cluster rewire

Cut the collisions, swap into valid edges.

Cluster-graph collisions (self-loops and parallels) move to a recycle list. Each recycle edge \(\{u_1, v_1\}\) is paired with a random other cluster edge \(\{u_2, v_2\}\) to make the swap (into \(\{u_1, u_2\}\) and \(\{v_1, v_2\}\), or \(\{u_1, v_2\}\) and \(\{v_1, u_2\}\)) and accepted if both new edges are simple. The rewire loop runs until the cluster graph is collision-free or runs out of moves; whatever is still on the recycle list at the end is dropped as an intra edge.

The dropped edges are not gone for good. For each leftover collision \(\{u, v\}\), one stub at each endpoint never paired inside the cluster. Those stubs are forwarded to the background pool, joining whatever the bg-split already allocated, and re-pair anonymously in the next stage. The specific intra edge is gone; the stubs survive and may form new bg edges with anyone.

pre-rewire (collisions visible) step 0 / 2

generation · background pairing

Pair background stubs across the whole vertex set.

The background graph \(G_0\) is one global pool: every node drops its background stubs into the same urn and the configuration model pairs them without regard for cluster membership. Two members of the same cluster can pair; the resulting edge counts as intra-cluster on the output even though it came from the background. A bg edge that duplicates an existing cluster edge flags for rewire alongside the usual self-loops and parallels.

background stubs ready step 0 / 0

generation · background rewire

Same swap, global pool now.

Same recycle-and-rewire loop as the cluster phase, but the swap partners are drawn from every bg edge instead of one cluster's edges. Each new half is checked against the existing bg edges and the cluster output: if it would land as a self-loop or duplicate either set, it stays on the recycle list and remains visible as a solid-red bad edge until a later op clears it. Anything still on the recycle list when the loop stops is forwarded to the cross-cluster final swap stage; nothing is dropped here.

pre-rewire (collisions visible) step 0 / 2

intra-cluster edges can come from either graph

The cluster graph \(G_j\) sees only its members' intra stubs, so every edge of \(G_j\) is intra-cluster by construction. But \(G_0\) pairs background stubs across the whole vertex set, and sometimes pairs two members of the same cluster. Those extra intra-cluster edges count toward the cluster's internal density; they are why the paper shows, for the global variant, a small upward drift of intra-cluster density with cluster size.

caveat: \(\xi\) is the only structural dial

ABCD does not target an edge-count matrix. If the input has a block pair \((r, s)\) with 200 crossing edges and \((r, t)\) with 20, the sampler only sees each node's background stubs and pools them globally; \((r, s)\) and \((r, t)\) end up with the count dictated by random pairing weighted by endpoint degrees, not the empirical figures. SBM and the EC-SBM family honour per-pair block counts; ABCD does not.

generation · cross-cluster final swap

One last swap pass over the union.

If the background rewire loop finishes with a non-empty recycle list, a final fallback stage runs. It pulls each leftover bad edge \(\{u_1, v_1\}\) and tries to swap it against a partner \(\{u_2, v_2\}\) drawn uniformly from the union of cluster output and background output (intra + inter pooled together). The same two swap variants apply (\(\{u_1, u_2\} + \{v_1, v_2\}\) or \(\{u_1, v_2\} + \{v_1, u_2\}\)). Each new half is checked: if it is a self-loop or duplicates an existing edge in the union, that half goes back on the recycle list; otherwise it is added to the union and the original partner is removed. The loop runs until the recycle list is empty or stops shrinking.

Whatever bad edges remain after the loop are dropped from the output. This is the only place in the pipeline where a residue actually leaves the model.

pre-final (carrying recycle from bg) step 0 / 0

generation · degree preservation

Output degree multiset almost matches input.

Stubs only truly leave the model at the cross-cluster final swap's drop-stale step. Cluster-rewire residue forwards to the bg pool; bg-rewire residue forwards to the cross-cluster final swap; only what the final swap cannot resolve is dropped.

reading the chart · nodes sorted by input \(d_u\) descending. Azure bar = input degree, mint bar = output degree from the current realisation. Red-dashed rectangle fills the gap between the two when they differ.

total abs diff (this seed) · matches exactly ·

mixing parameter · mint dashed = target \(\xi\) from the profile; azure tick = realised \(\xi = E_{\text{out}}/|E|\) on the current output. \(E_{\text{out}}\) counts edges whose endpoints sit in different LIVE clusters (singletons count as their own cluster).

target \(\xi\) · realised \(\xi\) · drift ·

case study · dnc

What you get on the shipped example.

Default run on dnc + sbm-flat-best+cc at --seed 1, every input outlier wrapped as its own singleton cluster:

stat	input	abcd output	note
# nodes	906	906	Every node appears in `edge.csv`; singleton outliers participate too.
# clusters	87	87	Preserved after `drop_singleton_clusters` strips the 355 outlier singletons.
# edges	10,429	10,150	Within 2.7%. Most edges are preserved; the cross-cluster final swap drops what no rewire stage could resolve.
global \(\xi\)	0.718	0.715	Hits the target within a stub. The target is high because 355 of 906 input nodes are outliers and each sits in its own 1-node pseudo-cluster, so every outlier-incident edge is inter by definition.
global cc	0.548	0.307	Not targeted. Configuration-model pairing inside a cluster distributes intra-edges uniformly across admissible pairs; triangles happen only by coincidence.
char. time	40.41	1.95	Random-walk mixing time. Background pairing wires the high-degree hubs across the whole vertex set, so the output mixes in roughly 1/20 of the input rate.
pseudo-diameter	8	5	Same hub effect: bg edges shortcut walks across cluster boundaries, so most pairs are within five hops.

The degree multiset and the cluster-size multiset pass through verbatim (modulo a handful of dropped edges in the cross-cluster final swap). Triangle-driven structure does not: \(cc\) lands at 0.307 against the 0.548 input. Random-walk dynamics shift in the same direction as SBM's top-up phase, but more sharply: bg pairing alone funds about \(\xi |E|\) cross-cluster edges, all drawn between high-degree hubs by stub probability, so the output is far better connected than the input even before any rewire.

input degree distribution, log-log · dnc has a heavy tail: max degree 368, next 312, 256, 225, 202; median 4. ABCD reuses this sequence verbatim, so the output's degree histogram is visually indistinguishable from the input's.

input cluster-size distribution, log-log · two piles. 355 singletons on the left (the outlier mass); 87 real clusters with sizes 2 to 52 on the right. ABCD under outlier_mode=singleton passes both piles through as one flat size list of length 442.

caveat 1: the cc gap is the price of "only \(\xi\)"

dnc's input global cc is 0.548. ABCD lands at 0.307. The gap is what "configuration model inside each cluster" produces on an input with structured triangles: the sampler has no way to preserve them. EC-SBM v2's \(K_{k+1}\) core and nPSO's hyperbolic embedding trade some of ABCD's tractability for a handle on clustering.

caveat 2: node identity is not preserved

The sampler sorts the profile's degrees by descending \(d_u\) and re-assigns them to clusters uniformly over admissible permutations. "Node 5" in the output is not node 5 in the input; the two are unrelated. What is preserved across every seed: the degree multiset (up to rewire drift) and the cluster size multiset (exact).