ABCD
Artificial Benchmark for Community Detection
Output is a union of \(k+1\) independent random graphs: one per cluster plus a background graph over all nodes. The scalar \(\xi\) sets the share of each node's degree that goes to the background.
Given a network \(G\) with a clustering, the goal is a synthetic twin that keeps both the degree sequence and the cluster sizes verbatim while drawing every edge fresh. ABCD models the graph as a union of independent random graphs, one per cluster plus one global background, with a single scalar \(\xi\) trading degree share between the two.
Each node \(u\) has target degree \(d_u\). ABCD splits this degree: \(y_u = (1 - \xi) d_u\) funds the cluster graph of \(u\)'s home cluster; \(z_u = \xi d_u\) funds the background graph on \(V\).
A configuration model on each cluster graph produces cluster-internal edges; a configuration model on all \(z\)-stubs produces background edges that can pair any two nodes. The final graph is the union. At \(\xi = 0\) the output is disjoint cluster graphs; at \(\xi = 1\) it is a pure random graph with no signal about cluster membership. \(\xi\) is the only structural dial: triangles, clustering coefficient, and per-pair block counts are not in the contract.
Shared input graph: 18 clustered + 2 singletons.
The 20-node synthetic this page builds on: C1 (8), C2 (6), C3 (4), plus outliers 19 + 20. ABCD requires \(\sum_j s_j = n\), so our wrapper's outlier_mode=singleton wraps each unclustered node in its own 1-node cluster. The graph below shows the resulting 5-cluster partition; ABCD's later sections all operate on this view.
Every node's degree, copied verbatim.
The profile is a deterministic function of the input; it does not change on re-roll. Profile writes three things. The first is degree.csv, the integer degree \(d_u\) per node, sorted desc. The Julia sampler reads the list unchanged and uses it as the exact degree target.
One integer per cluster, sum to \(n\).
The second file is cluster_sizes.csv, an integer size per cluster, sorted size desc. The paper requires \(\sum_j s_j = n\), so every input node lands in exactly one cluster. Our wrapper satisfies this via outlier_mode=singleton: every unclustered input node becomes a 1-node pseudo-cluster (ID __outlier_<id>__), so the size list sums to \(n\). Real clusters sort to the top; singletons tail. The next two sections show how the sampler uses this list to assign vertices and draw intra-cluster edges; output cluster sizes match the input verbatim.
One number: \(\xi\).
The third file is mixing_parameter.txt, a single float. Let \(E_{\text{out}}\) be the number of edges crossing cluster boundaries under the profile's clustering (singletons count, so outlier-incident edges are "inter" when the other endpoint is a different cluster). Then
The paper reads \(\xi\) as the expected fraction of each vertex's degree that the model routes to the background graph: \(\xi d_u\) background stubs, \((1 - \xi) d_u\) cluster-graph stubs. The realised inter-cluster fraction on output hits \(\xi\) only in expectation; on small graphs it drifts by a few percent seed-to-seed.
Both measure "fraction of edges crossing clusters" on the same input graph, but via different reductions. ABCD's \(\xi\) is a stub-weighted ratio over all edges, \(E_{\text{out}} / |E|\). LFR's \(\mu\) is the mean of per-node ratios, \(\text{mean}_u(d^{\text{out}}_u / d_u)\). On skewed-degree inputs the two diverge: on this synthetic ABCD's \(\xi\) is · while LFR's \(\mu\) is ·. Do not swap them across models.
Big nodes to big clusters; small nodes fill the rest.
The sampler assigns each vertex to exactly one cluster before any edges are drawn. The paper frames this as a uniform draw from the admissible permutations; the Julia code uses a sequential proxy that walks vertices in descending \(d_u\) and at each step samples one cluster from the admissible set with probability proportional to its remaining capacity. A vertex \(u\) is admissible for cluster \(j\) iff \(x_u \le s_j - 1\), where \(x_u\) is the per-vertex bound below: the expected number of neighbours of \(u\) that will end up in \(u\)'s own cluster.
\[ x_u \;=\; \bigl\lceil (1 - \xi \phi) \, d_u \bigr\rceil, \qquad \phi \;=\; 1 - \sum_{j=1}^{k} \left(\frac{s_j}{n}\right)^{\!2}. \]Vertices are processed by \(d_u\) desc, so admissibility only loosens as the walk proceeds: once a cluster fits a vertex, it fits every later vertex too. The sequential procedure is not equivalent to a uniform draw over admissible permutations: earlier high-\(d_u\) picks bias which clusters retain capacity for later vertices.
\(\phi\) is the probability that two nodes drawn uniformly at random fall in different clusters, with \(\phi \in [0, 1)\): one mega-cluster gives \(\phi = 0\), and evenly-sized clusters push \(\phi \to 1\). It enters the bound because not all of \(u\)'s \(\xi d_u\) background-routed half-edges actually leave the cluster; only the \(\phi\) fraction does, and the remaining \((1-\phi)\xi d_u\) lands back inside \(u\)'s own cluster by chance. \(u\)'s expected intra-degree is therefore \((1-\xi)d_u + (1-\phi)\xi d_u = (1-\xi\phi)d_u\), and \(x_u\) is the integer ceiling of that quantity.
The inequality \(\phi < 1\) is what keeps the post-split degree inside the cluster: since \(1 - \xi\phi \ge 1 - \xi\), the bound \(x_u \ge \lceil y_u \rceil \ge \hat y_u\), where \(y_u = (1-\xi)d_u\) is the intra-funded degree and \(\hat y_u\) is its integer realisation after the stochastic rounding the next stage performs. Combined with \(x_u \le s_j - 1\) under admissibility, this gives \(\hat y_u \le s_j - 1\), so the node's intra degree fits inside its cluster's other-member count.
A singleton has \(s_j = 1\), so admissibility requires \(x_u \le 0\); but \(\lceil (1-\xi\phi) d_u \rceil \ge 1\) whenever \(d_u \ge 1\). On inputs with no degree-0 nodes, every positive-degree vertex fails admissibility for every singleton. When real clusters fill up before all small-degree vertices are placed, the sampler falls back to a leftover singleton, prints a warning, and moves on. The fallback violates the bound, so \(\hat y_u\) can exceed \(s_j - 1\); the cluster's configuration model then emits self-loops and parallels for the rewire stage to swap out. Edges the rewire cannot resolve drop, and the realised \(\xi\) drifts above target in proportion to how many singletons absorbed non-zero-degree vertices.
The paper has no notion of "outlier"; it takes a cluster-size list with \(\sum_j s_j = n\) and asks no further questions. Our wrapper handles outliers by wrapping each unclustered input node in its own 1-node pseudo-cluster so the sum identity holds, then strips the singletons out of the output community list at the end. ABCD+o is the extended model that promotes outliers into a proper model-level parameter \(s_0\).
Spend part of each degree on intra-cluster edges.
With the cluster assignment fixed, each vertex sets aside a portion of its degree for intra-cluster edges. The target portion is
\[ y_u \;=\; (1 - \xi) \, d_u, \]and \(y_u\) funds the cluster graph \(G_{c(u)}\). Both that cluster graph and the later background graph are sampled by the configuration model (our default), which pairs integer stub counts exactly, rather than Chung-Lu, which only matches the degree sequence in expectation. Configuration-model pairing requires an integer \(\hat y_u\), so each cluster member rounds \(y_u\) to its floor and adds a Bernoulli bump with probability equal to the fractional part:
\[ \hat y_u \;=\; \lfloor y_u \rfloor + \mathbb{1}\{U_u < y_u - \lfloor y_u \rfloor\}, \qquad U_u \sim \text{Uniform}(0, 1). \]The leader (the cluster's largest-degree member, lowest vertex id on ties) is handled separately. It takes the deterministic floor \(\lfloor y_\ell \rfloor\) and adds 0 or 1 by parity: the bump fires when the sum of non-leader \(\hat y_u\) is odd and the floor is even, or vice versa, so the cluster's total internal stub count ends up even. Without that fixup, the cluster's configuration model would be one stub short of a clean pairing.
The leftover degree, \(\hat z_u = d_u - \hat y_u\), is the portion routed to the background graph. The background pool inherits any rewire residue from the cluster phase too, but that comes later; here it is enough to read \(\hat z_u\) off as the complement of \(\hat y_u\).
Pair \(\hat y_u\) stubs inside each cluster.
Each cluster graph \(G_j\) is a configuration model on its members' \(\hat y_u\) stubs. Each node \(u\) with \(c(u) = j\) contributes \(\hat y_u\) stubs as spokes around the node; they shuffle within the cluster and pair adjacent. Self-loops and parallels are flagged solid red; the next stage rewires them.
A note on layout. This page splits the inner loop by op-type: this stage walks every cluster's pairs, then the next stage walks every cluster's rewires. The Julia code finishes each cluster's pair-then-rewire pass before moving on, so the kernel narrative is by-cluster. Output is the same; the by-op-type split keeps the pairing and rewire mechanisms each in their own focused walker.
Cut the collisions, swap into valid edges.
Cluster-graph collisions (self-loops and parallels) move to a recycle list. Each recycle edge \(\{u_1, v_1\}\) is paired with a random other cluster edge \(\{u_2, v_2\}\) to make the swap (into \(\{u_1, u_2\}\) and \(\{v_1, v_2\}\), or \(\{u_1, v_2\}\) and \(\{v_1, u_2\}\)) and accepted if both new edges are simple. The rewire loop runs until the cluster graph is collision-free or runs out of moves; whatever is still on the recycle list at the end is dropped as an intra edge.
The dropped edges are not gone for good. For each leftover collision \(\{u, v\}\), one stub at each endpoint never paired inside the cluster. Those stubs are forwarded to the background pool, joining whatever the bg-split already allocated, and re-pair anonymously in the next stage. The specific intra edge is gone; the stubs survive and may form new bg edges with anyone.
Pair background stubs across the whole vertex set.
The background graph \(G_0\) is one global pool: every node drops its background stubs into the same urn and the configuration model pairs them without regard for cluster membership. Two members of the same cluster can pair; the resulting edge counts as intra-cluster on the output even though it came from the background. A bg edge that duplicates an existing cluster edge flags for rewire alongside the usual self-loops and parallels.
Same swap, global pool now.
Same recycle-and-rewire loop as the cluster phase, but the swap partners are drawn from every bg edge instead of one cluster's edges. Each new half is checked against the existing bg edges and the cluster output: if it would land as a self-loop or duplicate either set, it stays on the recycle list and remains visible as a solid-red bad edge until a later op clears it. Anything still on the recycle list when the loop stops is forwarded to the cross-cluster final swap stage; nothing is dropped here.
The cluster graph \(G_j\) sees only its members' intra stubs, so every edge of \(G_j\) is intra-cluster by construction. But \(G_0\) pairs background stubs across the whole vertex set, and sometimes pairs two members of the same cluster. Those extra intra-cluster edges count toward the cluster's internal density; they are why the paper shows, for the global variant, a small upward drift of intra-cluster density with cluster size.
ABCD does not target an edge-count matrix. If the input has a block pair \((r, s)\) with 200 crossing edges and \((r, t)\) with 20, the sampler only sees each node's background stubs and pools them globally; \((r, s)\) and \((r, t)\) end up with the count dictated by random pairing weighted by endpoint degrees, not the empirical figures. SBM and the EC-SBM family honour per-pair block counts; ABCD does not.
One last swap pass over the union.
If the background rewire loop finishes with a non-empty recycle list, a final fallback stage runs. It pulls each leftover bad edge \(\{u_1, v_1\}\) and tries to swap it against a partner \(\{u_2, v_2\}\) drawn uniformly from the union of cluster output and background output (intra + inter pooled together). The same two swap variants apply (\(\{u_1, u_2\} + \{v_1, v_2\}\) or \(\{u_1, v_2\} + \{v_1, u_2\}\)). Each new half is checked: if it is a self-loop or duplicates an existing edge in the union, that half goes back on the recycle list; otherwise it is added to the union and the original partner is removed. The loop runs until the recycle list is empty or stops shrinking.
Whatever bad edges remain after the loop are dropped from the output. This is the only place in the pipeline where a residue actually leaves the model.
Output degree multiset almost matches input.
Stubs only truly leave the model at the cross-cluster final swap's drop-stale step. Cluster-rewire residue forwards to the bg pool; bg-rewire residue forwards to the cross-cluster final swap; only what the final swap cannot resolve is dropped.
What you get on the shipped example.
Default run on dnc + sbm-flat-best+cc at --seed 1, every input outlier wrapped as its own singleton cluster:
| stat | input | abcd output | note |
|---|---|---|---|
| # nodes | 906 | 906 | Every node appears in edge.csv; singleton outliers participate too. |
| # clusters | 87 | 87 | Preserved after drop_singleton_clusters strips the 355 outlier singletons. |
| # edges | 10,429 | 10,150 | Within 2.7%. Most edges are preserved; the cross-cluster final swap drops what no rewire stage could resolve. |
| global \(\xi\) | 0.718 | 0.715 | Hits the target within a stub. The target is high because 355 of 906 input nodes are outliers and each sits in its own 1-node pseudo-cluster, so every outlier-incident edge is inter by definition. |
| global cc | 0.548 | 0.307 | Not targeted. Configuration-model pairing inside a cluster distributes intra-edges uniformly across admissible pairs; triangles happen only by coincidence. |
| char. time | 40.41 | 1.95 | Random-walk mixing time. Background pairing wires the high-degree hubs across the whole vertex set, so the output mixes in roughly 1/20 of the input rate. |
| pseudo-diameter | 8 | 5 | Same hub effect: bg edges shortcut walks across cluster boundaries, so most pairs are within five hops. |
The degree multiset and the cluster-size multiset pass through verbatim (modulo a handful of dropped edges in the cross-cluster final swap). Triangle-driven structure does not: \(cc\) lands at 0.307 against the 0.548 input. Random-walk dynamics shift in the same direction as SBM's top-up phase, but more sharply: bg pairing alone funds about \(\xi |E|\) cross-cluster edges, all drawn between high-degree hubs by stub probability, so the output is far better connected than the input even before any rewire.
dnc has a heavy tail: max degree 368, next 312, 256, 225, 202; median 4. ABCD reuses this sequence verbatim, so the output's degree histogram is visually indistinguishable from the input's.outlier_mode=singleton passes both piles through as one flat size list of length 442.dnc's input global cc is 0.548. ABCD lands at 0.307. The gap is what "configuration model inside each cluster" produces on an input with structured triangles: the sampler has no way to preserve them. EC-SBM v2's \(K_{k+1}\) core and nPSO's hyperbolic embedding trade some of ABCD's tractability for a handle on clustering.
The sampler sorts the profile's degrees by descending \(d_u\) and re-assigns them to clusters uniformly over admissible permutations. "Node 5" in the output is not node 5 in the input; the two are unrelated. What is preserved across every seed: the degree multiset (up to rewire drift) and the cluster size multiset (exact).