network-generation · abcd+o

ABCD+o

ABCD with outliers

An explicit outlier count \(s_0\). The sampler picks \(s_0\) eligible nodes uniformly at random, pins them to \(\hat y = 0\), and lets their full weight fund the background graph. OO edges are not forbidden; they become unlikely when \(s_0 / n\) is small.

Plain ABCD has no outlier concept. Our wrapper wraps each unclustered input node in its own 1-node pseudo-cluster (the singleton outlier mode), and the sampler accepts them through admissibility fallback. ABCD+o is the paper-level fix: a single integer parameter \(s_0\) so the Julia pseudo-code treats those \(s_0\) nodes as noise without pretending they form a community.

The sampler's outlier block sits at cluster_id = 1. Its members are drawn uniformly at random from every node satisfying an upper-degree bound the paper derives; once drawn, their weight split is pinned (\(\hat y_u = 0\), \(\hat z_u = d_u\)). All their stubs fund the background graph \(G_0\); none fund any cluster graph.

Across seeds the set of chosen outliers rotates, so two runs with the same inputs typically produce different outlier picks. The 233 input nodes that our profile marked as outliers need not show up at cluster_id = 1 in the output.

Two things the model does not do. It does not forbid outlier-outlier edges: two outliers' \(z\)-stubs can pair in the background graph. And the identity of input outliers does not survive; the wrapper's profile drops their OO edges at input time only to make the target \(\xi\) reachable, not to tell the sampler which specific nodes to label as outliers.

profiling · outliers as singletons

Shared input graph: 18 clustered + 2 outliers in one combined block (with the OO edge dashed).

The 20-node synthetic this page builds on: C1 (8), C2 (6), C3 (4), plus a 2-node outlier block (light red). ABCD+o's profile drops the OO edges by default, so the (19, 20) edge below is dashed-red as a visual cue: it disappears before stage 2 sees the input. The sampler then picks its own \(s_0\) outliers in stage 2; this view shows the input partition only.

input G · combined mode · 4 clusters · OO dashed shared fixture

profiling · degree sequence

Drop OO edges first, then read per-node degrees.

ABCD+o drops OO edges before reading degrees. The reason is target-stability: the background graph's uniform pairing produces OO edges at a rate of roughly \((s_0 / n)^2\), well below what an OO-dense input would have. Keeping those OO edges in profile would set a target \(\xi\) the sampler cannot reach, and the realised \(\xi\) would drift low. Dropping OO at profile time gives the sampler a target it can hit.

On our synthetic, the edge \((19, 20)\) is the only OO edge. Dropping it reduces \(d_{19}\) and \(d_{20}\) by one each. On outlier-heavy inputs the drop can zero an outlier's degree entirely, in which case the node has no stubs and disappears from the sampler's input.

input G · dashed OO edge dropped at profile time degree.csv

profiling · cluster sizes + n_outliers

Real clusters in one file, outlier count in another.

Profile emits cluster_sizes.csv with the real clusters only (sorted size desc), plus a separate n_outliers.txt holding \(s_0\). The wrapper's stage-2 concatenates them: the outlier count is prepended to the size list so the sampler receives

\[ \mathbf{s} \;=\; (s_0,\, s_1,\, s_2,\, \ldots,\, s_\ell), \]

where \(s_0\) is the outlier block size (always first, always cluster_id = 1 in the output), \(s_1 \ge s_2 \ge \ldots\) are the real cluster sizes. The paper requires \(s_0 + \sum_{j \ge 1} s_j = n\).

On our synthetic: real sizes \([8, 6, 4]\), \(s_0 = 2\), sampler sees \([2, 8, 6, 4]\).

input G · dashed ring encloses input outliers (sampler picks its own next stage) [2, 8, 6, 4]

profiling · the mixing scalar

One number: \(\xi\), computed after drop_oo on the combined-outlier partition.

Same definition as ABCD. Treat the input outliers as a single outlier cluster and drop their internal OO edges; OC edges stay and count as inter (they cross outlier ↔ real). Let \(|E^+|\) be the edge count after dropping OO and \(|E^+_{\text{out}}|\) the count of those edges that cross any cluster boundary:

\[ \xi \;=\; \frac{|E^+_{\text{out}}|}{|E^+|}. \]

Dropping OO is what makes the singletons-and-combined framings agree: every OC edge crosses a boundary either way (singleton-to-real or outlier-to-real), and OO is the only edge type whose intra/inter label depends on the framing. With OO removed, the count is invariant.

Sampler-side, \(\xi\) is still the paper's weight-split parameter: each non-outlier gets \(\hat y_u = (1-\xi) d_u\) cluster-graph stubs and \(\hat z_u = \xi d_u\) background stubs; outliers pin to \(\hat y_u = 0\). The realised inter-cluster fraction in output depends on which \(s_0\) nodes the sampler picks as outliers (hub-heavy picks lift it; low-degree picks depress it).

input G · red edges cross cluster boundaries on the post-drop_oo graph; OO edge faded mixing_parameter.txt

why drop_oo is on by default here (and only here)

Running the Julia sampler on our 20-node synthetic across 10 seeds at \(\xi = 0.308\), 8 of 10 runs produce a single OO edge despite no OO edges being explicitly allowed in the input profile; the other 2 land 0 OO edges. Drop_oo removes those edges from the target so the discrepancy lands where it should: on the realised side, not the profiled side.

ABCD+o's \(\xi\) vs LFR's \(\mu\): not the same number

Both measure "fraction of edges crossing clusters" on the same input graph, but via different reductions. ABCD+o's \(\xi\) is a stub-weighted ratio over post-drop_oo edges, \(E^+_{\text{out}} / |E^+|\). LFR's \(\mu\) is the mean of per-node ratios, \(\text{mean}_u(d^{\text{out}}_u / d_u)\), and LFR does not drop OO edges. On this synthetic ABCD+o's \(\xi\) is · while LFR's \(\mu\) is ·. Do not swap them across models.

generation · assigning vertices to clusters

Outliers first, then ABCD's cluster walk on whatever's left.

The sampler assigns each vertex to exactly one of \(\ell + 1\) blocks: the \(\ell\) real clusters plus the outlier block at cluster_id = 1. ABCD+o runs the assignment in two passes. The first pass picks \(s_0\) outliers uniformly from a degree-eligibility set; the second pass walks the remaining \(n - s_0\) vertices into the real clusters using the ABCD admissibility rule. A node of degree \(d_u\) is eligible for the outlier pick iff

\[ d_u \;\le\; \bar L + s_0 - \frac{\bar L \, s_0}{n} - 1, \qquad \bar L \;=\; \sum_{u \in V} \min(1, \xi d_u). \]

\(\bar L\) is the paper's lower bound on the count of nodes with at least one background-graph stub: every node with \(\xi d_u \ge 1\) contributes 1, every other contributes \(\xi d_u\) in expectation. The threshold caps eligibility at degrees small enough that the residual non-outlier graph stays graphic; on most inputs every node clears it. Outliers pin to \(\hat y_u = 0\) and \(\hat z_u = d_u\) at the next stage, so their full degree funds \(G_0\) and they never appear in any cluster graph.

The remaining \(n - s_0\) vertices walk in descending \(d_u\) exactly as in plain ABCD. A vertex \(u\) is admissible for cluster \(j\) iff \(x_u \le s_j - 1\), and the admissible cluster is sampled with probability proportional to its remaining capacity. The bound \(x_u\) and the cluster-mixing probability \(\phi\) carry the outlier carve-out:

\[ x_u \;=\; \bigl\lceil (1 - \xi \phi) \, d_u \bigr\rceil, \qquad \phi \;=\; 1 - \sum_{j \in [\ell]} \left(\frac{s_j}{n - s_0}\right)^{\!2} \frac{(n - s_0)\xi}{(n - s_0)\xi + s_0}. \]

Two factors carve the outlier block out of \(\phi\). The cluster proportion is \(s_j / (n - s_0)\), reading the partition over non-outliers only since outliers cannot land in any real cluster. The \((n - s_0)\xi / ((n - s_0)\xi + s_0)\) factor is the non-outlier share of the background-pool stubs: outliers dump their full degree there while non-outliers dump only \(\xi d_u\), so the non-outlier share of a random background match shrinks and \(\phi\) shrinks with it. Setting \(s_0 = 0\) recovers ABCD's \(\phi = 1 - \sum (s_j / n)^2\) and the original bound.

The outlier-block asymmetry survives downstream: eligibility runs on raw \(d_u\) rather than the cluster-fitting bound \(x_u\), so on hub-heavy inputs the threshold sits below the largest input degrees and outliers come from mid-degree nodes; on flatter inputs every node is eligible and a hub can land in the block. Either way every outlier's full degree funds \(G_0\), so the realised \(\xi\) drifts above target in proportion to how much aggregate degree the eligible draw absorbed.

step 0 / 0

low-\(\xi\) exhausts eligibility before \(s_0\) outliers can be drawn

\(\bar L\) is bounded above by \(\sum_u \xi d_u\), so \(\bar L \to 0\) as \(\xi \to 0\) and the eligibility bound \(\bar L + s_0 - \bar L s_0/n - 1 \to s_0 - 1\). The eligible set then collapses to the nodes with \(d_u \le s_0 - 1\), and unless that set has at least \(s_0\) members the sampler cannot fill the outlier block. In the extreme \(\xi = 0\), only outliers contribute background stubs at all; a graphic residual then requires every outlier's degree to fit below \(s_0\). It is the analog of ABCD's singleton trap, the corner of parameter space where the assignment runs out of legal moves.

node identity is not preserved across the assignment

The walker treats each block's slot list as anonymous: vertex 5 may land in C1 on one seed and C2 on the next, or in the outlier block on a third. Block sizes are exact (the kernel guarantees \(s_0 + \sum_j s_j = n\)); the per-node mapping is one draw from the admissible permutations. Re-roll changes which nodes sit where, including which nodes become outliers.

generation · the weight split

Outliers pin to \(\hat y = 0\). Non-outliers split and round like ABCD.

Fix the outlier pick from the previous stage. For non-outliers, \(y_u = (1 - \xi) d_u\) continuous and \(z_u = \xi d_u\); each non-outlier's \(\hat y_u\) is stochastically rounded to an integer, and the leader of each real cluster takes the deterministic floor plus a 0-or-1 parity bump so the cluster's \(\sum \hat y_u\) is even. For outliers, the sampler overrides: \(\hat y_u = 0\) and \(\hat z_u = d_u\).

reading the chart · one column per node, cluster-order on the x-axis. Cluster-coloured stack is \(\hat y_u\); red on top is \(\hat z_u\); dashed tick is the pre-rounding intra target \(y_u = (1 - \xi) d_u\); ★ marks the cluster's leader (deterministic floor + parity bump). Nodes picked as outliers stay solid red end-to-end and never carry the leader star.

background weight · total weight · target \(\xi\) · realised ·

generation · cluster pairing

Pair \(\hat y_u\) stubs inside each real cluster.

Each real cluster graph \(G_j\) (\(j \ge 2\)) runs a configuration model on its members' \(\hat y_u\) stubs. Each node \(u\) with home cluster \(j\) contributes \(\hat y_u\) stubs as spokes around the node; they shuffle within the cluster and pair adjacent. Self-loops and parallels are flagged solid red; the next stage rewires them.

The outlier "cluster" \(G_1\) is skipped because all its members have \(\hat y_u = 0\); they only contribute to the background graph two stages on.

stubs ready step 0 / 0

generation · cluster rewire

Cut the collisions, swap into valid edges.

Cluster-graph collisions (self-loops and parallels) move to a recycle list. Each recycle edge \(\{u_1, v_1\}\) is paired with a random other cluster edge \(\{u_2, v_2\}\) to make the swap (into \(\{u_1, u_2\}\) and \(\{v_1, v_2\}\), or \(\{u_1, v_2\}\) and \(\{v_1, u_2\}\)) and accepted if both new edges are simple. The rewire loop runs until the cluster graph is collision-free or runs out of moves; whatever is still on the recycle list at the end is dropped as an intra edge, and one stub at each endpoint is forwarded to the background pool.

pre-rewire step 0 / 0

generation · background pairing

Pair background stubs across the whole vertex set, outliers included.

The background graph \(G_0\) is one global pool: every node drops its \(\hat z_u\) stubs into the same urn and the configuration model pairs them without regard for cluster membership. Outlier stubs (\(\hat z_u = d_u\)) and non-outlier stubs (\(\hat z_u = \xi d_u\)) mix freely. A bg edge that duplicates an existing cluster edge flags for rewire alongside the usual self-loops and parallels.

stubs ready step 0 / 0

generation · background rewire

Same swap, global pool now.

Same recycle-and-rewire loop as the cluster phase, but the swap partners are drawn from every bg edge instead of one cluster's edges. Each new half is checked against existing bg edges and the cluster output: if it would land as a self-loop or duplicate either set, it stays on the recycle list and remains visible as a solid-red bad edge until a later op clears it. Anything still on the recycle list when the loop stops is forwarded to the cross-cluster final swap stage; nothing is dropped here.

pre-rewire step 0 / 0

OO edges are permitted by the model

Outlier stubs are \(s_0\) nodes' worth of background-graph stubs. For small \(s_0 / n\) and moderate \(\xi\), the global \(\hat z\)-stub pool is dominated by non-outliers, and outlier-outlier pairings happen only by chance. Empirically on our 20-node synthetic at \(\xi = 0.31\), the Julia sampler produces 0 or 1 OO edges per run; 8 of 10 seeds yield exactly 1 OO edge, the other 2 yield 0.

generation · cross-cluster final swap

One last swap pass over the union.

If the background rewire loop finishes with a non-empty recycle list, a final fallback stage runs. It pulls each leftover bad edge \(\{u_1, v_1\}\) and tries to swap it against a partner \(\{u_2, v_2\}\) drawn uniformly from the union of cluster output and background output (intra + inter pooled together, outliers' bg edges included). The same two swap variants apply (\(\{u_1, u_2\} + \{v_1, v_2\}\) or \(\{u_1, v_2\} + \{v_1, u_2\}\)). Each new half is checked: if it is a self-loop or duplicates an existing edge in the union, that half goes back on the recycle list; otherwise it is added to the union and the original partner is removed. The loop runs until the recycle list is empty or stops shrinking.

Bad edges still on the recycle list when the loop stops are dropped from the output. The cross-cluster final swap is the only stage that actually drops edges; every earlier rewire either succeeds or forwards its residue.

pre-final (carrying recycle from bg) step 0 / 0

generation · degree preservation

Output degree multiset almost matches input, outlier pin and all.

Stubs only truly leave the model at the cross-cluster final swap's drop-stale step. Cluster-rewire residue forwards to the bg pool; bg-rewire residue forwards to the cross-cluster final swap; only what the final swap cannot resolve is dropped. Outliers spend all their weight on the bg pool, so their output degree should still match \(d_u\) within the same residue tolerance as any other node.

reading the chart · nodes sorted by input \(d_u\) desc. Azure bar = input degree, mint bar = output degree from the current realisation. Red-dashed rectangle fills the gap between the two when they differ.

total abs diff (this realisation) · matches exactly ·

mixing parameter · mint dashed = target \(\xi\) from the profile (post-drop_oo); azure tick = realised \(\xi = E_{\text{out}}/|E|\) on the current output. \(E_{\text{out}}\) counts edges whose endpoints sit in different real clusters, with the model-picked cluster_id=1 outliers treated as their own block.

target \(\xi\) · realised \(\xi\) · drift ·

generation · the warning branch

The sampler warns when outlier stubs dominate the external pool.

At low \(\xi\) with a large outlier block, non-outliers contribute \(\sum_u \xi d_u\) background stubs while outliers contribute \(\sum_u d_u\) (the full outlier degree). When

\[ 2 \sum_{u \,:\, \mathrm{outlier}(u)} d_u \;>\; \sum_u \hat z_u, \]

outlier stubs are more than half the external pool, so two random stubs pair outlier-with-outlier more often than outlier-with-non-outlier. The Julia sampler flags this with the stderr message outlier nodes form a community. The Python wrapper greps stderr and branches: if the warning fires, keep the cluster_id = 1 rows in com.csv; otherwise strip them.

caveat: the branch affects com.csv semantics

Any downstream tool that reads com.csv without also reading run.log cannot tell whether an outlier got lifted into cluster_id = 1 or was silently dropped. Surface the warning status alongside any visualisation of ABCD+o output.

case study · dnc

What you get on the shipped example.

Default run on dnc + sbm-flat-best+cc at --seed 1 (\(s_0 = 355\), drop_oo = true):

stat	input	abcd+o output	note
N (in edge.csv)	906	673	233 input outliers had all their edges inside the outlier pool; drop_oo zeroes their degrees and they do not appear in `edge.csv`.
edges	10,429	10,070	Off by 3.4%. Most of the drop is profile-side (OO edges removed); the remainder is rewire attrition.
mean degree	23.02	29.93	Rises because the denominator (nodes with edges) shrinks faster than the numerator.
global \(\xi\)	0.710	0.710	Target \(\xi\) after drop_oo. Matches within a stub on `dnc` because \(\mu_0 \approx 1\) (small clusters dominate) so the paper's \(\xi = \mu / \mu_0\) correction is negligible.
global cc	0.548	0.307	Not targeted. Same cc gap as ABCD.
clusters	87	87	Warning silent on this run; cluster_id = 1 rows stripped. Cluster count matches input.
com.csv rows	551	551	Same 551 clustered input nodes. The 355 sampler-picked outliers are labelled nowhere.

input degree distribution, log-log · dnc has a heavy tail: max degree 368, next 312, 256, 225, 202; median 4. ABCD+o reuses the post-drop_oo sequence verbatim (233 outlier nodes go to zero degree and drop out), so the output's degree histogram on the surviving 673 nodes mirrors the input's.

input cluster-size distribution, log-log · two piles. 355 singletons on the left. ABCD+o folds them into one cluster_id=1 mega-block of size \(s_0\) instead of 355 singletons. The right pile is 87 real clusters with sizes 2 to 52, passed through verbatim as a flat size list of length 88.

caveat 1: output outliers are not input outliers

The 355 nodes labelled cluster_id = 1 in the sampler's output are picked by the sampler, uniformly from the eligible set. They are not the same 355 input nodes the profile marked as outliers. Our wrapper's drop_oo = true affects which degrees reach the sampler, not which identities the sampler labels as outliers. Two runs with the same seed match byte-for-byte (the sampler is deterministic given the seed); changing the seed reshuffles both outlier picks and intra-cluster assignments.

caveat 2: drop_oo can delete nodes

233 of 906 nodes vanish from edge.csv on dnc: all their edges were OO in the input, drop_oo zeroes their degrees, and the sampler has no stubs for them. Any downstream pipeline that expects one row per input node in the edge list will silently break on outlier-heavy inputs under ABCD+o.

caveat 3: the "community" warning is a heuristic

The stderr message outlier nodes form a community fires on a stub-count ratio, not any detector output. The wrapper's com.csv semantics change based on that substring match. On dnc the message is silent; com.csv has 551 rows and no cluster_id = 1. Always read run.log alongside com.csv.