network-generation · lfr

LFR

Lancichinetti–Fortunato–Radicchi benchmark

Fit one power law to the degree distribution, another to the cluster sizes. Resample both from scratch. Configuration model per cluster + global rewire to hit the per-node mixing parameter \(\mu\). The long-standing community-detection benchmark.

LFR's design choice is the opposite of ABCD's. ABCD takes your degree sequence and cluster sizes verbatim; LFR refuses to. It assumes the shapes are power laws and distils your input to two scalar exponents plus a handful of range parameters.

Stage 2 resamples new degrees from a truncated power law with the fitted exponent and range, and new cluster sizes from a second power law under LFR's size-floor constraints. The output graph shares statistics with the input but not identity: node 5 in the output is not node 5 in the input; cluster 3 is not cluster 3.

Mixing is controlled by a scalar \(\mu\) per node: LFR tries to arrange each node's neighbours so \(\mu_i = d^{\text{out}}_i / d_i \approx \mu\). A several-hundred-rewire loop reshuffles edge endpoints (never nodes) until each node's internal/external split is within a small tolerance of \(\mu\). The paper's Fig. 3 shows the resulting \(\mu_i\) distribution as bell-shaped around \(\mu\), not a delta.

LFR targets graphs with controlled degree + size distributions, sweepable across \(\mu\). ABCD fits when the input itself is the target. The two models share a configuration-model-plus-rewire backbone but differ on what counts as input.

profiling · outliers as singletons

Shared input graph: 18 clustered + 2 singletons.

The 20-node synthetic LFR profiles to derive its scalars: C1 (8), C2 (6), C3 (4), plus outliers 19 + 20 each in their own 1-node cluster. Under the wrapper's outlier_mode=singleton, the cluster-size list reads [8, 6, 4, 1, 1]; the size-1 entries are below LFR's hard \(c_{\min} = 3\) floor and will not survive the resample.

input G · singleton mode · 5 clusters shared fixture

profiling · degree distribution

Extract the shape; throw away the identity.

Profile writes degree.csv (per-node integer degrees) the same way ABCD does, but LFR's stage 2 reads only four derived scalars from it: \(N\), \(\langle k \rangle\), \(k_{\max}\), and the power-law exponent \(t_1\) from powerlaw.Fit with auto-selected \(x_{\min}\). The sequence itself is discarded.

On the synthetic the fit is coarse because only 20 points feed it; the curve below shows the degree histogram and the fitted \(k^{-t_1}\) line. On dnc the fit lands at \(t_1 \approx 1.45\) with auto \(x_{\min} = 1\), which is a poor fit by the strict-KS test but the best the powerlaw package can do.

reading the chart · dots are the input degree histogram (log-log axes); azure line is the fitted \(k^{-t_1}\) power law. The tail shape matters for the stage-2 resampler; the specific counts do not.

profiling · cluster sizes

A second power law, with a size floor.

Profile writes cluster_sizes.csv the same way ABCD does. Stage 2 derives \(c_{\min}, c_{\max}\), and the exponent \(t_2\) fitted with xmin = max(min(cluster_sizes), 3). The floor of 3 is a hard LFR requirement: the C++ binary refuses to build clusters smaller than 3. If the input has size-1 or size-2 clusters (such as the wrapper's singleton outliers), the floor absorbs them; the fitted exponent is estimated only from sizes \(\ge 3\).

On the synthetic: cluster_sizes = [8, 6, 4, 1, 1] → \(c_{\min} = 3\) (floor applies), \(c_{\max} = 8\), \(t_2 \approx 2.23\). On dnc: 442 entries ending in 355 singletons → \(c_{\min} = 3\), \(c_{\max} = 52\), \(t_2 \approx 2.01\).

reading the chart · dots are the input cluster-size histogram (log-log). Singletons (size 1) on the synthetic and on dnc appear in red; they are below the fit's \(x_{\min}\) floor and do not influence \(t_2\). The fit line starts at \(c_{\min}\).

profiling · the mixing scalar

One number: \(\mu\) (node-mean, not stub-weighted).

LFR writes mixing_parameter.txt, a single float. Unlike ABCD's stub-weighted \(\xi\), LFR uses the mean of per-node ratios:

\[ \mu \;=\; \frac{1}{|V|} \sum_{u \in V} \frac{d^{\text{out}}_u}{d_u}. \]

Each node contributes equally regardless of degree. A degree-1 node with one inter-cluster edge contributes \(1/1 = 1\); a degree-1000 hub with 500 inter-cluster edges contributes \(500/1000 = 0.5\). On skewed degrees the two conventions drift by a few percent. On dnc LFR's \(\mu = 0.686\) while ABCD's \(\xi = 0.718\); they are close but not equal.

N	·	Vertex count.
\(\langle k \rangle\)	·	Mean degree. Passed to the binary via `-k`.
\(k_{\max}\)	·	Maximum degree. Caps the truncated power-law sampler.
\(c_{\min}, c_{\max}\)	·	Size bounds, \(c_{\min}\) clipped at 3.
\(t_1\)	·	Degree power-law exponent from `powerlaw.Fit` auto \(x_{\min}\).
\(t_2\)	·	Cluster-size power-law exponent with \(x_{\min} = c_{\min}\).
\(\mu\)	·	Mean per-node out-ratio. This is the target the rewire loop chases.

LFR's \(\mu\) vs ABCD's \(\xi\): same phrase, different number

ABCD's \(\xi\) sums across stubs then divides: \(\xi = (\sum_u d^{\text{out}}_u) / (\sum_u d_u)\). LFR's \(\mu\) takes per-node ratios then averages. On symmetric or uniform-degree graphs the two match; on heavy-tail graphs like dnc they diverge by a few percent. Neither generator accepts the other's convention; do not swap them.

generation · resample degrees and cluster sizes

Draw fresh sequences from the fitted power laws.

The C++ binary samples a new degree sequence from \(P(k) \propto k^{-t_1}\) truncated to \([1, k_{\max}]\) with mean constrained to \(\langle k \rangle\), and a new cluster-size list from \(P(c) \propto c^{-t_2}\) with floor \(c_{\min}\) and sum constrained to \(N\). The upper cap \(c_{\max}\) is best-effort: when the fitted exponent is poor, the binary's truncation drifts and output sizes can exceed \(c_{\max}\). Both samples differ from the input.

Press re-roll sequences to draw a new pair. The bars below overlay input (azure) vs resampled (plum) for each rank position. On the synthetic the resampled degree sequence tends to have a flatter tail: with only 20 points, the rejection sampler has little room to reproduce the input's heavy hubs.

degree sequence · rank on x-axis (sorted degree desc), degree on y-axis. Azure = input; plum = resampled.

cluster sizes · rank on x-axis. The floor \(c_{\min} = 3\) caps the minimum; input singletons (size 1) do not survive.

generation · node-to-cluster assignment

Iterate until no homeless node remains.

The paper's step 4: walk the degree sequence and place each node in a random community whose remaining slot count exceeds the node's internal degree \(d^{\text{int}} = (1-\mu) d\). If the binary fails to place a node within \(3N\) retries, it calls change_community_size to reshuffle the cluster-size sequence and restarts the assignment from scratch. Most realistic parameter settings converge on the first attempt.

Step through the iterations below. Home position is random each time; re-roll draws a new permutation.

iter 0 · all nodes homeless step 0 / 4

generation · configuration model + \(\mu\) rewire

Configuration model to seed, targeted rewire to match \(\mu_i\).

With assignment fixed, the binary builds the edge set in two halves: a per-cluster configuration model on the internal-degree sub-sequence in each cluster, then a cross-cluster configuration model on the leftover external stubs. After the seed, a mate-rewire phase walks every same-cluster external edge and swaps an endpoint with a random outside node, dragging each node's \(\mu_i\) toward the target \(\mu\). Every swap preserves node degrees, only re-pairing endpoints.

The walker below decomposes the seed phase: each step pairs two stubs uniformly at random; collisions (self-loops, parallels) flag red-dashed and feed a 2-opt cleanup pass that swaps endpoints with a random valid edge. Highlighted nodes mark the up to 4 endpoints touched by each swap; everything else dims. The distribution of \(\mu_i\) at the end is bell-shaped around \(\mu\), not a delta: small-degree nodes have few possible ratios (degree-1 only 0 or 1; degree-2 only 0, 0.5, 1) so they cluster at discrete points. The bars below the walker plot the \(\mu_i\) distribution from a finalised sample; the slider moves the target \(\mu\), re-roll resamples the edge generation.

step 0 · stubs ready, no edges step 0 / 0

reading the chart · x-axis is \(\mu_i\), y-axis is count. Bars are the observed per-node ratios; the dashed vertical line is the target \(\mu\). On 20 nodes the distribution is jagged because each node contributes one of only a few possible rationals.

target \(\mu\) · mean achieved · within ±0.1 ·

Before the mate-rewire kicks in, each cluster runs its own configuration-model build via buildSubgraph: a deterministic multimap seed places intra-cluster edges first, then 10 randomization passes 4-way-swap the local adjacency, and finally a multi-edge rewire breaks any duplicates that surface when each cluster is merged into the global edge set. The walker below steps through every Phase-2 swap and every Phase-3 multi-edge rewire across all clusters in cluster order. Each step boldfaces the up to 4 endpoints touched, dims everything else, then animates the cut + place.

step 0 · per-cluster build init step 0 / 0

The mate-rewire phase is what turns LFR's μ from a target into the bell-shaped distribution above. After per-cluster internal edges are placed, the binary builds a global external-edge config model and runs a phase-3 loop: for any same-cluster external edge, pick a random outside node, do a 4-way swap that pulls one endpoint out of the cluster. The walker below steps op-by-op through that loop on the current sample. Per-cluster internal edges render as faded backdrop; mate edges still inside a cluster surface red-dashed; each step ghost-flashes the cuts before settling on the placements.

step 0 · mate-rewire init step 0 / 0

case study · dnc

What you get on the shipped example.

Default run on dnc + sbm-flat-best+cc at --seed 1 (\(N = 906\), \(\langle k \rangle = 23.02\), \(t_1 = 1.45\), \(t_2 = 2.01\), \(c_{\min} = 3\), \(c_{\max} = 52\), \(\mu = 0.686\)):

stat	input	lfr output	note
N	906	906	Exact match (CLI arg).
edges	10,429	10,370	Off by 0.6%. LFR's rewire + resample lose only a handful of edges.
mean degree	23.02	22.89	Close but not exact: the truncated power-law sampler hits the mean in expectation only.
max degree	368	<368	The input's heavy tail (368, 312, 256, 225, 202, ...) is often cut short: sampling from \(k^{-1.45}\) with \(k_{\max} = 368\) rarely draws the extreme quantiles without many tries.
\(\mu\)	0.686	~0.686	Hit via the rewire loop within the paper's tolerance; small per-node drift.
global cc	0.548	0.252	Not targeted.
clusters	87	10	LFR resamples cluster sizes from \(c^{-2.01}\) with floor 3 summing to 906. With \(t_2 = 2.01\) the sampler concentrates mass on the tail, so fewer larger clusters cover \(N\); 10 clusters for `dnc` is typical.
com.csv rows	551	906	Every node gets a cluster in LFR (no singletons to strip). Clustered count shoots up to all of \(N\) because the floor-3 constraint disallows 1-node clusters.

caveat 1: node IDs do not correspond

LFR's output uses node IDs 1 through \(N\). Those IDs do not correspond to the input's node IDs, nor do the cluster IDs correspond. If a downstream tool asks "is the output's cluster 7 the same as the input's cluster 7", the answer is no: LFR does not see the input's clustering, only its size distribution. Anything keyed on identity needs ABCD or SBM instead.

caveat 2: cluster count shifts

On dnc, input has 87 real clusters (plus 355 singletons the wrapper counts separately); LFR output has 10 clusters. This is not a bug: LFR samples \(c^{-t_2}\) with floor \(c_{\min}\) and sum fixed to \(N\). With \(t_2\) near 2, the sampler draws a handful of large clusters that sum to 906 quickly, leaving no room for the many small clusters the input had. The fitted \(c_{\max} = 52\) does not constrain the output here: empirically every output cluster on dnc lands in [67, 116], so the upper bound is best-effort, not enforced. LFR is the wrong tool when a specific cluster count is required.

caveat 3: outliers disappear

The wrapper's outlier_mode=singleton makes every unclustered input node a size-1 pseudo-cluster. LFR's floor \(c_{\min} = 3\) refuses them: their sizes don't survive the fit, and the output has no singletons at all. Outlier semantics are incompatible with LFR; they resolve as "every node is clustered" on LFR output.