Abstract. Cooperative multi-agent reinforcement learning (MARL) under lossy, bursty links often exhibits long-tail latency. Training-time stabilizers and learned communication are expensive and fragile under packet drops. Broadcast-Gain (BG) is a fixed-rate, neighbor-only overlay that adds a minimal control plane without modifying the base learner. Each agent broadcasts two bytes once per cycle: a signed residual of local pressure-progress and a compact meta tag. Receivers keep the freshest packets, form a confidence-weighted consensus, and use it to gate a simple phase scheduler. The overlay only perturbs the policy by shifting a single action logit (MOVE) with a clipped, distance-decayed push near the junction. Bandwidth is approximately 0.24 kbit/s per agent and compute is a few scalar operations per step. In a single-junction grid under heavy loss, BG reduces tail wait (p95) by 4.97 steps on a hard evaluation cell (N=120, drop=0.70, cycle=6), while increasing near-gate flow by +391.9 per 1k steps with negligible idle-red.
Communication constraints change the failure mode of cooperative MARL. Sparse rewards and delayed credit assignment increase gradient variance. Learned communication protocols often assume smooth and reliable channels. In networked control tasks, the cost of errors concentrates in rare, high-congestion events. A small execution-time control plane can target these events without retraining the policy.
BG follows this design. It sends a fixed two-byte message per agent per cycle, performs no backpropagation through the channel, and overlays an existing PPO+GAE policy.
The environment is a grid with two perpendicular corridors and a shared junction.
A clearance lock enforces a fixed number of steps after a crossing before another crossing is permitted.
Agents act each step using local observations and a two-action space: MOVE or WAIT.
The junction serves one axis at a time with a binary phase variable $S \in \{+1,-1\}$ that may switch at cycle boundaries.
Communication is neighbor-only and lossy. Each directed link drops packets with per-tick probability $p$. BG does not change the observation model, rewards, or training procedure of the base policy.
Each agent broadcasts a two-byte packet once per cycle to neighbors within a Manhattan radius. The packet contains a signed residual and a compact meta tag:
The residual $z_i$ summarizes local pressure-progress. A default construction quantizes a temporal-difference residual and uses $\mu$-law companding for robustness under 8-bit quantization. The meta byte packs an axis bit and a distance bin: $$m_i = (\mathrm{axis\ bit} \ll 7) \; | \; (\mathrm{dist\ bin} \;\&\; 0x7F).$$
Stop-gradient interface. BG does not learn a communication protocol. It uses fixed encoding, fixed aggregation, and fixed actuation. Gradients do not flow through the channel or through the scheduler.
Receivers keep the freshest packet per sender under a short time-to-live. Let $\eta_j = \exp(-\Delta t_j/\tau_{\mathrm{fresh}})$ weight staleness of sender $j$. After decompanding, axis scores are sums over unique fresh senders that support each axis:
A coverage term and freshness average define an information weight $w_{\mathrm{info}} \in [0,1]$. A consensus score compares the axis sums and is squashed for stability:
$w_{\mathrm{cons}}$ increases with agreement and coverage and decays smoothly under packet loss. This scalar drives both the phase scheduler and the actuation strength near the junction.
The scheduler updates the served axis $S$ at cycle boundaries. It enforces a minimum green duration and stretches it when information is weak. It switches when an advantage signal exceeds a confidence-scaled threshold or when a max-green limit is reached. A damping rule reduces oscillations after a wasted-clear cycle, defined as a cycle where the clearance lock prevents any crossing.
State: served axis S in {+1, -1}, green_age, min_green0
Inputs per cycle: Z_{+1}, Z_{-1}, w_info, last-cycle crossings x
green_age += 1
min_green_eff = min_green0 + lambda_stretch * (1 - w_info)
if green_age < min_green_eff:
HOLD
delta = Z_{-S} - Z_{S}
thresh = theta0 + theta1 * (1 - w_info)
if delta > thresh or green_age >= max_green:
S = -S
green_age = 0
if x == 0:
min_green0 = max(min_green_min, min_green0 - delta_wc)
else:
min_green0 = (1 - beta_mg) * min_green0 + beta_mg * min_green_tgt
BG perturbs the policy by modifying only the MOVE logit at execution time. Let $\ell^{(i)}_{\mathrm{move}}$ be the base MOVE logit for agent $i$. The overlay adds a clipped, distance-decayed term that depends on the served axis, consensus, and a green-only fairness accumulator.
$s_i \in \{+1,-1\}$ is the agent axis, $d_i$ is grid distance to the gate, and $\phi_i \in [0,1]$ increases with near-gate wait on green and decays otherwise. A hard-stop safety rule clamps $g^{(i)}_{\mathrm{add}}=-A$ when an agent is close to the gate on the red axis.
Why only one logit. Restricting actuation to a single logit makes the overlay easy to reason about and bound. It also reduces the risk of unintended side effects on other actions.
BG changes the policy by shifting only one logit by $\delta$ and clipping $|\delta| \le A$. For any observation $o$, the per-state drift is bounded.
These bounds formalize that BG is a small execution-time perturbation under the chosen clip level.
Each agent sends 16 bits per cycle. With environment step frequency $f_{\mathrm{step}}$ and cycle length $C$, the bitrate per agent is $$\mathrm{bps} = 16\,\frac{f_{\mathrm{step}}}{C}.$$ For example, $f_{\mathrm{step}}=60$ Hz and $C=4$ yields about 240 bps. Per-step compute includes decompanding, a few sums, and one logit add. The overlay does not add networks or attention layers.
Training uses PPO with generalized advantage estimation (GAE) and no BG. After training, the policy is frozen. Evaluation compares the frozen baseline to the same frozen policy with BG enabled. BG constants are fixed across all test cells.
Factor grid. Number of agents $N \in \{100,120,140\}$, per-tick packet drop $p \in \{0.60,0.65,0.70\}$, and cycle length $C \in \{3,5,6\}$. This yields 108 evaluation cells. Primary metric is near-gate tail wait p95 (lower is better). Secondary metrics include near-gate crossings per 1k steps (higher is better) and idle-red (lower is better).
PPO settings reported with the project: 2000 updates per seed, rollout length 2048, 32 minibatches, 4 epochs, Adam learning rate $3\times 10^{-4}$, $\gamma=0.99$, $\lambda=0.95$, clip 0.2, entropy and value coefficients 0.01 and 0.5, and gradient norm clip 0.5. Evaluation runs $4\times 10^5$ environment steps per seed per cell at 60 Hz.
The strongest reported evaluation cell uses $N=120$, drop $p=0.70$, and cycle length $C=6$. BG with TD residuals reduces tail p95 by 4.97 steps and increases near-gate flow by +391.9 per 1k steps relative to the frozen baseline. The RawEnt variant provides smaller gains on this cell.
| Variant (eval cell) | Signed Improvement Index | Tail p95 change (steps) | Near-gate change (/1k steps) |
|---|---|---|---|
| Frozen baseline | 0.000 | 0.00 | 0.0 |
| BG (TD) | 0.450 | -4.97 | +391.9 |
| BG (RawEnt) | 0.240 | -2.35 | +328.5 |
Across 108 evaluation cells, BG wins 78 (72%). Improvements concentrate at larger $N$ and longer cycle lengths. Very short cycles can be neutral or slightly negative. Under higher packet drop, BG degrades smoothly rather than failing abruptly.
A mechanism check on the training reference configuration (N=140, drop=0.70, cycle=6) reports reallocation into green and improved near-gate flow.
| Variant (train cell) | Attention-green share change (pp) | Realized-green share change (pp) | Near-gate change (/1k steps) |
|---|---|---|---|
| BG (TD) | +16.17 | +19.89 | +160.6 |
Two design parameters interact strongly with loss: neighborhood radius and packet staleness. Small radii and short time-to-live windows limit stale contradictions while preserving enough coverage for consensus. The project reports stable wins for radii in the range 2 to 3 and time-to-live of 1 to 2 cycles.
Actuation should be tuned conservatively. Increasing the push strength $\Lambda$ while keeping the clip $A$ moderate avoids oscillations. For short cycles, increasing minimum-green stretching and reducing switch aggressiveness can prevent premature flips under weak information.
The benchmark focuses on a single junction. Multi-junction settings introduce coupling between phases and may require multi-hop or hierarchical aggregation. The encoding is fixed and uses a single scalar residual. More complex tasks may benefit from adaptive rate, adaptive quantization, or uncertainty-driven time-to-live.
A practical extension pairs a learned short-horizon predictor with the BG gate at execution time. This keeps learning off the link while preserving robustness under bursty delivery.
In collaboration with NYU, Columbia, and Amazon.
Schulman, J., Moritz, P., Levine, S., Jordan, M. I., and Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438.
Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
Sukhbaatar, S., Szlam, A., and Fergus, R. (2016). Learning multiagent communication with backpropagation. NeurIPS.
Foerster, J. N., Assael, Y. M., de Freitas, N., and Whiteson, S. (2016). Learning to communicate with deep multi-agent reinforcement learning. NeurIPS.
Kim, D., Moon, S., Hostallero, D., Kang, W. J., Lee, T., Son, K., and Yi, Y. (2019). Learning to schedule communication in multi-agent reinforcement learning. ICLR.
This page summarizes the project "Broadcast-Gain: A 2-Byte, Stop-Gradient Control Plane to Trim Long-Tail Latency in Cooperative MARL". Update any environment specifics, hyperparameters, and results to match your repository logs if you reproduce the experiments.