The Proxy Decay Simulator — The Depth Constraint

Scenarios

Near rescue ↔: toggle Correction Attempt to see the knife edge. Stable regime: Ψ₀ below the critical region — stability example within the model, not a failure case. RLHF — illustrative structural pattern: P(t) corresponds to the expressed preference signal RLHF optimizes; V(t) corresponds to the underlying capacity expressed preference is supposed to track. Under sustained optimization pressure (high α, high κ), the gap between P and V can grow as the system finds and exploits the divergence between what evaluators prefer in the moment and what preserves their capacity for preferred states over time. Not an empirical placement of any RLHF system — illustrative of the structural gap identified in Article 2 §"What RLHF does and doesn't do." This illustrates the proxy decoupling direction only — the sufficiency failure direction (completion recognition not governing default policy) requires DRG-style instrumentation.

Parameters

δDecoupling rate0.060

Rate at which proxy optimization erodes V per unit of pressure. At δ=0: no erosion regardless of S or α.

ρRecovery capacity0.040

Restoration rate × D_eff — recovery requires the modeling capacity it's trying to restore.

αBase optimization pressure1.20

Base pressure. Actual α_eff rises with proxy gains when feedback (κ) > 0 — optimization self-amplifies on success.

κOptimization feedback0.50

Quadratic lock-in: α_eff = α(1+κP²). Pressure escalates steeply as proxy nears saturation and holds there. At κ=0.5, P→1: α_eff = 1.5×α. At κ=0: fixed pressure.

α_eff at P→1 = α(1+κ) 1.80

γProxy gravity0.050

Natural P decay rate without active maintenance. When intervention cuts α, P sags — making the cost of correction visible on the map.

τResponse latency0.0t

Time from D_base detection to intervention. τ=0: responds immediately. τ=3-4t: knife edge on Near Rescue. τ=5t: fails despite detection. Detection of structural failure does not guarantee timely correction — τ illustrates the delay between recognizing degradation and acting before the correction window closes.

SScope of influence1.0

Capability — reach over experiential states. Symmetric: scales P growth AND erosion identically. High S without D: the structural claim.

D₀Initial modeling depth0.20

Starting accuracy of the agent's model of V. Eroded by α when ε > 0. High D₀ delays collapse; the perfect-depth counterfactual is shown separately.

εDepth erosion by α0.010

Rate at which α directly erodes D_base — optimization competing with modeling. Makes D an independent state, not a function of V.

Ψ₀ = S / D₀ (within-model ratio) 5.00

cw_thresh = δαS/(ρ+δαS) 0.643

θAbsorbing threshold0.25

Time horizon T200

Display

Counterfactual V(t) · D_eff=1 always

Upper bound on V preservation — with perfect modeling depth (D_eff=1 always), erosion algebraically vanishes. This is not coincidental: D is formally defined as the fraction of optimization capacity correctly routed away from substrate-consuming pathways. Collapse requires imperfect representation, not pressure alone.

D_eff sub-chart — meta-decay + α-erosion

D_base erodes under α independently of V. D_eff = D_base × V. Amber line = dynamic correction window threshold.

Correction attempt — reduce α, rebuild D

Triggers when D_base drops to 94% of D₀ — pure internal state, no V contamination. Agent detects its own modeling depth eroding. α cut 40%, D_base rebuilds. Map (P) sags due to proxy gravity — the KPI cost is visible. Teal = intervention tick. Green = corrected trajectory.

Structural analog — Φ overlay

Structural-domain correspondence: same governing dynamics, observed from outside vs inside the system. Not claimed as proof of isomorphism — as structural correspondence. Independently of whether the formal equivalence holds [OP2], a one-directional causal result is already established: V(t) degradation in agents who maintain the physical substrate predicts substrate degradation under specified conditions. This overlay illustrates that correspondence; the Technical Companion develops the causal argument.

V(t) final · territory

—

P(t) final · map

—

agent's model

Ψ final · S/D_eff (model coordinate)

—

Capacity cost

—

vs D_eff=1 baseline

Crossover

—

CW closed / Absorbed

—

Territory V(t) · Agent's Map P(t) · proxy–capacity divergence

Map and territory diverge. S is symmetric: scales P growth and erosion identically. α_eff = α(1+κP²) — quadratic lock-in, escalates steeply at proxy saturation. P_eq(t) = α_eff·S·V(t)/(α_eff·S·V(t)+γ) — equilibrium attractor. P is pulled toward this; during rapid collapse P can overshoot above it. Amber = cw closed. Gold = crossover. Red = absorbed. Teal dashed = intervention fires (D_base monitoring). Green = corrected trajectory.

Sufficiency failure track (conceptual overlay): Proxy decoupling is directly simulated here as a V(t)-degrading feedback track. Sufficiency failure is marked only as the conceptual counterpart: optimization continues after resolution and degrades V(t) through saturation, as the agent is prevented from inhabiting the resolution state that would restore it. Whether both directions share the same formal absorbing-state properties remains OP2. The sufficiency failure dynamics — saturation, dynamic range clipping, refractory period, and path-asymmetry of recovery — require separate instrumentation; the controlled DRG replication (Article 2, footnote 2) and the Alignment Measurement Protocol are the empirical complement for this direction.

P(t) map

V(t) territory — recovery latency ↑, diversity ↓, sensitivity ↓ as V falls

gap

P_eq(t)

—

recovery dominantregime historyerosion dominant

D_effective(t) · D_base(t) · independent depth erosion · correction window

D_base erodes under α. D_eff = D_base × V. The cwThresh line (labeled cw≈) is derived under V≈constant (local linearization) — it is a useful structural indicator, not a sharp threshold. The actual crossing depends on V(t); the regime strip in the main chart uses the full recovery term.

D_eff

V ref

modeling deficit

cw thresh

Run the simulation. The optimizer feedback (κ) means α_eff rises when the proxy is improving — the system pushes harder precisely because it looks like it's working. S is symmetric: capability drives both proxy gain and V erosion at equal rates. The correction toggle restructures the approach: α is cut and D_base begins rebuilding. Toggle it on the Near Rescue preset to see the knife edge.

V(t) observable anchors — Article 2 §"Why V(t) is required"

Article 2 justifies V(t) through three observable behavioral signatures that can dissociate under targeted intervention. As V(t) falls in the simulation, all three degrade — but via distinct causal pathways. The dissociation test that validates V(t) as a latent construct (rather than a definition) is specified in TC2 §1.4 and the AMP. These indicators are illustrative, not empirical measurements.

Recovery latency

—

Time to return to baseline after perturbation. Rises as V falls — the system's restoration capacity degrades.

Behavioral diversity

—

Range of distinguishable responses under similar conditions. Narrows as V falls — the system's response repertoire contracts.

Signal sensitivity

—

Ability to register and respond to weak gradients before escalation. Degrades as V falls — the system becomes less responsive to early warning signals.

Bilateral structure (Article 2): This simulation directly models the proxy decoupling failure direction. The sufficiency failure direction is the structural counterpart: optimization continues past the resolution point and degrades V(t) through saturation rather than misdirection — the agent is prevented from inhabiting the resolution state that would restore the capacity. Both directions share the feedback pattern "degraded V(t) → worse policy sensitivity → harder correction," producing self-reinforcing degradation under endogenous policy dynamics. Whether both directions share formal absorbing-state properties in the same structural sense as Series 1's result remains open [OP2]. The sufficiency-failure dynamics — saturation, dynamic range clipping, refractory period, and path-asymmetry of recovery — require separate instrumentation beyond this simulator's scope.

Collapse horizon — V(T) vs scope S

Current D₀, δ, ρ, κ — S swept 0.5→3.0 (deterministic per-bar seeded runs). Green = viable at T; amber = crossover without absorption; red = absorbed. Below a critical S the system remains viable or crosses over; above it, absorption emerges within this parameter slice. The phase map below shows the full S × D₀ structure.

—

Phase map — viability across capability and depth

Finite-horizon regime map at current T — S × D₀ grid (26×21 cells), other params fixed. Green = viable through T. Amber = crossover without absorption. Red = absorbed. Contours mark orthogonal regime-change edges (grid-derived approximations). Badge uses 8-neighborhood including diagonals — may detect boundaries not shown as explicit contour segments. Increasing T expands the absorbed region. Uncorrected baseline.

—

D₀ →

Quadratic lock-in: α_eff = α(1+κP²). The implemented form: α_eff = α × (1 + κ × P²) — quadratic lock-in, so pressure accelerates steeply as P approaches 1. At κ=0.5, P=0.5: α_eff = 1.125×α. At P=0.9: α_eff = 1.405×α. At P→1: α_eff = α(1+κ) and stays there. The system is most destructive at peak performance. α_eff is coupled into both P growth and erosion — symmetrically.

Proxy gravity: γ makes intervention costly. Maintaining artificial proxies (engagement metrics, RLHF reward, approval ratings) requires constant active effort: dP = α_eff×S×(1−P)×V×dt − γ×P×dt. At γ=0.05 (default): P decays at 5% per time unit without active maintenance. When the intervention fires and α drops 40%, the map sags immediately — the user watches the green corrected-V trajectory diverge from a simultaneously declining P. This is the alignment dilemma made visible: saving the territory (V stabilizes) requires accepting a hit to the localized performance metric (P falls). The cost of correction is no longer hidden.

Correction trigger on D_base — internal degradation detection, not completion recognition. The intervention triggers when D_base < D₀ × 0.94 — when the agent detects that 6% of its initial modeling depth has been consumed by its own optimization intensity (ε × α × S). D_base is a purely internal quantity with no V component: the agent is monitoring its own structural degradation, not the state of the territory.

The trigger is fully internal — but internal monitoring is not sufficient for safety. The proxy objective contains no representation of completion: no condition under which further optimization correctly halts. The D_base monitor detects that the agent's modeling capacity is failing; it does not represent that the objective has been achieved. A system can detect it is failing without any representation of what completion would look like. As D_base erodes, the system loses not only gradient accuracy but the capacity to know when optimization should cease.

The architectural distinction matters: the proxy objective is the reward function; the D_base monitor is an out-of-band structural override — analogous to an internal constitutional override — a structural capacity distinct from the primary optimization policy, whose function is to detect degradation of the modeling layer rather than to evaluate whether the objective has been achieved. The article distinguishes between an agent that stops because it achieved its goal (objective completion) and an agent that stops because its structural machinery is failing (structural halting). This simulation models the latter. Safety cannot be derived from the metric being optimized. It must come from the structural layer that oversees it.

P_eq(t) — equilibrium attractor · stochastic noise · unified engine. Setting dP/dt=0 gives P_eq(t) = α_eff·S·V(t) / (α_eff·S·V(t) + γ) — a self-referential coupled equilibrium (α_eff depends on P), not a hard ceiling. P is attracted toward it and can overshoot during rapid V collapse. When V collapses fast, P_eq(t) crashes toward 0 while P(t) bleeds slowly via −γP: the map outlives the territory. The dashed curve on the chart tracks this. The gap between P and P_eq during collapse is not merely visual: it shows the system continuing to push past the point where its own equilibrium has already collapsed — the knife sentence made geometric. The regime strip below shows the toy-model moment where the local vector field flips from recovery-dominant to erosion-dominant. This is a model-specific crossover coordinate, not Article 3's Inner Crossing itself; it illustrates the kind of regime transition Ψ is introduced to organize. Stochastic noise: dV += (prng()−0.5)×0.002×dt via injectable PRNG — live sim uses Math.random; tests inject a seeded LCG (seed=42) into the same simulate() function. Visual sweeps and phase map use deterministic per-cell seeded runs. Tests validate the actual live equations, not a copy.

When correction is active, D_base rebuilds: dD_base += K_REBUILD × (1−D_base) × D_base × V × dt. The rebuild requires nonzero D_base × V — once both near zero, correction cannot restart. The five narrative arcs: Engagement — unstoppable, correction fires after absorption; RLHF — correction detects the problem early but S×α_eff is too large, collapses regardless; Default (with correction enabled) — correction delays collapse from t=8.5 to t=17.5 (partial rescue that ultimately fails — the "failed rescue" scenario); Near Rescue — intervention at t=5.1 saves V to ~0.58 (genuine rescue); Falsification — stable, D_base barely erodes.

Counterfactual and causal isolation. The D_eff=1 counterfactual (toggle above) runs the same simulation with modeling capacity permanently locked at its maximum. The erosion equation is δ × α_eff × S × (1−D_eff): when D_eff=1, erosion algebraically vanishes. This is the formal definition of depth within this framework — D is precisely the fraction of optimization capacity correctly routed away from substrate-consuming pathways, because higher resolution allows the system to distinguish between the signal of success and the physical conditions that generate it. Without that discriminability, the system cannot route force away from what it depends on; it can only follow the signal. When D=1, force is perfectly routed and erosion is zero not by coincidence but by definition.

This establishes the causal triangle: optimization pressure alone does not produce collapse; imperfect representation alone does not produce collapse. Collapse emerges from their interaction. Even under maximum α_eff and maximum S, the system does not collapse with perfect modeling — proving that proxy decoupling is a representation failure, not a force failure. No additional control mechanism operating on P alone can restore stability once modeling capacity degrades. Stability requires preserving the system's capacity to correctly represent what it is optimizing.

Collapse condition and governing ratio. The live metric displays the instantaneous collapse condition: δ×α_eff×S×(1−D_eff) vs ρ×D_eff×(1−V). When the left side exceeds the right, erosion dominates recovery regardless of system intent. Ψ = S/D_eff does not directly govern dynamics in this model — it is a sufficient statistic of the governing variables, useful as a compression of the system's state rather than a control parameter. The regime badge shows the terminal state; the strip shows the full crossing history. The cwThresh = δαS/(ρ+δαS) line in the D_eff sub-chart is derived under the approximation V ≈ constant (local linearization at V=1). The exact crossing depends on V(t), making cwThresh a useful structural indicator but not a sharp threshold — the regime strip and badge use the full ρ·D_eff·(1−V) recovery term from the actual equations.

Modeling assumptions — stated explicitly. (1) D_base erodes linearly in α × S, independently of V. (2) D_eff = D_base × V. (3) Recovery = ρ × D_eff × (1−V) — always active; vanishes naturally as D_eff→0 when V→0. (4) Erosion = δ × α_eff × S × (1−D_eff). (5) Proxy growth = α_eff × S × (1−P) × V. (6) Proxy decay = γ × P. (7) α_eff = α(1+κP²), capped at 3α. The cap prevents runaway feedback that would distort the visualization without changing qualitative behavior. The structural dynamics are invariant to this cap across the full parameter space exposed in the UI: collapse still occurs, and stable regimes remain stable, regardless of whether the cap is 3α, 4α, or removed. The verification suite confirms this. (8) D_base rebuild = K_REBUILD × (1−D_base) × D_eff. (9) Correction trigger: D_base < D₀ × 0.94, with τ-step delay before activation. (10) Stochastic: dV += uniform(−0.002, +0.002) × dt — amplitude is 0.002, negligible relative to typical erosion/recovery magnitudes (~0.01–0.10). The simulation is effectively deterministic; noise is included to demonstrate that qualitative behavior is not fragile to initial conditions, not to model stochastic dynamics. (11) Absorbing state: no hard switch. The equations run continuously — α_eff, erosion, and recovery follow the same laws below θ as above it. As V→0, D_eff = D_base×V→0, recovery = ρ·D_eff·(1−V)→0 naturally, proxy starvation arrests P growth, and erosion continues until V=0. The irrecoverability is derived from the equations, not enforced by a threshold rule. absorbT records when V first crosses θ for the display layer; it signals entry into the absorbing regime, after which V continues toward 0 under the same equations. (12) D_eff = D_base × V: multiplicative coupling that induces compounding fragility. All functional forms are motivated by the article's qualitative claims, not empirical fitting. V(t) is treated here as a scalar state variable. The article defines V(t) as a composite of three coupled capacities — signal fidelity, dynamic range, and recovery capacity — forming an equivalence class over configurations that share the functional property of preserving gradient discriminability. The scalar approximation is a simulation convenience. The composite nature is functionally instantiated through P_eq(t): as V collapses, the equilibrium attractor collapses with it, explicitly modeling the loss of signal fidelity — the localized metric retains high value through algorithmic inertia — the same dynamic by which corporate KPIs and engagement metrics coast on momentum long after the conditions justifying them have deteriorated. P does not actively register success; its growth term is dying (multiplied by V→0) and it bleeds slowly via gravity alone. The structural dynamics (directional degradation, threshold crossing, irrecoverability) hold under the composite structure.

Model scope and structural boundaries. This simulation isolates the minimal structure sufficient to produce proxy decoupling: an optimization objective without a stopping condition, modeling capacity that erodes under optimization pressure, and a feedback loop between the two. Within this simulation's minimal structure and exposed parameter space, collapse is robust once the full erosion/recovery balance enters the degraded-depth regime and no correction changes the trajectory. The displayed cwThresh is a structural indicator, not a sharp theorem-level threshold. This is a statement about the toy's defined model class, not a closed claim about the full framework: the formal absorbing-state equivalence between V(t) collapse and substrate collapse remains open [OP2]. Within these equations, a proxy objective without an internal correction or halting condition can destroy the evaluative capacity it depends on under sufficient optimization pressure. No additional mechanisms are required within this structure; the failure arises from the interaction of optimization pressure and imperfect representation alone.

Three phenomena the article addresses are outside this scope by design. Sufficiency failure dynamics: the second failure mode — optimization continuing past gradient resolution, disturbing restorative states through saturation rather than proxy pursuit — is not directly simulated here. Both directions are treated as V(t)-degrading feedback patterns, but only proxy decoupling is directly simulated here. Different mechanisms; parallel structural pressure — formal absorbing-state equivalence remains open under OP2. Whether proxy-decoupling collapse and substrate collapse share the same formal absorbing-state properties is the series' central open problem [OP2]; this simulation illustrates the analogous feedback structure, not the resolved equivalence. Its structural precondition (an objective without an internally representable stopping condition) is present here; the correction mechanism makes the same architectural distinction (detecting degradation ≠ representing completion). The saturation mechanism, dynamic range clipping, refractory period, and path-asymmetry of recovery require different instrumentation. Hysteresis: the article establishes that the path out of degraded V(t) requires something structurally different from reversing the path in. The correction mechanism here is partially reversible by design; full hysteresis requires a refractory period structure this toy does not model. Multi-agent dynamics: the framework identifies as a natural extension the coordination challenges when multiple agents disagree about what constitutes gradient resolution. This simulation is a strict single-agent baseline. Multi-agent extensions are expected to add pressure rather than remove it: competitive pressure can accelerate the dynamics, and coordination failures under shared V introduce additional pathways toward the same degraded regime. This toy does not model those extensions directly. The structural analog overlay is a mapping, not a proof of isomorphism — the Technical Companion provides the formal sketch.

Connection to RLHF (Article 2 §"What RLHF does and doesn't do"). In Article 2's application to RLHF: P(t) in this simulation corresponds to the expressed preference signal RLHF optimizes; V(t) corresponds to the underlying capacity expressed preference is supposed to track. The structural claim is that under sufficient optimization pressure, the gap between P and V grows in the direction of the optimization — the system finds and exploits the divergence between what evaluators say they prefer in the moment and what actually preserves their capacity for preferred states over time. The RLHF preset illustrates parameter values consistent with this dynamic (high α, high κ encoding preference-signal self-amplification, low D₀ encoding absence of a proxy-divergence detection mechanism). This is an illustrative structural pattern, not an empirical placement of any system. Article 2's RLHF critique has two directions: this simulation directly models the proxy decoupling direction. The sufficiency failure direction — completion recognition present as a representational capacity without governing default policy — is not modeled here. That direction requires separate instrumentation and is the subject of the controlled replication described in Article 2 footnote 2 and the Alignment Measurement Protocol.

Verification suite

Confirms equations behave as documented.