smarm

Author	SHA1	Message	Date
smarm	d4839f1d81	feat(runtime,io): driver-enqueues + park/wake idle path — retire the wake pipe The swap (RFC 018). Schedulers no longer sleep on a shared level-triggered wake pipe — the herd source that made the default 8-thread config 7x slower than 2 threads (E1). They park on per-thread futex parkers via the coordination layer; IO backends become producers behind a two-call contract (make runnable, then the enqueue tail wakes exactly one parked scheduler). Deleted: the drain lock and the one-winner phase-1 drain; the shared completions VecDeque; the wake pipe fds, poll_wake, drain_wake_pipe, wake_scheduler, the FdReady/Blocking Completion enum; the 100us idle nap; the per-pop io.lock liveness read; io.rs's as_millis timeout truncation. Added: - enqueue wake tail (fixes the silent enqueue): wake_one_if_idle, a fence + one Relaxed mask load when everyone is busy — the pure-compute hot path pays almost nothing. - driver-enqueues: the pool thread stashes its result in the slot, decrements io_outstanding, unparks; the epoll thread removes+DELs the waiter under the waiters lock and unparks. Both reach the runtime via a Weak (no Arc cycle). The waiters map moves behind its own Arc<Mutex> so the epoll thread never takes the runtime io lock (teardown holds it while joining that thread). - io_outstanding / io_fd_waiters atomics: the termination verdict reads two atomics instead of taking io.lock on every pop. - timekeeper idle path: at most one parked scheduler holds the timer deadline (an expiry wakes one, not a herd); everyone else parks indefinitely and is woken by the enqueue tail. - busy-path timer due-check (ratified design point (a)): under saturation nobody parks and no timekeeper exists, yet due timers must still fire — one Relaxed load of the earliest-deadline snapshot per loop, clock read only when a timer is armed. Maintained under the timers mutex. - chain rule: a scheduler that pops with more work queued and a sibling parked wakes one, so surplus runs in parallel rather than behind it. tests/park_wake.rs pins the two new observable properties: timers fire under full scheduler saturation, and sub-ms sleeps are prompt (the as_millis truncation regression). Full suite + all loom models green; clippy --lib clean.	2026-07-24 09:12:12 +02:00
smarm	2854b560d6	feat(park): fenced producer fast path + earliest-deadline snapshot Two integration-driven amendments ahead of the runtime swap: wake_one_if_idle() realizes RFC 018's "empty-mask fast path is one relaxed load" soundly: a bare relaxed load is a lost-wake in the Dekker shape for the lock-free ring queues, so the producer publishes work, fences (SeqCst), then reads the mask Relaxed — paired with a matching fence between the consumer's bit-publish and its re-check in park(). The pure-compute hot path (mask 0) never takes the shared mask line exclusive; the RMW read stays on the rare chain-rule path only. Loom models 1/2 now drive the fenced pattern end to end. next_deadline is the earliest KNOWN timer deadline, independent of whether anyone is parked — which tk_armed cannot give: under saturation nobody parks, nobody arms, yet due timers must still fire (ratified design point (a): the busy-path due-check). Maintained under the timers mutex (note_deadline on insert — which also carries the timekeeper re-arm wake — refresh_deadline after pop/clear); read lock-free. deadline_due() costs one Relaxed load and a branch when no timer exists; the clock is read only when one does.	2026-07-24 09:12:12 +02:00
smarm	7b026cfe56	feat(park): scheduler coordination layer — parkers, idle mask, wake protocol (RFC 018) Schedulers get an IO-agnostic sleep/wake primitive of their own: one futex Parker per scheduler thread (permit semantics, std::thread::park shaped — closes the check-then-park race), an AtomicU64 idle mask with a set-bit → re-check → wait park protocol, wake_one (highest-bit LIFO, CAS-clear before unpark: exactly one wakeup per call by construction), wake_all for the terminal path, and the timekeeper role — at most one parked scheduler holds the timer deadline, with an atomic armed-deadline snapshot for the busy-path due-check and an insert-side re-arm wake. Deadlines travel as nanosecond timespecs end to end; the wake pipe's as_millis truncation is unrepresentable here. The Dekker publish/re-check shape is resolved by the same-location-RMW handshake (AcqRel), not SeqCst loads; loom verifies exactly this in four models (no-lost-wake, chain propagation, timekeeper handoff, termination), run with LOOM_MAX_PREEMPTIONS=3 — unbounded exploration is impractical for the looped models. Loom/non-Linux builds park on a Mutex+Condvar via sync_shim. Standalone until the runtime swap (next commit): nothing outside tests constructs a Coordinator yet, hence the temporary dead_code allow in lib.rs.	2026-07-24 09:12:12 +02:00
smarm	006a3283e7	chore(hooks): clippy gate falls back to a nix-shell toolchain Desktop migration: the home-manager rust here ships without the clippy component. Prefer an installed cargo-clippy; otherwise run clippy from an ephemeral nix-shell with a separate target dir (mixed-compiler artifacts are an E0514 hard error). MSRV keeps the shell's older toolchain a legitimate gate.	2026-07-24 09:12:12 +02:00
Markk116	8c764e9169	docs(monitor): user-facing rewrite of process monitors Lead with the user's problem (learn when another actor dies without it knowing you're watching), explain one-directional/one-shot semantics and contrast briefly with link without assuming link.rs has been read. Add a compiling doctest. Drop em-dashes. Correctness facts about registration/death races and demonitor-after-fire safety kept, reworded in plain terms and separated from the public item docs.	2026-07-24 08:44:56 +02:00
Markk116	41b9d6d056	docs(introspect): user-facing rewrite of runtime introspection Was the worst offender for external-context references (RFC 016 Chunk 1/4, DECISION D1/D2, RFC 003/011), all removed. Lead with the practical use cases (debugging, health checks, test assertions, dashboards) for snapshot()/actor_info()/tree(), and explain the per-actor-reads-not-a-world-freeze consistency model in plain terms instead of citing a decision log. Add a compiling doctest.	2026-07-24 08:44:56 +02:00
Markk116	dd845f22fe	docs(mutex): user-facing rewrite of the actor-blocking mutex Every public item was previously undocumented. Lead with why Mutex<T> exists (a channel/gen_server is overkill for plain shared state) and how it differs from std::sync::Mutex (parks the actor not the OS thread, every lock is timeout-bounded by default). Add a compiling doctest. Document new/lock/lock_timeout/try_lock/set_default_timeout/ MutexGuard/LockTimeout/DEFAULT_TIMEOUT. Drop em-dashes; keep wake protocol mechanics as contributor-facing comments on private internals.	2026-07-24 08:44:56 +02:00
Markk116	36a0a9832d	docs(registry): user-facing rewrite of the name registry Replace the 'what changed' diff-against-a-prior-design framing with a plain explanation of what the registry is for (naming an actor so others can find and message it by name) and a compiling doctest (register/whereis/send/unregister). Cut all RFC/decision-number/bug-id references and em-dashes; move type-erasure and locking-discipline detail into an Implementation notes section for contributors.	2026-07-24 08:40:26 +02:00
Markk116	feda6517e5	docs(channel): user-facing rewrite of the MPSC channel primitive Lead with what a channel is and how to use it (compiling doctest for channel()/send/recv/close), before any internal rationale. Document every previously-undocumented public item (channel(), Sender, Receiver, SendError, RecvError). Move the RawMutex-vs-std::sync::Mutex rationale and lock-class discipline into an Implementation notes section. Drop em-dashes throughout.	2026-07-24 08:40:26 +02:00
Markk116	8625ae4c35	docs(scheduler): user-facing rewrite of the actor/spawn/run entry point Lead with what an actor is and how to start one with run()/spawn(), following gen_server.rs's example-first style. Add a compiling module doctest. Drop RFC references and em-dashes; keep internal mechanics (preemption gating, thread-local borrow rules) as plain contributor comments rather than public-facing doc prose.	2026-07-24 08:40:26 +02:00
Claude (sandbox)	d9addeba5e	test(causal): controller test no longer races the sweep's site snapshot run_experiments snapshots the site registry once at entry. stage-x is registered lazily (first causal_site! execution in the worker), so on a 1-core box the snapshot deterministically wins whenever a sibling test has already paid the tsc_hz calibration — stage-x missed the sweep and the summary assert tripped (silently, pre-propagation; the earlier cascade attribution was incomplete). The worker now signals after its first site entry and the test waits on it before starting the sweep. Jarred separately: the lazy-registration trap is a lib UX hazard worth a doc note or warm-up guidance.	2026-07-18 21:59:21 +00:00
Claude (sandbox)	d5a3ba1934	fix(causal): born current — slot reuse booked phantom park forgiveness A fresh/reused slot started with causal_delay=0 and causal_parked=true: the first resume then 'forgave' the entire monotone global backlog, once per spawn, booked as park forgiveness. Under close-mode conn churn (~95k spawns/s) that is millions of phantom forgiven ms per 700ms window, even in 0% cells — the books could not balance under actor churn while ka-mode stayed plausible (few spawns). Impacts were unaffected: the first resume always precedes the first check, so nobody ever spun. reset_counters now installs Coz's new-thread rule: causal_delay starts at the current global ledger and causal_parked starts false — a newborn neither owes nor is forgiven the process's history, and delay injected while it sits spawn-queued (runnable, not blocked) is owed and paid at its first check, the semantics audit_zero_pct_window_absorbs_leftover_ debt pins. That test (born failing, unmasked by the run() propagation fix) and the new actor_churn_between_experiments_forgives_nothing regression test both go green.	2026-07-18 21:55:56 +00:00
Claude (sandbox)	7eae56a296	fix(runtime): a root actor panic escapes run() The trampoline caught the root's panic, recorded it as Outcome::Panic on the slot, and run() dropped the initial handle without reading it — every assert inside run(), the standard test-suite pattern, was silently vacuous (found live: a failing-first test passed). run() now reads the root outcome before the handle drop and resume_unwinds the payload after full teardown, so a caller's catch_unwind leaves the Runtime reusable; Exit and Stopped return normally. The payload message is printed before re-raising (the throw-site hook output was suppressed in-actor). Correct the two tests this unmasked, both born failing and never run: the select loser-arm test kept a closed arm in the set (the documented closed-arm rule: a closed arm reports ready forever — observe the disconnect and drop it); the send_after-to-dead test expected Ok(None) from a closed+empty channel (documented: Err(RecvError), which proves nothing-delivered even more strongly).	2026-07-18 21:52:50 +00:00
Claude (sandbox)	527f045e17	feat(causal): offcpu column + closed-books eff in the attrib probe The probe now prints eff+offcpu next to eff — attributed plus the counted runnable off-CPU gaps over ground-truth in-site time — the per-window check that the located mechanism accounts for the whole residual (~1.00 = books closed, no remaining silent loss). Audit line gains the offcpu column, same delta-terms convention as the lib renderer. Header doc rewritten from hypothesis to resolution.	2026-07-13 12:46:58 +00:00
Claude (sandbox)	0ee3fe7330	feat(causal): offcpu audit bucket — the @50 deficit located (RFC 007) GPU sweep decomposed the ~23ms/700ms @50 injection deficit: every ledger bucket is ~zero (drop park 0, discards 0, drop yield ~0.3ms), books balance at absorbed+forgiven = 4x injected in all 48 cells, and the 0%-cell contamination signature is absent. The probe pins the residual: eff 0.933-0.943, a constant 22-27µs missing per site entry = ~4.9 slice-expiry yields/entry x ~5.6µs runqueue wait. The "deficit" is runnable off-CPU time inside the site — wall time the probe's ground truth counts but on-CPU attribution correctly skips (Coz model: speeding the site's code does not shrink queue-wait). Measure-only reclassification, no behaviour change: a yield in the target site stashes (tsc, experiment epoch) on the slot; the next on_resume counts the gap into OFFCPU_IN_SITE_{CYCLES,N} (would-be delta terms, MAX_SAMPLE_CYCLES-capped) iff the epoch still matches and the word is live — a gap straddling end()/a same-word begin() (live in the probe's 50,50 schedule) is dropped, never a leaked cooldown. Parks excluded: blocked time is forgiveness territory. New offcpu column in render_ledger_audit; LedgerCounters and ExperimentResult grow the two fields. Fidelity footer now states the on-CPU basis (deliberate wording change to the pinned summary; the substring pin test still holds). +2 tests (counted gap; epoch straddle) + render assert.	2026-07-13 12:46:15 +00:00
Claude (sandbox)	9bfeb2c6a2	feat(causal): ledger-audit output in the pipeline demo and attrib probe - causal_pipeline: SMARM_CAUSAL_AUDIT=1 appends render_ledger_audit() after the summary; pinned summary format untouched. - causal_attrib_probe: per-pct audit line (absorbed/forgiven/drops/ discards) under the existing eff line, so in-site-vs-attributed and the loss buckets land in one place for the sweep. 1-core smoke (work + wide): books balance — absorbed = 2x injected and forgiven = 2x injected, i.e. owed = injected x (N-1) with N=5 actors, zero outstanding. drop park = 0 even in wide mode (the bottleneck's queue is never empty, so its in-guard recv never parks); drop yield is ~1500 events but ~0.1ms per window, confirming slice-expiry yields sample at their own checkpoint. Deficit decomposition needs the parallel box.	2026-07-13 12:11:24 +00:00
Claude (sandbox)	a3be8f0977	feat(causal): ledger audit — decompose the @50 injection deficit (RFC 007) Measure-only counters for the deficit hunt (~23ms short per 700ms window at 50% on the bottleneck site; superlinear vs 25%). Nothing here changes injection or absorption; the sweep decides the fix. Buckets, windowed per cell into new ExperimentResult fields (audit snapshot taken at end() — spin/attribution freeze there, forgiveness does not): - spin_absorbed / park_forgiven: where owed delay actually went. Spin during a 0% cell is the baseline-contamination signature — leftover debt from a prior window being paid in a later one (checks gate on the experiment word, so cooldowns pay nothing and debt carries over). - drop_park / drop_yield (+counts): the deschedule path flushes no sample tail — an in-target-site park or yield silently loses [last sample -> now]; on_resume re-arms before the actor runs again. New on_deschedule hook in all three intent arms (real park; explicit/ slice-expiry yield; requeued park counts as yield — it never blocked). Slice-expiry yields sample at the descheduling checkpoint, so a fat yield bucket points at explicit yield_now or requeued parks. - discard_overmax (+count, in would-be delta terms so columns compare against injected_cycles) / discard_unarmed: the attribute() clamps, previously silent. LedgerCounters + ledger_counters() expose cumulative totals (tests, run-level prints); render_ledger_audit() is the per-cell companion to render_summary, which stays byte-identical (pinned). ExperimentResult now derives Default so literals survive future audit-field growth. Tests: +7 (spin counted, forgiveness counted, in-site park drop, in-site yield drop, overmax discard, 0%-window leftover absorption — synthesized deterministically via inject_delay_cycles_for_test with no experiment active — and the audit render). 22/22 causal.	2026-07-13 12:07:19 +00:00
Claude (sandbox)	a2d0b7af18	feat(causal): fidelity footer in render_summary One unconditional footer line whenever there are results: "note: impacts are lower bounds — undershoot grows with speedup pct; rankings unaffected" — surfacing the RFC 007 Validation fidelity statement where users actually look, instead of only in the RFC. Wording pinned by the summary test.	2026-07-13 11:11:20 +00:00
Claude (sandbox)	a4647f368a	feat(causal): wall-anchored send_after — user-facing timer opt-out (RFC 007) send_after_wall / send_after_named_wall (+ Timers::insert_send_wall) arm a message-delivery timer that opts out of the RFC 007 virtual-time shift and fires at its raw deadline regardless of injected delay — the Send-reason sibling of sleep_wall, closing the jar item whose substrate `efbc254` landed. For deadlines that reflect the outside world (protocol timeouts, wall-clock schedules) rather than workload pacing. cancel_timer is anchor-agnostic and unchanged; without the feature the API exists and is identical to send_after. The gen_server timer layer (send_after_to, RFC 015 §5) deliberately stays virtual-only — an opt-out there means new options on the gen_server/statem timeout API, out of scope for now. Tests: wall send fires at raw deadline while a virtual sibling shifts; cancel on a wall send with debt outstanding; featureless delivery/cancel smokes through the public named API. 15/15 causal, 34/34 binaries both feature configs, lib clippy clean both.	2026-07-13 11:10:11 +00:00
Claude (sandbox)	fec760a3c0	docs(causal): record the reserve-shortfall verdict in the demo header Occupancy probe on 24 cores: δ = 0.3µs/item (0.1% of the serialized path); wide guard leaves the @50% cell unchanged. The +84-vs-+100 shortfall is controller-side (injected 327ms of the ideal 350ms over the 700ms window, plus ~3% real-rate dip during experiments), not unguarded stage time. urus's ~70µs/request remainder remains the guard-placement case; the occupancy probe discriminates the two.	2026-07-13 09:48:24 +00:00
Claude (sandbox)	d5b6a8f66f	feat(causal): pipeline demo modes for the reserve-shortfall experiment SMARM_CAUSAL_MODE selects the reserve stage's guard placement: work (default, unchanged) \| wide (guard over recv+work+send, the whole serialized per-item path) \| occupancy (no experiments; per-segment timing of reserve's loop at baseline, reporting the unguarded remainder δ and the impact ceiling it implies). Discriminates the two candidate explanations for the demo's +84-vs-+100 @50% shortfall: physical recv/send time outside the guard (occupancy sees δ≈30µs, wide recovers ~2x) vs. injection-side credit loss (occupancy sees δ≈0, wide caps at ~+84 too — guard cadence identical).	2026-07-13 09:40:27 +00:00
Claude (sandbox)	efbc254634	feat(causal): wall-anchored timers — controller windows keep fixed wall length (RFC 007) New timer anchor: insert_sleep_wall / scheduler::sleep_wall (exported) opt a Sleep entry out of the RFC 007 virtual-time shift, so it fires at its raw deadline regardless of injected delay. Featureless config is unchanged (the API exists but is identical to sleep). The causal controller's window/cooldown sleeps and the tsc_hz calibration sleep use it on the actor path (the OS-thread path was already wall). This fixes the controller's own sleeps dilating under its own injection — experiment windows stretched ~2x at 50% speedup (337ms -> 646ms injected/ window). Deltas were rate-normalized so results were unbiased; this fixes sweep cost, not bias. The general wall-anchored-timer-semantics jar item (user-facing opt-out) remains open; this lands the substrate. Test: wall_timer_ignores_injected_delay — wall entry fires at raw deadline while a virtual sibling in the same heap shifts. 13/13 causal, 34/34 binaries both feature configs.	2026-07-13 07:44:51 +00:00
Claude (sandbox)	04dbac1f4b	feat(causal): timer-heap virtual time — deadlines chase injected delay (RFC 007) Injected delays dilate virtual time for the workload, but timer deadlines stayed wall-anchored: a sleep or receive-timeout fired early in virtual terms, so timeout/retry behaviour sped up relative to the dilated world (v1 known gap #1). Every heap Entry now carries delay_stamp — the global delay ledger at (re-)queue time, cfg-gated on smarm-causal. pop_due converts any debt accrued since the stamp to wall time (tsc_hz) and shifts the effective deadline; a not-yet-due entry is re-queued at the shifted deadline with a fresh stamp, so it keeps chasing delay injected while it waits. seq is preserved across re-queues, keeping send_after cancellation identity intact (cancelled entries are discarded before any shift). Zero debt is byte-identical to the old path; peek_deadline may under-report, costing one spurious scheduler wake per injected chunk (documented). This also makes the park-gated resume credit correct rather than forgiving for sleepers: a sleeping actor now physically pays its debt by sleeping longer, so the on_resume fast-forward reflects real payment (sleeping_actor_pays_injected_delay pins this end-to-end through the runtime). New test hooks: inject_delay_cycles_for_test (deterministic ledger driver, eagerly TSC-calibrating so conversion never stalls a scheduler loop) and cycles_to_duration. The ledger is process-global, so the delta-sensitive causal tests now serialize on a shared test mutex — they were racy under the parallel test harness before this, in principle.	2026-07-13 07:31:08 +00:00
Claude (sandbox)	d496914d40	fix(causal): flush target-site samples at guard boundaries (RFC 007) Samples were taken only when maybe_preempt's cold block happened to fire in-site, so the interval between the last check and SiteGuard drop was discarded on every site entry. Measured live on the 24-core box: 22-29us lost per entry, a constant attribution efficiency of ~0.93-0.94, which under-reported every impact (+83.5% where theory says +100%; the observed shortfall fits 1/(1-pct*eff)-1 at both 25% and 50%). SiteGuard enter/drop now call site_transition(): leaving the target site flushes the pending interval into the ledger (sample-only, never spins, so safe under no-preempt regions); entering the target site re-arms the sample clock so pre-site time is never attributed (the symmetric over-attribution). Winner attribution is factored into attribute(), shared by the cold check and the flush, with the same interval clamps. Adds examples/causal_attrib_probe.rs (ground-truth in-site time vs ledger attribution, the probe that confirmed the leak) and the site_boundaries_flush_tail regression test (a site entry that never hits a cold check must still be attributed). Also gates causal_probe on smarm-causal in Cargo.toml - it never was, so featureless builds of the examples were broken.	2026-07-12 19:35:06 +00:00
Claude (sandbox)	2668f4018f	feat(causal): native causal profiling behind smarm-causal (RFC 007 v1) causal_site! scoped site guards per actor slot, progress! throughput points, and a Coz-style virtual-speedup engine hooked into maybe_preempt's amortized cold block: target-site samples grow a global delay ledger; bystanders spin-absorb their debt at the next causal check, with timeslice extension so injected delay is not charged against the slice. Resume credit (Coz's blocked-thread rule) is gated on a causal_parked slot bit set only by a real park: crediting on every resume made any yield-cadence actor delay-immune and every experiment inert (found live on a 24-core run — dead-flat deltas across all sites). Report normalization uses a measured TSC frequency (~50ms calibration on first use) instead of the crate-wide 3 GHz assumption, which uniformly inflated impact numbers on a 3.7 GHz box. impact_pct() is the machine-readable form of the summary for programmatic checks. examples/causal_pipeline.rs burns fixed work (calibrated LCG loop), not fixed wall time — a timed busy-wait absorbs injected delay into its own budget and reads as a no-op. Self-checking: exits nonzero if causal separation fails; skips the verdict below 4 cores. Validated on a 24-core box: reserve (true bottleneck) +29.3%@25/+83.5%@50; serialize and background-compaction ~0%. Known v1 gaps (jar): timer-heap deadlines unshifted, no-check!/no-alloc actors undelayable, multi-scheduler coherence best-effort Relaxed, off-CPU blame punted, Instant::now() uncorrected. Zero-cost with the feature off; clippy -D warnings clean both ways; full suite green with and without smarm-causal.	2026-07-12 19:10:04 +00:00
smarm-agent	1c90a4ef5e	fix(registry): by_name stores the full holder Pid — a dead name heals under slot reuse Root cause of soak20 signature 2 (refcount_test.exs 'watcher crash', 110235x fast {:error, :server_down} probes over the full await window): by_name mapped name -> slot index, so a name whose holder died (no stop path unregisters; prune is lazy) and whose slot was then re-tenanted read as live-held: register failed NameTaken{holder: <unrelated tenant>} (which the bridge macro's generated start() swallows -> start_server/1 reports :ok for a server that never came up), while name resolution reached the tenant's mailbox, missed on the message TypeId and failed fast WITHOUT pruning — the wedge self-sustained for the tenant's lifetime. Name- addressed send additionally judged liveness on the slot's current mailbox pid, so a same-typed tenant would have received the message (misdelivery) and a differently typed one a misleading NoChannel. Fix: by_name: HashMap<&'static str, Pid> — every reader judges the stored holder with the generation-checked live(), so a recycled slot's tenant no longer impersonates a dead holder, and every touch (register / whereis / resolve / send) prunes and heals a stale name. prune(index) becomes prune_holder(pid): names bound to the holder go; the mailbox goes only while still the holder's own (a tenant's replacement mailbox is left untouched). Introspection matches names to mailboxes by full pid, so a stale name never annotates a slot's new tenant. Deterministic regression test added first and shown to fail pre-fix (tests/stale_name_slot_reuse.rs: tiny slab forces re-tenanting; old slot (1,0) died, tenant (1,1) took the index; register -> NameTaken pre-fix). Post-fix it asserts the healed contract: whereis -> None (pruned), call -> ServerDown, re-register -> Ok. Suite 33 ok-binaries, clippy gate clean. In the wild the window opened at every splice_test teardown: Splice.terminate -> exit_server('subtree') left the name bound; width 20 raised the re-tenant probability. Downstream (smarm_beam): install_child's unregister-before-register workaround becomes dead code (removed there); the #[smarm_server] macro's swallowed register error becomes truthful idempotency (a NameTaken now really is a live holder).	2026-07-12 07:23:30 +00:00
smarm-agent	f6641cd266	runtime: name scheduler threads smarm-sched-{slot} Extra scheduler threads (slots 1..N-1) are now spawned via thread::Builder with the name smarm-sched-{slot}, so they are identifiable in /proc/<pid>/task//comm, stack dumps and debuggers. Thread 0 keeps its caller-given name (an embedder names the thread that calls run — smarm_beam names it smarm-runtime). A refused spawn still panics, matching the previous thread::spawn semantics. Motivation: the §17 scheduler-width knob in smarm_beam asserts the live* width by counting these named threads, and a 20-scheduler soak needs the threads tellable apart in wedge captures.	2026-07-11 19:17:52 +00:00
smarm-agent	0017c5b9a1	fix(runtime): consume wake-pipe bytes only under the drain lock Lost-wakeup: schedule_loop's phase-1 drain uses drain_lock.try_lock(), and try_lock losers skip the completion drain entirely. Both schedulers park on one shared wake pipe and, until now, drained ALL its bytes right after their idle poll_wake returned — outside the drain lock. A loser could therefore eat the byte announcing a completion the winner had not seen (the winner was already past drain_completions when the epoll thread pushed it), and both threads would park with the completion stranded. Because the bridge eventfd is registered EPOLLONESHOT, the kernel had already disarmed it at epoll_wait, so no later write could re-fire it: the runtime slept until an unrelated timer deadline forced another phase-1 pass. Fix: drain_wake_pipe() moves inside the drain guard, immediately before drain_completions(); the two post-poll drains in the Pop::Idle arms are removed. Producers push their completion before writing the byte, so a byte consumed under the guard always has its completion visible to the drain that follows. An unconsumed byte keeps the (level-triggered) idle poll returning instantly, so a try_lock loser spins briefly until the winner releases — it can no longer sleep through stranded work. Found via smarm_beam's ingress-cap drain barrier flaking under CPU load (5/25 loaded suite runs wedged; mid-wedge stacks showed both schedulers in poll_wake with an FdReady stranded and the eventfd disarmed). Post-fix: 60/60 loaded runs green, tight 5.8-6.8s timing band, no stall tail. Root-cause notes: smarm_beam outputs/flake-rootcause-egress-overload.md.	2026-07-11 16:13:46 +00:00
smarm-agent	6c2b7e91cf	channel: drop queued messages when the Receiver drops A queued Envelope::Call was stranded until the last Sender dropped, so a caller parked in gen_server::call was never released with ServerDown when a named server was request_stop'd — the registry's inbox Sender clone (lazy prune) kept the channel Arc, and the queued reply_tx, alive indefinitely. Receiver::Drop now drains the queue (items dropped after releasing the lock, since a reply_tx drop reaches a different channel's lock + the scheduler), restoring the documented ServerDown guarantee on every teardown path. Adds tests/stop_with_queued_call.rs: deterministic pure-smarm reproducer.	2026-06-24 20:53:27 +00:00
smarm-agent	3e9c33377c	gen_statem: disambiguate module/macro intra-doc links	2026-06-20 19:51:33 +00:00
smarm-agent	e54c67c431	supervisor: rewrite docs for users; relocate internals to items	2026-06-20 19:51:32 +00:00
smarm-agent	3e321eaaf3	pg: rewrite docs for users; relocate internals to items Reframe the module doc in the gen_server house style: lead with what a process group is and when to reach for one, the auto-eviction-on-death behaviour, groups-vs-registry, and a running-context note. Keep the runnable example; add an ignored dispatch/worker-pool example. Move implementation reasoning to where a maintainer stands: lock discipline onto the ProcessGroups store, the eager-cleanup rationale onto reap_group, the join finalize-race detail into join's body comment. Drop all RFC references and the stray phase marker, and lead the public read/select fn docs with what the caller gets. Demote the pub(crate) assert_type intra-doc link in pick_as to plain code, clearing a pre-existing broken-link warning.	2026-06-20 18:51:00 +00:00
smarm-agent	c415f14dd0	ci: deny unwrap_used/expect_used on the library target Add [lints.clippy] unwrap_used = "deny", expect_used = "deny" plus a tracked pre-commit hook running `cargo clippy --lib -- -D warnings`. Library code may not hide a panic behind unwrap/expect; panic!/unreachable! stay un-linted as the explicit sanctioned form. Gate is the library target only — tests and examples are not gated. A fresh clone must run `git config core.hooksPath .githooks` to enable the hook.	2026-06-20 17:47:44 +00:00
smarm-agent	33177a0c48	library + trace: rewrite panic sites as explicit match+panic Apply the same explicit match+panic shape to the library layer (channel, gen_server, gen_statem). Extend it to the smarm-trace-gated code that the default `cargo clippy --lib` does not see: the GLOBAL lock-poison sites in trace.rs and the current_pid sites inside te!() in channel.rs. Keep the current_pid match inside the te!() argument so non-trace builds evaluate nothing extra on the recv-wake hot path. const-init the trace thread-local.	2026-06-20 17:47:39 +00:00
smarm-agent	a875fa8285	core: rewrite panic sites as explicit match+panic Replace implicit unwrap()/expect() in the lock-ordered core with explicit match arms. Lock-poison sites use one uniform message ("smarm: <lock> lock poisoned (core corrupt): {e}"); invariant sites panic with a descriptive message naming the violated invariant. No behaviour change: each rewrite preserves the prior panic-on-bad-arm semantics. Also clears the accompanying clippy hygiene in these files (redundant_closure, len_without_is_empty, too_many_arguments, unnecessary_sort_by, missing_safety_doc, nonminimal_bool/unnecessary_unwrap).	2026-06-20 17:47:33 +00:00
smarm-agent	531571bfa5	gen_statem: postpone events for replay after a transition A `=> postpone` row (cast/call/info) defers the current event untouched, to be replayed after the next real transition. `handle` is now two-phase: a borrow-only postpone pre-pass that hands the event back as `Step::Postponed(ev)`, then the existing consuming `match (state, event)`. The loop owns a FIFO queue, drained in the new state ahead of further intake; a replayed event may postpone again. A postponed `call` keeps its Reply, so a later state answers it. `handle` returns `Step` (Postponed / Transitioned / Stayed) so the loop can see both deferral and transition without reading the state cell. `Resolution::Postpone` is removed: postpone is a pre-dispatch routing decision, not a consuming-dispatch outcome.	2026-06-20 13:02:15 +00:00
smarm-agent	acf67fef06	gen_statem: state and named timeouts, info events Add the two timeout flavours, both surfacing as ordinary events matched in on-state arms: - cx.state_timeout(d): fires a state_timeout event after d in the current state, auto-reset by the loop on every real transition. - cx.timeout(name, d): fires a timeout(name) event after d, surviving state changes, keyed by name, with cx.cancel_timeout(name). Both ride the existing timer min-heap via send_after_to onto a new per-loop system channel, selected above the inbox so a fire can't be starved by inbox traffic. A local-id stamp on each fire lets a reset/cancel that loses the race discard a stale fire. The macro grows an info: clause and folds three internal Ev variants (Info, StateTimeout, Timeout) alongside cast/call, with new row keywords info / state_timeout / timeout. Unmatched info silently drops (the gen_server default); state/named timeouts have no default, so a state that can see one must handle it or the match is non-exhaustive. Rename the hand-written expansion-target example fused -> expanded, retire the deprecated Switch demo machine (its round-trip / enter / panic-down coverage moves onto the timer machine), and refresh the macro docs to the door machine. cargo build --all-targets warning-free; cargo test green.	2026-06-20 12:22:45 +00:00
smarm	bfa513cd6d	doc: rework gen_server module docs completely	2026-06-20 13:41:27 +02:00
smarm-agent	0cf6b80396	gen_statem: GenStatem* type prefix + cleanup - rename StatemRef/StatemCallError/StatemSendError -> GenStatem* - move the inline unit test out of src; consolidate the Switch coverage onto a single macro-driven harness in tests/gen_statem.rs - drop the redundant hand-written Switch test machine and the two untracked rejected-direction probes (succ_enums, typed_edges) - rename examples statem_{fused,macro}.rs -> gen_statem_{fused,macro}.rs - strip RFC/chunk/spike provenance and fix the mislabeled "throwaway" example header and dead cross-references	2026-06-20 10:48:33 +00:00
smarm-agent	3e316066c3	gen_server: prefix public types with Gen (GenServerRef, GenServerCtx, GenServerName, GenServerBuilder, NamedGenServerBuilder)	2026-06-20 10:48:33 +00:00
smarm	f646c5cd72	cleanup some LLM crud	2026-06-20 12:22:33 +02:00
smarm	07867b91f6	doc: links and tweaks	2026-06-20 12:11:16 +02:00
smarm-agent	8d1605638e	statem: add gen_statem! authoring macro (RFC 017 chunk 1) Declarative macro_rules! that fuses the hand-written statem surface into one invocation: emits the unified event enum, the machine struct + state cell, start, the Machine impl (dispatch + stay/transition apply-tail), and the enter dispatch. User keeps the meaningful types, the per-state successor enums, and the free handler fns. Pure sugar: every safety property is a property of the emitted code, checked by rustc, so a declarative macro carries (almost) the proc-macro guarantee set: 1. forgotten (state,event) pair -> E0004 (total match, no injected _) 2. conflicting row -> unreachable_patterns (macro self-denies; HARD only in-crate, suppressed cross-crate by in_external_macro -- documented) 3. orphan handler -> dead_code (handlers are user free fns) 4. out-of-set target -> E0599 (per-state successor enums) Hygiene: bodies can't see the macro's self/cx, so the caller names them via `context(data, prev, cx)` (shared call-site hygiene). No separate transitions{} block: the match IS the table, successor enums ARE the per-state target sets (fused variant, diverges from RFC edge-lint). - examples/statem_macro.rs: Door machine via the macro (parallel to the hand-written examples/statem_fused.rs; diff the two to see the delta). - in-crate test exercises a machine + anchors the in-crate #2 guarantee.	2026-06-20 09:55:01 +00:00
smarm-agent	acc37c5fc9	RFC 017 chunk 1: gen_statem primitives (no macro yet) Runtime support layer for gen_statem, built against existing public API (channel + scheduler::spawn), sibling to gen_server. No macro: per review, build the primitives first and hand-write the Switch example to evaluate whether a statem! macro earns its place before committing to one. - src/statem.rs: Machine trait (on_start/handle), Resolution<S> with From<S>, Cx (on_unhandled), Reply<T> move-only reply handle, StatemRef (send/call), spawn + inbox loop. Real time only; Postpone + Cx timeout arming are in the type surface but not yet acted on (chunks 2-3). - examples/statem_switch.rs: the RFC Switch machine hand-written against the primitives, tagged USER vs MACRO to mark what a macro would generate. Asserts the RFC end-state (flips=1, enters=3). - tests/statem.rs: call/cast round-trip, enter-on-start/transition-not-stay, panicking-handler -> Down. Reply<T> included (the call helper needs a handle type; keeps the example true to the RFC surface) but isolated and trivially removable if we drop it.	2026-06-19 19:52:02 +00:00
smarm-agent	0d6fc970a7	scheduler: gate send_after_to runtime tests out of the loom build These three RFC-015 tests call run() (the real runtime), so under --cfg loom they construct loom atomics outside a loom::model block and panic with "cannot access Loom execution state from outside a Loom model". They are unit tests of runtime behavior, not state-machine models. Gate the module #[cfg(all(test, not(loom)))], matching the existing run_queue::tests precedent: they still run under normal cargo test, and the loom build is now 28/28 clean.	2026-06-19 14:48:23 +00:00
smarm-agent	e9c39bff46	roadmap: introspection & observability shipped as RFC 016 Chunks 1-4 all landed; mark the introspection block done (was 'Needs an RFC'), matching the send_after in-place SHIPPED treatment above it.	2026-06-19 10:20:37 +00:00
smarm-agent	2d15834b24	RFC 016 Chunk 4: observer example with ps-style + tree dump A runnable examples/observer.rs (required-features = ["observer"]) that stands up a named service + two parked workers, starts the observer, and renders a snapshot as a ps-style table and the parentage forest indented. The observer appears in its own dump, caught running while it serves the snapshot call — transport over the same read every consumer sees. Complements the runnable doctest already on observer::start.	2026-06-19 10:20:09 +00:00
smarm-agent	6df4cd4a0b	RFC 016 Chunk 4: observer gen_server (feature-gated) A thin GenServer consumer of the Chunk-1 read primitive — the live observer process (D12). ObserverRequest/ObserverReply are the wire contract (D11); the version rides along on the snapshot/tree payloads, which already carry SNAPSHOT_FORMAT_VERSION (D1). Behind the new `observer` Cargo feature, off by default (D10): the primitive stays always-on, only the transport is gated. Cast is Infallible, so the server takes no async traffic and handle_cast is statically unreachable. Gated integration test proves each verb relays exactly what the corresponding primitive returns (snapshot/tree/actor_info), incl. a forged-pid None and a live Parked classification.	2026-06-19 10:18:54 +00:00
smarm-agent	48d47c45c9	RFC 016 Chunk 2c: approximate per-actor time-budget (reductions-like) Accumulate on-CPU cycles per actor as ActorInfo.budget_cycles, behind the off-by-default budget-accounting feature (D6) — a reductions-style work metric for relative comparison across runs. - Approximate by design (per Mark): charge now - slice-start once at the yield point, reusing the timestamp reset_timeslice already sets, so one RDTSC per resume not two. Wake-slot resumes inherit the slice and so slightly over-attribute the chain's time to the woken actor — noise that averages out; we trade exactness for half the hot-path cost. - Field/ActorInfo member are unconditional (keeps the snapshot shape stable across the feature flag, D1); only the accumulation is gated, so default builds are byte-identical and pay nothing. Reads return 0 when off. Single-writer Relaxed like the other counters; reset in reset_counters (D7). Matrix: default + feature-on + rq-mpmc + rq-striped + release + trace all green; loom unaffected (feature off under --cfg loom).	2026-06-19 08:47:58 +00:00
smarm-agent	e93b3120ec	RFC 016 Chunk 2b: per-actor messages-received counter Tally each dequeued message against the receiving actor, surfaced as ActorInfo.messages_received (the 'is this actor draining slower than its mailbox fills' signal). - messages-received, not sent (D4): the receiver counts on its own thread, so it's a single-writer Relaxed load+store on a hot Slot AtomicU64, no atomic RMW (D5); reuses the stashed *const Slot from 2a. - Incremented at all six channel dequeue-success sites (recv, recv_timeout x2, recv_match, try_recv_match, try_recv) via preempt::note_message_received; no-op outside an actor (null slot). - Resets with overruns in reset_counters across the three lifecycle sites (D7). Matrix: debug default + rq-mpmc + rq-striped + release green; loom slot_state/run_queue models pass (the 3 send_after_to loom failures are pre-existing at `fc014c4` — plain #[test]s under --cfg loom, not Chunk 2).	2026-06-19 07:34:32 +00:00

1 2 3 4