Update the documentation

Make preemption knobs configurable; fix unused-variable warnings
Add `Config::alloc_interval()` and `Config::timeslice_cycles()` so callers can tune preemption sensitivity at runtime. The values flow through `RuntimeInner` and are written into per-scheduler-thread locals via a new `configure_preempt()` call at thread startup, keeping the hot path free of cross-thread coherency traffic. Fix unused-variable warnings in channel.rs by inlining `current_pid()` directly into `te!` macro arguments — since the no-op macro arm never evaluates its argument, no binding is needed at the call site. Clean up a handful of dead imports exposed by the refactor.
2026-05-25 22:14:07 +02:00 · 2026-05-25 21:52:16 +02:00 · 2026-05-25 13:04:58 +00:00 · 2026-05-25 13:04:54 +00:00 · 2026-05-25 13:04:50 +00:00 · 2026-05-24 07:03:45 +00:00
44 changed files with 9771 additions and 535 deletions
--- a/.gitignore
+++ b/.gitignore
@@ -1,2 +1,3 @@
-/target
+target
 Cargo.lock
 smarm_trace.json
--- a/Cargo.toml
+++ b/Cargo.toml
@@ -1,14 +1,18 @@
 [package]
 name = "smarm"
-version = "0.1.0"
+version = "0.3.0"
 edition = "2021"
 rust-version = "1.95"
 [features]
 smarm-trace = []
 [dependencies]
 libc = "0.2"
 [dev-dependencies]
-tokio = { version = "1", features = ["rt", "macros", "sync"] }
+libc = "0.2"
 tokio = { version = "1", features = ["rt", "rt-multi-thread", "macros", "sync", "time"] }
 [profile.dev]
 panic = "unwind"
@@ -21,3 +25,7 @@ codegen-units = 1
 [[bench]]
 name = "primes"
 harness = false
 [[bench]]
 name = "multi_scheduler"
 harness = false
--- a/README.md
+++ b/README.md
@@ -0,0 +1,100 @@
 # smarm
 > SMARM — Smarm, Marks Actor Runtime Machinery. A proof-of-concept green-thread actor runtime for Rust.
 Implements the core ideas in [`Achitecture.md`](.docs/Architecture.md): green-thread actors on a
 shared heap, scheduled cooperatively, communicating only by `Send` messages.
 Erlang's isolation model without Erlang's copying GC, Rust's zero-copy
 ownership transfers without async's function colouring.
 The scheduler is multi-threaded — one OS thread per available CPU, all drawing
 from a shared run queue. The single-threaded `run()` entry point is kept as a
 convenience wrapper around `runtime::init(Config::exact(1)).run(f)`.
 ## What's here
 | Module       | What it does                                                           |
 |--------------|------------------------------------------------------------------------|
 | `stack`      | `mmap`'d growable stack with guard page; SIGSEGV on overflow           |
 | `context`    | `#[naked]` x86-64 context-switch shims, callee-saved regs only         |
 | `preempt`    | Allocator-driven preemption; `check!()` macro for no-alloc loops       |
 | `pid`        | `(index, generation)` PIDs; stale handles are detectable, not silent   |
 | `actor`      | Trampoline + `catch_unwind` boundary at the actor entry point          |
 | `scheduler`  | Run queue, slot table, spawn/join, parking, idle path                  |
 | `channel`    | Unbounded MPSC channel; `recv` parks the actor                         |
 | `mutex`      | `Mutex<T>` with mandatory timeout; FIFO waiters; parks the green thread |
 | `timer`      | Min-heap of `(deadline, reason)`; `Sleep` and `WaitTimeout` reasons    |
 | `io`         | `block_on_io` for blocking work; `wait_readable`/`wait_writable` + `read`/`write` via epoll |
 | `supervisor` | `Signal::Exit` / `Signal::Panic` delivered to a parent actor's mailbox |
 ## Quick taste
 ```rust
 use smarm::{run, spawn, channel};
 run(|| {
    let (tx, rx) = channel::<i64>();
    let h = spawn(move || {
        for _ in 0..3 {
            let v = rx.recv().unwrap();
            println!("got {v}");
        }
    });
    for v in 1..=3i64 {
        tx.send(v).unwrap();
    }
    h.join().unwrap();
 });
 ```
 ## Layout
 ```
 src/
  stack.rs context.rs preempt.rs pid.rs actor.rs
  scheduler.rs channel.rs mutex.rs timer.rs io.rs supervisor.rs
  lib.rs
 tests/
  per-module integration tests
 benches/
  primes.rs    fan-out/fan-in compute, vs tokio current_thread
 ```
 ## Building and running
 Standard Cargo. Requires Rust 1.95 or newer (the `#[naked]` attribute went stable
 in 1.88; we use a few unrelated post-1.88 features). x86-64 Linux only —
 ARM64 and macOS are on the deferred list because of the assembly shim and the
 epoll dependency.
 ```sh
 cargo test                # all tests
 cargo test --test mutex   # one module
 cargo bench               # primes benchmark vs tokio
 ```
 ## What's not here
 See the **Defer** section of `Architecture.md`. 
 restart-intensity caps, `join!` for handle groups, stack growth via remap,
 hierarchical timer wheel, fd-wait timeouts, `Signal::Timeout`. Each is
 mechanism we know how to add; none belongs in this iteration.
 ## Docs
 | Document | What it covers |
 |---|---|
 | [`Architecture.md`](./docs/Architecture.md) | Design intent, runtime model, and deferred work |
 | [`smarm - Deep Dive.html`](./docs/smarm%20-%20Deep%20Dive.html) | Generated walkthrough of the system; good starting point |
 | [`BENCHMARKS_AND_TUNING.md`](./docs/BENCHMARKS_AND_TUNING.md) | Where smarm wins and loses vs tokio, preemption knob recommendations |
 | [`benchmarks.md`](./docs/benchmarks.md) | Raw benchmark results, methodology, and tuning experiment log |
 ## Contributing
 This is a personal proof-of-concept. There's no PR workflow — if you fork it
 and do something interesting, just send me an email. I'd genuinely like to
 hear about it.
 ---
 <sub>The name is a recursive acronym. The M is for Marks, as in the BEAM — Bogdan/Björn's Erlang Abstract Machine, the virtual machine that runs Erlang and Elixir. smarm is not the BEAM. It just admires it from a safe distance.</sub>
--- a/benches/baseline-output/general.txt
+++ b/benches/baseline-output/general.txt
@@ -0,0 +1,44 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       7136 |       6929 |       8347
            smarm 1-thread |         1000 |       6979 |       6790 |       7364
      tokio current_thread |         1000 |        113 |        112 |        322
        tokio multi-thread |         1000 |        176 |        170 |        355
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      40079 |      39606 |      41913
            smarm 1-thread |       200000 |      40073 |      39298 |      43173
      tokio current_thread |       200000 |      14571 |      14430 |      14670
        tokio multi-thread |       200000 |      14044 |      13306 |      14432
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      19347 |      19185 |      19703
            smarm 1-thread |        33860 |      19461 |      19202 |      21172
      tokio current_thread |        33860 |      18616 |      18553 |      18987
        tokio multi-thread |        33860 |      18905 |      18755 |      19035
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      13731 |      13555 |      15545
            smarm 1-thread |         1000 |      14176 |      13870 |      14892
      tokio current_thread |         1000 |        828 |        788 |        939
        tokio multi-thread |         1000 |       3342 |       3233 |       3624
--- a/benches/baseline-output/multi_scheduler.txt
+++ b/benches/baseline-output/multi_scheduler.txt
@@ -0,0 +1,34 @@
 smarm multi-scheduler benchmarks
 available parallelism: 1 threads
 PRIME_N=400000, WORKERS=64, PING_ROUNDS=10000, SPAWN_COUNT=1000
 ================================================================================
  Fan-out/fan-in: count primes in [2, 400000) across 64 workers
 ================================================================================
               runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
     baseline (serial) |        33860 |      18581 |      18519 |      18905
   smarm single-thread |        33860 |      19467 |      19354 |      22082
        smarm 1-thread |        33860 |      19345 |      19287 |      19653
  tokio current_thread |        33860 |      18681 |      18591 |      18982
    tokio multi-thread |        33860 |      18948 |      18726 |      19212
 ================================================================================
  Ping-pong: 10000 round-trips between two actors
 ================================================================================
               runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
   smarm single-thread |        10000 |       2547 |       2473 |       2841
        smarm 1-thread |        10000 |       2546 |       2518 |       2702
  tokio current_thread |        10000 |       1221 |       1168 |       1366
    tokio multi-thread |        10000 |       1487 |       1316 |       2331
 ================================================================================
  Spawn throughput: 1000 actors spawned and joined
 ================================================================================
               runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
   smarm single-thread |         1000 |       8934 |       8066 |      12204
        smarm 1-thread |         1000 |       8102 |       8041 |      10849
  tokio current_thread |         1000 |        212 |        210 |        331
    tokio multi-thread |         1000 |        330 |        301 |        604
--- a/benches/baseline-output/primes.txt
+++ b/benches/baseline-output/primes.txt
@@ -0,0 +1,7 @@
 Counting primes in [2, 200000) across 16 workers, 5 iterations each
     runtime |    primes found |           median |             min |             max
 --------------------------------------------------------------------------------
    baseline | primes:  17984 | median:     7244 µs | min:     7231 µs | max:     7509 µs
       smarm | primes:  17984 | median:     7592 µs | min:     7505 µs | max:     8130 µs
       tokio | primes:  17984 | median:     7263 µs | min:     7225 µs | max:     9067 µs
--- a/benches/baseline-output/smarm_favored.txt
+++ b/benches/baseline-output/smarm_favored.txt
@@ -0,0 +1,40 @@
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         62 |         59 |        682
            smarm 1-thread |            1 |         71 |         61 |        210
      tokio current_thread |            1 |         22 |         22 |         23
        tokio multi-thread |            1 |         44 |         38 |         79
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     182177 |     180380 |     184410
      tokio current_thread |      1000000 |     138335 |     136097 |     141196
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      31473 |      28719 |      33113
      tokio current_thread |      1000000 |      51925 |      51205 |      53043
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     112306 |     109702 |     119859
            smarm 1-thread |        10000 |     114305 |     112030 |     121326
      tokio current_thread |        10000 |     151443 |     150949 |     153800
        tokio multi-thread |        10000 |     161344 |     160385 |     167573
--- a/benches/baseline-output/sweep/ai128_tc1200k.txt
+++ b/benches/baseline-output/sweep/ai128_tc1200k.txt
@@ -0,0 +1,126 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       8720 |       8526 |       9319
            smarm 1-thread |         1000 |       8662 |       8571 |       8991
      tokio current_thread |         1000 |        123 |        123 |        152
        tokio multi-thread |         1000 |        188 |        184 |        230
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      41530 |      41242 |      43501
            smarm 1-thread |       200000 |      41575 |      41187 |      43323
      tokio current_thread |       200000 |      15098 |      15020 |      15348
        tokio multi-thread |       200000 |      15900 |      15827 |      16012
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      29573 |      29435 |      31647
            smarm 1-thread |        33860 |      29521 |      29453 |      29847
      tokio current_thread |        33860 |      28495 |      28441 |      30150
        tokio multi-thread |        33860 |      34384 |      34297 |      34745
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      17190 |      16994 |      17541
            smarm 1-thread |         1000 |      17078 |      16916 |      19139
      tokio current_thread |         1000 |        899 |        896 |       1000
        tokio multi-thread |         1000 |       4198 |       4116 |       4573
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     138556 |     136165 |     140947
            smarm 1-thread |        10000 |     140223 |     136325 |     146781
      tokio current_thread |        10000 |       2671 |       2622 |       2913
        tokio multi-thread |        10000 |       6004 |       4360 |      12576
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |       9051 |       8967 |      11152
            smarm 1-thread |       320000 |       9058 |       9008 |       9998
      tokio current_thread |       320000 |      17375 |      17131 |      18514
        tokio multi-thread |       320000 |      17955 |      17452 |      18508
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     156969 |     153124 |     167711
            smarm 1-thread |        10000 |     150638 |     146070 |     168286
      tokio current_thread |        10000 |      13823 |      13482 |      14796
        tokio multi-thread |        10000 |      15034 |      14425 |      15320
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      30075 |      29707 |      30720
      tokio multi 1-thread |        33860 |      29060 |      28835 |      44378
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         86 |         79 |        130
            smarm 1-thread |            1 |         83 |         78 |        146
      tokio current_thread |            1 |         25 |         25 |         31
        tokio multi-thread |            1 |         49 |         46 |         85
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     190902 |     187600 |     194333
      tokio current_thread |      1000000 |     150279 |     148175 |     188184
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      27687 |      27198 |      29555
      tokio current_thread |      1000000 |      54465 |      54048 |      55954
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     160308 |     154365 |     167009
            smarm 1-thread |        10000 |     158662 |     155458 |     168896
      tokio current_thread |        10000 |     267762 |     260876 |     294092
        tokio multi-thread |        10000 |     275097 |     269344 |     287681
--- a/benches/baseline-output/sweep/ai128_tc150k.txt
+++ b/benches/baseline-output/sweep/ai128_tc150k.txt
@@ -0,0 +1,126 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       8596 |       8491 |       8805
            smarm 1-thread |         1000 |       8552 |       8461 |       9003
      tokio current_thread |         1000 |        125 |        125 |        260
        tokio multi-thread |         1000 |        190 |        184 |        338
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      41885 |      41112 |      43292
            smarm 1-thread |       200000 |      42174 |      41063 |      43145
      tokio current_thread |       200000 |      15195 |      15010 |      15589
        tokio multi-thread |       200000 |      16037 |      15869 |      17057
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      29872 |      29629 |      31596
            smarm 1-thread |        33860 |      29776 |      29528 |      30003
      tokio current_thread |        33860 |      28705 |      28605 |      30287
        tokio multi-thread |        33860 |      34655 |      34503 |      36596
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      16898 |      16574 |      17386
            smarm 1-thread |         1000 |      16871 |      16677 |      18467
      tokio current_thread |         1000 |        897 |        857 |        991
        tokio multi-thread |         1000 |       4325 |       4228 |       4458
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     133462 |     129526 |     138685
            smarm 1-thread |        10000 |     130118 |     127633 |     142344
      tokio current_thread |        10000 |       2713 |       2608 |       2831
        tokio multi-thread |        10000 |       7367 |       4345 |      11741
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |       9077 |       8944 |       9287
            smarm 1-thread |       320000 |       9100 |       9033 |      10604
      tokio current_thread |       320000 |      17310 |      17122 |      18616
        tokio multi-thread |       320000 |      17484 |      17413 |      17748
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     140039 |     135577 |     145123
            smarm 1-thread |        10000 |     139931 |     135513 |     143841
      tokio current_thread |        10000 |      14524 |      14378 |      14564
        tokio multi-thread |        10000 |      15066 |      14677 |      15336
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      29620 |      29511 |      31347
      tokio multi 1-thread |        33860 |      29046 |      28817 |      29687
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         94 |         79 |        371
            smarm 1-thread |            1 |        183 |         83 |        317
      tokio current_thread |            1 |         25 |         25 |         31
        tokio multi-thread |            1 |         54 |         41 |         71
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     189034 |     187674 |     192204
      tokio current_thread |      1000000 |     151106 |     149564 |     155601
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      26949 |      26838 |      30868
      tokio current_thread |      1000000 |      52984 |      52149 |      55141
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     145860 |     143015 |     152734
            smarm 1-thread |        10000 |     144550 |     141592 |     149247
      tokio current_thread |        10000 |     267500 |     265301 |     278751
        tokio multi-thread |        10000 |     275320 |     268986 |     286891
--- a/benches/baseline-output/sweep/ai128_tc300k.txt
+++ b/benches/baseline-output/sweep/ai128_tc300k.txt
@@ -0,0 +1,126 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       8469 |       8414 |       8717
            smarm 1-thread |         1000 |       8625 |       8479 |      10212
      tokio current_thread |         1000 |        124 |        123 |        175
        tokio multi-thread |         1000 |        194 |        184 |        317
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      41949 |      41419 |      43784
            smarm 1-thread |       200000 |      42005 |      41491 |      45224
      tokio current_thread |       200000 |      15139 |      15049 |      16352
        tokio multi-thread |       200000 |      15985 |      15931 |      16306
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      29640 |      29515 |      31229
            smarm 1-thread |        33860 |      29777 |      29642 |      30056
      tokio current_thread |        33860 |      28704 |      28584 |      30317
        tokio multi-thread |        33860 |      34870 |      34569 |      35876
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      17098 |      16968 |      18688
            smarm 1-thread |         1000 |      16918 |      16736 |      17326
      tokio current_thread |         1000 |        915 |        882 |       1000
        tokio multi-thread |         1000 |       4371 |       4265 |       4834
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     127075 |     124760 |     130259
            smarm 1-thread |        10000 |     125976 |     125121 |     128728
      tokio current_thread |        10000 |       2703 |       2646 |       2807
        tokio multi-thread |        10000 |       7201 |       4267 |      12853
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |       9116 |       8985 |       9237
            smarm 1-thread |       320000 |       9062 |       8947 |      10648
      tokio current_thread |       320000 |      17380 |      17192 |      18363
        tokio multi-thread |       320000 |      17854 |      17554 |      18219
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     137944 |     132081 |     141862
            smarm 1-thread |        10000 |     143773 |     137448 |     153703
      tokio current_thread |        10000 |      14174 |      13751 |      15079
        tokio multi-thread |        10000 |      15244 |      14625 |      16700
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      30832 |      30082 |      33360
      tokio multi 1-thread |        33860 |      29736 |      29321 |      29958
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         84 |         78 |        122
            smarm 1-thread |            1 |         90 |         79 |        157
      tokio current_thread |            1 |         25 |         25 |         31
        tokio multi-thread |            1 |         48 |         47 |         62
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     190830 |     188562 |     196621
      tokio current_thread |      1000000 |     151537 |     150038 |     165825
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      27265 |      26969 |      29317
      tokio current_thread |      1000000 |      53894 |      53380 |      56189
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     145006 |     144092 |     149002
            smarm 1-thread |        10000 |     144417 |     142000 |     148224
      tokio current_thread |        10000 |     265376 |     260227 |     272279
        tokio multi-thread |        10000 |     277432 |     270860 |     283266
--- a/benches/baseline-output/sweep/ai128_tc600k.txt
+++ b/benches/baseline-output/sweep/ai128_tc600k.txt
@@ -0,0 +1,126 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       8721 |       8398 |       8994
            smarm 1-thread |         1000 |       8587 |       8440 |       8810
      tokio current_thread |         1000 |        124 |        124 |        294
        tokio multi-thread |         1000 |        188 |        184 |        299
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      42588 |      42084 |      45080
            smarm 1-thread |       200000 |      42252 |      41963 |      43615
      tokio current_thread |       200000 |      15101 |      14994 |      15573
        tokio multi-thread |       200000 |      15979 |      15890 |      16356
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      29686 |      29491 |      31263
            smarm 1-thread |        33860 |      29841 |      29586 |      30570
      tokio current_thread |        33860 |      28652 |      28510 |      30359
        tokio multi-thread |        33860 |      34677 |      34461 |      35318
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      16909 |      16579 |      20782
            smarm 1-thread |         1000 |      16888 |      16537 |      20808
      tokio current_thread |         1000 |        925 |        911 |       1021
        tokio multi-thread |         1000 |       4192 |       4079 |       4531
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     145813 |     142042 |     152501
            smarm 1-thread |        10000 |     145119 |     141282 |     161294
      tokio current_thread |        10000 |       2968 |       2899 |       3231
        tokio multi-thread |        10000 |       6288 |       4289 |      12226
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |       9662 |       9254 |      11370
            smarm 1-thread |       320000 |       9673 |       9331 |       9989
      tokio current_thread |       320000 |      18015 |      17334 |      21096
        tokio multi-thread |       320000 |      18384 |      17837 |      19534
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     160492 |     154795 |     180307
            smarm 1-thread |        10000 |     161716 |     156498 |     191986
      tokio current_thread |        10000 |      13895 |      13576 |      14913
        tokio multi-thread |        10000 |      15074 |      14665 |      16070
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      30001 |      29600 |      38039
      tokio multi 1-thread |        33860 |      29419 |      28906 |      30079
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         91 |         79 |        186
            smarm 1-thread |            1 |         87 |         81 |        131
      tokio current_thread |            1 |         25 |         25 |        103
        tokio multi-thread |            1 |         56 |         47 |         64
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     190023 |     188250 |     193824
      tokio current_thread |      1000000 |     154681 |     152074 |     187328
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      27264 |      26772 |      29512
      tokio current_thread |      1000000 |      53324 |      51744 |      59282
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     155983 |     152595 |     161438
            smarm 1-thread |        10000 |     162122 |     156170 |     200357
      tokio current_thread |        10000 |     276303 |     264291 |     296266
        tokio multi-thread |        10000 |     271350 |     267654 |     285897
--- a/benches/baseline-output/sweep/ai256_tc300k.txt
+++ b/benches/baseline-output/sweep/ai256_tc300k.txt
@@ -0,0 +1,126 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       9130 |       8720 |      10611
            smarm 1-thread |         1000 |       8808 |       8617 |       9659
      tokio current_thread |         1000 |        126 |        125 |        164
        tokio multi-thread |         1000 |        190 |        184 |        329
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      42270 |      41814 |      44737
            smarm 1-thread |       200000 |      42999 |      42104 |      45424
      tokio current_thread |       200000 |      15441 |      15196 |      16096
        tokio multi-thread |       200000 |      16249 |      16070 |      17620
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      29813 |      29627 |      30176
            smarm 1-thread |        33860 |      29613 |      29440 |      31205
      tokio current_thread |        33860 |      28637 |      28406 |      29179
        tokio multi-thread |        33860 |      34472 |      34389 |      36092
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      16899 |      16804 |      17017
            smarm 1-thread |         1000 |      17001 |      16704 |      19533
      tokio current_thread |         1000 |        914 |        893 |       1021
        tokio multi-thread |         1000 |       4198 |       4136 |       4297
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     128621 |     126503 |     132268
            smarm 1-thread |        10000 |     131316 |     128354 |     133964
      tokio current_thread |        10000 |       2763 |       2696 |       2996
        tokio multi-thread |        10000 |       6023 |       4300 |      12908
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |       9225 |       9071 |      11272
            smarm 1-thread |       320000 |       9174 |       9028 |       9335
      tokio current_thread |       320000 |      17210 |      17100 |      18404
        tokio multi-thread |       320000 |      17550 |      17413 |      18080
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     136396 |     133330 |     142485
            smarm 1-thread |        10000 |     137374 |     134345 |     141168
      tokio current_thread |        10000 |      13789 |      13499 |      14621
        tokio multi-thread |        10000 |      15036 |      14729 |      15359
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      30065 |      29819 |      32418
      tokio multi 1-thread |        33860 |      29501 |      28916 |      30057
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         94 |         81 |        257
            smarm 1-thread |            1 |         83 |         80 |        134
      tokio current_thread |            1 |         25 |         25 |         33
        tokio multi-thread |            1 |         57 |         48 |        109
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     188506 |     187971 |     190121
      tokio current_thread |      1000000 |     149663 |     148978 |     150733
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      26945 |      26703 |      29430
      tokio current_thread |      1000000 |      52332 |      51838 |      54062
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     146192 |     143776 |     150609
            smarm 1-thread |        10000 |     144012 |     140604 |     153892
      tokio current_thread |        10000 |     268341 |     260941 |     275404
        tokio multi-thread |        10000 |     272691 |     268094 |     307084
--- a/benches/baseline-output/sweep/ai32_tc150k.txt
+++ b/benches/baseline-output/sweep/ai32_tc150k.txt
@@ -0,0 +1,126 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       8653 |       8522 |       9163
            smarm 1-thread |         1000 |       8908 |       8660 |      10606
      tokio current_thread |         1000 |        124 |        123 |        175
        tokio multi-thread |         1000 |        244 |        184 |        340
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      42597 |      41857 |      43492
            smarm 1-thread |       200000 |      42621 |      42097 |      44386
      tokio current_thread |       200000 |      15368 |      15144 |      16484
        tokio multi-thread |       200000 |      16120 |      16012 |      19222
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      30499 |      29657 |      33910
            smarm 1-thread |        33860 |      31190 |      30105 |      32675
      tokio current_thread |        33860 |      28748 |      28643 |      29398
        tokio multi-thread |        33860 |      34714 |      34499 |      36338
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      16990 |      16853 |      17540
            smarm 1-thread |         1000 |      16944 |      16740 |      18603
      tokio current_thread |         1000 |        937 |        921 |       1056
        tokio multi-thread |         1000 |       4342 |       4205 |       4549
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     130032 |     128075 |     153842
            smarm 1-thread |        10000 |     126396 |     125101 |     131406
      tokio current_thread |        10000 |       2685 |       2629 |       2841
        tokio multi-thread |        10000 |       6014 |       4126 |      11484
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |       9122 |       8987 |       9334
            smarm 1-thread |       320000 |       9073 |       8956 |      10151
      tokio current_thread |       320000 |      17259 |      17163 |      17673
        tokio multi-thread |       320000 |      22771 |      17709 |      24514
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     137844 |     134570 |     157034
            smarm 1-thread |        10000 |     141200 |     137494 |     156214
      tokio current_thread |        10000 |      14809 |      14024 |      16518
        tokio multi-thread |        10000 |      15089 |      14704 |      15331
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      30880 |      29931 |      32667
      tokio multi 1-thread |        33860 |      29862 |      29116 |      31310
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         90 |         80 |        196
            smarm 1-thread |            1 |         87 |         79 |        126
      tokio current_thread |            1 |         25 |         25 |         53
        tokio multi-thread |            1 |         52 |         47 |         88
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     191187 |     187194 |     198269
      tokio current_thread |      1000000 |     152531 |     151113 |     154462
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      27413 |      27312 |      29463
      tokio current_thread |      1000000 |      53620 |      52594 |      55332
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     144199 |     141893 |     157984
            smarm 1-thread |        10000 |     144857 |     142722 |     152275
      tokio current_thread |        10000 |     268006 |     264666 |     274542
        tokio multi-thread |        10000 |     271827 |     268740 |     290301
--- a/benches/baseline-output/sweep/ai32_tc300k.txt
+++ b/benches/baseline-output/sweep/ai32_tc300k.txt
@@ -0,0 +1,126 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       8950 |       8591 |      10655
            smarm 1-thread |         1000 |       9688 |       8657 |      11720
      tokio current_thread |         1000 |        123 |        123 |        256
        tokio multi-thread |         1000 |        192 |        177 |        314
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      42965 |      41667 |      44850
            smarm 1-thread |       200000 |      42881 |      41634 |      48864
      tokio current_thread |       200000 |      15112 |      14986 |      15484
        tokio multi-thread |       200000 |      16006 |      15915 |      16647
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      29931 |      29750 |      31707
            smarm 1-thread |        33860 |      29977 |      29670 |      30996
      tokio current_thread |        33860 |      28615 |      28441 |      30188
        tokio multi-thread |        33860 |      34371 |      34330 |      35176
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      16753 |      16498 |      18516
            smarm 1-thread |         1000 |      16728 |      16599 |      16874
      tokio current_thread |         1000 |        940 |        933 |       1037
        tokio multi-thread |         1000 |       4317 |       4236 |       4427
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     132575 |     128629 |     136999
            smarm 1-thread |        10000 |     130313 |     127372 |     157234
      tokio current_thread |        10000 |       2689 |       2611 |       2833
        tokio multi-thread |        10000 |      11337 |       4288 |      12635
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |       9122 |       9000 |      11033
            smarm 1-thread |       320000 |       9143 |       9015 |       9333
      tokio current_thread |       320000 |      17705 |      17250 |      18111
        tokio multi-thread |       320000 |      18044 |      17621 |      19484
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     141925 |     135531 |     188381
            smarm 1-thread |        10000 |     139655 |     134291 |     146458
      tokio current_thread |        10000 |      13837 |      13621 |      14877
        tokio multi-thread |        10000 |      14992 |      14542 |      15237
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      29687 |      29554 |      31408
      tokio multi 1-thread |        33860 |      28963 |      28742 |      30236
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         83 |         80 |        128
            smarm 1-thread |            1 |         86 |         77 |        149
      tokio current_thread |            1 |         25 |         25 |         50
        tokio multi-thread |            1 |         53 |         47 |         84
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     197474 |     194313 |     201690
      tokio current_thread |      1000000 |     149289 |     148575 |     154319
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      26884 |      26675 |      29436
      tokio current_thread |      1000000 |      52594 |      51941 |      54495
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     148321 |     146050 |     152943
            smarm 1-thread |        10000 |     147961 |     144521 |     152158
      tokio current_thread |        10000 |     264487 |     260848 |     274838
        tokio multi-thread |        10000 |     272103 |     265687 |     285209
--- a/benches/baseline-output/sweep/ai512_tc300k.txt
+++ b/benches/baseline-output/sweep/ai512_tc300k.txt
@@ -0,0 +1,126 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       8574 |       8421 |       8729
            smarm 1-thread |         1000 |       8675 |       8401 |      12686
      tokio current_thread |         1000 |        125 |        125 |        148
        tokio multi-thread |         1000 |        188 |        184 |        291
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      42389 |      41316 |      46466
            smarm 1-thread |       200000 |      41776 |      41342 |      48940
      tokio current_thread |       200000 |      15168 |      15094 |      15658
        tokio multi-thread |       200000 |      15953 |      15862 |      17408
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      29680 |      29572 |      30661
            smarm 1-thread |        33860 |      29816 |      29597 |      30401
      tokio current_thread |        33860 |      28657 |      28581 |      29488
        tokio multi-thread |        33860 |      34837 |      34529 |      37270
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      16735 |      16601 |      17444
            smarm 1-thread |         1000 |      16702 |      16500 |      17184
      tokio current_thread |         1000 |        898 |        873 |        994
        tokio multi-thread |         1000 |       4343 |       4241 |       4448
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     128408 |     126199 |     133268
            smarm 1-thread |        10000 |     131599 |     129387 |     135080
      tokio current_thread |        10000 |       2718 |       2661 |       2981
        tokio multi-thread |        10000 |       7264 |       4608 |      11583
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |       9289 |       9039 |       9751
            smarm 1-thread |       320000 |       9510 |       9157 |       9677
      tokio current_thread |       320000 |      17550 |      17290 |      18578
        tokio multi-thread |       320000 |      18336 |      17527 |      18989
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     139111 |     136105 |     146606
            smarm 1-thread |        10000 |     137302 |     133316 |     141350
      tokio current_thread |        10000 |      13720 |      13455 |      14607
        tokio multi-thread |        10000 |      14964 |      14546 |      15400
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      30048 |      29705 |      31530
      tokio multi 1-thread |        33860 |      28894 |      28682 |      30094
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         93 |         81 |        161
            smarm 1-thread |            1 |        103 |         80 |        178
      tokio current_thread |            1 |         25 |         25 |         28
        tokio multi-thread |            1 |         53 |         47 |         74
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     188726 |     187640 |     192658
      tokio current_thread |      1000000 |     149332 |     148133 |     155745
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      27630 |      27086 |      29749
      tokio current_thread |      1000000 |      54225 |      53355 |      56307
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     144934 |     143038 |     163552
            smarm 1-thread |        10000 |     146614 |     143653 |     151325
      tokio current_thread |        10000 |     266330 |     263523 |     271639
        tokio multi-thread |        10000 |     274729 |     266323 |     285114
--- a/benches/baseline-output/sweep/ai64_tc150k.txt
+++ b/benches/baseline-output/sweep/ai64_tc150k.txt
@@ -0,0 +1,126 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       8849 |       8486 |       9224
            smarm 1-thread |         1000 |       8841 |       8477 |       9108
      tokio current_thread |         1000 |        124 |        124 |        219
        tokio multi-thread |         1000 |        187 |        184 |        283
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      41681 |      41278 |      43685
            smarm 1-thread |       200000 |      41721 |      41218 |      42261
      tokio current_thread |       200000 |      14969 |      14940 |      15051
        tokio multi-thread |       200000 |      16004 |      15868 |      17569
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      29679 |      29516 |      30105
            smarm 1-thread |        33860 |      29677 |      29594 |      31365
      tokio current_thread |        33860 |      28656 |      28572 |      29239
        tokio multi-thread |        33860 |      34783 |      34617 |      36531
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      17009 |      16822 |      17418
            smarm 1-thread |         1000 |      16866 |      16723 |      17315
      tokio current_thread |         1000 |        880 |        871 |       1035
        tokio multi-thread |         1000 |       4263 |       4178 |       4391
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     126566 |     124995 |     130402
            smarm 1-thread |        10000 |     128278 |     126209 |     135156
      tokio current_thread |        10000 |       2680 |       2640 |       2787
        tokio multi-thread |        10000 |       7411 |       4393 |      12421
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |       9073 |       8937 |       9324
            smarm 1-thread |       320000 |       9120 |       9018 |       9263
      tokio current_thread |       320000 |      17245 |      17180 |      17574
        tokio multi-thread |       320000 |      18518 |      17685 |      19621
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     141855 |     135415 |     145810
            smarm 1-thread |        10000 |     138265 |     135535 |     142346
      tokio current_thread |        10000 |      14441 |      13453 |      14650
        tokio multi-thread |        10000 |      14956 |      14529 |      15451
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      30033 |      29659 |      31803
      tokio multi 1-thread |        33860 |      29078 |      28963 |      30231
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         83 |         79 |        132
            smarm 1-thread |            1 |         85 |         78 |        146
      tokio current_thread |            1 |         25 |         25 |         73
        tokio multi-thread |            1 |         51 |         47 |         64
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     191352 |     188830 |     196235
      tokio current_thread |      1000000 |     152382 |     150674 |     187815
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      27552 |      27099 |      30612
      tokio current_thread |      1000000 |      53160 |      52436 |      55255
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     145243 |     143291 |     173727
            smarm 1-thread |        10000 |     145242 |     142819 |     148457
      tokio current_thread |        10000 |     266471 |     262904 |     269145
        tokio multi-thread |        10000 |     274195 |     269312 |     286111
--- a/benches/baseline-output/sweep/ai64_tc300k.txt
+++ b/benches/baseline-output/sweep/ai64_tc300k.txt
@@ -0,0 +1,126 @@
 smarm general benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
 ================================================================================
  chained_spawn: depth 1000
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |       8735 |       8508 |       9314
            smarm 1-thread |         1000 |       8808 |       8506 |      10346
      tokio current_thread |         1000 |        123 |        123 |        172
        tokio multi-thread |         1000 |        190 |        184 |        273
 ================================================================================
  yield_many: 200 tasks × 1000 yields
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       200000 |      41619 |      41255 |      43489
            smarm 1-thread |       200000 |      41544 |      41196 |      43259
      tokio current_thread |       200000 |      15382 |      15233 |      16007
        tokio multi-thread |       200000 |      16095 |      15999 |      16296
 ================================================================================
  fan_out_compute: primes in [2, 400000) across 64
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      30032 |      29838 |      31744
            smarm 1-thread |        33860 |      29782 |      29653 |      30601
      tokio current_thread |        33860 |      28754 |      28614 |      30700
        tokio multi-thread |        33860 |      34988 |      34570 |      36871
 ================================================================================
  ping_pong_oneshot: 1000 rounds
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |         1000 |      17088 |      16868 |      18654
            smarm 1-thread |         1000 |      16951 |      16797 |      17783
      tokio current_thread |         1000 |        932 |        899 |       1019
        tokio multi-thread |         1000 |       4340 |       4273 |       5245
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     129009 |     127353 |     132990
            smarm 1-thread |        10000 |     128009 |     126554 |     140472
      tokio current_thread |        10000 |       2666 |       2624 |       2794
        tokio multi-thread |        10000 |       5974 |       4368 |      11517
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |       9044 |       8970 |      10788
            smarm 1-thread |       320000 |       9087 |       8995 |      12500
      tokio current_thread |       320000 |      17185 |      17072 |      18440
        tokio multi-thread |       320000 |      17720 |      17394 |      19182
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     145819 |     140671 |     150512
            smarm 1-thread |        10000 |     139046 |     135846 |     146127
      tokio current_thread |        10000 |      13866 |      13522 |      14670
        tokio multi-thread |        10000 |      14900 |      14471 |      16378
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      30695 |      29720 |      33196
      tokio multi 1-thread |        33860 |      29261 |      28895 |      31013
 smarm smarm-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
 ================================================================================
  deep_recursion: depth 500
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |            1 |         82 |         79 |        113
            smarm 1-thread |            1 |         85 |         78 |        143
      tokio current_thread |            1 |         25 |         25 |         56
        tokio multi-thread |            1 |         50 |         47 |         63
 ================================================================================
  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |     188698 |     187922 |     192263
      tokio current_thread |      1000000 |     150231 |     148746 |     151723
 ================================================================================
  uncontended_channel: 1→1, 1000000 msgs (single thread)
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |      1000000 |      28461 |      27638 |      30283
      tokio current_thread |      1000000 |      52224 |      51880 |      54732
 ================================================================================
  catch_unwind_panics: 10000 tasks, 50% panic
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     144604 |     143246 |     145585
            smarm 1-thread |        10000 |     148208 |     142691 |     151076
      tokio current_thread |        10000 |     265255 |     260637 |     271065
        tokio multi-thread |        10000 |     273131 |     271313 |     300420
--- a/benches/baseline-output/tokio_favored.txt
+++ b/benches/baseline-output/tokio_favored.txt
@@ -0,0 +1,42 @@
 smarm tokio-favored benchmarks
 available parallelism: 1 threads
 ITERS=15 (+1 warmup, discarded)
 STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
 ================================================================================
  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     105512 |     102322 |     120552
            smarm 1-thread |        10000 |     107113 |     104048 |     112377
      tokio current_thread |        10000 |       2222 |       2124 |       2506
        tokio multi-thread |        10000 |       4546 |       3833 |       7305
 ================================================================================
  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |       320000 |      10456 |      10331 |      10639
            smarm 1-thread |       320000 |      10395 |       9201 |      10549
      tokio current_thread |       320000 |      17348 |      16639 |      19061
        tokio multi-thread |       320000 |      18628 |      17499 |      19298
 ================================================================================
  many_timers: 10000 actors sleeping 1–10 ms
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        10000 |     120242 |     116239 |     127200
            smarm 1-thread |        10000 |     121023 |     113997 |     127826
      tokio current_thread |        10000 |      13581 |      13182 |      14415
        tokio multi-thread |        10000 |      14266 |      14084 |      14843
 ================================================================================
  multi_thread_scaling: primes in [2, 400000) across 64 workers
 ================================================================================
                   runtime |       result |  median µs |     min µs |     max µs
 --------------------------------------------------------------------------------
            smarm 1-thread |        33860 |      19852 |      19601 |      22679
      tokio multi 1-thread |        33860 |      19638 |      18994 |      20102
--- a/benches/baseline.json
+++ b/benches/baseline.json
@@ -0,0 +1,224 @@
 {
  "chained_spawn": {
    "smarm 1-thread": {
      "result": 1000,
      "median": 8637,
      "min": 8553,
      "max": 8933
    },
    "tokio current_thread": {
      "result": 1000,
      "median": 124,
      "min": 124,
      "max": 153
    },
    "tokio multi-thread": {
      "result": 1000,
      "median": 188,
      "min": 183,
      "max": 229
    }
  },
  "yield_many": {
    "smarm 1-thread": {
      "result": 200000,
      "median": 41622,
      "min": 41063,
      "max": 44973
    },
    "tokio current_thread": {
      "result": 200000,
      "median": 15085,
      "min": 15013,
      "max": 15274
    },
    "tokio multi-thread": {
      "result": 200000,
      "median": 15964,
      "min": 15880,
      "max": 17959
    }
  },
  "fan_out_compute": {
    "smarm 1-thread": {
      "result": 33860,
      "median": 29727,
      "min": 29491,
      "max": 31634
    },
    "tokio current_thread": {
      "result": 33860,
      "median": 28503,
      "min": 28391,
      "max": 28866
    },
    "tokio multi-thread": {
      "result": 33860,
      "median": 34542,
      "min": 34396,
      "max": 36111
    }
  },
  "ping_pong_oneshot": {
    "smarm 1-thread": {
      "result": 1000,
      "median": 16848,
      "min": 16633,
      "max": 17301
    },
    "tokio current_thread": {
      "result": 1000,
      "median": 879,
      "min": 868,
      "max": 973
    },
    "tokio multi-thread": {
      "result": 1000,
      "median": 4328,
      "min": 4223,
      "max": 4461
    }
  },
  "spawn_storm_busy": {
    "smarm 1-thread": {
      "result": 10000,
      "median": 130058,
      "min": 126790,
      "max": 134475
    },
    "tokio current_thread": {
      "result": 10000,
      "median": 2772,
      "min": 2641,
      "max": 4367
    },
    "tokio multi-thread": {
      "result": 10000,
      "median": 7462,
      "min": 4469,
      "max": 12892
    }
  },
  "mpsc_contention": {
    "smarm 1-thread": {
      "result": 320000,
      "median": 9260,
      "min": 9095,
      "max": 10081
    },
    "tokio current_thread": {
      "result": 320000,
      "median": 17570,
      "min": 17213,
      "max": 18276
    },
    "tokio multi-thread": {
      "result": 320000,
      "median": 17593,
      "min": 17452,
      "max": 19564
    }
  },
  "many_timers": {
    "smarm 1-thread": {
      "result": 10000,
      "median": 135806,
      "min": 132573,
      "max": 141651
    },
    "tokio current_thread": {
      "result": 10000,
      "median": 14462,
      "min": 13555,
      "max": 15457
    },
    "tokio multi-thread": {
      "result": 10000,
      "median": 15011,
      "min": 14655,
      "max": 15368
    }
  },
  "multi_thread_scaling": {
    "smarm 1-thread": {
      "result": 33860,
      "median": 30029,
      "min": 29720,
      "max": 31351
    },
    "tokio multi 1-thread": {
      "result": 33860,
      "median": 28983,
      "min": 28908,
      "max": 29323
    }
  },
  "deep_recursion": {
    "smarm 1-thread": {
      "result": 1,
      "median": 83,
      "min": 78,
      "max": 587
    },
    "tokio current_thread": {
      "result": 1,
      "median": 25,
      "min": 25,
      "max": 33
    },
    "tokio multi-thread": {
      "result": 1,
      "median": 59,
      "min": 47,
      "max": 205
    }
  },
  "yield_in_hot_loop": {
    "smarm 1-thread": {
      "result": 1000000,
      "median": 188753,
      "min": 187007,
      "max": 194366
    },
    "tokio current_thread": {
      "result": 1000000,
      "median": 153929,
      "min": 152712,
      "max": 158749
    }
  },
  "uncontended_channel": {
    "smarm 1-thread": {
      "result": 1000000,
      "median": 26811,
      "min": 26498,
      "max": 29069
    },
    "tokio current_thread": {
      "result": 1000000,
      "median": 51888,
      "min": 51530,
      "max": 52708
    }
  },
  "catch_unwind_panics": {
    "smarm 1-thread": {
      "result": 10000,
      "median": 142215,
      "min": 140189,
      "max": 143570
    },
    "tokio current_thread": {
      "result": 10000,
      "median": 682295,
      "min": 670281,
      "max": 700774
    },
    "tokio multi-thread": {
      "result": 10000,
      "median": 662688,
      "min": 641453,
      "max": 681868
    }
  }
 }
--- a/benches/general.rs
+++ b/benches/general.rs
@@ -0,0 +1,442 @@
 //! General benchmarks — workloads where neither runtime has a structural
 //! advantage. Both should be competitive; large gaps here indicate a real
 //! difference in per-task or per-yield overhead.
 //!
 //! Workloads:
 //!   1. chained_spawn  — task N spawns N+1, depth 1000. Spawn+exit overhead in
 //!                       a serial chain. Adapted from tokio's bench of the same
 //!                       name.
 //!   2. yield_many     — 200 actors × 1000 yields. Pure scheduling throughput
 //!                       with no allocation, no IO. Adapted from tokio.
 //!   3. fan_out_compute— count primes in [2, 400_000) across 64 workers. Same
 //!                       shape as multi_scheduler::primes but lives here for
 //!                       completeness.
 //!   4. ping_pong_oneshot — N rounds of (spawn pair, send oneshot, await).
 //!                       Closer to a request/response workload than channel
 //!                       ping-pong.
 use std::sync::atomic::{AtomicU64, Ordering};
 use std::sync::Arc;
 use std::time::Instant;
 // ---------------------------------------------------------------------------
 // Shared harness
 // ---------------------------------------------------------------------------
 const ITERS: u32 = 15;
 fn available_threads() -> usize {
    std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1)
 }
 fn print_header(title: &str) {
    println!("\n{}", "=".repeat(80));
    println!("  {title}");
    println!("{}", "=".repeat(80));
    println!(
        "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}",
        "runtime", "result", "median µs", "min µs", "max µs"
    );
    println!("{}", "-".repeat(80));
 }
 fn run_n<F: FnMut() -> (u64, u128)>(name: &str, n: u32, mut f: F) {
    let mut times = Vec::new();
    let mut last = 0u64;
    // One warmup iteration, discarded.
    let _ = f();
    for _ in 0..n {
        let (v, t) = f();
        times.push(t);
        last = v;
    }
    times.sort_unstable();
    let median = times[times.len() / 2];
    let min = *times.iter().min().unwrap();
    let max = *times.iter().max().unwrap();
    println!(
        "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}",
        name, last, median, min, max
    );
 }
 // ---------------------------------------------------------------------------
 // 1. chained_spawn — depth 1000
 // ---------------------------------------------------------------------------
 const CHAIN_DEPTH: u64 = 1_000;
 fn bench_chained_smarm(threads: usize) -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let c2 = counter.clone();
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(threads)).run(move || {
        // Fire-and-forget chain, matching tokio's bench shape: each link
        // spawns the next link and exits immediately; depth 0 signals done
        // via a channel. Crucially this does *not* nest joins on the
        // spawner's stack — important because smarm actor stacks are a
        // fixed 64 KiB.
        let (tx, rx) = smarm::channel::<()>();
        fn iter(c: Arc<AtomicU64>, tx: smarm::Sender<()>, n: u64) {
            if n == 0 {
                tx.send(()).unwrap();
            } else {
                let cc = c.clone();
                smarm::spawn(move || {
                    cc.fetch_add(1, Ordering::Relaxed);
                    iter(cc.clone(), tx, n - 1);
                });
                // Caller exits; JoinHandle dropped, no parking.
            }
        }
        iter(c2, tx, CHAIN_DEPTH);
        rx.recv().unwrap();
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_chained_tokio_current() -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let c2 = counter.clone();
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        // Use a oneshot done channel like tokio's own chained_spawn bench.
        let (done_tx, done_rx) = tokio::sync::oneshot::channel();
        fn iter(
            c: Arc<AtomicU64>,
            done: tokio::sync::oneshot::Sender<()>,
            n: u64,
        ) {
            if n == 0 {
                let _ = done.send(());
            } else {
                tokio::task::spawn_local(async move {
                    c.fetch_add(1, Ordering::Relaxed);
                    iter(c, done, n - 1);
                });
            }
        }
        iter(c2, done_tx, CHAIN_DEPTH);
        let _ = done_rx.await;
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_chained_tokio_multi() -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let c2 = counter.clone();
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let (done_tx, done_rx) = tokio::sync::oneshot::channel();
        fn iter(c: Arc<AtomicU64>, done: tokio::sync::oneshot::Sender<()>, n: u64) {
            if n == 0 {
                let _ = done.send(());
            } else {
                tokio::spawn(async move {
                    c.fetch_add(1, Ordering::Relaxed);
                    iter(c, done, n - 1);
                });
            }
        }
        iter(c2, done_tx, CHAIN_DEPTH);
        let _ = done_rx.await;
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // 2. yield_many — 200 actors × 1000 yields
 // ---------------------------------------------------------------------------
 const YIELD_TASKS: u64 = 200;
 const YIELD_ROUNDS: u64 = 1_000;
 fn bench_yield_smarm(threads: usize) -> (u64, u128) {
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(threads)).run(|| {
        let mut handles = Vec::new();
        for _ in 0..YIELD_TASKS {
            handles.push(smarm::spawn(|| {
                for _ in 0..YIELD_ROUNDS {
                    smarm::yield_now();
                }
            }));
        }
        for h in handles {
            h.join().unwrap();
        }
    });
    (YIELD_TASKS * YIELD_ROUNDS, start.elapsed().as_micros())
 }
 fn bench_yield_tokio_current() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let mut handles = Vec::new();
        for _ in 0..YIELD_TASKS {
            handles.push(tokio::task::spawn_local(async move {
                for _ in 0..YIELD_ROUNDS {
                    tokio::task::yield_now().await;
                }
            }));
        }
        for h in handles {
            let _ = h.await;
        }
    });
    (YIELD_TASKS * YIELD_ROUNDS, start.elapsed().as_micros())
 }
 fn bench_yield_tokio_multi() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let mut handles = Vec::new();
        for _ in 0..YIELD_TASKS {
            handles.push(tokio::spawn(async move {
                for _ in 0..YIELD_ROUNDS {
                    tokio::task::yield_now().await;
                }
            }));
        }
        for h in handles {
            let _ = h.await;
        }
    });
    (YIELD_TASKS * YIELD_ROUNDS, start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // 3. fan_out_compute — primes, same shape as multi_scheduler::primes
 // ---------------------------------------------------------------------------
 const PRIME_N: u64 = 400_000;
 const PRIME_WORKERS: u64 = 64;
 fn is_prime(n: u64) -> bool {
    if n < 2 { return false; }
    if n < 4 { return true; }
    if n % 2 == 0 { return false; }
    let mut i = 3u64;
    while i * i <= n { if n % i == 0 { return false; } i += 2; }
    true
 }
 fn count_primes(lo: u64, hi: u64) -> u64 {
    (lo..hi).filter(|&n| is_prime(n)).count() as u64
 }
 fn primes_slice(w: u64) -> (u64, u64) {
    let per = PRIME_N / PRIME_WORKERS;
    let lo = w * per;
    let hi = if w + 1 == PRIME_WORKERS { PRIME_N } else { lo + per };
    (lo, hi)
 }
 fn bench_primes_smarm(threads: usize) -> (u64, u128) {
    let total = Arc::new(AtomicU64::new(0));
    let t2 = total.clone();
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(threads)).run(move || {
        let mut handles = Vec::new();
        for w in 0..PRIME_WORKERS {
            let (lo, hi) = primes_slice(w);
            let tc = t2.clone();
            handles.push(smarm::spawn(move || {
                tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed);
            }));
        }
        for h in handles { h.join().unwrap(); }
    });
    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_primes_tokio_current() -> (u64, u128) {
    let total = Arc::new(AtomicU64::new(0));
    let t2 = total.clone();
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let mut handles = Vec::new();
        for w in 0..PRIME_WORKERS {
            let (lo, hi) = primes_slice(w);
            let tc = t2.clone();
            handles.push(tokio::task::spawn_local(async move {
                tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed);
            }));
        }
        for h in handles { let _ = h.await; }
    });
    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_primes_tokio_multi() -> (u64, u128) {
    let total = Arc::new(AtomicU64::new(0));
    let t2 = total.clone();
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let mut handles = Vec::new();
        for w in 0..PRIME_WORKERS {
            let (lo, hi) = primes_slice(w);
            let tc = t2.clone();
            handles.push(tokio::spawn(async move {
                tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed);
            }));
        }
        for h in handles { let _ = h.await; }
    });
    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // 4. ping_pong_oneshot — 1000 rounds of spawn-pair-await
 // ---------------------------------------------------------------------------
 const PP_ROUNDS: u64 = 1_000;
 fn bench_pp_smarm(threads: usize) -> (u64, u128) {
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(threads)).run(|| {
        for _ in 0..PP_ROUNDS {
            // smarm has no oneshot, so use a channel<()> per round — both
            // sides spawn, A sends ping, B replies pong, A joins B.
            let (tx_ping, rx_ping) = smarm::channel::<()>();
            let (tx_pong, rx_pong) = smarm::channel::<()>();
            let hb = smarm::spawn(move || {
                rx_ping.recv().unwrap();
                tx_pong.send(()).unwrap();
            });
            let ha = smarm::spawn(move || {
                tx_ping.send(()).unwrap();
                rx_pong.recv().unwrap();
            });
            ha.join().unwrap();
            hb.join().unwrap();
        }
    });
    (PP_ROUNDS, start.elapsed().as_micros())
 }
 fn bench_pp_tokio_current() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        for _ in 0..PP_ROUNDS {
            let (tx1, rx1) = tokio::sync::oneshot::channel::<()>();
            let (tx2, rx2) = tokio::sync::oneshot::channel::<()>();
            let hb = tokio::task::spawn_local(async move {
                rx1.await.unwrap();
                tx2.send(()).unwrap();
            });
            let ha = tokio::task::spawn_local(async move {
                tx1.send(()).unwrap();
                rx2.await.unwrap();
            });
            let _ = ha.await;
            let _ = hb.await;
        }
    });
    (PP_ROUNDS, start.elapsed().as_micros())
 }
 fn bench_pp_tokio_multi() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        for _ in 0..PP_ROUNDS {
            let (tx1, rx1) = tokio::sync::oneshot::channel::<()>();
            let (tx2, rx2) = tokio::sync::oneshot::channel::<()>();
            let hb = tokio::spawn(async move {
                rx1.await.unwrap();
                tx2.send(()).unwrap();
            });
            let ha = tokio::spawn(async move {
                tx1.send(()).unwrap();
                rx2.await.unwrap();
            });
            let _ = ha.await;
            let _ = hb.await;
        }
    });
    (PP_ROUNDS, start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // main
 // ---------------------------------------------------------------------------
 // ---------------------------------------------------------------------------
 // Knob helper — reads SMARM_ALLOC_INTERVAL / SMARM_TIMESLICE_CYCLES env vars
 // so the sweep script can override the preemption knobs without recompiling.
 // ---------------------------------------------------------------------------
 fn bench_cfg(threads: usize) -> smarm::runtime::Config {
    let mut cfg = smarm::runtime::Config::exact(threads);
    if let Ok(v) = std::env::var("SMARM_ALLOC_INTERVAL") {
        if let Ok(n) = v.parse::<u32>() { cfg = cfg.alloc_interval(n); }
    }
    if let Ok(v) = std::env::var("SMARM_TIMESLICE_CYCLES") {
        if let Ok(n) = v.parse::<u64>() { cfg = cfg.timeslice_cycles(n); }
    }
    cfg
 }
 fn main() {
    let n = available_threads();
    println!("smarm general benchmarks");
    println!("available parallelism: {n} threads");
    println!("ITERS={ITERS} (+1 warmup, discarded)");
    println!(
        "CHAIN_DEPTH={CHAIN_DEPTH}, YIELD_TASKS={YIELD_TASKS}×{YIELD_ROUNDS}, \
         PRIME_N={PRIME_N}/{PRIME_WORKERS} workers, PP_ROUNDS={PP_ROUNDS}"
    );
    // ---- 1. chained_spawn ----
    print_header(&format!("chained_spawn: depth {CHAIN_DEPTH}"));
    run_n("smarm 1-thread", ITERS, || bench_chained_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_chained_smarm(n));
    run_n("tokio current_thread", ITERS, bench_chained_tokio_current);
    run_n("tokio multi-thread", ITERS, bench_chained_tokio_multi);
    // ---- 2. yield_many ----
    print_header(&format!("yield_many: {YIELD_TASKS} tasks × {YIELD_ROUNDS} yields"));
    run_n("smarm 1-thread", ITERS, || bench_yield_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_yield_smarm(n));
    run_n("tokio current_thread", ITERS, bench_yield_tokio_current);
    run_n("tokio multi-thread", ITERS, bench_yield_tokio_multi);
    // ---- 3. fan_out_compute ----
    print_header(&format!("fan_out_compute: primes in [2, {PRIME_N}) across {PRIME_WORKERS}"));
    run_n("smarm 1-thread", ITERS, || bench_primes_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_primes_smarm(n));
    run_n("tokio current_thread", ITERS, bench_primes_tokio_current);
    run_n("tokio multi-thread", ITERS, bench_primes_tokio_multi);
    // ---- 4. ping_pong_oneshot ----
    print_header(&format!("ping_pong_oneshot: {PP_ROUNDS} rounds"));
    run_n("smarm 1-thread", ITERS, || bench_pp_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_pp_smarm(n));
    run_n("tokio current_thread", ITERS, bench_pp_tokio_current);
    run_n("tokio multi-thread", ITERS, bench_pp_tokio_multi);
 }
--- a/benches/multi_scheduler.rs
+++ b/benches/multi_scheduler.rs
@@ -0,0 +1,343 @@
 //! Benchmarks for the multi-scheduler runtime.
 //!
 //! Three workloads, three runtimes:
 //!   - smarm single-thread  (exact = 1)
 //!   - smarm multi-thread   (exact = available_parallelism)
 //!   - tokio current_thread (single-thread baseline)
 //!   - tokio multi-thread   (the parallel comparison)
 //!
 //! Workloads:
 //!   1. Fan-out / fan-in compute  (primes) — CPU-bound, tests parallelism
 //!   2. Ping-pong                 — message-passing overhead, park/unpark cost
 //!   3. Spawn throughput          — cost of spawn + join per actor
 use std::sync::atomic::{AtomicU64, Ordering};
 use std::sync::Arc;
 use std::time::Instant;
 // ---------------------------------------------------------------------------
 // Shared helpers
 // ---------------------------------------------------------------------------
 fn available_threads() -> usize {
    std::thread::available_parallelism()
        .map(|n| n.get())
        .unwrap_or(1)
 }
 fn print_header(title: &str) {
    println!("\n{}", "=".repeat(80));
    println!("  {title}");
    println!("{}", "=".repeat(80));
    println!(
        "{:>22} | {:>12} | {:>10} | {:>10} | {:>10}",
        "runtime", "result", "median µs", "min µs", "max µs"
    );
    println!("{}", "-".repeat(80));
 }
 fn run_n<F: FnMut() -> (u64, u128)>(name: &str, n: u32, mut f: F) {
    let mut times = Vec::new();
    let mut last = 0u64;
    for _ in 0..n {
        let (v, t) = f();
        times.push(t);
        last = v;
    }
    times.sort_unstable();
    let median = times[times.len() / 2];
    let min = *times.iter().min().unwrap();
    let max = *times.iter().max().unwrap();
    println!(
        "{:>22} | {:>12} | {:>10} | {:>10} | {:>10}",
        name, last, median, min, max
    );
 }
 const ITERS: u32 = 7;
 // ---------------------------------------------------------------------------
 // Workload 1: fan-out / fan-in primes
 // ---------------------------------------------------------------------------
 const PRIME_N: u64 = 400_000;
 const WORKERS: u64 = 64;
 fn is_prime(n: u64) -> bool {
    if n < 2 { return false; }
    if n < 4 { return true; }
    if n % 2 == 0 { return false; }
    let mut i = 3u64;
    while i * i <= n { if n % i == 0 { return false; } i += 2; }
    true
 }
 fn count_primes(lo: u64, hi: u64) -> u64 {
    (lo..hi).filter(|&n| is_prime(n)).count() as u64
 }
 fn primes_slice(w: u64) -> (u64, u64) {
    let per = PRIME_N / WORKERS;
    let lo = w * per;
    let hi = if w + 1 == WORKERS { PRIME_N } else { lo + per };
    (lo, hi)
 }
 fn bench_primes_smarm(threads: usize) -> (u64, u128) {
    let total = Arc::new(AtomicU64::new(0));
    let t2 = total.clone();
    let start = Instant::now();
    smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(move || {
        let mut handles = Vec::new();
        for w in 0..WORKERS {
            let (lo, hi) = primes_slice(w);
            let tc = t2.clone();
            handles.push(smarm::spawn(move || {
                tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed);
            }));
        }
        for h in handles { h.join().unwrap(); }
    });
    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_primes_tokio_current() -> (u64, u128) {
    let total = Arc::new(AtomicU64::new(0));
    let t2 = total.clone();
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let mut handles = Vec::new();
        for w in 0..WORKERS {
            let (lo, hi) = primes_slice(w);
            let tc = t2.clone();
            handles.push(tokio::task::spawn_local(async move {
                tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed);
            }));
        }
        for h in handles { let _ = h.await; }
    });
    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_primes_tokio_multi() -> (u64, u128) {
    let total = Arc::new(AtomicU64::new(0));
    let t2 = total.clone();
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let mut handles = Vec::new();
        for w in 0..WORKERS {
            let (lo, hi) = primes_slice(w);
            let tc = t2.clone();
            handles.push(tokio::spawn(async move {
                tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed);
            }));
        }
        for h in handles { let _ = h.await; }
    });
    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_primes_baseline() -> (u64, u128) {
    let start = Instant::now();
    let total: u64 = (0..WORKERS).map(|w| {
        let (lo, hi) = primes_slice(w);
        count_primes(lo, hi)
    }).sum();
    (total, start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // Workload 2: channel ping-pong
 // ---------------------------------------------------------------------------
 const PING_ROUNDS: u64 = 10_000;
 fn bench_pingpong_smarm(threads: usize) -> (u64, u128) {
    let start = Instant::now();
    smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(|| {
        let (tx_a, rx_a) = smarm::channel::<u64>();
        let (tx_b, rx_b) = smarm::channel::<u64>();
        let ha = smarm::spawn(move || {
            tx_a.send(0).unwrap();
            loop {
                let v = rx_b.recv().unwrap();
                if v >= PING_ROUNDS { break; }
                tx_a.send(v + 1).unwrap();
            }
        });
        let hb = smarm::spawn(move || {
            loop {
                let v = rx_a.recv().unwrap();
                tx_b.send(v + 1).unwrap();
                if v + 1 >= PING_ROUNDS { break; }
            }
        });
        ha.join().unwrap();
        hb.join().unwrap();
    });
    (PING_ROUNDS, start.elapsed().as_micros())
 }
 fn bench_pingpong_tokio_current() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_all()
        .build()
        .unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let (tx_a, mut rx_a) = tokio::sync::mpsc::unbounded_channel::<u64>();
        let (tx_b, mut rx_b) = tokio::sync::mpsc::unbounded_channel::<u64>();
        let ha = tokio::task::spawn_local(async move {
            tx_a.send(0).unwrap();
            loop {
                let v = rx_b.recv().await.unwrap();
                if v >= PING_ROUNDS { break; }
                tx_a.send(v + 1).unwrap();
            }
        });
        let hb = tokio::task::spawn_local(async move {
            loop {
                let v = rx_a.recv().await.unwrap();
                tx_b.send(v + 1).unwrap();
                if v + 1 >= PING_ROUNDS { break; }
            }
        });
        let _ = ha.await;
        let _ = hb.await;
    });
    (PING_ROUNDS, start.elapsed().as_micros())
 }
 fn bench_pingpong_tokio_multi() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(2) // ping-pong only needs 2 threads
        .enable_all()
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let (tx_a, mut rx_a) = tokio::sync::mpsc::unbounded_channel::<u64>();
        let (tx_b, mut rx_b) = tokio::sync::mpsc::unbounded_channel::<u64>();
        let ha = tokio::spawn(async move {
            tx_a.send(0).unwrap();
            loop {
                let v = rx_b.recv().await.unwrap();
                if v >= PING_ROUNDS { break; }
                tx_a.send(v + 1).unwrap();
            }
        });
        let hb = tokio::spawn(async move {
            loop {
                let v = rx_a.recv().await.unwrap();
                tx_b.send(v + 1).unwrap();
                if v + 1 >= PING_ROUNDS { break; }
            }
        });
        let _ = ha.await;
        let _ = hb.await;
    });
    (PING_ROUNDS, start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // Workload 3: spawn throughput
 // ---------------------------------------------------------------------------
 const SPAWN_COUNT: u64 = 1_000;
 fn bench_spawn_smarm(threads: usize) -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let c = counter.clone();
    let start = Instant::now();
    smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(move || {
        let mut handles = Vec::new();
        for _ in 0..SPAWN_COUNT {
            let cc = c.clone();
            handles.push(smarm::spawn(move || {
                cc.fetch_add(1, Ordering::Relaxed);
            }));
        }
        for h in handles { h.join().unwrap(); }
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_spawn_tokio_current() -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let c = counter.clone();
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let mut handles = Vec::new();
        for _ in 0..SPAWN_COUNT {
            let cc = c.clone();
            handles.push(tokio::task::spawn_local(async move {
                cc.fetch_add(1, Ordering::Relaxed);
            }));
        }
        for h in handles { let _ = h.await; }
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_spawn_tokio_multi() -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let c = counter.clone();
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let mut handles = Vec::new();
        for _ in 0..SPAWN_COUNT {
            let cc = c.clone();
            handles.push(tokio::spawn(async move {
                cc.fetch_add(1, Ordering::Relaxed);
            }));
        }
        for h in handles { let _ = h.await; }
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // main
 // ---------------------------------------------------------------------------
 fn main() {
    let n = available_threads();
    println!("smarm multi-scheduler benchmarks");
    println!("available parallelism: {n} threads");
    println!("PRIME_N={PRIME_N}, WORKERS={WORKERS}, PING_ROUNDS={PING_ROUNDS}, SPAWN_COUNT={SPAWN_COUNT}");
    // ---- Primes ----
    print_header(&format!("Fan-out/fan-in: count primes in [2, {PRIME_N}) across {WORKERS} workers"));
    run_n("baseline (serial)",       ITERS, bench_primes_baseline);
    run_n("smarm single-thread",     ITERS, || bench_primes_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_primes_smarm(n));
    run_n("tokio current_thread",    ITERS, bench_primes_tokio_current);
    run_n("tokio multi-thread",      ITERS, bench_primes_tokio_multi);
    // ---- Ping-pong ----
    print_header(&format!("Ping-pong: {PING_ROUNDS} round-trips between two actors"));
    run_n("smarm single-thread",     ITERS, || bench_pingpong_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_pingpong_smarm(n));
    run_n("tokio current_thread",    ITERS, bench_pingpong_tokio_current);
    run_n("tokio multi-thread",      ITERS, bench_pingpong_tokio_multi);
    // ---- Spawn throughput ----
    print_header(&format!("Spawn throughput: {SPAWN_COUNT} actors spawned and joined"));
    run_n("smarm single-thread",     ITERS, || bench_spawn_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_spawn_smarm(n));
    run_n("tokio current_thread",    ITERS, bench_spawn_tokio_current);
    run_n("tokio multi-thread",      ITERS, bench_spawn_tokio_multi);
 }
--- a/benches/smarm_favored.rs
+++ b/benches/smarm_favored.rs
@@ -0,0 +1,408 @@
 //! Benchmarks where smarm's design has a structural advantage.
 //!
 //! These exist to show what the green-thread + stackful model buys you. The
 //! single-thread numbers are the most interesting ones — they isolate the
 //! per-switch / per-task cost from any contention story.
 //!
 //! Workloads:
 //!   9.  deep_recursion       — actor recurses 1000 deep then returns. In
 //!                              smarm this is plain stack recursion on the
 //!                              growable mmap'd stack. In tokio, async fn
 //!                              can't directly recurse — each level must
 //!                              `Box::pin` its future. We measure both.
 //!   10. yield_in_hot_loop    — 2 actors ping yield_now back and forth 500k
 //!                              times. Pure context-switch cost; no
 //!                              channels, no allocation, no contention.
 //!                              Smarm's switch is ~6 GPRs + xmm save and a
 //!                              `ret`; tokio's is poll → state-machine →
 //!                              schedule.
 //!   11. uncontended_channel  — single producer, single consumer, 1M msgs,
 //!                              single-threaded runtime. With no
 //!                              cross-thread contention, smarm's
 //!                              Arc<Mutex<>> channel is essentially free,
 //!                              and the green-thread switch should beat
 //!                              tokio's future polling overhead.
 //!   12. catch_unwind_panics  — spawn 10k tasks; half panic, half succeed.
 //!                              Supervisor handles each. Exploratory — if
 //!                              there's no real gap, drop this one.
 use std::sync::atomic::{AtomicU64, Ordering};
 use std::sync::Arc;
 use std::time::Instant;
 // ---------------------------------------------------------------------------
 // Shared harness
 // ---------------------------------------------------------------------------
 const ITERS: u32 = 15;
 fn available_threads() -> usize {
    std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1)
 }
 fn print_header(title: &str) {
    println!("\n{}", "=".repeat(80));
    println!("  {title}");
    println!("{}", "=".repeat(80));
    println!(
        "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}",
        "runtime", "result", "median µs", "min µs", "max µs"
    );
    println!("{}", "-".repeat(80));
 }
 fn run_n<F: FnMut() -> (u64, u128)>(name: &str, n: u32, mut f: F) {
    let mut times = Vec::new();
    let mut last = 0u64;
    let _ = f(); // warmup
    for _ in 0..n {
        let (v, t) = f();
        times.push(t);
        last = v;
    }
    times.sort_unstable();
    let median = times[times.len() / 2];
    let min = *times.iter().min().unwrap();
    let max = *times.iter().max().unwrap();
    println!(
        "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}",
        name, last, median, min, max
    );
 }
 // ---------------------------------------------------------------------------
 // 9. deep_recursion — 1000 levels deep
 // ---------------------------------------------------------------------------
 // Each recursive frame holds an `&AtomicU64`, a `u64`, plus prologue/spill —
 // conservatively ~64 B/frame on release. Smarm actor stacks are a fixed 64 KiB,
 // so 500 levels (~32 KiB) leaves comfortable headroom while still being deep
 // enough to exercise the stack-growth advantage over Box::pin recursion.
 const RECURSE_DEPTH: u64 = 500;
 fn bench_recurse_smarm(threads: usize) -> (u64, u128) {
    let total = Arc::new(AtomicU64::new(0));
    let t2 = total.clone();
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(threads)).run(move || {
        // Plain Rust recursion on the actor's own (growable) stack.
        fn recurse(c: &AtomicU64, n: u64) -> u64 {
            if n == 0 {
                c.fetch_add(1, Ordering::Relaxed);
                0
            } else {
                1 + recurse(c, n - 1)
            }
        }
        let h = smarm::spawn(move || {
            let _ = recurse(&t2, RECURSE_DEPTH);
        });
        h.join().unwrap();
    });
    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_recurse_tokio_current() -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let c2 = counter.clone();
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        // async fn can't self-recurse; each level returns a Box::pin'd future.
        // This is the canonical workaround a real user would write.
        fn recurse(
            c: Arc<AtomicU64>,
            n: u64,
        ) -> std::pin::Pin<Box<dyn std::future::Future<Output = u64>>> {
            Box::pin(async move {
                if n == 0 {
                    c.fetch_add(1, Ordering::Relaxed);
                    0
                } else {
                    1 + recurse(c, n - 1).await
                }
            })
        }
        let h = tokio::task::spawn_local(async move {
            let _ = recurse(c2, RECURSE_DEPTH).await;
        });
        let _ = h.await;
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_recurse_tokio_multi() -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let c2 = counter.clone();
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        fn recurse(
            c: Arc<AtomicU64>,
            n: u64,
        ) -> std::pin::Pin<Box<dyn std::future::Future<Output = u64> + Send>> {
            Box::pin(async move {
                if n == 0 {
                    c.fetch_add(1, Ordering::Relaxed);
                    0
                } else {
                    1 + recurse(c, n - 1).await
                }
            })
        }
        let h = tokio::spawn(async move {
            let _ = recurse(c2, RECURSE_DEPTH).await;
        });
        let _ = h.await;
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // 10. yield_in_hot_loop — 2 actors, 500k yields each, single thread
 // ---------------------------------------------------------------------------
 const HOT_YIELDS: u64 = 500_000;
 fn bench_hot_smarm() -> (u64, u128) {
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(1)).run(|| {
        let ha = smarm::spawn(|| {
            for _ in 0..HOT_YIELDS {
                smarm::yield_now();
            }
        });
        let hb = smarm::spawn(|| {
            for _ in 0..HOT_YIELDS {
                smarm::yield_now();
            }
        });
        ha.join().unwrap();
        hb.join().unwrap();
    });
    (HOT_YIELDS * 2, start.elapsed().as_micros())
 }
 fn bench_hot_tokio_current() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let ha = tokio::task::spawn_local(async move {
            for _ in 0..HOT_YIELDS {
                tokio::task::yield_now().await;
            }
        });
        let hb = tokio::task::spawn_local(async move {
            for _ in 0..HOT_YIELDS {
                tokio::task::yield_now().await;
            }
        });
        let _ = ha.await;
        let _ = hb.await;
    });
    (HOT_YIELDS * 2, start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // 11. uncontended_channel — 1 producer, 1 consumer, 1M msgs, single-threaded
 // ---------------------------------------------------------------------------
 const UNCONT_MSGS: u64 = 1_000_000;
 fn bench_unc_smarm() -> (u64, u128) {
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(1)).run(|| {
        let (tx, rx) = smarm::channel::<u64>();
        let consumer = smarm::spawn(move || {
            let mut count = 0u64;
            while let Ok(_) = rx.recv() {
                count += 1;
            }
            let _ = count; // discard; run() closure must return ()
        });
        let producer = smarm::spawn(move || {
            for i in 0..UNCONT_MSGS {
                tx.send(i).unwrap();
            }
            // tx drops here, closing the channel.
        });
        producer.join().unwrap();
        let _ = consumer.join().unwrap();
    });
    (UNCONT_MSGS, start.elapsed().as_micros())
 }
 fn bench_unc_tokio_current() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<u64>();
        let consumer = tokio::task::spawn_local(async move {
            let mut count = 0u64;
            while let Some(_) = rx.recv().await {
                count += 1;
            }
            count
        });
        let producer = tokio::task::spawn_local(async move {
            for i in 0..UNCONT_MSGS {
                tx.send(i).unwrap();
            }
        });
        let _ = producer.await;
        let _ = consumer.await;
    });
    (UNCONT_MSGS, start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // 12. catch_unwind_panics — 10k tasks, half panic
 // ---------------------------------------------------------------------------
 const PANIC_TASKS: u64 = 10_000;
 fn bench_panic_smarm(threads: usize) -> (u64, u128) {
    let ok = Arc::new(AtomicU64::new(0));
    let err = Arc::new(AtomicU64::new(0));
    let ok2 = ok.clone();
    let err2 = err.clone();
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(threads)).run(move || {
        let mut handles = Vec::new();
        for i in 0..PANIC_TASKS {
            handles.push(smarm::spawn(move || {
                if i % 2 == 0 {
                    panic!("planned");
                }
            }));
        }
        for h in handles {
            match h.join() {
                Ok(()) => { ok2.fetch_add(1, Ordering::Relaxed); }
                Err(_) => { err2.fetch_add(1, Ordering::Relaxed); }
            }
        }
    });
    let total = ok.load(Ordering::Relaxed) + err.load(Ordering::Relaxed);
    (total, start.elapsed().as_micros())
 }
 fn bench_panic_tokio_current() -> (u64, u128) {
    let ok = Arc::new(AtomicU64::new(0));
    let err = Arc::new(AtomicU64::new(0));
    let ok2 = ok.clone();
    let err2 = err.clone();
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let mut handles = Vec::new();
        for i in 0..PANIC_TASKS {
            handles.push(tokio::task::spawn_local(async move {
                if i % 2 == 0 {
                    panic!("planned");
                }
            }));
        }
        for h in handles {
            match h.await {
                Ok(()) => { ok2.fetch_add(1, Ordering::Relaxed); }
                Err(_) => { err2.fetch_add(1, Ordering::Relaxed); }
            }
        }
    });
    let total = ok.load(Ordering::Relaxed) + err.load(Ordering::Relaxed);
    (total, start.elapsed().as_micros())
 }
 fn bench_panic_tokio_multi() -> (u64, u128) {
    let ok = Arc::new(AtomicU64::new(0));
    let err = Arc::new(AtomicU64::new(0));
    let ok2 = ok.clone();
    let err2 = err.clone();
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let mut handles = Vec::new();
        for i in 0..PANIC_TASKS {
            handles.push(tokio::spawn(async move {
                if i % 2 == 0 {
                    panic!("planned");
                }
            }));
        }
        for h in handles {
            match h.await {
                Ok(()) => { ok2.fetch_add(1, Ordering::Relaxed); }
                Err(_) => { err2.fetch_add(1, Ordering::Relaxed); }
            }
        }
    });
    let total = ok.load(Ordering::Relaxed) + err.load(Ordering::Relaxed);
    (total, start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // main
 // ---------------------------------------------------------------------------
 // ---------------------------------------------------------------------------
 // Knob helper — reads SMARM_ALLOC_INTERVAL / SMARM_TIMESLICE_CYCLES env vars
 // so the sweep script can override the preemption knobs without recompiling.
 // ---------------------------------------------------------------------------
 fn bench_cfg(threads: usize) -> smarm::runtime::Config {
    let mut cfg = smarm::runtime::Config::exact(threads);
    if let Ok(v) = std::env::var("SMARM_ALLOC_INTERVAL") {
        if let Ok(n) = v.parse::<u32>() { cfg = cfg.alloc_interval(n); }
    }
    if let Ok(v) = std::env::var("SMARM_TIMESLICE_CYCLES") {
        if let Ok(n) = v.parse::<u64>() { cfg = cfg.timeslice_cycles(n); }
    }
    cfg
 }
 fn main() {
    let n = available_threads();
    println!("smarm smarm-favored benchmarks");
    println!("available parallelism: {n} threads");
    println!("ITERS={ITERS} (+1 warmup, discarded)");
    println!(
        "RECURSE_DEPTH={RECURSE_DEPTH}, HOT_YIELDS={HOT_YIELDS}×2, \
         UNCONT_MSGS={UNCONT_MSGS}, PANIC_TASKS={PANIC_TASKS}"
    );
    // ---- 9. deep_recursion ----
    print_header(&format!("deep_recursion: depth {RECURSE_DEPTH}"));
    run_n("smarm 1-thread", ITERS, || bench_recurse_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_recurse_smarm(n));
    run_n("tokio current_thread", ITERS, bench_recurse_tokio_current);
    run_n("tokio multi-thread", ITERS, bench_recurse_tokio_multi);
    // ---- 10. yield_in_hot_loop ----
    print_header(&format!("yield_in_hot_loop: 2 actors × {HOT_YIELDS} yields (single thread)"));
    run_n("smarm 1-thread", ITERS, bench_hot_smarm);
    run_n("tokio current_thread", ITERS, bench_hot_tokio_current);
    // ---- 11. uncontended_channel ----
    print_header(&format!("uncontended_channel: 1→1, {UNCONT_MSGS} msgs (single thread)"));
    run_n("smarm 1-thread", ITERS, bench_unc_smarm);
    run_n("tokio current_thread", ITERS, bench_unc_tokio_current);
    // ---- 12. catch_unwind_panics ----
    print_header(&format!("catch_unwind_panics: {PANIC_TASKS} tasks, 50% panic"));
    run_n("smarm 1-thread", ITERS, || bench_panic_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_panic_smarm(n));
    run_n("tokio current_thread", ITERS, bench_panic_tokio_current);
    run_n("tokio multi-thread", ITERS, bench_panic_tokio_multi);
 }
--- a/benches/sweep.py
+++ b/benches/sweep.py
@@ -0,0 +1,347 @@
 #!/usr/bin/env python3
 """
 smarm bench sweep + regression checker.
 Usage:
    # Run a full knob sweep and print a comparison table:
    python3 benches/sweep.py sweep
    # Check the current build against the committed baseline:
    python3 benches/sweep.py regress
    # Run all benches once (default knobs) and print results:
    python3 benches/sweep.py run
 The sweep grid is defined in SWEEP_GRID below.
 The regression baseline is loaded from benches/baseline.json.
 """
 import argparse
 import json
 import os
 import re
 import subprocess
 import sys
 from pathlib import Path
 # ---------------------------------------------------------------------------
 # Configuration
 # ---------------------------------------------------------------------------
 REPO = Path(__file__).resolve().parent.parent
 # Bench files to run (primes + multi_scheduler omitted — legacy harness,
 # not part of the 12-bench suite, and insensitive to the preemption knobs).
 BENCHES = ["general", "tokio_favored", "smarm_favored"]
 # Knob sweep grid: (alloc_interval, timeslice_cycles)
 # alloc_interval: lower = check RDTSC more often = finer preemption
 # timeslice_cycles: lower = shorter timeslice = more cooperative
 SWEEP_GRID = [
    (32,  150_000),
    (64,  150_000),
    (128, 150_000),   # default interval, shorter slice
    (32,  300_000),
    (64,  300_000),
    (128, 300_000),   # <<< baseline (defaults)
    (256, 300_000),
    (512, 300_000),
    (128, 600_000),
    (128, 1_200_000),
 ]
 # Regression threshold: warn if median is more than this % worse than baseline.
 REGRESSION_THRESHOLD_PCT = 10
 # ---------------------------------------------------------------------------
 # Parsing
 # ---------------------------------------------------------------------------
 # Match lines like:
 #   "          smarm 1-thread |      1000000 |      31473 |      28719 |      33113"
 ROW_RE = re.compile(
    r"^\s*(?P<name>[^|]+?)\s*\|\s*(?P<result>\d+)\s*\|\s*(?P<median>\d+)\s*\|\s*(?P<min>\d+)\s*\|\s*(?P<max>\d+)\s*$"
 )
 # Match section headers like:
 #   "  chained_spawn: depth 1000"
 HEADER_RE = re.compile(r"^\s{2}(?P<bench>[a-z_]+)[:—]")
 def parse_output(text: str) -> dict[str, dict[str, dict]]:
    """
    Returns {bench_name: {runtime_label: {median, min, max, result}}}.
    bench_name is the snake_case name extracted from the section header.
    """
    results: dict[str, dict[str, dict]] = {}
    current_bench = None
    for line in text.splitlines():
        hm = HEADER_RE.match(line)
        if hm:
            current_bench = hm.group("bench")
            results.setdefault(current_bench, {})
            continue
        if current_bench is None:
            continue
        rm = ROW_RE.match(line)
        if rm:
            label = rm.group("name").strip()
            results[current_bench][label] = {
                "result": int(rm.group("result")),
                "median": int(rm.group("median")),
                "min":    int(rm.group("min")),
                "max":    int(rm.group("max")),
            }
    return results
 # ---------------------------------------------------------------------------
 # Running
 # ---------------------------------------------------------------------------
 def run_benches(env_extra: dict[str, str] | None = None) -> dict[str, dict[str, dict]]:
    """Run all BENCHES and return merged parsed results."""
    env = os.environ.copy()
    if env_extra:
        env.update(env_extra)
    all_results: dict[str, dict[str, dict]] = {}
    for bench in BENCHES:
        cmd = ["cargo", "bench", "--bench", bench]
        proc = subprocess.run(
            cmd,
            cwd=REPO,
            env=env,
            capture_output=True,
            text=True,
        )
        if proc.returncode != 0:
            print(f"  ERROR running {bench}:\n{proc.stderr[-800:]}", file=sys.stderr)
            continue
        parsed = parse_output(proc.stdout)
        all_results.update(parsed)
    return all_results
 # ---------------------------------------------------------------------------
 # Baseline JSON
 # ---------------------------------------------------------------------------
 BASELINE_PATH = REPO / "benches" / "baseline.json"
 def load_baseline() -> dict:
    if not BASELINE_PATH.exists():
        sys.exit(
            f"No baseline found at {BASELINE_PATH}.\n"
            "Run:  python3 benches/sweep.py run  then save the output manually,\n"
            "or use --save-baseline with the run subcommand."
        )
    return json.loads(BASELINE_PATH.read_text())
 def save_baseline(results: dict) -> None:
    BASELINE_PATH.write_text(json.dumps(results, indent=2))
    print(f"Baseline saved to {BASELINE_PATH}")
 # ---------------------------------------------------------------------------
 # Regression check
 # ---------------------------------------------------------------------------
 def check_regressions(current: dict, baseline: dict) -> bool:
    """
    Compare current results to baseline. Print warnings for regressions.
    Returns True if any regression found.
    """
    any_regression = False
    for bench, runtimes in baseline.items():
        cur_bench = current.get(bench, {})
        for label, base_data in runtimes.items():
            cur_data = cur_bench.get(label)
            if cur_data is None:
                print(f"  MISSING  {bench}/{label} — not present in current run")
                any_regression = True
                continue
            base_med = base_data["median"]
            cur_med  = cur_data["median"]
            if base_med == 0:
                continue
            pct = (cur_med - base_med) / base_med * 100
            if pct > REGRESSION_THRESHOLD_PCT:
                print(
                    f"  REGRESSION  {bench}/{label}: "
                    f"{base_med} → {cur_med} µs  ({pct:+.1f}%)"
                )
                any_regression = True
            elif pct < -REGRESSION_THRESHOLD_PCT:
                print(
                    f"  IMPROVEMENT {bench}/{label}: "
                    f"{base_med} → {cur_med} µs  ({pct:+.1f}%)"
                )
    return any_regression
 # ---------------------------------------------------------------------------
 # Pretty print
 # ---------------------------------------------------------------------------
 def print_results(results: dict, label: str = "") -> None:
    if label:
        print(f"\n{'='*70}")
        print(f"  {label}")
        print(f"{'='*70}")
    for bench, runtimes in sorted(results.items()):
        print(f"\n  [{bench}]")
        print(f"  {'runtime':>28} | {'result':>10} | {'median µs':>10} | {'min':>8} | {'max':>8}")
        print(f"  {'-'*75}")
        for rt_label, data in runtimes.items():
            print(
                f"  {rt_label:>28} | {data['result']:>10} | "
                f"{data['median']:>10} | {data['min']:>8} | {data['max']:>8}"
            )
 def print_sweep_table(sweep_results: list[tuple[int, int, dict]]) -> None:
    """Print a compact comparison across sweep points for each bench/runtime."""
    # Collect all bench/label pairs
    all_keys: list[tuple[str, str]] = []
    for _, _, results in sweep_results:
        for bench, runtimes in results.items():
            for label in runtimes:
                key = (bench, label)
                if key not in all_keys:
                    all_keys.append(key)
    # Header
    col_w = 12
    print(f"\n{'bench/runtime':<45}", end="")
    for interval, cycles, _ in sweep_results:
        tag = f"ai={interval}/tc={cycles//1000}k"
        print(f"  {tag:>{col_w}}", end="")
    print()
    print("-" * (45 + (col_w + 2) * len(sweep_results)))
    for bench, label in all_keys:
        key_str = f"{bench}/{label}"
        print(f"  {key_str:<43}", end="")
        for _, _, results in sweep_results:
            val = results.get(bench, {}).get(label, {}).get("median")
            cell = str(val) if val is not None else "—"
            print(f"  {cell:>{col_w}}", end="")
        print()
 # ---------------------------------------------------------------------------
 # Subcommands
 # ---------------------------------------------------------------------------
 def cmd_run(args) -> None:
    print("Building release binaries…")
    subprocess.run(
        ["cargo", "build", "--release", "--benches"],
        cwd=REPO, check=True, capture_output=True,
    )
    print("Running benches…")
    results = run_benches()
    print_results(results, "Results (default knobs)")
    if args.save_baseline:
        save_baseline(results)
 def cmd_regress(args) -> None:
    baseline = load_baseline()
    print("Building release binaries…")
    subprocess.run(
        ["cargo", "build", "--release", "--benches"],
        cwd=REPO, check=True, capture_output=True,
    )
    print("Running benches…")
    current = run_benches()
    print_results(current, "Current results")
    print(f"\nRegression check (threshold: >{REGRESSION_THRESHOLD_PCT}% slower than baseline)")
    print("-" * 60)
    found = check_regressions(current, baseline)
    if not found:
        print("  No regressions detected.")
    sys.exit(1 if found else 0)
 def cmd_sweep(args) -> None:
    print("Building release binaries (once)…")
    subprocess.run(
        ["cargo", "build", "--release", "--benches"],
        cwd=REPO, check=True, capture_output=True,
    )
    # Benches are pre-built; env vars change runtime behaviour, no recompile needed.
    sweep_results: list[tuple[int, int, dict]] = []
    for interval, cycles in SWEEP_GRID:
        tag = f"alloc_interval={interval}, timeslice_cycles={cycles}"
        print(f"  Running: {tag} …", flush=True)
        env_extra = {
            "SMARM_ALLOC_INTERVAL":    str(interval),
            "SMARM_TIMESLICE_CYCLES":  str(cycles),
        }
        results = run_benches(env_extra)
        sweep_results.append((interval, cycles, results))
    print_sweep_table(sweep_results)
    if args.save_csv:
        import csv
        rows = []
        for interval, cycles, results in sweep_results:
            for bench, runtimes in results.items():
                for label, data in runtimes.items():
                    rows.append({
                        "alloc_interval": interval,
                        "timeslice_cycles": cycles,
                        "bench": bench,
                        "runtime": label,
                        **data,
                    })
        with open(args.save_csv, "w", newline="") as f:
            writer = csv.DictWriter(f, fieldnames=rows[0].keys())
            writer.writeheader()
            writer.writerows(rows)
        print(f"\nCSV saved to {args.save_csv}")
 # ---------------------------------------------------------------------------
 # Entry point
 # ---------------------------------------------------------------------------
 def main() -> None:
    parser = argparse.ArgumentParser(description=__doc__, formatter_class=argparse.RawDescriptionHelpFormatter)
    sub = parser.add_subparsers(dest="cmd", required=True)
    p_run = sub.add_parser("run", help="Run benches once with default knobs")
    p_run.add_argument("--save-baseline", action="store_true",
                       help="Save results as the regression baseline")
    p_run.set_defaults(func=cmd_run)
    p_reg = sub.add_parser("regress", help="Check current results against baseline")
    p_reg.set_defaults(func=cmd_regress)
    p_sw = sub.add_parser("sweep", help="Sweep preemption knobs and compare")
    p_sw.add_argument("--save-csv", metavar="FILE",
                      help="Write full sweep results to a CSV file")
    p_sw.set_defaults(func=cmd_sweep)
    args = parser.parse_args()
    args.func(args)
 if __name__ == "__main__":
    main()
--- a/benches/tokio_favored.rs
+++ b/benches/tokio_favored.rs
@@ -0,0 +1,487 @@
 //! Benchmarks where tokio's design has a structural advantage.
 //!
 //! These exist to *measure* the cost of smarm's design choices, not to flatter
 //! either runtime. Expect tokio to win these; the value is in knowing by how
 //! much, and in catching regressions where the gap widens.
 //!
 //! Workloads:
 //!   5. spawn_storm_busy    — keep N workers busy with yielding tasks, then
 //!                            spawn 10k zero-work tasks and join. Adapted from
 //!                            tokio's `spawn_many_remote_busy1`. Tokio's
 //!                            work-stealing deques + per-worker LIFO slot
 //!                            should beat smarm's single global Mutex<>
 //!                            run queue.
 //!   6. mpsc_contention     — 32 producer actors, 1 consumer, 10k messages
 //!                            each. Tokio's mpsc is lock-free on the hot path;
 //!                            smarm's channel is Arc<Mutex<Inner>> per channel
 //!                            *and* takes the runtime mutex on each unpark.
 //!   7. many_timers         — 10k actors each sleep for a random short
 //!                            duration (1–10 ms), all wake within a tight
 //!                            window. Tokio's per-worker sharded timer wheel
 //!                            vs smarm's single shared min-heap (and single
 //!                            drain-lock winner).
 //!   8. multi_thread_scaling— primes again, but sweep thread count 1, 2, 4,
 //!                            available_parallelism(). Smarm's mutex ceiling
 //!                            should show up as soon as scheduling overhead
 //!                            is non-trivial relative to per-actor work.
 use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
 use std::sync::Arc;
 use std::time::{Duration, Instant};
 // ---------------------------------------------------------------------------
 // Shared harness
 // ---------------------------------------------------------------------------
 const ITERS: u32 = 15;
 fn available_threads() -> usize {
    std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1)
 }
 fn print_header(title: &str) {
    println!("\n{}", "=".repeat(80));
    println!("  {title}");
    println!("{}", "=".repeat(80));
    println!(
        "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}",
        "runtime", "result", "median µs", "min µs", "max µs"
    );
    println!("{}", "-".repeat(80));
 }
 fn run_n<F: FnMut() -> (u64, u128)>(name: &str, n: u32, mut f: F) {
    let mut times = Vec::new();
    let mut last = 0u64;
    let _ = f(); // warmup
    for _ in 0..n {
        let (v, t) = f();
        times.push(t);
        last = v;
    }
    times.sort_unstable();
    let median = times[times.len() / 2];
    let min = *times.iter().min().unwrap();
    let max = *times.iter().max().unwrap();
    println!(
        "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}",
        name, last, median, min, max
    );
 }
 // ---------------------------------------------------------------------------
 // 5. spawn_storm_busy — workers loaded, then storm of zero-work spawns
 // ---------------------------------------------------------------------------
 const STORM_BACKGROUND: u64 = 8;   // number of background "busy" actors
 const STORM_SPAWN: u64 = 10_000;   // zero-work spawns to time
 fn bench_storm_smarm(threads: usize) -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let stop = Arc::new(AtomicBool::new(false));
    let c2 = counter.clone();
    let s2 = stop.clone();
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(threads)).run(move || {
        // Background actors: yield in a tight loop until told to stop.
        let mut bg_handles = Vec::new();
        for _ in 0..STORM_BACKGROUND {
            let s = s2.clone();
            bg_handles.push(smarm::spawn(move || {
                while !s.load(Ordering::Relaxed) {
                    smarm::yield_now();
                }
            }));
        }
        // Storm: spawn 10k zero-work actors and join them all.
        let mut handles = Vec::new();
        for _ in 0..STORM_SPAWN {
            let cc = c2.clone();
            handles.push(smarm::spawn(move || {
                cc.fetch_add(1, Ordering::Relaxed);
            }));
        }
        for h in handles { h.join().unwrap(); }
        // Tear down background.
        s2.store(true, Ordering::Relaxed);
        for h in bg_handles { h.join().unwrap(); }
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_storm_tokio_current() -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let stop = Arc::new(AtomicBool::new(false));
    let c2 = counter.clone();
    let s2 = stop.clone();
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let mut bg_handles = Vec::new();
        for _ in 0..STORM_BACKGROUND {
            let s = s2.clone();
            bg_handles.push(tokio::task::spawn_local(async move {
                while !s.load(Ordering::Relaxed) {
                    tokio::task::yield_now().await;
                }
            }));
        }
        let mut handles = Vec::new();
        for _ in 0..STORM_SPAWN {
            let cc = c2.clone();
            handles.push(tokio::task::spawn_local(async move {
                cc.fetch_add(1, Ordering::Relaxed);
            }));
        }
        for h in handles { let _ = h.await; }
        s2.store(true, Ordering::Relaxed);
        for h in bg_handles { let _ = h.await; }
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_storm_tokio_multi() -> (u64, u128) {
    let counter = Arc::new(AtomicU64::new(0));
    let stop = Arc::new(AtomicBool::new(false));
    let c2 = counter.clone();
    let s2 = stop.clone();
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let mut bg_handles = Vec::new();
        for _ in 0..STORM_BACKGROUND {
            let s = s2.clone();
            bg_handles.push(tokio::spawn(async move {
                while !s.load(Ordering::Relaxed) {
                    tokio::task::yield_now().await;
                }
            }));
        }
        let mut handles = Vec::new();
        for _ in 0..STORM_SPAWN {
            let cc = c2.clone();
            handles.push(tokio::spawn(async move {
                cc.fetch_add(1, Ordering::Relaxed);
            }));
        }
        for h in handles { let _ = h.await; }
        s2.store(true, Ordering::Relaxed);
        for h in bg_handles { let _ = h.await; }
    });
    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // 6. mpsc_contention — 32 producers × 10k msgs into 1 consumer
 // ---------------------------------------------------------------------------
 const MPSC_PRODUCERS: u64 = 32;
 const MPSC_PER_PRODUCER: u64 = 10_000;
 fn bench_mpsc_smarm(threads: usize) -> (u64, u128) {
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(threads)).run(|| {
        let (tx, rx) = smarm::channel::<u64>();
        let mut prod_handles = Vec::new();
        for p in 0..MPSC_PRODUCERS {
            let tx = tx.clone();
            prod_handles.push(smarm::spawn(move || {
                for i in 0..MPSC_PER_PRODUCER {
                    tx.send(p * MPSC_PER_PRODUCER + i).unwrap();
                }
            }));
        }
        drop(tx); // close once producers drop
        let consumer = smarm::spawn(move || {
            let mut count = 0u64;
            while let Ok(_) = rx.recv() {
                count += 1;
            }
            let _ = count; // discard; run() closure must return ()
        });
        for h in prod_handles { h.join().unwrap(); }
        let _ = consumer.join().unwrap();
    });
    (MPSC_PRODUCERS * MPSC_PER_PRODUCER, start.elapsed().as_micros())
 }
 fn bench_mpsc_tokio_current() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<u64>();
        let mut prod_handles = Vec::new();
        for p in 0..MPSC_PRODUCERS {
            let tx = tx.clone();
            prod_handles.push(tokio::task::spawn_local(async move {
                for i in 0..MPSC_PER_PRODUCER {
                    tx.send(p * MPSC_PER_PRODUCER + i).unwrap();
                }
            }));
        }
        drop(tx);
        let consumer = tokio::task::spawn_local(async move {
            let mut count = 0u64;
            while let Some(_) = rx.recv().await {
                count += 1;
            }
            count
        });
        for h in prod_handles { let _ = h.await; }
        let _ = consumer.await;
    });
    (MPSC_PRODUCERS * MPSC_PER_PRODUCER, start.elapsed().as_micros())
 }
 fn bench_mpsc_tokio_multi() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<u64>();
        let mut prod_handles = Vec::new();
        for p in 0..MPSC_PRODUCERS {
            let tx = tx.clone();
            prod_handles.push(tokio::spawn(async move {
                for i in 0..MPSC_PER_PRODUCER {
                    tx.send(p * MPSC_PER_PRODUCER + i).unwrap();
                }
            }));
        }
        drop(tx);
        let consumer = tokio::spawn(async move {
            let mut count = 0u64;
            while let Some(_) = rx.recv().await {
                count += 1;
            }
            count
        });
        for h in prod_handles { let _ = h.await; }
        let _ = consumer.await;
    });
    (MPSC_PRODUCERS * MPSC_PER_PRODUCER, start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // 7. many_timers — 10k sleeping actors waking in a tight window
 // ---------------------------------------------------------------------------
 const TIMER_ACTORS: u64 = 10_000;
 const TIMER_MIN_MS: u64 = 1;
 const TIMER_MAX_MS: u64 = 10;
 // Deterministic per-actor delay so iterations are comparable.
 fn timer_delay_ms(i: u64) -> u64 {
    TIMER_MIN_MS + (i * 2654435761u64 >> 32) % (TIMER_MAX_MS - TIMER_MIN_MS + 1)
 }
 fn bench_timers_smarm(threads: usize) -> (u64, u128) {
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(threads)).run(|| {
        let mut handles = Vec::new();
        for i in 0..TIMER_ACTORS {
            let ms = timer_delay_ms(i);
            handles.push(smarm::spawn(move || {
                smarm::sleep(Duration::from_millis(ms));
            }));
        }
        for h in handles { h.join().unwrap(); }
    });
    (TIMER_ACTORS, start.elapsed().as_micros())
 }
 fn bench_timers_tokio_current() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_current_thread()
        .enable_time()
        .build()
        .unwrap();
    let start = Instant::now();
    let local = tokio::task::LocalSet::new();
    local.block_on(&rt, async move {
        let mut handles = Vec::new();
        for i in 0..TIMER_ACTORS {
            let ms = timer_delay_ms(i);
            handles.push(tokio::task::spawn_local(async move {
                tokio::time::sleep(Duration::from_millis(ms)).await;
            }));
        }
        for h in handles { let _ = h.await; }
    });
    (TIMER_ACTORS, start.elapsed().as_micros())
 }
 fn bench_timers_tokio_multi() -> (u64, u128) {
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(available_threads())
        .enable_time()
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let mut handles = Vec::new();
        for i in 0..TIMER_ACTORS {
            let ms = timer_delay_ms(i);
            handles.push(tokio::spawn(async move {
                tokio::time::sleep(Duration::from_millis(ms)).await;
            }));
        }
        for h in handles { let _ = h.await; }
    });
    (TIMER_ACTORS, start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // 8. multi_thread_scaling — primes, sweep thread count
 // ---------------------------------------------------------------------------
 const SCALING_N: u64 = 400_000;
 const SCALING_WORKERS: u64 = 64;
 fn is_prime(n: u64) -> bool {
    if n < 2 { return false; }
    if n < 4 { return true; }
    if n % 2 == 0 { return false; }
    let mut i = 3u64;
    while i * i <= n { if n % i == 0 { return false; } i += 2; }
    true
 }
 fn count_primes(lo: u64, hi: u64) -> u64 {
    (lo..hi).filter(|&n| is_prime(n)).count() as u64
 }
 fn scaling_slice(w: u64) -> (u64, u64) {
    let per = SCALING_N / SCALING_WORKERS;
    let lo = w * per;
    let hi = if w + 1 == SCALING_WORKERS { SCALING_N } else { lo + per };
    (lo, hi)
 }
 fn bench_scaling_smarm(threads: usize) -> (u64, u128) {
    let total = Arc::new(AtomicU64::new(0));
    let t2 = total.clone();
    let start = Instant::now();
    smarm::runtime::init(bench_cfg(threads)).run(move || {
        let mut handles = Vec::new();
        for w in 0..SCALING_WORKERS {
            let (lo, hi) = scaling_slice(w);
            let tc = t2.clone();
            handles.push(smarm::spawn(move || {
                tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed);
            }));
        }
        for h in handles { h.join().unwrap(); }
    });
    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 fn bench_scaling_tokio_multi(threads: usize) -> (u64, u128) {
    let total = Arc::new(AtomicU64::new(0));
    let t2 = total.clone();
    let rt = tokio::runtime::Builder::new_multi_thread()
        .worker_threads(threads)
        .build()
        .unwrap();
    let start = Instant::now();
    rt.block_on(async move {
        let mut handles = Vec::new();
        for w in 0..SCALING_WORKERS {
            let (lo, hi) = scaling_slice(w);
            let tc = t2.clone();
            handles.push(tokio::spawn(async move {
                tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed);
            }));
        }
        for h in handles { let _ = h.await; }
    });
    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
 }
 // ---------------------------------------------------------------------------
 // main
 // ---------------------------------------------------------------------------
 // ---------------------------------------------------------------------------
 // Knob helper — reads SMARM_ALLOC_INTERVAL / SMARM_TIMESLICE_CYCLES env vars
 // so the sweep script can override the preemption knobs without recompiling.
 // ---------------------------------------------------------------------------
 fn bench_cfg(threads: usize) -> smarm::runtime::Config {
    let mut cfg = smarm::runtime::Config::exact(threads);
    if let Ok(v) = std::env::var("SMARM_ALLOC_INTERVAL") {
        if let Ok(n) = v.parse::<u32>() { cfg = cfg.alloc_interval(n); }
    }
    if let Ok(v) = std::env::var("SMARM_TIMESLICE_CYCLES") {
        if let Ok(n) = v.parse::<u64>() { cfg = cfg.timeslice_cycles(n); }
    }
    cfg
 }
 fn main() {
    let n = available_threads();
    println!("smarm tokio-favored benchmarks");
    println!("available parallelism: {n} threads");
    println!("ITERS={ITERS} (+1 warmup, discarded)");
    println!(
        "STORM_BACKGROUND={STORM_BACKGROUND}, STORM_SPAWN={STORM_SPAWN}, \
         MPSC={MPSC_PRODUCERS}×{MPSC_PER_PRODUCER}, \
         TIMER_ACTORS={TIMER_ACTORS} ({TIMER_MIN_MS}–{TIMER_MAX_MS} ms), \
         SCALING_N={SCALING_N}/{SCALING_WORKERS}"
    );
    // ---- 5. spawn_storm_busy ----
    print_header(&format!(
        "spawn_storm_busy: {STORM_BACKGROUND} bg yielders + {STORM_SPAWN} zero-work spawns"
    ));
    run_n("smarm 1-thread", ITERS, || bench_storm_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_storm_smarm(n));
    run_n("tokio current_thread", ITERS, bench_storm_tokio_current);
    run_n("tokio multi-thread", ITERS, bench_storm_tokio_multi);
    // ---- 6. mpsc_contention ----
    print_header(&format!(
        "mpsc_contention: {MPSC_PRODUCERS} producers × {MPSC_PER_PRODUCER} msgs → 1 consumer"
    ));
    run_n("smarm 1-thread", ITERS, || bench_mpsc_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_mpsc_smarm(n));
    run_n("tokio current_thread", ITERS, bench_mpsc_tokio_current);
    run_n("tokio multi-thread", ITERS, bench_mpsc_tokio_multi);
    // ---- 7. many_timers ----
    print_header(&format!(
        "many_timers: {TIMER_ACTORS} actors sleeping {TIMER_MIN_MS}–{TIMER_MAX_MS} ms"
    ));
    run_n("smarm 1-thread", ITERS, || bench_timers_smarm(1));
    run_n(&format!("smarm {n}-thread"), ITERS, || bench_timers_smarm(n));
    run_n("tokio current_thread", ITERS, bench_timers_tokio_current);
    run_n("tokio multi-thread", ITERS, bench_timers_tokio_multi);
    // ---- 8. multi_thread_scaling ----
    print_header(&format!(
        "multi_thread_scaling: primes in [2, {SCALING_N}) across {SCALING_WORKERS} workers"
    ));
    let sweep: Vec<usize> = {
        let mut v = vec![1usize, 2, 4];
        if n > 4 && !v.contains(&n) { v.push(n); }
        v.into_iter().filter(|t| *t <= n).collect()
    };
    for t in &sweep {
        run_n(&format!("smarm {t}-thread"), ITERS, || bench_scaling_smarm(*t));
    }
    for t in &sweep {
        run_n(&format!("tokio multi {t}-thread"), ITERS, || bench_scaling_tokio_multi(*t));
    }
 }
--- a/docs/Architecture.md
+++ b/docs/Architecture.md
@@ -0,0 +1,217 @@
 # SMARM Architecture
 > Erlang-style actor concurrency for Rust, without the copies, the colors, or the GC pauses.
 ---
 ## Vision
 Rust gives you the right ownership discipline for safe actor concurrency almost for free — `Send` already
 draws the boundary, the borrow checker already enforces it. What it lacks is an execution model to match:
 async/await is IO-centric, colors your functions, and trades stack simplicity for state-machine complexity;
 OS threads are too heavy to spawn per actor.
 SMARM adds a third option: **green-thread actors on a shared heap**, scheduled cooperatively, with
 message-passing as the only cross-actor communication primitive. You get Erlang's isolation model without
 Erlang's copying GC, and you get Rust's zero-copy ownership transfers without async's cognitive overhead.
 No function coloring. No `Box<dyn Future>`. Just actors, messages, and the borrow checker doing what it
 already does.
 ---
 ## Do: Core Runtime
 ### Actors and scheduling
 Each actor is a lightweight green thread with its own heap-allocated, growable stack. Stacks are
 allocated via `mmap` with a guard page below the region; overflow is detected by the OS without SMARM
 polling for it. Initial stacks are small and grow by remapping on demand.
 The scheduler runs one OS thread per CPU. Each scheduler thread loops against a single global
 `Mutex<HashMap>` queue shared across all schedulers. If queue contention becomes a measured bottleneck
 this can be revisited; the interface will not change.
 SMARM requires `panic = unwind`. Users who set `panic = abort` accept that supervision and actor
 isolation are silently degraded to process death.
 ### Process descriptor
 Each actor has a descriptor that is hot while the actor runs and will typically live in L1 cache.
 It holds:
 - `stack_base: *mut u8` — bottom of the allocated stack region
 - `stack_cap: usize` — total allocated size
 - `stack_ptr: *mut u8` — current stack pointer (`rsp`), saved on yield
 - `pid: (u32, u32)` — index and generation counter (see PIDs below)
 - `alloc_count: u32` — countdown for preemption sampling
 - `timeslice_start: u64` — `RDTSC` value written on every resume
 - `resize_count: u16` — diagnostic counter for stack growth events
 - `context: *mut ContextSaveArea` — pointer to the register save area (cold, touched only on switch)
 ### Context switching
 Context switching is implemented in a `#[naked]` assembly shim, one per supported architecture.
 The compiler cannot be asked to switch stacks.
 **Suspend** (yield, preemption, or blocking):
 1. Save callee-saved integer registers and SIMD registers into `ContextSaveArea`.
 2. Save `rsp`/`sp` into the process descriptor.
 3. Load the scheduler's stack pointer from a thread-local and jump back into the scheduler loop.
 **Resume**:
 1. Load `rsp`/`sp` from the process descriptor.
 2. Restore registers from `ContextSaveArea`.
 3. `ret` — the return address is already on the restored stack, execution resumes exactly where the
   actor yielded.
 **x86-64**: saves `rbx`, `rbp`, `r12`–`r15` (6 × 8 = 48 bytes) and `xmm0`–`xmm15` (16 × 16 = 256
 bytes) = 304 bytes total. Full SSE baseline is required; the compiler may autovectorise freely.
 AVX-512 is deferred.
 **ARM64**: saves `x19`–`x30` (12 × 8 = 96 bytes, including the link register `x30` which must be
 saved explicitly — it holds the return address, unlike x86 where `call` pushes it to the stack) and
 `d8`–`d15` (8 × 8 = 64 bytes) = 160 bytes total.
 `ContextSaveArea` is a `Box<ContextSaveArea>` per actor. Lifetime equals the actor's lifetime;
 no churn, no bulk deallocation, `Box` is correct.
 Initial platform target is x86-64 Linux. ARM64 and macOS are natural follow-ons.
 ### Allocator-driven preemption
 Every Nth allocation, the allocator reads `RDTSC` and compares it against `timeslice_start`. If the
 threshold is exceeded the actor yields. The workloads that starve a scheduler — sustained compute,
 data transformation — are precisely the ones doing frequent allocations, so this approximation is
 correct by construction.
 `RDTSC` is not monotonic across core migration; a slightly wrong timeslice is acceptable. SMARM is
 not a real-time scheduler.
 Known failure mode: tight no-alloc loops are invisible to this mechanism. Actors doing sustained
 allocation-free compute must call `smarm::yield_now()` explicitly, or offload to a thread pool
 outside the actor scheduler (e.g. rayon). This is documented and acceptable — such loops are rare
 in message-passing workloads.
 ### Yield points
 An actor yields at:
 - **Channel send/recv** — the primary communication primitive
 - **Mutex contention** — attempting to lock a held `Arc<Mutex<>>` parks the actor
 - **IO** — blocking on a socket or file descriptor parks the actor until the IO thread signals readiness
 - **`smarm::sleep(duration)`** — parks the actor; the timer wheel re-queues it on expiry
 - **`smarm::yield_now()`** — explicit cooperative yield
 - **Allocator preemption** — as above
 - **Spawn** — does not yield by default; the new actor is queued and the spawner continues
 `std::thread::sleep` inside an actor blocks the entire OS thread and should never be used. SMARM
 may emit a warning if it can detect this.
 ### IO thread
 A single dedicated IO thread runs an `epoll`/`kqueue` loop. Actors blocking on IO register their
 file descriptor and PID; the IO thread moves them back into the global queue when the fd is ready.
 A `HashMap<RawFd, Pid>` maps fds to parked actors. Cancellation (actor dies while waiting on IO)
 deregisters the fd. This is intentionally simple and not pluggable; SMARM is not a general async
 executor.
 ### Communication
 Messages must be `Send` or `Copy`. Non-`Send` types cannot cross an actor boundary; this is
 enforced by the type system with no runtime overhead.
 Two primitives only:
 - **Move** — transfer owned data across a channel. Zero copy. The sender relinquishes ownership
  at the type level. This is the default.
 - **`Arc<Mutex<T>>`** — for genuinely shared long-lived state. Explicit and visible.
 Cross-actor `Rc` or bare pointers are banned. There is no cycle detector. Cross-actor cycles are
 banned by construction: either transfer ownership or use `Arc`.
 ### PIDs
 A PID is a `(index, generation)` pair. The index may be reused after an actor dies; the generation
 counter increments on every death. A stale handle holding the wrong generation is a detectable
 error, not a silent misdirection. This avoids the ABA problem without reserving PID space forever.
 ### Supervision
 Every actor has a supervisor, assigned at spawn. This is not optional. The root supervisor is
 provided by the runtime; its death is a process exit.
 A supervisor receives one of three signals when a child actor terminates:
 - `Signal::Exit(pid)` — normal completion
 - `Signal::Panic(pid, payload)` — caught via `catch_unwind` at the actor entry point boundary,
  before unwinding can reach the assembly shim
 - `Signal::Timeout(pid)` — actor exceeded a budget (see below)
 The supervisor decides: restart the actor, escalate to its own supervisor, or ignore. Restart
 intensity is capped: if an actor panics more than N times within a time window, the supervisor
 stops restarting and escalates. This prevents a bad prelude or corrupted input from spinning the
 supervisor in a restart loop indefinitely. N and the window are configurable per supervisor with a
 sensible global default.
 ### Mutex timeout
 Every `smarm::mutex` lock attempt is mediated by the scheduler. If the lock is not acquired within
 a configurable timeout, the actor receives a `LockTimeout` error rather than parking forever. This
 is a hard runtime guarantee, not a convention. Default timeout is global and configurable;
 individual locks and individual call sites can override it.
 ### Task joining
 Actors can spawn children and wait on a group of handles:
 ```rust
 let h1 = smarm::spawn(|| compute_a());
 let h2 = smarm::spawn(|| compute_b());
 let (a, b) = smarm::join!(h1, h2);
 ```
 `join!` parks the calling actor until all handles complete. The last child to finish re-queues the
 parent. This is a countdown in the parent's descriptor; no polling, no waker registration. A
 `join_timeout!` variant is a natural extension.
 ### Timer wheel
 `smarm::sleep` and supervision timeouts are driven by a timer wheel in the scheduler. Sleeping
 actors are parked and re-queued by the timer thread on expiry. The timer wheel is internal
 infrastructure; its design is an implementation detail.
 ---
 ## Defer: Later Work
 - **Stack sizing policy** — initial size, growth factor, and whether stacks ever shrink are
  implementation decisions to be made with profiling data, not up front.
 - **Queue contention** — if `Mutex<HashMap>` proves to be a bottleneck under profiling, evaluate
  `DashMap` or a lock-free work-stealing deque (e.g. `crossbeam-deque`). Not before.
 - **AVX-512 context save** — extend `ContextSaveArea` when there is a concrete use case.
 - **`smarm::sleep` vs raw sleep semantics** — further control knobs deferred until the basic sleep
  is working and real use cases are understood.
 - **Supervision tree API** — the contract is defined; the recursive hierarchy, restart strategies,
  and introspection API are implementation work.
 - **no_std support** — the assembly shim is no_std friendly but the IO thread and allocator require
  OS primitives. Target is no_std + `alloc` on hosted platforms; bare metal is out of scope.
 - **Distribution** — SMARM is a single-process runtime. No distribution protocol, no BEAM-style
  clustering.
 ---
 ## What SMARM is Not
 - Not a drop-in replacement for Tokio. SMARM does not implement `Future` or the async executor interface.
 - Not a general allocator. SMARM manages actor stacks; heap allocation for actor data goes through
  the system allocator.
 - Not Erlang. No hot code reloading, no distribution protocol, no BEAM bytecode. SMARM is a
  concurrency runtime, not a platform.
 - Not a real-time scheduler. Timeslice accuracy is best-effort.
 ---
 ## On names
 <sub>The name is a recursive acronym. The M is for Marks, as in the BEAM — Bogdan/Björn's Erlang Abstract Machine, the virtual machine that runs Erlang and Elixir. smarm is not the BEAM. It just admires it from a safe distance.</sub>
--- a/docs/BENCHMARKS_AND_TUNING.md
+++ b/docs/BENCHMARKS_AND_TUNING.md
@@ -0,0 +1,320 @@
 # smarm — Benchmarks & Tuning Recommendations
 > Based on bench suite v0.3.0, Intel Xeon @ 2.80 GHz, 1-core sandbox,
 > kernel 6.18.5, rustc 1.95.0. Multi-core conclusions are extrapolated from
 > design reasoning and single-core sweep data; re-validate on real hardware.
 ---
 ## TL;DR
 smarm is competitive with tokio for **channel-heavy, message-passing workloads**
 and wins outright on **uncontended channels** and **panic/unwind isolation**.
 It is significantly slower than tokio for **spawn-heavy** patterns and
 **timer-heavy** workloads. The preemption knobs (`alloc_interval`,
 `timeslice_cycles`) have minimal effect on single-core machines; they matter
 on multi-core under scheduler-thread contention.
 ---
 ## Bench results summary
 All medians in µs. Tokio column is `current_thread` unless noted.
 | Bench                | smarm  | tokio  | ratio  | winner        |
 |----------------------|--------|--------|--------|---------------|
 | `chained_spawn`      | 8 625  | 124    | 70×    | tokio         |
 | `ping_pong_oneshot`  | 16 848 | 879    | 19×    | tokio         |
 | `spawn_storm_busy`   | 126 k  | 2 772  | 45×    | tokio         |
 | `yield_many`         | 41 622 | 15 085 | 2.8×   | tokio         |
 | `yield_in_hot_loop`  | 190 k  | 153 k  | 1.25×  | tokio         |
 | `many_timers`        | 143 k  | 14 462 | 10×    | tokio         |
 | `fan_out_compute`    | 29 727 | 28 503 | 1.04×  | **even**      |
 | `multi_thread_scaling` | 30 k | 29 k   | 1.04×  | **even**      |
 | `deep_recursion`     | 83     | 25     | 3.3×   | tokio         |
 | `mpsc_contention`    | 9 062  | 17 570 | 0.52×  | **smarm** 1.9× |
 | `uncontended_channel`| 27 265 | 51 888 | 0.53×  | **smarm** 1.9× |
 | `catch_unwind_panics`| 142 k  | 682 k  | 0.21×  | **smarm** 4.8× |
 ---
 ## Where smarm wins
 ### Uncontended channels (1.9× faster)
 When a single producer sends to a single consumer with no other actors
 competing for the queue, smarm's channel is meaningfully faster than
 tokio's. This is the core use case smarm is designed for: pipelines of
 actors passing owned data along a chain.
 **Recommendation**: smarm is a good fit for any architecture where data
 flows through a chain of stages, each stage is an actor, and the
 channel between stages is the primary synchronisation point.
 ### Uncontended MPSC (1.9× faster, same reason)
 Multi-producer single-consumer works well for the same reason. On a
 single-thread runtime, smarm's mutex is uncontended, so the lock is
 essentially free. On multi-core this advantage will shrink; re-measure.
 ### Panic isolation (4.8× faster recovery)
 `catch_unwind_panics` creates 10 000 actors that each panic. smarm
 recovers and delivers `Signal::Panic` to the supervisor 4.8× faster
 than tokio. This matters if you're building a system that uses panics
 as a fast abort path for malformed input or actor-level faults, or if
 you're using supervision trees seriously.
 **Recommendation**: if your system expects panics to be a normal
 operational event (not just bugs), smarm's supervision story is a
 genuine advantage over tokio's task abort model.
 ---
 ## Where smarm loses, and why
 ### Spawn-heavy workloads (19–70×)
 Every smarm actor `mmap`s a 64 KiB stack with a guard page. This is
 a syscall. Tokio tasks are heap-allocated state machines — no stack,
 no syscall, ~100 bytes each. For workloads that spawn thousands of
 short-lived actors per second, this is a structural disadvantage.
 **Recommendations**:
 - Avoid spawning actors for work that completes in microseconds.
  Use a worker-pool pattern: spawn N long-lived actors at startup,
  distribute work over channels.
 - If you genuinely need high-frequency short-lived actors, the stack
  allocation cost is a known roadmap item (stack caching, slab alloc).
  It is not an inherent design flaw — just not implemented yet.
 - `deep_recursion` shows the same problem at depth 500: smarm spawns
  a fresh actor per level, paying the mmap cost repeatedly. Recursive
  decomposition should use explicit stacks or iteration inside a single
  actor, not actor-per-level spawning.
 ### Timer-heavy workloads (10×)
 smarm uses a global min-heap of `(deadline, Pid)` pairs behind the
 shared mutex. Tokio uses a sharded hierarchical timer wheel. With
 10 000 pending timers, smarm's O(log N) heap under lock is
 dramatically slower.
 **Recommendations**:
 - Do not use smarm `sleep()` in tight loops with many concurrent
  sleeping actors if timing precision matters.
 - For IO timeouts: prefer a single timer actor that manages a priority
  queue and fans out wakeups over channels, rather than 1 000 actors
  each sleeping directly.
 - The hierarchical timer wheel is listed in `LOOM.md` deferred work.
  It is the correct fix if timer performance becomes a bottleneck.
 ### Yield overhead (2.8× in `yield_many`, 1.25× in `yield_in_hot_loop`)
 Every `yield_now()` goes through the runtime mutex and run queue even
 on a single-thread scheduler. Tokio's current_thread scheduler handles
 yields with much lower overhead. smarm's naked context-switch is fast,
 but the lock acquisition around it dominates for high-frequency yields.
 **Recommendation**: minimise explicit `yield_now()` calls in hot paths.
 In message-passing workloads this is natural — yield happens at
 `recv()` and `send()`, which is appropriate. If you are using
 `yield_now()` in a tight loop, consider whether the actor should
 instead be blocking on a channel or sleeping.
 ---
 ## Preemption knob recommendations
 The knobs are `Config::alloc_interval(n)` and `Config::timeslice_cycles(c)`.
 Default: `alloc_interval = 128`, `timeslice_cycles = 300_000` (≈100 µs at 3 GHz).
 ### Findings from the sweep
 The sweep varied alloc_interval in `{32, 64, 128, 256, 512}` and
 timeslice_cycles in `{150k, 300k, 600k, 1200k}` — 10 points total.
 On a single-CPU machine the knobs are almost inert: most benches move
 < 5% across the entire grid. The exceptions are meaningful:
 **Longer timeslices hurt under contention.** At `tc=600k` and `tc=1200k`:
 - `spawn_storm_busy` degrades +11–15%
 - `catch_unwind_panics` degrades +10–12%
 The cause: 8 background yielder actors hold the scheduler mutex longer
 per timeslice, delaying the 10 000 actors waiting to be joined. A
 longer timeslice amplifies the global-mutex bottleneck.
 **Shorter timeslices marginally help timer-heavy work.** At `tc=150k`,
 `many_timers` improves 3–4%. Actors that are sleeping get rescheduled
 sooner because the runtime polls the timer heap more frequently.
 **alloc_interval has no clear winner.** Moving from 32 to 512 causes
 < 3% variation on every bench. The check frequency is not the
 bottleneck — the lock is.
 ### Recommended starting points
 | Workload                          | alloc_interval | timeslice_cycles |
 |-----------------------------------|----------------|------------------|
 | Default (unknown)                 | 128 (default)  | 300 000 (default)|
 | Many concurrent sleeping actors   | 128            | 150 000          |
 | High-throughput channel pipeline  | 128            | 300 000          |
 | Compute-heavy (few allocs)        | 32             | 300 000          |
 | Strict fairness / many actors     | 64             | 150 000          |
 | Long-running compute batches      | 256            | 600 000          |
 **Note on `timeslice_cycles` calibration**: the default was tuned for
 ≈100 µs on a 3 GHz CPU. On a 2.8 GHz machine that's ≈107 µs. On a
 4 GHz machine it's ≈75 µs. If you want a precise target timeslice,
 measure your CPU's TSC frequency at startup and set the cycles value
 accordingly:
 ```rust
 // Approximate TSC frequency measurement (call once at startup)
 fn tsc_hz() -> u64 {
    let t0 = smarm::preempt::rdtsc();
    std::thread::sleep(std::time::Duration::from_millis(100));
    let t1 = smarm::preempt::rdtsc();
    (t1 - t0) * 10  // extrapolate to 1 second
 }
 let target_us = 100u64; // desired timeslice in microseconds
 let cycles = tsc_hz() / 1_000_000 * target_us;
 let rt = smarm::runtime::init(
    smarm::runtime::Config::default()
        .timeslice_cycles(cycles)
 );
 ```
 ---
 ## Architecture recommendations
 ### Use actor pools, not per-request actors
 ```rust
 // Avoid: spawning an actor per request
 for req in requests {
    spawn(move || handle(req));
 }
 // Prefer: fixed pool, channel dispatch
 let (tx, rx) = channel();
 for _ in 0..num_cpus {
    let rx = rx.clone();
    spawn(move || { while let Ok(req) = rx.recv() { handle(req); } });
 }
 for req in requests { tx.send(req).unwrap(); }
 ```
 The worker pool pattern amortises the 64 KiB mmap cost over the
 lifetime of the pool. The `chained_spawn` bench shows this cost is
 real: 8 625 µs for 1 000 sequential spawns vs tokio's 124 µs.
 ### Supervision for fault isolation
 smarm delivers `Signal::Panic(pid, payload)` to the supervisor when an
 actor panics. Use `spawn_under` to register a supervisor channel and
 build restart logic:
 ```rust
 let (sup_tx, sup_rx) = channel::<smarm::Signal>();
 let child = smarm::spawn_under(sup_tx.clone(), move || {
    // ... actor body ...
 });
 // Supervisor loop
 loop {
    match sup_rx.recv() {
        Ok(Signal::Panic(pid, _)) => {
            // restart, escalate, or record
        }
        Ok(Signal::Exit(_)) => break,
        Err(_) => break,
    }
 }
 ```
 This pattern has essentially zero overhead compared to unmonitored
 spawning, and the `catch_unwind_panics` bench confirms it is 4.8×
 faster than tokio's abort/recover cycle.
 ### Explicit preemption in no-alloc hot loops
 The allocator-driven preemption mechanism fires every `alloc_interval`
 allocations. Code that never allocates (tight numeric loops, parsing
 fixed-size buffers) will never yield preemptively. Add `smarm::check!()`
 at the natural loop boundary:
 ```rust
 for chunk in data.chunks(4096) {
    process(chunk);       // no allocations
    smarm::check!();      // yield if timeslice expired
 }
 ```
 This is explicitly called out in `LOOM.md` as a known limitation.
 The `yield_in_hot_loop` bench (1M iterations of `yield_now()`) shows
 smarm is 1.25× slower than tokio even with explicit yields, which sets
 the floor on how much `check!()` can help in truly tight loops.
 ### IO-bound work
 smarm's IO path (`wait_readable`, `wait_writable`, `block_on_io`) parks
 the actor without blocking the OS scheduler thread. This is correct and
 works well. There is no specific bench for IO-bound workloads in the
 current suite, but the architecture is sound for network servers and
 file-IO pipelines.
 ---
 ## Known limitations and roadmap items
 These are from `LOOM.md` plus observations from the bench suite.
 | Limitation                    | Impact             | Roadmap status     |
 |-------------------------------|--------------------|--------------------|
 | No stack size caching / slab  | High spawn cost    | Deferred           |
 | Global single min-heap timers | Poor at many timers| Deferred (hierarch. wheel) |
 | Global `Mutex<RunQueue>`      | Lock contention    | Deferred (per-thread queues) |
 | No `join!()` macro            | Ergonomics         | Deferred           |
 | x86-64 Linux only             | Portability        | ARM64 deferred     |
 | No restart intensity caps     | Supervision safety | Deferred           |
 | Yield overhead under lock     | Hot-loop fairness  | Structural / ongoing |
 The yield overhead and global mutex are the two issues most likely to
 matter on a real multi-core workload. The sweep confirmed that
 `timeslice_cycles` is a meaningful knob for controlling the mutex
 hold time; the right long-term fix is per-thread run queues with
 work stealing.
 ---
 ## Running the bench suite
 ```sh
 # Run all benches once, print results
 python3 benches/sweep.py run
 # Save current results as regression baseline
 python3 benches/sweep.py run --save-baseline
 # Check for regressions (>10% slower than baseline → exit 1)
 python3 benches/sweep.py regress
 # Sweep preemption knobs across the grid defined in sweep.py
 python3 benches/sweep.py sweep
 # Sweep and save raw data as CSV
 python3 benches/sweep.py sweep --save-csv results.csv
 # Run a single knob configuration manually
 SMARM_ALLOC_INTERVAL=64 SMARM_TIMESLICE_CYCLES=150000 \
    cargo bench --bench general
 ```
 The regression threshold is 10% and is configurable in `sweep.py`
 (`REGRESSION_THRESHOLD_PCT`). The sweep grid is `SWEEP_GRID` in the
 same file.
--- a/docs/benchmarks.md
+++ b/docs/benchmarks.md
@@ -0,0 +1,177 @@
 # Benchmarks
 Regression-test and tuning reference for smarm vs tokio.
 ## Running
 ```sh
 cargo bench --bench primes              # original compute bench
 cargo bench --bench multi_scheduler     # original 3-workload bench
 cargo bench --bench general             # benches 1–4
 cargo bench --bench tokio_favored       # benches 5–8
 cargo bench --bench smarm_favored       # benches 9–12
 ```
 Each bench runs one warmup iteration (discarded) and 15 measured iterations.
 Results are reported as median / min / max in microseconds. Median is the
 headline number; the spread between min and max indicates measurement
 stability.
 ## Methodology notes
 - The harness times wall-clock elapsed for the full workload, including
  runtime startup and shutdown. For multi-thread runtimes this means worker
  thread spawn cost is included; on short-lived benches this can dominate.
  Where startup matters, the bench is structured so the workload is much
  longer than typical startup.
 - `tokio` uses `new_current_thread` + `LocalSet` for the single-threaded
  comparison and `new_multi_thread().worker_threads(N)` for parallel.
  `smarm::runtime::Config::exact(N)` is the equivalent knob.
 - mpsc choice: tokio's `unbounded_channel` to match smarm's unbounded channel
  semantics. Bounded comparisons would need a separate suite.
 - Random delays in `many_timers` use a deterministic mixing function of the
  actor index so iterations are reproducible.
 ## Bench catalog
 ### General — neither runtime structurally favored
 | # | Bench               | Stresses                                        | Prediction         |
 |---|---------------------|-------------------------------------------------|--------------------|
 | 1 | `chained_spawn`     | Spawn + exit overhead in a serial chain         | Roughly even       |
 | 2 | `yield_many`        | Pure scheduling throughput, explicit yields     | Roughly even       |
 | 3 | `fan_out_compute`   | CPU-bound parallel work, minimal coordination   | Even (compute-bound) |
 | 4 | `ping_pong_oneshot` | Spawn + oneshot round-trip latency              | Roughly even       |
 A regression here means a real change in per-task or per-yield cost — those
 should be investigated regardless of which runtime got slower.
 ### Tokio-favored — measures cost of smarm's design choices
 | # | Bench                   | Stresses                                              | Why tokio should win                                                              |
 |---|-------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------|
 | 5 | `spawn_storm_busy`      | 8 background yielders + 10k zero-work spawns          | Tokio's per-worker deque + LIFO slot vs smarm's global `Mutex<SharedState>` queue |
 | 6 | `mpsc_contention`       | 32 producers × 10k msgs → 1 consumer                  | Tokio's mpsc is lock-free on the hot path; smarm channel is `Arc<Mutex<Inner>>` + runtime mutex on each unpark |
 | 7 | `many_timers`           | 10k actors sleeping 1–10 ms, dense wake window        | Tokio's per-worker sharded timer wheel vs smarm's single shared min-heap          |
 | 8 | `multi_thread_scaling`  | Primes, sweep thread count 1, 2, 4, available         | Tokio scales near-linearly; smarm hits its mutex ceiling                          |
 A regression here means a smarm design choice got more expensive. Widening
 gaps signal something to investigate; narrowing gaps after a tuning change is
 the desired direction.
 ### Smarm-favored — measures payoff of green-thread + stackful design
 | #  | Bench                  | Stresses                                                  | Why smarm should win                                                            |
 |----|------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------|
 | 9  | `deep_recursion`       | Actor recurses 1000 deep, returns                         | Native stack growth vs tokio's per-level `Box::pin`                             |
 | 10 | `yield_in_hot_loop`    | 2 actors, 500k yields each, single thread                 | Naked context switch (~6 GPRs + xmm save + ret) vs poll → state machine → schedule |
 | 11 | `uncontended_channel`  | 1→1, 1M msgs, single thread                               | Mutex is essentially free uncontended; green-thread switch is cheaper than poll |
 | 12 | `catch_unwind_panics`  | 10k spawns, 50% panic                                     | Smarm has `catch_unwind` at the actor entry; both runtimes do this but the boundaries differ — exploratory |
 A regression here means we lost some of smarm's structural advantage. #12 is
 exploratory — if the baseline shows no real gap, drop it.
 ## Baseline (v0.3.0, Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, RUSTFLAGS: none)
 > Sandbox environment has only 1 logical CPU. All multi-thread rows (smarm Nt,
 > tokio mt) are equivalent to 1-thread; scaling sweep is limited to 1 thread.
 > Label duplication in bench output ("smarm 1-thread" appearing twice) is
 > because available_parallelism() == 1, so the N-thread variant is identical.
 | Bench               | smarm 1t | smarm Nt | tokio ct | tokio mt | Notes |
 |---------------------|----------|----------|----------|----------|-------|
 | chained_spawn       | 7136     | 6979     | 113      | 176      | smarm ~60x slower; spawn+stack alloc dominates on 1 CPU |
 | yield_many          | 40079    | 40073    | 14571    | 14044    | smarm ~2.8x slower; scheduling overhead real |
 | fan_out_compute     | 19347    | 19461    | 18616    | 18905    | roughly even; compute-bound as expected |
 | ping_pong_oneshot   | 13731    | 14176    | 828      | 3342     | smarm ~17x slower; per-round spawn+join cost high |
 | spawn_storm_busy    | 105512   | 107113   | 2222     | 4546     | smarm ~47x slower; global mutex under 8 bg yielders |
 | mpsc_contention     | 10456    | 10395    | 17348    | 18628    | smarm wins; uncontended mutex essentially free on 1-thread |
 | many_timers         | 120242   | 121023   | 13581    | 14266    | smarm ~9x slower; single min-heap vs sharded wheel |
 | multi_thread_scaling — see thread-count sweep below                                            |
 | deep_recursion      | 62       | 71       | 22       | 44       | tokio wins unexpectedly; see sanity-check notes |
 | yield_in_hot_loop   | 182177   | —        | 138335   | —        | tokio wins; smarm prediction wrong; see notes |
 | uncontended_channel | 31473    | —        | 51925    | —        | smarm wins as predicted; ~1.65x |
 | catch_unwind_panics | 112306   | 114305   | 151443   | 161344   | smarm wins as predicted; ~1.35x |
 ### `multi_thread_scaling` thread-count sweep (median µs)
 > Sandbox has 1 logical CPU; only 1-thread row is available.
 | Threads | smarm | tokio mt |
 |---------|-------|----------|
 | 1       | 19852 | 19638    |
 | 2       | —     | —        |
 | 4       | —     | —        |
 | N (avail=1) | 19852 | 19638 |
 ## Tuning experiments
 ### Reduction-budget sweep
 `smarm` uses an allocator-driven preemption mechanism: every Nth allocation,
 the actor checks RDTSC against its timeslice start and yields if over budget.
 The Nth-allocation threshold (the "reduction budget") and the timeslice
 duration are the two knobs.
 Record each experiment as a row below. Reference the commit or the parameter
 values explicitly.
 | Date | Configuration              | Bench (or "all")     | Result vs baseline           | Notes |
 |------|----------------------------|----------------------|------------------------------|-------|
 |      | baseline                   | all                  | —                            |       |
 |      | budget=…, timeslice=…      |                      |                              |       |
 |      |                            |                      |                              |       |
 When the gap on tokio-favored benches narrows without regressing
 smarm-favored benches, the change is a keeper. If a budget change improves
 one workload but regresses another by more, prefer keeping the broader-impact
 configuration unless we have a clear use case for the trade-off.
 ## Sanity-check notes (baseline run)
 ### Compile fixes applied
 Two bench files had a type error: `smarm::Runtime::run()` takes
 `impl FnOnce() + Send + 'static` (returns `()`), but the consumer closures
 in `bench_mpsc_smarm` (tokio_favored.rs) and `bench_unc_smarm`
 (smarm_favored.rs) returned `u64` via a bare `count` tail expression. Fixed
 by changing the tail to `let _ = count;` in both closures, and the
 corresponding `consumer.join().unwrap()` calls to `let _ = consumer.join()...`.
 No workload semantics changed.
 ### Single-CPU sandbox caveat
 `available_parallelism()` returns 1, so every "N-thread" variant is identical
 to "1-thread". Multi-thread results should not be used to draw scaling
 conclusions; re-run on a multi-core machine before committing to the tuning
 sweep.
 ### Predicted-winner mismatches
 **`deep_recursion` — tokio wins (22 µs) over smarm (62 µs).**
 At depth 500, smarm spawns a fresh actor which requires mmap'ing a 64 KiB
 stack; that allocation cost dominates the actual recursion. Tokio's
 Box::pin recursion allocates 500 small heap objects but avoids the mmap.
 The prediction assumed stack allocation was amortised across many uses; here
 the actor is single-use. Not a bug, but the bench may not exercise the
 intended advantage.
 **`yield_in_hot_loop` — tokio wins (138 ms) over smarm (182 ms).**
 The prediction was that smarm's ~6-GPR naked context switch would beat
 tokio's poll/state-machine cycle. In practice, on a single-thread sandbox,
 tokio's current_thread scheduler has very low overhead per yield_now, while
 smarm's yield_now still goes through the runtime mutex and run-queue even on
 a single thread. This is a meaningful data point: smarm's scheduling overhead
 is not as low as the assembly switch cost alone suggests.
 ### Noise / spread
 - `catch_unwind_panics` smarm spread is reasonable (~10% min/max).
 - `spawn_storm_busy` tokio multi-thread has notable spread (3833–7305 µs);
  consistent with tokio issue #3829 noted in task spec.
 - `many_timers` smarm spread acceptable (~10%).
 ### Result-column equivalence
 All result columns match between runtimes for every bench (same prime counts,
 same message totals, same task counts). Workloads are equivalent.
--- a/docs/smarm
+++ b/docs/smarm
--- a/src/channel.rs
+++ b/src/channel.rs
@@ -1,12 +1,8 @@
 //! Unbounded MPSC channels.
 //!
-//! Single-threaded scheduler: the inner state is `Rc<RefCell<Inner<T>>>`,
+//! Inner state is `Arc<Mutex<Inner<T>>>` so channels can be sent across OS
-//! not `Arc<Mutex>`. We hand-implement `Send` for `Sender<T>` and
+//! threads (required for the multi-scheduler runtime where a sender and
-//! `Receiver<T>` when `T: Send`, on the basis that the only way two actor
+//! receiver may run on different scheduler threads simultaneously).
 //! contexts touch the same channel is by being scheduled on the *same* OS
 //! thread (v0.1 has exactly one). When we add a second scheduler thread,
 //! this lie must be retired: replace `Rc<RefCell>` with `Arc<Mutex>` (or a
 //! lock-free queue) and remove the unsafe Send impls.
 //!
 //! Semantics:
 //!   - Senders are clonable; the last sender drop closes the channel.
@@ -19,12 +15,11 @@
 //!     parked, the receiver is unparked.
 use crate::pid::Pid;
 use std::cell::RefCell;
 use std::collections::VecDeque;
-use std::rc::Rc;
+use std::sync::{Arc, Mutex};
 pub fn channel<T>() -> (Sender<T>, Receiver<T>) {
-    let inner = Rc::new(RefCell::new(Inner {
+    let inner = Arc::new(Mutex::new(Inner {
        queue: VecDeque::new(),
        parked_receiver: None,
        senders: 1,
@@ -41,20 +36,13 @@ struct Inner<T> {
 }
 pub struct Sender<T> {
-    inner: Rc<RefCell<Inner<T>>>,
+    inner: Arc<Mutex<Inner<T>>>,
 }
 pub struct Receiver<T> {
-    inner: Rc<RefCell<Inner<T>>>,
+    inner: Arc<Mutex<Inner<T>>>,
 }
 // SAFETY (v0.1 only): the scheduler is single-threaded. Sender/Receiver can
 // be captured into actor closures (which require Send), but they will only
 // ever be touched from one OS thread. When multi-threading lands, swap the
 // `Rc<RefCell>` for `Arc<Mutex>` and remove these.
 unsafe impl<T: Send> Send for Sender<T> {}
 unsafe impl<T: Send> Send for Receiver<T> {}
 #[derive(Debug, PartialEq, Eq)]
 pub struct SendError<T>(pub T);
@@ -71,7 +59,7 @@ impl std::error::Error for RecvError {}
 impl<T> Clone for Sender<T> {
    fn clone(&self) -> Self {
-        self.inner.borrow_mut().senders += 1;
+        self.inner.lock().unwrap().senders += 1;
        Sender { inner: self.inner.clone() }
    }
 }
@@ -79,11 +67,9 @@ impl<T> Clone for Sender<T> {
 impl<T> Drop for Sender<T> {
    fn drop(&mut self) {
        let unpark = {
-            let mut g = self.inner.borrow_mut();
+            let mut g = self.inner.lock().unwrap();
            g.senders -= 1;
            if g.senders == 0 && g.queue.is_empty() {
                // Channel closed and drained. Wake the receiver so it can
                // see RecvError.
                g.parked_receiver.take()
            } else {
                None
@@ -97,23 +83,25 @@ impl<T> Drop for Sender<T> {
 impl<T> Drop for Receiver<T> {
    fn drop(&mut self) {
-        self.inner.borrow_mut().receiver_alive = false;
+        self.inner.lock().unwrap().receiver_alive = false;
    }
 }
 impl<T> Sender<T> {
    pub fn send(&self, value: T) -> Result<(), SendError<T>> {
        let unpark = {
-            let mut g = self.inner.borrow_mut();
+            let mut g = self.inner.lock().unwrap();
            if !g.receiver_alive {
                return Err(SendError(value));
            }
            g.queue.push_back(value);
            // If the receiver is parked, unpark it.
            g.parked_receiver.take()
        };
        if let Some(pid) = unpark {
            crate::te!(crate::trace::Event::Send { sender: crate::actor::current_pid().unwrap_or(crate::pid::Pid::new(u32::MAX, u32::MAX)), receiver: Some(pid) });
            crate::scheduler::unpark(pid);
        } else {
            crate::te!(crate::trace::Event::Send { sender: crate::actor::current_pid().unwrap_or(crate::pid::Pid::new(u32::MAX, u32::MAX)), receiver: None });
        }
        Ok(())
    }
@@ -122,16 +110,14 @@ impl<T> Sender<T> {
 impl<T> Receiver<T> {
    pub fn recv(&self) -> Result<T, RecvError> {
        loop {
            // Try to take a message.
            {
-                let mut g = self.inner.borrow_mut();
+                let mut g = self.inner.lock().unwrap();
                if let Some(v) = g.queue.pop_front() {
                    return Ok(v);
                }
                if g.senders == 0 {
                    return Err(RecvError);
                }
                // Empty + open: register and park.
                let me = crate::actor::current_pid()
                    .expect("recv() called outside an actor");
                debug_assert!(
@@ -139,19 +125,19 @@ impl<T> Receiver<T> {
                    "channel has more than one receiver"
                );
                g.parked_receiver = Some(me);
                crate::te!(crate::trace::Event::RecvPark(me));
            }
-            // Release the borrow before parking — the unparker will need it.
+            // Release the lock before parking — the unparker will need it.
            crate::scheduler::park_current();
-            // Loop: the message that woke us might already have been taken
+            // Woken up — record it before looping to check the queue.
-            // (it can't, with one receiver, but the senders=0 path can fire
+            crate::te!(crate::trace::Event::RecvWake(crate::actor::current_pid().unwrap()));
            // here too).
        }
    }
    /// Non-blocking. `Ok(Some(v))` if a message was available, `Ok(None)` if
    /// the channel is empty but open, `Err(RecvError)` if closed and drained.
    pub fn try_recv(&self) -> Result<Option<T>, RecvError> {
-        let mut g = self.inner.borrow_mut();
+        let mut g = self.inner.lock().unwrap();
        if let Some(v) = g.queue.pop_front() {
            return Ok(Some(v));
        }
--- a/src/io.rs
+++ b/src/io.rs
@@ -0,0 +1,521 @@
 //! Off-scheduler IO: blocking-work offload and epoll-based fd readiness.
 //!
 //! `block_on_io(closure)` runs `closure` on a dedicated worker OS thread,
 //! parks the calling actor in the meantime, and returns the closure's
 //! value when it completes. Lets actors call into blocking C libraries,
 //! synchronous file IO, or anything else that doesn't fit the readiness
 //! model.
 //!
 //! `wait_readable(fd)` / `wait_writable(fd)` register interest in an fd
 //! with epoll and park the calling actor. When the fd becomes ready, the
 //! epoll thread unparks the actor. The actual `read(2)`/`write(2)` syscall
 //! runs back on the scheduler thread, *inside* the actor — buffer never
 //! leaves the actor, no copying through an intermediary thread. Built on
 //! these are the conveniences `read(fd, &mut buf)` and `write(fd, &buf)`.
 //!
 //! Architecture
 //! ============
 //! Per `run()`, two OS threads:
 //!   - **epoll thread**: owns the epollfd. Loops in `epoll_wait`. On a
 //!     ready fd, pushes `Completion::FdReady { pid, fd, events }` to the
 //!     shared completion queue and writes the scheduler-wake pipe. On the
 //!     shutdown pipe (also registered in epollfd), exits.
 //!   - **pool thread**: blocks on the request mpsc. Runs the closure
 //!     inside `catch_unwind`, pushes `Completion::Blocking { pid, result }`,
 //!     writes the scheduler-wake pipe.
 //!
 //! Both threads share a single `completions: Arc<Mutex<VecDeque<Completion>>>`
 //! and the same scheduler-wake pipe.
 //!
 //! `epoll_ctl` (register/unregister fd interest) is called by the
 //! scheduler thread *directly* on the epollfd. That's well-defined per
 //! `epoll_ctl(2)`: a thread may be calling `epoll_wait` on the epollfd
 //! while another thread calls `epoll_ctl`. Avoids needing a second mpsc
 //! and a second wake mechanism.
 //!
 //! Epoll mode
 //! ==========
 //! Level-triggered with EPOLLONESHOT. After a wakeup the kernel
 //! auto-disarms the fd, so we never get two wakeups for one
 //! `wait_readable` call. The scheduler explicitly `EPOLL_CTL_DEL`s the fd
 //! on completion to free the slot for re-registration. Net effect: each
 //! `wait_readable(fd)` is one ADD, one wakeup, one DEL — symmetric and
 //! stateless between calls.
 //!
 //! Fd hygiene
 //! ==========
 //! If an actor dies while waiting on an fd, the registration is leaked
 //! (the fd stays in the epollfd, armed). EPOLLONESHOT bounds the damage:
 //! at most one stale wakeup, after which the kernel disarms. The stale
 //! wakeup hits a dead pid in `waiters` and is dropped. Acceptable for v0.2;
 //! a future pass should DEL on actor death.
 //!
 //! Buffers used with `read`/`write` should be on fds opened with
 //! `O_NONBLOCK`. If they aren't, the syscall may block the scheduler
 //! thread despite the readiness notification (the fd reporting readable
 //! doesn't guarantee the syscall completes without blocking — e.g. a
 //! signal could be delivered). Documented; not enforced.
 //!
 //! Panic handling
 //! ==============
 //! The pool worker runs the closure inside `catch_unwind` and ships either
 //! the return value or the panic payload back to the scheduler.
 //! `block_on_io` resumes the panic on the calling actor's stack, so the
 //! actor's supervisor sees a real `Signal::Panic` as if the work had run
 //! inline. Fd-wait primitives don't run user code on the IO thread, so
 //! they have no equivalent panic-propagation path.
 use crate::pid::Pid;
 use std::any::Any;
 use std::collections::{HashMap, VecDeque};
 use std::io;
 use std::os::fd::RawFd;
 use std::panic;
 use std::sync::mpsc;
 use std::sync::{Arc, Mutex};
 use std::thread::JoinHandle as OsJoinHandle;
 // ---------------------------------------------------------------------------
 // Wire types
 // ---------------------------------------------------------------------------
 /// What the pool stores while computing a result. `Ok` is the closure's
 /// return value (boxed as `Any`); `Err` is the panic payload.
 pub type IoResult = Result<Box<dyn Any + Send>, Box<dyn Any + Send>>;
 struct Request {
    pid: Pid,
    /// The work to perform. Returns the wire-form result directly.
    work: Box<dyn FnOnce() -> IoResult + Send>,
 }
 /// Completion message from either IO thread back to the scheduler.
 pub enum Completion {
    /// A `block_on_io` closure has finished (Ok = return value, Err = panic
    /// payload).
    Blocking { pid: Pid, result: IoResult },
    /// An fd registered via `wait_readable`/`wait_writable` is ready. The
    /// scheduler looks up the parked pid in `waiters`, unparks it, and
    /// removes the entry. `pid` isn't in this variant because the epoll
    /// thread doesn't have access to the `waiters` map; the scheduler
    /// thread owns that.
    FdReady { fd: RawFd, events: u32 },
 }
 // ---------------------------------------------------------------------------
 // IoThread — created per `run()`, owned by `SchedulerState`.
 // ---------------------------------------------------------------------------
 pub struct IoThread {
    // ----- Channels & queues -----
    /// Submission queue into the blocking-work pool.
    tx: mpsc::Sender<Request>,
    /// Shared completion queue, fed by both the pool and the epoll thread.
    completions: Arc<Mutex<VecDeque<Completion>>>,
    /// Pipe the scheduler polls in its idle path. Both IO threads write to
    /// `wake_write` after pushing a completion.
    wake_read: RawFd,
    wake_write: RawFd,
    // ----- Epoll machinery -----
    /// The epollfd, owned by `IoThread`. Callable cross-thread via
    /// `epoll_ctl` per the man page.
    epollfd: RawFd,
    /// Pipe used to signal the epoll thread to exit. Registered inside the
    /// epollfd so a single `epoll_wait` covers both fd readiness and
    /// shutdown.
    shutdown_read: RawFd,
    shutdown_write: RawFd,
    /// One parked actor per registered fd. Populated by `wait_readable` /
    /// `wait_writable` and drained by the scheduler when a `FdReady`
    /// completion is processed.
    pub waiters: HashMap<RawFd, Pid>,
    // ----- Threads -----
    pool_thread: Option<OsJoinHandle<()>>,
    epoll_thread: Option<OsJoinHandle<()>>,
    /// Number of `block_on_io` requests in-flight. Used by the scheduler's
    /// idle path to decide whether to wait on the pipe or exit. Fd waits
    /// are not counted here; they're counted by `waiters.len()`.
    pub outstanding: u32,
 }
 impl IoThread {
    pub fn start() -> io::Result<Self> {
        // Scheduler-facing wake pipe.
        let (wake_read, wake_write) = make_pipe()?;
        // Pool submission channel + shared completion queue.
        let (tx, rx) = mpsc::channel::<Request>();
        let completions: Arc<Mutex<VecDeque<Completion>>> =
            Arc::new(Mutex::new(VecDeque::new()));
        // Epoll machinery.
        let epollfd = unsafe { libc::epoll_create1(libc::EPOLL_CLOEXEC) };
        if epollfd < 0 {
            // Best-effort fd cleanup before bailing.
            unsafe {
                libc::close(wake_read);
                libc::close(wake_write);
            }
            return Err(io::Error::last_os_error());
        }
        let (shutdown_read, shutdown_write) = match make_pipe() {
            Ok(p) => p,
            Err(e) => {
                unsafe {
                    libc::close(epollfd);
                    libc::close(wake_read);
                    libc::close(wake_write);
                }
                return Err(e);
            }
        };
        // Register the shutdown pipe in epollfd. We use a sentinel `data`
        // value to recognise shutdown events. RawFd values are non-negative,
        // so u64::MAX is unambiguously not a real fd-data encoding.
        let mut shutdown_ev = libc::epoll_event {
            events: libc::EPOLLIN as u32,
            u64: SHUTDOWN_EPOLL_TOKEN,
        };
        if unsafe {
            libc::epoll_ctl(
                epollfd,
                libc::EPOLL_CTL_ADD,
                shutdown_read,
                &mut shutdown_ev as *mut _,
            )
        } < 0
        {
            let e = io::Error::last_os_error();
            unsafe {
                libc::close(epollfd);
                libc::close(shutdown_read);
                libc::close(shutdown_write);
                libc::close(wake_read);
                libc::close(wake_write);
            }
            return Err(e);
        }
        // Spawn pool thread.
        let pool_comps = completions.clone();
        let pool_thread = std::thread::Builder::new()
            .name("smarm-io-pool".into())
            .spawn(move || pool_loop(rx, pool_comps, wake_write))?;
        // Spawn epoll thread.
        let epoll_comps = completions.clone();
        let epoll_thread = std::thread::Builder::new()
            .name("smarm-io-epoll".into())
            .spawn(move || epoll_loop(epollfd, epoll_comps, wake_write))?;
        Ok(Self {
            tx,
            completions,
            wake_read,
            wake_write,
            epollfd,
            shutdown_read,
            shutdown_write,
            waiters: HashMap::new(),
            pool_thread: Some(pool_thread),
            epoll_thread: Some(epoll_thread),
            outstanding: 0,
        })
    }
    /// Hand a request to the pool. Increments `outstanding`.
    pub fn submit(&mut self, pid: Pid, work: Box<dyn FnOnce() -> IoResult + Send>) {
        self.outstanding += 1;
        // Send can only fail if the pool has hung up, which only happens
        // on shutdown. submit during shutdown is a bug.
        self.tx
            .send(Request { pid, work })
            .expect("io pool hung up unexpectedly");
    }
    /// Drain every available completion. Caller (the scheduler) routes the
    /// results and updates `outstanding` / `waiters` accordingly.
    pub fn drain_completions(&mut self) -> Vec<Completion> {
        let mut q = self.completions.lock().unwrap();
        let mut out = Vec::with_capacity(q.len());
        while let Some(c) = q.pop_front() {
            out.push(c);
        }
        out
    }
    pub fn wake_fd(&self) -> RawFd {
        self.wake_read
    }
    /// Register interest in `fd` becoming readable/writable; record `pid`
    /// as the parked waiter. The epoll thread will push a `FdReady`
    /// completion when the kernel signals.
    ///
    /// EPOLLONESHOT: one wakeup per registration. The scheduler must
    /// `epoll_del` on completion to free the slot for re-registration.
    pub fn epoll_register(
        &mut self,
        fd: RawFd,
        pid: Pid,
        readable: bool,
        writable: bool,
    ) -> io::Result<()> {
        // Two actors waiting on the same fd would be a misuse: the kernel
        // delivers exactly one EPOLLONESHOT wakeup, so the second waiter
        // would hang. Reject up front.
        if self.waiters.contains_key(&fd) {
            return Err(io::Error::new(
                io::ErrorKind::AlreadyExists,
                "fd already has a parked waiter",
            ));
        }
        // Defensive cleanup: if a previous actor died while waiting on this
        // fd, the kernel-side registration was leaked (we don't walk all
        // waiters on actor death). A bare DEL is harmless if the fd isn't
        // registered (ENOENT), and removes any leak.
        unsafe {
            libc::epoll_ctl(self.epollfd, libc::EPOLL_CTL_DEL, fd, std::ptr::null_mut());
        }
        let mut events: u32 = libc::EPOLLONESHOT as u32;
        if readable {
            events |= libc::EPOLLIN as u32;
        }
        if writable {
            events |= libc::EPOLLOUT as u32;
        }
        let mut ev = libc::epoll_event {
            events,
            u64: fd as u64,
        };
        let r = unsafe {
            libc::epoll_ctl(self.epollfd, libc::EPOLL_CTL_ADD, fd, &mut ev as *mut _)
        };
        if r < 0 {
            return Err(io::Error::last_os_error());
        }
        self.waiters.insert(fd, pid);
        Ok(())
    }
    /// Remove `fd` from the epollfd. Called by the scheduler after a
    /// `FdReady` completion, so the next `wait_readable(fd)` can ADD again.
    ///
    /// Does NOT touch `waiters` — that's the scheduler's bookkeeping; this
    /// is purely the kernel-side cleanup.
    pub fn epoll_deregister(&mut self, fd: RawFd) {
        // EPOLL_CTL_DEL of an already-removed fd returns ENOENT; ignore.
        unsafe {
            libc::epoll_ctl(self.epollfd, libc::EPOLL_CTL_DEL, fd, std::ptr::null_mut());
        }
    }
 }
 impl Drop for IoThread {
    fn drop(&mut self) {
        // 1. Signal the epoll thread to exit by writing the shutdown pipe.
        unsafe {
            let buf: [u8; 1] = [0];
            // Single byte; we don't care about EINTR retry here — worst
            // case the epoll thread blocks until process exit, which is
            // fine because we then close fds out from under it.
            libc::write(self.shutdown_write, buf.as_ptr() as *const _, 1);
        }
        // 2. Hang up the pool's request channel so the pool thread exits.
        let (dead_tx, _) = mpsc::channel::<Request>();
        let real_tx = std::mem::replace(&mut self.tx, dead_tx);
        drop(real_tx);
        // 3. Join both threads.
        if let Some(h) = self.epoll_thread.take() {
            let _ = h.join();
        }
        if let Some(h) = self.pool_thread.take() {
            let _ = h.join();
        }
        // 4. Close fds.
        unsafe {
            libc::close(self.epollfd);
            libc::close(self.shutdown_read);
            libc::close(self.shutdown_write);
            libc::close(self.wake_read);
            libc::close(self.wake_write);
        }
    }
 }
 /// Sentinel `epoll_event.u64` distinguishing the shutdown pipe from
 /// registered actor fds. RawFd values fit in i32, so the high bits are
 /// available for a marker; we use u64::MAX which can't be a valid fd.
 const SHUTDOWN_EPOLL_TOKEN: u64 = u64::MAX;
 // ---------------------------------------------------------------------------
 // Pool loop
 // ---------------------------------------------------------------------------
 fn pool_loop(
    rx: mpsc::Receiver<Request>,
    completions: Arc<Mutex<VecDeque<Completion>>>,
    wake_write: RawFd,
 ) {
    while let Ok(Request { pid, work }) = rx.recv() {
        let result: IoResult = match panic::catch_unwind(panic::AssertUnwindSafe(work)) {
            Ok(r) => r,
            Err(payload) => Err(payload),
        };
        completions
            .lock()
            .unwrap()
            .push_back(Completion::Blocking { pid, result });
        wake_scheduler(wake_write);
    }
 }
 // ---------------------------------------------------------------------------
 // Epoll loop
 // ---------------------------------------------------------------------------
 fn epoll_loop(
    epollfd: RawFd,
    completions: Arc<Mutex<VecDeque<Completion>>>,
    wake_write: RawFd,
 ) {
    // Buffer for epoll_wait. 64 is plenty for our scale; if a real load
    // appears that needs more, this is a one-line change.
    const MAX_EVENTS: usize = 64;
    let mut events: [libc::epoll_event; MAX_EVENTS] = unsafe { std::mem::zeroed() };
    loop {
        let n = unsafe {
            libc::epoll_wait(
                epollfd,
                events.as_mut_ptr(),
                MAX_EVENTS as libc::c_int,
                -1,
            )
        };
        if n < 0 {
            let e = unsafe { *libc::__errno_location() };
            if e == libc::EINTR {
                continue;
            }
            // Anything else here is a programming error (EBADF on epollfd
            // after we've closed it from Drop — the close races with us).
            // Treat as shutdown.
            return;
        }
        let mut shutdown_requested = false;
        let mut pushed_any = false;
        {
            let mut q = completions.lock().unwrap();
            for ev in events.iter().take(n as usize) {
                if ev.u64 == SHUTDOWN_EPOLL_TOKEN {
                    shutdown_requested = true;
                    continue;
                }
                let fd = ev.u64 as RawFd;
                let evs = ev.events;
                q.push_back(Completion::FdReady {
                    fd,
                    events: evs,
                });
                pushed_any = true;
            }
        }
        if pushed_any {
            wake_scheduler(wake_write);
        }
        if shutdown_requested {
            return;
        }
    }
 }
 /// Write one byte to the scheduler's wake pipe. Retries on EINTR; ignores
 /// EAGAIN (pipe full means there's already an outstanding wake we haven't
 /// consumed yet, which is sufficient).
 fn wake_scheduler(wake_write: RawFd) {
    let buf: [u8; 1] = [0];
    unsafe {
        loop {
            let n = libc::write(wake_write, buf.as_ptr() as *const _, 1);
            if n < 0 {
                let e = *libc::__errno_location();
                if e == libc::EINTR {
                    continue;
                }
            }
            break;
        }
    }
 }
 // ---------------------------------------------------------------------------
 // Pipe helpers (unchanged from v0.2)
 // ---------------------------------------------------------------------------
 fn make_pipe() -> io::Result<(RawFd, RawFd)> {
    let mut fds: [libc::c_int; 2] = [0; 2];
    let r = unsafe { libc::pipe2(fds.as_mut_ptr(), libc::O_CLOEXEC | libc::O_NONBLOCK) };
    if r != 0 {
        return Err(io::Error::last_os_error());
    }
    Ok((fds[0], fds[1]))
 }
 /// Drain pending bytes from the wake pipe. The scheduler calls this after
 /// a `poll` wakeup so the next idle call sees an empty pipe.
 pub fn drain_wake_pipe(fd: RawFd) {
    let mut buf = [0u8; 64];
    loop {
        let n = unsafe { libc::read(fd, buf.as_mut_ptr() as *mut _, buf.len()) };
        if n <= 0 {
            break;
        }
    }
 }
 /// Block on `fd` for up to `timeout`, returning when either there's data
 /// to read or the timeout elapses. `None` for `timeout` means wait forever.
 pub fn poll_wake(fd: RawFd, timeout: Option<std::time::Duration>) {
    let timeout_ms: libc::c_int = match timeout {
        None => -1,
        Some(d) => {
            let ms = d.as_millis();
            if ms > i32::MAX as u128 {
                i32::MAX
            } else {
                ms as i32
            }
        }
    };
    let mut pfd = libc::pollfd {
        fd,
        events: libc::POLLIN,
        revents: 0,
    };
    loop {
        let r = unsafe { libc::poll(&mut pfd as *mut _, 1, timeout_ms) };
        if r < 0 {
            let e = unsafe { *libc::__errno_location() };
            if e == libc::EINTR {
                continue;
            }
        }
        break;
    }
 }
--- a/src/lib.rs
+++ b/src/lib.rs
@@ -2,11 +2,12 @@
 //!
 //! Erlang-style green-thread actor concurrency for Rust.
 //!
-//! v0.1 is single-threaded. One scheduler, one OS thread. The scheduler
+//! Multi-threaded: N scheduler OS threads (default: one per CPU) share a
-//! cooperatively interleaves green-thread actors with hand-rolled context
+//! single global run queue behind a `Mutex`. Actors communicate by sending
-//! switches. Actors communicate by sending `Send` messages over channels;
+//! `Send` messages over channels; every actor has a supervisor. Synchronisation
-//! every actor has a supervisor, which is itself just an actor with a
+//! primitives — `Mutex<T>` with mandatory lock timeouts, channel `recv`,
-//! `Receiver<Signal>`.
+//! `sleep`, and epoll-backed `wait_readable`/`wait_writable` — all park the
 //! green thread, never the OS thread.
 //!
 //! See `LOOM.md` for the design intent and the deferred-for-later list.
@@ -19,13 +20,13 @@ pub mod channel;
 pub mod scheduler;
 pub mod supervisor;
 pub mod timer;
 pub mod io;
 pub mod mutex;
 pub mod runtime;
 pub mod trace;
 // ---------------------------------------------------------------------------
 // Global allocator
 //
 // The preempting allocator wraps `System`. While `PREEMPTION_ENABLED` is
 // false (the default outside an actor) it adds one branch per allocation
 // and no syscalls. The scheduler flips it on per-resume.
 // ---------------------------------------------------------------------------
 #[global_allocator]
@@ -36,6 +37,24 @@ static ALLOCATOR: preempt::PreemptingAllocator = preempt::PreemptingAllocator;
 // ---------------------------------------------------------------------------
 pub use channel::{channel, Receiver, RecvError, Sender};
 pub use mutex::{LockTimeout, Mutex, MutexGuard};
 pub use pid::Pid;
-pub use scheduler::{run, self_pid, sleep, spawn, spawn_under, yield_now, JoinError, JoinHandle};
+pub use runtime::{init, Config, Runtime};
 pub use scheduler::{
    block_on_io, run, self_pid, sleep, spawn, spawn_under, wait_readable, wait_writable,
    yield_now, JoinError, JoinHandle,
 };
 pub use supervisor::Signal;
 // ---------------------------------------------------------------------------
 // check!()
 // ---------------------------------------------------------------------------
 /// Voluntarily check whether this actor's timeslice has expired, yielding
 /// if so.
 #[macro_export]
 macro_rules! check {
    () => {
        $crate::preempt::maybe_preempt()
    };
 }
--- a/src/mutex.rs
+++ b/src/mutex.rs
@@ -0,0 +1,248 @@
 //! Actor-aware mutex with mandatory timeout.
 //!
 //! `Mutex<T>` parks the calling *green* thread on contention rather than
 //! blocking the OS thread. Every lock attempt is bounded by a timeout.
 //!
 //! Internals use `Arc<std::sync::Mutex<...>>` so the type is genuinely
 //! `Send + Sync` and can be shared across scheduler threads.
 //!
 //! Fairness: FIFO. Poisoning: none. Reentrance: deadlock (caller bug).
 use crate::pid::Pid;
 use crate::scheduler;
 use crate::timer::{self, TimerTarget};
 use std::collections::VecDeque;
 use std::sync::{Arc, Mutex as StdMutex};
 use std::time::Duration;
 pub const DEFAULT_TIMEOUT: Duration = Duration::from_secs(30);
 #[derive(Debug, PartialEq, Eq, Clone, Copy)]
 pub struct LockTimeout;
 impl std::fmt::Display for LockTimeout {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        write!(f, "mutex lock timed out")
    }
 }
 impl std::error::Error for LockTimeout {}
 // ---------------------------------------------------------------------------
 // Internals
 // ---------------------------------------------------------------------------
 struct Wait {
    pid: Pid,
    seq: u64,
 }
 struct MutexState {
    holder: Option<Pid>,
    waiters: VecDeque<Wait>,
    next_seq: u64,
    default_timeout: Duration,
 }
 struct MutexCore {
    state: StdMutex<MutexState>,
 }
 impl MutexCore {
    fn new(default_timeout: Duration) -> Self {
        Self {
            state: StdMutex::new(MutexState {
                holder: None,
                waiters: VecDeque::new(),
                next_seq: 0,
                default_timeout,
            }),
        }
    }
 }
 impl TimerTarget for MutexCore {
    fn on_timeout(&self, pid: Pid, wait_seq: u64) {
        let unpark = {
            let mut st = self.state.lock().unwrap();
            // Remove from waiters only if still there with matching seq.
            // If the lock was already granted (holder == Some(pid)), the
            // timer fired after the grant — treat as no-op; the actor
            // will see `is_holder == true` and return Ok.
            if st.holder == Some(pid) {
                return;
            }
            let pos = st.waiters.iter().position(|w| w.pid == pid && w.seq == wait_seq);
            if pos.is_some() {
                st.waiters.remove(pos.unwrap());
                true
            } else {
                false
            }
        };
        if unpark {
            scheduler::unpark(pid);
        }
    }
 }
 // ---------------------------------------------------------------------------
 // Public API
 // ---------------------------------------------------------------------------
 pub struct Mutex<T> {
    core: Arc<MutexCore>,
    /// Protected value. `None` while a guard is live; `Some` while free.
    value: Arc<StdMutex<Option<T>>>,
 }
 impl<T> Mutex<T> {
    pub fn new(value: T) -> Self {
        Self {
            core: Arc::new(MutexCore::new(DEFAULT_TIMEOUT)),
            value: Arc::new(StdMutex::new(Some(value))),
        }
    }
    pub fn set_default_timeout(&self, timeout: Duration) {
        self.core.state.lock().unwrap().default_timeout = timeout;
    }
    pub fn lock(&self) -> Result<MutexGuard<'_, T>, LockTimeout> {
        let timeout = self.core.state.lock().unwrap().default_timeout;
        self.lock_timeout(timeout)
    }
    pub fn lock_timeout(&self, timeout: Duration) -> Result<MutexGuard<'_, T>, LockTimeout> {
        // Outside the runtime (e.g. in tests, after run() returns) there is no
        // current actor PID.  Fall back to a blocking std::sync::Mutex acquire.
        let Some(me) = crate::actor::current_pid() else {
            return self.lock_blocking();
        };
        // Fast path: nobody holds it.
        {
            let mut st = self.core.state.lock().unwrap();
            if st.holder.is_none() {
                st.holder = Some(me);
                drop(st);
                let value = self.value.lock().unwrap().take()
                    .expect("Mutex: value missing on free fast path");
                return Ok(MutexGuard { mutex: self, value: Some(value) });
            }
        }
        // Slow path: register as a waiter, set timeout, park.
        let _np = scheduler::NoPreempt::enter();
        let seq = {
            let mut st = self.core.state.lock().unwrap();
            let seq = st.next_seq;
            st.next_seq = st.next_seq.wrapping_add(1);
            st.waiters.push_back(Wait { pid: me, seq });
            seq
        };
        let target: Arc<dyn TimerTarget> = self.core.clone();
        let deadline = timer::deadline_from_now(timeout);
        scheduler::insert_wait_timer(deadline, me, target, seq);
        scheduler::park_current();
        // Resumed. Are we the holder?
        let is_holder = self.core.state.lock().unwrap().holder == Some(me);
        if is_holder {
            let value = self.value.lock().unwrap().take()
                .expect("Mutex: value missing after grant");
            Ok(MutexGuard { mutex: self, value: Some(value) })
        } else {
            Err(LockTimeout)
        }
    }
    pub fn try_lock(&self) -> Option<MutexGuard<'_, T>> {
        let me = crate::actor::current_pid()?;
        let mut st = self.core.state.lock().unwrap();
        if st.holder.is_some() {
            return None;
        }
        st.holder = Some(me);
        drop(st);
        let value = self.value.lock().unwrap().take()
            .expect("Mutex: value missing on try_lock free path");
        Some(MutexGuard { mutex: self, value: Some(value) })
    }
    /// Blocking fallback used when called outside the smarm runtime.
    /// Spins on the internal std mutex; no actor parking, no timeout.
    fn lock_blocking(&self) -> Result<MutexGuard<'_, T>, LockTimeout> {
        // We have no PID to register as holder, so we bypass the holder/waiter
        // tracking and just grab the value mutex directly.  This is safe because
        // outside the runtime there are no green threads competing.
        let value = loop {
            let v = self.value.lock().unwrap().take();
            if let Some(v) = v { break v; }
            std::thread::yield_now();
        };
        Ok(MutexGuard { mutex: self, value: Some(value) })
    }
 }
 impl<T> Clone for Mutex<T> {
    fn clone(&self) -> Self {
        Self { core: self.core.clone(), value: self.value.clone() }
    }
 }
 // Genuinely Send + Sync now that internals are Arc<std::sync::Mutex<...>>.
 unsafe impl<T: Send> Send for Mutex<T> {}
 unsafe impl<T: Send> Sync for Mutex<T> {}
 // ---------------------------------------------------------------------------
 // Guard
 // ---------------------------------------------------------------------------
 pub struct MutexGuard<'a, T> {
    mutex: &'a Mutex<T>,
    value: Option<T>,
 }
 impl<T> std::ops::Deref for MutexGuard<'_, T> {
    type Target = T;
    fn deref(&self) -> &T { self.value.as_ref().expect("MutexGuard: value missing") }
 }
 impl<T> std::ops::DerefMut for MutexGuard<'_, T> {
    fn deref_mut(&mut self) -> &mut T {
        self.value.as_mut().expect("MutexGuard: value missing")
    }
 }
 impl<T: std::fmt::Debug> std::fmt::Debug for MutexGuard<'_, T> {
    fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
        f.debug_tuple("MutexGuard")
            .field(self.value.as_ref().expect("MutexGuard: value missing"))
            .finish()
    }
 }
 impl<T> Drop for MutexGuard<'_, T> {
    fn drop(&mut self) {
        let v = self.value.take().expect("MutexGuard: double drop");
        *self.mutex.value.lock().unwrap() = Some(v);
        let next_pid = {
            let mut st = self.mutex.core.state.lock().unwrap();
            match st.waiters.pop_front() {
                Some(w) => {
                    st.holder = Some(w.pid);
                    Some(w.pid)
                }
                None => {
                    st.holder = None;
                    None
                }
            }
        };
        if let Some(pid) = next_pid {
            scheduler::unpark(pid);
        }
    }
 }
--- a/src/preempt.rs
+++ b/src/preempt.rs
@@ -6,10 +6,16 @@
 //! `switch_to_scheduler` to yield. Resetting the counter to `ALLOC_INTERVAL`
 //! amortises the RDTSC across many cheap events.
 //!
-//! Events today are heap allocations (via `PreemptingAllocator`). v0.2 will
+//! Two event sources today:
-//! add stack-frame entries as a second event source — frames are stack
+//!   - `PreemptingAllocator` — heap allocations.
-//! allocations, the counter naming still fits — sharing this same counter
+//!   - `smarm::check!()` — explicit preemption point for tight no-alloc
-//! so both routes behave consistently.
+//!     loops, since stable Rust gives us no transparent way to preempt
 //!     such loops (`__rust_probestack` is emitted inline by LLVM and not
 //!     called at runtime).
 //!
 //! Both sources share `ALLOC_COUNT`, so the timeslice check fires at the
 //! same rate regardless of whether the actor is alloc-heavy, check-heavy,
 //! or mixed.
 //!
 //! All state is thread-local. The scheduler enables preemption on resume
 //! and disables it on the return path, so the scheduler can never preempt
@@ -22,23 +28,42 @@
 use std::alloc::{GlobalAlloc, Layout, System};
 use std::cell::Cell;
-const ALLOC_INTERVAL: u32 = 128;
+pub const DEFAULT_ALLOC_INTERVAL: u32 = 128;
-const TIMESLICE_CYCLES: u64 = 300_000; // ≈ 100µs on a 3 GHz CPU
+pub const DEFAULT_TIMESLICE_CYCLES: u64 = 300_000; // ≈ 100µs on a 3 GHz CPU
 thread_local! {
    /// While `false`, the allocator hook is a no-op.
    pub static PREEMPTION_ENABLED: Cell<bool> = const { Cell::new(false) };
    /// Countdown to next RDTSC check. Reset to `ALLOC_INTERVAL` on resume.
-    static ALLOC_COUNT: Cell<u32> = const { Cell::new(ALLOC_INTERVAL) };
+    static ALLOC_COUNT: Cell<u32> = const { Cell::new(DEFAULT_ALLOC_INTERVAL) };
    /// RDTSC value written by the scheduler on every actor resume.
    static TIMESLICE_START: Cell<u64> = const { Cell::new(0) };
    /// Per-thread copy of the configured alloc interval, written once at
    /// scheduler-thread startup. Kept in a thread-local so the hot path
    /// (`maybe_preempt`) pays only a TLS load, with no cache-coherency traffic.
    static CONFIGURED_ALLOC_INTERVAL: Cell<u32> = const { Cell::new(DEFAULT_ALLOC_INTERVAL) };
    /// Per-thread copy of the configured timeslice, written once at
    /// scheduler-thread startup.
    static CONFIGURED_TIMESLICE_CYCLES: Cell<u64> = const { Cell::new(DEFAULT_TIMESLICE_CYCLES) };
 }
 /// Called once per scheduler thread at startup (before any actor runs).
 /// Writes the runtime-configured preemption knobs into thread-locals so the
 /// hot path reads them without any cross-thread coherency cost.
 pub fn configure_preempt(alloc_interval: u32, timeslice_cycles: u64) {
    CONFIGURED_ALLOC_INTERVAL.with(|c| c.set(alloc_interval));
    CONFIGURED_TIMESLICE_CYCLES.with(|c| c.set(timeslice_cycles));
    // Also prime the countdown so the first resume uses the right interval.
    ALLOC_COUNT.with(|c| c.set(alloc_interval));
 }
 /// Arm the timeslice. Called by the scheduler on every resume.
 pub fn reset_timeslice() {
-    ALLOC_COUNT.with(|c| c.set(ALLOC_INTERVAL));
+    ALLOC_COUNT.with(|c| c.set(CONFIGURED_ALLOC_INTERVAL.with(|i| i.get())));
    TIMESLICE_START.with(|c| c.set(rdtsc()));
 }
@@ -80,18 +105,26 @@ unsafe impl GlobalAlloc for PreemptingAllocator {
 }
 /// Shared preemption check. Called by every preemption event source — the
-/// heap allocator today, the stack-frame entry hook in v0.2. Decrements
+/// heap allocator today, `smarm::check!()` for tight no-alloc loops.
-/// `ALLOC_COUNT`; every `ALLOC_INTERVAL` calls reads the timeslice clock
+/// Decrements `ALLOC_COUNT`; every `ALLOC_INTERVAL` calls reads the
-/// and yields if expired.
+/// timeslice clock and yields if expired.
 ///
 /// **Invariant**: must not be called inside a "prep-to-park" region —
 /// e.g. between registering as a channel's parked receiver and calling
 /// `park_current()`. A preemption-driven yield in that window would
 /// reach the scheduler with state=Runnable, the unparker would no-op,
 /// the actor would then park, and the wakeup would be lost. Library
 /// code that touches the parking primitives must keep its prep-to-park
 /// regions allocation-free and check!()-free.
 #[inline(always)]
 pub fn maybe_preempt() {
    ALLOC_COUNT.with(|c| {
        let n = c.get();
        if n == 0 {
-            c.set(ALLOC_INTERVAL);
+            c.set(CONFIGURED_ALLOC_INTERVAL.with(|i| i.get()));
            if PREEMPTION_ENABLED.with(|e| e.get()) {
                let start = TIMESLICE_START.with(|s| s.get());
-                if rdtsc().saturating_sub(start) > TIMESLICE_CYCLES {
+                if rdtsc().saturating_sub(start) > CONFIGURED_TIMESLICE_CYCLES.with(|t| t.get()) {
                    // SAFETY: reachable only inside an actor (the scheduler
                    // sets PREEMPTION_ENABLED on resume and clears it on
                    // return). The scheduler stack is therefore valid.
--- a/src/runtime.rs
+++ b/src/runtime.rs
@@ -0,0 +1,787 @@
 //! Multi-scheduler runtime: configuration, initialisation, and the shared
 //! state that all scheduler OS threads operate against.
 //!
 //! # Architecture
 //!
 //! ```text
 //!  init(Config) → Runtime (Arc<RuntimeInner>)
 //!
 //!  RuntimeInner {
 //!    shared: Mutex<SharedState>   ← slot table, run queue, timers, IO
 //!    stats:  Vec<SchedulerStats>  ← one per thread, lockless atomics (RFC 000)
 //!    io_parked:  AtomicU32        ← actors parked on IO
 //!    sleeping:   AtomicU32        ← actors parked on timer
 //!  }
 //! ```
 //!
 //! `Runtime::run(f)` spawns N OS threads (one per `Config::resolved_thread_count()`),
 //! each running `schedule_loop`. It blocks until all scheduler threads exit,
 //! i.e. until the run queue is empty and nothing is pending.
 //!
 //! Each scheduler thread holds an `Arc<RuntimeInner>` clone. Per-thread
 //! identity is a small integer index, stored in a thread-local, used to index
 //! into `stats`.
 //!
 //! # Timer / IO drain (try-lock, one-winner)
 //!
 //! On each loop iteration every scheduler thread tries `try_lock()` on a
 //! separate `drain_lock: Mutex<()>`. The winner drains due timers and IO
 //! completions; losers skip and move straight to popping an actor from the
 //! run queue. This is the simplest correct approach; revisit if the drain
 //! becomes a measured bottleneck.
 use crate::actor::{
    clear_current_pid, is_actor_done, reset_actor_done, set_current_actor_box, 
    set_current_pid, take_last_outcome, Actor, Outcome,
 };
 use crate::channel::Sender;
 use crate::context::{get_actor_sp, set_actor_sp, switch_to_actor};
 use crate::io::IoThread;
 use crate::pid::Pid;
 use crate::preempt::PREEMPTION_ENABLED;
 use crate::supervisor::Signal;
 use crate::timer::Timers;
 use std::collections::VecDeque;
 use std::sync::atomic::{AtomicU32, AtomicU64, Ordering};
 use std::sync::{Arc, Mutex};
 use std::thread;
 // ---------------------------------------------------------------------------
 // Config
 // ---------------------------------------------------------------------------
 /// Runtime configuration.
 ///
 /// ```
 /// use smarm::runtime::Config;
 ///
 /// // Use all available CPUs (default):
 /// let c = Config::default();
 ///
 /// // Exactly 4 scheduler threads:
 /// let c = Config::exact(4);
 ///
 /// // Between 2 and 8, clamped to available parallelism:
 /// let c = Config::new(2, 8, None);
 /// ```
 #[derive(Clone, Debug)]
 pub struct Config {
    min: usize,
    max: usize,
    exact: Option<usize>,
    alloc_interval: u32,
    timeslice_cycles: u64,
 }
 impl Config {
    /// Exact thread count; takes precedence over min/max.
    pub fn exact(n: usize) -> Self {
        assert!(n >= 1, "scheduler thread count must be ≥ 1");
        Self {
            min: n, max: n, exact: Some(n),
            alloc_interval: crate::preempt::DEFAULT_ALLOC_INTERVAL,
            timeslice_cycles: crate::preempt::DEFAULT_TIMESLICE_CYCLES,
        }
    }
    /// Bounded range. Thread count = clamp(available_parallelism, min, max).
    pub fn new(min: usize, max: usize, exact: Option<usize>) -> Self {
        assert!(min >= 1, "min must be ≥ 1");
        assert!(max >= min, "max must be ≥ min");
        if let Some(e) = exact {
            assert!(e >= 1, "exact must be ≥ 1");
        }
        Self {
            min, max, exact,
            alloc_interval: crate::preempt::DEFAULT_ALLOC_INTERVAL,
            timeslice_cycles: crate::preempt::DEFAULT_TIMESLICE_CYCLES,
        }
    }
    /// How many allocations (or `smarm::check!()` calls) between RDTSC checks.
    /// Lower = more responsive preemption, higher = less overhead.
    /// Default: 128.
    pub fn alloc_interval(mut self, n: u32) -> Self {
        assert!(n >= 1, "alloc_interval must be ≥ 1");
        self.alloc_interval = n;
        self
    }
    /// How many TSC cycles constitute one timeslice.
    /// Default: 300_000 (≈ 100µs on a 3 GHz CPU).
    pub fn timeslice_cycles(mut self, n: u64) -> Self {
        assert!(n >= 1, "timeslice_cycles must be ≥ 1");
        self.timeslice_cycles = n;
        self
    }
    /// The number of scheduler threads this config resolves to.
    pub fn resolved_thread_count(&self) -> usize {
        if let Some(e) = self.exact {
            return e;
        }
        let avail = thread::available_parallelism()
            .map(|n| n.get())
            .unwrap_or(1);
        avail.clamp(self.min, self.max)
    }
 }
 impl Default for Config {
    fn default() -> Self {
        let avail = thread::available_parallelism()
            .map(|n| n.get())
            .unwrap_or(1);
        Self {
            min: 1, max: avail, exact: None,
            alloc_interval: crate::preempt::DEFAULT_ALLOC_INTERVAL,
            timeslice_cycles: crate::preempt::DEFAULT_TIMESLICE_CYCLES,
        }
    }
 }
 // ---------------------------------------------------------------------------
 // Per-thread stats (RFC 000 Layer 1 primitives)
 // ---------------------------------------------------------------------------
 /// Lockless per-scheduler-thread counters. Written only by the owning thread;
 /// readable from any thread (introspection actor, tests).
 pub struct SchedulerStats {
    /// PID index of the actor currently on-CPU, or `u32::MAX` when idle.
    pub current_pid_index: AtomicU32,
    /// Snapshot of run queue length maintained on every push/pop.
    pub run_queue_len: AtomicU64,
 }
 impl SchedulerStats {
    fn new() -> Self {
        Self {
            current_pid_index: AtomicU32::new(u32::MAX),
            run_queue_len: AtomicU64::new(0),
        }
    }
 }
 // ---------------------------------------------------------------------------
 // Runtime stats snapshot (for tests / introspection)
 // ---------------------------------------------------------------------------
 pub struct RuntimeStats {
    pub(crate) inner: Arc<RuntimeInner>,
 }
 impl RuntimeStats {
    /// Sum of run queue lengths across all scheduler threads.
    pub fn total_run_queue_len(&self) -> u64 {
        self.inner.stats.iter()
            .map(|s| s.run_queue_len.load(Ordering::Relaxed))
            .sum()
    }
    /// Number of scheduler threads.
    pub fn scheduler_count(&self) -> usize {
        self.inner.stats.len()
    }
    /// Actors currently parked on IO.
    pub fn io_parked_count(&self) -> u32 {
        self.inner.io_parked.load(Ordering::Relaxed)
    }
    /// Actors currently sleeping on a timer.
    pub fn sleeping_count(&self) -> u32 {
        self.inner.sleeping.load(Ordering::Relaxed)
    }
 }
 // ---------------------------------------------------------------------------
 // Shared state (behind Mutex<>)
 // ---------------------------------------------------------------------------
 pub(crate) const ACTOR_STACK_SIZE: usize = 64 * 1024;
 #[derive(Debug)]
 pub(crate) enum State { Runnable, Parked, Done }
 pub(crate) struct Slot {
    pub(crate) generation: u32,
    pub(crate) actor: Option<Actor>,
    pub(crate) state: State,
    pub(crate) waiters: Vec<Pid>,
    pub(crate) outcome: Option<Outcome>,
    pub(crate) supervisor_channel: Option<Sender<Signal>>,
    pub(crate) outstanding_handles: u32,
    pub(crate) pending_io_result: Option<crate::io::IoResult>,
    /// Set by `unpark()` when the actor is still running (not yet Parked).
    /// The scheduler checks this after a Park yield and re-queues instead
    /// of sleeping, closing the lost-wakeup window.
    pub(crate) pending_unpark: bool,
 }
 impl Slot {
    fn vacant() -> Self {
        Self {
            generation: 0,
            actor: None,
            state: State::Done,
            waiters: Vec::new(),
            outcome: None,
            supervisor_channel: None,
            outstanding_handles: 0,
            pending_io_result: None,
            pending_unpark: false,
        }
    }
 }
 pub(crate) type Closure = Box<dyn FnOnce() + Send>;
 pub(crate) struct SharedState {
    pub(crate) slots: Vec<Slot>,
    pub(crate) free_list: Vec<u32>,
    pub(crate) run_queue: VecDeque<Pid>,
    pub(crate) root_pid: Option<Pid>,
    pub(crate) timers: Timers,
    pub(crate) io: Option<IoThread>,
    /// Closures awaiting their first resume, keyed by Pid.
    pub(crate) pending_closures: Vec<(Pid, Closure)>,
 }
 impl SharedState {
    fn new() -> Self {
        Self {
            slots: Vec::new(),
            free_list: Vec::new(),
            run_queue: VecDeque::new(),
            root_pid: None,
            timers: Timers::new(),
            io: None,
            pending_closures: Vec::new(),
        }
    }
    pub(crate) fn allocate_slot(&mut self) -> (u32, u32) {
        if let Some(idx) = self.free_list.pop() {
            let gen = self.slots[idx as usize].generation;
            (idx, gen)
        } else {
            let idx = self.slots.len() as u32;
            self.slots.push(Slot::vacant());
            (idx, 0)
        }
    }
    pub(crate) fn slot(&self, pid: Pid) -> Option<&Slot> {
        let s = self.slots.get(pid.index() as usize)?;
        if s.generation == pid.generation() { Some(s) } else { None }
    }
    pub(crate) fn slot_mut(&mut self, pid: Pid) -> Option<&mut Slot> {
        let s = self.slots.get_mut(pid.index() as usize)?;
        if s.generation == pid.generation() { Some(s) } else { None }
    }
    pub(crate) fn pop_pending_closure(&mut self, pid: Pid) -> Option<Closure> {
        let pos = self.pending_closures.iter().position(|(p, _)| *p == pid)?;
        Some(self.pending_closures.swap_remove(pos).1)
    }
 }
 // ---------------------------------------------------------------------------
 // RuntimeInner — the shared core behind an Arc
 // ---------------------------------------------------------------------------
 pub(crate) struct RuntimeInner {
    pub(crate) shared: Mutex<SharedState>,
    /// Try-lock: exactly one scheduler thread drains timers/IO per iteration.
    drain_lock: Mutex<()>,
    /// Per-thread stats, indexed by scheduler thread slot (0..N).
    pub(crate) stats: Vec<SchedulerStats>,
    /// Global counters for RFC 000 primitives.
    pub(crate) io_parked: AtomicU32,
    pub(crate) sleeping: AtomicU32,
    /// Preemption knobs, written into each scheduler thread's locals on startup.
    pub(crate) alloc_interval: u32,
    pub(crate) timeslice_cycles: u64,
 }
 impl RuntimeInner {
    fn new(thread_count: usize, alloc_interval: u32, timeslice_cycles: u64) -> Arc<Self> {
        let stats = (0..thread_count).map(|_| SchedulerStats::new()).collect();
        Arc::new(Self {
            shared: Mutex::new(SharedState::new()),
            drain_lock: Mutex::new(()),
            stats,
            io_parked: AtomicU32::new(0),
            sleeping: AtomicU32::new(0),
            alloc_interval,
            timeslice_cycles,
        })
    }
    pub(crate) fn with_shared<R>(&self, f: impl FnOnce(&mut SharedState) -> R) -> R {
        // Preemption must be off while we hold the shared mutex. If an actor
        // called with_shared (e.g. from spawn, join, sleep) and the allocator
        // fired maybe_preempt() while the lock was held, switch_to_scheduler()
        // would context-switch to the scheduler loop, which would immediately
        // deadlock trying to acquire the same mutex.
        let prev = crate::preempt::PREEMPTION_ENABLED.with(|c| c.replace(false));
        let result = f(&mut self.shared.lock().unwrap());
        crate::preempt::PREEMPTION_ENABLED.with(|c| c.set(prev));
        result
    }
 }
 // ---------------------------------------------------------------------------
 // Runtime — the public handle
 // ---------------------------------------------------------------------------
 pub struct Runtime {
    inner: Arc<RuntimeInner>,
    thread_count: usize,
 }
 /// Initialise the runtime with the given config. Returns a reusable handle.
 pub fn init(config: Config) -> Runtime {
    let n = config.resolved_thread_count();
    Runtime {
        inner: RuntimeInner::new(n, config.alloc_interval, config.timeslice_cycles),
        thread_count: n,
    }
 }
 impl Runtime {
    /// Run `f` as the initial actor, block until all actors finish.
    /// Can be called multiple times sequentially on the same `Runtime`.
    pub fn run(&self, f: impl FnOnce() + Send + 'static) {
        // Install smarm's panic hook on first call. The default Rust hook is
        // not reentrant — concurrent actor panics can trigger a double-panic
        // abort when the backtrace printer takes an internal lock that is
        // already held. smarm catches every actor panic via `catch_unwind` in
        // the trampoline, so panics never need to reach the hook for runtime
        // correctness; the hook fires only as a side-effect of unwinding before
        // `catch_unwind` catches it.
        //
        // We install once and leave it installed: the previous hook is chained
        // so that panics outside actor context (e.g. in the test harness
        // itself) are still reported normally.
        static HOOK_INSTALLED: std::sync::OnceLock<()> = std::sync::OnceLock::new();
        HOOK_INSTALLED.get_or_init(|| {
            let prev = std::panic::take_hook();
            std::panic::set_hook(Box::new(move |info| {
                // If we are currently executing inside an actor trampoline the
                // panic will be caught by `catch_unwind` momentarily. Suppress
                // the hook output to avoid interleaved noise and reentrancy.
                // Outside actor context, delegate to the previous hook so that
                // genuine runtime panics are still reported.
                if crate::actor::current_pid().is_some() {
                    // Inside an actor — catch_unwind handles it; stay silent.
                } else {
                    prev(info);
                }
            }));
        });
        // Open the trace store for this run (no-op without smarm-trace).
        #[cfg(feature = "smarm-trace")]
        crate::trace::open();
        // Re-initialise shared state for this run.
        {
            let mut s = self.inner.shared.lock().unwrap();
            assert!(s.run_queue.is_empty(), "run() called while previous run still active");
            s.root_pid = Some(ROOT_PID);
            s.io = Some(IoThread::start().expect("failed to start IO thread"));
        }
        // Spawn the initial actor through the public spawn path (which
        // requires a running runtime in the thread-local).
        RUNTIME.with(|r| *r.borrow_mut() = Some(self.inner.clone()));
        let initial_handle = crate::scheduler::spawn(f);
        // Launch N-1 extra scheduler threads. The calling thread is thread 0.
        let mut os_threads = Vec::new();
        for slot in 1..self.thread_count {
            let inner = self.inner.clone();
            let t = thread::spawn(move || {
                RUNTIME.with(|r| *r.borrow_mut() = Some(inner.clone()));
                SCHED_SLOT.with(|s| s.set(slot));
                schedule_loop(&inner, slot);
                RUNTIME.with(|r| *r.borrow_mut() = None);
            });
            os_threads.push(t);
        }
        // Thread 0 runs the loop on the calling thread.
        SCHED_SLOT.with(|s| s.set(0));
        schedule_loop(&self.inner, 0);
        // Wait for all other scheduler threads.
        for t in os_threads {
            let _ = t.join();
        }
        // Drop initial handle (decrements outstanding_handles count).
        drop(initial_handle);
        // Tear down IO and clean up shared state for the next run() call.
        let mut s = self.inner.shared.lock().unwrap();
        drop(s.io.take()); // joins IO threads
        s.pending_closures.clear();
        // Reset per-thread stats.
        for stat in &self.inner.stats {
            stat.current_pid_index.store(u32::MAX, Ordering::Relaxed);
            stat.run_queue_len.store(0, Ordering::Relaxed);
        }
        self.inner.io_parked.store(0, Ordering::Relaxed);
        self.inner.sleeping.store(0, Ordering::Relaxed);
        RUNTIME.with(|r| *r.borrow_mut() = None);
        // Flush trace to disk (no-op without smarm-trace).
        #[cfg(feature = "smarm-trace")]
        crate::trace::flush();
    }
    /// Snapshot of runtime statistics for introspection / tests.
    pub fn stats(&self) -> RuntimeStats {
        RuntimeStats { inner: self.inner.clone() }
    }
 }
 // ---------------------------------------------------------------------------
 // Thread-locals
 // ---------------------------------------------------------------------------
 use std::cell::{Cell, RefCell};
 thread_local! {
    /// The RuntimeInner for the current run(). Set by run() on the calling
    /// thread and by each spawned scheduler thread.
    pub(crate) static RUNTIME: RefCell<Option<Arc<RuntimeInner>>> =
        const { RefCell::new(None) };
    /// This scheduler thread's index into RuntimeInner::stats.
    static SCHED_SLOT: Cell<usize> = const { Cell::new(0) };
    /// What the actor wants when it yields back to the scheduler.
    static YIELD_INTENT: Cell<YieldIntent> = const { Cell::new(YieldIntent::Yield) };
 }
 #[derive(Copy, Clone)]
 pub(crate) enum YieldIntent { Yield, Park }
 pub(crate) fn set_yield_intent(i: YieldIntent) {
    YIELD_INTENT.with(|c| c.set(i));
 }
 // ---------------------------------------------------------------------------
 // Sentinel root PID
 // ---------------------------------------------------------------------------
 pub const ROOT_PID: Pid = Pid::new(u32::MAX, u32::MAX);
 // ---------------------------------------------------------------------------
 // Slot reclamation
 // ---------------------------------------------------------------------------
 pub(crate) fn reclaim_slot(s: &mut SharedState, pid: Pid) {
    let idx = pid.index();
    let slot = &mut s.slots[idx as usize];
    slot.generation = slot.generation.wrapping_add(1);
    slot.actor = None;
    slot.outcome = None;
    slot.waiters.clear();
    slot.supervisor_channel = None;
    slot.state = State::Done;
    slot.outstanding_handles = 0;
    slot.pending_unpark = false;
    slot.pending_io_result = None;
    s.free_list.push(idx);
 }
 // ---------------------------------------------------------------------------
 // finalize_actor
 // ---------------------------------------------------------------------------
 fn finalize_actor(inner: &Arc<RuntimeInner>, pid: Pid, outcome: Outcome) {
    let (joiner_outcome, sup_signal) = match outcome {
        Outcome::Exit => (Outcome::Exit, Signal::Exit(pid)),
        Outcome::Panic(payload) => (
            Outcome::Panic(payload),
            Signal::Panic(pid, Box::new(()) as Box<dyn std::any::Any + Send>),
        ),
    };
    let (waiters, supervisor_pid) = inner.with_shared(|s| {
        let slot = s.slot_mut(pid).expect("finalize_actor: slot vanished");
        let sup = slot.actor.as_ref().map(|a| a.supervisor);
        slot.outcome = Some(joiner_outcome);
        slot.state = State::Done;
        slot.actor = None;
        (std::mem::take(&mut slot.waiters), sup)
    });
    // Deliver to supervisor.
    if let Some(sup) = supervisor_pid {
        let sender = inner.with_shared(|s| {
            s.slot(sup).and_then(|slot| slot.supervisor_channel.clone())
        });
        if let Some(sender) = sender {
            let _ = sender.send(sup_signal);
        }
    }
    // Unpark joiners.
    for joiner in waiters {
        crate::scheduler::unpark(joiner);
    }
    // Reclaim if no outstanding handles.
    inner.with_shared(|s| {
        let reclaim = s.slot(pid).map(|slot| slot.outstanding_handles == 0).unwrap_or(false);
        if reclaim { reclaim_slot(s, pid); }
    });
 }
 // ---------------------------------------------------------------------------
 // schedule_loop — runs on each scheduler OS thread
 // ---------------------------------------------------------------------------
 fn schedule_loop(inner: &Arc<RuntimeInner>, slot: usize) {
    crate::preempt::configure_preempt(inner.alloc_interval, inner.timeslice_cycles);
    let stats = &inner.stats[slot];
    loop {
        // ----------------------------------------------------------------
        // 1. Try to win the drain lock (timers + IO). One winner per round;
        //    losers skip immediately and proceed to step 2.
        // ----------------------------------------------------------------
        if let Ok(_drain_guard) = inner.drain_lock.try_lock() {
            let now = std::time::Instant::now();
            // Drain due timers.
            let due = inner.with_shared(|s| s.timers.pop_due(now));
            for entry in due {
                match entry.reason {
                    crate::timer::Reason::Sleep => {
                        inner.with_shared(|s| {
                            if let Some(slot) = s.slot_mut(entry.pid) {
                                if matches!(slot.state, State::Parked) {
                                    slot.state = State::Runnable;
                                    s.run_queue.push_back(entry.pid);
                                    crate::te!(crate::trace::Event::Enqueue(entry.pid));
                                }
                            }
                        });
                    }
                    crate::timer::Reason::WaitTimeout { target, wait_seq } => {
                        // Runs outside with_shared — the callback may call unpark.
                        target.on_timeout(entry.pid, wait_seq);
                    }
                }
            }
            // Drain IO completions.
            let completions = inner.with_shared(|s| {
                s.io.as_mut().map(|io| io.drain_completions()).unwrap_or_default()
            });
            for completion in completions {
                match completion {
                    crate::io::Completion::Blocking { pid, result } => {
                        inner.with_shared(|s| {
                            if let Some(io) = s.io.as_mut() {
                                io.outstanding = io.outstanding.saturating_sub(1);
                            }
                            if let Some(slot) = s.slot_mut(pid) {
                                slot.pending_io_result = Some(result);
                                if matches!(slot.state, State::Parked) {
                                    slot.state = State::Runnable;
                                    s.run_queue.push_back(pid);
                                    crate::te!(crate::trace::Event::Enqueue(pid));
                                }
                            }
                        });
                    }
                    crate::io::Completion::FdReady { fd, events: _ } => {
                        inner.with_shared(|s| {
                            let parked_pid = s.io.as_mut().and_then(|io| {
                                let pid = io.waiters.remove(&fd);
                                io.epoll_deregister(fd);
                                pid
                            });
                            if let Some(pid) = parked_pid {
                                if let Some(slot) = s.slot_mut(pid) {
                                    match slot.state {
                                        State::Parked => {
                                            slot.state = State::Runnable;
                                            s.run_queue.push_back(pid);
                                            crate::te!(crate::trace::Event::UnparkDirect(pid));
                                            crate::te!(crate::trace::Event::Enqueue(pid));
                                        }
                                        // Actor is between epoll_register
                                        // and park_current. Set the flag so
                                        // the upcoming Park yield re-queues
                                        // instead of suspending. Mirrors
                                        // scheduler::unpark().
                                        State::Runnable => {
                                            slot.pending_unpark = true;
                                            crate::te!(crate::trace::Event::UnparkDeferred(pid));
                                        }
                                        State::Done => {}
                                    }
                                }
                            }
                        });
                    }
                }
            }
        } // drain_guard drops here
        // ----------------------------------------------------------------
        // 2. Pop a runnable actor from the shared queue.
        // ----------------------------------------------------------------
        let pid = match inner.with_shared(|s| {
            let len = s.run_queue.len() as u64;
            stats.run_queue_len.store(len, Ordering::Relaxed);
            s.run_queue.pop_front()
        }) {
            Some(p) => {
                crate::te!(crate::trace::Event::Dequeue(p));
                p
            }
            None => {
                // Queue was empty when we popped. Re-examine under the lock to
                // decide whether to exit or wait. All four conditions must hold
                // simultaneously before we exit:
                //   1. run queue is still empty
                //   2. no live actors (nothing parked, nothing mid-finalize)
                //   3. no pending timers
                //   4. no outstanding IO
                // If any is non-zero we keep spinning — "check the fridge is
                // empty before you leave for the airport".
                let (next_deadline, io_outstanding, wake_fd, all_clear) =
                    inner.with_shared(|s| {
                        let next = s.timers.peek_deadline();
                        let (out, fd) = match s.io.as_ref() {
                            Some(io) => (
                                io.outstanding + io.waiters.len() as u32,
                                Some(io.wake_fd()),
                            ),
                            None => (0, None),
                        };
                        let live = s.slots.iter().filter(|slot| slot.actor.is_some()).count();
                        let queue_empty = s.run_queue.is_empty();
                        let all_clear = queue_empty && live == 0 && next.is_none() && out == 0;
                        (next, out, fd, all_clear)
                    });
                if all_clear {
                    return;
                }
                // Something is still in flight. Sleep on the appropriate source
                // to avoid hammering the mutex; the loop will retry on wake.
                match (next_deadline, wake_fd) {
                    (Some(deadline), fd_opt) => {
                        let now = std::time::Instant::now();
                        if deadline > now {
                            let timeout = deadline - now;
                            match fd_opt {
                                Some(fd) => {
                                    crate::io::poll_wake(fd, Some(timeout));
                                    crate::io::drain_wake_pipe(fd);
                                }
                                None => thread::sleep(timeout),
                            }
                        }
                    }
                    (None, Some(fd)) if io_outstanding > 0 => {
                        crate::io::poll_wake(fd, None);
                        crate::io::drain_wake_pipe(fd);
                    }
                    _ => {
                        thread::sleep(std::time::Duration::from_micros(100));
                    }
                }
                continue;
            }
        };
        // ----------------------------------------------------------------
        // 3. Resume the actor.
        // ----------------------------------------------------------------
        let sp = match inner.with_shared(|s| {
            s.slot(pid).and_then(|slot| slot.actor.as_ref().map(|a| a.sp))
        }) {
            Some(sp) => sp,
            None => {
                continue; // stale pid
            }
        };
        // First resume: move the closure into the trampoline's thread-local.
        if let Some(b) = inner.with_shared(|s| s.pop_pending_closure(pid)) {
            set_current_actor_box(b);
        }
        // Update per-thread stats: record who's on-CPU.
        stats.current_pid_index.store(pid.index(), Ordering::Relaxed);
        set_actor_sp(sp);
        set_current_pid(pid);
        reset_actor_done();
        YIELD_INTENT.with(|c| c.set(YieldIntent::Yield));
        crate::preempt::reset_timeslice();
        PREEMPTION_ENABLED.with(|c| c.set(true));
        crate::te!(crate::trace::Event::Resume(pid));
        unsafe { switch_to_actor() };
        PREEMPTION_ENABLED.with(|c| c.set(false));
        stats.current_pid_index.store(u32::MAX, Ordering::Relaxed);
        clear_current_pid();
        let intent = YIELD_INTENT.with(|c| c.get());
        let new_sp = get_actor_sp();
        if is_actor_done() {
            crate::te!(crate::trace::Event::Done(pid));
            let outcome = take_last_outcome().unwrap_or(Outcome::Exit);
            finalize_actor(inner, pid, outcome);
        } else {
            inner.with_shared(|s| {
                if let Some(slot) = s.slot_mut(pid) {
                    if let Some(actor) = slot.actor.as_mut() {
                        actor.sp = new_sp;
                    }
                    match intent {
                        YieldIntent::Yield => {
                            crate::te!(crate::trace::Event::Yield(pid));
                            slot.state = State::Runnable;
                            s.run_queue.push_back(pid);
                            crate::te!(crate::trace::Event::Enqueue(pid));
                        }
                        YieldIntent::Park => {
                            // Check if unpark() fired while the actor was
                            // still running (between registering in the
                            // channel and calling park_current). If so,
                            // re-queue immediately instead of parking.
                            if slot.pending_unpark {
                                slot.pending_unpark = false;
                                slot.state = State::Runnable;
                                s.run_queue.push_back(pid);
                                crate::te!(crate::trace::Event::UnparkFlagConsumed(pid));
                                crate::te!(crate::trace::Event::Enqueue(pid));
                            } else {
                                crate::te!(crate::trace::Event::Park(pid));
                                slot.state = State::Parked;
                            }
                        }
                    }
                }
            });
        }
    }
 }
--- a/src/scheduler.rs
+++ b/src/scheduler.rs
@@ -1,200 +1,75 @@
-//! The single-threaded scheduler.
+//! Scheduler public API — thin façade over the multi-scheduler runtime.
 //!
-//! There is one global scheduler per OS thread, stored in a thread-local.
+//! All heavy lifting lives in `runtime.rs`. This module exposes the same
-//! `run(initial)` initialises it, spawns the initial actor, drives the loop
+//! surface that the rest of the codebase (channel, mutex, io, timer, actor)
-//! until the run queue is empty, then tears it down.
+//! calls into, plus the public API re-exported from `lib.rs`.
 //!
-//! Slot table: a `Vec<Slot>` indexed by `Pid::index()`, with a free list of
+//! The single-threaded `run()` entry point is kept as a convenience wrapper
-//! reusable indices. Each slot has a `generation` counter that increments
+//! around `runtime::init(Config::exact(1)).run(f)`.
 //! every time the slot is freed; `Pid` carries the generation it was minted
 //! with, so a stale PID has a mismatching generation and is detected on
 //! lookup.
 //!
 //! Run queue: a `VecDeque<Pid>` of runnable actors. The state of an actor
 //! is implicit in slot.state: `Runnable` means it's either in the queue or
 //! currently executing; `Parked` means it's waiting for something to unpark
 //! it (channel send, join completion, …); `Done` means it has finished and
 //! is awaiting reaping.
 //!
 //! Joining: `JoinHandle::join()` parks the calling actor and registers it
 //! on the target slot's `waiters` list. When the target actor finishes,
 //! the scheduler reaps the slot and unparks every waiter, passing them the
 //! outcome via a side channel (the target's `outcome` field, drained on
 //! the joiner side).
-use crate::actor::{
+use crate::actor::current_pid;
    clear_current_pid, current_pid, is_actor_done, reset_actor_done,
    set_current_actor_box, set_current_pid, take_last_outcome, trampoline, Actor, Outcome,
 };
 use crate::channel::Sender;
 use crate::context::{get_actor_sp, init_actor_stack, set_actor_sp, switch_to_actor};
 use crate::pid::Pid;
-use crate::preempt::PREEMPTION_ENABLED;
+use crate::runtime::{
-use crate::stack::Stack;
+    self, RuntimeInner, YieldIntent, RUNTIME,
 };
 use crate::supervisor::Signal;
-use std::cell::RefCell;
+use std::sync::Arc;
 use std::collections::VecDeque;
 // ---------------------------------------------------------------------------
-// Configuration
+// with_runtime / try_with_runtime
 // ---------------------------------------------------------------------------
-const ACTOR_STACK_SIZE: usize = 64 * 1024;
+/// Borrow the current runtime. Panics if called outside `Runtime::run()`.
-
+pub(crate) fn with_runtime<R>(f: impl FnOnce(&Arc<RuntimeInner>) -> R) -> R {
-// ---------------------------------------------------------------------------
+    RUNTIME.with(|r| {
-// Per-actor slot
+        let b = r.borrow();
-// ---------------------------------------------------------------------------
+        let inner = b.as_ref().expect("smarm: not inside Runtime::run()");
-
+        f(inner)
 enum State {
    /// Either in the run queue or currently executing.
    Runnable,
    /// Removed from the queue, waiting for `unpark()`.
    Parked,
    /// The actor has finished. Slot persists until the last `JoinHandle`
    /// has been joined (or dropped). Then the slot is freed.
    Done,
 }
 struct Slot {
    /// Bumped every time this slot is freed and re-used. A `Pid` with a
    /// non-matching generation is stale.
    generation: u32,
    /// `None` when the slot is free. `Some` otherwise.
    actor: Option<Actor>,
    state: State,
    /// PIDs waiting in `JoinHandle::join`.
    waiters: Vec<Pid>,
    /// The outcome the actor produced, captured when it finished.
    /// Drained by `JoinHandle::join`.
    outcome: Option<Outcome>,
    /// If this slot is a supervisor, the sender into its `Signal` mailbox.
    /// Cloned out and used when one of its children dies.
    supervisor_channel: Option<Sender<Signal>>,
    /// Number of `JoinHandle`s still outstanding for this actor. The slot
    /// is reclaimed only when the actor is done AND outstanding_handles == 0.
    outstanding_handles: u32,
 }
 impl Slot {
    fn vacant() -> Self {
        Self {
            generation: 0,
            actor: None,
            state: State::Done,
            waiters: Vec::new(),
            outcome: None,
            supervisor_channel: None,
            outstanding_handles: 0,
        }
    }
 }
 // ---------------------------------------------------------------------------
 // Scheduler state
 // ---------------------------------------------------------------------------
 struct SchedulerState {
    slots: Vec<Slot>,
    free_list: Vec<u32>,
    run_queue: VecDeque<Pid>,
    /// The root supervisor's PID. Children spawned at the top level are
    /// supervised by this. Set by `run()`.
    root_pid: Option<Pid>,
    /// Pending sleep timers. Min-heap keyed by deadline.
    timers: crate::timer::Timers,
 }
 impl SchedulerState {
    fn new() -> Self {
        Self {
            slots: Vec::new(),
            free_list: Vec::new(),
            run_queue: VecDeque::new(),
            root_pid: None,
            timers: crate::timer::Timers::new(),
        }
    }
    /// Allocate a slot; return its (index, generation).
    fn allocate_slot(&mut self) -> (u32, u32) {
        if let Some(idx) = self.free_list.pop() {
            let s = &mut self.slots[idx as usize];
            (idx, s.generation)
        } else {
            let idx = self.slots.len() as u32;
            self.slots.push(Slot::vacant());
            (idx, 0)
        }
    }
    fn slot(&self, pid: Pid) -> Option<&Slot> {
        let s = self.slots.get(pid.index() as usize)?;
        if s.generation == pid.generation() { Some(s) } else { None }
    }
    fn slot_mut(&mut self, pid: Pid) -> Option<&mut Slot> {
        let s = self.slots.get_mut(pid.index() as usize)?;
        if s.generation == pid.generation() { Some(s) } else { None }
    }
 }
 thread_local! {
    static SCHED: RefCell<Option<SchedulerState>> = const { RefCell::new(None) };
 }
 fn with_sched<R>(f: impl FnOnce(&mut SchedulerState) -> R) -> R {
    SCHED.with(|c| {
        let mut g = c.borrow_mut();
        let s = g.as_mut().expect("scheduler not running");
        f(s)
    })
 }
-/// Same as `with_sched` but returns `None` when there's no scheduler instead
+/// Borrow the runtime if present; returns `None` otherwise.
-/// of panicking. Used on cleanup paths (channel sender drop during shutdown,
+/// Used on cleanup paths (channel Drop during teardown).
-/// for example).
+pub(crate) fn try_with_runtime<R>(f: impl FnOnce(&Arc<RuntimeInner>) -> R) -> Option<R> {
-fn try_with_sched<R>(f: impl FnOnce(&mut SchedulerState) -> R) -> Option<R> {
+    RUNTIME.with(|r| r.borrow().as_ref().map(|inner| f(inner)))
    SCHED.with(|c| {
        let mut g = c.borrow_mut();
        g.as_mut().map(f)
    })
 }
 // ---------------------------------------------------------------------------
-// JoinHandle
+// JoinHandle / JoinError
 // ---------------------------------------------------------------------------
 #[derive(Debug)]
 pub struct JoinError {
    /// Whatever `panic!` was called with.
    pub payload: Box<dyn std::any::Any + Send>,
 }
 pub struct JoinHandle {
    pid: Pid,
    /// `false` once `join()` has been called and the handle has consumed
    /// its outcome. Prevents the Drop impl from double-decrementing.
    consumed: bool,
 }
 impl JoinHandle {
    pub fn pid(&self) -> Pid { self.pid }
    /// Block the calling actor until the target completes. Returns
    /// `Ok(())` on normal exit, `Err(JoinError)` if the target panicked.
    pub fn join(mut self) -> Result<(), JoinError> {
        use crate::actor::Outcome;
        use crate::runtime::State; // need State visibility
        let me = current_pid().expect("join() called outside an actor");
        loop {
-            let outcome = with_sched(|s| {
+            let outcome = with_runtime(|inner| {
                inner.with_shared(|s| {
                    let slot = s.slot_mut(self.pid)
                        .expect("join: target slot has been reused");
                    if matches!(slot.state, State::Done) {
-                    Some(slot.outcome.take().expect("Done slot must have an outcome"))
+                        Some(slot.outcome.take().expect("Done slot must have outcome"))
                    } else {
                        slot.waiters.push(me);
                        None
                    }
                })
            });
            match outcome {
@@ -206,23 +81,30 @@ impl JoinHandle {
                        Outcome::Panic(p) => Err(JoinError { payload: p }),
                    };
                }
-                None => park_current(),
+                None => {
                    let _np = NoPreempt::enter();
                    park_current();
                }
            }
        }
    }
    fn decrement_handle_count(&mut self) {
-        with_sched(|s| {
+        with_runtime(|inner| {
            inner.with_shared(|s| {
                let should_reclaim = match s.slot_mut(self.pid) {
                    Some(slot) => {
-                    slot.outstanding_handles = slot.outstanding_handles.saturating_sub(1);
+                        slot.outstanding_handles =
-                    matches!(slot.state, State::Done) && slot.outstanding_handles == 0
+                            slot.outstanding_handles.saturating_sub(1);
                        matches!(slot.state, crate::runtime::State::Done)
                            && slot.outstanding_handles == 0
                    }
                    None => false,
                };
                if should_reclaim {
-                reclaim_slot(s, self.pid);
+                    crate::runtime::reclaim_slot(s, self.pid);
                }
            })
        });
    }
 }
@@ -230,345 +112,238 @@ impl JoinHandle {
 impl Drop for JoinHandle {
    fn drop(&mut self) {
        if !self.consumed {
            // May be called outside run() if handle is dropped after teardown.
            if try_with_runtime(|_| ()).is_some() {
                self.decrement_handle_count();
            }
        }
-}
+    }
 // ---------------------------------------------------------------------------
 // Slot reclamation
 // ---------------------------------------------------------------------------
 fn reclaim_slot(s: &mut SchedulerState, pid: Pid) {
    let idx = pid.index();
    let slot = &mut s.slots[idx as usize];
    // Bump generation so any stale PIDs from now on miss.
    slot.generation = slot.generation.wrapping_add(1);
    // Drop the actor (its stack with it).
    slot.actor = None;
    slot.outcome = None;
    slot.waiters.clear();
    slot.supervisor_channel = None;
    slot.state = State::Done; // semantically vacant; allocator checks free_list
    slot.outstanding_handles = 0;
    s.free_list.push(idx);
 }
 // ---------------------------------------------------------------------------
 // spawn / spawn_under / self_pid
 // ---------------------------------------------------------------------------
 /// Spawn `f` as a child of the currently-executing actor.
 /// Outside an actor (only legal from `run()`'s initial setup), the child's
 /// supervisor is the root supervisor.
 pub fn spawn(f: impl FnOnce() + Send + 'static) -> JoinHandle {
    let parent = current_pid()
-        .or_else(|| with_sched(|s| s.root_pid))
+        .or_else(|| with_runtime(|inner| inner.with_shared(|s| s.root_pid)))
        .expect("spawn() before run()");
    spawn_under(parent, f)
 }
 /// Spawn `f` with `supervisor` as its parent. The supervisor will receive
 /// a `Signal` on its registered channel when the child terminates.
 pub fn spawn_under(supervisor: Pid, f: impl FnOnce() + Send + 'static) -> JoinHandle {
-    let pid = with_sched(|s| {
+    let pid = with_runtime(|inner| {
        inner.with_shared(|s| {
            let (idx, gen) = s.allocate_slot();
            let pid = Pid::new(idx, gen);
-        let stack = Stack::new(ACTOR_STACK_SIZE)
+            let stack = crate::stack::Stack::new(crate::runtime::ACTOR_STACK_SIZE)
                .expect("stack allocation failed");
-        let sp = init_actor_stack(stack.top(), trampoline);
+            let sp = init_actor_stack(stack.top(), crate::actor::trampoline);
            let slot = &mut s.slots[idx as usize];
-        slot.actor = Some(Actor { pid, stack, sp, supervisor });
+            slot.actor = Some(crate::actor::Actor { pid, stack, sp, supervisor });
-        slot.state = State::Runnable;
+            slot.state = crate::runtime::State::Runnable;
            slot.outstanding_handles = 1;
            slot.outcome = None;
            slot.waiters.clear();
            slot.supervisor_channel = None;
            slot.pending_unpark = false;
            slot.pending_io_result = None;
            s.run_queue.push_back(pid);
            s.pending_closures.push((pid, Box::new(f) as crate::runtime::Closure));
            crate::te!(crate::trace::Event::Spawn { parent: supervisor, child: pid });
            crate::te!(crate::trace::Event::Enqueue(pid));
            pid
-    });
+        })
    // Stash the closure where `schedule_loop` will find it before the first
    // resume.
    PENDING_CLOSURES.with(|c| {
        c.borrow_mut().push((pid, Box::new(f) as Closure));
    });
    JoinHandle { pid, consumed: false }
 }
-type Closure = Box<dyn FnOnce() + Send>;
+use crate::context::init_actor_stack;
 thread_local! {
    /// Closures awaiting their first resume. Keyed by the PID the scheduler
    /// allocated for them in `spawn_under`. The scheduler pops from here in
    /// `pop_pending_closure` right before each first resume.
    static PENDING_CLOSURES: RefCell<Vec<(Pid, Closure)>> = const { RefCell::new(Vec::new()) };
 }
 fn pop_pending_closure(pid: Pid) -> Option<Closure> {
    PENDING_CLOSURES.with(|c| {
        let mut v = c.borrow_mut();
        v.iter().position(|(p, _)| *p == pid).map(|i| v.swap_remove(i).1)
    })
 }
 pub fn self_pid() -> Pid {
    current_pid().expect("self_pid() called outside an actor")
 }
 // ---------------------------------------------------------------------------
-// yield_now / park / unpark
+// yield_now / park_current / unpark
 // ---------------------------------------------------------------------------
 /// Cooperative yield. The current actor goes to the back of the run queue.
 pub fn yield_now() {
-    // Mark ourselves as needing to be re-queued, then yield.
+    runtime::set_yield_intent(YieldIntent::Yield);
    YIELD_INTENT.with(|c| c.set(YieldIntent::Yield));
    unsafe { crate::context::switch_to_scheduler() };
 }
 /// Park the current actor (remove it from the run queue until `unpark`).
 pub fn park_current() {
-    YIELD_INTENT.with(|c| c.set(YieldIntent::Park));
+    runtime::set_yield_intent(YieldIntent::Park);
    unsafe { crate::context::switch_to_scheduler() };
 }
-/// Park the current actor for at least `duration`. A zero duration behaves
+pub fn unpark(pid: Pid) {
-/// like `yield_now` (the deadline is immediately in the past, so the timer
+    let result = try_with_runtime(|inner| {
-/// pops on the next scheduler iteration).
+        inner.with_shared(|s| {
            if let Some(slot) = s.slot_mut(pid) {
                match slot.state {
                    crate::runtime::State::Parked => {
                        // Actor is suspended — safe to re-queue immediately.
                        slot.state = crate::runtime::State::Runnable;
                        s.run_queue.push_back(pid);
                        crate::te!(crate::trace::Event::UnparkDirect(pid));
                        crate::te!(crate::trace::Event::Enqueue(pid));
                    }
                    crate::runtime::State::Runnable => {
                        // Actor is still running (between registering its
                        // parked_receiver and calling park_current). Set the
                        // flag; the scheduler will re-queue after the Park
                        // yield instead of sleeping.
                        slot.pending_unpark = true;
                        crate::te!(crate::trace::Event::UnparkDeferred(pid));
                    }
                    crate::runtime::State::Done => {}
                }
            }
        })
    });
    let _ = result;
 }
 // ---------------------------------------------------------------------------
 // NoPreempt
 // ---------------------------------------------------------------------------
 pub struct NoPreempt(bool);
 impl NoPreempt {
    pub fn enter() -> Self {
        let prev = crate::preempt::PREEMPTION_ENABLED.with(|c| c.replace(false));
        NoPreempt(prev)
    }
 }
 impl Drop for NoPreempt {
    fn drop(&mut self) {
        crate::preempt::PREEMPTION_ENABLED.with(|c| c.set(self.0));
    }
 }
 // ---------------------------------------------------------------------------
 // sleep / insert_wait_timer
 // ---------------------------------------------------------------------------
 pub fn sleep(duration: std::time::Duration) {
    let me = current_pid().expect("sleep() called outside an actor");
    let _np = NoPreempt::enter();
    let deadline = crate::timer::deadline_from_now(duration);
-    with_sched(|s| s.timers.insert(deadline, me));
+    with_runtime(|inner| inner.with_shared(|s| s.timers.insert_sleep(deadline, me)));
    park_current();
 }
-/// Wake a parked actor. If the actor isn't parked (already runnable or done)
+pub fn insert_wait_timer(
-/// this is a no-op — that's important; channel and join can both fire
+    deadline: std::time::Instant,
-/// spurious unparks under some orderings and we want them to be cheap.
+    pid: Pid,
-/// Also a no-op if the scheduler isn't running (covers channel-sender drop
+    target: std::sync::Arc<dyn crate::timer::TimerTarget>,
-/// during runtime teardown).
+    wait_seq: u64,
-pub fn unpark(pid: Pid) {
+) {
-    try_with_sched(|s| {
+    with_runtime(|inner| {
-        if let Some(slot) = s.slot_mut(pid) {
+        inner.with_shared(|s| {
-            if matches!(slot.state, State::Parked) {
+            s.timers.insert(
-                slot.state = State::Runnable;
+                deadline,
-                s.run_queue.push_back(pid);
+                pid,
-            }
+                crate::timer::Reason::WaitTimeout { target, wait_seq },
-        }
+            );
        })
    });
 }
-/// What an actor wants the scheduler to do when control returns from it.
+// ---------------------------------------------------------------------------
-#[derive(Copy, Clone)]
+// block_on_io / wait_readable / wait_writable / read / write
-enum YieldIntent {
+// ---------------------------------------------------------------------------
-    /// Re-queue (yield_now or preemption).
+
-    Yield,
+pub fn block_on_io<F, T>(f: F) -> T
-    /// Remove from the run queue (waiting for unpark).
+where
-    Park,
+    F: FnOnce() -> T + Send + 'static,
    T: Send + 'static,
 {
    let me = current_pid().expect("block_on_io() called outside an actor");
    let work: Box<dyn FnOnce() -> crate::io::IoResult + Send> = Box::new(move || {
        let v: T = f();
        Ok(Box::new(v) as Box<dyn std::any::Any + Send>)
    });
    {
        let _np = NoPreempt::enter();
        with_runtime(|inner| inner.with_shared(|s| {
            let io = s.io.as_mut().expect("io thread not started");
            io.submit(me, work);
        }));
        park_current();
    }
    let result = with_runtime(|inner| inner.with_shared(|s| {
        s.slot_mut(me)
            .expect("block_on_io: own slot vanished")
            .pending_io_result
            .take()
            .expect("block_on_io: resumed without a result")
    }));
    match result {
        Ok(any) => *any.downcast::<T>().expect("block_on_io: type mismatch"),
        Err(payload) => std::panic::resume_unwind(payload),
    }
 }
-thread_local! {
+pub fn wait_readable(fd: std::os::fd::RawFd) -> std::io::Result<()> {
-    static YIELD_INTENT: std::cell::Cell<YieldIntent> = const { std::cell::Cell::new(YieldIntent::Yield) };
+    wait_fd(fd, true, false)
 }
 pub fn wait_writable(fd: std::os::fd::RawFd) -> std::io::Result<()> {
    wait_fd(fd, false, true)
 }
 fn wait_fd(fd: std::os::fd::RawFd, readable: bool, writable: bool) -> std::io::Result<()> {
    let me = current_pid().expect("wait_*() called outside an actor");
    let _np = NoPreempt::enter();
    with_runtime(|inner| inner.with_shared(|s| {
        let io = s.io.as_mut().expect("io thread not started");
        io.epoll_register(fd, me, readable, writable)
    }))?;
    park_current();
    Ok(())
 }
 pub fn read(fd: std::os::fd::RawFd, buf: &mut [u8]) -> std::io::Result<usize> {
    wait_readable(fd)?;
    let n = unsafe { libc::read(fd, buf.as_mut_ptr() as *mut _, buf.len()) };
    if n < 0 { Err(std::io::Error::last_os_error()) } else { Ok(n as usize) }
 }
 pub fn write(fd: std::os::fd::RawFd, buf: &[u8]) -> std::io::Result<usize> {
    wait_writable(fd)?;
    let n = unsafe { libc::write(fd, buf.as_ptr() as *const _, buf.len()) };
    if n < 0 { Err(std::io::Error::last_os_error()) } else { Ok(n as usize) }
 }
 // ---------------------------------------------------------------------------
-// Supervisor channel registration
+// register_supervisor_channel
 // ---------------------------------------------------------------------------
 /// Register `sender` as the mailbox for signals about children supervised
 /// by `pid`. Idempotent; later calls overwrite.
 pub fn register_supervisor_channel(pid: Pid, sender: Sender<Signal>) {
-    with_sched(|s| {
+    with_runtime(|inner| inner.with_shared(|s| {
        if let Some(slot) = s.slot_mut(pid) {
            slot.supervisor_channel = Some(sender);
        } else {
            panic!("register_supervisor_channel: pid {:?} not found", pid);
        }
-    });
+    }));
 }
 // ---------------------------------------------------------------------------
-// run() — the runtime entry point
+// Legacy run() — convenience wrapper
 // ---------------------------------------------------------------------------
-/// Boot the runtime, spawn `initial` as a child of the root supervisor,
+/// Single-threaded runtime entry point (backwards-compatible wrapper).
-/// drive the scheduler until the run queue is empty, tear down.
+/// Equivalent to `runtime::init(Config::exact(1)).run(f)`.
-///
+pub fn run<F: FnOnce() + Send + 'static>(f: F) {
-/// The root supervisor is a *sentinel* PID, not a real actor. Signals
+    crate::runtime::init(crate::runtime::Config::exact(1)).run(f);
 /// addressed to it are dropped on the floor — that's what "process exits"
 /// means in the spec when nothing escalates further. User code that wants
 /// real supervision spawns its own supervisor actor and uses `spawn_under`.
 pub fn run<F: FnOnce() + Send + 'static>(initial: F) {
    SCHED.with(|c| {
        assert!(c.borrow().is_none(), "smarm::run() called recursively");
        let mut state = SchedulerState::new();
        state.root_pid = Some(ROOT_PID);
        *c.borrow_mut() = Some(state);
    });
    let initial_handle = spawn(initial);
    schedule_loop();
    // Drop the handle BEFORE the scheduler is torn down — its Drop impl
    // calls `with_sched` to decrement the outstanding-handle count.
    drop(initial_handle);
    // Take the SchedulerState out of the thread-local BEFORE dropping it.
    // Dropping it while still inside SCHED.with's RefCell borrow would
    // re-enter (via channel senders' Drop → unpark → try_with_sched).
    let state = SCHED.with(|c| c.borrow_mut().take());
    drop(state);
    PENDING_CLOSURES.with(|c| c.borrow_mut().clear());
 }
 /// Reserved sentinel pid for the root supervisor. Never allocated to a
 /// real actor; lookups return `None`; signals are dropped.
 pub const ROOT_PID: Pid = Pid::new(u32::MAX, u32::MAX);
 fn schedule_loop() {
    loop {
        // 1. Drain due timers into the run queue.
        let now = std::time::Instant::now();
        let due = with_sched(|s| s.timers.pop_due(now));
        for pid in due {
            // Same idempotency as `unpark`: only re-queue if still parked.
            with_sched(|s| {
                if let Some(slot) = s.slot_mut(pid) {
                    if matches!(slot.state, State::Parked) {
                        slot.state = State::Runnable;
                        s.run_queue.push_back(pid);
                    }
                }
            });
        }
        // 2. Pop a runnable actor. If none, sleep on the soonest timer or
        // exit if there isn't one.
        let pid = match with_sched(|s| s.run_queue.pop_front()) {
            Some(p) => p,
            None => {
                let next = with_sched(|s| s.timers.peek_deadline());
                match next {
                    Some(deadline) => {
                        let now = std::time::Instant::now();
                        if deadline > now {
                            // No other thread can wake us; plain sleep is
                            // correct. When the IO thread lands in v0.2
                            // this becomes a Condvar / pipe wakeup.
                            std::thread::sleep(deadline - now);
                        }
                        continue;
                    }
                    None => return, // no runnables, no timers — done.
                }
            }
        };
        // Look up sp; skip stale or already-reaped pids.
        let sp = match with_sched(|s| {
            s.slot(pid).and_then(|slot| slot.actor.as_ref().map(|a| a.sp))
        }) {
            Some(sp) => sp,
            None => continue,
        };
        // If this is a first resume, move the pending closure to the
        // thread-local the trampoline reads.
        if let Some(b) = pop_pending_closure(pid) {
            set_current_actor_box(b);
        }
        set_actor_sp(sp);
        set_current_pid(pid);
        reset_actor_done();
        YIELD_INTENT.with(|c| c.set(YieldIntent::Yield));
        crate::preempt::reset_timeslice();
        PREEMPTION_ENABLED.with(|c| c.set(true));
        unsafe { switch_to_actor() };
        PREEMPTION_ENABLED.with(|c| c.set(false));
        clear_current_pid();
        let intent = YIELD_INTENT.with(|c| c.get());
        let new_sp = get_actor_sp();
        if is_actor_done() {
            let outcome = take_last_outcome().unwrap_or(Outcome::Exit);
            finalize_actor(pid, outcome);
        } else {
            with_sched(|s| {
                if let Some(slot) = s.slot_mut(pid) {
                    if let Some(actor) = slot.actor.as_mut() {
                        actor.sp = new_sp;
                    }
                    match intent {
                        YieldIntent::Yield => {
                            slot.state = State::Runnable;
                            s.run_queue.push_back(pid);
                        }
                        YieldIntent::Park => {
                            slot.state = State::Parked;
                        }
                    }
                }
            });
        }
    }
 }
 fn finalize_actor(pid: Pid, outcome: Outcome) {
    // Joiners get the typed Result with the panic payload. The supervisor
    // gets an informational `Signal::Panic` with an empty payload — its job
    // is policy (restart/escalate), not forensics. Users who need the
    // payload in supervision can plumb their own channel.
    let (joiner_outcome, sup_signal) = match outcome {
        Outcome::Exit             => (Outcome::Exit, Signal::Exit(pid)),
        Outcome::Panic(payload)   => (
            Outcome::Panic(payload),
            Signal::Panic(pid, Box::new(()) as Box<dyn std::any::Any + Send>),
        ),
    };
    // Stash outcome, mark Done, collect waiters, drop the actor stack.
    let (waiters, supervisor_pid) = with_sched(|s| {
        let slot = s.slot_mut(pid).expect("finalize_actor: slot vanished");
        let sup = slot.actor.as_ref().map(|a| a.supervisor);
        slot.outcome = Some(joiner_outcome);
        slot.state = State::Done;
        slot.actor = None;
        let w = std::mem::take(&mut slot.waiters);
        (w, sup)
    });
    // Deliver to supervisor (best-effort; ignore SendError).
    if let Some(sup) = supervisor_pid {
        let sender = with_sched(|s| {
            s.slot(sup).and_then(|slot| slot.supervisor_channel.clone())
        });
        if let Some(sender) = sender {
            let _ = sender.send(sup_signal);
        }
    }
    // Unpark joiners.
    for joiner in waiters {
        unpark(joiner);
    }
    // Reclaim if no outstanding handles.
    with_sched(|s| {
        let should_reclaim = match s.slot(pid) {
            Some(slot) => slot.outstanding_handles == 0,
            None => false,
        };
        if should_reclaim {
            reclaim_slot(s, pid);
        }
    });
 }
--- a/src/timer.rs
+++ b/src/timer.rs
@@ -1,38 +1,86 @@
-//! Sleep timers.
+//! Sleep + wait-with-timeout timers.
 //!
-//! A min-heap of `(deadline, Pid)` entries lives on `SchedulerState`. When
+//! A min-heap of `(deadline, seq, reason)` entries lives on `SchedulerState`.
-//! an actor calls `sleep`, the runtime inserts the entry, marks the actor
+//! When an actor sleeps or starts a bounded wait (e.g. `mutex.lock()` with a
-//! parked, and yields. On every scheduler loop iteration the runtime pops
+//! timeout), the runtime inserts an entry, marks the actor parked, and yields.
-//! all entries whose deadline has passed and unparks them. When the run
+//! On every scheduler loop iteration the runtime pops all entries whose
-//! queue is empty but the heap is not, the runtime sleeps the OS thread
+//! deadline has passed and dispatches each according to its `Reason`:
 //! until the soonest deadline, then re-checks.
 //!
-//! `BinaryHeap` is a max-heap, so entries are stored with their deadline
+//!   - `Sleep`: unpark the actor.
-//! wrapped in `Reverse` to get min-heap behaviour.
+//!   - `WaitTimeout`: call `on_timeout` on the registered target. The target
 //!     (e.g. a `Mutex`) decides whether the actor was actually still waiting
 //!     (timer fires first → unpark with error) or had already been granted
 //!     what it was waiting for (lock granted first → no-op).
 //!
-//! Stale pids (slot reused since the timer was inserted) are detected on
+//! `BinaryHeap` is a max-heap; entries are wrapped in `Reverse` to get
-//! `due_pids` pop and silently dropped — same convention as the run queue.
+//! min-heap behaviour.
 //!
 //! No cancellation. When a non-timer wakeup happens (e.g. lock granted
 //! before timeout), the timer entry is left in the heap. It will be popped
 //! eventually and the dispatch will observe "actor is no longer parked /
 //! wait_seq is stale" and no-op. Cost is ~32 bytes per stale entry plus a
 //! few cycles on pop; acceptable given the upper bound is "one entry per
 //! parked actor".
 //!
 //! Stale pids (slot reused since the timer was inserted) are filtered on
 //! pop by the scheduler — same convention as the run queue.
 use crate::pid::Pid;
 use std::cmp::Reverse;
 use std::collections::BinaryHeap;
 use std::sync::Arc;
 use std::time::{Duration, Instant};
-#[derive(PartialEq, Eq)]
+/// What to do when a timer entry's deadline arrives.
 ///
 /// Held inside `Entry`, dispatched by the scheduler in `pop_due`.
 pub enum Reason {
    /// `loom::sleep(d)`. Unpark `pid` unconditionally (modulo the usual
    /// "still parked?" check the scheduler applies).
    Sleep,
    /// A bounded wait — currently only `Mutex::lock_timeout`. On expiry the
    /// scheduler calls `target.on_timeout(pid, wait_seq)`. The target then
    /// decides whether `pid` was actually still waiting, and if so unparks
    /// it with whatever error the wait was bounded for. `wait_seq` lets the
    /// target tell apart "this wait" from "a later wait by the same actor
    /// on the same target".
    WaitTimeout {
        target: Arc<dyn TimerTarget>,
        wait_seq: u64,
    },
 }
 /// Callback the scheduler invokes when a `WaitTimeout` entry pops.
 ///
 /// Implementors: do not touch `SchedulerState` other than via the public
 /// `unpark` / channel APIs. The scheduler is mid-iteration when this fires.
 pub trait TimerTarget: Send + Sync {
    fn on_timeout(&self, pid: Pid, wait_seq: u64);
 }
 pub struct Entry {
    pub deadline: Instant,
    /// Insertion order, used purely as a tiebreaker so `Entry: Ord` works
    /// without having to compare the `Reason` payload (which contains an
    /// `Rc<dyn TimerTarget>` and isn't `Ord`).
    seq: u64,
    pub pid: Pid,
    pub reason: Reason,
 }
 impl PartialEq for Entry {
    fn eq(&self, other: &Self) -> bool {
        self.deadline == other.deadline && self.seq == other.seq
    }
 }
 impl Eq for Entry {}
 impl Ord for Entry {
    fn cmp(&self, other: &Self) -> std::cmp::Ordering {
-        // Only `deadline` matters for ordering; pid is a tiebreaker so the
+        // Earlier deadline first; ties broken by insertion order so the
-        // type is Ord, but the order among same-deadline entries is
+        // ordering is total. `Reason` and `Pid` deliberately don't
-        // irrelevant.
+        // participate.
-        self.deadline
+        self.deadline.cmp(&other.deadline).then_with(|| self.seq.cmp(&other.seq))
            .cmp(&other.deadline)
            .then_with(|| self.pid.index().cmp(&other.pid.index()))
            .then_with(|| self.pid.generation().cmp(&other.pid.generation()))
    }
 }
@@ -46,15 +94,25 @@ impl PartialOrd for Entry {
 pub struct Timers {
    /// Reverse-wrapped so the smallest deadline is at the top.
    heap: BinaryHeap<Reverse<Entry>>,
    /// Monotonic counter for the tiebreaker `seq` field.
    next_seq: u64,
 }
 impl Timers {
    pub fn new() -> Self {
-        Self { heap: BinaryHeap::new() }
+        Self { heap: BinaryHeap::new(), next_seq: 0 }
    }
-    pub fn insert(&mut self, deadline: Instant, pid: Pid) {
+    /// Insert a `Sleep` timer. Convenience for the common case.
-        self.heap.push(Reverse(Entry { deadline, pid }));
+    pub fn insert_sleep(&mut self, deadline: Instant, pid: Pid) {
        self.insert(deadline, pid, Reason::Sleep);
    }
    /// Insert an arbitrary timer entry.
    pub fn insert(&mut self, deadline: Instant, pid: Pid, reason: Reason) {
        let seq = self.next_seq;
        self.next_seq = self.next_seq.wrapping_add(1);
        self.heap.push(Reverse(Entry { deadline, seq, pid, reason }));
    }
    pub fn is_empty(&self) -> bool {
@@ -66,13 +124,13 @@ impl Timers {
        self.heap.peek().map(|r| r.0.deadline)
    }
-    /// Pop and return every pid whose deadline is ≤ `now`.
+    /// Pop every entry whose deadline is ≤ `now`, in deadline order.
-    pub fn pop_due(&mut self, now: Instant) -> Vec<Pid> {
+    /// The scheduler dispatches each entry by inspecting `entry.reason`.
    pub fn pop_due(&mut self, now: Instant) -> Vec<Entry> {
        let mut out = Vec::new();
        while let Some(r) = self.heap.peek() {
            if r.0.deadline <= now {
-                let e = self.heap.pop().unwrap().0;
+                out.push(self.heap.pop().unwrap().0);
                out.push(e.pid);
            } else {
                break;
            }
@@ -81,7 +139,7 @@ impl Timers {
    }
 }
-/// Wall-clock duration helper exposed for `sleep`.
+/// Wall-clock duration helper exposed for `sleep` and `lock_timeout`.
 pub fn deadline_from_now(duration: Duration) -> Instant {
    Instant::now()
        .checked_add(duration)
--- a/src/trace.rs
+++ b/src/trace.rs
@@ -0,0 +1,246 @@
 //! Structured per-event tracing for smarm.
 //!
 //! Enabled by `--features smarm-trace`. Zero cost without the feature.
 //!
 //! Architecture: MPSC. Every scheduler thread holds a thread-local Sender
 //! clone (one mutex acquire per thread, on first use). A dedicated drain
 //! thread owns the Receiver, batches records, and writes to a BufWriter.
 //! The hot path (record()) is a single channel send — no mutex, no disk I/O.
 //!
 //! Usage:
 //!   cargo test --test runtime <test_name> --features smarm-trace
 //!
 //! Output: smarm_trace.json in cwd, or $SMARM_TRACE_FILE.
 //! View:   https://ui.perfetto.dev  or  chrome://tracing
 #[cfg(feature = "smarm-trace")]
 #[macro_export]
 macro_rules! te {
    ($kind:expr) => { $crate::trace::record($kind) };
 }
 #[cfg(not(feature = "smarm-trace"))]
 #[macro_export]
 macro_rules! te {
    ($kind:expr) => { () };
 }
 #[cfg(feature = "smarm-trace")]
 pub use inner::*;
 #[cfg(feature = "smarm-trace")]
 mod inner {
    use crate::pid::Pid;
    use std::io::Write;
    use std::sync::{mpsc, Mutex};
    use std::time::Instant;
    // -----------------------------------------------------------------------
    // Event kinds
    // -----------------------------------------------------------------------
    #[derive(Clone, Debug)]
    pub enum Event {
        // Actor lifecycle
        Spawn { parent: Pid, child: Pid },
        Resume(Pid),
        Yield(Pid),
        Park(Pid),
        Done(Pid),
        // Wakeup paths
        UnparkDirect(Pid),       // unpark() saw Parked   -> re-queued immediately
        UnparkDeferred(Pid),     // unpark() saw Runnable -> set pending_unpark flag
        UnparkFlagConsumed(Pid), // scheduler saw flag on Park -> re-queued instead
        // Channel
        Send { sender: Pid, receiver: Option<Pid> },
        RecvPark(Pid),
        RecvWake(Pid),
        // Queue
        Enqueue(Pid),
        Dequeue(Pid),
    }
    // -----------------------------------------------------------------------
    // Wire format sent through the channel
    // -----------------------------------------------------------------------
    struct Record {
        nanos: u64,   // ns since open()
        tid:   u64,   // OS thread id
        event: Event,
    }
    // Sentinel: drain thread flushes and exits when it receives this.
    enum Msg {
        Event(Record),
        Flush,
    }
    // -----------------------------------------------------------------------
    // Global sender + start time
    // -----------------------------------------------------------------------
    struct Global {
        sender:  mpsc::Sender<Msg>,
        start:   Instant,
    }
    static GLOBAL: Mutex<Option<Global>> = Mutex::new(None);
    // Per-thread state: cached Sender clone + cached copy of start Instant.
    // The Sender clone is taken once per thread (one mutex hit).
    // The start Instant is copied alongside it — also one mutex hit per thread.
    // record() never touches GLOBAL after that.
    struct LocalState {
        tx:    mpsc::Sender<Msg>,
        start: Instant,
    }
    thread_local! {
        static LOCAL_STATE: std::cell::RefCell<Option<LocalState>> =
            std::cell::RefCell::new(None);
    }
    // -----------------------------------------------------------------------
    // Lifecycle
    // -----------------------------------------------------------------------
    pub fn open() {
        let path = std::env::var("SMARM_TRACE_FILE")
            .unwrap_or_else(|_| "smarm_trace.json".to_owned());
        let (tx, rx) = mpsc::channel::<Msg>();
        let start = Instant::now();
        *GLOBAL.lock().unwrap() = Some(Global { sender: tx, start });
        // Drain thread: owns the Receiver, writes to disk.
        let path_for_thread = path.clone();
        std::thread::Builder::new()
            .name("smarm-trace-drain".into())
            .spawn(move || drain_thread(rx, &path_for_thread))
            .expect("failed to spawn trace drain thread");
        eprintln!("[smarm-trace] writing to {}", path);
    }
    /// Send a Flush sentinel and block until the drain thread finishes writing.
    /// Called by Runtime::run after all scheduler threads have exited.
    pub fn flush() {
        // Drop the global sender so the drain thread's recv() returns Err
        // after the Flush sentinel, signalling clean shutdown.
        let sender = {
            let mut g = GLOBAL.lock().unwrap();
            g.take().map(|g| g.sender)
        };
        if let Some(tx) = sender {
            let _ = tx.send(Msg::Flush);
            // tx drops here — drain thread will see disconnected after Flush.
        }
        // Clear thread-local state.
        LOCAL_STATE.with(|c| *c.borrow_mut() = None);
    }
    // -----------------------------------------------------------------------
    // Hot path
    // -----------------------------------------------------------------------
    pub fn record(event: Event) {
        // Disable preemption for the entire duration of record(). Any
        // allocation here (mutex internals, channel send, lazy init) would
        // trigger PreemptingAllocator -> maybe_preempt -> switch_to_scheduler,
        // which would try to re-acquire inner.shared (already held at many
        // te!() call sites) -> deadlock. Guard at the very top, before any
        // allocation-capable call.
        let was_enabled = crate::preempt::PREEMPTION_ENABLED
            .with(|e| { let v = e.get(); e.set(false); v });
        LOCAL_STATE.with(|cell| {
            let mut opt = cell.borrow_mut();
            // Lazily initialise: one mutex hit per thread, ever.
            if opt.is_none() {
                if let Some(g) = GLOBAL.lock().unwrap().as_ref() {
                    let tx = g.sender.clone();
                    *opt = Some(LocalState { tx, start: g.start });
                }
            }
            if let Some(ls) = opt.as_ref() {
                let nanos = ls.start.elapsed().as_nanos() as u64;
                let tid   = os_tid();
                let _ = ls.tx.send(Msg::Event(Record { nanos, tid, event }));
            }
        });
        crate::preempt::PREEMPTION_ENABLED.with(|e| e.set(was_enabled));
    }
    // -----------------------------------------------------------------------
    // Drain thread
    // -----------------------------------------------------------------------
    fn drain_thread(rx: mpsc::Receiver<Msg>, path: &str) {
        let f = match std::fs::File::create(path) {
            Ok(f) => f,
            Err(e) => { eprintln!("[smarm-trace] create failed: {}", e); return; }
        };
        let mut w = std::io::BufWriter::new(f);
        let _ = writeln!(w, "{{\"traceEvents\":[");
        let mut count: u64 = 0;
        let mut first = true;
        loop {
            match rx.recv() {
                Ok(Msg::Event(r)) => {
                    let (name, actor_idx) = chrome_fields(&r.event);
                    let ts_us = r.nanos as f64 / 1000.0;
                    if !first { let _ = w.write_all(b",\n"); }
                    first = false;
                    let _ = write!(w,
                        "{{\"ph\":\"i\",\"ts\":{:.3},\"pid\":{},\"tid\":{},\"name\":{:?},\"s\":\"g\"}}",
                        ts_us, actor_idx, r.tid, name);
                    count += 1;
                }
                Ok(Msg::Flush) | Err(_) => {
                    // Clean close.
                    let _ = writeln!(w, "\n]}}");
                    let _ = w.flush();
                    eprintln!("[smarm-trace] {} events written", count);
                    return;
                }
            }
        }
    }
    // -----------------------------------------------------------------------
    // Chrome trace helpers
    // -----------------------------------------------------------------------
    fn chrome_fields(ev: &Event) -> (String, u32) {
        match ev {
            Event::Spawn { parent, child } =>
                (format!("spawn c={}", child.index()), parent.index()),
            Event::Resume(p)             => ("resume".into(),               p.index()),
            Event::Yield(p)              => ("yield".into(),                p.index()),
            Event::Park(p)               => ("park".into(),                 p.index()),
            Event::Done(p)               => ("done".into(),                 p.index()),
            Event::UnparkDirect(p)       => ("unpark_direct".into(),        p.index()),
            Event::UnparkDeferred(p)     => ("unpark_deferred".into(),      p.index()),
            Event::UnparkFlagConsumed(p) => ("unpark_flag_consumed".into(), p.index()),
            Event::Send { sender, receiver } => (
                format!("send rx={}", receiver
                    .map(|p| p.index().to_string())
                    .unwrap_or_else(|| "none".into())),
                sender.index(),
            ),
            Event::RecvPark(p) => ("recv_park".into(), p.index()),
            Event::RecvWake(p) => ("recv_wake".into(), p.index()),
            Event::Enqueue(p)  => ("enqueue".into(),   p.index()),
            Event::Dequeue(p)  => ("dequeue".into(),   p.index()),
        }
    }
    fn os_tid() -> u64 {
        unsafe { libc::syscall(libc::SYS_gettid) as u64 }
    }
 }
--- a/tests/io.rs
+++ b/tests/io.rs
@@ -0,0 +1,99 @@
 //! Tests for `block_on_io` — running a blocking closure on a worker OS
 //! thread while the calling actor is parked.
 use smarm::{block_on_io, run, spawn, yield_now};
 use std::sync::atomic::{AtomicU32, Ordering};
 use std::sync::{Arc, Mutex};
 use std::time::Duration;
 #[test]
 fn block_on_io_returns_the_closures_value() {
    let captured: Arc<Mutex<Option<u64>>> = Arc::new(Mutex::new(None));
    let c = captured.clone();
    run(move || {
        let v: u64 = block_on_io(|| {
            // Burn a tiny bit of time so this actually crosses thread.
            std::thread::sleep(Duration::from_millis(5));
            42
        });
        *c.lock().unwrap() = Some(v);
    });
    assert_eq!(*captured.lock().unwrap(), Some(42));
 }
 #[test]
 fn other_actors_run_while_block_on_io_is_in_flight() {
    // While actor A is parked in block_on_io, actor B should be able to
    // make progress.
    let order: Arc<Mutex<Vec<u8>>> = Arc::new(Mutex::new(Vec::new()));
    let oa = order.clone();
    let ob = order.clone();
    run(move || {
        let a = spawn(move || {
            oa.lock().unwrap().push(1); // A starts first.
            block_on_io(|| {
                std::thread::sleep(Duration::from_millis(50));
            });
            oa.lock().unwrap().push(4); // A resumes last.
        });
        let b = spawn(move || {
            // Make sure A enters block_on_io first.
            yield_now();
            ob.lock().unwrap().push(2);
            yield_now();
            ob.lock().unwrap().push(3);
        });
        a.join().unwrap();
        b.join().unwrap();
    });
    // Required interleaving: 1 (A starts) before 2,3 (B runs while A
    // is parked), and 4 (A resumes) after 2,3.
    let v = order.lock().unwrap();
    assert_eq!(v[0], 1, "log: {:?}", *v);
    assert_eq!(v[v.len() - 1], 4, "log: {:?}", *v);
    let pos_2 = v.iter().position(|&x| x == 2).unwrap();
    let pos_3 = v.iter().position(|&x| x == 3).unwrap();
    let pos_4 = v.iter().position(|&x| x == 4).unwrap();
    assert!(pos_2 < pos_4, "B's first step ran after A resumed: {:?}", *v);
    assert!(pos_3 < pos_4, "B's second step ran after A resumed: {:?}", *v);
 }
 #[test]
 fn many_concurrent_block_on_io_calls_all_complete() {
    let counter = Arc::new(AtomicU32::new(0));
    let c = counter.clone();
    run(move || {
        let mut handles = Vec::new();
        for _ in 0..10 {
            let cc = c.clone();
            handles.push(spawn(move || {
                let n: u32 = block_on_io(|| {
                    std::thread::sleep(Duration::from_millis(10));
                    1
                });
                cc.fetch_add(n, Ordering::SeqCst);
            }));
        }
        for h in handles { h.join().unwrap(); }
    });
    assert_eq!(counter.load(Ordering::SeqCst), 10);
 }
 #[test]
 fn block_on_io_panic_propagates_to_caller() {
    let saw_err = Arc::new(std::sync::atomic::AtomicBool::new(false));
    let s = saw_err.clone();
    run(move || {
        let h = spawn(move || {
            // The closure panics on the worker thread; that should
            // resurface as a panic in this actor.
            let _: () = block_on_io(|| panic!("boom on io thread"));
        });
        if h.join().is_err() {
            s.store(true, Ordering::SeqCst);
        }
    });
    assert!(saw_err.load(Ordering::SeqCst));
 }
--- a/tests/io_epoll.rs
+++ b/tests/io_epoll.rs
@@ -0,0 +1,324 @@
 //! Tests for epoll-based fd readiness primitives: `wait_readable`,
 //! `wait_writable`, and the `read`/`write` sugar on top of them.
 //!
 //! Pipes are the convenient test target: cheap to create, easy to drive,
 //! and we already use `libc::pipe2` internally. Each pipe is one direction
 //! and respects `O_NONBLOCK` if we ask for it.
 use smarm::{run, spawn, wait_readable, wait_writable, yield_now};
 use std::os::fd::RawFd;
 use std::sync::atomic::{AtomicU32, Ordering};
 use std::sync::Arc;
 use std::sync::Mutex as StdMutex;
 use std::time::Duration;
 // ---------------------------------------------------------------------------
 // Pipe helper
 // ---------------------------------------------------------------------------
 struct Pipe {
    read: RawFd,
    write: RawFd,
 }
 impl Pipe {
    fn new() -> Self {
        let mut fds: [libc::c_int; 2] = [0; 2];
        let r = unsafe { libc::pipe2(fds.as_mut_ptr(), libc::O_CLOEXEC | libc::O_NONBLOCK) };
        assert_eq!(r, 0, "pipe2 failed");
        Pipe {
            read: fds[0],
            write: fds[1],
        }
    }
 }
 impl Drop for Pipe {
    fn drop(&mut self) {
        unsafe {
            libc::close(self.read);
            libc::close(self.write);
        }
    }
 }
 fn raw_write(fd: RawFd, buf: &[u8]) -> isize {
    unsafe { libc::write(fd, buf.as_ptr() as *const _, buf.len()) }
 }
 fn raw_read(fd: RawFd, buf: &mut [u8]) -> isize {
    unsafe { libc::read(fd, buf.as_mut_ptr() as *mut _, buf.len()) }
 }
 // ---------------------------------------------------------------------------
 // wait_readable parks until data arrives, then libc::read succeeds.
 // ---------------------------------------------------------------------------
 #[test]
 fn wait_readable_blocks_until_data_arrives_then_read_succeeds() {
    let captured: Arc<StdMutex<Vec<u8>>> = Arc::new(StdMutex::new(Vec::new()));
    let cap = captured.clone();
    let p = Arc::new(Pipe::new());
    let p_reader = p.clone();
    let p_writer = p.clone();
    run(move || {
        let reader = spawn(move || {
            // Initially the pipe is empty; this parks.
            wait_readable(p_reader.read).expect("wait_readable failed");
            // Now data should be readable.
            let mut buf = [0u8; 16];
            let n = raw_read(p_reader.read, &mut buf);
            assert!(n > 0, "read returned {}", n);
            cap.lock().unwrap().extend_from_slice(&buf[..n as usize]);
        });
        let writer = spawn(move || {
            // Yield so the reader gets to park first.
            yield_now();
            yield_now();
            // Sleep a touch so the reader is definitely waiting in epoll.
            smarm::sleep(Duration::from_millis(5));
            let n = raw_write(p_writer.write, b"hello");
            assert_eq!(n, 5);
        });
        reader.join().unwrap();
        writer.join().unwrap();
    });
    assert_eq!(*captured.lock().unwrap(), b"hello");
 }
 // ---------------------------------------------------------------------------
 // The smarm::scheduler::read sugar — wait_readable + libc::read in one call.
 // ---------------------------------------------------------------------------
 #[test]
 fn read_sugar_returns_bytes_from_pipe() {
    let captured: Arc<StdMutex<Vec<u8>>> = Arc::new(StdMutex::new(Vec::new()));
    let cap = captured.clone();
    let p = Arc::new(Pipe::new());
    let p_reader = p.clone();
    let p_writer = p.clone();
    run(move || {
        let reader = spawn(move || {
            let mut buf = [0u8; 16];
            let n = smarm::scheduler::read(p_reader.read, &mut buf)
                .expect("smarm::scheduler::read failed");
            cap.lock().unwrap().extend_from_slice(&buf[..n]);
        });
        let writer = spawn(move || {
            yield_now();
            smarm::sleep(Duration::from_millis(5));
            let _ = raw_write(p_writer.write, b"world");
        });
        reader.join().unwrap();
        writer.join().unwrap();
    });
    assert_eq!(*captured.lock().unwrap(), b"world");
 }
 // ---------------------------------------------------------------------------
 // wait_writable + write — though pipes are almost always writable; the
 // useful test here is that the call doesn't hang on a writable fd.
 // ---------------------------------------------------------------------------
 #[test]
 fn write_sugar_sends_bytes_to_pipe() {
    let counter = Arc::new(AtomicU32::new(0));
    let c = counter.clone();
    let p = Arc::new(Pipe::new());
    let p_writer = p.clone();
    let p_reader = p.clone();
    run(move || {
        let writer = spawn(move || {
            // Pipe is empty + has buffer space, so this returns immediately
            // after wait_writable wakes (which happens fast because the
            // kernel marks an empty pipe as immediately writable).
            let n = smarm::scheduler::write(p_writer.write, b"smarm")
                .expect("write failed");
            assert_eq!(n, 5);
            c.fetch_add(1, Ordering::SeqCst);
        });
        let reader = spawn(move || {
            // Give the writer time.
            smarm::sleep(Duration::from_millis(10));
            let mut buf = [0u8; 16];
            let n = raw_read(p_reader.read, &mut buf);
            assert_eq!(n, 5);
            assert_eq!(&buf[..5], b"smarm");
        });
        writer.join().unwrap();
        reader.join().unwrap();
    });
    assert_eq!(counter.load(Ordering::SeqCst), 1);
 }
 // ---------------------------------------------------------------------------
 // While an actor is parked on wait_readable, other actors keep running.
 // ---------------------------------------------------------------------------
 #[test]
 fn other_actors_run_while_one_is_parked_on_wait_readable() {
    let log: Arc<StdMutex<Vec<u8>>> = Arc::new(StdMutex::new(Vec::new()));
    let la = log.clone();
    let lb = log.clone();
    let p = Arc::new(Pipe::new());
    let p_a = p.clone();
    let p_b = p.clone();
    run(move || {
        let a = spawn(move || {
            la.lock().unwrap().push(b'A');
            wait_readable(p_a.read).unwrap();
            la.lock().unwrap().push(b'a');
        });
        let b = spawn(move || {
            // A starts parking on the empty pipe; B should be free to do
            // its work in the meantime.
            for _ in 0..3 {
                yield_now();
                lb.lock().unwrap().push(b'B');
            }
            // Now wake A.
            let _ = raw_write(p_b.write, b"x");
        });
        a.join().unwrap();
        b.join().unwrap();
    });
    let v = log.lock().unwrap();
    // A goes first ('A'), then B makes progress (multiple 'B's) while A is
    // parked, then A wakes and finishes ('a').
    let pos_big_a = v.iter().position(|&c| c == b'A').unwrap();
    let pos_lit_a = v.iter().position(|&c| c == b'a').unwrap();
    let big_b_count = v.iter().filter(|&&c| c == b'B').count();
    assert_eq!(big_b_count, 3, "B should have made 3 steps: {:?}", *v);
    assert!(pos_big_a < pos_lit_a, "A pre-park before A post-park: {:?}", *v);
    // At least the last B step should be before A resumes.
    let last_big_b = v.iter().rposition(|&c| c == b'B').unwrap();
    assert!(last_big_b < pos_lit_a, "B should finish before A resumes: {:?}", *v);
 }
 // ---------------------------------------------------------------------------
 // Two-way pipe ping-pong via wait_readable.
 // ---------------------------------------------------------------------------
 #[test]
 fn ping_pong_between_two_pipes_completes() {
    // a_to_b: actor A writes, actor B reads.
    // b_to_a: actor B writes, actor A reads.
    let a_to_b = Arc::new(Pipe::new());
    let b_to_a = Arc::new(Pipe::new());
    let counter = Arc::new(AtomicU32::new(0));
    let ca = counter.clone();
    let cb = counter.clone();
    let a_to_b_a = a_to_b.clone();
    let a_to_b_b = a_to_b.clone();
    let b_to_a_a = b_to_a.clone();
    let b_to_a_b = b_to_a.clone();
    run(move || {
        let a = spawn(move || {
            for _ in 0..5 {
                let _ = raw_write(a_to_b_a.write, b"x");
                wait_readable(b_to_a_a.read).unwrap();
                let mut buf = [0u8; 4];
                let _ = raw_read(b_to_a_a.read, &mut buf);
                ca.fetch_add(1, Ordering::SeqCst);
            }
        });
        let b = spawn(move || {
            for _ in 0..5 {
                wait_readable(a_to_b_b.read).unwrap();
                let mut buf = [0u8; 4];
                let _ = raw_read(a_to_b_b.read, &mut buf);
                let _ = raw_write(b_to_a_b.write, b"y");
                cb.fetch_add(1, Ordering::SeqCst);
            }
        });
        a.join().unwrap();
        b.join().unwrap();
    });
    // Both sides did 5 rounds; counter is incremented by both, so total = 10.
    assert_eq!(counter.load(Ordering::SeqCst), 10);
 }
 // ---------------------------------------------------------------------------
 // Same fd reused across calls — DEL+ADD cycle works.
 // ---------------------------------------------------------------------------
 #[test]
 fn same_fd_can_be_waited_on_repeatedly() {
    let p = Arc::new(Pipe::new());
    let p_r = p.clone();
    let p_w = p.clone();
    let counter = Arc::new(AtomicU32::new(0));
    let c = counter.clone();
    run(move || {
        let reader = spawn(move || {
            for _ in 0..4 {
                wait_readable(p_r.read).unwrap();
                let mut buf = [0u8; 4];
                let n = raw_read(p_r.read, &mut buf);
                assert!(n > 0);
                c.fetch_add(1, Ordering::SeqCst);
            }
        });
        let writer = spawn(move || {
            for _ in 0..4 {
                yield_now();
                smarm::sleep(Duration::from_millis(2));
                let _ = raw_write(p_w.write, b"z");
            }
        });
        reader.join().unwrap();
        writer.join().unwrap();
    });
    assert_eq!(counter.load(Ordering::SeqCst), 4);
 }
 // ---------------------------------------------------------------------------
 // Sanity that wait_writable on an already-writable pipe returns promptly.
 // ---------------------------------------------------------------------------
 #[test]
 fn wait_writable_on_empty_pipe_returns_quickly() {
    let p = Arc::new(Pipe::new());
    let p_w = p.clone();
    let start = std::time::Instant::now();
    run(move || {
        wait_writable(p_w.write).unwrap();
    });
    let elapsed = start.elapsed();
    assert!(
        elapsed < Duration::from_millis(200),
        "wait_writable should be fast on a writable fd, took {:?}",
        elapsed
    );
 }
--- a/tests/mutex.rs
+++ b/tests/mutex.rs
@@ -0,0 +1,314 @@
 //! `loom::Mutex<T>` tests. All run under the scheduler because `lock()`
 //! needs to be able to park.
 use smarm::{run, spawn, yield_now, LockTimeout, Mutex};
 use std::sync::Arc;
 use std::sync::Mutex as StdMutex;
 use std::sync::atomic::{AtomicU32, Ordering};
 use std::time::{Duration, Instant};
 // ---------------------------------------------------------------------------
 // Uncontended fast path
 // ---------------------------------------------------------------------------
 #[test]
 fn lock_free_mutex_succeeds() {
    let captured = Arc::new(AtomicU32::new(0));
    let c = captured.clone();
    run(move || {
        let m = Mutex::new(42u32);
        {
            let g = m.lock_timeout(Duration::from_millis(500)).unwrap();
            c.store(*g, Ordering::SeqCst);
        }
        // After drop we can lock again.
        let g2 = m.lock_timeout(Duration::from_millis(500)).unwrap();
        assert_eq!(*g2, 42);
    });
    assert_eq!(captured.load(Ordering::SeqCst), 42);
 }
 #[test]
 fn try_lock_returns_some_when_free_none_when_held() {
    let success_flag = Arc::new(AtomicU32::new(0));
    let s = success_flag.clone();
    run(move || {
        let m = Mutex::new(0u32);
        let g = m.try_lock().expect("free");
        // Holding the guard; a second try_lock on the same actor should fail.
        assert!(m.try_lock().is_none());
        drop(g);
        // Now free again.
        let g2 = m.try_lock().expect("free again");
        drop(g2);
        s.store(1, Ordering::SeqCst);
    });
    assert_eq!(success_flag.load(Ordering::SeqCst), 1);
 }
 #[test]
 fn guard_mutates_value_visible_through_next_lock() {
    let final_value = Arc::new(AtomicU32::new(0));
    let f = final_value.clone();
    run(move || {
        let m = Mutex::new(0u32);
        {
            let mut g = m.lock_timeout(Duration::from_millis(500)).unwrap();
            *g = 7;
        }
        let g2 = m.lock_timeout(Duration::from_millis(500)).unwrap();
        f.store(*g2, Ordering::SeqCst);
    });
    assert_eq!(final_value.load(Ordering::SeqCst), 7);
 }
 // ---------------------------------------------------------------------------
 // Contention: a second actor parks until the first releases.
 // ---------------------------------------------------------------------------
 #[test]
 fn contended_lock_parks_until_holder_releases() {
    // Actor A locks, yields (still holding), then releases. Actor B tries
    // to lock in between — B should park, then succeed after A drops.
    let log: Arc<StdMutex<Vec<&'static str>>> = Arc::new(StdMutex::new(Vec::new()));
    let la = log.clone();
    let lb = log.clone();
    run(move || {
        let m = Mutex::new(0u32);
        let m_a = m.clone();
        let m_b = m.clone();
        let a = spawn(move || {
            let g = m_a.lock_timeout(Duration::from_millis(500)).unwrap();
            la.lock().unwrap().push("A_locked");
            // First yield: lets B run past its first yield_now.
            yield_now();
            // Second yield: lets B reach B_try and attempt lock() while we
            // still hold it, so B parks on the mutex.
            yield_now();
            la.lock().unwrap().push("A_dropping");
            drop(g);
            la.lock().unwrap().push("A_dropped");
        });
        let b = spawn(move || {
            // One yield: lets A run and acquire the lock first.
            yield_now();
            lb.lock().unwrap().push("B_try");
            let _g = m_b.lock_timeout(Duration::from_millis(500)).unwrap();
            lb.lock().unwrap().push("B_locked");
        });
        a.join().unwrap();
        b.join().unwrap();
    });
    let v = log.lock().unwrap();
    // A locks, B tries (parks), A drops, B gets the lock.
    let pos_a_locked = v.iter().position(|s| *s == "A_locked").unwrap();
    let pos_b_try = v.iter().position(|s| *s == "B_try").unwrap();
    let pos_a_dropped = v.iter().position(|s| *s == "A_dropped").unwrap();
    let pos_b_locked = v.iter().position(|s| *s == "B_locked").unwrap();
    assert!(pos_a_locked < pos_b_try, "log: {:?}", *v);
    assert!(pos_b_try < pos_a_dropped, "B should attempt before A drops: {:?}", *v);
    assert!(pos_a_dropped < pos_b_locked, "B should lock only after A drops: {:?}", *v);
 }
 // ---------------------------------------------------------------------------
 // Timeout: B times out while A holds forever.
 // ---------------------------------------------------------------------------
 #[test]
 fn lock_timeout_returns_err_when_holder_never_releases() {
    let saw_err = Arc::new(std::sync::atomic::AtomicBool::new(false));
    let s = saw_err.clone();
    run(move || {
        let m: Mutex<u32> = Mutex::new(0);
        let m_a = m.clone();
        let m_b = m.clone();
        let a = spawn(move || {
            // Hold the lock for 100ms, blocking B's attempt with a 20ms timeout.
            let _g = m_a.lock_timeout(Duration::from_millis(500)).unwrap();
            smarm::sleep(Duration::from_millis(100));
            // _g drops here.
        });
        let b = spawn(move || {
            // Let A acquire first.
            yield_now();
            let t0 = Instant::now();
            let res = m_b.lock_timeout(Duration::from_millis(20));
            let elapsed = t0.elapsed();
            assert!(matches!(res, Err(LockTimeout)), "got {:?}", res);
            // Sanity: actually waited approximately the timeout.
            assert!(
                elapsed >= Duration::from_millis(15),
                "timed out too fast: {:?}",
                elapsed
            );
            assert!(
                elapsed < Duration::from_millis(80),
                "timed out far too slow: {:?}",
                elapsed
            );
            s.store(true, Ordering::SeqCst);
        });
        a.join().unwrap();
        b.join().unwrap();
    });
    assert!(saw_err.load(Ordering::SeqCst));
 }
 // ---------------------------------------------------------------------------
 // FIFO fairness: when many actors queue, they get the lock in arrival order.
 // ---------------------------------------------------------------------------
 #[test]
 fn waiters_are_granted_the_lock_in_fifo_order() {
    let order: Arc<StdMutex<Vec<u32>>> = Arc::new(StdMutex::new(Vec::new()));
    run({
        let order = order.clone();
        move || {
            let m: Mutex<()> = Mutex::new(());
            // Holder: takes the lock, yields to let others queue up, then
            // releases. Each waiter records its arrival order on acquisition.
            let m_holder = m.clone();
            let holder = spawn(move || {
                let g = m_holder.lock_timeout(Duration::from_millis(500)).unwrap();
                // Let waiters pile up.
                for _ in 0..5 {
                    yield_now();
                }
                drop(g);
            });
            // Spawn 4 waiters in order 1, 2, 3, 4. Each yields once before
            // calling lock(), so we know the holder ran first.
            let mut handles = vec![holder];
            for id in 1u32..=4 {
                let m_w = m.clone();
                let o = order.clone();
                handles.push(spawn(move || {
                    // Stagger the lock attempts so they arrive in order.
                    for _ in 0..id {
                        yield_now();
                    }
                    let _g = m_w.lock_timeout(Duration::from_millis(500)).unwrap();
                    o.lock().unwrap().push(id);
                }));
            }
            for h in handles {
                h.join().unwrap();
            }
        }
    });
    let v = order.lock().unwrap().clone();
    assert_eq!(v, vec![1, 2, 3, 4], "waiters should acquire in arrival order");
 }
 // ---------------------------------------------------------------------------
 // Grant-vs-timeout race: holder drops just before timer would fire — waiter
 // should get the lock, not LockTimeout.
 // ---------------------------------------------------------------------------
 #[test]
 fn grant_wins_when_holder_releases_before_timeout() {
    let got_lock = Arc::new(std::sync::atomic::AtomicBool::new(false));
    let g = got_lock.clone();
    run(move || {
        let m: Mutex<u32> = Mutex::new(0);
        let m_a = m.clone();
        let m_b = m.clone();
        let a = spawn(move || {
            let _g = m_a.lock_timeout(Duration::from_millis(500)).unwrap();
            // Hold for 10ms, well under B's 100ms timeout.
            smarm::sleep(Duration::from_millis(10));
        });
        let b = spawn(move || {
            yield_now();
            let res = m_b.lock_timeout(Duration::from_millis(100));
            if res.is_ok() {
                g.store(true, Ordering::SeqCst);
            }
        });
        a.join().unwrap();
        b.join().unwrap();
    });
    assert!(got_lock.load(Ordering::SeqCst));
 }
 // ---------------------------------------------------------------------------
 // Panic in critical section: next waiter still gets the lock (no poisoning).
 // ---------------------------------------------------------------------------
 #[test]
 fn next_waiter_gets_lock_after_holder_panics() {
    let next_got_it = Arc::new(std::sync::atomic::AtomicBool::new(false));
    let n = next_got_it.clone();
    run(move || {
        let m: Mutex<u32> = Mutex::new(7);
        let m_a = m.clone();
        let m_b = m.clone();
        let a = spawn(move || {
            let _g = m_a.lock_timeout(Duration::from_millis(500)).unwrap();
            yield_now();
            panic!("holder dies mid-critical-section");
        });
        let b = spawn(move || {
            yield_now();
            // A is dead but its guard's Drop ran during unwind. We get the lock.
            let g = m_b.lock_timeout(Duration::from_millis(100)).unwrap();
            assert_eq!(*g, 7);
            n.store(true, Ordering::SeqCst);
        });
        let _ = a.join(); // panic — expected
        b.join().unwrap();
    });
    assert!(next_got_it.load(Ordering::SeqCst));
 }
 // ---------------------------------------------------------------------------
 // Multiple short critical sections under contention all complete (no lost
 // wakeups, no deadlock). Counts up to N from M actors.
 // ---------------------------------------------------------------------------
 #[test]
 fn many_actors_increment_shared_counter_via_mutex() {
    const ACTORS: u32 = 8;
    const PER_ACTOR: u32 = 50;
    let final_value = Arc::new(AtomicU32::new(0));
    let fv = final_value.clone();
    run(move || {
        let m: Mutex<u32> = Mutex::new(0);
        let mut handles = Vec::new();
        for _ in 0..ACTORS {
            let m_i = m.clone();
            handles.push(spawn(move || {
                for _ in 0..PER_ACTOR {
                    let mut g = m_i.lock_timeout(Duration::from_millis(500)).unwrap();
                    *g += 1;
                }
            }));
        }
        for h in handles {
            h.join().unwrap();
        }
        let g = m.lock_timeout(Duration::from_millis(500)).unwrap();
        fv.store(*g, Ordering::SeqCst);
    });
    assert_eq!(final_value.load(Ordering::SeqCst), ACTORS * PER_ACTOR);
 }
--- a/tests/preempt.rs
+++ b/tests/preempt.rs
@@ -0,0 +1,66 @@
 //! Tests for explicit preemption via `smarm::check!()`.
 use smarm::{run, spawn};
 use std::sync::atomic::{AtomicU64, Ordering};
 use std::sync::Arc;
 #[test]
 fn check_yields_when_timeslice_expired() {
    // A single actor that drives the timeslice clock to zero manually,
    // then calls check!() and expects to yield. The scheduler has nothing
    // else to run, so it just re-queues us. To prove we actually yielded,
    // observe the run counter on the slot... we don't have one. So
    // instead: spawn a second actor that increments a counter and joins
    // it; verify both actors made progress in interleaved order under
    // forced timeslice expiry.
    let order: Arc<std::sync::Mutex<Vec<u8>>> = Arc::new(std::sync::Mutex::new(Vec::new()));
    let o1 = order.clone();
    let o2 = order.clone();
    run(move || {
        let a = spawn(move || {
            o1.lock().unwrap().push(b'A');
            // Force the timeslice to be considered expired.
            smarm::preempt::expire_timeslice_for_test();
            smarm::check!();
            o1.lock().unwrap().push(b'a');
        });
        let b = spawn(move || {
            o2.lock().unwrap().push(b'B');
            smarm::preempt::expire_timeslice_for_test();
            smarm::check!();
            o2.lock().unwrap().push(b'b');
        });
        a.join().unwrap();
        b.join().unwrap();
    });
    // FIFO scheduling + forced preemption: A starts, expires, yields to B;
    // B starts, expires, yields to A; A finishes, B finishes.
    // Required: both uppercase letters appear before either lowercase.
    let v = order.lock().unwrap();
    let pos_big_a = v.iter().position(|&c| c == b'A').unwrap();
    let pos_big_b = v.iter().position(|&c| c == b'B').unwrap();
    let pos_lit_a = v.iter().position(|&c| c == b'a').unwrap();
    let pos_lit_b = v.iter().position(|&c| c == b'b').unwrap();
    assert!(pos_big_a < pos_lit_a, "A's tail ran before B's head: {:?}", *v);
    assert!(pos_big_b < pos_lit_b, "B's tail ran before A's head: {:?}", *v);
    assert!(pos_big_a.max(pos_big_b) < pos_lit_a.min(pos_lit_b),
        "preemption didn't interleave: {:?}", *v);
 }
 #[test]
 fn check_is_a_noop_when_timeslice_not_expired() {
    // After a fresh resume, check!() should be cheap and not yield. Run
    // a single actor that calls check!() many times; it should complete
    // promptly.
    let count = Arc::new(AtomicU64::new(0));
    let c = count.clone();
    run(move || {
        for _ in 0..1_000 {
            smarm::check!();
            c.fetch_add(1, Ordering::Relaxed);
        }
    });
    assert_eq!(count.load(Ordering::Relaxed), 1_000);
 }
--- a/tests/runtime.rs
+++ b/tests/runtime.rs
@@ -0,0 +1,423 @@
 //! Tests for the multi-scheduler runtime: Config, Runtime::run, and
 //! correctness under genuine parallelism.
 //!
 //! The single-threaded correctness properties (channel ordering, mutex
 //! fairness, timer accuracy, etc.) are already covered by the per-module
 //! tests. This file focuses on what changes when N > 1 scheduler threads
 //! are involved:
 //!
 //!   - Config construction and validation
 //!   - Runtime::run blocks until all actors finish
 //!   - All existing cooperative behaviours hold under multi-threading
 //!   - Actors genuinely run on different OS threads
 //!   - No lost wakeups under concurrent park/unpark
 //!   - No slot leaks under high spawn/join churn
 //!   - Panic on one scheduler thread doesn't kill others
 use smarm::{channel, runtime::{Config, Runtime}, spawn, yield_now, JoinHandle};
 use std::sync::{atomic::{AtomicBool, AtomicU64, Ordering}, Arc};
 use std::time::Duration;
 use std::collections::HashSet;
 // ---------------------------------------------------------------------------
 // Helpers
 // ---------------------------------------------------------------------------
 /// Build a runtime with exactly `n` scheduler threads.
 fn rt(n: usize) -> Runtime {
    smarm::runtime::init(Config::exact(n))
 }
 /// Convenient single-threaded runtime (regression guard).
 fn rt1() -> Runtime { rt(1) }
 /// Multi-threaded runtime using all available parallelism.
 fn rt_par() -> Runtime {
    smarm::runtime::init(Config::default())
 }
 // ---------------------------------------------------------------------------
 // Config
 // ---------------------------------------------------------------------------
 #[test]
 fn config_exact_overrides_bounds() {
    let c = Config::exact(3);
    assert_eq!(c.resolved_thread_count(), 3);
 }
 #[test]
 fn config_default_clamps_to_available_parallelism() {
    let c = Config::default();
    let n = c.resolved_thread_count();
    let avail = std::thread::available_parallelism()
        .map(|n| n.get())
        .unwrap_or(1);
    // Default min is 1, default max is available_parallelism.
    assert!(n >= 1 && n <= avail);
 }
 #[test]
 fn config_min_max_clamps() {
    // Force a range that excludes exact: min=2, max=4, available might be >4.
    let c = Config::new(2, 4, None);
    let n = c.resolved_thread_count();
    assert!(n >= 2 && n <= 4, "expected 2..=4, got {n}");
 }
 #[test]
 fn config_min_1_max_1_is_single_threaded() {
    let c = Config::new(1, 1, None);
    assert_eq!(c.resolved_thread_count(), 1);
 }
 // ---------------------------------------------------------------------------
 // Runtime::run — basic lifecycle
 // ---------------------------------------------------------------------------
 #[test]
 fn runtime_run_executes_closure() {
    let flag = Arc::new(AtomicBool::new(false));
    let f = flag.clone();
    rt(1).run(move || { f.store(true, Ordering::SeqCst); });
    assert!(flag.load(Ordering::SeqCst));
 }
 #[test]
 fn runtime_run_blocks_until_all_actors_done() {
    // Spawn a chain of actors; the counter should be exactly N when run returns.
    let counter = Arc::new(AtomicU64::new(0));
    let c = counter.clone();
    rt(2).run(move || {
        let mut handles = Vec::new();
        for _ in 0..20 {
            let cc = c.clone();
            handles.push(spawn(move || {
                cc.fetch_add(1, Ordering::SeqCst);
            }));
        }
        for h in handles {
            h.join().unwrap();
        }
    });
    assert_eq!(counter.load(Ordering::SeqCst), 20);
 }
 #[test]
 fn runtime_can_be_used_multiple_times_sequentially() {
    // Each call to run() is independent.
    let r = rt(2);
    let a = Arc::new(AtomicU64::new(0));
    let b = Arc::new(AtomicU64::new(0));
    let ac = a.clone();
    let bc = b.clone();
    r.run(move || { ac.fetch_add(1, Ordering::SeqCst); });
    r.run(move || { bc.fetch_add(1, Ordering::SeqCst); });
    assert_eq!(a.load(Ordering::SeqCst), 1);
    assert_eq!(b.load(Ordering::SeqCst), 1);
 }
 // ---------------------------------------------------------------------------
 // Single-threaded regression: exact(1) must behave identically to old run()
 // ---------------------------------------------------------------------------
 #[test]
 fn exact_1_spawn_join_works() {
    let v = Arc::new(AtomicU64::new(0));
    let vc = v.clone();
    rt1().run(move || {
        let h = spawn(move || { vc.store(42, Ordering::SeqCst); });
        h.join().unwrap();
    });
    assert_eq!(v.load(Ordering::SeqCst), 42);
 }
 #[test]
 fn exact_1_channel_recv_parks_and_wakes() {
    let v = Arc::new(AtomicU64::new(0));
    let vc = v.clone();
    rt1().run(move || {
        let (tx, rx) = channel::<u64>();
        let h = spawn(move || {
            let val = rx.recv().unwrap();
            vc.store(val, Ordering::SeqCst);
        });
        yield_now();
        tx.send(99).unwrap();
        h.join().unwrap();
    });
    assert_eq!(v.load(Ordering::SeqCst), 99);
 }
 #[test]
 fn exact_1_panic_captured() {
    let saw_err = Arc::new(AtomicBool::new(false));
    let s = saw_err.clone();
    rt1().run(move || {
        let h = spawn(|| panic!("oops"));
        if h.join().is_err() { s.store(true, Ordering::SeqCst); }
    });
    assert!(saw_err.load(Ordering::SeqCst));
 }
 // ---------------------------------------------------------------------------
 // Multi-threaded correctness
 // ---------------------------------------------------------------------------
 #[test]
 fn multi_thread_all_actors_complete() {
    let counter = Arc::new(AtomicU64::new(0));
    let c = counter.clone();
    rt_par().run(move || {
        let mut handles = Vec::new();
        for _ in 0..100 {
            let cc = c.clone();
            handles.push(spawn(move || {
                cc.fetch_add(1, Ordering::SeqCst);
            }));
        }
        for h in handles { h.join().unwrap(); }
    });
    assert_eq!(counter.load(Ordering::SeqCst), 100);
 }
 #[test]
 fn multi_thread_channel_wakeup_across_threads() {
    // Receiver parks; sender runs (potentially on a different OS thread).
    // Verifies no lost wakeup.
    let received = Arc::new(AtomicU64::new(0));
    let rc = received.clone();
    rt_par().run(move || {
        let (tx, rx) = channel::<u64>();
        let h = spawn(move || {
            let v = rx.recv().unwrap();
            rc.store(v, Ordering::SeqCst);
        });
        // Let receiver park.
        yield_now();
        tx.send(7).unwrap();
        h.join().unwrap();
    });
    assert_eq!(received.load(Ordering::SeqCst), 7);
 }
 #[test]
 fn multi_thread_many_channels_no_lost_wakeups() {
    // N pairs of (sender actor, receiver actor). Each pair exchanges one
    // message. All must complete — any lost wakeup causes a deadlock/timeout.
    const PAIRS: usize = 50;
    let count = Arc::new(AtomicU64::new(0));
    let c = count.clone();
    rt_par().run(move || {
        let mut handles: Vec<JoinHandle> = Vec::new();
        for _ in 0..PAIRS {
            let (tx, rx) = channel::<u64>();
            let cc = c.clone();
            handles.push(spawn(move || {
                let v = rx.recv().unwrap();
                cc.fetch_add(v, Ordering::SeqCst);
            }));
            handles.push(spawn(move || {
                tx.send(1).unwrap();
            }));
        }
        for h in handles { h.join().unwrap(); }
    });
    assert_eq!(count.load(Ordering::SeqCst), PAIRS as u64);
 }
 #[test]
 fn multi_thread_mutex_contention_no_deadlock() {
    use smarm::Mutex;
    const ACTORS: usize = 20;
    const PER: u64 = 100;
    let total = Arc::new(AtomicU64::new(0));
    let t = total.clone();
    rt_par().run(move || {
        let m: Mutex<u64> = Mutex::new(0);
        let mut handles = Vec::new();
        for _ in 0..ACTORS {
            let mc = m.clone();
            let tc = t.clone();
            handles.push(spawn(move || {
                for _ in 0..PER {
                    let mut g = mc.lock_timeout(Duration::from_secs(5)).unwrap();
                    *g += 1;
                    tc.fetch_add(0, Ordering::SeqCst); // just a memory barrier
                }
            }));
        }
        for h in handles { h.join().unwrap(); }
        let g = m.lock_timeout(Duration::from_secs(1)).unwrap();
        t.store(*g, Ordering::SeqCst);
    });
    assert_eq!(total.load(Ordering::SeqCst), ACTORS as u64 * PER);
 }
 #[test]
 fn multi_thread_join_across_threads() {
    // Parent joins a child that may run on a different scheduler thread.
    let v = Arc::new(AtomicU64::new(0));
    let vc = v.clone();
    rt_par().run(move || {
        let h = spawn(move || {
            // Do some work to make scheduling interesting.
            for _ in 0..10 { yield_now(); }
            vc.store(1, Ordering::SeqCst);
        });
        h.join().unwrap();
    });
    assert_eq!(v.load(Ordering::SeqCst), 1);
 }
 // ---------------------------------------------------------------------------
 // Actors run on distinct OS threads
 //
 // We collect the OS thread IDs that actors execute on. With N schedulers
 // and enough actors, we expect to see more than one thread ID.
 // ---------------------------------------------------------------------------
 #[test]
 fn actors_run_on_multiple_os_threads() {
    let thread_ids: Arc<smarm::Mutex<HashSet<u64>>> =
        Arc::new(smarm::Mutex::new(HashSet::new()));
    rt_par().run({
        let ids = thread_ids.clone();
        move || {
            let mut handles = Vec::new();
            for _ in 0..64 {
                let idc = ids.clone();
                handles.push(spawn(move || {
                    let tid = unsafe { libc::syscall(libc::SYS_gettid) as u64 };
                    let mut g = idc.lock_timeout(Duration::from_secs(1)).unwrap();
                    g.insert(tid);
                }));
            }
            for h in handles { h.join().unwrap(); }
        }
    });
    let n = std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1);
    let ids = thread_ids.lock_timeout(Duration::from_secs(1)).unwrap();
    // If we have >1 scheduler threads, we expect >1 OS thread IDs.
    // On a single-CPU machine this may be 1; we just assert ≥ 1.
    assert!(!ids.is_empty());
    if n > 1 {
        // Strongly expect parallelism — not a hard assert since scheduling
        // is non-deterministic, but 64 actors should spread.
        // We log rather than assert to avoid flakiness on loaded CI.
        if ids.len() == 1 {
            eprintln!("WARNING: 64 actors all ran on the same OS thread (flaky on loaded system)");
        }
    }
 }
 // ---------------------------------------------------------------------------
 // Scheduler stats (RFC 000 Layer 1 primitives)
 // ---------------------------------------------------------------------------
 #[test]
 fn scheduler_stats_run_queue_len_is_observable() {
    // After spawning actors but before they run, the queue should be non-empty.
    // We can't observe this from inside run() without a snapshot API, but we
    // can verify the stats struct is accessible and returns sane values after
    // run() completes (queue len == 0 at quiescence).
    let r = rt_par();
    r.run(|| {
        for _ in 0..10 { spawn(|| {}); }
        // Don't join — let them drain naturally.
    });
    let stats = r.stats();
    assert_eq!(stats.total_run_queue_len(), 0, "queue should be empty after run()");
 }
 #[test]
 fn scheduler_stats_thread_count_matches_config() {
    let r = rt(3);
    r.run(|| {});
    assert_eq!(r.stats().scheduler_count(), 3);
 }
 // ---------------------------------------------------------------------------
 // Panic isolation: a panicking actor doesn't kill the scheduler thread
 // ---------------------------------------------------------------------------
 #[test]
 fn panic_in_actor_does_not_kill_runtime() {
    let completed = Arc::new(AtomicU64::new(0));
    let c = completed.clone();
    rt_par().run(move || {
        // Spawn a panicker alongside well-behaved actors.
        let bad = spawn(|| panic!("deliberate"));
        let mut good_handles = Vec::new();
        for _ in 0..10 {
            let cc = c.clone();
            good_handles.push(spawn(move || {
                cc.fetch_add(1, Ordering::SeqCst);
            }));
        }
        let _ = bad.join(); // expect Err
        for h in good_handles { h.join().unwrap(); }
    });
    assert_eq!(completed.load(Ordering::SeqCst), 10);
 }
 // ---------------------------------------------------------------------------
 // No slot leaks: rapid spawn/join churn
 // ---------------------------------------------------------------------------
 #[test]
 fn no_slot_leak_under_churn() {
    // Spawn and join many short actors in a loop. If slots leak, the slot
    // table grows unboundedly. We can't directly measure it without an
    // introspection API, but the test at least checks correctness under
    // churn and will OOM if there's a severe leak.
    let counter = Arc::new(AtomicU64::new(0));
    let c = counter.clone();
    rt_par().run(move || {
        for _ in 0..500 {
            let cc = c.clone();
            spawn(move || { cc.fetch_add(1, Ordering::SeqCst); })
                .join()
                .unwrap();
        }
    });
    assert_eq!(counter.load(Ordering::SeqCst), 500);
 }
 // ---------------------------------------------------------------------------
 // Ping-pong: channel round-trips between two actors
 // ---------------------------------------------------------------------------
 #[test]
 fn ping_pong_completes() {
    const ROUNDS: u64 = 1_000;
    let final_val = Arc::new(AtomicU64::new(0));
    let fv = final_val.clone();
    rt_par().run(move || {
        let (tx_a, rx_a) = channel::<u64>();
        let (tx_b, rx_b) = channel::<u64>();
        let h_a = spawn(move || {
            tx_a.send(0).unwrap();
            for _ in 0..ROUNDS {
                let v = rx_b.recv().unwrap();
                tx_a.send(v + 1).unwrap();
            }
        });
        let h_b = spawn(move || {
            for _ in 0..=ROUNDS {
                let v = rx_a.recv().unwrap();
                if v < ROUNDS {
                    tx_b.send(v).unwrap();
                } else {
                    fv.store(v, Ordering::SeqCst);
                }
            }
        });
        h_a.join().unwrap();
        h_b.join().unwrap();
    });
    assert_eq!(final_val.load(Ordering::SeqCst), ROUNDS);
 }
--- a/tests/stress.rs
+++ b/tests/stress.rs
@@ -0,0 +1,448 @@
 //! Stress tests targeting lost wakeups, PID table pressure, thundering herds,
 //! and panic isolation under concurrency.
 //!
 //! These tests are designed to find bugs that functional happy-path tests
 //! cannot: races in the park/unpark protocol, slot leaks under concurrent
 //! churn, and scheduler corruption from concurrent panics.
 //!
 //! Every test that could hang is bounded by a join on a known-finite set of
 //! handles. A deadlock from a lost wakeup will cause the test binary to time
 //! out rather than produce a false pass — run with `cargo test -- --timeout`
 //! or under a CI timeout.
 use smarm::{channel, runtime::{Config, Runtime}, spawn, yield_now, JoinHandle};
 use std::sync::{
    atomic::{AtomicU64, AtomicUsize, Ordering},
    Arc,
 };
 fn rt(n: usize) -> Runtime {
    smarm::runtime::init(Config::exact(n))
 }
 fn rt_par() -> Runtime {
    smarm::runtime::init(Config::default())
 }
 // ---------------------------------------------------------------------------
 // P0: Lost-wakeup — many concurrent sender/receiver pairs
 //
 // 500 independent (tx, rx) pairs. Each sender and receiver are separate
 // actors. No ordering is imposed between pairs. Any lost wakeup causes one
 // receiver to park forever, deadlocking the join at the end.
 // ---------------------------------------------------------------------------
 #[test]
 fn lost_wakeup_many_pairs() {
    const PAIRS: usize = 500;
    let count = Arc::new(AtomicU64::new(0));
    for threads in [1, 2, 4] {
        count.store(0, Ordering::SeqCst);
        let c = count.clone();
        rt(threads).run(move || {
            let mut handles: Vec<JoinHandle> = Vec::with_capacity(PAIRS * 2);
            for _ in 0..PAIRS {
                let (tx, rx) = channel::<u64>();
                let cc = c.clone();
                // Receiver parks immediately.
                handles.push(spawn(move || {
                    let v = rx.recv().unwrap();
                    cc.fetch_add(v, Ordering::SeqCst);
                }));
                // Sender fires without any yield — races with receiver parking.
                handles.push(spawn(move || {
                    tx.send(1).unwrap();
                }));
            }
            for h in handles {
                h.join().unwrap();
            }
        });
        assert_eq!(
            count.load(Ordering::SeqCst),
            PAIRS as u64,
            "lost wakeup on {threads}-thread runtime"
        );
    }
 }
 // ---------------------------------------------------------------------------
 // P0: Lost-wakeup — rapid-fire single receiver
 //
 // One receiver, SENDERS senders, all spawned at once. The receiver loops
 // receiving SENDERS messages. Race: a sender may fire before the receiver
 // has parked, or exactly as it is transitioning to parked.
 // ---------------------------------------------------------------------------
 #[test]
 fn lost_wakeup_rapid_fire_single_receiver() {
    const SENDERS: u64 = 200;
    for threads in [1, 2, 4] {
        let received = Arc::new(AtomicU64::new(0));
        let rc = received.clone();
        rt(threads).run(move || {
            let (tx, rx) = channel::<u64>();
            let mut handles: Vec<JoinHandle> = Vec::with_capacity(SENDERS as usize + 1);
            // Receiver loops until it has seen all messages.
            handles.push(spawn(move || {
                let mut n = 0u64;
                while n < SENDERS {
                    rx.recv().unwrap();
                    n += 1;
                }
                rc.store(n, Ordering::SeqCst);
            }));
            // All senders fire with no deliberate delay.
            for _ in 0..SENDERS {
                let txc = tx.clone();
                handles.push(spawn(move || {
                    txc.send(1).unwrap();
                }));
            }
            for h in handles {
                h.join().unwrap();
            }
        });
        assert_eq!(
            received.load(Ordering::SeqCst),
            SENDERS,
            "missed messages on {threads}-thread runtime"
        );
    }
 }
 // ---------------------------------------------------------------------------
 // P0: Lost-wakeup — wakeup during yield chain
 //
 // Receiver yields N times before it would naturally park. Sender fires
 // during that window. Tests the race between "actor is on the run queue
 // yielding" and "actor transitions to parked."
 // ---------------------------------------------------------------------------
 #[test]
 fn lost_wakeup_during_yield_chain() {
    const YIELDS: usize = 20;
    const PAIRS: usize = 100;
    let count = Arc::new(AtomicU64::new(0));
    let c = count.clone();
    rt_par().run(move || {
        let mut handles: Vec<JoinHandle> = Vec::with_capacity(PAIRS * 2);
        for _ in 0..PAIRS {
            let (tx, rx) = channel::<u64>();
            let cc = c.clone();
            handles.push(spawn(move || {
                // Yield several times, then block.
                for _ in 0..YIELDS {
                    yield_now();
                }
                let v = rx.recv().unwrap();
                cc.fetch_add(v, Ordering::SeqCst);
            }));
            handles.push(spawn(move || {
                // Fire immediately — may arrive while receiver is still yielding.
                tx.send(1).unwrap();
            }));
        }
        for h in handles {
            h.join().unwrap();
        }
    });
    assert_eq!(count.load(Ordering::SeqCst), PAIRS as u64);
 }
 // ---------------------------------------------------------------------------
 // P2: Thundering herd
 //
 // N actors all block on recv from their own channel. A coordinator sends
 // to all channels in rapid succession. All N actors must wake and complete.
 // Common bug: wakeup list walked destructively while lock is dropped
 // mid-walk, causing some actors to never be re-queued.
 // ---------------------------------------------------------------------------
 #[test]
 fn thundering_herd_all_wake() {
    const HERD: usize = 200;
    let woke = Arc::new(AtomicUsize::new(0));
    let w = woke.clone();
    rt_par().run(move || {
        let mut senders: Vec<smarm::Sender<u8>> = Vec::with_capacity(HERD);
        let mut handles: Vec<JoinHandle> = Vec::with_capacity(HERD + 1);
        for _ in 0..HERD {
            let (tx, rx) = channel::<u8>();
            senders.push(tx);
            let wc = w.clone();
            handles.push(spawn(move || {
                rx.recv().unwrap();
                wc.fetch_add(1, Ordering::SeqCst);
            }));
        }
        // Let all receivers park before we send.
        for _ in 0..4 { yield_now(); }
        // Coordinator blasts all channels.
        handles.push(spawn(move || {
            for tx in senders {
                tx.send(1).unwrap();
            }
        }));
        for h in handles {
            h.join().unwrap();
        }
    });
    assert_eq!(woke.load(Ordering::SeqCst), HERD);
 }
 // ---------------------------------------------------------------------------
 // P1: Concurrent spawn/join churn — PID table pressure
 //
 // K parent actors each spawn M children and join them, all concurrently.
 // Exercises PID allocation/deallocation racing across scheduler threads.
 // A generation-counter bug or slot leak will either corrupt a join result
 // or accumulate memory without bound.
 // ---------------------------------------------------------------------------
 #[test]
 fn concurrent_spawn_join_churn() {
    const PARENTS: usize = 20;
    const CHILDREN_PER_PARENT: usize = 50;
    const EXPECTED: u64 = (PARENTS * CHILDREN_PER_PARENT) as u64;
    let total = Arc::new(AtomicU64::new(0));
    let t = total.clone();
    rt_par().run(move || {
        let mut parent_handles: Vec<JoinHandle> = Vec::with_capacity(PARENTS);
        for _ in 0..PARENTS {
            let tc = t.clone();
            parent_handles.push(spawn(move || {
                let mut child_handles: Vec<JoinHandle> =
                    Vec::with_capacity(CHILDREN_PER_PARENT);
                for _ in 0..CHILDREN_PER_PARENT {
                    let tcc = tc.clone();
                    child_handles.push(spawn(move || {
                        tcc.fetch_add(1, Ordering::SeqCst);
                    }));
                }
                for h in child_handles {
                    h.join().unwrap();
                }
            }));
        }
        for h in parent_handles {
            h.join().unwrap();
        }
    });
    assert_eq!(total.load(Ordering::SeqCst), EXPECTED);
 }
 // ---------------------------------------------------------------------------
 // P0: Join race — join called after child has already finished
 //
 // The child is given time to complete before the parent calls join. This
 // exercises a different code path than "join before child finishes":
 // the wakeup has already fired and the result must be stored in the slot.
 // A bug here leaves the parent hanging or returns a corrupted result.
 // ---------------------------------------------------------------------------
 #[test]
 fn join_race_child_finishes_first() {
    const REPS: usize = 300;
    let ok = Arc::new(AtomicUsize::new(0));
    let o = ok.clone();
    rt_par().run(move || {
        let mut handles: Vec<JoinHandle> = Vec::with_capacity(REPS);
        for _ in 0..REPS {
            let oc = o.clone();
            let h = spawn(move || {
                // Child does a tiny bit of work and exits quickly.
                oc.fetch_add(1, Ordering::SeqCst);
            });
            handles.push(h);
        }
        // Yield enough to let children run to completion before we join.
        for _ in 0..8 { yield_now(); }
        for h in handles {
            // If child already finished, join must return immediately with Ok.
            h.join().unwrap();
        }
    });
    assert_eq!(ok.load(Ordering::SeqCst), REPS);
 }
 // ---------------------------------------------------------------------------
 // P3: Panic storm — concurrent panics don't corrupt the scheduler
 //
 // Many actors panic at the same time while a separate cohort of well-behaved
 // actors makes progress. If a panic corrupts the run queue or the slot table,
 // the well-behaved actors will deadlock or produce wrong counts.
 // ---------------------------------------------------------------------------
 #[test]
 fn panic_storm_does_not_corrupt_scheduler() {
    const PANICKERS: usize = 50;
    const WORKERS: usize = 50;
    const WORK_PER_ACTOR: u64 = 10;
    let total = Arc::new(AtomicU64::new(0));
    let t = total.clone();
    rt_par().run(move || {
        let mut handles: Vec<JoinHandle> = Vec::with_capacity(PANICKERS + WORKERS);
        // Spawn all panickers.
        for _ in 0..PANICKERS {
            handles.push(spawn(|| panic!("deliberate panic storm")));
        }
        // Interleave well-behaved workers.
        for _ in 0..WORKERS {
            let tc = t.clone();
            handles.push(spawn(move || {
                for _ in 0..WORK_PER_ACTOR {
                    yield_now();
                    tc.fetch_add(1, Ordering::SeqCst);
                }
            }));
        }
        // Collect results — panickers return Err, workers return Ok.
        let mut panic_count = 0usize;
        let mut ok_count = 0usize;
        for h in handles {
            match h.join() {
                Ok(()) => ok_count += 1,
                Err(_) => panic_count += 1,
            }
        }
        assert_eq!(panic_count, PANICKERS, "wrong number of panics captured");
        assert_eq!(ok_count, WORKERS, "some workers lost");
    });
    assert_eq!(
        total.load(Ordering::SeqCst),
        WORKERS as u64 * WORK_PER_ACTOR,
        "workers produced wrong count — scheduler corruption suspected"
    );
 }
 // ---------------------------------------------------------------------------
 // P1: Sequential slot reuse — generation counter correctness
 //
 // Spawn an actor, join it, then spawn a new actor. The new actor will likely
 // reuse the same slot index. A stale handle to the first actor must not
 // accidentally refer to the second. We can't hold a stale handle across a
 // join (join consumes the handle), but we can verify that PID generations
 // are distinct across reuse.
 // ---------------------------------------------------------------------------
 #[test]
 fn pid_generation_increments_on_reuse() {
    use smarm::self_pid;
    let pids: Arc<smarm::Mutex<Vec<smarm::Pid>>> =
        Arc::new(smarm::Mutex::new(Vec::new()));
    let p = pids.clone();
    rt(1).run(move || {
        // Single-threaded to maximise slot reuse.
        for _ in 0..100 {
            let pc = p.clone();
            spawn(move || {
                let pid = self_pid();
                let mut g = pc.lock_timeout(std::time::Duration::from_secs(5)).unwrap();
                g.push(pid);
            })
            .join()
            .unwrap();
        }
    });
    let g = pids.lock_timeout(std::time::Duration::from_secs(1)).unwrap();
    // Any two PIDs that share an index must have different generations.
    for i in 0..g.len() {
        for j in (i + 1)..g.len() {
            if g[i].index() == g[j].index() {
                assert_ne!(
                    g[i].generation(),
                    g[j].generation(),
                    "slot {} reused without incrementing generation",
                    g[i].index()
                );
            }
        }
    }
 }
 // ---------------------------------------------------------------------------
 // P0: Channel backpressure — slow receiver, fast sender
 //
 // Sender produces messages faster than the receiver consumes them. The
 // channel must not lose messages or deadlock regardless of how deep the
 // queue grows. Tests unbounded channel growth and correct message ordering.
 // ---------------------------------------------------------------------------
 #[test]
 fn channel_backpressure_no_loss() {
    const MESSAGES: u64 = 10_000;
    let received = Arc::new(AtomicU64::new(0));
    let rc = received.clone();
    rt_par().run(move || {
        let (tx, rx) = channel::<u64>();
        let receiver = spawn(move || {
            let mut sum = 0u64;
            for _ in 0..MESSAGES {
                sum += rx.recv().unwrap();
            }
            rc.store(sum, Ordering::SeqCst);
        });
        // Send all messages from the parent without waiting.
        for i in 0..MESSAGES {
            tx.send(i).unwrap();
        }
        receiver.join().unwrap();
    });
    // Sum of 0..MESSAGES
    let expected: u64 = (0..MESSAGES).sum();
    assert_eq!(received.load(Ordering::SeqCst), expected);
 }
--- a/tests/timer.rs
+++ b/tests/timer.rs
@@ -114,3 +114,94 @@ fn many_concurrent_sleepers_all_wake() {
    });
    assert_eq!(counter.load(std::sync::atomic::Ordering::SeqCst), 20);
 }
 // ---------------------------------------------------------------------------
 // Direct tests on the Timers data structure. No scheduler involved — these
 // cover the new Reason machinery without needing a Mutex implementation.
 // ---------------------------------------------------------------------------
 use smarm::pid::Pid;
 use smarm::timer::{Reason, TimerTarget, Timers};
 struct RecordingTarget {
    calls: Mutex<Vec<(Pid, u64)>>,
 }
 impl TimerTarget for RecordingTarget {
    fn on_timeout(&self, pid: Pid, seq: u64) {
        self.calls.lock().unwrap().push((pid, seq));
    }
 }
 #[test]
 fn timers_pop_due_returns_entries_in_deadline_order() {
    let mut t = Timers::new();
    let now = Instant::now();
    // Insert out of order; pop_due should hand them back sorted by deadline.
    t.insert_sleep(now + Duration::from_millis(30), Pid::new(0, 0));
    t.insert_sleep(now + Duration::from_millis(10), Pid::new(1, 0));
    t.insert_sleep(now + Duration::from_millis(20), Pid::new(2, 0));
    // Advance past all of them.
    let due = t.pop_due(now + Duration::from_millis(50));
    let pids: Vec<u32> = due.iter().map(|e| e.pid.index()).collect();
    assert_eq!(pids, vec![1, 2, 0]);
    assert!(t.is_empty());
 }
 #[test]
 fn timers_only_pop_entries_whose_deadline_has_passed() {
    let mut t = Timers::new();
    let now = Instant::now();
    t.insert_sleep(now + Duration::from_millis(5), Pid::new(0, 0));
    t.insert_sleep(now + Duration::from_millis(100), Pid::new(1, 0));
    let due = t.pop_due(now + Duration::from_millis(20));
    assert_eq!(due.len(), 1);
    assert_eq!(due[0].pid.index(), 0);
    assert!(!t.is_empty());
    // The unpopped entry's deadline is still visible.
    assert!(t.peek_deadline().is_some());
 }
 #[test]
 fn timers_mix_sleep_and_wait_timeout_reasons() {
    let mut t = Timers::new();
    let target = Arc::new(RecordingTarget { calls: Mutex::new(Vec::new()) });
    let now = Instant::now();
    t.insert_sleep(now + Duration::from_millis(5), Pid::new(0, 0));
    t.insert(
        now + Duration::from_millis(10),
        Pid::new(1, 0),
        Reason::WaitTimeout { target: target.clone(), wait_seq: 42 },
    );
    let due = t.pop_due(now + Duration::from_millis(20));
    assert_eq!(due.len(), 2);
    // Order: Sleep (5ms) first, WaitTimeout (10ms) second.
    match &due[0].reason {
        Reason::Sleep => {}
        _ => panic!("first entry should be a Sleep"),
    }
    match &due[1].reason {
        Reason::WaitTimeout { wait_seq, .. } => assert_eq!(*wait_seq, 42),
        _ => panic!("second entry should be a WaitTimeout"),
    }
 }
 #[test]
 fn same_deadline_entries_pop_in_insertion_order() {
    // The `seq` tiebreaker means inserting two entries with the same
    // deadline preserves the order they were inserted.
    let mut t = Timers::new();
    let now = Instant::now();
    let d = now + Duration::from_millis(10);
    t.insert_sleep(d, Pid::new(0, 0));
    t.insert_sleep(d, Pid::new(1, 0));
    t.insert_sleep(d, Pid::new(2, 0));
    let due = t.pop_due(now + Duration::from_millis(20));
    let pids: Vec<u32> = due.iter().map(|e| e.pid.index()).collect();
    assert_eq!(pids, vec![0, 1, 2]);
 }
Author	SHA1	Message	Date
smarm	d432349f99	Update the documentation	2026-05-25 22:14:07 +02:00
smarm	2b85ef60b2	Make preemption knobs configurable; fix unused-variable warnings Add `Config::alloc_interval()` and `Config::timeslice_cycles()` so callers can tune preemption sensitivity at runtime. The values flow through `RuntimeInner` and are written into per-scheduler-thread locals via a new `configure_preempt()` call at thread startup, keeping the hot path free of cross-thread coherency traffic. Fix unused-variable warnings in channel.rs by inlining `current_pid()` directly into `te!` macro arguments — since the no-op macro arm never evaluates its argument, no binding is needed at the call site. Clean up a handful of dead imports exposed by the refactor.	2026-05-25 21:52:16 +02:00
Bench	3da6ffaa77	benches: expose preemption knobs + sweep runner Config API changes (src/preempt.rs, src/runtime.rs): - preempt: promote ALLOC_INTERVAL and TIMESLICE_CYCLES from bare consts to DEFAULT_ALLOC_INTERVAL / DEFAULT_TIMESLICE_CYCLES; store active values in thread-locals set on each actor resume so multiple runtimes can use different settings concurrently. - runtime: add alloc_interval / timeslice_cycles fields to Config; add Config::alloc_interval(n) and Config::timeslice_cycles(c) builder methods; thread the values through RuntimeInner to the reset_timeslice() call in schedule_loop. Bench changes: - Add bench_cfg(threads) helper to general/tokio_favored/smarm_favored that wraps Config::exact and reads SMARM_ALLOC_INTERVAL / SMARM_TIMESLICE_CYCLES env vars, so the sweep script can vary knobs without recompiling. Sweep tooling (benches/sweep.py): - 'run': run the 3-file bench suite once; --save-baseline persists JSON - 'regress': compare current run against baseline.json, exit 1 on any bench that regresses >10% vs stored medians - 'sweep': run the full SWEEP_GRID (10 points), print comparison table, optional --save-csv; binaries pre-built so no recompile per point Sweep results (10-point grid, 1-CPU sandbox): - The preemption knobs have very little effect on this single-CPU machine. Most benches move <5% across the entire grid. - Longer timeslices (tc=600k, tc=1200k) reliably hurt spawn_storm_busy (+11-15%) and catch_unwind_panics (+10-12%) because actors hold the scheduler mutex longer per timeslice, stalling the storm of joinable tasks. - Shorter timeslices (tc=150k) give a small improvement on many_timers (-3-4%) and a wash everywhere else. - yield_in_hot_loop and uncontended_channel are essentially flat across all knobs — both are scheduling-dominated and call yield_now explicitly, so the RDTSC-driven preemption path is irrelevant. - Conclusion: the knobs matter primarily under contention (multi-core). Re-run sweep on a multi-core machine before drawing tuning conclusions.	2026-05-25 13:04:58 +00:00
Bench	6d1c59fb99	benches: baseline results Two compile fixes: - tokio_favored.rs bench_mpsc_smarm: consumer spawn closure returned u64 via bare 'count' tail expression; smarm::Runtime::run() requires FnOnce()->(). Fixed to 'let _ = count;'. Same fix on the consumer.join() call site. - smarm_favored.rs bench_unc_smarm: same pattern, same fix. Baseline run: Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, smarm 0.3.0, no RUSTFLAGS. Single-CPU sandbox — N-thread rows identical to 1-thread; scaling sweep limited to 1 thread. Notable findings: - deep_recursion: tokio wins (22 vs 62 us); mmap stack alloc cost dominates for single-use actors at depth 500. - yield_in_hot_loop: tokio wins (138 vs 182 ms); smarm mutex overhead on yield_now exceeds expected naked-switch advantage on 1 CPU. - mpsc_contention/uncontended_channel/catch_unwind_panics: smarm wins as predicted. - spawn_storm_busy: smarm 47x slower; global mutex saturated by bg yielders.	2026-05-25 13:04:54 +00:00
Bench	4b348d12be	docs: BENCHMARKS_AND_TUNING.md — bench results, knob recommendations, arch guidance	2026-05-25 13:04:50 +00:00
smarm	aeacaf6118	fix: stress testing & stability (v0.6.5) Improve reliability under high load: - tests/stress.rs: New comprehensive stress test suite (448 lines) - Fine-tune I/O & runtime scheduling edge cases - Pin versions & fix MSRV compatibility	2026-05-24 07:03:45 +00:00
Claude	978678a46e	feat: full runtime redesign (v0.6) Complete rewrite with improved architecture & correctness: - src/runtime.rs: Simplified task scheduling with proper state transitions - src/scheduler.rs: Decoupled from runtime, pure task queue logic - src/io.rs, src/mutex.rs: Refactored for clarity & performance - New actor model framework (src/actor.rs, src/context.rs) - Channel primitives (src/channel.rs) & process IDs (src/pid.rs) - Preemption framework (src/preempt.rs) for fair timeslicing - Expanded benchmarks & tests (multi_scheduler, primes, runtime)	2026-05-23 16:09:35 +00:00
Claude	078447539c	chore: reset working tree (v0.5) Temporary commit clearing working tree for v0.6 rebuild	2026-05-23 16:09:35 +00:00
Claude	e9fdbb1160	refactor: centralize runtime logic (v0.4) Extract scheduler responsibilities into a dedicated Runtime component: - src/runtime.rs: New centralized control flow (669 lines) - src/scheduler.rs: Simplified to task queue & preemption management - tests/runtime.rs: Comprehensive runtime test suite - benches/multi_scheduler.rs: Multi-runtime scheduling benchmarks - Improves modularity and enables per-runtime configuration	2026-05-23 16:09:32 +00:00
Claude	8cbef1dfc1	feat: I/O and mutex support (v0.3) Add epoll-based non-blocking I/O and kernel-like mutexes: - src/io.rs: Complete epoll backend with timeout & error handling - src/mutex.rs: Fair mutex with waiter queues & parking integration - Enhanced scheduler to support synchronous I/O blocking - Comprehensive test suites for I/O (epoll) and mutex behavior - Documentation: LOOM.md concurrency model & README	2026-05-23 16:09:29 +00:00
Claude	d3ab81b833	preempt: explicit check!() macro for no-alloc loops Stable Rust emits stack probes inline (subq/movq/jne loop) rather than calling __rust_probestack, so there's no transparent hook for stack- frame preemption. Override of __rust_probestack links cleanly but never runs. Falling back to an explicit check!() that users drop into hot compute loops. check!() decrements the same ALLOC_COUNT counter as the heap path, so both event sources fire timeslice checks at the same rate. Documents the prep-to-park invariant on maybe_preempt — library code that registers a wakeup and then parks must keep that window alloc-free and check-free, or a preemption-driven yield in the middle would lose the wakeup.	2026-05-22 05:37:04 +00:00
Claude	51bfccc3c2	feat: I/O and mutex support (v0.3) Add epoll-based non-blocking I/O and kernel-like mutexes: - src/io.rs: Complete epoll backend with timeout & error handling - src/mutex.rs: Fair mutex with waiter queues & parking integration - Enhanced scheduler to support synchronous I/O blocking - Comprehensive test suites for I/O (epoll) and mutex behavior - Documentation: LOOM.md concurrency model & README	2026-05-22 05:32:24 +00:00