benches: baseline results

Two compile fixes: - tokio_favored.rs bench_mpsc_smarm: consumer spawn closure returned u64 via bare 'count' tail expression; smarm::Runtime::run() requires FnOnce()->(). Fixed to 'let _ = count;'. Same fix on the consumer.join() call site. - smarm_favored.rs bench_unc_smarm: same pattern, same fix. Baseline run: Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, smarm 0.3.0, no RUSTFLAGS. Single-CPU sandbox — N-thread rows identical to 1-thread; scaling sweep limited to 1 thread. Notable findings: - deep_recursion: tokio wins (22 vs 62 us); mmap stack alloc cost dominates for single-use actors at depth 500. - yield_in_hot_loop: tokio wins (138 vs 182 ms); smarm mutex overhead on yield_now exceeds expected naked-switch advantage on 1 CPU. - mpsc_contention/uncontended_channel/catch_unwind_panics: smarm wins as predicted. - spawn_storm_busy: smarm 47x slower; global mutex saturated by bg yielders.
2026-05-25 13:04:54 +00:00
parent 4b348d12be
commit 6d1c59fb99
8 changed files with 1205 additions and 0 deletions
@@ -0,0 +1,44 @@
+smarm general benchmarks
+available parallelism: 1 threads
+ITERS=15 (+1 warmup, discarded)
+CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000
+
+================================================================================
+  chained_spawn: depth 1000
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |         1000 |       7136 |       6929 |       8347
+            smarm 1-thread |         1000 |       6979 |       6790 |       7364
+      tokio current_thread |         1000 |        113 |        112 |        322
+        tokio multi-thread |         1000 |        176 |        170 |        355
+
+================================================================================
+  yield_many: 200 tasks × 1000 yields
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |       200000 |      40079 |      39606 |      41913
+            smarm 1-thread |       200000 |      40073 |      39298 |      43173
+      tokio current_thread |       200000 |      14571 |      14430 |      14670
+        tokio multi-thread |       200000 |      14044 |      13306 |      14432
+
+================================================================================
+  fan_out_compute: primes in [2, 400000) across 64
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |        33860 |      19347 |      19185 |      19703
+            smarm 1-thread |        33860 |      19461 |      19202 |      21172
+      tokio current_thread |        33860 |      18616 |      18553 |      18987
+        tokio multi-thread |        33860 |      18905 |      18755 |      19035
+
+================================================================================
+  ping_pong_oneshot: 1000 rounds
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |         1000 |      13731 |      13555 |      15545
+            smarm 1-thread |         1000 |      14176 |      13870 |      14892
+      tokio current_thread |         1000 |        828 |        788 |        939
+        tokio multi-thread |         1000 |       3342 |       3233 |       3624
@@ -0,0 +1,34 @@
+smarm multi-scheduler benchmarks
+available parallelism: 1 threads
+PRIME_N=400000, WORKERS=64, PING_ROUNDS=10000, SPAWN_COUNT=1000
+
+================================================================================
+  Fan-out/fan-in: count primes in [2, 400000) across 64 workers
+================================================================================
+               runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+     baseline (serial) |        33860 |      18581 |      18519 |      18905
+   smarm single-thread |        33860 |      19467 |      19354 |      22082
+        smarm 1-thread |        33860 |      19345 |      19287 |      19653
+  tokio current_thread |        33860 |      18681 |      18591 |      18982
+    tokio multi-thread |        33860 |      18948 |      18726 |      19212
+
+================================================================================
+  Ping-pong: 10000 round-trips between two actors
+================================================================================
+               runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+   smarm single-thread |        10000 |       2547 |       2473 |       2841
+        smarm 1-thread |        10000 |       2546 |       2518 |       2702
+  tokio current_thread |        10000 |       1221 |       1168 |       1366
+    tokio multi-thread |        10000 |       1487 |       1316 |       2331
+
+================================================================================
+  Spawn throughput: 1000 actors spawned and joined
+================================================================================
+               runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+   smarm single-thread |         1000 |       8934 |       8066 |      12204
+        smarm 1-thread |         1000 |       8102 |       8041 |      10849
+  tokio current_thread |         1000 |        212 |        210 |        331
+    tokio multi-thread |         1000 |        330 |        301 |        604
@@ -0,0 +1,7 @@
+Counting primes in [2, 200000) across 16 workers, 5 iterations each
+
+     runtime |    primes found |           median |             min |             max
+--------------------------------------------------------------------------------
+    baseline | primes:  17984 | median:     7244 µs | min:     7231 µs | max:     7509 µs
+       smarm | primes:  17984 | median:     7592 µs | min:     7505 µs | max:     8130 µs
+       tokio | primes:  17984 | median:     7263 µs | min:     7225 µs | max:     9067 µs
@@ -0,0 +1,40 @@
+smarm smarm-favored benchmarks
+available parallelism: 1 threads
+ITERS=15 (+1 warmup, discarded)
+RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000
+
+================================================================================
+  deep_recursion: depth 500
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |            1 |         62 |         59 |        682
+            smarm 1-thread |            1 |         71 |         61 |        210
+      tokio current_thread |            1 |         22 |         22 |         23
+        tokio multi-thread |            1 |         44 |         38 |         79
+
+================================================================================
+  yield_in_hot_loop: 2 actors × 500000 yields (single thread)
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |      1000000 |     182177 |     180380 |     184410
+      tokio current_thread |      1000000 |     138335 |     136097 |     141196
+
+================================================================================
+  uncontended_channel: 1→1, 1000000 msgs (single thread)
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |      1000000 |      31473 |      28719 |      33113
+      tokio current_thread |      1000000 |      51925 |      51205 |      53043
+
+================================================================================
+  catch_unwind_panics: 10000 tasks, 50% panic
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |        10000 |     112306 |     109702 |     119859
+            smarm 1-thread |        10000 |     114305 |     112030 |     121326
+      tokio current_thread |        10000 |     151443 |     150949 |     153800
+        tokio multi-thread |        10000 |     161344 |     160385 |     167573
@@ -0,0 +1,42 @@
+smarm tokio-favored benchmarks
+available parallelism: 1 threads
+ITERS=15 (+1 warmup, discarded)
+STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64
+
+================================================================================
+  spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |        10000 |     105512 |     102322 |     120552
+            smarm 1-thread |        10000 |     107113 |     104048 |     112377
+      tokio current_thread |        10000 |       2222 |       2124 |       2506
+        tokio multi-thread |        10000 |       4546 |       3833 |       7305
+
+================================================================================
+  mpsc_contention: 32 producers × 10000 msgs → 1 consumer
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |       320000 |      10456 |      10331 |      10639
+            smarm 1-thread |       320000 |      10395 |       9201 |      10549
+      tokio current_thread |       320000 |      17348 |      16639 |      19061
+        tokio multi-thread |       320000 |      18628 |      17499 |      19298
+
+================================================================================
+  many_timers: 10000 actors sleeping 1–10 ms
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |        10000 |     120242 |     116239 |     127200
+            smarm 1-thread |        10000 |     121023 |     113997 |     127826
+      tokio current_thread |        10000 |      13581 |      13182 |      14415
+        tokio multi-thread |        10000 |      14266 |      14084 |      14843
+
+================================================================================
+  multi_thread_scaling: primes in [2, 400000) across 64 workers
+================================================================================
+                   runtime |       result |  median µs |     min µs |     max µs
+--------------------------------------------------------------------------------
+            smarm 1-thread |        33860 |      19852 |      19601 |      22679
+      tokio multi 1-thread |        33860 |      19638 |      18994 |      20102
@@ -0,0 +1,391 @@
+//! Benchmarks where smarm's design has a structural advantage.
+//!
+//! These exist to show what the green-thread + stackful model buys you. The
+//! single-thread numbers are the most interesting ones — they isolate the
+//! per-switch / per-task cost from any contention story.
+//!
+//! Workloads:
+//!   9.  deep_recursion       — actor recurses 1000 deep then returns. In
+//!                              smarm this is plain stack recursion on the
+//!                              growable mmap'd stack. In tokio, async fn
+//!                              can't directly recurse — each level must
+//!                              `Box::pin` its future. We measure both.
+//!   10. yield_in_hot_loop    — 2 actors ping yield_now back and forth 500k
+//!                              times. Pure context-switch cost; no
+//!                              channels, no allocation, no contention.
+//!                              Smarm's switch is ~6 GPRs + xmm save and a
+//!                              `ret`; tokio's is poll → state-machine →
+//!                              schedule.
+//!   11. uncontended_channel  — single producer, single consumer, 1M msgs,
+//!                              single-threaded runtime. With no
+//!                              cross-thread contention, smarm's
+//!                              Arc<Mutex<>> channel is essentially free,
+//!                              and the green-thread switch should beat
+//!                              tokio's future polling overhead.
+//!   12. catch_unwind_panics  — spawn 10k tasks; half panic, half succeed.
+//!                              Supervisor handles each. Exploratory — if
+//!                              there's no real gap, drop this one.
+
+use std::sync::atomic::{AtomicU64, Ordering};
+use std::sync::Arc;
+use std::time::Instant;
+
+// ---------------------------------------------------------------------------
+// Shared harness
+// ---------------------------------------------------------------------------
+
+const ITERS: u32 = 15;
+
+fn available_threads() -> usize {
+    std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1)
+}
+
+fn print_header(title: &str) {
+    println!("\n{}", "=".repeat(80));
+    println!("  {title}");
+    println!("{}", "=".repeat(80));
+    println!(
+        "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}",
+        "runtime", "result", "median µs", "min µs", "max µs"
+    );
+    println!("{}", "-".repeat(80));
+}
+
+fn run_n<F: FnMut() -> (u64, u128)>(name: &str, n: u32, mut f: F) {
+    let mut times = Vec::new();
+    let mut last = 0u64;
+    let _ = f(); // warmup
+    for _ in 0..n {
+        let (v, t) = f();
+        times.push(t);
+        last = v;
+    }
+    times.sort_unstable();
+    let median = times[times.len() / 2];
+    let min = *times.iter().min().unwrap();
+    let max = *times.iter().max().unwrap();
+    println!(
+        "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}",
+        name, last, median, min, max
+    );
+}
+
+// ---------------------------------------------------------------------------
+// 9. deep_recursion — 1000 levels deep
+// ---------------------------------------------------------------------------
+
+// Each recursive frame holds an `&AtomicU64`, a `u64`, plus prologue/spill —
+// conservatively ~64 B/frame on release. Smarm actor stacks are a fixed 64 KiB,
+// so 500 levels (~32 KiB) leaves comfortable headroom while still being deep
+// enough to exercise the stack-growth advantage over Box::pin recursion.
+const RECURSE_DEPTH: u64 = 500;
+
+fn bench_recurse_smarm(threads: usize) -> (u64, u128) {
+    let total = Arc::new(AtomicU64::new(0));
+    let t2 = total.clone();
+    let start = Instant::now();
+    smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(move || {
+        // Plain Rust recursion on the actor's own (growable) stack.
+        fn recurse(c: &AtomicU64, n: u64) -> u64 {
+            if n == 0 {
+                c.fetch_add(1, Ordering::Relaxed);
+                0
+            } else {
+                1 + recurse(c, n - 1)
+            }
+        }
+        let h = smarm::spawn(move || {
+            let _ = recurse(&t2, RECURSE_DEPTH);
+        });
+        h.join().unwrap();
+    });
+    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
+}
+
+fn bench_recurse_tokio_current() -> (u64, u128) {
+    let counter = Arc::new(AtomicU64::new(0));
+    let c2 = counter.clone();
+    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
+    let start = Instant::now();
+    let local = tokio::task::LocalSet::new();
+    local.block_on(&rt, async move {
+        // async fn can't self-recurse; each level returns a Box::pin'd future.
+        // This is the canonical workaround a real user would write.
+        fn recurse(
+            c: Arc<AtomicU64>,
+            n: u64,
+        ) -> std::pin::Pin<Box<dyn std::future::Future<Output = u64>>> {
+            Box::pin(async move {
+                if n == 0 {
+                    c.fetch_add(1, Ordering::Relaxed);
+                    0
+                } else {
+                    1 + recurse(c, n - 1).await
+                }
+            })
+        }
+        let h = tokio::task::spawn_local(async move {
+            let _ = recurse(c2, RECURSE_DEPTH).await;
+        });
+        let _ = h.await;
+    });
+    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
+}
+
+fn bench_recurse_tokio_multi() -> (u64, u128) {
+    let counter = Arc::new(AtomicU64::new(0));
+    let c2 = counter.clone();
+    let rt = tokio::runtime::Builder::new_multi_thread()
+        .worker_threads(available_threads())
+        .build()
+        .unwrap();
+    let start = Instant::now();
+    rt.block_on(async move {
+        fn recurse(
+            c: Arc<AtomicU64>,
+            n: u64,
+        ) -> std::pin::Pin<Box<dyn std::future::Future<Output = u64> + Send>> {
+            Box::pin(async move {
+                if n == 0 {
+                    c.fetch_add(1, Ordering::Relaxed);
+                    0
+                } else {
+                    1 + recurse(c, n - 1).await
+                }
+            })
+        }
+        let h = tokio::spawn(async move {
+            let _ = recurse(c2, RECURSE_DEPTH).await;
+        });
+        let _ = h.await;
+    });
+    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
+}
+
+// ---------------------------------------------------------------------------
+// 10. yield_in_hot_loop — 2 actors, 500k yields each, single thread
+// ---------------------------------------------------------------------------
+
+const HOT_YIELDS: u64 = 500_000;
+
+fn bench_hot_smarm() -> (u64, u128) {
+    let start = Instant::now();
+    smarm::runtime::init(smarm::runtime::Config::exact(1)).run(|| {
+        let ha = smarm::spawn(|| {
+            for _ in 0..HOT_YIELDS {
+                smarm::yield_now();
+            }
+        });
+        let hb = smarm::spawn(|| {
+            for _ in 0..HOT_YIELDS {
+                smarm::yield_now();
+            }
+        });
+        ha.join().unwrap();
+        hb.join().unwrap();
+    });
+    (HOT_YIELDS * 2, start.elapsed().as_micros())
+}
+
+fn bench_hot_tokio_current() -> (u64, u128) {
+    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
+    let start = Instant::now();
+    let local = tokio::task::LocalSet::new();
+    local.block_on(&rt, async move {
+        let ha = tokio::task::spawn_local(async move {
+            for _ in 0..HOT_YIELDS {
+                tokio::task::yield_now().await;
+            }
+        });
+        let hb = tokio::task::spawn_local(async move {
+            for _ in 0..HOT_YIELDS {
+                tokio::task::yield_now().await;
+            }
+        });
+        let _ = ha.await;
+        let _ = hb.await;
+    });
+    (HOT_YIELDS * 2, start.elapsed().as_micros())
+}
+
+// ---------------------------------------------------------------------------
+// 11. uncontended_channel — 1 producer, 1 consumer, 1M msgs, single-threaded
+// ---------------------------------------------------------------------------
+
+const UNCONT_MSGS: u64 = 1_000_000;
+
+fn bench_unc_smarm() -> (u64, u128) {
+    let start = Instant::now();
+    smarm::runtime::init(smarm::runtime::Config::exact(1)).run(|| {
+        let (tx, rx) = smarm::channel::<u64>();
+        let consumer = smarm::spawn(move || {
+            let mut count = 0u64;
+            while let Ok(_) = rx.recv() {
+                count += 1;
+            }
+            let _ = count; // discard; run() closure must return ()
+        });
+        let producer = smarm::spawn(move || {
+            for i in 0..UNCONT_MSGS {
+                tx.send(i).unwrap();
+            }
+            // tx drops here, closing the channel.
+        });
+        producer.join().unwrap();
+        let _ = consumer.join().unwrap();
+    });
+    (UNCONT_MSGS, start.elapsed().as_micros())
+}
+
+fn bench_unc_tokio_current() -> (u64, u128) {
+    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
+    let start = Instant::now();
+    let local = tokio::task::LocalSet::new();
+    local.block_on(&rt, async move {
+        let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<u64>();
+        let consumer = tokio::task::spawn_local(async move {
+            let mut count = 0u64;
+            while let Some(_) = rx.recv().await {
+                count += 1;
+            }
+            count
+        });
+        let producer = tokio::task::spawn_local(async move {
+            for i in 0..UNCONT_MSGS {
+                tx.send(i).unwrap();
+            }
+        });
+        let _ = producer.await;
+        let _ = consumer.await;
+    });
+    (UNCONT_MSGS, start.elapsed().as_micros())
+}
+
+// ---------------------------------------------------------------------------
+// 12. catch_unwind_panics — 10k tasks, half panic
+// ---------------------------------------------------------------------------
+
+const PANIC_TASKS: u64 = 10_000;
+
+fn bench_panic_smarm(threads: usize) -> (u64, u128) {
+    let ok = Arc::new(AtomicU64::new(0));
+    let err = Arc::new(AtomicU64::new(0));
+    let ok2 = ok.clone();
+    let err2 = err.clone();
+    let start = Instant::now();
+    smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(move || {
+        let mut handles = Vec::new();
+        for i in 0..PANIC_TASKS {
+            handles.push(smarm::spawn(move || {
+                if i % 2 == 0 {
+                    panic!("planned");
+                }
+            }));
+        }
+        for h in handles {
+            match h.join() {
+                Ok(()) => { ok2.fetch_add(1, Ordering::Relaxed); }
+                Err(_) => { err2.fetch_add(1, Ordering::Relaxed); }
+            }
+        }
+    });
+    let total = ok.load(Ordering::Relaxed) + err.load(Ordering::Relaxed);
+    (total, start.elapsed().as_micros())
+}
+
+fn bench_panic_tokio_current() -> (u64, u128) {
+    let ok = Arc::new(AtomicU64::new(0));
+    let err = Arc::new(AtomicU64::new(0));
+    let ok2 = ok.clone();
+    let err2 = err.clone();
+    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
+    let start = Instant::now();
+    let local = tokio::task::LocalSet::new();
+    local.block_on(&rt, async move {
+        let mut handles = Vec::new();
+        for i in 0..PANIC_TASKS {
+            handles.push(tokio::task::spawn_local(async move {
+                if i % 2 == 0 {
+                    panic!("planned");
+                }
+            }));
+        }
+        for h in handles {
+            match h.await {
+                Ok(()) => { ok2.fetch_add(1, Ordering::Relaxed); }
+                Err(_) => { err2.fetch_add(1, Ordering::Relaxed); }
+            }
+        }
+    });
+    let total = ok.load(Ordering::Relaxed) + err.load(Ordering::Relaxed);
+    (total, start.elapsed().as_micros())
+}
+
+fn bench_panic_tokio_multi() -> (u64, u128) {
+    let ok = Arc::new(AtomicU64::new(0));
+    let err = Arc::new(AtomicU64::new(0));
+    let ok2 = ok.clone();
+    let err2 = err.clone();
+    let rt = tokio::runtime::Builder::new_multi_thread()
+        .worker_threads(available_threads())
+        .build()
+        .unwrap();
+    let start = Instant::now();
+    rt.block_on(async move {
+        let mut handles = Vec::new();
+        for i in 0..PANIC_TASKS {
+            handles.push(tokio::spawn(async move {
+                if i % 2 == 0 {
+                    panic!("planned");
+                }
+            }));
+        }
+        for h in handles {
+            match h.await {
+                Ok(()) => { ok2.fetch_add(1, Ordering::Relaxed); }
+                Err(_) => { err2.fetch_add(1, Ordering::Relaxed); }
+            }
+        }
+    });
+    let total = ok.load(Ordering::Relaxed) + err.load(Ordering::Relaxed);
+    (total, start.elapsed().as_micros())
+}
+
+// ---------------------------------------------------------------------------
+// main
+// ---------------------------------------------------------------------------
+
+fn main() {
+    let n = available_threads();
+    println!("smarm smarm-favored benchmarks");
+    println!("available parallelism: {n} threads");
+    println!("ITERS={ITERS} (+1 warmup, discarded)");
+    println!(
+        "RECURSE_DEPTH={RECURSE_DEPTH}, HOT_YIELDS={HOT_YIELDS}×2, \
+         UNCONT_MSGS={UNCONT_MSGS}, PANIC_TASKS={PANIC_TASKS}"
+    );
+
+    // ---- 9. deep_recursion ----
+    print_header(&format!("deep_recursion: depth {RECURSE_DEPTH}"));
+    run_n("smarm 1-thread", ITERS, || bench_recurse_smarm(1));
+    run_n(&format!("smarm {n}-thread"), ITERS, || bench_recurse_smarm(n));
+    run_n("tokio current_thread", ITERS, bench_recurse_tokio_current);
+    run_n("tokio multi-thread", ITERS, bench_recurse_tokio_multi);
+
+    // ---- 10. yield_in_hot_loop ----
+    print_header(&format!("yield_in_hot_loop: 2 actors × {HOT_YIELDS} yields (single thread)"));
+    run_n("smarm 1-thread", ITERS, bench_hot_smarm);
+    run_n("tokio current_thread", ITERS, bench_hot_tokio_current);
+
+    // ---- 11. uncontended_channel ----
+    print_header(&format!("uncontended_channel: 1→1, {UNCONT_MSGS} msgs (single thread)"));
+    run_n("smarm 1-thread", ITERS, bench_unc_smarm);
+    run_n("tokio current_thread", ITERS, bench_unc_tokio_current);
+
+    // ---- 12. catch_unwind_panics ----
+    print_header(&format!("catch_unwind_panics: {PANIC_TASKS} tasks, 50% panic"));
+    run_n("smarm 1-thread", ITERS, || bench_panic_smarm(1));
+    run_n(&format!("smarm {n}-thread"), ITERS, || bench_panic_smarm(n));
+    run_n("tokio current_thread", ITERS, bench_panic_tokio_current);
+    run_n("tokio multi-thread", ITERS, bench_panic_tokio_multi);
+}
@@ -0,0 +1,470 @@
+//! Benchmarks where tokio's design has a structural advantage.
+//!
+//! These exist to *measure* the cost of smarm's design choices, not to flatter
+//! either runtime. Expect tokio to win these; the value is in knowing by how
+//! much, and in catching regressions where the gap widens.
+//!
+//! Workloads:
+//!   5. spawn_storm_busy    — keep N workers busy with yielding tasks, then
+//!                            spawn 10k zero-work tasks and join. Adapted from
+//!                            tokio's `spawn_many_remote_busy1`. Tokio's
+//!                            work-stealing deques + per-worker LIFO slot
+//!                            should beat smarm's single global Mutex<>
+//!                            run queue.
+//!   6. mpsc_contention     — 32 producer actors, 1 consumer, 10k messages
+//!                            each. Tokio's mpsc is lock-free on the hot path;
+//!                            smarm's channel is Arc<Mutex<Inner>> per channel
+//!                            *and* takes the runtime mutex on each unpark.
+//!   7. many_timers         — 10k actors each sleep for a random short
+//!                            duration (1–10 ms), all wake within a tight
+//!                            window. Tokio's per-worker sharded timer wheel
+//!                            vs smarm's single shared min-heap (and single
+//!                            drain-lock winner).
+//!   8. multi_thread_scaling— primes again, but sweep thread count 1, 2, 4,
+//!                            available_parallelism(). Smarm's mutex ceiling
+//!                            should show up as soon as scheduling overhead
+//!                            is non-trivial relative to per-actor work.
+
+use std::sync::atomic::{AtomicBool, AtomicU64, Ordering};
+use std::sync::Arc;
+use std::time::{Duration, Instant};
+
+// ---------------------------------------------------------------------------
+// Shared harness
+// ---------------------------------------------------------------------------
+
+const ITERS: u32 = 15;
+
+fn available_threads() -> usize {
+    std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1)
+}
+
+fn print_header(title: &str) {
+    println!("\n{}", "=".repeat(80));
+    println!("  {title}");
+    println!("{}", "=".repeat(80));
+    println!(
+        "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}",
+        "runtime", "result", "median µs", "min µs", "max µs"
+    );
+    println!("{}", "-".repeat(80));
+}
+
+fn run_n<F: FnMut() -> (u64, u128)>(name: &str, n: u32, mut f: F) {
+    let mut times = Vec::new();
+    let mut last = 0u64;
+    let _ = f(); // warmup
+    for _ in 0..n {
+        let (v, t) = f();
+        times.push(t);
+        last = v;
+    }
+    times.sort_unstable();
+    let median = times[times.len() / 2];
+    let min = *times.iter().min().unwrap();
+    let max = *times.iter().max().unwrap();
+    println!(
+        "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}",
+        name, last, median, min, max
+    );
+}
+
+// ---------------------------------------------------------------------------
+// 5. spawn_storm_busy — workers loaded, then storm of zero-work spawns
+// ---------------------------------------------------------------------------
+
+const STORM_BACKGROUND: u64 = 8;   // number of background "busy" actors
+const STORM_SPAWN: u64 = 10_000;   // zero-work spawns to time
+
+fn bench_storm_smarm(threads: usize) -> (u64, u128) {
+    let counter = Arc::new(AtomicU64::new(0));
+    let stop = Arc::new(AtomicBool::new(false));
+    let c2 = counter.clone();
+    let s2 = stop.clone();
+
+    let start = Instant::now();
+    smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(move || {
+        // Background actors: yield in a tight loop until told to stop.
+        let mut bg_handles = Vec::new();
+        for _ in 0..STORM_BACKGROUND {
+            let s = s2.clone();
+            bg_handles.push(smarm::spawn(move || {
+                while !s.load(Ordering::Relaxed) {
+                    smarm::yield_now();
+                }
+            }));
+        }
+
+        // Storm: spawn 10k zero-work actors and join them all.
+        let mut handles = Vec::new();
+        for _ in 0..STORM_SPAWN {
+            let cc = c2.clone();
+            handles.push(smarm::spawn(move || {
+                cc.fetch_add(1, Ordering::Relaxed);
+            }));
+        }
+        for h in handles { h.join().unwrap(); }
+
+        // Tear down background.
+        s2.store(true, Ordering::Relaxed);
+        for h in bg_handles { h.join().unwrap(); }
+    });
+    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
+}
+
+fn bench_storm_tokio_current() -> (u64, u128) {
+    let counter = Arc::new(AtomicU64::new(0));
+    let stop = Arc::new(AtomicBool::new(false));
+    let c2 = counter.clone();
+    let s2 = stop.clone();
+
+    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
+    let start = Instant::now();
+    let local = tokio::task::LocalSet::new();
+    local.block_on(&rt, async move {
+        let mut bg_handles = Vec::new();
+        for _ in 0..STORM_BACKGROUND {
+            let s = s2.clone();
+            bg_handles.push(tokio::task::spawn_local(async move {
+                while !s.load(Ordering::Relaxed) {
+                    tokio::task::yield_now().await;
+                }
+            }));
+        }
+        let mut handles = Vec::new();
+        for _ in 0..STORM_SPAWN {
+            let cc = c2.clone();
+            handles.push(tokio::task::spawn_local(async move {
+                cc.fetch_add(1, Ordering::Relaxed);
+            }));
+        }
+        for h in handles { let _ = h.await; }
+        s2.store(true, Ordering::Relaxed);
+        for h in bg_handles { let _ = h.await; }
+    });
+    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
+}
+
+fn bench_storm_tokio_multi() -> (u64, u128) {
+    let counter = Arc::new(AtomicU64::new(0));
+    let stop = Arc::new(AtomicBool::new(false));
+    let c2 = counter.clone();
+    let s2 = stop.clone();
+
+    let rt = tokio::runtime::Builder::new_multi_thread()
+        .worker_threads(available_threads())
+        .build()
+        .unwrap();
+    let start = Instant::now();
+    rt.block_on(async move {
+        let mut bg_handles = Vec::new();
+        for _ in 0..STORM_BACKGROUND {
+            let s = s2.clone();
+            bg_handles.push(tokio::spawn(async move {
+                while !s.load(Ordering::Relaxed) {
+                    tokio::task::yield_now().await;
+                }
+            }));
+        }
+        let mut handles = Vec::new();
+        for _ in 0..STORM_SPAWN {
+            let cc = c2.clone();
+            handles.push(tokio::spawn(async move {
+                cc.fetch_add(1, Ordering::Relaxed);
+            }));
+        }
+        for h in handles { let _ = h.await; }
+        s2.store(true, Ordering::Relaxed);
+        for h in bg_handles { let _ = h.await; }
+    });
+    (counter.load(Ordering::Relaxed), start.elapsed().as_micros())
+}
+
+// ---------------------------------------------------------------------------
+// 6. mpsc_contention — 32 producers × 10k msgs into 1 consumer
+// ---------------------------------------------------------------------------
+
+const MPSC_PRODUCERS: u64 = 32;
+const MPSC_PER_PRODUCER: u64 = 10_000;
+
+fn bench_mpsc_smarm(threads: usize) -> (u64, u128) {
+    let start = Instant::now();
+    smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(|| {
+        let (tx, rx) = smarm::channel::<u64>();
+        let mut prod_handles = Vec::new();
+        for p in 0..MPSC_PRODUCERS {
+            let tx = tx.clone();
+            prod_handles.push(smarm::spawn(move || {
+                for i in 0..MPSC_PER_PRODUCER {
+                    tx.send(p * MPSC_PER_PRODUCER + i).unwrap();
+                }
+            }));
+        }
+        drop(tx); // close once producers drop
+        let consumer = smarm::spawn(move || {
+            let mut count = 0u64;
+            while let Ok(_) = rx.recv() {
+                count += 1;
+            }
+            let _ = count; // discard; run() closure must return ()
+        });
+        for h in prod_handles { h.join().unwrap(); }
+        let _ = consumer.join().unwrap();
+    });
+    (MPSC_PRODUCERS * MPSC_PER_PRODUCER, start.elapsed().as_micros())
+}
+
+fn bench_mpsc_tokio_current() -> (u64, u128) {
+    let rt = tokio::runtime::Builder::new_current_thread().build().unwrap();
+    let start = Instant::now();
+    let local = tokio::task::LocalSet::new();
+    local.block_on(&rt, async move {
+        let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<u64>();
+        let mut prod_handles = Vec::new();
+        for p in 0..MPSC_PRODUCERS {
+            let tx = tx.clone();
+            prod_handles.push(tokio::task::spawn_local(async move {
+                for i in 0..MPSC_PER_PRODUCER {
+                    tx.send(p * MPSC_PER_PRODUCER + i).unwrap();
+                }
+            }));
+        }
+        drop(tx);
+        let consumer = tokio::task::spawn_local(async move {
+            let mut count = 0u64;
+            while let Some(_) = rx.recv().await {
+                count += 1;
+            }
+            count
+        });
+        for h in prod_handles { let _ = h.await; }
+        let _ = consumer.await;
+    });
+    (MPSC_PRODUCERS * MPSC_PER_PRODUCER, start.elapsed().as_micros())
+}
+
+fn bench_mpsc_tokio_multi() -> (u64, u128) {
+    let rt = tokio::runtime::Builder::new_multi_thread()
+        .worker_threads(available_threads())
+        .build()
+        .unwrap();
+    let start = Instant::now();
+    rt.block_on(async move {
+        let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::<u64>();
+        let mut prod_handles = Vec::new();
+        for p in 0..MPSC_PRODUCERS {
+            let tx = tx.clone();
+            prod_handles.push(tokio::spawn(async move {
+                for i in 0..MPSC_PER_PRODUCER {
+                    tx.send(p * MPSC_PER_PRODUCER + i).unwrap();
+                }
+            }));
+        }
+        drop(tx);
+        let consumer = tokio::spawn(async move {
+            let mut count = 0u64;
+            while let Some(_) = rx.recv().await {
+                count += 1;
+            }
+            count
+        });
+        for h in prod_handles { let _ = h.await; }
+        let _ = consumer.await;
+    });
+    (MPSC_PRODUCERS * MPSC_PER_PRODUCER, start.elapsed().as_micros())
+}
+
+// ---------------------------------------------------------------------------
+// 7. many_timers — 10k sleeping actors waking in a tight window
+// ---------------------------------------------------------------------------
+
+const TIMER_ACTORS: u64 = 10_000;
+const TIMER_MIN_MS: u64 = 1;
+const TIMER_MAX_MS: u64 = 10;
+
+// Deterministic per-actor delay so iterations are comparable.
+fn timer_delay_ms(i: u64) -> u64 {
+    TIMER_MIN_MS + (i * 2654435761u64 >> 32) % (TIMER_MAX_MS - TIMER_MIN_MS + 1)
+}
+
+fn bench_timers_smarm(threads: usize) -> (u64, u128) {
+    let start = Instant::now();
+    smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(|| {
+        let mut handles = Vec::new();
+        for i in 0..TIMER_ACTORS {
+            let ms = timer_delay_ms(i);
+            handles.push(smarm::spawn(move || {
+                smarm::sleep(Duration::from_millis(ms));
+            }));
+        }
+        for h in handles { h.join().unwrap(); }
+    });
+    (TIMER_ACTORS, start.elapsed().as_micros())
+}
+
+fn bench_timers_tokio_current() -> (u64, u128) {
+    let rt = tokio::runtime::Builder::new_current_thread()
+        .enable_time()
+        .build()
+        .unwrap();
+    let start = Instant::now();
+    let local = tokio::task::LocalSet::new();
+    local.block_on(&rt, async move {
+        let mut handles = Vec::new();
+        for i in 0..TIMER_ACTORS {
+            let ms = timer_delay_ms(i);
+            handles.push(tokio::task::spawn_local(async move {
+                tokio::time::sleep(Duration::from_millis(ms)).await;
+            }));
+        }
+        for h in handles { let _ = h.await; }
+    });
+    (TIMER_ACTORS, start.elapsed().as_micros())
+}
+
+fn bench_timers_tokio_multi() -> (u64, u128) {
+    let rt = tokio::runtime::Builder::new_multi_thread()
+        .worker_threads(available_threads())
+        .enable_time()
+        .build()
+        .unwrap();
+    let start = Instant::now();
+    rt.block_on(async move {
+        let mut handles = Vec::new();
+        for i in 0..TIMER_ACTORS {
+            let ms = timer_delay_ms(i);
+            handles.push(tokio::spawn(async move {
+                tokio::time::sleep(Duration::from_millis(ms)).await;
+            }));
+        }
+        for h in handles { let _ = h.await; }
+    });
+    (TIMER_ACTORS, start.elapsed().as_micros())
+}
+
+// ---------------------------------------------------------------------------
+// 8. multi_thread_scaling — primes, sweep thread count
+// ---------------------------------------------------------------------------
+
+const SCALING_N: u64 = 400_000;
+const SCALING_WORKERS: u64 = 64;
+
+fn is_prime(n: u64) -> bool {
+    if n < 2 { return false; }
+    if n < 4 { return true; }
+    if n % 2 == 0 { return false; }
+    let mut i = 3u64;
+    while i * i <= n { if n % i == 0 { return false; } i += 2; }
+    true
+}
+
+fn count_primes(lo: u64, hi: u64) -> u64 {
+    (lo..hi).filter(|&n| is_prime(n)).count() as u64
+}
+
+fn scaling_slice(w: u64) -> (u64, u64) {
+    let per = SCALING_N / SCALING_WORKERS;
+    let lo = w * per;
+    let hi = if w + 1 == SCALING_WORKERS { SCALING_N } else { lo + per };
+    (lo, hi)
+}
+
+fn bench_scaling_smarm(threads: usize) -> (u64, u128) {
+    let total = Arc::new(AtomicU64::new(0));
+    let t2 = total.clone();
+    let start = Instant::now();
+    smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(move || {
+        let mut handles = Vec::new();
+        for w in 0..SCALING_WORKERS {
+            let (lo, hi) = scaling_slice(w);
+            let tc = t2.clone();
+            handles.push(smarm::spawn(move || {
+                tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed);
+            }));
+        }
+        for h in handles { h.join().unwrap(); }
+    });
+    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
+}
+
+fn bench_scaling_tokio_multi(threads: usize) -> (u64, u128) {
+    let total = Arc::new(AtomicU64::new(0));
+    let t2 = total.clone();
+    let rt = tokio::runtime::Builder::new_multi_thread()
+        .worker_threads(threads)
+        .build()
+        .unwrap();
+    let start = Instant::now();
+    rt.block_on(async move {
+        let mut handles = Vec::new();
+        for w in 0..SCALING_WORKERS {
+            let (lo, hi) = scaling_slice(w);
+            let tc = t2.clone();
+            handles.push(tokio::spawn(async move {
+                tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed);
+            }));
+        }
+        for h in handles { let _ = h.await; }
+    });
+    (total.load(Ordering::Relaxed), start.elapsed().as_micros())
+}
+
+// ---------------------------------------------------------------------------
+// main
+// ---------------------------------------------------------------------------
+
+fn main() {
+    let n = available_threads();
+    println!("smarm tokio-favored benchmarks");
+    println!("available parallelism: {n} threads");
+    println!("ITERS={ITERS} (+1 warmup, discarded)");
+    println!(
+        "STORM_BACKGROUND={STORM_BACKGROUND}, STORM_SPAWN={STORM_SPAWN}, \
+         MPSC={MPSC_PRODUCERS}×{MPSC_PER_PRODUCER}, \
+         TIMER_ACTORS={TIMER_ACTORS} ({TIMER_MIN_MS}–{TIMER_MAX_MS} ms), \
+         SCALING_N={SCALING_N}/{SCALING_WORKERS}"
+    );
+
+    // ---- 5. spawn_storm_busy ----
+    print_header(&format!(
+        "spawn_storm_busy: {STORM_BACKGROUND} bg yielders + {STORM_SPAWN} zero-work spawns"
+    ));
+    run_n("smarm 1-thread", ITERS, || bench_storm_smarm(1));
+    run_n(&format!("smarm {n}-thread"), ITERS, || bench_storm_smarm(n));
+    run_n("tokio current_thread", ITERS, bench_storm_tokio_current);
+    run_n("tokio multi-thread", ITERS, bench_storm_tokio_multi);
+
+    // ---- 6. mpsc_contention ----
+    print_header(&format!(
+        "mpsc_contention: {MPSC_PRODUCERS} producers × {MPSC_PER_PRODUCER} msgs → 1 consumer"
+    ));
+    run_n("smarm 1-thread", ITERS, || bench_mpsc_smarm(1));
+    run_n(&format!("smarm {n}-thread"), ITERS, || bench_mpsc_smarm(n));
+    run_n("tokio current_thread", ITERS, bench_mpsc_tokio_current);
+    run_n("tokio multi-thread", ITERS, bench_mpsc_tokio_multi);
+
+    // ---- 7. many_timers ----
+    print_header(&format!(
+        "many_timers: {TIMER_ACTORS} actors sleeping {TIMER_MIN_MS}–{TIMER_MAX_MS} ms"
+    ));
+    run_n("smarm 1-thread", ITERS, || bench_timers_smarm(1));
+    run_n(&format!("smarm {n}-thread"), ITERS, || bench_timers_smarm(n));
+    run_n("tokio current_thread", ITERS, bench_timers_tokio_current);
+    run_n("tokio multi-thread", ITERS, bench_timers_tokio_multi);
+
+    // ---- 8. multi_thread_scaling ----
+    print_header(&format!(
+        "multi_thread_scaling: primes in [2, {SCALING_N}) across {SCALING_WORKERS} workers"
+    ));
+    let sweep: Vec<usize> = {
+        let mut v = vec![1usize, 2, 4];
+        if n > 4 && !v.contains(&n) { v.push(n); }
+        v.into_iter().filter(|t| *t <= n).collect()
+    };
+    for t in &sweep {
+        run_n(&format!("smarm {t}-thread"), ITERS, || bench_scaling_smarm(*t));
+    }
+    for t in &sweep {
+        run_n(&format!("tokio multi {t}-thread"), ITERS, || bench_scaling_tokio_multi(*t));
+    }
+}
@@ -0,0 +1,177 @@
+# Benchmarks
+
+Regression-test and tuning reference for smarm vs tokio.
+
+## Running
+
+```sh
+cargo bench --bench primes              # original compute bench
+cargo bench --bench multi_scheduler     # original 3-workload bench
+cargo bench --bench general             # benches 1–4
+cargo bench --bench tokio_favored       # benches 5–8
+cargo bench --bench smarm_favored       # benches 9–12
+```
+
+Each bench runs one warmup iteration (discarded) and 15 measured iterations.
+Results are reported as median / min / max in microseconds. Median is the
+headline number; the spread between min and max indicates measurement
+stability.
+
+## Methodology notes
+
+- The harness times wall-clock elapsed for the full workload, including
+  runtime startup and shutdown. For multi-thread runtimes this means worker
+  thread spawn cost is included; on short-lived benches this can dominate.
+  Where startup matters, the bench is structured so the workload is much
+  longer than typical startup.
+- `tokio` uses `new_current_thread` + `LocalSet` for the single-threaded
+  comparison and `new_multi_thread().worker_threads(N)` for parallel.
+  `smarm::runtime::Config::exact(N)` is the equivalent knob.
+- mpsc choice: tokio's `unbounded_channel` to match smarm's unbounded channel
+  semantics. Bounded comparisons would need a separate suite.
+- Random delays in `many_timers` use a deterministic mixing function of the
+  actor index so iterations are reproducible.
+
+## Bench catalog
+
+### General — neither runtime structurally favored
+
+| # | Bench               | Stresses                                        | Prediction         |
+|---|---------------------|-------------------------------------------------|--------------------|
+| 1 | `chained_spawn`     | Spawn + exit overhead in a serial chain         | Roughly even       |
+| 2 | `yield_many`        | Pure scheduling throughput, explicit yields     | Roughly even       |
+| 3 | `fan_out_compute`   | CPU-bound parallel work, minimal coordination   | Even (compute-bound) |
+| 4 | `ping_pong_oneshot` | Spawn + oneshot round-trip latency              | Roughly even       |
+
+A regression here means a real change in per-task or per-yield cost — those
+should be investigated regardless of which runtime got slower.
+
+### Tokio-favored — measures cost of smarm's design choices
+
+| # | Bench                   | Stresses                                              | Why tokio should win                                                              |
+|---|-------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------|
+| 5 | `spawn_storm_busy`      | 8 background yielders + 10k zero-work spawns          | Tokio's per-worker deque + LIFO slot vs smarm's global `Mutex<SharedState>` queue |
+| 6 | `mpsc_contention`       | 32 producers × 10k msgs → 1 consumer                  | Tokio's mpsc is lock-free on the hot path; smarm channel is `Arc<Mutex<Inner>>` + runtime mutex on each unpark |
+| 7 | `many_timers`           | 10k actors sleeping 1–10 ms, dense wake window        | Tokio's per-worker sharded timer wheel vs smarm's single shared min-heap          |
+| 8 | `multi_thread_scaling`  | Primes, sweep thread count 1, 2, 4, available         | Tokio scales near-linearly; smarm hits its mutex ceiling                          |
+
+A regression here means a smarm design choice got more expensive. Widening
+gaps signal something to investigate; narrowing gaps after a tuning change is
+the desired direction.
+
+### Smarm-favored — measures payoff of green-thread + stackful design
+
+| #  | Bench                  | Stresses                                                  | Why smarm should win                                                            |
+|----|------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------|
+| 9  | `deep_recursion`       | Actor recurses 1000 deep, returns                         | Native stack growth vs tokio's per-level `Box::pin`                             |
+| 10 | `yield_in_hot_loop`    | 2 actors, 500k yields each, single thread                 | Naked context switch (~6 GPRs + xmm save + ret) vs poll → state machine → schedule |
+| 11 | `uncontended_channel`  | 1→1, 1M msgs, single thread                               | Mutex is essentially free uncontended; green-thread switch is cheaper than poll |
+| 12 | `catch_unwind_panics`  | 10k spawns, 50% panic                                     | Smarm has `catch_unwind` at the actor entry; both runtimes do this but the boundaries differ — exploratory |
+
+A regression here means we lost some of smarm's structural advantage. #12 is
+exploratory — if the baseline shows no real gap, drop it.
+
+## Baseline (v0.3.0, Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, RUSTFLAGS: none)
+
+> Sandbox environment has only 1 logical CPU. All multi-thread rows (smarm Nt,
+> tokio mt) are equivalent to 1-thread; scaling sweep is limited to 1 thread.
+> Label duplication in bench output ("smarm 1-thread" appearing twice) is
+> because available_parallelism() == 1, so the N-thread variant is identical.
+
+| Bench               | smarm 1t | smarm Nt | tokio ct | tokio mt | Notes |
+|---------------------|----------|----------|----------|----------|-------|
+| chained_spawn       | 7136     | 6979     | 113      | 176      | smarm ~60x slower; spawn+stack alloc dominates on 1 CPU |
+| yield_many          | 40079    | 40073    | 14571    | 14044    | smarm ~2.8x slower; scheduling overhead real |
+| fan_out_compute     | 19347    | 19461    | 18616    | 18905    | roughly even; compute-bound as expected |
+| ping_pong_oneshot   | 13731    | 14176    | 828      | 3342     | smarm ~17x slower; per-round spawn+join cost high |
+| spawn_storm_busy    | 105512   | 107113   | 2222     | 4546     | smarm ~47x slower; global mutex under 8 bg yielders |
+| mpsc_contention     | 10456    | 10395    | 17348    | 18628    | smarm wins; uncontended mutex essentially free on 1-thread |
+| many_timers         | 120242   | 121023   | 13581    | 14266    | smarm ~9x slower; single min-heap vs sharded wheel |
+| multi_thread_scaling — see thread-count sweep below                                            |
+| deep_recursion      | 62       | 71       | 22       | 44       | tokio wins unexpectedly; see sanity-check notes |
+| yield_in_hot_loop   | 182177   | —        | 138335   | —        | tokio wins; smarm prediction wrong; see notes |
+| uncontended_channel | 31473    | —        | 51925    | —        | smarm wins as predicted; ~1.65x |
+| catch_unwind_panics | 112306   | 114305   | 151443   | 161344   | smarm wins as predicted; ~1.35x |
+
+### `multi_thread_scaling` thread-count sweep (median µs)
+
+> Sandbox has 1 logical CPU; only 1-thread row is available.
+
+| Threads | smarm | tokio mt |
+|---------|-------|----------|
+| 1       | 19852 | 19638    |
+| 2       | —     | —        |
+| 4       | —     | —        |
+| N (avail=1) | 19852 | 19638 |
+
+## Tuning experiments
+
+### Reduction-budget sweep
+
+`smarm` uses an allocator-driven preemption mechanism: every Nth allocation,
+the actor checks RDTSC against its timeslice start and yields if over budget.
+The Nth-allocation threshold (the "reduction budget") and the timeslice
+duration are the two knobs.
+
+Record each experiment as a row below. Reference the commit or the parameter
+values explicitly.
+
+| Date | Configuration              | Bench (or "all")     | Result vs baseline           | Notes |
+|------|----------------------------|----------------------|------------------------------|-------|
+|      | baseline                   | all                  | —                            |       |
+|      | budget=…, timeslice=…      |                      |                              |       |
+|      |                            |                      |                              |       |
+
+When the gap on tokio-favored benches narrows without regressing
+smarm-favored benches, the change is a keeper. If a budget change improves
+one workload but regresses another by more, prefer keeping the broader-impact
+configuration unless we have a clear use case for the trade-off.
+
+## Sanity-check notes (baseline run)
+
+### Compile fixes applied
+
+Two bench files had a type error: `smarm::Runtime::run()` takes
+`impl FnOnce() + Send + 'static` (returns `()`), but the consumer closures
+in `bench_mpsc_smarm` (tokio_favored.rs) and `bench_unc_smarm`
+(smarm_favored.rs) returned `u64` via a bare `count` tail expression. Fixed
+by changing the tail to `let _ = count;` in both closures, and the
+corresponding `consumer.join().unwrap()` calls to `let _ = consumer.join()...`.
+No workload semantics changed.
+
+### Single-CPU sandbox caveat
+
+`available_parallelism()` returns 1, so every "N-thread" variant is identical
+to "1-thread". Multi-thread results should not be used to draw scaling
+conclusions; re-run on a multi-core machine before committing to the tuning
+sweep.
+
+### Predicted-winner mismatches
+
+**`deep_recursion` — tokio wins (22 µs) over smarm (62 µs).**
+At depth 500, smarm spawns a fresh actor which requires mmap'ing a 64 KiB
+stack; that allocation cost dominates the actual recursion. Tokio's
+Box::pin recursion allocates 500 small heap objects but avoids the mmap.
+The prediction assumed stack allocation was amortised across many uses; here
+the actor is single-use. Not a bug, but the bench may not exercise the
+intended advantage.
+
+**`yield_in_hot_loop` — tokio wins (138 ms) over smarm (182 ms).**
+The prediction was that smarm's ~6-GPR naked context switch would beat
+tokio's poll/state-machine cycle. In practice, on a single-thread sandbox,
+tokio's current_thread scheduler has very low overhead per yield_now, while
+smarm's yield_now still goes through the runtime mutex and run-queue even on
+a single thread. This is a meaningful data point: smarm's scheduling overhead
+is not as low as the assembly switch cost alone suggests.
+
+### Noise / spread
+
+- `catch_unwind_panics` smarm spread is reasonable (~10% min/max).
+- `spawn_storm_busy` tokio multi-thread has notable spread (3833–7305 µs);
+  consistent with tokio issue #3829 noted in task spec.
+- `many_timers` smarm spread acceptable (~10%).
+
+### Result-column equivalence
+
+All result columns match between runtimes for every bench (same prime counts,
+same message totals, same task counts). Workloads are equivalent.