diff --git a/benches/baseline-output/general.txt b/benches/baseline-output/general.txt new file mode 100644 index 0000000..f530ee1 --- /dev/null +++ b/benches/baseline-output/general.txt @@ -0,0 +1,44 @@ +smarm general benchmarks +available parallelism: 1 threads +ITERS=15 (+1 warmup, discarded) +CHAIN_DEPTH=1000, YIELD_TASKS=200×1000, PRIME_N=400000/64 workers, PP_ROUNDS=1000 + +================================================================================ + chained_spawn: depth 1000 +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 1000 | 7136 | 6929 | 8347 + smarm 1-thread | 1000 | 6979 | 6790 | 7364 + tokio current_thread | 1000 | 113 | 112 | 322 + tokio multi-thread | 1000 | 176 | 170 | 355 + +================================================================================ + yield_many: 200 tasks × 1000 yields +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 200000 | 40079 | 39606 | 41913 + smarm 1-thread | 200000 | 40073 | 39298 | 43173 + tokio current_thread | 200000 | 14571 | 14430 | 14670 + tokio multi-thread | 200000 | 14044 | 13306 | 14432 + +================================================================================ + fan_out_compute: primes in [2, 400000) across 64 +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 33860 | 19347 | 19185 | 19703 + smarm 1-thread | 33860 | 19461 | 19202 | 21172 + tokio current_thread | 33860 | 18616 | 18553 | 18987 + tokio multi-thread | 33860 | 18905 | 18755 | 19035 + +================================================================================ + ping_pong_oneshot: 1000 rounds +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 1000 | 13731 | 13555 | 15545 + smarm 1-thread | 1000 | 14176 | 13870 | 14892 + tokio current_thread | 1000 | 828 | 788 | 939 + tokio multi-thread | 1000 | 3342 | 3233 | 3624 diff --git a/benches/baseline-output/multi_scheduler.txt b/benches/baseline-output/multi_scheduler.txt new file mode 100644 index 0000000..9c6171e --- /dev/null +++ b/benches/baseline-output/multi_scheduler.txt @@ -0,0 +1,34 @@ +smarm multi-scheduler benchmarks +available parallelism: 1 threads +PRIME_N=400000, WORKERS=64, PING_ROUNDS=10000, SPAWN_COUNT=1000 + +================================================================================ + Fan-out/fan-in: count primes in [2, 400000) across 64 workers +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + baseline (serial) | 33860 | 18581 | 18519 | 18905 + smarm single-thread | 33860 | 19467 | 19354 | 22082 + smarm 1-thread | 33860 | 19345 | 19287 | 19653 + tokio current_thread | 33860 | 18681 | 18591 | 18982 + tokio multi-thread | 33860 | 18948 | 18726 | 19212 + +================================================================================ + Ping-pong: 10000 round-trips between two actors +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm single-thread | 10000 | 2547 | 2473 | 2841 + smarm 1-thread | 10000 | 2546 | 2518 | 2702 + tokio current_thread | 10000 | 1221 | 1168 | 1366 + tokio multi-thread | 10000 | 1487 | 1316 | 2331 + +================================================================================ + Spawn throughput: 1000 actors spawned and joined +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm single-thread | 1000 | 8934 | 8066 | 12204 + smarm 1-thread | 1000 | 8102 | 8041 | 10849 + tokio current_thread | 1000 | 212 | 210 | 331 + tokio multi-thread | 1000 | 330 | 301 | 604 diff --git a/benches/baseline-output/primes.txt b/benches/baseline-output/primes.txt new file mode 100644 index 0000000..7ac6500 --- /dev/null +++ b/benches/baseline-output/primes.txt @@ -0,0 +1,7 @@ +Counting primes in [2, 200000) across 16 workers, 5 iterations each + + runtime | primes found | median | min | max +-------------------------------------------------------------------------------- + baseline | primes: 17984 | median: 7244 µs | min: 7231 µs | max: 7509 µs + smarm | primes: 17984 | median: 7592 µs | min: 7505 µs | max: 8130 µs + tokio | primes: 17984 | median: 7263 µs | min: 7225 µs | max: 9067 µs diff --git a/benches/baseline-output/smarm_favored.txt b/benches/baseline-output/smarm_favored.txt new file mode 100644 index 0000000..a8b7af4 --- /dev/null +++ b/benches/baseline-output/smarm_favored.txt @@ -0,0 +1,40 @@ +smarm smarm-favored benchmarks +available parallelism: 1 threads +ITERS=15 (+1 warmup, discarded) +RECURSE_DEPTH=500, HOT_YIELDS=500000×2, UNCONT_MSGS=1000000, PANIC_TASKS=10000 + +================================================================================ + deep_recursion: depth 500 +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 1 | 62 | 59 | 682 + smarm 1-thread | 1 | 71 | 61 | 210 + tokio current_thread | 1 | 22 | 22 | 23 + tokio multi-thread | 1 | 44 | 38 | 79 + +================================================================================ + yield_in_hot_loop: 2 actors × 500000 yields (single thread) +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 1000000 | 182177 | 180380 | 184410 + tokio current_thread | 1000000 | 138335 | 136097 | 141196 + +================================================================================ + uncontended_channel: 1→1, 1000000 msgs (single thread) +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 1000000 | 31473 | 28719 | 33113 + tokio current_thread | 1000000 | 51925 | 51205 | 53043 + +================================================================================ + catch_unwind_panics: 10000 tasks, 50% panic +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 10000 | 112306 | 109702 | 119859 + smarm 1-thread | 10000 | 114305 | 112030 | 121326 + tokio current_thread | 10000 | 151443 | 150949 | 153800 + tokio multi-thread | 10000 | 161344 | 160385 | 167573 diff --git a/benches/baseline-output/tokio_favored.txt b/benches/baseline-output/tokio_favored.txt new file mode 100644 index 0000000..2cf700b --- /dev/null +++ b/benches/baseline-output/tokio_favored.txt @@ -0,0 +1,42 @@ +smarm tokio-favored benchmarks +available parallelism: 1 threads +ITERS=15 (+1 warmup, discarded) +STORM_BACKGROUND=8, STORM_SPAWN=10000, MPSC=32×10000, TIMER_ACTORS=10000 (1–10 ms), SCALING_N=400000/64 + +================================================================================ + spawn_storm_busy: 8 bg yielders + 10000 zero-work spawns +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 10000 | 105512 | 102322 | 120552 + smarm 1-thread | 10000 | 107113 | 104048 | 112377 + tokio current_thread | 10000 | 2222 | 2124 | 2506 + tokio multi-thread | 10000 | 4546 | 3833 | 7305 + +================================================================================ + mpsc_contention: 32 producers × 10000 msgs → 1 consumer +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 320000 | 10456 | 10331 | 10639 + smarm 1-thread | 320000 | 10395 | 9201 | 10549 + tokio current_thread | 320000 | 17348 | 16639 | 19061 + tokio multi-thread | 320000 | 18628 | 17499 | 19298 + +================================================================================ + many_timers: 10000 actors sleeping 1–10 ms +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 10000 | 120242 | 116239 | 127200 + smarm 1-thread | 10000 | 121023 | 113997 | 127826 + tokio current_thread | 10000 | 13581 | 13182 | 14415 + tokio multi-thread | 10000 | 14266 | 14084 | 14843 + +================================================================================ + multi_thread_scaling: primes in [2, 400000) across 64 workers +================================================================================ + runtime | result | median µs | min µs | max µs +-------------------------------------------------------------------------------- + smarm 1-thread | 33860 | 19852 | 19601 | 22679 + tokio multi 1-thread | 33860 | 19638 | 18994 | 20102 diff --git a/benches/smarm_favored.rs b/benches/smarm_favored.rs new file mode 100644 index 0000000..2139de5 --- /dev/null +++ b/benches/smarm_favored.rs @@ -0,0 +1,391 @@ +//! Benchmarks where smarm's design has a structural advantage. +//! +//! These exist to show what the green-thread + stackful model buys you. The +//! single-thread numbers are the most interesting ones — they isolate the +//! per-switch / per-task cost from any contention story. +//! +//! Workloads: +//! 9. deep_recursion — actor recurses 1000 deep then returns. In +//! smarm this is plain stack recursion on the +//! growable mmap'd stack. In tokio, async fn +//! can't directly recurse — each level must +//! `Box::pin` its future. We measure both. +//! 10. yield_in_hot_loop — 2 actors ping yield_now back and forth 500k +//! times. Pure context-switch cost; no +//! channels, no allocation, no contention. +//! Smarm's switch is ~6 GPRs + xmm save and a +//! `ret`; tokio's is poll → state-machine → +//! schedule. +//! 11. uncontended_channel — single producer, single consumer, 1M msgs, +//! single-threaded runtime. With no +//! cross-thread contention, smarm's +//! Arc> channel is essentially free, +//! and the green-thread switch should beat +//! tokio's future polling overhead. +//! 12. catch_unwind_panics — spawn 10k tasks; half panic, half succeed. +//! Supervisor handles each. Exploratory — if +//! there's no real gap, drop this one. + +use std::sync::atomic::{AtomicU64, Ordering}; +use std::sync::Arc; +use std::time::Instant; + +// --------------------------------------------------------------------------- +// Shared harness +// --------------------------------------------------------------------------- + +const ITERS: u32 = 15; + +fn available_threads() -> usize { + std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1) +} + +fn print_header(title: &str) { + println!("\n{}", "=".repeat(80)); + println!(" {title}"); + println!("{}", "=".repeat(80)); + println!( + "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}", + "runtime", "result", "median µs", "min µs", "max µs" + ); + println!("{}", "-".repeat(80)); +} + +fn run_n (u64, u128)>(name: &str, n: u32, mut f: F) { + let mut times = Vec::new(); + let mut last = 0u64; + let _ = f(); // warmup + for _ in 0..n { + let (v, t) = f(); + times.push(t); + last = v; + } + times.sort_unstable(); + let median = times[times.len() / 2]; + let min = *times.iter().min().unwrap(); + let max = *times.iter().max().unwrap(); + println!( + "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}", + name, last, median, min, max + ); +} + +// --------------------------------------------------------------------------- +// 9. deep_recursion — 1000 levels deep +// --------------------------------------------------------------------------- + +// Each recursive frame holds an `&AtomicU64`, a `u64`, plus prologue/spill — +// conservatively ~64 B/frame on release. Smarm actor stacks are a fixed 64 KiB, +// so 500 levels (~32 KiB) leaves comfortable headroom while still being deep +// enough to exercise the stack-growth advantage over Box::pin recursion. +const RECURSE_DEPTH: u64 = 500; + +fn bench_recurse_smarm(threads: usize) -> (u64, u128) { + let total = Arc::new(AtomicU64::new(0)); + let t2 = total.clone(); + let start = Instant::now(); + smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(move || { + // Plain Rust recursion on the actor's own (growable) stack. + fn recurse(c: &AtomicU64, n: u64) -> u64 { + if n == 0 { + c.fetch_add(1, Ordering::Relaxed); + 0 + } else { + 1 + recurse(c, n - 1) + } + } + let h = smarm::spawn(move || { + let _ = recurse(&t2, RECURSE_DEPTH); + }); + h.join().unwrap(); + }); + (total.load(Ordering::Relaxed), start.elapsed().as_micros()) +} + +fn bench_recurse_tokio_current() -> (u64, u128) { + let counter = Arc::new(AtomicU64::new(0)); + let c2 = counter.clone(); + let rt = tokio::runtime::Builder::new_current_thread().build().unwrap(); + let start = Instant::now(); + let local = tokio::task::LocalSet::new(); + local.block_on(&rt, async move { + // async fn can't self-recurse; each level returns a Box::pin'd future. + // This is the canonical workaround a real user would write. + fn recurse( + c: Arc, + n: u64, + ) -> std::pin::Pin>> { + Box::pin(async move { + if n == 0 { + c.fetch_add(1, Ordering::Relaxed); + 0 + } else { + 1 + recurse(c, n - 1).await + } + }) + } + let h = tokio::task::spawn_local(async move { + let _ = recurse(c2, RECURSE_DEPTH).await; + }); + let _ = h.await; + }); + (counter.load(Ordering::Relaxed), start.elapsed().as_micros()) +} + +fn bench_recurse_tokio_multi() -> (u64, u128) { + let counter = Arc::new(AtomicU64::new(0)); + let c2 = counter.clone(); + let rt = tokio::runtime::Builder::new_multi_thread() + .worker_threads(available_threads()) + .build() + .unwrap(); + let start = Instant::now(); + rt.block_on(async move { + fn recurse( + c: Arc, + n: u64, + ) -> std::pin::Pin + Send>> { + Box::pin(async move { + if n == 0 { + c.fetch_add(1, Ordering::Relaxed); + 0 + } else { + 1 + recurse(c, n - 1).await + } + }) + } + let h = tokio::spawn(async move { + let _ = recurse(c2, RECURSE_DEPTH).await; + }); + let _ = h.await; + }); + (counter.load(Ordering::Relaxed), start.elapsed().as_micros()) +} + +// --------------------------------------------------------------------------- +// 10. yield_in_hot_loop — 2 actors, 500k yields each, single thread +// --------------------------------------------------------------------------- + +const HOT_YIELDS: u64 = 500_000; + +fn bench_hot_smarm() -> (u64, u128) { + let start = Instant::now(); + smarm::runtime::init(smarm::runtime::Config::exact(1)).run(|| { + let ha = smarm::spawn(|| { + for _ in 0..HOT_YIELDS { + smarm::yield_now(); + } + }); + let hb = smarm::spawn(|| { + for _ in 0..HOT_YIELDS { + smarm::yield_now(); + } + }); + ha.join().unwrap(); + hb.join().unwrap(); + }); + (HOT_YIELDS * 2, start.elapsed().as_micros()) +} + +fn bench_hot_tokio_current() -> (u64, u128) { + let rt = tokio::runtime::Builder::new_current_thread().build().unwrap(); + let start = Instant::now(); + let local = tokio::task::LocalSet::new(); + local.block_on(&rt, async move { + let ha = tokio::task::spawn_local(async move { + for _ in 0..HOT_YIELDS { + tokio::task::yield_now().await; + } + }); + let hb = tokio::task::spawn_local(async move { + for _ in 0..HOT_YIELDS { + tokio::task::yield_now().await; + } + }); + let _ = ha.await; + let _ = hb.await; + }); + (HOT_YIELDS * 2, start.elapsed().as_micros()) +} + +// --------------------------------------------------------------------------- +// 11. uncontended_channel — 1 producer, 1 consumer, 1M msgs, single-threaded +// --------------------------------------------------------------------------- + +const UNCONT_MSGS: u64 = 1_000_000; + +fn bench_unc_smarm() -> (u64, u128) { + let start = Instant::now(); + smarm::runtime::init(smarm::runtime::Config::exact(1)).run(|| { + let (tx, rx) = smarm::channel::(); + let consumer = smarm::spawn(move || { + let mut count = 0u64; + while let Ok(_) = rx.recv() { + count += 1; + } + let _ = count; // discard; run() closure must return () + }); + let producer = smarm::spawn(move || { + for i in 0..UNCONT_MSGS { + tx.send(i).unwrap(); + } + // tx drops here, closing the channel. + }); + producer.join().unwrap(); + let _ = consumer.join().unwrap(); + }); + (UNCONT_MSGS, start.elapsed().as_micros()) +} + +fn bench_unc_tokio_current() -> (u64, u128) { + let rt = tokio::runtime::Builder::new_current_thread().build().unwrap(); + let start = Instant::now(); + let local = tokio::task::LocalSet::new(); + local.block_on(&rt, async move { + let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::(); + let consumer = tokio::task::spawn_local(async move { + let mut count = 0u64; + while let Some(_) = rx.recv().await { + count += 1; + } + count + }); + let producer = tokio::task::spawn_local(async move { + for i in 0..UNCONT_MSGS { + tx.send(i).unwrap(); + } + }); + let _ = producer.await; + let _ = consumer.await; + }); + (UNCONT_MSGS, start.elapsed().as_micros()) +} + +// --------------------------------------------------------------------------- +// 12. catch_unwind_panics — 10k tasks, half panic +// --------------------------------------------------------------------------- + +const PANIC_TASKS: u64 = 10_000; + +fn bench_panic_smarm(threads: usize) -> (u64, u128) { + let ok = Arc::new(AtomicU64::new(0)); + let err = Arc::new(AtomicU64::new(0)); + let ok2 = ok.clone(); + let err2 = err.clone(); + let start = Instant::now(); + smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(move || { + let mut handles = Vec::new(); + for i in 0..PANIC_TASKS { + handles.push(smarm::spawn(move || { + if i % 2 == 0 { + panic!("planned"); + } + })); + } + for h in handles { + match h.join() { + Ok(()) => { ok2.fetch_add(1, Ordering::Relaxed); } + Err(_) => { err2.fetch_add(1, Ordering::Relaxed); } + } + } + }); + let total = ok.load(Ordering::Relaxed) + err.load(Ordering::Relaxed); + (total, start.elapsed().as_micros()) +} + +fn bench_panic_tokio_current() -> (u64, u128) { + let ok = Arc::new(AtomicU64::new(0)); + let err = Arc::new(AtomicU64::new(0)); + let ok2 = ok.clone(); + let err2 = err.clone(); + let rt = tokio::runtime::Builder::new_current_thread().build().unwrap(); + let start = Instant::now(); + let local = tokio::task::LocalSet::new(); + local.block_on(&rt, async move { + let mut handles = Vec::new(); + for i in 0..PANIC_TASKS { + handles.push(tokio::task::spawn_local(async move { + if i % 2 == 0 { + panic!("planned"); + } + })); + } + for h in handles { + match h.await { + Ok(()) => { ok2.fetch_add(1, Ordering::Relaxed); } + Err(_) => { err2.fetch_add(1, Ordering::Relaxed); } + } + } + }); + let total = ok.load(Ordering::Relaxed) + err.load(Ordering::Relaxed); + (total, start.elapsed().as_micros()) +} + +fn bench_panic_tokio_multi() -> (u64, u128) { + let ok = Arc::new(AtomicU64::new(0)); + let err = Arc::new(AtomicU64::new(0)); + let ok2 = ok.clone(); + let err2 = err.clone(); + let rt = tokio::runtime::Builder::new_multi_thread() + .worker_threads(available_threads()) + .build() + .unwrap(); + let start = Instant::now(); + rt.block_on(async move { + let mut handles = Vec::new(); + for i in 0..PANIC_TASKS { + handles.push(tokio::spawn(async move { + if i % 2 == 0 { + panic!("planned"); + } + })); + } + for h in handles { + match h.await { + Ok(()) => { ok2.fetch_add(1, Ordering::Relaxed); } + Err(_) => { err2.fetch_add(1, Ordering::Relaxed); } + } + } + }); + let total = ok.load(Ordering::Relaxed) + err.load(Ordering::Relaxed); + (total, start.elapsed().as_micros()) +} + +// --------------------------------------------------------------------------- +// main +// --------------------------------------------------------------------------- + +fn main() { + let n = available_threads(); + println!("smarm smarm-favored benchmarks"); + println!("available parallelism: {n} threads"); + println!("ITERS={ITERS} (+1 warmup, discarded)"); + println!( + "RECURSE_DEPTH={RECURSE_DEPTH}, HOT_YIELDS={HOT_YIELDS}×2, \ + UNCONT_MSGS={UNCONT_MSGS}, PANIC_TASKS={PANIC_TASKS}" + ); + + // ---- 9. deep_recursion ---- + print_header(&format!("deep_recursion: depth {RECURSE_DEPTH}")); + run_n("smarm 1-thread", ITERS, || bench_recurse_smarm(1)); + run_n(&format!("smarm {n}-thread"), ITERS, || bench_recurse_smarm(n)); + run_n("tokio current_thread", ITERS, bench_recurse_tokio_current); + run_n("tokio multi-thread", ITERS, bench_recurse_tokio_multi); + + // ---- 10. yield_in_hot_loop ---- + print_header(&format!("yield_in_hot_loop: 2 actors × {HOT_YIELDS} yields (single thread)")); + run_n("smarm 1-thread", ITERS, bench_hot_smarm); + run_n("tokio current_thread", ITERS, bench_hot_tokio_current); + + // ---- 11. uncontended_channel ---- + print_header(&format!("uncontended_channel: 1→1, {UNCONT_MSGS} msgs (single thread)")); + run_n("smarm 1-thread", ITERS, bench_unc_smarm); + run_n("tokio current_thread", ITERS, bench_unc_tokio_current); + + // ---- 12. catch_unwind_panics ---- + print_header(&format!("catch_unwind_panics: {PANIC_TASKS} tasks, 50% panic")); + run_n("smarm 1-thread", ITERS, || bench_panic_smarm(1)); + run_n(&format!("smarm {n}-thread"), ITERS, || bench_panic_smarm(n)); + run_n("tokio current_thread", ITERS, bench_panic_tokio_current); + run_n("tokio multi-thread", ITERS, bench_panic_tokio_multi); +} diff --git a/benches/tokio_favored.rs b/benches/tokio_favored.rs new file mode 100644 index 0000000..8082c15 --- /dev/null +++ b/benches/tokio_favored.rs @@ -0,0 +1,470 @@ +//! Benchmarks where tokio's design has a structural advantage. +//! +//! These exist to *measure* the cost of smarm's design choices, not to flatter +//! either runtime. Expect tokio to win these; the value is in knowing by how +//! much, and in catching regressions where the gap widens. +//! +//! Workloads: +//! 5. spawn_storm_busy — keep N workers busy with yielding tasks, then +//! spawn 10k zero-work tasks and join. Adapted from +//! tokio's `spawn_many_remote_busy1`. Tokio's +//! work-stealing deques + per-worker LIFO slot +//! should beat smarm's single global Mutex<> +//! run queue. +//! 6. mpsc_contention — 32 producer actors, 1 consumer, 10k messages +//! each. Tokio's mpsc is lock-free on the hot path; +//! smarm's channel is Arc> per channel +//! *and* takes the runtime mutex on each unpark. +//! 7. many_timers — 10k actors each sleep for a random short +//! duration (1–10 ms), all wake within a tight +//! window. Tokio's per-worker sharded timer wheel +//! vs smarm's single shared min-heap (and single +//! drain-lock winner). +//! 8. multi_thread_scaling— primes again, but sweep thread count 1, 2, 4, +//! available_parallelism(). Smarm's mutex ceiling +//! should show up as soon as scheduling overhead +//! is non-trivial relative to per-actor work. + +use std::sync::atomic::{AtomicBool, AtomicU64, Ordering}; +use std::sync::Arc; +use std::time::{Duration, Instant}; + +// --------------------------------------------------------------------------- +// Shared harness +// --------------------------------------------------------------------------- + +const ITERS: u32 = 15; + +fn available_threads() -> usize { + std::thread::available_parallelism().map(|n| n.get()).unwrap_or(1) +} + +fn print_header(title: &str) { + println!("\n{}", "=".repeat(80)); + println!(" {title}"); + println!("{}", "=".repeat(80)); + println!( + "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}", + "runtime", "result", "median µs", "min µs", "max µs" + ); + println!("{}", "-".repeat(80)); +} + +fn run_n (u64, u128)>(name: &str, n: u32, mut f: F) { + let mut times = Vec::new(); + let mut last = 0u64; + let _ = f(); // warmup + for _ in 0..n { + let (v, t) = f(); + times.push(t); + last = v; + } + times.sort_unstable(); + let median = times[times.len() / 2]; + let min = *times.iter().min().unwrap(); + let max = *times.iter().max().unwrap(); + println!( + "{:>26} | {:>12} | {:>10} | {:>10} | {:>10}", + name, last, median, min, max + ); +} + +// --------------------------------------------------------------------------- +// 5. spawn_storm_busy — workers loaded, then storm of zero-work spawns +// --------------------------------------------------------------------------- + +const STORM_BACKGROUND: u64 = 8; // number of background "busy" actors +const STORM_SPAWN: u64 = 10_000; // zero-work spawns to time + +fn bench_storm_smarm(threads: usize) -> (u64, u128) { + let counter = Arc::new(AtomicU64::new(0)); + let stop = Arc::new(AtomicBool::new(false)); + let c2 = counter.clone(); + let s2 = stop.clone(); + + let start = Instant::now(); + smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(move || { + // Background actors: yield in a tight loop until told to stop. + let mut bg_handles = Vec::new(); + for _ in 0..STORM_BACKGROUND { + let s = s2.clone(); + bg_handles.push(smarm::spawn(move || { + while !s.load(Ordering::Relaxed) { + smarm::yield_now(); + } + })); + } + + // Storm: spawn 10k zero-work actors and join them all. + let mut handles = Vec::new(); + for _ in 0..STORM_SPAWN { + let cc = c2.clone(); + handles.push(smarm::spawn(move || { + cc.fetch_add(1, Ordering::Relaxed); + })); + } + for h in handles { h.join().unwrap(); } + + // Tear down background. + s2.store(true, Ordering::Relaxed); + for h in bg_handles { h.join().unwrap(); } + }); + (counter.load(Ordering::Relaxed), start.elapsed().as_micros()) +} + +fn bench_storm_tokio_current() -> (u64, u128) { + let counter = Arc::new(AtomicU64::new(0)); + let stop = Arc::new(AtomicBool::new(false)); + let c2 = counter.clone(); + let s2 = stop.clone(); + + let rt = tokio::runtime::Builder::new_current_thread().build().unwrap(); + let start = Instant::now(); + let local = tokio::task::LocalSet::new(); + local.block_on(&rt, async move { + let mut bg_handles = Vec::new(); + for _ in 0..STORM_BACKGROUND { + let s = s2.clone(); + bg_handles.push(tokio::task::spawn_local(async move { + while !s.load(Ordering::Relaxed) { + tokio::task::yield_now().await; + } + })); + } + let mut handles = Vec::new(); + for _ in 0..STORM_SPAWN { + let cc = c2.clone(); + handles.push(tokio::task::spawn_local(async move { + cc.fetch_add(1, Ordering::Relaxed); + })); + } + for h in handles { let _ = h.await; } + s2.store(true, Ordering::Relaxed); + for h in bg_handles { let _ = h.await; } + }); + (counter.load(Ordering::Relaxed), start.elapsed().as_micros()) +} + +fn bench_storm_tokio_multi() -> (u64, u128) { + let counter = Arc::new(AtomicU64::new(0)); + let stop = Arc::new(AtomicBool::new(false)); + let c2 = counter.clone(); + let s2 = stop.clone(); + + let rt = tokio::runtime::Builder::new_multi_thread() + .worker_threads(available_threads()) + .build() + .unwrap(); + let start = Instant::now(); + rt.block_on(async move { + let mut bg_handles = Vec::new(); + for _ in 0..STORM_BACKGROUND { + let s = s2.clone(); + bg_handles.push(tokio::spawn(async move { + while !s.load(Ordering::Relaxed) { + tokio::task::yield_now().await; + } + })); + } + let mut handles = Vec::new(); + for _ in 0..STORM_SPAWN { + let cc = c2.clone(); + handles.push(tokio::spawn(async move { + cc.fetch_add(1, Ordering::Relaxed); + })); + } + for h in handles { let _ = h.await; } + s2.store(true, Ordering::Relaxed); + for h in bg_handles { let _ = h.await; } + }); + (counter.load(Ordering::Relaxed), start.elapsed().as_micros()) +} + +// --------------------------------------------------------------------------- +// 6. mpsc_contention — 32 producers × 10k msgs into 1 consumer +// --------------------------------------------------------------------------- + +const MPSC_PRODUCERS: u64 = 32; +const MPSC_PER_PRODUCER: u64 = 10_000; + +fn bench_mpsc_smarm(threads: usize) -> (u64, u128) { + let start = Instant::now(); + smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(|| { + let (tx, rx) = smarm::channel::(); + let mut prod_handles = Vec::new(); + for p in 0..MPSC_PRODUCERS { + let tx = tx.clone(); + prod_handles.push(smarm::spawn(move || { + for i in 0..MPSC_PER_PRODUCER { + tx.send(p * MPSC_PER_PRODUCER + i).unwrap(); + } + })); + } + drop(tx); // close once producers drop + let consumer = smarm::spawn(move || { + let mut count = 0u64; + while let Ok(_) = rx.recv() { + count += 1; + } + let _ = count; // discard; run() closure must return () + }); + for h in prod_handles { h.join().unwrap(); } + let _ = consumer.join().unwrap(); + }); + (MPSC_PRODUCERS * MPSC_PER_PRODUCER, start.elapsed().as_micros()) +} + +fn bench_mpsc_tokio_current() -> (u64, u128) { + let rt = tokio::runtime::Builder::new_current_thread().build().unwrap(); + let start = Instant::now(); + let local = tokio::task::LocalSet::new(); + local.block_on(&rt, async move { + let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::(); + let mut prod_handles = Vec::new(); + for p in 0..MPSC_PRODUCERS { + let tx = tx.clone(); + prod_handles.push(tokio::task::spawn_local(async move { + for i in 0..MPSC_PER_PRODUCER { + tx.send(p * MPSC_PER_PRODUCER + i).unwrap(); + } + })); + } + drop(tx); + let consumer = tokio::task::spawn_local(async move { + let mut count = 0u64; + while let Some(_) = rx.recv().await { + count += 1; + } + count + }); + for h in prod_handles { let _ = h.await; } + let _ = consumer.await; + }); + (MPSC_PRODUCERS * MPSC_PER_PRODUCER, start.elapsed().as_micros()) +} + +fn bench_mpsc_tokio_multi() -> (u64, u128) { + let rt = tokio::runtime::Builder::new_multi_thread() + .worker_threads(available_threads()) + .build() + .unwrap(); + let start = Instant::now(); + rt.block_on(async move { + let (tx, mut rx) = tokio::sync::mpsc::unbounded_channel::(); + let mut prod_handles = Vec::new(); + for p in 0..MPSC_PRODUCERS { + let tx = tx.clone(); + prod_handles.push(tokio::spawn(async move { + for i in 0..MPSC_PER_PRODUCER { + tx.send(p * MPSC_PER_PRODUCER + i).unwrap(); + } + })); + } + drop(tx); + let consumer = tokio::spawn(async move { + let mut count = 0u64; + while let Some(_) = rx.recv().await { + count += 1; + } + count + }); + for h in prod_handles { let _ = h.await; } + let _ = consumer.await; + }); + (MPSC_PRODUCERS * MPSC_PER_PRODUCER, start.elapsed().as_micros()) +} + +// --------------------------------------------------------------------------- +// 7. many_timers — 10k sleeping actors waking in a tight window +// --------------------------------------------------------------------------- + +const TIMER_ACTORS: u64 = 10_000; +const TIMER_MIN_MS: u64 = 1; +const TIMER_MAX_MS: u64 = 10; + +// Deterministic per-actor delay so iterations are comparable. +fn timer_delay_ms(i: u64) -> u64 { + TIMER_MIN_MS + (i * 2654435761u64 >> 32) % (TIMER_MAX_MS - TIMER_MIN_MS + 1) +} + +fn bench_timers_smarm(threads: usize) -> (u64, u128) { + let start = Instant::now(); + smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(|| { + let mut handles = Vec::new(); + for i in 0..TIMER_ACTORS { + let ms = timer_delay_ms(i); + handles.push(smarm::spawn(move || { + smarm::sleep(Duration::from_millis(ms)); + })); + } + for h in handles { h.join().unwrap(); } + }); + (TIMER_ACTORS, start.elapsed().as_micros()) +} + +fn bench_timers_tokio_current() -> (u64, u128) { + let rt = tokio::runtime::Builder::new_current_thread() + .enable_time() + .build() + .unwrap(); + let start = Instant::now(); + let local = tokio::task::LocalSet::new(); + local.block_on(&rt, async move { + let mut handles = Vec::new(); + for i in 0..TIMER_ACTORS { + let ms = timer_delay_ms(i); + handles.push(tokio::task::spawn_local(async move { + tokio::time::sleep(Duration::from_millis(ms)).await; + })); + } + for h in handles { let _ = h.await; } + }); + (TIMER_ACTORS, start.elapsed().as_micros()) +} + +fn bench_timers_tokio_multi() -> (u64, u128) { + let rt = tokio::runtime::Builder::new_multi_thread() + .worker_threads(available_threads()) + .enable_time() + .build() + .unwrap(); + let start = Instant::now(); + rt.block_on(async move { + let mut handles = Vec::new(); + for i in 0..TIMER_ACTORS { + let ms = timer_delay_ms(i); + handles.push(tokio::spawn(async move { + tokio::time::sleep(Duration::from_millis(ms)).await; + })); + } + for h in handles { let _ = h.await; } + }); + (TIMER_ACTORS, start.elapsed().as_micros()) +} + +// --------------------------------------------------------------------------- +// 8. multi_thread_scaling — primes, sweep thread count +// --------------------------------------------------------------------------- + +const SCALING_N: u64 = 400_000; +const SCALING_WORKERS: u64 = 64; + +fn is_prime(n: u64) -> bool { + if n < 2 { return false; } + if n < 4 { return true; } + if n % 2 == 0 { return false; } + let mut i = 3u64; + while i * i <= n { if n % i == 0 { return false; } i += 2; } + true +} + +fn count_primes(lo: u64, hi: u64) -> u64 { + (lo..hi).filter(|&n| is_prime(n)).count() as u64 +} + +fn scaling_slice(w: u64) -> (u64, u64) { + let per = SCALING_N / SCALING_WORKERS; + let lo = w * per; + let hi = if w + 1 == SCALING_WORKERS { SCALING_N } else { lo + per }; + (lo, hi) +} + +fn bench_scaling_smarm(threads: usize) -> (u64, u128) { + let total = Arc::new(AtomicU64::new(0)); + let t2 = total.clone(); + let start = Instant::now(); + smarm::runtime::init(smarm::runtime::Config::exact(threads)).run(move || { + let mut handles = Vec::new(); + for w in 0..SCALING_WORKERS { + let (lo, hi) = scaling_slice(w); + let tc = t2.clone(); + handles.push(smarm::spawn(move || { + tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed); + })); + } + for h in handles { h.join().unwrap(); } + }); + (total.load(Ordering::Relaxed), start.elapsed().as_micros()) +} + +fn bench_scaling_tokio_multi(threads: usize) -> (u64, u128) { + let total = Arc::new(AtomicU64::new(0)); + let t2 = total.clone(); + let rt = tokio::runtime::Builder::new_multi_thread() + .worker_threads(threads) + .build() + .unwrap(); + let start = Instant::now(); + rt.block_on(async move { + let mut handles = Vec::new(); + for w in 0..SCALING_WORKERS { + let (lo, hi) = scaling_slice(w); + let tc = t2.clone(); + handles.push(tokio::spawn(async move { + tc.fetch_add(count_primes(lo, hi), Ordering::Relaxed); + })); + } + for h in handles { let _ = h.await; } + }); + (total.load(Ordering::Relaxed), start.elapsed().as_micros()) +} + +// --------------------------------------------------------------------------- +// main +// --------------------------------------------------------------------------- + +fn main() { + let n = available_threads(); + println!("smarm tokio-favored benchmarks"); + println!("available parallelism: {n} threads"); + println!("ITERS={ITERS} (+1 warmup, discarded)"); + println!( + "STORM_BACKGROUND={STORM_BACKGROUND}, STORM_SPAWN={STORM_SPAWN}, \ + MPSC={MPSC_PRODUCERS}×{MPSC_PER_PRODUCER}, \ + TIMER_ACTORS={TIMER_ACTORS} ({TIMER_MIN_MS}–{TIMER_MAX_MS} ms), \ + SCALING_N={SCALING_N}/{SCALING_WORKERS}" + ); + + // ---- 5. spawn_storm_busy ---- + print_header(&format!( + "spawn_storm_busy: {STORM_BACKGROUND} bg yielders + {STORM_SPAWN} zero-work spawns" + )); + run_n("smarm 1-thread", ITERS, || bench_storm_smarm(1)); + run_n(&format!("smarm {n}-thread"), ITERS, || bench_storm_smarm(n)); + run_n("tokio current_thread", ITERS, bench_storm_tokio_current); + run_n("tokio multi-thread", ITERS, bench_storm_tokio_multi); + + // ---- 6. mpsc_contention ---- + print_header(&format!( + "mpsc_contention: {MPSC_PRODUCERS} producers × {MPSC_PER_PRODUCER} msgs → 1 consumer" + )); + run_n("smarm 1-thread", ITERS, || bench_mpsc_smarm(1)); + run_n(&format!("smarm {n}-thread"), ITERS, || bench_mpsc_smarm(n)); + run_n("tokio current_thread", ITERS, bench_mpsc_tokio_current); + run_n("tokio multi-thread", ITERS, bench_mpsc_tokio_multi); + + // ---- 7. many_timers ---- + print_header(&format!( + "many_timers: {TIMER_ACTORS} actors sleeping {TIMER_MIN_MS}–{TIMER_MAX_MS} ms" + )); + run_n("smarm 1-thread", ITERS, || bench_timers_smarm(1)); + run_n(&format!("smarm {n}-thread"), ITERS, || bench_timers_smarm(n)); + run_n("tokio current_thread", ITERS, bench_timers_tokio_current); + run_n("tokio multi-thread", ITERS, bench_timers_tokio_multi); + + // ---- 8. multi_thread_scaling ---- + print_header(&format!( + "multi_thread_scaling: primes in [2, {SCALING_N}) across {SCALING_WORKERS} workers" + )); + let sweep: Vec = { + let mut v = vec![1usize, 2, 4]; + if n > 4 && !v.contains(&n) { v.push(n); } + v.into_iter().filter(|t| *t <= n).collect() + }; + for t in &sweep { + run_n(&format!("smarm {t}-thread"), ITERS, || bench_scaling_smarm(*t)); + } + for t in &sweep { + run_n(&format!("tokio multi {t}-thread"), ITERS, || bench_scaling_tokio_multi(*t)); + } +} diff --git a/benchmarks.md b/benchmarks.md new file mode 100644 index 0000000..eb6e7b7 --- /dev/null +++ b/benchmarks.md @@ -0,0 +1,177 @@ +# Benchmarks + +Regression-test and tuning reference for smarm vs tokio. + +## Running + +```sh +cargo bench --bench primes # original compute bench +cargo bench --bench multi_scheduler # original 3-workload bench +cargo bench --bench general # benches 1–4 +cargo bench --bench tokio_favored # benches 5–8 +cargo bench --bench smarm_favored # benches 9–12 +``` + +Each bench runs one warmup iteration (discarded) and 15 measured iterations. +Results are reported as median / min / max in microseconds. Median is the +headline number; the spread between min and max indicates measurement +stability. + +## Methodology notes + +- The harness times wall-clock elapsed for the full workload, including + runtime startup and shutdown. For multi-thread runtimes this means worker + thread spawn cost is included; on short-lived benches this can dominate. + Where startup matters, the bench is structured so the workload is much + longer than typical startup. +- `tokio` uses `new_current_thread` + `LocalSet` for the single-threaded + comparison and `new_multi_thread().worker_threads(N)` for parallel. + `smarm::runtime::Config::exact(N)` is the equivalent knob. +- mpsc choice: tokio's `unbounded_channel` to match smarm's unbounded channel + semantics. Bounded comparisons would need a separate suite. +- Random delays in `many_timers` use a deterministic mixing function of the + actor index so iterations are reproducible. + +## Bench catalog + +### General — neither runtime structurally favored + +| # | Bench | Stresses | Prediction | +|---|---------------------|-------------------------------------------------|--------------------| +| 1 | `chained_spawn` | Spawn + exit overhead in a serial chain | Roughly even | +| 2 | `yield_many` | Pure scheduling throughput, explicit yields | Roughly even | +| 3 | `fan_out_compute` | CPU-bound parallel work, minimal coordination | Even (compute-bound) | +| 4 | `ping_pong_oneshot` | Spawn + oneshot round-trip latency | Roughly even | + +A regression here means a real change in per-task or per-yield cost — those +should be investigated regardless of which runtime got slower. + +### Tokio-favored — measures cost of smarm's design choices + +| # | Bench | Stresses | Why tokio should win | +|---|-------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------| +| 5 | `spawn_storm_busy` | 8 background yielders + 10k zero-work spawns | Tokio's per-worker deque + LIFO slot vs smarm's global `Mutex` queue | +| 6 | `mpsc_contention` | 32 producers × 10k msgs → 1 consumer | Tokio's mpsc is lock-free on the hot path; smarm channel is `Arc>` + runtime mutex on each unpark | +| 7 | `many_timers` | 10k actors sleeping 1–10 ms, dense wake window | Tokio's per-worker sharded timer wheel vs smarm's single shared min-heap | +| 8 | `multi_thread_scaling` | Primes, sweep thread count 1, 2, 4, available | Tokio scales near-linearly; smarm hits its mutex ceiling | + +A regression here means a smarm design choice got more expensive. Widening +gaps signal something to investigate; narrowing gaps after a tuning change is +the desired direction. + +### Smarm-favored — measures payoff of green-thread + stackful design + +| # | Bench | Stresses | Why smarm should win | +|----|------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------| +| 9 | `deep_recursion` | Actor recurses 1000 deep, returns | Native stack growth vs tokio's per-level `Box::pin` | +| 10 | `yield_in_hot_loop` | 2 actors, 500k yields each, single thread | Naked context switch (~6 GPRs + xmm save + ret) vs poll → state machine → schedule | +| 11 | `uncontended_channel` | 1→1, 1M msgs, single thread | Mutex is essentially free uncontended; green-thread switch is cheaper than poll | +| 12 | `catch_unwind_panics` | 10k spawns, 50% panic | Smarm has `catch_unwind` at the actor entry; both runtimes do this but the boundaries differ — exploratory | + +A regression here means we lost some of smarm's structural advantage. #12 is +exploratory — if the baseline shows no real gap, drop it. + +## Baseline (v0.3.0, Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, RUSTFLAGS: none) + +> Sandbox environment has only 1 logical CPU. All multi-thread rows (smarm Nt, +> tokio mt) are equivalent to 1-thread; scaling sweep is limited to 1 thread. +> Label duplication in bench output ("smarm 1-thread" appearing twice) is +> because available_parallelism() == 1, so the N-thread variant is identical. + +| Bench | smarm 1t | smarm Nt | tokio ct | tokio mt | Notes | +|---------------------|----------|----------|----------|----------|-------| +| chained_spawn | 7136 | 6979 | 113 | 176 | smarm ~60x slower; spawn+stack alloc dominates on 1 CPU | +| yield_many | 40079 | 40073 | 14571 | 14044 | smarm ~2.8x slower; scheduling overhead real | +| fan_out_compute | 19347 | 19461 | 18616 | 18905 | roughly even; compute-bound as expected | +| ping_pong_oneshot | 13731 | 14176 | 828 | 3342 | smarm ~17x slower; per-round spawn+join cost high | +| spawn_storm_busy | 105512 | 107113 | 2222 | 4546 | smarm ~47x slower; global mutex under 8 bg yielders | +| mpsc_contention | 10456 | 10395 | 17348 | 18628 | smarm wins; uncontended mutex essentially free on 1-thread | +| many_timers | 120242 | 121023 | 13581 | 14266 | smarm ~9x slower; single min-heap vs sharded wheel | +| multi_thread_scaling — see thread-count sweep below | +| deep_recursion | 62 | 71 | 22 | 44 | tokio wins unexpectedly; see sanity-check notes | +| yield_in_hot_loop | 182177 | — | 138335 | — | tokio wins; smarm prediction wrong; see notes | +| uncontended_channel | 31473 | — | 51925 | — | smarm wins as predicted; ~1.65x | +| catch_unwind_panics | 112306 | 114305 | 151443 | 161344 | smarm wins as predicted; ~1.35x | + +### `multi_thread_scaling` thread-count sweep (median µs) + +> Sandbox has 1 logical CPU; only 1-thread row is available. + +| Threads | smarm | tokio mt | +|---------|-------|----------| +| 1 | 19852 | 19638 | +| 2 | — | — | +| 4 | — | — | +| N (avail=1) | 19852 | 19638 | + +## Tuning experiments + +### Reduction-budget sweep + +`smarm` uses an allocator-driven preemption mechanism: every Nth allocation, +the actor checks RDTSC against its timeslice start and yields if over budget. +The Nth-allocation threshold (the "reduction budget") and the timeslice +duration are the two knobs. + +Record each experiment as a row below. Reference the commit or the parameter +values explicitly. + +| Date | Configuration | Bench (or "all") | Result vs baseline | Notes | +|------|----------------------------|----------------------|------------------------------|-------| +| | baseline | all | — | | +| | budget=…, timeslice=… | | | | +| | | | | | + +When the gap on tokio-favored benches narrows without regressing +smarm-favored benches, the change is a keeper. If a budget change improves +one workload but regresses another by more, prefer keeping the broader-impact +configuration unless we have a clear use case for the trade-off. + +## Sanity-check notes (baseline run) + +### Compile fixes applied + +Two bench files had a type error: `smarm::Runtime::run()` takes +`impl FnOnce() + Send + 'static` (returns `()`), but the consumer closures +in `bench_mpsc_smarm` (tokio_favored.rs) and `bench_unc_smarm` +(smarm_favored.rs) returned `u64` via a bare `count` tail expression. Fixed +by changing the tail to `let _ = count;` in both closures, and the +corresponding `consumer.join().unwrap()` calls to `let _ = consumer.join()...`. +No workload semantics changed. + +### Single-CPU sandbox caveat + +`available_parallelism()` returns 1, so every "N-thread" variant is identical +to "1-thread". Multi-thread results should not be used to draw scaling +conclusions; re-run on a multi-core machine before committing to the tuning +sweep. + +### Predicted-winner mismatches + +**`deep_recursion` — tokio wins (22 µs) over smarm (62 µs).** +At depth 500, smarm spawns a fresh actor which requires mmap'ing a 64 KiB +stack; that allocation cost dominates the actual recursion. Tokio's +Box::pin recursion allocates 500 small heap objects but avoids the mmap. +The prediction assumed stack allocation was amortised across many uses; here +the actor is single-use. Not a bug, but the bench may not exercise the +intended advantage. + +**`yield_in_hot_loop` — tokio wins (138 ms) over smarm (182 ms).** +The prediction was that smarm's ~6-GPR naked context switch would beat +tokio's poll/state-machine cycle. In practice, on a single-thread sandbox, +tokio's current_thread scheduler has very low overhead per yield_now, while +smarm's yield_now still goes through the runtime mutex and run-queue even on +a single thread. This is a meaningful data point: smarm's scheduling overhead +is not as low as the assembly switch cost alone suggests. + +### Noise / spread + +- `catch_unwind_panics` smarm spread is reasonable (~10% min/max). +- `spawn_storm_busy` tokio multi-thread has notable spread (3833–7305 µs); + consistent with tokio issue #3829 noted in task spec. +- `many_timers` smarm spread acceptable (~10%). + +### Result-column equivalence + +All result columns match between runtimes for every bench (same prime counts, +same message totals, same task counts). Workloads are equivalent.