Files
smarm/docs/benchmarks.md
2026-05-25 22:14:07 +02:00

10 KiB
Raw Permalink Blame History

Benchmarks

Regression-test and tuning reference for smarm vs tokio.

Running

cargo bench --bench primes              # original compute bench
cargo bench --bench multi_scheduler     # original 3-workload bench
cargo bench --bench general             # benches 14
cargo bench --bench tokio_favored       # benches 58
cargo bench --bench smarm_favored       # benches 912

Each bench runs one warmup iteration (discarded) and 15 measured iterations. Results are reported as median / min / max in microseconds. Median is the headline number; the spread between min and max indicates measurement stability.

Methodology notes

  • The harness times wall-clock elapsed for the full workload, including runtime startup and shutdown. For multi-thread runtimes this means worker thread spawn cost is included; on short-lived benches this can dominate. Where startup matters, the bench is structured so the workload is much longer than typical startup.
  • tokio uses new_current_thread + LocalSet for the single-threaded comparison and new_multi_thread().worker_threads(N) for parallel. smarm::runtime::Config::exact(N) is the equivalent knob.
  • mpsc choice: tokio's unbounded_channel to match smarm's unbounded channel semantics. Bounded comparisons would need a separate suite.
  • Random delays in many_timers use a deterministic mixing function of the actor index so iterations are reproducible.

Bench catalog

General — neither runtime structurally favored

# Bench Stresses Prediction
1 chained_spawn Spawn + exit overhead in a serial chain Roughly even
2 yield_many Pure scheduling throughput, explicit yields Roughly even
3 fan_out_compute CPU-bound parallel work, minimal coordination Even (compute-bound)
4 ping_pong_oneshot Spawn + oneshot round-trip latency Roughly even

A regression here means a real change in per-task or per-yield cost — those should be investigated regardless of which runtime got slower.

Tokio-favored — measures cost of smarm's design choices

# Bench Stresses Why tokio should win
5 spawn_storm_busy 8 background yielders + 10k zero-work spawns Tokio's per-worker deque + LIFO slot vs smarm's global Mutex<SharedState> queue
6 mpsc_contention 32 producers × 10k msgs → 1 consumer Tokio's mpsc is lock-free on the hot path; smarm channel is Arc<Mutex<Inner>> + runtime mutex on each unpark
7 many_timers 10k actors sleeping 110 ms, dense wake window Tokio's per-worker sharded timer wheel vs smarm's single shared min-heap
8 multi_thread_scaling Primes, sweep thread count 1, 2, 4, available Tokio scales near-linearly; smarm hits its mutex ceiling

A regression here means a smarm design choice got more expensive. Widening gaps signal something to investigate; narrowing gaps after a tuning change is the desired direction.

Smarm-favored — measures payoff of green-thread + stackful design

# Bench Stresses Why smarm should win
9 deep_recursion Actor recurses 1000 deep, returns Native stack growth vs tokio's per-level Box::pin
10 yield_in_hot_loop 2 actors, 500k yields each, single thread Naked context switch (~6 GPRs + xmm save + ret) vs poll → state machine → schedule
11 uncontended_channel 1→1, 1M msgs, single thread Mutex is essentially free uncontended; green-thread switch is cheaper than poll
12 catch_unwind_panics 10k spawns, 50% panic Smarm has catch_unwind at the actor entry; both runtimes do this but the boundaries differ — exploratory

A regression here means we lost some of smarm's structural advantage. #12 is exploratory — if the baseline shows no real gap, drop it.

Baseline (v0.3.0, Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, RUSTFLAGS: none)

Sandbox environment has only 1 logical CPU. All multi-thread rows (smarm Nt, tokio mt) are equivalent to 1-thread; scaling sweep is limited to 1 thread. Label duplication in bench output ("smarm 1-thread" appearing twice) is because available_parallelism() == 1, so the N-thread variant is identical.

Bench smarm 1t smarm Nt tokio ct tokio mt Notes
chained_spawn 7136 6979 113 176 smarm ~60x slower; spawn+stack alloc dominates on 1 CPU
yield_many 40079 40073 14571 14044 smarm ~2.8x slower; scheduling overhead real
fan_out_compute 19347 19461 18616 18905 roughly even; compute-bound as expected
ping_pong_oneshot 13731 14176 828 3342 smarm ~17x slower; per-round spawn+join cost high
spawn_storm_busy 105512 107113 2222 4546 smarm ~47x slower; global mutex under 8 bg yielders
mpsc_contention 10456 10395 17348 18628 smarm wins; uncontended mutex essentially free on 1-thread
many_timers 120242 121023 13581 14266 smarm ~9x slower; single min-heap vs sharded wheel
multi_thread_scaling — see thread-count sweep below
deep_recursion 62 71 22 44 tokio wins unexpectedly; see sanity-check notes
yield_in_hot_loop 182177 138335 tokio wins; smarm prediction wrong; see notes
uncontended_channel 31473 51925 smarm wins as predicted; ~1.65x
catch_unwind_panics 112306 114305 151443 161344 smarm wins as predicted; ~1.35x

multi_thread_scaling thread-count sweep (median µs)

Sandbox has 1 logical CPU; only 1-thread row is available.

Threads smarm tokio mt
1 19852 19638
2
4
N (avail=1) 19852 19638

Tuning experiments

Reduction-budget sweep

smarm uses an allocator-driven preemption mechanism: every Nth allocation, the actor checks RDTSC against its timeslice start and yields if over budget. The Nth-allocation threshold (the "reduction budget") and the timeslice duration are the two knobs.

Record each experiment as a row below. Reference the commit or the parameter values explicitly.

Date Configuration Bench (or "all") Result vs baseline Notes
baseline all
budget=…, timeslice=…

When the gap on tokio-favored benches narrows without regressing smarm-favored benches, the change is a keeper. If a budget change improves one workload but regresses another by more, prefer keeping the broader-impact configuration unless we have a clear use case for the trade-off.

Sanity-check notes (baseline run)

Compile fixes applied

Two bench files had a type error: smarm::Runtime::run() takes impl FnOnce() + Send + 'static (returns ()), but the consumer closures in bench_mpsc_smarm (tokio_favored.rs) and bench_unc_smarm (smarm_favored.rs) returned u64 via a bare count tail expression. Fixed by changing the tail to let _ = count; in both closures, and the corresponding consumer.join().unwrap() calls to let _ = consumer.join().... No workload semantics changed.

Single-CPU sandbox caveat

available_parallelism() returns 1, so every "N-thread" variant is identical to "1-thread". Multi-thread results should not be used to draw scaling conclusions; re-run on a multi-core machine before committing to the tuning sweep.

Predicted-winner mismatches

deep_recursion — tokio wins (22 µs) over smarm (62 µs). At depth 500, smarm spawns a fresh actor which requires mmap'ing a 64 KiB stack; that allocation cost dominates the actual recursion. Tokio's Box::pin recursion allocates 500 small heap objects but avoids the mmap. The prediction assumed stack allocation was amortised across many uses; here the actor is single-use. Not a bug, but the bench may not exercise the intended advantage.

yield_in_hot_loop — tokio wins (138 ms) over smarm (182 ms). The prediction was that smarm's ~6-GPR naked context switch would beat tokio's poll/state-machine cycle. In practice, on a single-thread sandbox, tokio's current_thread scheduler has very low overhead per yield_now, while smarm's yield_now still goes through the runtime mutex and run-queue even on a single thread. This is a meaningful data point: smarm's scheduling overhead is not as low as the assembly switch cost alone suggests.

Noise / spread

  • catch_unwind_panics smarm spread is reasonable (~10% min/max).
  • spawn_storm_busy tokio multi-thread has notable spread (38337305 µs); consistent with tokio issue #3829 noted in task spec.
  • many_timers smarm spread acceptable (~10%).

Result-column equivalence

All result columns match between runtimes for every bench (same prime counts, same message totals, same task counts). Workloads are equivalent.