Files
smarm/docs/benchmarks.md
2026-05-25 22:14:07 +02:00

178 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Benchmarks
Regression-test and tuning reference for smarm vs tokio.
## Running
```sh
cargo bench --bench primes # original compute bench
cargo bench --bench multi_scheduler # original 3-workload bench
cargo bench --bench general # benches 14
cargo bench --bench tokio_favored # benches 58
cargo bench --bench smarm_favored # benches 912
```
Each bench runs one warmup iteration (discarded) and 15 measured iterations.
Results are reported as median / min / max in microseconds. Median is the
headline number; the spread between min and max indicates measurement
stability.
## Methodology notes
- The harness times wall-clock elapsed for the full workload, including
runtime startup and shutdown. For multi-thread runtimes this means worker
thread spawn cost is included; on short-lived benches this can dominate.
Where startup matters, the bench is structured so the workload is much
longer than typical startup.
- `tokio` uses `new_current_thread` + `LocalSet` for the single-threaded
comparison and `new_multi_thread().worker_threads(N)` for parallel.
`smarm::runtime::Config::exact(N)` is the equivalent knob.
- mpsc choice: tokio's `unbounded_channel` to match smarm's unbounded channel
semantics. Bounded comparisons would need a separate suite.
- Random delays in `many_timers` use a deterministic mixing function of the
actor index so iterations are reproducible.
## Bench catalog
### General — neither runtime structurally favored
| # | Bench | Stresses | Prediction |
|---|---------------------|-------------------------------------------------|--------------------|
| 1 | `chained_spawn` | Spawn + exit overhead in a serial chain | Roughly even |
| 2 | `yield_many` | Pure scheduling throughput, explicit yields | Roughly even |
| 3 | `fan_out_compute` | CPU-bound parallel work, minimal coordination | Even (compute-bound) |
| 4 | `ping_pong_oneshot` | Spawn + oneshot round-trip latency | Roughly even |
A regression here means a real change in per-task or per-yield cost — those
should be investigated regardless of which runtime got slower.
### Tokio-favored — measures cost of smarm's design choices
| # | Bench | Stresses | Why tokio should win |
|---|-------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------|
| 5 | `spawn_storm_busy` | 8 background yielders + 10k zero-work spawns | Tokio's per-worker deque + LIFO slot vs smarm's global `Mutex<SharedState>` queue |
| 6 | `mpsc_contention` | 32 producers × 10k msgs → 1 consumer | Tokio's mpsc is lock-free on the hot path; smarm channel is `Arc<Mutex<Inner>>` + runtime mutex on each unpark |
| 7 | `many_timers` | 10k actors sleeping 110 ms, dense wake window | Tokio's per-worker sharded timer wheel vs smarm's single shared min-heap |
| 8 | `multi_thread_scaling` | Primes, sweep thread count 1, 2, 4, available | Tokio scales near-linearly; smarm hits its mutex ceiling |
A regression here means a smarm design choice got more expensive. Widening
gaps signal something to investigate; narrowing gaps after a tuning change is
the desired direction.
### Smarm-favored — measures payoff of green-thread + stackful design
| # | Bench | Stresses | Why smarm should win |
|----|------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------|
| 9 | `deep_recursion` | Actor recurses 1000 deep, returns | Native stack growth vs tokio's per-level `Box::pin` |
| 10 | `yield_in_hot_loop` | 2 actors, 500k yields each, single thread | Naked context switch (~6 GPRs + xmm save + ret) vs poll → state machine → schedule |
| 11 | `uncontended_channel` | 1→1, 1M msgs, single thread | Mutex is essentially free uncontended; green-thread switch is cheaper than poll |
| 12 | `catch_unwind_panics` | 10k spawns, 50% panic | Smarm has `catch_unwind` at the actor entry; both runtimes do this but the boundaries differ — exploratory |
A regression here means we lost some of smarm's structural advantage. #12 is
exploratory — if the baseline shows no real gap, drop it.
## Baseline (v0.3.0, Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, RUSTFLAGS: none)
> Sandbox environment has only 1 logical CPU. All multi-thread rows (smarm Nt,
> tokio mt) are equivalent to 1-thread; scaling sweep is limited to 1 thread.
> Label duplication in bench output ("smarm 1-thread" appearing twice) is
> because available_parallelism() == 1, so the N-thread variant is identical.
| Bench | smarm 1t | smarm Nt | tokio ct | tokio mt | Notes |
|---------------------|----------|----------|----------|----------|-------|
| chained_spawn | 7136 | 6979 | 113 | 176 | smarm ~60x slower; spawn+stack alloc dominates on 1 CPU |
| yield_many | 40079 | 40073 | 14571 | 14044 | smarm ~2.8x slower; scheduling overhead real |
| fan_out_compute | 19347 | 19461 | 18616 | 18905 | roughly even; compute-bound as expected |
| ping_pong_oneshot | 13731 | 14176 | 828 | 3342 | smarm ~17x slower; per-round spawn+join cost high |
| spawn_storm_busy | 105512 | 107113 | 2222 | 4546 | smarm ~47x slower; global mutex under 8 bg yielders |
| mpsc_contention | 10456 | 10395 | 17348 | 18628 | smarm wins; uncontended mutex essentially free on 1-thread |
| many_timers | 120242 | 121023 | 13581 | 14266 | smarm ~9x slower; single min-heap vs sharded wheel |
| multi_thread_scaling — see thread-count sweep below |
| deep_recursion | 62 | 71 | 22 | 44 | tokio wins unexpectedly; see sanity-check notes |
| yield_in_hot_loop | 182177 | — | 138335 | — | tokio wins; smarm prediction wrong; see notes |
| uncontended_channel | 31473 | — | 51925 | — | smarm wins as predicted; ~1.65x |
| catch_unwind_panics | 112306 | 114305 | 151443 | 161344 | smarm wins as predicted; ~1.35x |
### `multi_thread_scaling` thread-count sweep (median µs)
> Sandbox has 1 logical CPU; only 1-thread row is available.
| Threads | smarm | tokio mt |
|---------|-------|----------|
| 1 | 19852 | 19638 |
| 2 | — | — |
| 4 | — | — |
| N (avail=1) | 19852 | 19638 |
## Tuning experiments
### Reduction-budget sweep
`smarm` uses an allocator-driven preemption mechanism: every Nth allocation,
the actor checks RDTSC against its timeslice start and yields if over budget.
The Nth-allocation threshold (the "reduction budget") and the timeslice
duration are the two knobs.
Record each experiment as a row below. Reference the commit or the parameter
values explicitly.
| Date | Configuration | Bench (or "all") | Result vs baseline | Notes |
|------|----------------------------|----------------------|------------------------------|-------|
| | baseline | all | — | |
| | budget=…, timeslice=… | | | |
| | | | | |
When the gap on tokio-favored benches narrows without regressing
smarm-favored benches, the change is a keeper. If a budget change improves
one workload but regresses another by more, prefer keeping the broader-impact
configuration unless we have a clear use case for the trade-off.
## Sanity-check notes (baseline run)
### Compile fixes applied
Two bench files had a type error: `smarm::Runtime::run()` takes
`impl FnOnce() + Send + 'static` (returns `()`), but the consumer closures
in `bench_mpsc_smarm` (tokio_favored.rs) and `bench_unc_smarm`
(smarm_favored.rs) returned `u64` via a bare `count` tail expression. Fixed
by changing the tail to `let _ = count;` in both closures, and the
corresponding `consumer.join().unwrap()` calls to `let _ = consumer.join()...`.
No workload semantics changed.
### Single-CPU sandbox caveat
`available_parallelism()` returns 1, so every "N-thread" variant is identical
to "1-thread". Multi-thread results should not be used to draw scaling
conclusions; re-run on a multi-core machine before committing to the tuning
sweep.
### Predicted-winner mismatches
**`deep_recursion` — tokio wins (22 µs) over smarm (62 µs).**
At depth 500, smarm spawns a fresh actor which requires mmap'ing a 64 KiB
stack; that allocation cost dominates the actual recursion. Tokio's
Box::pin recursion allocates 500 small heap objects but avoids the mmap.
The prediction assumed stack allocation was amortised across many uses; here
the actor is single-use. Not a bug, but the bench may not exercise the
intended advantage.
**`yield_in_hot_loop` — tokio wins (138 ms) over smarm (182 ms).**
The prediction was that smarm's ~6-GPR naked context switch would beat
tokio's poll/state-machine cycle. In practice, on a single-thread sandbox,
tokio's current_thread scheduler has very low overhead per yield_now, while
smarm's yield_now still goes through the runtime mutex and run-queue even on
a single thread. This is a meaningful data point: smarm's scheduling overhead
is not as low as the assembly switch cost alone suggests.
### Noise / spread
- `catch_unwind_panics` smarm spread is reasonable (~10% min/max).
- `spawn_storm_busy` tokio multi-thread has notable spread (38337305 µs);
consistent with tokio issue #3829 noted in task spec.
- `many_timers` smarm spread acceptable (~10%).
### Result-column equivalence
All result columns match between runtimes for every bench (same prime counts,
same message totals, same task counts). Workloads are equivalent.