Update the documentation
This commit is contained in:
177
docs/benchmarks.md
Normal file
177
docs/benchmarks.md
Normal file
@@ -0,0 +1,177 @@
|
||||
# Benchmarks
|
||||
|
||||
Regression-test and tuning reference for smarm vs tokio.
|
||||
|
||||
## Running
|
||||
|
||||
```sh
|
||||
cargo bench --bench primes # original compute bench
|
||||
cargo bench --bench multi_scheduler # original 3-workload bench
|
||||
cargo bench --bench general # benches 1–4
|
||||
cargo bench --bench tokio_favored # benches 5–8
|
||||
cargo bench --bench smarm_favored # benches 9–12
|
||||
```
|
||||
|
||||
Each bench runs one warmup iteration (discarded) and 15 measured iterations.
|
||||
Results are reported as median / min / max in microseconds. Median is the
|
||||
headline number; the spread between min and max indicates measurement
|
||||
stability.
|
||||
|
||||
## Methodology notes
|
||||
|
||||
- The harness times wall-clock elapsed for the full workload, including
|
||||
runtime startup and shutdown. For multi-thread runtimes this means worker
|
||||
thread spawn cost is included; on short-lived benches this can dominate.
|
||||
Where startup matters, the bench is structured so the workload is much
|
||||
longer than typical startup.
|
||||
- `tokio` uses `new_current_thread` + `LocalSet` for the single-threaded
|
||||
comparison and `new_multi_thread().worker_threads(N)` for parallel.
|
||||
`smarm::runtime::Config::exact(N)` is the equivalent knob.
|
||||
- mpsc choice: tokio's `unbounded_channel` to match smarm's unbounded channel
|
||||
semantics. Bounded comparisons would need a separate suite.
|
||||
- Random delays in `many_timers` use a deterministic mixing function of the
|
||||
actor index so iterations are reproducible.
|
||||
|
||||
## Bench catalog
|
||||
|
||||
### General — neither runtime structurally favored
|
||||
|
||||
| # | Bench | Stresses | Prediction |
|
||||
|---|---------------------|-------------------------------------------------|--------------------|
|
||||
| 1 | `chained_spawn` | Spawn + exit overhead in a serial chain | Roughly even |
|
||||
| 2 | `yield_many` | Pure scheduling throughput, explicit yields | Roughly even |
|
||||
| 3 | `fan_out_compute` | CPU-bound parallel work, minimal coordination | Even (compute-bound) |
|
||||
| 4 | `ping_pong_oneshot` | Spawn + oneshot round-trip latency | Roughly even |
|
||||
|
||||
A regression here means a real change in per-task or per-yield cost — those
|
||||
should be investigated regardless of which runtime got slower.
|
||||
|
||||
### Tokio-favored — measures cost of smarm's design choices
|
||||
|
||||
| # | Bench | Stresses | Why tokio should win |
|
||||
|---|-------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------|
|
||||
| 5 | `spawn_storm_busy` | 8 background yielders + 10k zero-work spawns | Tokio's per-worker deque + LIFO slot vs smarm's global `Mutex<SharedState>` queue |
|
||||
| 6 | `mpsc_contention` | 32 producers × 10k msgs → 1 consumer | Tokio's mpsc is lock-free on the hot path; smarm channel is `Arc<Mutex<Inner>>` + runtime mutex on each unpark |
|
||||
| 7 | `many_timers` | 10k actors sleeping 1–10 ms, dense wake window | Tokio's per-worker sharded timer wheel vs smarm's single shared min-heap |
|
||||
| 8 | `multi_thread_scaling` | Primes, sweep thread count 1, 2, 4, available | Tokio scales near-linearly; smarm hits its mutex ceiling |
|
||||
|
||||
A regression here means a smarm design choice got more expensive. Widening
|
||||
gaps signal something to investigate; narrowing gaps after a tuning change is
|
||||
the desired direction.
|
||||
|
||||
### Smarm-favored — measures payoff of green-thread + stackful design
|
||||
|
||||
| # | Bench | Stresses | Why smarm should win |
|
||||
|----|------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------|
|
||||
| 9 | `deep_recursion` | Actor recurses 1000 deep, returns | Native stack growth vs tokio's per-level `Box::pin` |
|
||||
| 10 | `yield_in_hot_loop` | 2 actors, 500k yields each, single thread | Naked context switch (~6 GPRs + xmm save + ret) vs poll → state machine → schedule |
|
||||
| 11 | `uncontended_channel` | 1→1, 1M msgs, single thread | Mutex is essentially free uncontended; green-thread switch is cheaper than poll |
|
||||
| 12 | `catch_unwind_panics` | 10k spawns, 50% panic | Smarm has `catch_unwind` at the actor entry; both runtimes do this but the boundaries differ — exploratory |
|
||||
|
||||
A regression here means we lost some of smarm's structural advantage. #12 is
|
||||
exploratory — if the baseline shows no real gap, drop it.
|
||||
|
||||
## Baseline (v0.3.0, Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, RUSTFLAGS: none)
|
||||
|
||||
> Sandbox environment has only 1 logical CPU. All multi-thread rows (smarm Nt,
|
||||
> tokio mt) are equivalent to 1-thread; scaling sweep is limited to 1 thread.
|
||||
> Label duplication in bench output ("smarm 1-thread" appearing twice) is
|
||||
> because available_parallelism() == 1, so the N-thread variant is identical.
|
||||
|
||||
| Bench | smarm 1t | smarm Nt | tokio ct | tokio mt | Notes |
|
||||
|---------------------|----------|----------|----------|----------|-------|
|
||||
| chained_spawn | 7136 | 6979 | 113 | 176 | smarm ~60x slower; spawn+stack alloc dominates on 1 CPU |
|
||||
| yield_many | 40079 | 40073 | 14571 | 14044 | smarm ~2.8x slower; scheduling overhead real |
|
||||
| fan_out_compute | 19347 | 19461 | 18616 | 18905 | roughly even; compute-bound as expected |
|
||||
| ping_pong_oneshot | 13731 | 14176 | 828 | 3342 | smarm ~17x slower; per-round spawn+join cost high |
|
||||
| spawn_storm_busy | 105512 | 107113 | 2222 | 4546 | smarm ~47x slower; global mutex under 8 bg yielders |
|
||||
| mpsc_contention | 10456 | 10395 | 17348 | 18628 | smarm wins; uncontended mutex essentially free on 1-thread |
|
||||
| many_timers | 120242 | 121023 | 13581 | 14266 | smarm ~9x slower; single min-heap vs sharded wheel |
|
||||
| multi_thread_scaling — see thread-count sweep below |
|
||||
| deep_recursion | 62 | 71 | 22 | 44 | tokio wins unexpectedly; see sanity-check notes |
|
||||
| yield_in_hot_loop | 182177 | — | 138335 | — | tokio wins; smarm prediction wrong; see notes |
|
||||
| uncontended_channel | 31473 | — | 51925 | — | smarm wins as predicted; ~1.65x |
|
||||
| catch_unwind_panics | 112306 | 114305 | 151443 | 161344 | smarm wins as predicted; ~1.35x |
|
||||
|
||||
### `multi_thread_scaling` thread-count sweep (median µs)
|
||||
|
||||
> Sandbox has 1 logical CPU; only 1-thread row is available.
|
||||
|
||||
| Threads | smarm | tokio mt |
|
||||
|---------|-------|----------|
|
||||
| 1 | 19852 | 19638 |
|
||||
| 2 | — | — |
|
||||
| 4 | — | — |
|
||||
| N (avail=1) | 19852 | 19638 |
|
||||
|
||||
## Tuning experiments
|
||||
|
||||
### Reduction-budget sweep
|
||||
|
||||
`smarm` uses an allocator-driven preemption mechanism: every Nth allocation,
|
||||
the actor checks RDTSC against its timeslice start and yields if over budget.
|
||||
The Nth-allocation threshold (the "reduction budget") and the timeslice
|
||||
duration are the two knobs.
|
||||
|
||||
Record each experiment as a row below. Reference the commit or the parameter
|
||||
values explicitly.
|
||||
|
||||
| Date | Configuration | Bench (or "all") | Result vs baseline | Notes |
|
||||
|------|----------------------------|----------------------|------------------------------|-------|
|
||||
| | baseline | all | — | |
|
||||
| | budget=…, timeslice=… | | | |
|
||||
| | | | | |
|
||||
|
||||
When the gap on tokio-favored benches narrows without regressing
|
||||
smarm-favored benches, the change is a keeper. If a budget change improves
|
||||
one workload but regresses another by more, prefer keeping the broader-impact
|
||||
configuration unless we have a clear use case for the trade-off.
|
||||
|
||||
## Sanity-check notes (baseline run)
|
||||
|
||||
### Compile fixes applied
|
||||
|
||||
Two bench files had a type error: `smarm::Runtime::run()` takes
|
||||
`impl FnOnce() + Send + 'static` (returns `()`), but the consumer closures
|
||||
in `bench_mpsc_smarm` (tokio_favored.rs) and `bench_unc_smarm`
|
||||
(smarm_favored.rs) returned `u64` via a bare `count` tail expression. Fixed
|
||||
by changing the tail to `let _ = count;` in both closures, and the
|
||||
corresponding `consumer.join().unwrap()` calls to `let _ = consumer.join()...`.
|
||||
No workload semantics changed.
|
||||
|
||||
### Single-CPU sandbox caveat
|
||||
|
||||
`available_parallelism()` returns 1, so every "N-thread" variant is identical
|
||||
to "1-thread". Multi-thread results should not be used to draw scaling
|
||||
conclusions; re-run on a multi-core machine before committing to the tuning
|
||||
sweep.
|
||||
|
||||
### Predicted-winner mismatches
|
||||
|
||||
**`deep_recursion` — tokio wins (22 µs) over smarm (62 µs).**
|
||||
At depth 500, smarm spawns a fresh actor which requires mmap'ing a 64 KiB
|
||||
stack; that allocation cost dominates the actual recursion. Tokio's
|
||||
Box::pin recursion allocates 500 small heap objects but avoids the mmap.
|
||||
The prediction assumed stack allocation was amortised across many uses; here
|
||||
the actor is single-use. Not a bug, but the bench may not exercise the
|
||||
intended advantage.
|
||||
|
||||
**`yield_in_hot_loop` — tokio wins (138 ms) over smarm (182 ms).**
|
||||
The prediction was that smarm's ~6-GPR naked context switch would beat
|
||||
tokio's poll/state-machine cycle. In practice, on a single-thread sandbox,
|
||||
tokio's current_thread scheduler has very low overhead per yield_now, while
|
||||
smarm's yield_now still goes through the runtime mutex and run-queue even on
|
||||
a single thread. This is a meaningful data point: smarm's scheduling overhead
|
||||
is not as low as the assembly switch cost alone suggests.
|
||||
|
||||
### Noise / spread
|
||||
|
||||
- `catch_unwind_panics` smarm spread is reasonable (~10% min/max).
|
||||
- `spawn_storm_busy` tokio multi-thread has notable spread (3833–7305 µs);
|
||||
consistent with tokio issue #3829 noted in task spec.
|
||||
- `many_timers` smarm spread acceptable (~10%).
|
||||
|
||||
### Result-column equivalence
|
||||
|
||||
All result columns match between runtimes for every bench (same prime counts,
|
||||
same message totals, same task counts). Workloads are equivalent.
|
||||
Reference in New Issue
Block a user