Update the documentation

2026-05-25 22:14:07 +02:00
parent 2b85ef60b2
commit d432349f99
6 changed files with 1348 additions and 25 deletions
@@ -0,0 +1,177 @@
+# Benchmarks
+
+Regression-test and tuning reference for smarm vs tokio.
+
+## Running
+
+```sh
+cargo bench --bench primes              # original compute bench
+cargo bench --bench multi_scheduler     # original 3-workload bench
+cargo bench --bench general             # benches 1–4
+cargo bench --bench tokio_favored       # benches 5–8
+cargo bench --bench smarm_favored       # benches 9–12
+```
+
+Each bench runs one warmup iteration (discarded) and 15 measured iterations.
+Results are reported as median / min / max in microseconds. Median is the
+headline number; the spread between min and max indicates measurement
+stability.
+
+## Methodology notes
+
+- The harness times wall-clock elapsed for the full workload, including
+  runtime startup and shutdown. For multi-thread runtimes this means worker
+  thread spawn cost is included; on short-lived benches this can dominate.
+  Where startup matters, the bench is structured so the workload is much
+  longer than typical startup.
+- `tokio` uses `new_current_thread` + `LocalSet` for the single-threaded
+  comparison and `new_multi_thread().worker_threads(N)` for parallel.
+  `smarm::runtime::Config::exact(N)` is the equivalent knob.
+- mpsc choice: tokio's `unbounded_channel` to match smarm's unbounded channel
+  semantics. Bounded comparisons would need a separate suite.
+- Random delays in `many_timers` use a deterministic mixing function of the
+  actor index so iterations are reproducible.
+
+## Bench catalog
+
+### General — neither runtime structurally favored
+
+| # | Bench               | Stresses                                        | Prediction         |
+|---|---------------------|-------------------------------------------------|--------------------|
+| 1 | `chained_spawn`     | Spawn + exit overhead in a serial chain         | Roughly even       |
+| 2 | `yield_many`        | Pure scheduling throughput, explicit yields     | Roughly even       |
+| 3 | `fan_out_compute`   | CPU-bound parallel work, minimal coordination   | Even (compute-bound) |
+| 4 | `ping_pong_oneshot` | Spawn + oneshot round-trip latency              | Roughly even       |
+
+A regression here means a real change in per-task or per-yield cost — those
+should be investigated regardless of which runtime got slower.
+
+### Tokio-favored — measures cost of smarm's design choices
+
+| # | Bench                   | Stresses                                              | Why tokio should win                                                              |
+|---|-------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------|
+| 5 | `spawn_storm_busy`      | 8 background yielders + 10k zero-work spawns          | Tokio's per-worker deque + LIFO slot vs smarm's global `Mutex<SharedState>` queue |
+| 6 | `mpsc_contention`       | 32 producers × 10k msgs → 1 consumer                  | Tokio's mpsc is lock-free on the hot path; smarm channel is `Arc<Mutex<Inner>>` + runtime mutex on each unpark |
+| 7 | `many_timers`           | 10k actors sleeping 1–10 ms, dense wake window        | Tokio's per-worker sharded timer wheel vs smarm's single shared min-heap          |
+| 8 | `multi_thread_scaling`  | Primes, sweep thread count 1, 2, 4, available         | Tokio scales near-linearly; smarm hits its mutex ceiling                          |
+
+A regression here means a smarm design choice got more expensive. Widening
+gaps signal something to investigate; narrowing gaps after a tuning change is
+the desired direction.
+
+### Smarm-favored — measures payoff of green-thread + stackful design
+
+| #  | Bench                  | Stresses                                                  | Why smarm should win                                                            |
+|----|------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------|
+| 9  | `deep_recursion`       | Actor recurses 1000 deep, returns                         | Native stack growth vs tokio's per-level `Box::pin`                             |
+| 10 | `yield_in_hot_loop`    | 2 actors, 500k yields each, single thread                 | Naked context switch (~6 GPRs + xmm save + ret) vs poll → state machine → schedule |
+| 11 | `uncontended_channel`  | 1→1, 1M msgs, single thread                               | Mutex is essentially free uncontended; green-thread switch is cheaper than poll |
+| 12 | `catch_unwind_panics`  | 10k spawns, 50% panic                                     | Smarm has `catch_unwind` at the actor entry; both runtimes do this but the boundaries differ — exploratory |
+
+A regression here means we lost some of smarm's structural advantage. #12 is
+exploratory — if the baseline shows no real gap, drop it.
+
+## Baseline (v0.3.0, Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, RUSTFLAGS: none)
+
+> Sandbox environment has only 1 logical CPU. All multi-thread rows (smarm Nt,
+> tokio mt) are equivalent to 1-thread; scaling sweep is limited to 1 thread.
+> Label duplication in bench output ("smarm 1-thread" appearing twice) is
+> because available_parallelism() == 1, so the N-thread variant is identical.
+
+| Bench               | smarm 1t | smarm Nt | tokio ct | tokio mt | Notes |
+|---------------------|----------|----------|----------|----------|-------|
+| chained_spawn       | 7136     | 6979     | 113      | 176      | smarm ~60x slower; spawn+stack alloc dominates on 1 CPU |
+| yield_many          | 40079    | 40073    | 14571    | 14044    | smarm ~2.8x slower; scheduling overhead real |
+| fan_out_compute     | 19347    | 19461    | 18616    | 18905    | roughly even; compute-bound as expected |
+| ping_pong_oneshot   | 13731    | 14176    | 828      | 3342     | smarm ~17x slower; per-round spawn+join cost high |
+| spawn_storm_busy    | 105512   | 107113   | 2222     | 4546     | smarm ~47x slower; global mutex under 8 bg yielders |
+| mpsc_contention     | 10456    | 10395    | 17348    | 18628    | smarm wins; uncontended mutex essentially free on 1-thread |
+| many_timers         | 120242   | 121023   | 13581    | 14266    | smarm ~9x slower; single min-heap vs sharded wheel |
+| multi_thread_scaling — see thread-count sweep below                                            |
+| deep_recursion      | 62       | 71       | 22       | 44       | tokio wins unexpectedly; see sanity-check notes |
+| yield_in_hot_loop   | 182177   | —        | 138335   | —        | tokio wins; smarm prediction wrong; see notes |
+| uncontended_channel | 31473    | —        | 51925    | —        | smarm wins as predicted; ~1.65x |
+| catch_unwind_panics | 112306   | 114305   | 151443   | 161344   | smarm wins as predicted; ~1.35x |
+
+### `multi_thread_scaling` thread-count sweep (median µs)
+
+> Sandbox has 1 logical CPU; only 1-thread row is available.
+
+| Threads | smarm | tokio mt |
+|---------|-------|----------|
+| 1       | 19852 | 19638    |
+| 2       | —     | —        |
+| 4       | —     | —        |
+| N (avail=1) | 19852 | 19638 |
+
+## Tuning experiments
+
+### Reduction-budget sweep
+
+`smarm` uses an allocator-driven preemption mechanism: every Nth allocation,
+the actor checks RDTSC against its timeslice start and yields if over budget.
+The Nth-allocation threshold (the "reduction budget") and the timeslice
+duration are the two knobs.
+
+Record each experiment as a row below. Reference the commit or the parameter
+values explicitly.
+
+| Date | Configuration              | Bench (or "all")     | Result vs baseline           | Notes |
+|------|----------------------------|----------------------|------------------------------|-------|
+|      | baseline                   | all                  | —                            |       |
+|      | budget=…, timeslice=…      |                      |                              |       |
+|      |                            |                      |                              |       |
+
+When the gap on tokio-favored benches narrows without regressing
+smarm-favored benches, the change is a keeper. If a budget change improves
+one workload but regresses another by more, prefer keeping the broader-impact
+configuration unless we have a clear use case for the trade-off.
+
+## Sanity-check notes (baseline run)
+
+### Compile fixes applied
+
+Two bench files had a type error: `smarm::Runtime::run()` takes
+`impl FnOnce() + Send + 'static` (returns `()`), but the consumer closures
+in `bench_mpsc_smarm` (tokio_favored.rs) and `bench_unc_smarm`
+(smarm_favored.rs) returned `u64` via a bare `count` tail expression. Fixed
+by changing the tail to `let _ = count;` in both closures, and the
+corresponding `consumer.join().unwrap()` calls to `let _ = consumer.join()...`.
+No workload semantics changed.
+
+### Single-CPU sandbox caveat
+
+`available_parallelism()` returns 1, so every "N-thread" variant is identical
+to "1-thread". Multi-thread results should not be used to draw scaling
+conclusions; re-run on a multi-core machine before committing to the tuning
+sweep.
+
+### Predicted-winner mismatches
+
+**`deep_recursion` — tokio wins (22 µs) over smarm (62 µs).**
+At depth 500, smarm spawns a fresh actor which requires mmap'ing a 64 KiB
+stack; that allocation cost dominates the actual recursion. Tokio's
+Box::pin recursion allocates 500 small heap objects but avoids the mmap.
+The prediction assumed stack allocation was amortised across many uses; here
+the actor is single-use. Not a bug, but the bench may not exercise the
+intended advantage.
+
+**`yield_in_hot_loop` — tokio wins (138 ms) over smarm (182 ms).**
+The prediction was that smarm's ~6-GPR naked context switch would beat
+tokio's poll/state-machine cycle. In practice, on a single-thread sandbox,
+tokio's current_thread scheduler has very low overhead per yield_now, while
+smarm's yield_now still goes through the runtime mutex and run-queue even on
+a single thread. This is a meaningful data point: smarm's scheduling overhead
+is not as low as the assembly switch cost alone suggests.
+
+### Noise / spread
+
+- `catch_unwind_panics` smarm spread is reasonable (~10% min/max).
+- `spawn_storm_busy` tokio multi-thread has notable spread (3833–7305 µs);
+  consistent with tokio issue #3829 noted in task spec.
+- `many_timers` smarm spread acceptable (~10%).
+
+### Result-column equivalence
+
+All result columns match between runtimes for every bench (same prime counts,
+same message totals, same task counts). Workloads are equivalent.