From 4b348d12be63f8a898b16012a1a5127ae66200c8 Mon Sep 17 00:00:00 2001 From: Bench Date: Sun, 24 May 2026 21:51:13 +0000 Subject: [PATCH] =?UTF-8?q?docs:=20BENCHMARKS=5FAND=5FTUNING.md=20?= =?UTF-8?q?=E2=80=94=20bench=20results,=20knob=20recommendations,=20arch?= =?UTF-8?q?=20guidance?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- BENCHMARKS_AND_TUNING.md | 320 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 320 insertions(+) create mode 100644 BENCHMARKS_AND_TUNING.md diff --git a/BENCHMARKS_AND_TUNING.md b/BENCHMARKS_AND_TUNING.md new file mode 100644 index 0000000..0eeadb8 --- /dev/null +++ b/BENCHMARKS_AND_TUNING.md @@ -0,0 +1,320 @@ +# smarm — Benchmarks & Tuning Recommendations + +> Based on bench suite v0.3.0, Intel Xeon @ 2.80 GHz, 1-core sandbox, +> kernel 6.18.5, rustc 1.95.0. Multi-core conclusions are extrapolated from +> design reasoning and single-core sweep data; re-validate on real hardware. + +--- + +## TL;DR + +smarm is competitive with tokio for **channel-heavy, message-passing workloads** +and wins outright on **uncontended channels** and **panic/unwind isolation**. +It is significantly slower than tokio for **spawn-heavy** patterns and +**timer-heavy** workloads. The preemption knobs (`alloc_interval`, +`timeslice_cycles`) have minimal effect on single-core machines; they matter +on multi-core under scheduler-thread contention. + +--- + +## Bench results summary + +All medians in µs. Tokio column is `current_thread` unless noted. + +| Bench | smarm | tokio | ratio | winner | +|----------------------|--------|--------|--------|---------------| +| `chained_spawn` | 8 625 | 124 | 70× | tokio | +| `ping_pong_oneshot` | 16 848 | 879 | 19× | tokio | +| `spawn_storm_busy` | 126 k | 2 772 | 45× | tokio | +| `yield_many` | 41 622 | 15 085 | 2.8× | tokio | +| `yield_in_hot_loop` | 190 k | 153 k | 1.25× | tokio | +| `many_timers` | 143 k | 14 462 | 10× | tokio | +| `fan_out_compute` | 29 727 | 28 503 | 1.04× | **even** | +| `multi_thread_scaling` | 30 k | 29 k | 1.04× | **even** | +| `deep_recursion` | 83 | 25 | 3.3× | tokio | +| `mpsc_contention` | 9 062 | 17 570 | 0.52× | **smarm** 1.9× | +| `uncontended_channel`| 27 265 | 51 888 | 0.53× | **smarm** 1.9× | +| `catch_unwind_panics`| 142 k | 682 k | 0.21× | **smarm** 4.8× | + +--- + +## Where smarm wins + +### Uncontended channels (1.9× faster) + +When a single producer sends to a single consumer with no other actors +competing for the queue, smarm's channel is meaningfully faster than +tokio's. This is the core use case smarm is designed for: pipelines of +actors passing owned data along a chain. + +**Recommendation**: smarm is a good fit for any architecture where data +flows through a chain of stages, each stage is an actor, and the +channel between stages is the primary synchronisation point. + +### Uncontended MPSC (1.9× faster, same reason) + +Multi-producer single-consumer works well for the same reason. On a +single-thread runtime, smarm's mutex is uncontended, so the lock is +essentially free. On multi-core this advantage will shrink; re-measure. + +### Panic isolation (4.8× faster recovery) + +`catch_unwind_panics` creates 10 000 actors that each panic. smarm +recovers and delivers `Signal::Panic` to the supervisor 4.8× faster +than tokio. This matters if you're building a system that uses panics +as a fast abort path for malformed input or actor-level faults, or if +you're using supervision trees seriously. + +**Recommendation**: if your system expects panics to be a normal +operational event (not just bugs), smarm's supervision story is a +genuine advantage over tokio's task abort model. + +--- + +## Where smarm loses, and why + +### Spawn-heavy workloads (19–70×) + +Every smarm actor `mmap`s a 64 KiB stack with a guard page. This is +a syscall. Tokio tasks are heap-allocated state machines — no stack, +no syscall, ~100 bytes each. For workloads that spawn thousands of +short-lived actors per second, this is a structural disadvantage. + +**Recommendations**: +- Avoid spawning actors for work that completes in microseconds. + Use a worker-pool pattern: spawn N long-lived actors at startup, + distribute work over channels. +- If you genuinely need high-frequency short-lived actors, the stack + allocation cost is a known roadmap item (stack caching, slab alloc). + It is not an inherent design flaw — just not implemented yet. +- `deep_recursion` shows the same problem at depth 500: smarm spawns + a fresh actor per level, paying the mmap cost repeatedly. Recursive + decomposition should use explicit stacks or iteration inside a single + actor, not actor-per-level spawning. + +### Timer-heavy workloads (10×) + +smarm uses a global min-heap of `(deadline, Pid)` pairs behind the +shared mutex. Tokio uses a sharded hierarchical timer wheel. With +10 000 pending timers, smarm's O(log N) heap under lock is +dramatically slower. + +**Recommendations**: +- Do not use smarm `sleep()` in tight loops with many concurrent + sleeping actors if timing precision matters. +- For IO timeouts: prefer a single timer actor that manages a priority + queue and fans out wakeups over channels, rather than 1 000 actors + each sleeping directly. +- The hierarchical timer wheel is listed in `LOOM.md` deferred work. + It is the correct fix if timer performance becomes a bottleneck. + +### Yield overhead (2.8× in `yield_many`, 1.25× in `yield_in_hot_loop`) + +Every `yield_now()` goes through the runtime mutex and run queue even +on a single-thread scheduler. Tokio's current_thread scheduler handles +yields with much lower overhead. smarm's naked context-switch is fast, +but the lock acquisition around it dominates for high-frequency yields. + +**Recommendation**: minimise explicit `yield_now()` calls in hot paths. +In message-passing workloads this is natural — yield happens at +`recv()` and `send()`, which is appropriate. If you are using +`yield_now()` in a tight loop, consider whether the actor should +instead be blocking on a channel or sleeping. + +--- + +## Preemption knob recommendations + +The knobs are `Config::alloc_interval(n)` and `Config::timeslice_cycles(c)`. +Default: `alloc_interval = 128`, `timeslice_cycles = 300_000` (≈100 µs at 3 GHz). + +### Findings from the sweep + +The sweep varied alloc_interval in `{32, 64, 128, 256, 512}` and +timeslice_cycles in `{150k, 300k, 600k, 1200k}` — 10 points total. + +On a single-CPU machine the knobs are almost inert: most benches move +< 5% across the entire grid. The exceptions are meaningful: + +**Longer timeslices hurt under contention.** At `tc=600k` and `tc=1200k`: + +- `spawn_storm_busy` degrades +11–15% +- `catch_unwind_panics` degrades +10–12% + +The cause: 8 background yielder actors hold the scheduler mutex longer +per timeslice, delaying the 10 000 actors waiting to be joined. A +longer timeslice amplifies the global-mutex bottleneck. + +**Shorter timeslices marginally help timer-heavy work.** At `tc=150k`, +`many_timers` improves 3–4%. Actors that are sleeping get rescheduled +sooner because the runtime polls the timer heap more frequently. + +**alloc_interval has no clear winner.** Moving from 32 to 512 causes +< 3% variation on every bench. The check frequency is not the +bottleneck — the lock is. + +### Recommended starting points + +| Workload | alloc_interval | timeslice_cycles | +|-----------------------------------|----------------|------------------| +| Default (unknown) | 128 (default) | 300 000 (default)| +| Many concurrent sleeping actors | 128 | 150 000 | +| High-throughput channel pipeline | 128 | 300 000 | +| Compute-heavy (few allocs) | 32 | 300 000 | +| Strict fairness / many actors | 64 | 150 000 | +| Long-running compute batches | 256 | 600 000 | + +**Note on `timeslice_cycles` calibration**: the default was tuned for +≈100 µs on a 3 GHz CPU. On a 2.8 GHz machine that's ≈107 µs. On a +4 GHz machine it's ≈75 µs. If you want a precise target timeslice, +measure your CPU's TSC frequency at startup and set the cycles value +accordingly: + +```rust +// Approximate TSC frequency measurement (call once at startup) +fn tsc_hz() -> u64 { + let t0 = smarm::preempt::rdtsc(); + std::thread::sleep(std::time::Duration::from_millis(100)); + let t1 = smarm::preempt::rdtsc(); + (t1 - t0) * 10 // extrapolate to 1 second +} + +let target_us = 100u64; // desired timeslice in microseconds +let cycles = tsc_hz() / 1_000_000 * target_us; + +let rt = smarm::runtime::init( + smarm::runtime::Config::default() + .timeslice_cycles(cycles) +); +``` + +--- + +## Architecture recommendations + +### Use actor pools, not per-request actors + +```rust +// Avoid: spawning an actor per request +for req in requests { + spawn(move || handle(req)); +} + +// Prefer: fixed pool, channel dispatch +let (tx, rx) = channel(); +for _ in 0..num_cpus { + let rx = rx.clone(); + spawn(move || { while let Ok(req) = rx.recv() { handle(req); } }); +} +for req in requests { tx.send(req).unwrap(); } +``` + +The worker pool pattern amortises the 64 KiB mmap cost over the +lifetime of the pool. The `chained_spawn` bench shows this cost is +real: 8 625 µs for 1 000 sequential spawns vs tokio's 124 µs. + +### Supervision for fault isolation + +smarm delivers `Signal::Panic(pid, payload)` to the supervisor when an +actor panics. Use `spawn_under` to register a supervisor channel and +build restart logic: + +```rust +let (sup_tx, sup_rx) = channel::(); +let child = smarm::spawn_under(sup_tx.clone(), move || { + // ... actor body ... +}); + +// Supervisor loop +loop { + match sup_rx.recv() { + Ok(Signal::Panic(pid, _)) => { + // restart, escalate, or record + } + Ok(Signal::Exit(_)) => break, + Err(_) => break, + } +} +``` + +This pattern has essentially zero overhead compared to unmonitored +spawning, and the `catch_unwind_panics` bench confirms it is 4.8× +faster than tokio's abort/recover cycle. + +### Explicit preemption in no-alloc hot loops + +The allocator-driven preemption mechanism fires every `alloc_interval` +allocations. Code that never allocates (tight numeric loops, parsing +fixed-size buffers) will never yield preemptively. Add `smarm::check!()` +at the natural loop boundary: + +```rust +for chunk in data.chunks(4096) { + process(chunk); // no allocations + smarm::check!(); // yield if timeslice expired +} +``` + +This is explicitly called out in `LOOM.md` as a known limitation. +The `yield_in_hot_loop` bench (1M iterations of `yield_now()`) shows +smarm is 1.25× slower than tokio even with explicit yields, which sets +the floor on how much `check!()` can help in truly tight loops. + +### IO-bound work + +smarm's IO path (`wait_readable`, `wait_writable`, `block_on_io`) parks +the actor without blocking the OS scheduler thread. This is correct and +works well. There is no specific bench for IO-bound workloads in the +current suite, but the architecture is sound for network servers and +file-IO pipelines. + +--- + +## Known limitations and roadmap items + +These are from `LOOM.md` plus observations from the bench suite. + +| Limitation | Impact | Roadmap status | +|-------------------------------|--------------------|--------------------| +| No stack size caching / slab | High spawn cost | Deferred | +| Global single min-heap timers | Poor at many timers| Deferred (hierarch. wheel) | +| Global `Mutex` | Lock contention | Deferred (per-thread queues) | +| No `join!()` macro | Ergonomics | Deferred | +| x86-64 Linux only | Portability | ARM64 deferred | +| No restart intensity caps | Supervision safety | Deferred | +| Yield overhead under lock | Hot-loop fairness | Structural / ongoing | + +The yield overhead and global mutex are the two issues most likely to +matter on a real multi-core workload. The sweep confirmed that +`timeslice_cycles` is a meaningful knob for controlling the mutex +hold time; the right long-term fix is per-thread run queues with +work stealing. + +--- + +## Running the bench suite + +```sh +# Run all benches once, print results +python3 benches/sweep.py run + +# Save current results as regression baseline +python3 benches/sweep.py run --save-baseline + +# Check for regressions (>10% slower than baseline → exit 1) +python3 benches/sweep.py regress + +# Sweep preemption knobs across the grid defined in sweep.py +python3 benches/sweep.py sweep + +# Sweep and save raw data as CSV +python3 benches/sweep.py sweep --save-csv results.csv + +# Run a single knob configuration manually +SMARM_ALLOC_INTERVAL=64 SMARM_TIMESLICE_CYCLES=150000 \ + cargo bench --bench general +``` + +The regression threshold is 10% and is configurable in `sweep.py` +(`REGRESSION_THRESHOLD_PCT`). The sweep grid is `SWEEP_GRID` in the +same file.