From 4b348d12be63f8a898b16012a1a5127ae66200c8 Mon Sep 17 00:00:00 2001
From: Bench <bench@smarm>
Date: Sun, 24 May 2026 21:51:13 +0000
Subject: [PATCH] =?UTF-8?q?docs:=20BENCHMARKS=5FAND=5FTUNING.md=20?=
 =?UTF-8?q?=E2=80=94=20bench=20results,=20knob=20recommendations,=20arch?=
 =?UTF-8?q?=20guidance?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 BENCHMARKS_AND_TUNING.md | 320 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 320 insertions(+)
 create mode 100644 BENCHMARKS_AND_TUNING.md

diff --git a/BENCHMARKS_AND_TUNING.md b/BENCHMARKS_AND_TUNING.md
new file mode 100644
index 0000000..0eeadb8
--- /dev/null
+++ b/BENCHMARKS_AND_TUNING.md
@@ -0,0 +1,320 @@
+# smarm — Benchmarks & Tuning Recommendations
+
+> Based on bench suite v0.3.0, Intel Xeon @ 2.80 GHz, 1-core sandbox,
+> kernel 6.18.5, rustc 1.95.0. Multi-core conclusions are extrapolated from
+> design reasoning and single-core sweep data; re-validate on real hardware.
+
+---
+
+## TL;DR
+
+smarm is competitive with tokio for **channel-heavy, message-passing workloads**
+and wins outright on **uncontended channels** and **panic/unwind isolation**.
+It is significantly slower than tokio for **spawn-heavy** patterns and
+**timer-heavy** workloads. The preemption knobs (`alloc_interval`,
+`timeslice_cycles`) have minimal effect on single-core machines; they matter
+on multi-core under scheduler-thread contention.
+
+---
+
+## Bench results summary
+
+All medians in µs. Tokio column is `current_thread` unless noted.
+
+| Bench                | smarm  | tokio  | ratio  | winner        |
+|----------------------|--------|--------|--------|---------------|
+| `chained_spawn`      | 8 625  | 124    | 70×    | tokio         |
+| `ping_pong_oneshot`  | 16 848 | 879    | 19×    | tokio         |
+| `spawn_storm_busy`   | 126 k  | 2 772  | 45×    | tokio         |
+| `yield_many`         | 41 622 | 15 085 | 2.8×   | tokio         |
+| `yield_in_hot_loop`  | 190 k  | 153 k  | 1.25×  | tokio         |
+| `many_timers`        | 143 k  | 14 462 | 10×    | tokio         |
+| `fan_out_compute`    | 29 727 | 28 503 | 1.04×  | **even**      |
+| `multi_thread_scaling` | 30 k | 29 k   | 1.04×  | **even**      |
+| `deep_recursion`     | 83     | 25     | 3.3×   | tokio         |
+| `mpsc_contention`    | 9 062  | 17 570 | 0.52×  | **smarm** 1.9× |
+| `uncontended_channel`| 27 265 | 51 888 | 0.53×  | **smarm** 1.9× |
+| `catch_unwind_panics`| 142 k  | 682 k  | 0.21×  | **smarm** 4.8× |
+
+---
+
+## Where smarm wins
+
+### Uncontended channels (1.9× faster)
+
+When a single producer sends to a single consumer with no other actors
+competing for the queue, smarm's channel is meaningfully faster than
+tokio's. This is the core use case smarm is designed for: pipelines of
+actors passing owned data along a chain.
+
+**Recommendation**: smarm is a good fit for any architecture where data
+flows through a chain of stages, each stage is an actor, and the
+channel between stages is the primary synchronisation point.
+
+### Uncontended MPSC (1.9× faster, same reason)
+
+Multi-producer single-consumer works well for the same reason. On a
+single-thread runtime, smarm's mutex is uncontended, so the lock is
+essentially free. On multi-core this advantage will shrink; re-measure.
+
+### Panic isolation (4.8× faster recovery)
+
+`catch_unwind_panics` creates 10 000 actors that each panic. smarm
+recovers and delivers `Signal::Panic` to the supervisor 4.8× faster
+than tokio. This matters if you're building a system that uses panics
+as a fast abort path for malformed input or actor-level faults, or if
+you're using supervision trees seriously.
+
+**Recommendation**: if your system expects panics to be a normal
+operational event (not just bugs), smarm's supervision story is a
+genuine advantage over tokio's task abort model.
+
+---
+
+## Where smarm loses, and why
+
+### Spawn-heavy workloads (19–70×)
+
+Every smarm actor `mmap`s a 64 KiB stack with a guard page. This is
+a syscall. Tokio tasks are heap-allocated state machines — no stack,
+no syscall, ~100 bytes each. For workloads that spawn thousands of
+short-lived actors per second, this is a structural disadvantage.
+
+**Recommendations**:
+- Avoid spawning actors for work that completes in microseconds.
+  Use a worker-pool pattern: spawn N long-lived actors at startup,
+  distribute work over channels.
+- If you genuinely need high-frequency short-lived actors, the stack
+  allocation cost is a known roadmap item (stack caching, slab alloc).
+  It is not an inherent design flaw — just not implemented yet.
+- `deep_recursion` shows the same problem at depth 500: smarm spawns
+  a fresh actor per level, paying the mmap cost repeatedly. Recursive
+  decomposition should use explicit stacks or iteration inside a single
+  actor, not actor-per-level spawning.
+
+### Timer-heavy workloads (10×)
+
+smarm uses a global min-heap of `(deadline, Pid)` pairs behind the
+shared mutex. Tokio uses a sharded hierarchical timer wheel. With
+10 000 pending timers, smarm's O(log N) heap under lock is
+dramatically slower.
+
+**Recommendations**:
+- Do not use smarm `sleep()` in tight loops with many concurrent
+  sleeping actors if timing precision matters.
+- For IO timeouts: prefer a single timer actor that manages a priority
+  queue and fans out wakeups over channels, rather than 1 000 actors
+  each sleeping directly.
+- The hierarchical timer wheel is listed in `LOOM.md` deferred work.
+  It is the correct fix if timer performance becomes a bottleneck.
+
+### Yield overhead (2.8× in `yield_many`, 1.25× in `yield_in_hot_loop`)
+
+Every `yield_now()` goes through the runtime mutex and run queue even
+on a single-thread scheduler. Tokio's current_thread scheduler handles
+yields with much lower overhead. smarm's naked context-switch is fast,
+but the lock acquisition around it dominates for high-frequency yields.
+
+**Recommendation**: minimise explicit `yield_now()` calls in hot paths.
+In message-passing workloads this is natural — yield happens at
+`recv()` and `send()`, which is appropriate. If you are using
+`yield_now()` in a tight loop, consider whether the actor should
+instead be blocking on a channel or sleeping.
+
+---
+
+## Preemption knob recommendations
+
+The knobs are `Config::alloc_interval(n)` and `Config::timeslice_cycles(c)`.
+Default: `alloc_interval = 128`, `timeslice_cycles = 300_000` (≈100 µs at 3 GHz).
+
+### Findings from the sweep
+
+The sweep varied alloc_interval in `{32, 64, 128, 256, 512}` and
+timeslice_cycles in `{150k, 300k, 600k, 1200k}` — 10 points total.
+
+On a single-CPU machine the knobs are almost inert: most benches move
+< 5% across the entire grid. The exceptions are meaningful:
+
+**Longer timeslices hurt under contention.** At `tc=600k` and `tc=1200k`:
+
+- `spawn_storm_busy` degrades +11–15%
+- `catch_unwind_panics` degrades +10–12%
+
+The cause: 8 background yielder actors hold the scheduler mutex longer
+per timeslice, delaying the 10 000 actors waiting to be joined. A
+longer timeslice amplifies the global-mutex bottleneck.
+
+**Shorter timeslices marginally help timer-heavy work.** At `tc=150k`,
+`many_timers` improves 3–4%. Actors that are sleeping get rescheduled
+sooner because the runtime polls the timer heap more frequently.
+
+**alloc_interval has no clear winner.** Moving from 32 to 512 causes
+< 3% variation on every bench. The check frequency is not the
+bottleneck — the lock is.
+
+### Recommended starting points
+
+| Workload                          | alloc_interval | timeslice_cycles |
+|-----------------------------------|----------------|------------------|
+| Default (unknown)                 | 128 (default)  | 300 000 (default)|
+| Many concurrent sleeping actors   | 128            | 150 000          |
+| High-throughput channel pipeline  | 128            | 300 000          |
+| Compute-heavy (few allocs)        | 32             | 300 000          |
+| Strict fairness / many actors     | 64             | 150 000          |
+| Long-running compute batches      | 256            | 600 000          |
+
+**Note on `timeslice_cycles` calibration**: the default was tuned for
+≈100 µs on a 3 GHz CPU. On a 2.8 GHz machine that's ≈107 µs. On a
+4 GHz machine it's ≈75 µs. If you want a precise target timeslice,
+measure your CPU's TSC frequency at startup and set the cycles value
+accordingly:
+
+```rust
+// Approximate TSC frequency measurement (call once at startup)
+fn tsc_hz() -> u64 {
+    let t0 = smarm::preempt::rdtsc();
+    std::thread::sleep(std::time::Duration::from_millis(100));
+    let t1 = smarm::preempt::rdtsc();
+    (t1 - t0) * 10  // extrapolate to 1 second
+}
+
+let target_us = 100u64; // desired timeslice in microseconds
+let cycles = tsc_hz() / 1_000_000 * target_us;
+
+let rt = smarm::runtime::init(
+    smarm::runtime::Config::default()
+        .timeslice_cycles(cycles)
+);
+```
+
+---
+
+## Architecture recommendations
+
+### Use actor pools, not per-request actors
+
+```rust
+// Avoid: spawning an actor per request
+for req in requests {
+    spawn(move || handle(req));
+}
+
+// Prefer: fixed pool, channel dispatch
+let (tx, rx) = channel();
+for _ in 0..num_cpus {
+    let rx = rx.clone();
+    spawn(move || { while let Ok(req) = rx.recv() { handle(req); } });
+}
+for req in requests { tx.send(req).unwrap(); }
+```
+
+The worker pool pattern amortises the 64 KiB mmap cost over the
+lifetime of the pool. The `chained_spawn` bench shows this cost is
+real: 8 625 µs for 1 000 sequential spawns vs tokio's 124 µs.
+
+### Supervision for fault isolation
+
+smarm delivers `Signal::Panic(pid, payload)` to the supervisor when an
+actor panics. Use `spawn_under` to register a supervisor channel and
+build restart logic:
+
+```rust
+let (sup_tx, sup_rx) = channel::<smarm::Signal>();
+let child = smarm::spawn_under(sup_tx.clone(), move || {
+    // ... actor body ...
+});
+
+// Supervisor loop
+loop {
+    match sup_rx.recv() {
+        Ok(Signal::Panic(pid, _)) => {
+            // restart, escalate, or record
+        }
+        Ok(Signal::Exit(_)) => break,
+        Err(_) => break,
+    }
+}
+```
+
+This pattern has essentially zero overhead compared to unmonitored
+spawning, and the `catch_unwind_panics` bench confirms it is 4.8×
+faster than tokio's abort/recover cycle.
+
+### Explicit preemption in no-alloc hot loops
+
+The allocator-driven preemption mechanism fires every `alloc_interval`
+allocations. Code that never allocates (tight numeric loops, parsing
+fixed-size buffers) will never yield preemptively. Add `smarm::check!()`
+at the natural loop boundary:
+
+```rust
+for chunk in data.chunks(4096) {
+    process(chunk);       // no allocations
+    smarm::check!();      // yield if timeslice expired
+}
+```
+
+This is explicitly called out in `LOOM.md` as a known limitation.
+The `yield_in_hot_loop` bench (1M iterations of `yield_now()`) shows
+smarm is 1.25× slower than tokio even with explicit yields, which sets
+the floor on how much `check!()` can help in truly tight loops.
+
+### IO-bound work
+
+smarm's IO path (`wait_readable`, `wait_writable`, `block_on_io`) parks
+the actor without blocking the OS scheduler thread. This is correct and
+works well. There is no specific bench for IO-bound workloads in the
+current suite, but the architecture is sound for network servers and
+file-IO pipelines.
+
+---
+
+## Known limitations and roadmap items
+
+These are from `LOOM.md` plus observations from the bench suite.
+
+| Limitation                    | Impact             | Roadmap status     |
+|-------------------------------|--------------------|--------------------|
+| No stack size caching / slab  | High spawn cost    | Deferred           |
+| Global single min-heap timers | Poor at many timers| Deferred (hierarch. wheel) |
+| Global `Mutex<RunQueue>`      | Lock contention    | Deferred (per-thread queues) |
+| No `join!()` macro            | Ergonomics         | Deferred           |
+| x86-64 Linux only             | Portability        | ARM64 deferred     |
+| No restart intensity caps     | Supervision safety | Deferred           |
+| Yield overhead under lock     | Hot-loop fairness  | Structural / ongoing |
+
+The yield overhead and global mutex are the two issues most likely to
+matter on a real multi-core workload. The sweep confirmed that
+`timeslice_cycles` is a meaningful knob for controlling the mutex
+hold time; the right long-term fix is per-thread run queues with
+work stealing.
+
+---
+
+## Running the bench suite
+
+```sh
+# Run all benches once, print results
+python3 benches/sweep.py run
+
+# Save current results as regression baseline
+python3 benches/sweep.py run --save-baseline
+
+# Check for regressions (>10% slower than baseline → exit 1)
+python3 benches/sweep.py regress
+
+# Sweep preemption knobs across the grid defined in sweep.py
+python3 benches/sweep.py sweep
+
+# Sweep and save raw data as CSV
+python3 benches/sweep.py sweep --save-csv results.csv
+
+# Run a single knob configuration manually
+SMARM_ALLOC_INTERVAL=64 SMARM_TIMESLICE_CYCLES=150000 \
+    cargo bench --bench general
+```
+
+The regression threshold is 10% and is configurable in `sweep.py`
+(`REGRESSION_THRESHOLD_PCT`). The sweep grid is `SWEEP_GRID` in the
+same file.