Update the documentation

2026-05-25 22:14:07 +02:00
parent 2b85ef60b2
commit d432349f99
6 changed files with 1348 additions and 25 deletions
@@ -0,0 +1,217 @@
+# SMARM Architecture
+
+> Erlang-style actor concurrency for Rust, without the copies, the colors, or the GC pauses.
+
+---
+
+## Vision
+
+Rust gives you the right ownership discipline for safe actor concurrency almost for free — `Send` already
+draws the boundary, the borrow checker already enforces it. What it lacks is an execution model to match:
+async/await is IO-centric, colors your functions, and trades stack simplicity for state-machine complexity;
+OS threads are too heavy to spawn per actor.
+
+SMARM adds a third option: **green-thread actors on a shared heap**, scheduled cooperatively, with
+message-passing as the only cross-actor communication primitive. You get Erlang's isolation model without
+Erlang's copying GC, and you get Rust's zero-copy ownership transfers without async's cognitive overhead.
+No function coloring. No `Box<dyn Future>`. Just actors, messages, and the borrow checker doing what it
+already does.
+
+---
+
+## Do: Core Runtime
+
+### Actors and scheduling
+
+Each actor is a lightweight green thread with its own heap-allocated, growable stack. Stacks are
+allocated via `mmap` with a guard page below the region; overflow is detected by the OS without SMARM
+polling for it. Initial stacks are small and grow by remapping on demand.
+
+The scheduler runs one OS thread per CPU. Each scheduler thread loops against a single global
+`Mutex<HashMap>` queue shared across all schedulers. If queue contention becomes a measured bottleneck
+this can be revisited; the interface will not change.
+
+SMARM requires `panic = unwind`. Users who set `panic = abort` accept that supervision and actor
+isolation are silently degraded to process death.
+
+### Process descriptor
+
+Each actor has a descriptor that is hot while the actor runs and will typically live in L1 cache.
+It holds:
+
+- `stack_base: *mut u8` — bottom of the allocated stack region
+- `stack_cap: usize` — total allocated size
+- `stack_ptr: *mut u8` — current stack pointer (`rsp`), saved on yield
+- `pid: (u32, u32)` — index and generation counter (see PIDs below)
+- `alloc_count: u32` — countdown for preemption sampling
+- `timeslice_start: u64` — `RDTSC` value written on every resume
+- `resize_count: u16` — diagnostic counter for stack growth events
+- `context: *mut ContextSaveArea` — pointer to the register save area (cold, touched only on switch)
+
+### Context switching
+
+Context switching is implemented in a `#[naked]` assembly shim, one per supported architecture.
+The compiler cannot be asked to switch stacks.
+
+**Suspend** (yield, preemption, or blocking):
+1. Save callee-saved integer registers and SIMD registers into `ContextSaveArea`.
+2. Save `rsp`/`sp` into the process descriptor.
+3. Load the scheduler's stack pointer from a thread-local and jump back into the scheduler loop.
+
+**Resume**:
+1. Load `rsp`/`sp` from the process descriptor.
+2. Restore registers from `ContextSaveArea`.
+3. `ret` — the return address is already on the restored stack, execution resumes exactly where the
+   actor yielded.
+
+**x86-64**: saves `rbx`, `rbp`, `r12`–`r15` (6 × 8 = 48 bytes) and `xmm0`–`xmm15` (16 × 16 = 256
+bytes) = 304 bytes total. Full SSE baseline is required; the compiler may autovectorise freely.
+AVX-512 is deferred.
+
+**ARM64**: saves `x19`–`x30` (12 × 8 = 96 bytes, including the link register `x30` which must be
+saved explicitly — it holds the return address, unlike x86 where `call` pushes it to the stack) and
+`d8`–`d15` (8 × 8 = 64 bytes) = 160 bytes total.
+
+`ContextSaveArea` is a `Box<ContextSaveArea>` per actor. Lifetime equals the actor's lifetime;
+no churn, no bulk deallocation, `Box` is correct.
+
+Initial platform target is x86-64 Linux. ARM64 and macOS are natural follow-ons.
+
+### Allocator-driven preemption
+
+Every Nth allocation, the allocator reads `RDTSC` and compares it against `timeslice_start`. If the
+threshold is exceeded the actor yields. The workloads that starve a scheduler — sustained compute,
+data transformation — are precisely the ones doing frequent allocations, so this approximation is
+correct by construction.
+
+`RDTSC` is not monotonic across core migration; a slightly wrong timeslice is acceptable. SMARM is
+not a real-time scheduler.
+
+Known failure mode: tight no-alloc loops are invisible to this mechanism. Actors doing sustained
+allocation-free compute must call `smarm::yield_now()` explicitly, or offload to a thread pool
+outside the actor scheduler (e.g. rayon). This is documented and acceptable — such loops are rare
+in message-passing workloads.
+
+### Yield points
+
+An actor yields at:
+
+- **Channel send/recv** — the primary communication primitive
+- **Mutex contention** — attempting to lock a held `Arc<Mutex<>>` parks the actor
+- **IO** — blocking on a socket or file descriptor parks the actor until the IO thread signals readiness
+- **`smarm::sleep(duration)`** — parks the actor; the timer wheel re-queues it on expiry
+- **`smarm::yield_now()`** — explicit cooperative yield
+- **Allocator preemption** — as above
+- **Spawn** — does not yield by default; the new actor is queued and the spawner continues
+
+`std::thread::sleep` inside an actor blocks the entire OS thread and should never be used. SMARM
+may emit a warning if it can detect this.
+
+### IO thread
+
+A single dedicated IO thread runs an `epoll`/`kqueue` loop. Actors blocking on IO register their
+file descriptor and PID; the IO thread moves them back into the global queue when the fd is ready.
+A `HashMap<RawFd, Pid>` maps fds to parked actors. Cancellation (actor dies while waiting on IO)
+deregisters the fd. This is intentionally simple and not pluggable; SMARM is not a general async
+executor.
+
+### Communication
+
+Messages must be `Send` or `Copy`. Non-`Send` types cannot cross an actor boundary; this is
+enforced by the type system with no runtime overhead.
+
+Two primitives only:
+
+- **Move** — transfer owned data across a channel. Zero copy. The sender relinquishes ownership
+  at the type level. This is the default.
+- **`Arc<Mutex<T>>`** — for genuinely shared long-lived state. Explicit and visible.
+
+Cross-actor `Rc` or bare pointers are banned. There is no cycle detector. Cross-actor cycles are
+banned by construction: either transfer ownership or use `Arc`.
+
+### PIDs
+
+A PID is a `(index, generation)` pair. The index may be reused after an actor dies; the generation
+counter increments on every death. A stale handle holding the wrong generation is a detectable
+error, not a silent misdirection. This avoids the ABA problem without reserving PID space forever.
+
+### Supervision
+
+Every actor has a supervisor, assigned at spawn. This is not optional. The root supervisor is
+provided by the runtime; its death is a process exit.
+
+A supervisor receives one of three signals when a child actor terminates:
+
+- `Signal::Exit(pid)` — normal completion
+- `Signal::Panic(pid, payload)` — caught via `catch_unwind` at the actor entry point boundary,
+  before unwinding can reach the assembly shim
+- `Signal::Timeout(pid)` — actor exceeded a budget (see below)
+
+The supervisor decides: restart the actor, escalate to its own supervisor, or ignore. Restart
+intensity is capped: if an actor panics more than N times within a time window, the supervisor
+stops restarting and escalates. This prevents a bad prelude or corrupted input from spinning the
+supervisor in a restart loop indefinitely. N and the window are configurable per supervisor with a
+sensible global default.
+
+### Mutex timeout
+
+Every `smarm::mutex` lock attempt is mediated by the scheduler. If the lock is not acquired within
+a configurable timeout, the actor receives a `LockTimeout` error rather than parking forever. This
+is a hard runtime guarantee, not a convention. Default timeout is global and configurable;
+individual locks and individual call sites can override it.
+
+### Task joining
+
+Actors can spawn children and wait on a group of handles:
+
+```rust
+let h1 = smarm::spawn(|| compute_a());
+let h2 = smarm::spawn(|| compute_b());
+let (a, b) = smarm::join!(h1, h2);
+```
+
+`join!` parks the calling actor until all handles complete. The last child to finish re-queues the
+parent. This is a countdown in the parent's descriptor; no polling, no waker registration. A
+`join_timeout!` variant is a natural extension.
+
+### Timer wheel
+
+`smarm::sleep` and supervision timeouts are driven by a timer wheel in the scheduler. Sleeping
+actors are parked and re-queued by the timer thread on expiry. The timer wheel is internal
+infrastructure; its design is an implementation detail.
+
+---
+
+## Defer: Later Work
+
+- **Stack sizing policy** — initial size, growth factor, and whether stacks ever shrink are
+  implementation decisions to be made with profiling data, not up front.
+- **Queue contention** — if `Mutex<HashMap>` proves to be a bottleneck under profiling, evaluate
+  `DashMap` or a lock-free work-stealing deque (e.g. `crossbeam-deque`). Not before.
+- **AVX-512 context save** — extend `ContextSaveArea` when there is a concrete use case.
+- **`smarm::sleep` vs raw sleep semantics** — further control knobs deferred until the basic sleep
+  is working and real use cases are understood.
+- **Supervision tree API** — the contract is defined; the recursive hierarchy, restart strategies,
+  and introspection API are implementation work.
+- **no_std support** — the assembly shim is no_std friendly but the IO thread and allocator require
+  OS primitives. Target is no_std + `alloc` on hosted platforms; bare metal is out of scope.
+- **Distribution** — SMARM is a single-process runtime. No distribution protocol, no BEAM-style
+  clustering.
+
+---
+
+## What SMARM is Not
+
+- Not a drop-in replacement for Tokio. SMARM does not implement `Future` or the async executor interface.
+- Not a general allocator. SMARM manages actor stacks; heap allocation for actor data goes through
+  the system allocator.
+- Not Erlang. No hot code reloading, no distribution protocol, no BEAM bytecode. SMARM is a
+  concurrency runtime, not a platform.
+- Not a real-time scheduler. Timeslice accuracy is best-effort.
+
+
+---
+
+## On names
+
+<sub>The name is a recursive acronym. The M is for Marks, as in the BEAM — Bogdan/Björn's Erlang Abstract Machine, the virtual machine that runs Erlang and Elixir. smarm is not the BEAM. It just admires it from a safe distance.</sub>
@@ -0,0 +1,320 @@
+# smarm — Benchmarks & Tuning Recommendations
+
+> Based on bench suite v0.3.0, Intel Xeon @ 2.80 GHz, 1-core sandbox,
+> kernel 6.18.5, rustc 1.95.0. Multi-core conclusions are extrapolated from
+> design reasoning and single-core sweep data; re-validate on real hardware.
+
+---
+
+## TL;DR
+
+smarm is competitive with tokio for **channel-heavy, message-passing workloads**
+and wins outright on **uncontended channels** and **panic/unwind isolation**.
+It is significantly slower than tokio for **spawn-heavy** patterns and
+**timer-heavy** workloads. The preemption knobs (`alloc_interval`,
+`timeslice_cycles`) have minimal effect on single-core machines; they matter
+on multi-core under scheduler-thread contention.
+
+---
+
+## Bench results summary
+
+All medians in µs. Tokio column is `current_thread` unless noted.
+
+| Bench                | smarm  | tokio  | ratio  | winner        |
+|----------------------|--------|--------|--------|---------------|
+| `chained_spawn`      | 8 625  | 124    | 70×    | tokio         |
+| `ping_pong_oneshot`  | 16 848 | 879    | 19×    | tokio         |
+| `spawn_storm_busy`   | 126 k  | 2 772  | 45×    | tokio         |
+| `yield_many`         | 41 622 | 15 085 | 2.8×   | tokio         |
+| `yield_in_hot_loop`  | 190 k  | 153 k  | 1.25×  | tokio         |
+| `many_timers`        | 143 k  | 14 462 | 10×    | tokio         |
+| `fan_out_compute`    | 29 727 | 28 503 | 1.04×  | **even**      |
+| `multi_thread_scaling` | 30 k | 29 k   | 1.04×  | **even**      |
+| `deep_recursion`     | 83     | 25     | 3.3×   | tokio         |
+| `mpsc_contention`    | 9 062  | 17 570 | 0.52×  | **smarm** 1.9× |
+| `uncontended_channel`| 27 265 | 51 888 | 0.53×  | **smarm** 1.9× |
+| `catch_unwind_panics`| 142 k  | 682 k  | 0.21×  | **smarm** 4.8× |
+
+---
+
+## Where smarm wins
+
+### Uncontended channels (1.9× faster)
+
+When a single producer sends to a single consumer with no other actors
+competing for the queue, smarm's channel is meaningfully faster than
+tokio's. This is the core use case smarm is designed for: pipelines of
+actors passing owned data along a chain.
+
+**Recommendation**: smarm is a good fit for any architecture where data
+flows through a chain of stages, each stage is an actor, and the
+channel between stages is the primary synchronisation point.
+
+### Uncontended MPSC (1.9× faster, same reason)
+
+Multi-producer single-consumer works well for the same reason. On a
+single-thread runtime, smarm's mutex is uncontended, so the lock is
+essentially free. On multi-core this advantage will shrink; re-measure.
+
+### Panic isolation (4.8× faster recovery)
+
+`catch_unwind_panics` creates 10 000 actors that each panic. smarm
+recovers and delivers `Signal::Panic` to the supervisor 4.8× faster
+than tokio. This matters if you're building a system that uses panics
+as a fast abort path for malformed input or actor-level faults, or if
+you're using supervision trees seriously.
+
+**Recommendation**: if your system expects panics to be a normal
+operational event (not just bugs), smarm's supervision story is a
+genuine advantage over tokio's task abort model.
+
+---
+
+## Where smarm loses, and why
+
+### Spawn-heavy workloads (19–70×)
+
+Every smarm actor `mmap`s a 64 KiB stack with a guard page. This is
+a syscall. Tokio tasks are heap-allocated state machines — no stack,
+no syscall, ~100 bytes each. For workloads that spawn thousands of
+short-lived actors per second, this is a structural disadvantage.
+
+**Recommendations**:
+- Avoid spawning actors for work that completes in microseconds.
+  Use a worker-pool pattern: spawn N long-lived actors at startup,
+  distribute work over channels.
+- If you genuinely need high-frequency short-lived actors, the stack
+  allocation cost is a known roadmap item (stack caching, slab alloc).
+  It is not an inherent design flaw — just not implemented yet.
+- `deep_recursion` shows the same problem at depth 500: smarm spawns
+  a fresh actor per level, paying the mmap cost repeatedly. Recursive
+  decomposition should use explicit stacks or iteration inside a single
+  actor, not actor-per-level spawning.
+
+### Timer-heavy workloads (10×)
+
+smarm uses a global min-heap of `(deadline, Pid)` pairs behind the
+shared mutex. Tokio uses a sharded hierarchical timer wheel. With
+10 000 pending timers, smarm's O(log N) heap under lock is
+dramatically slower.
+
+**Recommendations**:
+- Do not use smarm `sleep()` in tight loops with many concurrent
+  sleeping actors if timing precision matters.
+- For IO timeouts: prefer a single timer actor that manages a priority
+  queue and fans out wakeups over channels, rather than 1 000 actors
+  each sleeping directly.
+- The hierarchical timer wheel is listed in `LOOM.md` deferred work.
+  It is the correct fix if timer performance becomes a bottleneck.
+
+### Yield overhead (2.8× in `yield_many`, 1.25× in `yield_in_hot_loop`)
+
+Every `yield_now()` goes through the runtime mutex and run queue even
+on a single-thread scheduler. Tokio's current_thread scheduler handles
+yields with much lower overhead. smarm's naked context-switch is fast,
+but the lock acquisition around it dominates for high-frequency yields.
+
+**Recommendation**: minimise explicit `yield_now()` calls in hot paths.
+In message-passing workloads this is natural — yield happens at
+`recv()` and `send()`, which is appropriate. If you are using
+`yield_now()` in a tight loop, consider whether the actor should
+instead be blocking on a channel or sleeping.
+
+---
+
+## Preemption knob recommendations
+
+The knobs are `Config::alloc_interval(n)` and `Config::timeslice_cycles(c)`.
+Default: `alloc_interval = 128`, `timeslice_cycles = 300_000` (≈100 µs at 3 GHz).
+
+### Findings from the sweep
+
+The sweep varied alloc_interval in `{32, 64, 128, 256, 512}` and
+timeslice_cycles in `{150k, 300k, 600k, 1200k}` — 10 points total.
+
+On a single-CPU machine the knobs are almost inert: most benches move
+< 5% across the entire grid. The exceptions are meaningful:
+
+**Longer timeslices hurt under contention.** At `tc=600k` and `tc=1200k`:
+
+- `spawn_storm_busy` degrades +11–15%
+- `catch_unwind_panics` degrades +10–12%
+
+The cause: 8 background yielder actors hold the scheduler mutex longer
+per timeslice, delaying the 10 000 actors waiting to be joined. A
+longer timeslice amplifies the global-mutex bottleneck.
+
+**Shorter timeslices marginally help timer-heavy work.** At `tc=150k`,
+`many_timers` improves 3–4%. Actors that are sleeping get rescheduled
+sooner because the runtime polls the timer heap more frequently.
+
+**alloc_interval has no clear winner.** Moving from 32 to 512 causes
+< 3% variation on every bench. The check frequency is not the
+bottleneck — the lock is.
+
+### Recommended starting points
+
+| Workload                          | alloc_interval | timeslice_cycles |
+|-----------------------------------|----------------|------------------|
+| Default (unknown)                 | 128 (default)  | 300 000 (default)|
+| Many concurrent sleeping actors   | 128            | 150 000          |
+| High-throughput channel pipeline  | 128            | 300 000          |
+| Compute-heavy (few allocs)        | 32             | 300 000          |
+| Strict fairness / many actors     | 64             | 150 000          |
+| Long-running compute batches      | 256            | 600 000          |
+
+**Note on `timeslice_cycles` calibration**: the default was tuned for
+≈100 µs on a 3 GHz CPU. On a 2.8 GHz machine that's ≈107 µs. On a
+4 GHz machine it's ≈75 µs. If you want a precise target timeslice,
+measure your CPU's TSC frequency at startup and set the cycles value
+accordingly:
+
+```rust
+// Approximate TSC frequency measurement (call once at startup)
+fn tsc_hz() -> u64 {
+    let t0 = smarm::preempt::rdtsc();
+    std::thread::sleep(std::time::Duration::from_millis(100));
+    let t1 = smarm::preempt::rdtsc();
+    (t1 - t0) * 10  // extrapolate to 1 second
+}
+
+let target_us = 100u64; // desired timeslice in microseconds
+let cycles = tsc_hz() / 1_000_000 * target_us;
+
+let rt = smarm::runtime::init(
+    smarm::runtime::Config::default()
+        .timeslice_cycles(cycles)
+);
+```
+
+---
+
+## Architecture recommendations
+
+### Use actor pools, not per-request actors
+
+```rust
+// Avoid: spawning an actor per request
+for req in requests {
+    spawn(move || handle(req));
+}
+
+// Prefer: fixed pool, channel dispatch
+let (tx, rx) = channel();
+for _ in 0..num_cpus {
+    let rx = rx.clone();
+    spawn(move || { while let Ok(req) = rx.recv() { handle(req); } });
+}
+for req in requests { tx.send(req).unwrap(); }
+```
+
+The worker pool pattern amortises the 64 KiB mmap cost over the
+lifetime of the pool. The `chained_spawn` bench shows this cost is
+real: 8 625 µs for 1 000 sequential spawns vs tokio's 124 µs.
+
+### Supervision for fault isolation
+
+smarm delivers `Signal::Panic(pid, payload)` to the supervisor when an
+actor panics. Use `spawn_under` to register a supervisor channel and
+build restart logic:
+
+```rust
+let (sup_tx, sup_rx) = channel::<smarm::Signal>();
+let child = smarm::spawn_under(sup_tx.clone(), move || {
+    // ... actor body ...
+});
+
+// Supervisor loop
+loop {
+    match sup_rx.recv() {
+        Ok(Signal::Panic(pid, _)) => {
+            // restart, escalate, or record
+        }
+        Ok(Signal::Exit(_)) => break,
+        Err(_) => break,
+    }
+}
+```
+
+This pattern has essentially zero overhead compared to unmonitored
+spawning, and the `catch_unwind_panics` bench confirms it is 4.8×
+faster than tokio's abort/recover cycle.
+
+### Explicit preemption in no-alloc hot loops
+
+The allocator-driven preemption mechanism fires every `alloc_interval`
+allocations. Code that never allocates (tight numeric loops, parsing
+fixed-size buffers) will never yield preemptively. Add `smarm::check!()`
+at the natural loop boundary:
+
+```rust
+for chunk in data.chunks(4096) {
+    process(chunk);       // no allocations
+    smarm::check!();      // yield if timeslice expired
+}
+```
+
+This is explicitly called out in `LOOM.md` as a known limitation.
+The `yield_in_hot_loop` bench (1M iterations of `yield_now()`) shows
+smarm is 1.25× slower than tokio even with explicit yields, which sets
+the floor on how much `check!()` can help in truly tight loops.
+
+### IO-bound work
+
+smarm's IO path (`wait_readable`, `wait_writable`, `block_on_io`) parks
+the actor without blocking the OS scheduler thread. This is correct and
+works well. There is no specific bench for IO-bound workloads in the
+current suite, but the architecture is sound for network servers and
+file-IO pipelines.
+
+---
+
+## Known limitations and roadmap items
+
+These are from `LOOM.md` plus observations from the bench suite.
+
+| Limitation                    | Impact             | Roadmap status     |
+|-------------------------------|--------------------|--------------------|
+| No stack size caching / slab  | High spawn cost    | Deferred           |
+| Global single min-heap timers | Poor at many timers| Deferred (hierarch. wheel) |
+| Global `Mutex<RunQueue>`      | Lock contention    | Deferred (per-thread queues) |
+| No `join!()` macro            | Ergonomics         | Deferred           |
+| x86-64 Linux only             | Portability        | ARM64 deferred     |
+| No restart intensity caps     | Supervision safety | Deferred           |
+| Yield overhead under lock     | Hot-loop fairness  | Structural / ongoing |
+
+The yield overhead and global mutex are the two issues most likely to
+matter on a real multi-core workload. The sweep confirmed that
+`timeslice_cycles` is a meaningful knob for controlling the mutex
+hold time; the right long-term fix is per-thread run queues with
+work stealing.
+
+---
+
+## Running the bench suite
+
+```sh
+# Run all benches once, print results
+python3 benches/sweep.py run
+
+# Save current results as regression baseline
+python3 benches/sweep.py run --save-baseline
+
+# Check for regressions (>10% slower than baseline → exit 1)
+python3 benches/sweep.py regress
+
+# Sweep preemption knobs across the grid defined in sweep.py
+python3 benches/sweep.py sweep
+
+# Sweep and save raw data as CSV
+python3 benches/sweep.py sweep --save-csv results.csv
+
+# Run a single knob configuration manually
+SMARM_ALLOC_INTERVAL=64 SMARM_TIMESLICE_CYCLES=150000 \
+    cargo bench --bench general
+```
+
+The regression threshold is 10% and is configurable in `sweep.py`
+(`REGRESSION_THRESHOLD_PCT`). The sweep grid is `SWEEP_GRID` in the
+same file.
@@ -0,0 +1,177 @@
+# Benchmarks
+
+Regression-test and tuning reference for smarm vs tokio.
+
+## Running
+
+```sh
+cargo bench --bench primes              # original compute bench
+cargo bench --bench multi_scheduler     # original 3-workload bench
+cargo bench --bench general             # benches 1–4
+cargo bench --bench tokio_favored       # benches 5–8
+cargo bench --bench smarm_favored       # benches 9–12
+```
+
+Each bench runs one warmup iteration (discarded) and 15 measured iterations.
+Results are reported as median / min / max in microseconds. Median is the
+headline number; the spread between min and max indicates measurement
+stability.
+
+## Methodology notes
+
+- The harness times wall-clock elapsed for the full workload, including
+  runtime startup and shutdown. For multi-thread runtimes this means worker
+  thread spawn cost is included; on short-lived benches this can dominate.
+  Where startup matters, the bench is structured so the workload is much
+  longer than typical startup.
+- `tokio` uses `new_current_thread` + `LocalSet` for the single-threaded
+  comparison and `new_multi_thread().worker_threads(N)` for parallel.
+  `smarm::runtime::Config::exact(N)` is the equivalent knob.
+- mpsc choice: tokio's `unbounded_channel` to match smarm's unbounded channel
+  semantics. Bounded comparisons would need a separate suite.
+- Random delays in `many_timers` use a deterministic mixing function of the
+  actor index so iterations are reproducible.
+
+## Bench catalog
+
+### General — neither runtime structurally favored
+
+| # | Bench               | Stresses                                        | Prediction         |
+|---|---------------------|-------------------------------------------------|--------------------|
+| 1 | `chained_spawn`     | Spawn + exit overhead in a serial chain         | Roughly even       |
+| 2 | `yield_many`        | Pure scheduling throughput, explicit yields     | Roughly even       |
+| 3 | `fan_out_compute`   | CPU-bound parallel work, minimal coordination   | Even (compute-bound) |
+| 4 | `ping_pong_oneshot` | Spawn + oneshot round-trip latency              | Roughly even       |
+
+A regression here means a real change in per-task or per-yield cost — those
+should be investigated regardless of which runtime got slower.
+
+### Tokio-favored — measures cost of smarm's design choices
+
+| # | Bench                   | Stresses                                              | Why tokio should win                                                              |
+|---|-------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------|
+| 5 | `spawn_storm_busy`      | 8 background yielders + 10k zero-work spawns          | Tokio's per-worker deque + LIFO slot vs smarm's global `Mutex<SharedState>` queue |
+| 6 | `mpsc_contention`       | 32 producers × 10k msgs → 1 consumer                  | Tokio's mpsc is lock-free on the hot path; smarm channel is `Arc<Mutex<Inner>>` + runtime mutex on each unpark |
+| 7 | `many_timers`           | 10k actors sleeping 1–10 ms, dense wake window        | Tokio's per-worker sharded timer wheel vs smarm's single shared min-heap          |
+| 8 | `multi_thread_scaling`  | Primes, sweep thread count 1, 2, 4, available         | Tokio scales near-linearly; smarm hits its mutex ceiling                          |
+
+A regression here means a smarm design choice got more expensive. Widening
+gaps signal something to investigate; narrowing gaps after a tuning change is
+the desired direction.
+
+### Smarm-favored — measures payoff of green-thread + stackful design
+
+| #  | Bench                  | Stresses                                                  | Why smarm should win                                                            |
+|----|------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------|
+| 9  | `deep_recursion`       | Actor recurses 1000 deep, returns                         | Native stack growth vs tokio's per-level `Box::pin`                             |
+| 10 | `yield_in_hot_loop`    | 2 actors, 500k yields each, single thread                 | Naked context switch (~6 GPRs + xmm save + ret) vs poll → state machine → schedule |
+| 11 | `uncontended_channel`  | 1→1, 1M msgs, single thread                               | Mutex is essentially free uncontended; green-thread switch is cheaper than poll |
+| 12 | `catch_unwind_panics`  | 10k spawns, 50% panic                                     | Smarm has `catch_unwind` at the actor entry; both runtimes do this but the boundaries differ — exploratory |
+
+A regression here means we lost some of smarm's structural advantage. #12 is
+exploratory — if the baseline shows no real gap, drop it.
+
+## Baseline (v0.3.0, Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, RUSTFLAGS: none)
+
+> Sandbox environment has only 1 logical CPU. All multi-thread rows (smarm Nt,
+> tokio mt) are equivalent to 1-thread; scaling sweep is limited to 1 thread.
+> Label duplication in bench output ("smarm 1-thread" appearing twice) is
+> because available_parallelism() == 1, so the N-thread variant is identical.
+
+| Bench               | smarm 1t | smarm Nt | tokio ct | tokio mt | Notes |
+|---------------------|----------|----------|----------|----------|-------|
+| chained_spawn       | 7136     | 6979     | 113      | 176      | smarm ~60x slower; spawn+stack alloc dominates on 1 CPU |
+| yield_many          | 40079    | 40073    | 14571    | 14044    | smarm ~2.8x slower; scheduling overhead real |
+| fan_out_compute     | 19347    | 19461    | 18616    | 18905    | roughly even; compute-bound as expected |
+| ping_pong_oneshot   | 13731    | 14176    | 828      | 3342     | smarm ~17x slower; per-round spawn+join cost high |
+| spawn_storm_busy    | 105512   | 107113   | 2222     | 4546     | smarm ~47x slower; global mutex under 8 bg yielders |
+| mpsc_contention     | 10456    | 10395    | 17348    | 18628    | smarm wins; uncontended mutex essentially free on 1-thread |
+| many_timers         | 120242   | 121023   | 13581    | 14266    | smarm ~9x slower; single min-heap vs sharded wheel |
+| multi_thread_scaling — see thread-count sweep below                                            |
+| deep_recursion      | 62       | 71       | 22       | 44       | tokio wins unexpectedly; see sanity-check notes |
+| yield_in_hot_loop   | 182177   | —        | 138335   | —        | tokio wins; smarm prediction wrong; see notes |
+| uncontended_channel | 31473    | —        | 51925    | —        | smarm wins as predicted; ~1.65x |
+| catch_unwind_panics | 112306   | 114305   | 151443   | 161344   | smarm wins as predicted; ~1.35x |
+
+### `multi_thread_scaling` thread-count sweep (median µs)
+
+> Sandbox has 1 logical CPU; only 1-thread row is available.
+
+| Threads | smarm | tokio mt |
+|---------|-------|----------|
+| 1       | 19852 | 19638    |
+| 2       | —     | —        |
+| 4       | —     | —        |
+| N (avail=1) | 19852 | 19638 |
+
+## Tuning experiments
+
+### Reduction-budget sweep
+
+`smarm` uses an allocator-driven preemption mechanism: every Nth allocation,
+the actor checks RDTSC against its timeslice start and yields if over budget.
+The Nth-allocation threshold (the "reduction budget") and the timeslice
+duration are the two knobs.
+
+Record each experiment as a row below. Reference the commit or the parameter
+values explicitly.
+
+| Date | Configuration              | Bench (or "all")     | Result vs baseline           | Notes |
+|------|----------------------------|----------------------|------------------------------|-------|
+|      | baseline                   | all                  | —                            |       |
+|      | budget=…, timeslice=…      |                      |                              |       |
+|      |                            |                      |                              |       |
+
+When the gap on tokio-favored benches narrows without regressing
+smarm-favored benches, the change is a keeper. If a budget change improves
+one workload but regresses another by more, prefer keeping the broader-impact
+configuration unless we have a clear use case for the trade-off.
+
+## Sanity-check notes (baseline run)
+
+### Compile fixes applied
+
+Two bench files had a type error: `smarm::Runtime::run()` takes
+`impl FnOnce() + Send + 'static` (returns `()`), but the consumer closures
+in `bench_mpsc_smarm` (tokio_favored.rs) and `bench_unc_smarm`
+(smarm_favored.rs) returned `u64` via a bare `count` tail expression. Fixed
+by changing the tail to `let _ = count;` in both closures, and the
+corresponding `consumer.join().unwrap()` calls to `let _ = consumer.join()...`.
+No workload semantics changed.
+
+### Single-CPU sandbox caveat
+
+`available_parallelism()` returns 1, so every "N-thread" variant is identical
+to "1-thread". Multi-thread results should not be used to draw scaling
+conclusions; re-run on a multi-core machine before committing to the tuning
+sweep.
+
+### Predicted-winner mismatches
+
+**`deep_recursion` — tokio wins (22 µs) over smarm (62 µs).**
+At depth 500, smarm spawns a fresh actor which requires mmap'ing a 64 KiB
+stack; that allocation cost dominates the actual recursion. Tokio's
+Box::pin recursion allocates 500 small heap objects but avoids the mmap.
+The prediction assumed stack allocation was amortised across many uses; here
+the actor is single-use. Not a bug, but the bench may not exercise the
+intended advantage.
+
+**`yield_in_hot_loop` — tokio wins (138 ms) over smarm (182 ms).**
+The prediction was that smarm's ~6-GPR naked context switch would beat
+tokio's poll/state-machine cycle. In practice, on a single-thread sandbox,
+tokio's current_thread scheduler has very low overhead per yield_now, while
+smarm's yield_now still goes through the runtime mutex and run-queue even on
+a single thread. This is a meaningful data point: smarm's scheduling overhead
+is not as low as the assembly switch cost alone suggests.
+
+### Noise / spread
+
+- `catch_unwind_panics` smarm spread is reasonable (~10% min/max).
+- `spawn_storm_busy` tokio multi-thread has notable spread (3833–7305 µs);
+  consistent with tokio issue #3829 noted in task spec.
+- `many_timers` smarm spread acceptable (~10%).
+
+### Result-column equivalence
+
+All result columns match between runtimes for every bench (same prime counts,
+same message totals, same task counts). Workloads are equivalent.