Files
smarm/docs/BENCHMARKS_AND_TUNING.md
2026-05-25 22:14:07 +02:00

321 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# smarm — Benchmarks & Tuning Recommendations
> Based on bench suite v0.3.0, Intel Xeon @ 2.80 GHz, 1-core sandbox,
> kernel 6.18.5, rustc 1.95.0. Multi-core conclusions are extrapolated from
> design reasoning and single-core sweep data; re-validate on real hardware.
---
## TL;DR
smarm is competitive with tokio for **channel-heavy, message-passing workloads**
and wins outright on **uncontended channels** and **panic/unwind isolation**.
It is significantly slower than tokio for **spawn-heavy** patterns and
**timer-heavy** workloads. The preemption knobs (`alloc_interval`,
`timeslice_cycles`) have minimal effect on single-core machines; they matter
on multi-core under scheduler-thread contention.
---
## Bench results summary
All medians in µs. Tokio column is `current_thread` unless noted.
| Bench | smarm | tokio | ratio | winner |
|----------------------|--------|--------|--------|---------------|
| `chained_spawn` | 8 625 | 124 | 70× | tokio |
| `ping_pong_oneshot` | 16 848 | 879 | 19× | tokio |
| `spawn_storm_busy` | 126 k | 2 772 | 45× | tokio |
| `yield_many` | 41 622 | 15 085 | 2.8× | tokio |
| `yield_in_hot_loop` | 190 k | 153 k | 1.25× | tokio |
| `many_timers` | 143 k | 14 462 | 10× | tokio |
| `fan_out_compute` | 29 727 | 28 503 | 1.04× | **even** |
| `multi_thread_scaling` | 30 k | 29 k | 1.04× | **even** |
| `deep_recursion` | 83 | 25 | 3.3× | tokio |
| `mpsc_contention` | 9 062 | 17 570 | 0.52× | **smarm** 1.9× |
| `uncontended_channel`| 27 265 | 51 888 | 0.53× | **smarm** 1.9× |
| `catch_unwind_panics`| 142 k | 682 k | 0.21× | **smarm** 4.8× |
---
## Where smarm wins
### Uncontended channels (1.9× faster)
When a single producer sends to a single consumer with no other actors
competing for the queue, smarm's channel is meaningfully faster than
tokio's. This is the core use case smarm is designed for: pipelines of
actors passing owned data along a chain.
**Recommendation**: smarm is a good fit for any architecture where data
flows through a chain of stages, each stage is an actor, and the
channel between stages is the primary synchronisation point.
### Uncontended MPSC (1.9× faster, same reason)
Multi-producer single-consumer works well for the same reason. On a
single-thread runtime, smarm's mutex is uncontended, so the lock is
essentially free. On multi-core this advantage will shrink; re-measure.
### Panic isolation (4.8× faster recovery)
`catch_unwind_panics` creates 10 000 actors that each panic. smarm
recovers and delivers `Signal::Panic` to the supervisor 4.8× faster
than tokio. This matters if you're building a system that uses panics
as a fast abort path for malformed input or actor-level faults, or if
you're using supervision trees seriously.
**Recommendation**: if your system expects panics to be a normal
operational event (not just bugs), smarm's supervision story is a
genuine advantage over tokio's task abort model.
---
## Where smarm loses, and why
### Spawn-heavy workloads (1970×)
Every smarm actor `mmap`s a 64 KiB stack with a guard page. This is
a syscall. Tokio tasks are heap-allocated state machines — no stack,
no syscall, ~100 bytes each. For workloads that spawn thousands of
short-lived actors per second, this is a structural disadvantage.
**Recommendations**:
- Avoid spawning actors for work that completes in microseconds.
Use a worker-pool pattern: spawn N long-lived actors at startup,
distribute work over channels.
- If you genuinely need high-frequency short-lived actors, the stack
allocation cost is a known roadmap item (stack caching, slab alloc).
It is not an inherent design flaw — just not implemented yet.
- `deep_recursion` shows the same problem at depth 500: smarm spawns
a fresh actor per level, paying the mmap cost repeatedly. Recursive
decomposition should use explicit stacks or iteration inside a single
actor, not actor-per-level spawning.
### Timer-heavy workloads (10×)
smarm uses a global min-heap of `(deadline, Pid)` pairs behind the
shared mutex. Tokio uses a sharded hierarchical timer wheel. With
10 000 pending timers, smarm's O(log N) heap under lock is
dramatically slower.
**Recommendations**:
- Do not use smarm `sleep()` in tight loops with many concurrent
sleeping actors if timing precision matters.
- For IO timeouts: prefer a single timer actor that manages a priority
queue and fans out wakeups over channels, rather than 1 000 actors
each sleeping directly.
- The hierarchical timer wheel is listed in `LOOM.md` deferred work.
It is the correct fix if timer performance becomes a bottleneck.
### Yield overhead (2.8× in `yield_many`, 1.25× in `yield_in_hot_loop`)
Every `yield_now()` goes through the runtime mutex and run queue even
on a single-thread scheduler. Tokio's current_thread scheduler handles
yields with much lower overhead. smarm's naked context-switch is fast,
but the lock acquisition around it dominates for high-frequency yields.
**Recommendation**: minimise explicit `yield_now()` calls in hot paths.
In message-passing workloads this is natural — yield happens at
`recv()` and `send()`, which is appropriate. If you are using
`yield_now()` in a tight loop, consider whether the actor should
instead be blocking on a channel or sleeping.
---
## Preemption knob recommendations
The knobs are `Config::alloc_interval(n)` and `Config::timeslice_cycles(c)`.
Default: `alloc_interval = 128`, `timeslice_cycles = 300_000` (≈100 µs at 3 GHz).
### Findings from the sweep
The sweep varied alloc_interval in `{32, 64, 128, 256, 512}` and
timeslice_cycles in `{150k, 300k, 600k, 1200k}` — 10 points total.
On a single-CPU machine the knobs are almost inert: most benches move
< 5% across the entire grid. The exceptions are meaningful:
**Longer timeslices hurt under contention.** At `tc=600k` and `tc=1200k`:
- `spawn_storm_busy` degrades +1115%
- `catch_unwind_panics` degrades +1012%
The cause: 8 background yielder actors hold the scheduler mutex longer
per timeslice, delaying the 10 000 actors waiting to be joined. A
longer timeslice amplifies the global-mutex bottleneck.
**Shorter timeslices marginally help timer-heavy work.** At `tc=150k`,
`many_timers` improves 34%. Actors that are sleeping get rescheduled
sooner because the runtime polls the timer heap more frequently.
**alloc_interval has no clear winner.** Moving from 32 to 512 causes
< 3% variation on every bench. The check frequency is not the
bottleneck the lock is.
### Recommended starting points
| Workload | alloc_interval | timeslice_cycles |
|-----------------------------------|----------------|------------------|
| Default (unknown) | 128 (default) | 300 000 (default)|
| Many concurrent sleeping actors | 128 | 150 000 |
| High-throughput channel pipeline | 128 | 300 000 |
| Compute-heavy (few allocs) | 32 | 300 000 |
| Strict fairness / many actors | 64 | 150 000 |
| Long-running compute batches | 256 | 600 000 |
**Note on `timeslice_cycles` calibration**: the default was tuned for
100 µs on a 3 GHz CPU. On a 2.8 GHz machine that's 107 µs. On a
4 GHz machine it's 75 µs. If you want a precise target timeslice,
measure your CPU's TSC frequency at startup and set the cycles value
accordingly:
```rust
// Approximate TSC frequency measurement (call once at startup)
fn tsc_hz() -> u64 {
let t0 = smarm::preempt::rdtsc();
std::thread::sleep(std::time::Duration::from_millis(100));
let t1 = smarm::preempt::rdtsc();
(t1 - t0) * 10 // extrapolate to 1 second
}
let target_us = 100u64; // desired timeslice in microseconds
let cycles = tsc_hz() / 1_000_000 * target_us;
let rt = smarm::runtime::init(
smarm::runtime::Config::default()
.timeslice_cycles(cycles)
);
```
---
## Architecture recommendations
### Use actor pools, not per-request actors
```rust
// Avoid: spawning an actor per request
for req in requests {
spawn(move || handle(req));
}
// Prefer: fixed pool, channel dispatch
let (tx, rx) = channel();
for _ in 0..num_cpus {
let rx = rx.clone();
spawn(move || { while let Ok(req) = rx.recv() { handle(req); } });
}
for req in requests { tx.send(req).unwrap(); }
```
The worker pool pattern amortises the 64 KiB mmap cost over the
lifetime of the pool. The `chained_spawn` bench shows this cost is
real: 8 625 µs for 1 000 sequential spawns vs tokio's 124 µs.
### Supervision for fault isolation
smarm delivers `Signal::Panic(pid, payload)` to the supervisor when an
actor panics. Use `spawn_under` to register a supervisor channel and
build restart logic:
```rust
let (sup_tx, sup_rx) = channel::<smarm::Signal>();
let child = smarm::spawn_under(sup_tx.clone(), move || {
// ... actor body ...
});
// Supervisor loop
loop {
match sup_rx.recv() {
Ok(Signal::Panic(pid, _)) => {
// restart, escalate, or record
}
Ok(Signal::Exit(_)) => break,
Err(_) => break,
}
}
```
This pattern has essentially zero overhead compared to unmonitored
spawning, and the `catch_unwind_panics` bench confirms it is 4.8×
faster than tokio's abort/recover cycle.
### Explicit preemption in no-alloc hot loops
The allocator-driven preemption mechanism fires every `alloc_interval`
allocations. Code that never allocates (tight numeric loops, parsing
fixed-size buffers) will never yield preemptively. Add `smarm::check!()`
at the natural loop boundary:
```rust
for chunk in data.chunks(4096) {
process(chunk); // no allocations
smarm::check!(); // yield if timeslice expired
}
```
This is explicitly called out in `LOOM.md` as a known limitation.
The `yield_in_hot_loop` bench (1M iterations of `yield_now()`) shows
smarm is 1.25× slower than tokio even with explicit yields, which sets
the floor on how much `check!()` can help in truly tight loops.
### IO-bound work
smarm's IO path (`wait_readable`, `wait_writable`, `block_on_io`) parks
the actor without blocking the OS scheduler thread. This is correct and
works well. There is no specific bench for IO-bound workloads in the
current suite, but the architecture is sound for network servers and
file-IO pipelines.
---
## Known limitations and roadmap items
These are from `LOOM.md` plus observations from the bench suite.
| Limitation | Impact | Roadmap status |
|-------------------------------|--------------------|--------------------|
| No stack size caching / slab | High spawn cost | Deferred |
| Global single min-heap timers | Poor at many timers| Deferred (hierarch. wheel) |
| Global `Mutex<RunQueue>` | Lock contention | Deferred (per-thread queues) |
| No `join!()` macro | Ergonomics | Deferred |
| x86-64 Linux only | Portability | ARM64 deferred |
| No restart intensity caps | Supervision safety | Deferred |
| Yield overhead under lock | Hot-loop fairness | Structural / ongoing |
The yield overhead and global mutex are the two issues most likely to
matter on a real multi-core workload. The sweep confirmed that
`timeslice_cycles` is a meaningful knob for controlling the mutex
hold time; the right long-term fix is per-thread run queues with
work stealing.
---
## Running the bench suite
```sh
# Run all benches once, print results
python3 benches/sweep.py run
# Save current results as regression baseline
python3 benches/sweep.py run --save-baseline
# Check for regressions (>10% slower than baseline → exit 1)
python3 benches/sweep.py regress
# Sweep preemption knobs across the grid defined in sweep.py
python3 benches/sweep.py sweep
# Sweep and save raw data as CSV
python3 benches/sweep.py sweep --save-csv results.csv
# Run a single knob configuration manually
SMARM_ALLOC_INTERVAL=64 SMARM_TIMESLICE_CYCLES=150000 \
cargo bench --bench general
```
The regression threshold is 10% and is configurable in `sweep.py`
(`REGRESSION_THRESHOLD_PCT`). The sweep grid is `SWEEP_GRID` in the
same file.