docs: BENCHMARKS_AND_TUNING.md — bench results, knob recommendations, arch guidance
This commit is contained in:
320
BENCHMARKS_AND_TUNING.md
Normal file
320
BENCHMARKS_AND_TUNING.md
Normal file
@@ -0,0 +1,320 @@
|
||||
# smarm — Benchmarks & Tuning Recommendations
|
||||
|
||||
> Based on bench suite v0.3.0, Intel Xeon @ 2.80 GHz, 1-core sandbox,
|
||||
> kernel 6.18.5, rustc 1.95.0. Multi-core conclusions are extrapolated from
|
||||
> design reasoning and single-core sweep data; re-validate on real hardware.
|
||||
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
smarm is competitive with tokio for **channel-heavy, message-passing workloads**
|
||||
and wins outright on **uncontended channels** and **panic/unwind isolation**.
|
||||
It is significantly slower than tokio for **spawn-heavy** patterns and
|
||||
**timer-heavy** workloads. The preemption knobs (`alloc_interval`,
|
||||
`timeslice_cycles`) have minimal effect on single-core machines; they matter
|
||||
on multi-core under scheduler-thread contention.
|
||||
|
||||
---
|
||||
|
||||
## Bench results summary
|
||||
|
||||
All medians in µs. Tokio column is `current_thread` unless noted.
|
||||
|
||||
| Bench | smarm | tokio | ratio | winner |
|
||||
|----------------------|--------|--------|--------|---------------|
|
||||
| `chained_spawn` | 8 625 | 124 | 70× | tokio |
|
||||
| `ping_pong_oneshot` | 16 848 | 879 | 19× | tokio |
|
||||
| `spawn_storm_busy` | 126 k | 2 772 | 45× | tokio |
|
||||
| `yield_many` | 41 622 | 15 085 | 2.8× | tokio |
|
||||
| `yield_in_hot_loop` | 190 k | 153 k | 1.25× | tokio |
|
||||
| `many_timers` | 143 k | 14 462 | 10× | tokio |
|
||||
| `fan_out_compute` | 29 727 | 28 503 | 1.04× | **even** |
|
||||
| `multi_thread_scaling` | 30 k | 29 k | 1.04× | **even** |
|
||||
| `deep_recursion` | 83 | 25 | 3.3× | tokio |
|
||||
| `mpsc_contention` | 9 062 | 17 570 | 0.52× | **smarm** 1.9× |
|
||||
| `uncontended_channel`| 27 265 | 51 888 | 0.53× | **smarm** 1.9× |
|
||||
| `catch_unwind_panics`| 142 k | 682 k | 0.21× | **smarm** 4.8× |
|
||||
|
||||
---
|
||||
|
||||
## Where smarm wins
|
||||
|
||||
### Uncontended channels (1.9× faster)
|
||||
|
||||
When a single producer sends to a single consumer with no other actors
|
||||
competing for the queue, smarm's channel is meaningfully faster than
|
||||
tokio's. This is the core use case smarm is designed for: pipelines of
|
||||
actors passing owned data along a chain.
|
||||
|
||||
**Recommendation**: smarm is a good fit for any architecture where data
|
||||
flows through a chain of stages, each stage is an actor, and the
|
||||
channel between stages is the primary synchronisation point.
|
||||
|
||||
### Uncontended MPSC (1.9× faster, same reason)
|
||||
|
||||
Multi-producer single-consumer works well for the same reason. On a
|
||||
single-thread runtime, smarm's mutex is uncontended, so the lock is
|
||||
essentially free. On multi-core this advantage will shrink; re-measure.
|
||||
|
||||
### Panic isolation (4.8× faster recovery)
|
||||
|
||||
`catch_unwind_panics` creates 10 000 actors that each panic. smarm
|
||||
recovers and delivers `Signal::Panic` to the supervisor 4.8× faster
|
||||
than tokio. This matters if you're building a system that uses panics
|
||||
as a fast abort path for malformed input or actor-level faults, or if
|
||||
you're using supervision trees seriously.
|
||||
|
||||
**Recommendation**: if your system expects panics to be a normal
|
||||
operational event (not just bugs), smarm's supervision story is a
|
||||
genuine advantage over tokio's task abort model.
|
||||
|
||||
---
|
||||
|
||||
## Where smarm loses, and why
|
||||
|
||||
### Spawn-heavy workloads (19–70×)
|
||||
|
||||
Every smarm actor `mmap`s a 64 KiB stack with a guard page. This is
|
||||
a syscall. Tokio tasks are heap-allocated state machines — no stack,
|
||||
no syscall, ~100 bytes each. For workloads that spawn thousands of
|
||||
short-lived actors per second, this is a structural disadvantage.
|
||||
|
||||
**Recommendations**:
|
||||
- Avoid spawning actors for work that completes in microseconds.
|
||||
Use a worker-pool pattern: spawn N long-lived actors at startup,
|
||||
distribute work over channels.
|
||||
- If you genuinely need high-frequency short-lived actors, the stack
|
||||
allocation cost is a known roadmap item (stack caching, slab alloc).
|
||||
It is not an inherent design flaw — just not implemented yet.
|
||||
- `deep_recursion` shows the same problem at depth 500: smarm spawns
|
||||
a fresh actor per level, paying the mmap cost repeatedly. Recursive
|
||||
decomposition should use explicit stacks or iteration inside a single
|
||||
actor, not actor-per-level spawning.
|
||||
|
||||
### Timer-heavy workloads (10×)
|
||||
|
||||
smarm uses a global min-heap of `(deadline, Pid)` pairs behind the
|
||||
shared mutex. Tokio uses a sharded hierarchical timer wheel. With
|
||||
10 000 pending timers, smarm's O(log N) heap under lock is
|
||||
dramatically slower.
|
||||
|
||||
**Recommendations**:
|
||||
- Do not use smarm `sleep()` in tight loops with many concurrent
|
||||
sleeping actors if timing precision matters.
|
||||
- For IO timeouts: prefer a single timer actor that manages a priority
|
||||
queue and fans out wakeups over channels, rather than 1 000 actors
|
||||
each sleeping directly.
|
||||
- The hierarchical timer wheel is listed in `LOOM.md` deferred work.
|
||||
It is the correct fix if timer performance becomes a bottleneck.
|
||||
|
||||
### Yield overhead (2.8× in `yield_many`, 1.25× in `yield_in_hot_loop`)
|
||||
|
||||
Every `yield_now()` goes through the runtime mutex and run queue even
|
||||
on a single-thread scheduler. Tokio's current_thread scheduler handles
|
||||
yields with much lower overhead. smarm's naked context-switch is fast,
|
||||
but the lock acquisition around it dominates for high-frequency yields.
|
||||
|
||||
**Recommendation**: minimise explicit `yield_now()` calls in hot paths.
|
||||
In message-passing workloads this is natural — yield happens at
|
||||
`recv()` and `send()`, which is appropriate. If you are using
|
||||
`yield_now()` in a tight loop, consider whether the actor should
|
||||
instead be blocking on a channel or sleeping.
|
||||
|
||||
---
|
||||
|
||||
## Preemption knob recommendations
|
||||
|
||||
The knobs are `Config::alloc_interval(n)` and `Config::timeslice_cycles(c)`.
|
||||
Default: `alloc_interval = 128`, `timeslice_cycles = 300_000` (≈100 µs at 3 GHz).
|
||||
|
||||
### Findings from the sweep
|
||||
|
||||
The sweep varied alloc_interval in `{32, 64, 128, 256, 512}` and
|
||||
timeslice_cycles in `{150k, 300k, 600k, 1200k}` — 10 points total.
|
||||
|
||||
On a single-CPU machine the knobs are almost inert: most benches move
|
||||
< 5% across the entire grid. The exceptions are meaningful:
|
||||
|
||||
**Longer timeslices hurt under contention.** At `tc=600k` and `tc=1200k`:
|
||||
|
||||
- `spawn_storm_busy` degrades +11–15%
|
||||
- `catch_unwind_panics` degrades +10–12%
|
||||
|
||||
The cause: 8 background yielder actors hold the scheduler mutex longer
|
||||
per timeslice, delaying the 10 000 actors waiting to be joined. A
|
||||
longer timeslice amplifies the global-mutex bottleneck.
|
||||
|
||||
**Shorter timeslices marginally help timer-heavy work.** At `tc=150k`,
|
||||
`many_timers` improves 3–4%. Actors that are sleeping get rescheduled
|
||||
sooner because the runtime polls the timer heap more frequently.
|
||||
|
||||
**alloc_interval has no clear winner.** Moving from 32 to 512 causes
|
||||
< 3% variation on every bench. The check frequency is not the
|
||||
bottleneck — the lock is.
|
||||
|
||||
### Recommended starting points
|
||||
|
||||
| Workload | alloc_interval | timeslice_cycles |
|
||||
|-----------------------------------|----------------|------------------|
|
||||
| Default (unknown) | 128 (default) | 300 000 (default)|
|
||||
| Many concurrent sleeping actors | 128 | 150 000 |
|
||||
| High-throughput channel pipeline | 128 | 300 000 |
|
||||
| Compute-heavy (few allocs) | 32 | 300 000 |
|
||||
| Strict fairness / many actors | 64 | 150 000 |
|
||||
| Long-running compute batches | 256 | 600 000 |
|
||||
|
||||
**Note on `timeslice_cycles` calibration**: the default was tuned for
|
||||
≈100 µs on a 3 GHz CPU. On a 2.8 GHz machine that's ≈107 µs. On a
|
||||
4 GHz machine it's ≈75 µs. If you want a precise target timeslice,
|
||||
measure your CPU's TSC frequency at startup and set the cycles value
|
||||
accordingly:
|
||||
|
||||
```rust
|
||||
// Approximate TSC frequency measurement (call once at startup)
|
||||
fn tsc_hz() -> u64 {
|
||||
let t0 = smarm::preempt::rdtsc();
|
||||
std::thread::sleep(std::time::Duration::from_millis(100));
|
||||
let t1 = smarm::preempt::rdtsc();
|
||||
(t1 - t0) * 10 // extrapolate to 1 second
|
||||
}
|
||||
|
||||
let target_us = 100u64; // desired timeslice in microseconds
|
||||
let cycles = tsc_hz() / 1_000_000 * target_us;
|
||||
|
||||
let rt = smarm::runtime::init(
|
||||
smarm::runtime::Config::default()
|
||||
.timeslice_cycles(cycles)
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture recommendations
|
||||
|
||||
### Use actor pools, not per-request actors
|
||||
|
||||
```rust
|
||||
// Avoid: spawning an actor per request
|
||||
for req in requests {
|
||||
spawn(move || handle(req));
|
||||
}
|
||||
|
||||
// Prefer: fixed pool, channel dispatch
|
||||
let (tx, rx) = channel();
|
||||
for _ in 0..num_cpus {
|
||||
let rx = rx.clone();
|
||||
spawn(move || { while let Ok(req) = rx.recv() { handle(req); } });
|
||||
}
|
||||
for req in requests { tx.send(req).unwrap(); }
|
||||
```
|
||||
|
||||
The worker pool pattern amortises the 64 KiB mmap cost over the
|
||||
lifetime of the pool. The `chained_spawn` bench shows this cost is
|
||||
real: 8 625 µs for 1 000 sequential spawns vs tokio's 124 µs.
|
||||
|
||||
### Supervision for fault isolation
|
||||
|
||||
smarm delivers `Signal::Panic(pid, payload)` to the supervisor when an
|
||||
actor panics. Use `spawn_under` to register a supervisor channel and
|
||||
build restart logic:
|
||||
|
||||
```rust
|
||||
let (sup_tx, sup_rx) = channel::<smarm::Signal>();
|
||||
let child = smarm::spawn_under(sup_tx.clone(), move || {
|
||||
// ... actor body ...
|
||||
});
|
||||
|
||||
// Supervisor loop
|
||||
loop {
|
||||
match sup_rx.recv() {
|
||||
Ok(Signal::Panic(pid, _)) => {
|
||||
// restart, escalate, or record
|
||||
}
|
||||
Ok(Signal::Exit(_)) => break,
|
||||
Err(_) => break,
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This pattern has essentially zero overhead compared to unmonitored
|
||||
spawning, and the `catch_unwind_panics` bench confirms it is 4.8×
|
||||
faster than tokio's abort/recover cycle.
|
||||
|
||||
### Explicit preemption in no-alloc hot loops
|
||||
|
||||
The allocator-driven preemption mechanism fires every `alloc_interval`
|
||||
allocations. Code that never allocates (tight numeric loops, parsing
|
||||
fixed-size buffers) will never yield preemptively. Add `smarm::check!()`
|
||||
at the natural loop boundary:
|
||||
|
||||
```rust
|
||||
for chunk in data.chunks(4096) {
|
||||
process(chunk); // no allocations
|
||||
smarm::check!(); // yield if timeslice expired
|
||||
}
|
||||
```
|
||||
|
||||
This is explicitly called out in `LOOM.md` as a known limitation.
|
||||
The `yield_in_hot_loop` bench (1M iterations of `yield_now()`) shows
|
||||
smarm is 1.25× slower than tokio even with explicit yields, which sets
|
||||
the floor on how much `check!()` can help in truly tight loops.
|
||||
|
||||
### IO-bound work
|
||||
|
||||
smarm's IO path (`wait_readable`, `wait_writable`, `block_on_io`) parks
|
||||
the actor without blocking the OS scheduler thread. This is correct and
|
||||
works well. There is no specific bench for IO-bound workloads in the
|
||||
current suite, but the architecture is sound for network servers and
|
||||
file-IO pipelines.
|
||||
|
||||
---
|
||||
|
||||
## Known limitations and roadmap items
|
||||
|
||||
These are from `LOOM.md` plus observations from the bench suite.
|
||||
|
||||
| Limitation | Impact | Roadmap status |
|
||||
|-------------------------------|--------------------|--------------------|
|
||||
| No stack size caching / slab | High spawn cost | Deferred |
|
||||
| Global single min-heap timers | Poor at many timers| Deferred (hierarch. wheel) |
|
||||
| Global `Mutex<RunQueue>` | Lock contention | Deferred (per-thread queues) |
|
||||
| No `join!()` macro | Ergonomics | Deferred |
|
||||
| x86-64 Linux only | Portability | ARM64 deferred |
|
||||
| No restart intensity caps | Supervision safety | Deferred |
|
||||
| Yield overhead under lock | Hot-loop fairness | Structural / ongoing |
|
||||
|
||||
The yield overhead and global mutex are the two issues most likely to
|
||||
matter on a real multi-core workload. The sweep confirmed that
|
||||
`timeslice_cycles` is a meaningful knob for controlling the mutex
|
||||
hold time; the right long-term fix is per-thread run queues with
|
||||
work stealing.
|
||||
|
||||
---
|
||||
|
||||
## Running the bench suite
|
||||
|
||||
```sh
|
||||
# Run all benches once, print results
|
||||
python3 benches/sweep.py run
|
||||
|
||||
# Save current results as regression baseline
|
||||
python3 benches/sweep.py run --save-baseline
|
||||
|
||||
# Check for regressions (>10% slower than baseline → exit 1)
|
||||
python3 benches/sweep.py regress
|
||||
|
||||
# Sweep preemption knobs across the grid defined in sweep.py
|
||||
python3 benches/sweep.py sweep
|
||||
|
||||
# Sweep and save raw data as CSV
|
||||
python3 benches/sweep.py sweep --save-csv results.csv
|
||||
|
||||
# Run a single knob configuration manually
|
||||
SMARM_ALLOC_INTERVAL=64 SMARM_TIMESLICE_CYCLES=150000 \
|
||||
cargo bench --bench general
|
||||
```
|
||||
|
||||
The regression threshold is 10% and is configurable in `sweep.py`
|
||||
(`REGRESSION_THRESHOLD_PCT`). The sweep grid is `SWEEP_GRID` in the
|
||||
same file.
|
||||
Reference in New Issue
Block a user