12 KiB
smarm — Benchmarks & Tuning Recommendations
Based on bench suite v0.3.0, Intel Xeon @ 2.80 GHz, 1-core sandbox, kernel 6.18.5, rustc 1.95.0. Multi-core conclusions are extrapolated from design reasoning and single-core sweep data; re-validate on real hardware.
TL;DR
smarm is competitive with tokio for channel-heavy, message-passing workloads
and wins outright on uncontended channels and panic/unwind isolation.
It is significantly slower than tokio for spawn-heavy patterns and
timer-heavy workloads. The preemption knobs (alloc_interval,
timeslice_cycles) have minimal effect on single-core machines; they matter
on multi-core under scheduler-thread contention.
Bench results summary
All medians in µs. Tokio column is current_thread unless noted.
| Bench | smarm | tokio | ratio | winner |
|---|---|---|---|---|
chained_spawn |
8 625 | 124 | 70× | tokio |
ping_pong_oneshot |
16 848 | 879 | 19× | tokio |
spawn_storm_busy |
126 k | 2 772 | 45× | tokio |
yield_many |
41 622 | 15 085 | 2.8× | tokio |
yield_in_hot_loop |
190 k | 153 k | 1.25× | tokio |
many_timers |
143 k | 14 462 | 10× | tokio |
fan_out_compute |
29 727 | 28 503 | 1.04× | even |
multi_thread_scaling |
30 k | 29 k | 1.04× | even |
deep_recursion |
83 | 25 | 3.3× | tokio |
mpsc_contention |
9 062 | 17 570 | 0.52× | smarm 1.9× |
uncontended_channel |
27 265 | 51 888 | 0.53× | smarm 1.9× |
catch_unwind_panics |
142 k | 682 k | 0.21× | smarm 4.8× |
Where smarm wins
Uncontended channels (1.9× faster)
When a single producer sends to a single consumer with no other actors competing for the queue, smarm's channel is meaningfully faster than tokio's. This is the core use case smarm is designed for: pipelines of actors passing owned data along a chain.
Recommendation: smarm is a good fit for any architecture where data flows through a chain of stages, each stage is an actor, and the channel between stages is the primary synchronisation point.
Uncontended MPSC (1.9× faster, same reason)
Multi-producer single-consumer works well for the same reason. On a single-thread runtime, smarm's mutex is uncontended, so the lock is essentially free. On multi-core this advantage will shrink; re-measure.
Panic isolation (4.8× faster recovery)
catch_unwind_panics creates 10 000 actors that each panic. smarm
recovers and delivers Signal::Panic to the supervisor 4.8× faster
than tokio. This matters if you're building a system that uses panics
as a fast abort path for malformed input or actor-level faults, or if
you're using supervision trees seriously.
Recommendation: if your system expects panics to be a normal operational event (not just bugs), smarm's supervision story is a genuine advantage over tokio's task abort model.
Where smarm loses, and why
Spawn-heavy workloads (19–70×)
Every smarm actor mmaps a 64 KiB stack with a guard page. This is
a syscall. Tokio tasks are heap-allocated state machines — no stack,
no syscall, ~100 bytes each. For workloads that spawn thousands of
short-lived actors per second, this is a structural disadvantage.
Recommendations:
- Avoid spawning actors for work that completes in microseconds. Use a worker-pool pattern: spawn N long-lived actors at startup, distribute work over channels.
- If you genuinely need high-frequency short-lived actors, the stack allocation cost is a known roadmap item (stack caching, slab alloc). It is not an inherent design flaw — just not implemented yet.
deep_recursionshows the same problem at depth 500: smarm spawns a fresh actor per level, paying the mmap cost repeatedly. Recursive decomposition should use explicit stacks or iteration inside a single actor, not actor-per-level spawning.
Timer-heavy workloads (10×)
smarm uses a global min-heap of (deadline, Pid) pairs behind the
shared mutex. Tokio uses a sharded hierarchical timer wheel. With
10 000 pending timers, smarm's O(log N) heap under lock is
dramatically slower.
Recommendations:
- Do not use smarm
sleep()in tight loops with many concurrent sleeping actors if timing precision matters. - For IO timeouts: prefer a single timer actor that manages a priority queue and fans out wakeups over channels, rather than 1 000 actors each sleeping directly.
- The hierarchical timer wheel is listed in
LOOM.mddeferred work. It is the correct fix if timer performance becomes a bottleneck.
Yield overhead (2.8× in yield_many, 1.25× in yield_in_hot_loop)
Every yield_now() goes through the runtime mutex and run queue even
on a single-thread scheduler. Tokio's current_thread scheduler handles
yields with much lower overhead. smarm's naked context-switch is fast,
but the lock acquisition around it dominates for high-frequency yields.
Recommendation: minimise explicit yield_now() calls in hot paths.
In message-passing workloads this is natural — yield happens at
recv() and send(), which is appropriate. If you are using
yield_now() in a tight loop, consider whether the actor should
instead be blocking on a channel or sleeping.
Preemption knob recommendations
The knobs are Config::alloc_interval(n) and Config::timeslice_cycles(c).
Default: alloc_interval = 128, timeslice_cycles = 300_000 (≈100 µs at 3 GHz).
Findings from the sweep
The sweep varied alloc_interval in {32, 64, 128, 256, 512} and
timeslice_cycles in {150k, 300k, 600k, 1200k} — 10 points total.
On a single-CPU machine the knobs are almost inert: most benches move < 5% across the entire grid. The exceptions are meaningful:
Longer timeslices hurt under contention. At tc=600k and tc=1200k:
spawn_storm_busydegrades +11–15%catch_unwind_panicsdegrades +10–12%
The cause: 8 background yielder actors hold the scheduler mutex longer per timeslice, delaying the 10 000 actors waiting to be joined. A longer timeslice amplifies the global-mutex bottleneck.
Shorter timeslices marginally help timer-heavy work. At tc=150k,
many_timers improves 3–4%. Actors that are sleeping get rescheduled
sooner because the runtime polls the timer heap more frequently.
alloc_interval has no clear winner. Moving from 32 to 512 causes < 3% variation on every bench. The check frequency is not the bottleneck — the lock is.
Recommended starting points
| Workload | alloc_interval | timeslice_cycles |
|---|---|---|
| Default (unknown) | 128 (default) | 300 000 (default) |
| Many concurrent sleeping actors | 128 | 150 000 |
| High-throughput channel pipeline | 128 | 300 000 |
| Compute-heavy (few allocs) | 32 | 300 000 |
| Strict fairness / many actors | 64 | 150 000 |
| Long-running compute batches | 256 | 600 000 |
Note on timeslice_cycles calibration: the default was tuned for
≈100 µs on a 3 GHz CPU. On a 2.8 GHz machine that's ≈107 µs. On a
4 GHz machine it's ≈75 µs. If you want a precise target timeslice,
measure your CPU's TSC frequency at startup and set the cycles value
accordingly:
// Approximate TSC frequency measurement (call once at startup)
fn tsc_hz() -> u64 {
let t0 = smarm::preempt::rdtsc();
std::thread::sleep(std::time::Duration::from_millis(100));
let t1 = smarm::preempt::rdtsc();
(t1 - t0) * 10 // extrapolate to 1 second
}
let target_us = 100u64; // desired timeslice in microseconds
let cycles = tsc_hz() / 1_000_000 * target_us;
let rt = smarm::runtime::init(
smarm::runtime::Config::default()
.timeslice_cycles(cycles)
);
Architecture recommendations
Use actor pools, not per-request actors
// Avoid: spawning an actor per request
for req in requests {
spawn(move || handle(req));
}
// Prefer: fixed pool, channel dispatch
let (tx, rx) = channel();
for _ in 0..num_cpus {
let rx = rx.clone();
spawn(move || { while let Ok(req) = rx.recv() { handle(req); } });
}
for req in requests { tx.send(req).unwrap(); }
The worker pool pattern amortises the 64 KiB mmap cost over the
lifetime of the pool. The chained_spawn bench shows this cost is
real: 8 625 µs for 1 000 sequential spawns vs tokio's 124 µs.
Supervision for fault isolation
smarm delivers Signal::Panic(pid, payload) to the supervisor when an
actor panics. Use spawn_under to register a supervisor channel and
build restart logic:
let (sup_tx, sup_rx) = channel::<smarm::Signal>();
let child = smarm::spawn_under(sup_tx.clone(), move || {
// ... actor body ...
});
// Supervisor loop
loop {
match sup_rx.recv() {
Ok(Signal::Panic(pid, _)) => {
// restart, escalate, or record
}
Ok(Signal::Exit(_)) => break,
Err(_) => break,
}
}
This pattern has essentially zero overhead compared to unmonitored
spawning, and the catch_unwind_panics bench confirms it is 4.8×
faster than tokio's abort/recover cycle.
Explicit preemption in no-alloc hot loops
The allocator-driven preemption mechanism fires every alloc_interval
allocations. Code that never allocates (tight numeric loops, parsing
fixed-size buffers) will never yield preemptively. Add smarm::check!()
at the natural loop boundary:
for chunk in data.chunks(4096) {
process(chunk); // no allocations
smarm::check!(); // yield if timeslice expired
}
This is explicitly called out in LOOM.md as a known limitation.
The yield_in_hot_loop bench (1M iterations of yield_now()) shows
smarm is 1.25× slower than tokio even with explicit yields, which sets
the floor on how much check!() can help in truly tight loops.
IO-bound work
smarm's IO path (wait_readable, wait_writable, block_on_io) parks
the actor without blocking the OS scheduler thread. This is correct and
works well. There is no specific bench for IO-bound workloads in the
current suite, but the architecture is sound for network servers and
file-IO pipelines.
Known limitations and roadmap items
These are from LOOM.md plus observations from the bench suite.
| Limitation | Impact | Roadmap status |
|---|---|---|
| No stack size caching / slab | High spawn cost | Deferred |
| Global single min-heap timers | Poor at many timers | Deferred (hierarch. wheel) |
Global Mutex<RunQueue> |
Lock contention | Deferred (per-thread queues) |
No join!() macro |
Ergonomics | Deferred |
| x86-64 Linux only | Portability | ARM64 deferred |
| No restart intensity caps | Supervision safety | Deferred |
| Yield overhead under lock | Hot-loop fairness | Structural / ongoing |
The yield overhead and global mutex are the two issues most likely to
matter on a real multi-core workload. The sweep confirmed that
timeslice_cycles is a meaningful knob for controlling the mutex
hold time; the right long-term fix is per-thread run queues with
work stealing.
Running the bench suite
# Run all benches once, print results
python3 benches/sweep.py run
# Save current results as regression baseline
python3 benches/sweep.py run --save-baseline
# Check for regressions (>10% slower than baseline → exit 1)
python3 benches/sweep.py regress
# Sweep preemption knobs across the grid defined in sweep.py
python3 benches/sweep.py sweep
# Sweep and save raw data as CSV
python3 benches/sweep.py sweep --save-csv results.csv
# Run a single knob configuration manually
SMARM_ALLOC_INTERVAL=64 SMARM_TIMESLICE_CYCLES=150000 \
cargo bench --bench general
The regression threshold is 10% and is configurable in sweep.py
(REGRESSION_THRESHOLD_PCT). The sweep grid is SWEEP_GRID in the
same file.