Files

smarm d432349f99 Update the documentation

2026-05-25 22:14:07 +02:00

12 KiB

Raw Permalink Blame History

smarm — Benchmarks & Tuning Recommendations

Based on bench suite v0.3.0, Intel Xeon @ 2.80 GHz, 1-core sandbox, kernel 6.18.5, rustc 1.95.0. Multi-core conclusions are extrapolated from design reasoning and single-core sweep data; re-validate on real hardware.

TL;DR

smarm is competitive with tokio for channel-heavy, message-passing workloads and wins outright on uncontended channels and panic/unwind isolation. It is significantly slower than tokio for spawn-heavy patterns and timer-heavy workloads. The preemption knobs (alloc_interval, timeslice_cycles) have minimal effect on single-core machines; they matter on multi-core under scheduler-thread contention.

Bench results summary

All medians in µs. Tokio column is current_thread unless noted.

Bench	smarm	tokio	ratio	winner
`chained_spawn`	8 625	124	70×	tokio
`ping_pong_oneshot`	16 848	879	19×	tokio
`spawn_storm_busy`	126 k	2 772	45×	tokio
`yield_many`	41 622	15 085	2.8×	tokio
`yield_in_hot_loop`	190 k	153 k	1.25×	tokio
`many_timers`	143 k	14 462	10×	tokio
`fan_out_compute`	29 727	28 503	1.04×	even
`multi_thread_scaling`	30 k	29 k	1.04×	even
`deep_recursion`	83	25	3.3×	tokio
`mpsc_contention`	9 062	17 570	0.52×	smarm 1.9×
`uncontended_channel`	27 265	51 888	0.53×	smarm 1.9×
`catch_unwind_panics`	142 k	682 k	0.21×	smarm 4.8×

Where smarm wins

Uncontended channels (1.9× faster)

When a single producer sends to a single consumer with no other actors competing for the queue, smarm's channel is meaningfully faster than tokio's. This is the core use case smarm is designed for: pipelines of actors passing owned data along a chain.

Recommendation: smarm is a good fit for any architecture where data flows through a chain of stages, each stage is an actor, and the channel between stages is the primary synchronisation point.

Uncontended MPSC (1.9× faster, same reason)

Multi-producer single-consumer works well for the same reason. On a single-thread runtime, smarm's mutex is uncontended, so the lock is essentially free. On multi-core this advantage will shrink; re-measure.

Panic isolation (4.8× faster recovery)

catch_unwind_panics creates 10 000 actors that each panic. smarm recovers and delivers Signal::Panic to the supervisor 4.8× faster than tokio. This matters if you're building a system that uses panics as a fast abort path for malformed input or actor-level faults, or if you're using supervision trees seriously.

Recommendation: if your system expects panics to be a normal operational event (not just bugs), smarm's supervision story is a genuine advantage over tokio's task abort model.

Where smarm loses, and why

Spawn-heavy workloads (19–70×)

Every smarm actor mmaps a 64 KiB stack with a guard page. This is a syscall. Tokio tasks are heap-allocated state machines — no stack, no syscall, ~100 bytes each. For workloads that spawn thousands of short-lived actors per second, this is a structural disadvantage.

Recommendations:

Avoid spawning actors for work that completes in microseconds. Use a worker-pool pattern: spawn N long-lived actors at startup, distribute work over channels.
If you genuinely need high-frequency short-lived actors, the stack allocation cost is a known roadmap item (stack caching, slab alloc). It is not an inherent design flaw — just not implemented yet.
deep_recursion shows the same problem at depth 500: smarm spawns a fresh actor per level, paying the mmap cost repeatedly. Recursive decomposition should use explicit stacks or iteration inside a single actor, not actor-per-level spawning.

Timer-heavy workloads (10×)

smarm uses a global min-heap of (deadline, Pid) pairs behind the shared mutex. Tokio uses a sharded hierarchical timer wheel. With 10 000 pending timers, smarm's O(log N) heap under lock is dramatically slower.

Recommendations:

Do not use smarm sleep() in tight loops with many concurrent sleeping actors if timing precision matters.
For IO timeouts: prefer a single timer actor that manages a priority queue and fans out wakeups over channels, rather than 1 000 actors each sleeping directly.
The hierarchical timer wheel is listed in LOOM.md deferred work. It is the correct fix if timer performance becomes a bottleneck.

Yield overhead (2.8× in `yield_many`, 1.25× in `yield_in_hot_loop`)

Every yield_now() goes through the runtime mutex and run queue even on a single-thread scheduler. Tokio's current_thread scheduler handles yields with much lower overhead. smarm's naked context-switch is fast, but the lock acquisition around it dominates for high-frequency yields.

Recommendation: minimise explicit yield_now() calls in hot paths. In message-passing workloads this is natural — yield happens at recv() and send(), which is appropriate. If you are using yield_now() in a tight loop, consider whether the actor should instead be blocking on a channel or sleeping.

Preemption knob recommendations

The knobs are Config::alloc_interval(n) and Config::timeslice_cycles(c). Default: alloc_interval = 128, timeslice_cycles = 300_000 (≈100 µs at 3 GHz).

Findings from the sweep

The sweep varied alloc_interval in {32, 64, 128, 256, 512} and timeslice_cycles in {150k, 300k, 600k, 1200k} — 10 points total.

On a single-CPU machine the knobs are almost inert: most benches move < 5% across the entire grid. The exceptions are meaningful:

Longer timeslices hurt under contention. At tc=600k and tc=1200k:

spawn_storm_busy degrades +11–15%
catch_unwind_panics degrades +10–12%

The cause: 8 background yielder actors hold the scheduler mutex longer per timeslice, delaying the 10 000 actors waiting to be joined. A longer timeslice amplifies the global-mutex bottleneck.

Shorter timeslices marginally help timer-heavy work. At tc=150k, many_timers improves 3–4%. Actors that are sleeping get rescheduled sooner because the runtime polls the timer heap more frequently.

alloc_interval has no clear winner. Moving from 32 to 512 causes < 3% variation on every bench. The check frequency is not the bottleneck — the lock is.

Recommended starting points

Workload	alloc_interval	timeslice_cycles
Default (unknown)	128 (default)	300 000 (default)
Many concurrent sleeping actors	128	150 000
High-throughput channel pipeline	128	300 000
Compute-heavy (few allocs)	32	300 000
Strict fairness / many actors	64	150 000
Long-running compute batches	256	600 000

Note on timeslice_cycles calibration: the default was tuned for ≈100 µs on a 3 GHz CPU. On a 2.8 GHz machine that's ≈107 µs. On a 4 GHz machine it's ≈75 µs. If you want a precise target timeslice, measure your CPU's TSC frequency at startup and set the cycles value accordingly:

// Approximate TSC frequency measurement (call once at startup)
fn tsc_hz() -> u64 {
    let t0 = smarm::preempt::rdtsc();
    std::thread::sleep(std::time::Duration::from_millis(100));
    let t1 = smarm::preempt::rdtsc();
    (t1 - t0) * 10  // extrapolate to 1 second
}

let target_us = 100u64; // desired timeslice in microseconds
let cycles = tsc_hz() / 1_000_000 * target_us;

let rt = smarm::runtime::init(
    smarm::runtime::Config::default()
        .timeslice_cycles(cycles)
);

Architecture recommendations

Use actor pools, not per-request actors

// Avoid: spawning an actor per request
for req in requests {
    spawn(move || handle(req));
}

// Prefer: fixed pool, channel dispatch
let (tx, rx) = channel();
for _ in 0..num_cpus {
    let rx = rx.clone();
    spawn(move || { while let Ok(req) = rx.recv() { handle(req); } });
}
for req in requests { tx.send(req).unwrap(); }

The worker pool pattern amortises the 64 KiB mmap cost over the lifetime of the pool. The chained_spawn bench shows this cost is real: 8 625 µs for 1 000 sequential spawns vs tokio's 124 µs.

Supervision for fault isolation

smarm delivers Signal::Panic(pid, payload) to the supervisor when an actor panics. Use spawn_under to register a supervisor channel and build restart logic:

let (sup_tx, sup_rx) = channel::<smarm::Signal>();
let child = smarm::spawn_under(sup_tx.clone(), move || {
    // ... actor body ...
});

// Supervisor loop
loop {
    match sup_rx.recv() {
        Ok(Signal::Panic(pid, _)) => {
            // restart, escalate, or record
        }
        Ok(Signal::Exit(_)) => break,
        Err(_) => break,
    }
}

This pattern has essentially zero overhead compared to unmonitored spawning, and the catch_unwind_panics bench confirms it is 4.8× faster than tokio's abort/recover cycle.

Explicit preemption in no-alloc hot loops

The allocator-driven preemption mechanism fires every alloc_interval allocations. Code that never allocates (tight numeric loops, parsing fixed-size buffers) will never yield preemptively. Add smarm::check!() at the natural loop boundary:

for chunk in data.chunks(4096) {
    process(chunk);       // no allocations
    smarm::check!();      // yield if timeslice expired
}

This is explicitly called out in LOOM.md as a known limitation. The yield_in_hot_loop bench (1M iterations of yield_now()) shows smarm is 1.25× slower than tokio even with explicit yields, which sets the floor on how much check!() can help in truly tight loops.

IO-bound work

smarm's IO path (wait_readable, wait_writable, block_on_io) parks the actor without blocking the OS scheduler thread. This is correct and works well. There is no specific bench for IO-bound workloads in the current suite, but the architecture is sound for network servers and file-IO pipelines.

Known limitations and roadmap items

These are from LOOM.md plus observations from the bench suite.

Limitation	Impact	Roadmap status
No stack size caching / slab	High spawn cost	Deferred
Global single min-heap timers	Poor at many timers	Deferred (hierarch. wheel)
Global `Mutex<RunQueue>`	Lock contention	Deferred (per-thread queues)
No `join!()` macro	Ergonomics	Deferred
x86-64 Linux only	Portability	ARM64 deferred
No restart intensity caps	Supervision safety	Deferred
Yield overhead under lock	Hot-loop fairness	Structural / ongoing

The yield overhead and global mutex are the two issues most likely to matter on a real multi-core workload. The sweep confirmed that timeslice_cycles is a meaningful knob for controlling the mutex hold time; the right long-term fix is per-thread run queues with work stealing.

Running the bench suite

# Run all benches once, print results
python3 benches/sweep.py run

# Save current results as regression baseline
python3 benches/sweep.py run --save-baseline

# Check for regressions (>10% slower than baseline → exit 1)
python3 benches/sweep.py regress

# Sweep preemption knobs across the grid defined in sweep.py
python3 benches/sweep.py sweep

# Sweep and save raw data as CSV
python3 benches/sweep.py sweep --save-csv results.csv

# Run a single knob configuration manually
SMARM_ALLOC_INTERVAL=64 SMARM_TIMESLICE_CYCLES=150000 \
    cargo bench --bench general

The regression threshold is 10% and is configurable in sweep.py (REGRESSION_THRESHOLD_PCT). The sweep grid is SWEEP_GRID in the same file.

12 KiB Raw Permalink Blame History Unescape Escape