Update the documentation

This commit is contained in:
smarm
2026-05-25 22:14:07 +02:00
parent 2b85ef60b2
commit d432349f99
6 changed files with 1348 additions and 25 deletions

217
docs/Architecture.md Normal file
View File

@@ -0,0 +1,217 @@
# SMARM Architecture
> Erlang-style actor concurrency for Rust, without the copies, the colors, or the GC pauses.
---
## Vision
Rust gives you the right ownership discipline for safe actor concurrency almost for free — `Send` already
draws the boundary, the borrow checker already enforces it. What it lacks is an execution model to match:
async/await is IO-centric, colors your functions, and trades stack simplicity for state-machine complexity;
OS threads are too heavy to spawn per actor.
SMARM adds a third option: **green-thread actors on a shared heap**, scheduled cooperatively, with
message-passing as the only cross-actor communication primitive. You get Erlang's isolation model without
Erlang's copying GC, and you get Rust's zero-copy ownership transfers without async's cognitive overhead.
No function coloring. No `Box<dyn Future>`. Just actors, messages, and the borrow checker doing what it
already does.
---
## Do: Core Runtime
### Actors and scheduling
Each actor is a lightweight green thread with its own heap-allocated, growable stack. Stacks are
allocated via `mmap` with a guard page below the region; overflow is detected by the OS without SMARM
polling for it. Initial stacks are small and grow by remapping on demand.
The scheduler runs one OS thread per CPU. Each scheduler thread loops against a single global
`Mutex<HashMap>` queue shared across all schedulers. If queue contention becomes a measured bottleneck
this can be revisited; the interface will not change.
SMARM requires `panic = unwind`. Users who set `panic = abort` accept that supervision and actor
isolation are silently degraded to process death.
### Process descriptor
Each actor has a descriptor that is hot while the actor runs and will typically live in L1 cache.
It holds:
- `stack_base: *mut u8` — bottom of the allocated stack region
- `stack_cap: usize` — total allocated size
- `stack_ptr: *mut u8` — current stack pointer (`rsp`), saved on yield
- `pid: (u32, u32)` — index and generation counter (see PIDs below)
- `alloc_count: u32` — countdown for preemption sampling
- `timeslice_start: u64``RDTSC` value written on every resume
- `resize_count: u16` — diagnostic counter for stack growth events
- `context: *mut ContextSaveArea` — pointer to the register save area (cold, touched only on switch)
### Context switching
Context switching is implemented in a `#[naked]` assembly shim, one per supported architecture.
The compiler cannot be asked to switch stacks.
**Suspend** (yield, preemption, or blocking):
1. Save callee-saved integer registers and SIMD registers into `ContextSaveArea`.
2. Save `rsp`/`sp` into the process descriptor.
3. Load the scheduler's stack pointer from a thread-local and jump back into the scheduler loop.
**Resume**:
1. Load `rsp`/`sp` from the process descriptor.
2. Restore registers from `ContextSaveArea`.
3. `ret` — the return address is already on the restored stack, execution resumes exactly where the
actor yielded.
**x86-64**: saves `rbx`, `rbp`, `r12``r15` (6 × 8 = 48 bytes) and `xmm0``xmm15` (16 × 16 = 256
bytes) = 304 bytes total. Full SSE baseline is required; the compiler may autovectorise freely.
AVX-512 is deferred.
**ARM64**: saves `x19``x30` (12 × 8 = 96 bytes, including the link register `x30` which must be
saved explicitly — it holds the return address, unlike x86 where `call` pushes it to the stack) and
`d8``d15` (8 × 8 = 64 bytes) = 160 bytes total.
`ContextSaveArea` is a `Box<ContextSaveArea>` per actor. Lifetime equals the actor's lifetime;
no churn, no bulk deallocation, `Box` is correct.
Initial platform target is x86-64 Linux. ARM64 and macOS are natural follow-ons.
### Allocator-driven preemption
Every Nth allocation, the allocator reads `RDTSC` and compares it against `timeslice_start`. If the
threshold is exceeded the actor yields. The workloads that starve a scheduler — sustained compute,
data transformation — are precisely the ones doing frequent allocations, so this approximation is
correct by construction.
`RDTSC` is not monotonic across core migration; a slightly wrong timeslice is acceptable. SMARM is
not a real-time scheduler.
Known failure mode: tight no-alloc loops are invisible to this mechanism. Actors doing sustained
allocation-free compute must call `smarm::yield_now()` explicitly, or offload to a thread pool
outside the actor scheduler (e.g. rayon). This is documented and acceptable — such loops are rare
in message-passing workloads.
### Yield points
An actor yields at:
- **Channel send/recv** — the primary communication primitive
- **Mutex contention** — attempting to lock a held `Arc<Mutex<>>` parks the actor
- **IO** — blocking on a socket or file descriptor parks the actor until the IO thread signals readiness
- **`smarm::sleep(duration)`** — parks the actor; the timer wheel re-queues it on expiry
- **`smarm::yield_now()`** — explicit cooperative yield
- **Allocator preemption** — as above
- **Spawn** — does not yield by default; the new actor is queued and the spawner continues
`std::thread::sleep` inside an actor blocks the entire OS thread and should never be used. SMARM
may emit a warning if it can detect this.
### IO thread
A single dedicated IO thread runs an `epoll`/`kqueue` loop. Actors blocking on IO register their
file descriptor and PID; the IO thread moves them back into the global queue when the fd is ready.
A `HashMap<RawFd, Pid>` maps fds to parked actors. Cancellation (actor dies while waiting on IO)
deregisters the fd. This is intentionally simple and not pluggable; SMARM is not a general async
executor.
### Communication
Messages must be `Send` or `Copy`. Non-`Send` types cannot cross an actor boundary; this is
enforced by the type system with no runtime overhead.
Two primitives only:
- **Move** — transfer owned data across a channel. Zero copy. The sender relinquishes ownership
at the type level. This is the default.
- **`Arc<Mutex<T>>`** — for genuinely shared long-lived state. Explicit and visible.
Cross-actor `Rc` or bare pointers are banned. There is no cycle detector. Cross-actor cycles are
banned by construction: either transfer ownership or use `Arc`.
### PIDs
A PID is a `(index, generation)` pair. The index may be reused after an actor dies; the generation
counter increments on every death. A stale handle holding the wrong generation is a detectable
error, not a silent misdirection. This avoids the ABA problem without reserving PID space forever.
### Supervision
Every actor has a supervisor, assigned at spawn. This is not optional. The root supervisor is
provided by the runtime; its death is a process exit.
A supervisor receives one of three signals when a child actor terminates:
- `Signal::Exit(pid)` — normal completion
- `Signal::Panic(pid, payload)` — caught via `catch_unwind` at the actor entry point boundary,
before unwinding can reach the assembly shim
- `Signal::Timeout(pid)` — actor exceeded a budget (see below)
The supervisor decides: restart the actor, escalate to its own supervisor, or ignore. Restart
intensity is capped: if an actor panics more than N times within a time window, the supervisor
stops restarting and escalates. This prevents a bad prelude or corrupted input from spinning the
supervisor in a restart loop indefinitely. N and the window are configurable per supervisor with a
sensible global default.
### Mutex timeout
Every `smarm::mutex` lock attempt is mediated by the scheduler. If the lock is not acquired within
a configurable timeout, the actor receives a `LockTimeout` error rather than parking forever. This
is a hard runtime guarantee, not a convention. Default timeout is global and configurable;
individual locks and individual call sites can override it.
### Task joining
Actors can spawn children and wait on a group of handles:
```rust
let h1 = smarm::spawn(|| compute_a());
let h2 = smarm::spawn(|| compute_b());
let (a, b) = smarm::join!(h1, h2);
```
`join!` parks the calling actor until all handles complete. The last child to finish re-queues the
parent. This is a countdown in the parent's descriptor; no polling, no waker registration. A
`join_timeout!` variant is a natural extension.
### Timer wheel
`smarm::sleep` and supervision timeouts are driven by a timer wheel in the scheduler. Sleeping
actors are parked and re-queued by the timer thread on expiry. The timer wheel is internal
infrastructure; its design is an implementation detail.
---
## Defer: Later Work
- **Stack sizing policy** — initial size, growth factor, and whether stacks ever shrink are
implementation decisions to be made with profiling data, not up front.
- **Queue contention** — if `Mutex<HashMap>` proves to be a bottleneck under profiling, evaluate
`DashMap` or a lock-free work-stealing deque (e.g. `crossbeam-deque`). Not before.
- **AVX-512 context save** — extend `ContextSaveArea` when there is a concrete use case.
- **`smarm::sleep` vs raw sleep semantics** — further control knobs deferred until the basic sleep
is working and real use cases are understood.
- **Supervision tree API** — the contract is defined; the recursive hierarchy, restart strategies,
and introspection API are implementation work.
- **no_std support** — the assembly shim is no_std friendly but the IO thread and allocator require
OS primitives. Target is no_std + `alloc` on hosted platforms; bare metal is out of scope.
- **Distribution** — SMARM is a single-process runtime. No distribution protocol, no BEAM-style
clustering.
---
## What SMARM is Not
- Not a drop-in replacement for Tokio. SMARM does not implement `Future` or the async executor interface.
- Not a general allocator. SMARM manages actor stacks; heap allocation for actor data goes through
the system allocator.
- Not Erlang. No hot code reloading, no distribution protocol, no BEAM bytecode. SMARM is a
concurrency runtime, not a platform.
- Not a real-time scheduler. Timeslice accuracy is best-effort.
---
## On names
<sub>The name is a recursive acronym. The M is for Marks, as in the BEAM — Bogdan/Björn's Erlang Abstract Machine, the virtual machine that runs Erlang and Elixir. smarm is not the BEAM. It just admires it from a safe distance.</sub>

View File

@@ -0,0 +1,320 @@
# smarm — Benchmarks & Tuning Recommendations
> Based on bench suite v0.3.0, Intel Xeon @ 2.80 GHz, 1-core sandbox,
> kernel 6.18.5, rustc 1.95.0. Multi-core conclusions are extrapolated from
> design reasoning and single-core sweep data; re-validate on real hardware.
---
## TL;DR
smarm is competitive with tokio for **channel-heavy, message-passing workloads**
and wins outright on **uncontended channels** and **panic/unwind isolation**.
It is significantly slower than tokio for **spawn-heavy** patterns and
**timer-heavy** workloads. The preemption knobs (`alloc_interval`,
`timeslice_cycles`) have minimal effect on single-core machines; they matter
on multi-core under scheduler-thread contention.
---
## Bench results summary
All medians in µs. Tokio column is `current_thread` unless noted.
| Bench | smarm | tokio | ratio | winner |
|----------------------|--------|--------|--------|---------------|
| `chained_spawn` | 8 625 | 124 | 70× | tokio |
| `ping_pong_oneshot` | 16 848 | 879 | 19× | tokio |
| `spawn_storm_busy` | 126 k | 2 772 | 45× | tokio |
| `yield_many` | 41 622 | 15 085 | 2.8× | tokio |
| `yield_in_hot_loop` | 190 k | 153 k | 1.25× | tokio |
| `many_timers` | 143 k | 14 462 | 10× | tokio |
| `fan_out_compute` | 29 727 | 28 503 | 1.04× | **even** |
| `multi_thread_scaling` | 30 k | 29 k | 1.04× | **even** |
| `deep_recursion` | 83 | 25 | 3.3× | tokio |
| `mpsc_contention` | 9 062 | 17 570 | 0.52× | **smarm** 1.9× |
| `uncontended_channel`| 27 265 | 51 888 | 0.53× | **smarm** 1.9× |
| `catch_unwind_panics`| 142 k | 682 k | 0.21× | **smarm** 4.8× |
---
## Where smarm wins
### Uncontended channels (1.9× faster)
When a single producer sends to a single consumer with no other actors
competing for the queue, smarm's channel is meaningfully faster than
tokio's. This is the core use case smarm is designed for: pipelines of
actors passing owned data along a chain.
**Recommendation**: smarm is a good fit for any architecture where data
flows through a chain of stages, each stage is an actor, and the
channel between stages is the primary synchronisation point.
### Uncontended MPSC (1.9× faster, same reason)
Multi-producer single-consumer works well for the same reason. On a
single-thread runtime, smarm's mutex is uncontended, so the lock is
essentially free. On multi-core this advantage will shrink; re-measure.
### Panic isolation (4.8× faster recovery)
`catch_unwind_panics` creates 10 000 actors that each panic. smarm
recovers and delivers `Signal::Panic` to the supervisor 4.8× faster
than tokio. This matters if you're building a system that uses panics
as a fast abort path for malformed input or actor-level faults, or if
you're using supervision trees seriously.
**Recommendation**: if your system expects panics to be a normal
operational event (not just bugs), smarm's supervision story is a
genuine advantage over tokio's task abort model.
---
## Where smarm loses, and why
### Spawn-heavy workloads (1970×)
Every smarm actor `mmap`s a 64 KiB stack with a guard page. This is
a syscall. Tokio tasks are heap-allocated state machines — no stack,
no syscall, ~100 bytes each. For workloads that spawn thousands of
short-lived actors per second, this is a structural disadvantage.
**Recommendations**:
- Avoid spawning actors for work that completes in microseconds.
Use a worker-pool pattern: spawn N long-lived actors at startup,
distribute work over channels.
- If you genuinely need high-frequency short-lived actors, the stack
allocation cost is a known roadmap item (stack caching, slab alloc).
It is not an inherent design flaw — just not implemented yet.
- `deep_recursion` shows the same problem at depth 500: smarm spawns
a fresh actor per level, paying the mmap cost repeatedly. Recursive
decomposition should use explicit stacks or iteration inside a single
actor, not actor-per-level spawning.
### Timer-heavy workloads (10×)
smarm uses a global min-heap of `(deadline, Pid)` pairs behind the
shared mutex. Tokio uses a sharded hierarchical timer wheel. With
10 000 pending timers, smarm's O(log N) heap under lock is
dramatically slower.
**Recommendations**:
- Do not use smarm `sleep()` in tight loops with many concurrent
sleeping actors if timing precision matters.
- For IO timeouts: prefer a single timer actor that manages a priority
queue and fans out wakeups over channels, rather than 1 000 actors
each sleeping directly.
- The hierarchical timer wheel is listed in `LOOM.md` deferred work.
It is the correct fix if timer performance becomes a bottleneck.
### Yield overhead (2.8× in `yield_many`, 1.25× in `yield_in_hot_loop`)
Every `yield_now()` goes through the runtime mutex and run queue even
on a single-thread scheduler. Tokio's current_thread scheduler handles
yields with much lower overhead. smarm's naked context-switch is fast,
but the lock acquisition around it dominates for high-frequency yields.
**Recommendation**: minimise explicit `yield_now()` calls in hot paths.
In message-passing workloads this is natural — yield happens at
`recv()` and `send()`, which is appropriate. If you are using
`yield_now()` in a tight loop, consider whether the actor should
instead be blocking on a channel or sleeping.
---
## Preemption knob recommendations
The knobs are `Config::alloc_interval(n)` and `Config::timeslice_cycles(c)`.
Default: `alloc_interval = 128`, `timeslice_cycles = 300_000` (≈100 µs at 3 GHz).
### Findings from the sweep
The sweep varied alloc_interval in `{32, 64, 128, 256, 512}` and
timeslice_cycles in `{150k, 300k, 600k, 1200k}` — 10 points total.
On a single-CPU machine the knobs are almost inert: most benches move
< 5% across the entire grid. The exceptions are meaningful:
**Longer timeslices hurt under contention.** At `tc=600k` and `tc=1200k`:
- `spawn_storm_busy` degrades +1115%
- `catch_unwind_panics` degrades +1012%
The cause: 8 background yielder actors hold the scheduler mutex longer
per timeslice, delaying the 10 000 actors waiting to be joined. A
longer timeslice amplifies the global-mutex bottleneck.
**Shorter timeslices marginally help timer-heavy work.** At `tc=150k`,
`many_timers` improves 34%. Actors that are sleeping get rescheduled
sooner because the runtime polls the timer heap more frequently.
**alloc_interval has no clear winner.** Moving from 32 to 512 causes
< 3% variation on every bench. The check frequency is not the
bottleneck the lock is.
### Recommended starting points
| Workload | alloc_interval | timeslice_cycles |
|-----------------------------------|----------------|------------------|
| Default (unknown) | 128 (default) | 300 000 (default)|
| Many concurrent sleeping actors | 128 | 150 000 |
| High-throughput channel pipeline | 128 | 300 000 |
| Compute-heavy (few allocs) | 32 | 300 000 |
| Strict fairness / many actors | 64 | 150 000 |
| Long-running compute batches | 256 | 600 000 |
**Note on `timeslice_cycles` calibration**: the default was tuned for
100 µs on a 3 GHz CPU. On a 2.8 GHz machine that's 107 µs. On a
4 GHz machine it's 75 µs. If you want a precise target timeslice,
measure your CPU's TSC frequency at startup and set the cycles value
accordingly:
```rust
// Approximate TSC frequency measurement (call once at startup)
fn tsc_hz() -> u64 {
let t0 = smarm::preempt::rdtsc();
std::thread::sleep(std::time::Duration::from_millis(100));
let t1 = smarm::preempt::rdtsc();
(t1 - t0) * 10 // extrapolate to 1 second
}
let target_us = 100u64; // desired timeslice in microseconds
let cycles = tsc_hz() / 1_000_000 * target_us;
let rt = smarm::runtime::init(
smarm::runtime::Config::default()
.timeslice_cycles(cycles)
);
```
---
## Architecture recommendations
### Use actor pools, not per-request actors
```rust
// Avoid: spawning an actor per request
for req in requests {
spawn(move || handle(req));
}
// Prefer: fixed pool, channel dispatch
let (tx, rx) = channel();
for _ in 0..num_cpus {
let rx = rx.clone();
spawn(move || { while let Ok(req) = rx.recv() { handle(req); } });
}
for req in requests { tx.send(req).unwrap(); }
```
The worker pool pattern amortises the 64 KiB mmap cost over the
lifetime of the pool. The `chained_spawn` bench shows this cost is
real: 8 625 µs for 1 000 sequential spawns vs tokio's 124 µs.
### Supervision for fault isolation
smarm delivers `Signal::Panic(pid, payload)` to the supervisor when an
actor panics. Use `spawn_under` to register a supervisor channel and
build restart logic:
```rust
let (sup_tx, sup_rx) = channel::<smarm::Signal>();
let child = smarm::spawn_under(sup_tx.clone(), move || {
// ... actor body ...
});
// Supervisor loop
loop {
match sup_rx.recv() {
Ok(Signal::Panic(pid, _)) => {
// restart, escalate, or record
}
Ok(Signal::Exit(_)) => break,
Err(_) => break,
}
}
```
This pattern has essentially zero overhead compared to unmonitored
spawning, and the `catch_unwind_panics` bench confirms it is 4.8×
faster than tokio's abort/recover cycle.
### Explicit preemption in no-alloc hot loops
The allocator-driven preemption mechanism fires every `alloc_interval`
allocations. Code that never allocates (tight numeric loops, parsing
fixed-size buffers) will never yield preemptively. Add `smarm::check!()`
at the natural loop boundary:
```rust
for chunk in data.chunks(4096) {
process(chunk); // no allocations
smarm::check!(); // yield if timeslice expired
}
```
This is explicitly called out in `LOOM.md` as a known limitation.
The `yield_in_hot_loop` bench (1M iterations of `yield_now()`) shows
smarm is 1.25× slower than tokio even with explicit yields, which sets
the floor on how much `check!()` can help in truly tight loops.
### IO-bound work
smarm's IO path (`wait_readable`, `wait_writable`, `block_on_io`) parks
the actor without blocking the OS scheduler thread. This is correct and
works well. There is no specific bench for IO-bound workloads in the
current suite, but the architecture is sound for network servers and
file-IO pipelines.
---
## Known limitations and roadmap items
These are from `LOOM.md` plus observations from the bench suite.
| Limitation | Impact | Roadmap status |
|-------------------------------|--------------------|--------------------|
| No stack size caching / slab | High spawn cost | Deferred |
| Global single min-heap timers | Poor at many timers| Deferred (hierarch. wheel) |
| Global `Mutex<RunQueue>` | Lock contention | Deferred (per-thread queues) |
| No `join!()` macro | Ergonomics | Deferred |
| x86-64 Linux only | Portability | ARM64 deferred |
| No restart intensity caps | Supervision safety | Deferred |
| Yield overhead under lock | Hot-loop fairness | Structural / ongoing |
The yield overhead and global mutex are the two issues most likely to
matter on a real multi-core workload. The sweep confirmed that
`timeslice_cycles` is a meaningful knob for controlling the mutex
hold time; the right long-term fix is per-thread run queues with
work stealing.
---
## Running the bench suite
```sh
# Run all benches once, print results
python3 benches/sweep.py run
# Save current results as regression baseline
python3 benches/sweep.py run --save-baseline
# Check for regressions (>10% slower than baseline → exit 1)
python3 benches/sweep.py regress
# Sweep preemption knobs across the grid defined in sweep.py
python3 benches/sweep.py sweep
# Sweep and save raw data as CSV
python3 benches/sweep.py sweep --save-csv results.csv
# Run a single knob configuration manually
SMARM_ALLOC_INTERVAL=64 SMARM_TIMESLICE_CYCLES=150000 \
cargo bench --bench general
```
The regression threshold is 10% and is configurable in `sweep.py`
(`REGRESSION_THRESHOLD_PCT`). The sweep grid is `SWEEP_GRID` in the
same file.

177
docs/benchmarks.md Normal file
View File

@@ -0,0 +1,177 @@
# Benchmarks
Regression-test and tuning reference for smarm vs tokio.
## Running
```sh
cargo bench --bench primes # original compute bench
cargo bench --bench multi_scheduler # original 3-workload bench
cargo bench --bench general # benches 14
cargo bench --bench tokio_favored # benches 58
cargo bench --bench smarm_favored # benches 912
```
Each bench runs one warmup iteration (discarded) and 15 measured iterations.
Results are reported as median / min / max in microseconds. Median is the
headline number; the spread between min and max indicates measurement
stability.
## Methodology notes
- The harness times wall-clock elapsed for the full workload, including
runtime startup and shutdown. For multi-thread runtimes this means worker
thread spawn cost is included; on short-lived benches this can dominate.
Where startup matters, the bench is structured so the workload is much
longer than typical startup.
- `tokio` uses `new_current_thread` + `LocalSet` for the single-threaded
comparison and `new_multi_thread().worker_threads(N)` for parallel.
`smarm::runtime::Config::exact(N)` is the equivalent knob.
- mpsc choice: tokio's `unbounded_channel` to match smarm's unbounded channel
semantics. Bounded comparisons would need a separate suite.
- Random delays in `many_timers` use a deterministic mixing function of the
actor index so iterations are reproducible.
## Bench catalog
### General — neither runtime structurally favored
| # | Bench | Stresses | Prediction |
|---|---------------------|-------------------------------------------------|--------------------|
| 1 | `chained_spawn` | Spawn + exit overhead in a serial chain | Roughly even |
| 2 | `yield_many` | Pure scheduling throughput, explicit yields | Roughly even |
| 3 | `fan_out_compute` | CPU-bound parallel work, minimal coordination | Even (compute-bound) |
| 4 | `ping_pong_oneshot` | Spawn + oneshot round-trip latency | Roughly even |
A regression here means a real change in per-task or per-yield cost — those
should be investigated regardless of which runtime got slower.
### Tokio-favored — measures cost of smarm's design choices
| # | Bench | Stresses | Why tokio should win |
|---|-------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------|
| 5 | `spawn_storm_busy` | 8 background yielders + 10k zero-work spawns | Tokio's per-worker deque + LIFO slot vs smarm's global `Mutex<SharedState>` queue |
| 6 | `mpsc_contention` | 32 producers × 10k msgs → 1 consumer | Tokio's mpsc is lock-free on the hot path; smarm channel is `Arc<Mutex<Inner>>` + runtime mutex on each unpark |
| 7 | `many_timers` | 10k actors sleeping 110 ms, dense wake window | Tokio's per-worker sharded timer wheel vs smarm's single shared min-heap |
| 8 | `multi_thread_scaling` | Primes, sweep thread count 1, 2, 4, available | Tokio scales near-linearly; smarm hits its mutex ceiling |
A regression here means a smarm design choice got more expensive. Widening
gaps signal something to investigate; narrowing gaps after a tuning change is
the desired direction.
### Smarm-favored — measures payoff of green-thread + stackful design
| # | Bench | Stresses | Why smarm should win |
|----|------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------|
| 9 | `deep_recursion` | Actor recurses 1000 deep, returns | Native stack growth vs tokio's per-level `Box::pin` |
| 10 | `yield_in_hot_loop` | 2 actors, 500k yields each, single thread | Naked context switch (~6 GPRs + xmm save + ret) vs poll → state machine → schedule |
| 11 | `uncontended_channel` | 1→1, 1M msgs, single thread | Mutex is essentially free uncontended; green-thread switch is cheaper than poll |
| 12 | `catch_unwind_panics` | 10k spawns, 50% panic | Smarm has `catch_unwind` at the actor entry; both runtimes do this but the boundaries differ — exploratory |
A regression here means we lost some of smarm's structural advantage. #12 is
exploratory — if the baseline shows no real gap, drop it.
## Baseline (v0.3.0, Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, RUSTFLAGS: none)
> Sandbox environment has only 1 logical CPU. All multi-thread rows (smarm Nt,
> tokio mt) are equivalent to 1-thread; scaling sweep is limited to 1 thread.
> Label duplication in bench output ("smarm 1-thread" appearing twice) is
> because available_parallelism() == 1, so the N-thread variant is identical.
| Bench | smarm 1t | smarm Nt | tokio ct | tokio mt | Notes |
|---------------------|----------|----------|----------|----------|-------|
| chained_spawn | 7136 | 6979 | 113 | 176 | smarm ~60x slower; spawn+stack alloc dominates on 1 CPU |
| yield_many | 40079 | 40073 | 14571 | 14044 | smarm ~2.8x slower; scheduling overhead real |
| fan_out_compute | 19347 | 19461 | 18616 | 18905 | roughly even; compute-bound as expected |
| ping_pong_oneshot | 13731 | 14176 | 828 | 3342 | smarm ~17x slower; per-round spawn+join cost high |
| spawn_storm_busy | 105512 | 107113 | 2222 | 4546 | smarm ~47x slower; global mutex under 8 bg yielders |
| mpsc_contention | 10456 | 10395 | 17348 | 18628 | smarm wins; uncontended mutex essentially free on 1-thread |
| many_timers | 120242 | 121023 | 13581 | 14266 | smarm ~9x slower; single min-heap vs sharded wheel |
| multi_thread_scaling — see thread-count sweep below |
| deep_recursion | 62 | 71 | 22 | 44 | tokio wins unexpectedly; see sanity-check notes |
| yield_in_hot_loop | 182177 | — | 138335 | — | tokio wins; smarm prediction wrong; see notes |
| uncontended_channel | 31473 | — | 51925 | — | smarm wins as predicted; ~1.65x |
| catch_unwind_panics | 112306 | 114305 | 151443 | 161344 | smarm wins as predicted; ~1.35x |
### `multi_thread_scaling` thread-count sweep (median µs)
> Sandbox has 1 logical CPU; only 1-thread row is available.
| Threads | smarm | tokio mt |
|---------|-------|----------|
| 1 | 19852 | 19638 |
| 2 | — | — |
| 4 | — | — |
| N (avail=1) | 19852 | 19638 |
## Tuning experiments
### Reduction-budget sweep
`smarm` uses an allocator-driven preemption mechanism: every Nth allocation,
the actor checks RDTSC against its timeslice start and yields if over budget.
The Nth-allocation threshold (the "reduction budget") and the timeslice
duration are the two knobs.
Record each experiment as a row below. Reference the commit or the parameter
values explicitly.
| Date | Configuration | Bench (or "all") | Result vs baseline | Notes |
|------|----------------------------|----------------------|------------------------------|-------|
| | baseline | all | — | |
| | budget=…, timeslice=… | | | |
| | | | | |
When the gap on tokio-favored benches narrows without regressing
smarm-favored benches, the change is a keeper. If a budget change improves
one workload but regresses another by more, prefer keeping the broader-impact
configuration unless we have a clear use case for the trade-off.
## Sanity-check notes (baseline run)
### Compile fixes applied
Two bench files had a type error: `smarm::Runtime::run()` takes
`impl FnOnce() + Send + 'static` (returns `()`), but the consumer closures
in `bench_mpsc_smarm` (tokio_favored.rs) and `bench_unc_smarm`
(smarm_favored.rs) returned `u64` via a bare `count` tail expression. Fixed
by changing the tail to `let _ = count;` in both closures, and the
corresponding `consumer.join().unwrap()` calls to `let _ = consumer.join()...`.
No workload semantics changed.
### Single-CPU sandbox caveat
`available_parallelism()` returns 1, so every "N-thread" variant is identical
to "1-thread". Multi-thread results should not be used to draw scaling
conclusions; re-run on a multi-core machine before committing to the tuning
sweep.
### Predicted-winner mismatches
**`deep_recursion` — tokio wins (22 µs) over smarm (62 µs).**
At depth 500, smarm spawns a fresh actor which requires mmap'ing a 64 KiB
stack; that allocation cost dominates the actual recursion. Tokio's
Box::pin recursion allocates 500 small heap objects but avoids the mmap.
The prediction assumed stack allocation was amortised across many uses; here
the actor is single-use. Not a bug, but the bench may not exercise the
intended advantage.
**`yield_in_hot_loop` — tokio wins (138 ms) over smarm (182 ms).**
The prediction was that smarm's ~6-GPR naked context switch would beat
tokio's poll/state-machine cycle. In practice, on a single-thread sandbox,
tokio's current_thread scheduler has very low overhead per yield_now, while
smarm's yield_now still goes through the runtime mutex and run-queue even on
a single thread. This is a meaningful data point: smarm's scheduling overhead
is not as low as the assembly switch cost alone suggests.
### Noise / spread
- `catch_unwind_panics` smarm spread is reasonable (~10% min/max).
- `spawn_storm_busy` tokio multi-thread has notable spread (38337305 µs);
consistent with tokio issue #3829 noted in task spec.
- `many_timers` smarm spread acceptable (~10%).
### Result-column equivalence
All result columns match between runtimes for every bench (same prime counts,
same message totals, same task counts). Workloads are equivalent.

1297
docs/smarm - Deep Dive.html Normal file

File diff suppressed because it is too large Load Diff