Update the documentation
This commit is contained in:
217
docs/Architecture.md
Normal file
217
docs/Architecture.md
Normal file
@@ -0,0 +1,217 @@
|
||||
# SMARM Architecture
|
||||
|
||||
> Erlang-style actor concurrency for Rust, without the copies, the colors, or the GC pauses.
|
||||
|
||||
---
|
||||
|
||||
## Vision
|
||||
|
||||
Rust gives you the right ownership discipline for safe actor concurrency almost for free — `Send` already
|
||||
draws the boundary, the borrow checker already enforces it. What it lacks is an execution model to match:
|
||||
async/await is IO-centric, colors your functions, and trades stack simplicity for state-machine complexity;
|
||||
OS threads are too heavy to spawn per actor.
|
||||
|
||||
SMARM adds a third option: **green-thread actors on a shared heap**, scheduled cooperatively, with
|
||||
message-passing as the only cross-actor communication primitive. You get Erlang's isolation model without
|
||||
Erlang's copying GC, and you get Rust's zero-copy ownership transfers without async's cognitive overhead.
|
||||
No function coloring. No `Box<dyn Future>`. Just actors, messages, and the borrow checker doing what it
|
||||
already does.
|
||||
|
||||
---
|
||||
|
||||
## Do: Core Runtime
|
||||
|
||||
### Actors and scheduling
|
||||
|
||||
Each actor is a lightweight green thread with its own heap-allocated, growable stack. Stacks are
|
||||
allocated via `mmap` with a guard page below the region; overflow is detected by the OS without SMARM
|
||||
polling for it. Initial stacks are small and grow by remapping on demand.
|
||||
|
||||
The scheduler runs one OS thread per CPU. Each scheduler thread loops against a single global
|
||||
`Mutex<HashMap>` queue shared across all schedulers. If queue contention becomes a measured bottleneck
|
||||
this can be revisited; the interface will not change.
|
||||
|
||||
SMARM requires `panic = unwind`. Users who set `panic = abort` accept that supervision and actor
|
||||
isolation are silently degraded to process death.
|
||||
|
||||
### Process descriptor
|
||||
|
||||
Each actor has a descriptor that is hot while the actor runs and will typically live in L1 cache.
|
||||
It holds:
|
||||
|
||||
- `stack_base: *mut u8` — bottom of the allocated stack region
|
||||
- `stack_cap: usize` — total allocated size
|
||||
- `stack_ptr: *mut u8` — current stack pointer (`rsp`), saved on yield
|
||||
- `pid: (u32, u32)` — index and generation counter (see PIDs below)
|
||||
- `alloc_count: u32` — countdown for preemption sampling
|
||||
- `timeslice_start: u64` — `RDTSC` value written on every resume
|
||||
- `resize_count: u16` — diagnostic counter for stack growth events
|
||||
- `context: *mut ContextSaveArea` — pointer to the register save area (cold, touched only on switch)
|
||||
|
||||
### Context switching
|
||||
|
||||
Context switching is implemented in a `#[naked]` assembly shim, one per supported architecture.
|
||||
The compiler cannot be asked to switch stacks.
|
||||
|
||||
**Suspend** (yield, preemption, or blocking):
|
||||
1. Save callee-saved integer registers and SIMD registers into `ContextSaveArea`.
|
||||
2. Save `rsp`/`sp` into the process descriptor.
|
||||
3. Load the scheduler's stack pointer from a thread-local and jump back into the scheduler loop.
|
||||
|
||||
**Resume**:
|
||||
1. Load `rsp`/`sp` from the process descriptor.
|
||||
2. Restore registers from `ContextSaveArea`.
|
||||
3. `ret` — the return address is already on the restored stack, execution resumes exactly where the
|
||||
actor yielded.
|
||||
|
||||
**x86-64**: saves `rbx`, `rbp`, `r12`–`r15` (6 × 8 = 48 bytes) and `xmm0`–`xmm15` (16 × 16 = 256
|
||||
bytes) = 304 bytes total. Full SSE baseline is required; the compiler may autovectorise freely.
|
||||
AVX-512 is deferred.
|
||||
|
||||
**ARM64**: saves `x19`–`x30` (12 × 8 = 96 bytes, including the link register `x30` which must be
|
||||
saved explicitly — it holds the return address, unlike x86 where `call` pushes it to the stack) and
|
||||
`d8`–`d15` (8 × 8 = 64 bytes) = 160 bytes total.
|
||||
|
||||
`ContextSaveArea` is a `Box<ContextSaveArea>` per actor. Lifetime equals the actor's lifetime;
|
||||
no churn, no bulk deallocation, `Box` is correct.
|
||||
|
||||
Initial platform target is x86-64 Linux. ARM64 and macOS are natural follow-ons.
|
||||
|
||||
### Allocator-driven preemption
|
||||
|
||||
Every Nth allocation, the allocator reads `RDTSC` and compares it against `timeslice_start`. If the
|
||||
threshold is exceeded the actor yields. The workloads that starve a scheduler — sustained compute,
|
||||
data transformation — are precisely the ones doing frequent allocations, so this approximation is
|
||||
correct by construction.
|
||||
|
||||
`RDTSC` is not monotonic across core migration; a slightly wrong timeslice is acceptable. SMARM is
|
||||
not a real-time scheduler.
|
||||
|
||||
Known failure mode: tight no-alloc loops are invisible to this mechanism. Actors doing sustained
|
||||
allocation-free compute must call `smarm::yield_now()` explicitly, or offload to a thread pool
|
||||
outside the actor scheduler (e.g. rayon). This is documented and acceptable — such loops are rare
|
||||
in message-passing workloads.
|
||||
|
||||
### Yield points
|
||||
|
||||
An actor yields at:
|
||||
|
||||
- **Channel send/recv** — the primary communication primitive
|
||||
- **Mutex contention** — attempting to lock a held `Arc<Mutex<>>` parks the actor
|
||||
- **IO** — blocking on a socket or file descriptor parks the actor until the IO thread signals readiness
|
||||
- **`smarm::sleep(duration)`** — parks the actor; the timer wheel re-queues it on expiry
|
||||
- **`smarm::yield_now()`** — explicit cooperative yield
|
||||
- **Allocator preemption** — as above
|
||||
- **Spawn** — does not yield by default; the new actor is queued and the spawner continues
|
||||
|
||||
`std::thread::sleep` inside an actor blocks the entire OS thread and should never be used. SMARM
|
||||
may emit a warning if it can detect this.
|
||||
|
||||
### IO thread
|
||||
|
||||
A single dedicated IO thread runs an `epoll`/`kqueue` loop. Actors blocking on IO register their
|
||||
file descriptor and PID; the IO thread moves them back into the global queue when the fd is ready.
|
||||
A `HashMap<RawFd, Pid>` maps fds to parked actors. Cancellation (actor dies while waiting on IO)
|
||||
deregisters the fd. This is intentionally simple and not pluggable; SMARM is not a general async
|
||||
executor.
|
||||
|
||||
### Communication
|
||||
|
||||
Messages must be `Send` or `Copy`. Non-`Send` types cannot cross an actor boundary; this is
|
||||
enforced by the type system with no runtime overhead.
|
||||
|
||||
Two primitives only:
|
||||
|
||||
- **Move** — transfer owned data across a channel. Zero copy. The sender relinquishes ownership
|
||||
at the type level. This is the default.
|
||||
- **`Arc<Mutex<T>>`** — for genuinely shared long-lived state. Explicit and visible.
|
||||
|
||||
Cross-actor `Rc` or bare pointers are banned. There is no cycle detector. Cross-actor cycles are
|
||||
banned by construction: either transfer ownership or use `Arc`.
|
||||
|
||||
### PIDs
|
||||
|
||||
A PID is a `(index, generation)` pair. The index may be reused after an actor dies; the generation
|
||||
counter increments on every death. A stale handle holding the wrong generation is a detectable
|
||||
error, not a silent misdirection. This avoids the ABA problem without reserving PID space forever.
|
||||
|
||||
### Supervision
|
||||
|
||||
Every actor has a supervisor, assigned at spawn. This is not optional. The root supervisor is
|
||||
provided by the runtime; its death is a process exit.
|
||||
|
||||
A supervisor receives one of three signals when a child actor terminates:
|
||||
|
||||
- `Signal::Exit(pid)` — normal completion
|
||||
- `Signal::Panic(pid, payload)` — caught via `catch_unwind` at the actor entry point boundary,
|
||||
before unwinding can reach the assembly shim
|
||||
- `Signal::Timeout(pid)` — actor exceeded a budget (see below)
|
||||
|
||||
The supervisor decides: restart the actor, escalate to its own supervisor, or ignore. Restart
|
||||
intensity is capped: if an actor panics more than N times within a time window, the supervisor
|
||||
stops restarting and escalates. This prevents a bad prelude or corrupted input from spinning the
|
||||
supervisor in a restart loop indefinitely. N and the window are configurable per supervisor with a
|
||||
sensible global default.
|
||||
|
||||
### Mutex timeout
|
||||
|
||||
Every `smarm::mutex` lock attempt is mediated by the scheduler. If the lock is not acquired within
|
||||
a configurable timeout, the actor receives a `LockTimeout` error rather than parking forever. This
|
||||
is a hard runtime guarantee, not a convention. Default timeout is global and configurable;
|
||||
individual locks and individual call sites can override it.
|
||||
|
||||
### Task joining
|
||||
|
||||
Actors can spawn children and wait on a group of handles:
|
||||
|
||||
```rust
|
||||
let h1 = smarm::spawn(|| compute_a());
|
||||
let h2 = smarm::spawn(|| compute_b());
|
||||
let (a, b) = smarm::join!(h1, h2);
|
||||
```
|
||||
|
||||
`join!` parks the calling actor until all handles complete. The last child to finish re-queues the
|
||||
parent. This is a countdown in the parent's descriptor; no polling, no waker registration. A
|
||||
`join_timeout!` variant is a natural extension.
|
||||
|
||||
### Timer wheel
|
||||
|
||||
`smarm::sleep` and supervision timeouts are driven by a timer wheel in the scheduler. Sleeping
|
||||
actors are parked and re-queued by the timer thread on expiry. The timer wheel is internal
|
||||
infrastructure; its design is an implementation detail.
|
||||
|
||||
---
|
||||
|
||||
## Defer: Later Work
|
||||
|
||||
- **Stack sizing policy** — initial size, growth factor, and whether stacks ever shrink are
|
||||
implementation decisions to be made with profiling data, not up front.
|
||||
- **Queue contention** — if `Mutex<HashMap>` proves to be a bottleneck under profiling, evaluate
|
||||
`DashMap` or a lock-free work-stealing deque (e.g. `crossbeam-deque`). Not before.
|
||||
- **AVX-512 context save** — extend `ContextSaveArea` when there is a concrete use case.
|
||||
- **`smarm::sleep` vs raw sleep semantics** — further control knobs deferred until the basic sleep
|
||||
is working and real use cases are understood.
|
||||
- **Supervision tree API** — the contract is defined; the recursive hierarchy, restart strategies,
|
||||
and introspection API are implementation work.
|
||||
- **no_std support** — the assembly shim is no_std friendly but the IO thread and allocator require
|
||||
OS primitives. Target is no_std + `alloc` on hosted platforms; bare metal is out of scope.
|
||||
- **Distribution** — SMARM is a single-process runtime. No distribution protocol, no BEAM-style
|
||||
clustering.
|
||||
|
||||
---
|
||||
|
||||
## What SMARM is Not
|
||||
|
||||
- Not a drop-in replacement for Tokio. SMARM does not implement `Future` or the async executor interface.
|
||||
- Not a general allocator. SMARM manages actor stacks; heap allocation for actor data goes through
|
||||
the system allocator.
|
||||
- Not Erlang. No hot code reloading, no distribution protocol, no BEAM bytecode. SMARM is a
|
||||
concurrency runtime, not a platform.
|
||||
- Not a real-time scheduler. Timeslice accuracy is best-effort.
|
||||
|
||||
|
||||
---
|
||||
|
||||
## On names
|
||||
|
||||
<sub>The name is a recursive acronym. The M is for Marks, as in the BEAM — Bogdan/Björn's Erlang Abstract Machine, the virtual machine that runs Erlang and Elixir. smarm is not the BEAM. It just admires it from a safe distance.</sub>
|
||||
320
docs/BENCHMARKS_AND_TUNING.md
Normal file
320
docs/BENCHMARKS_AND_TUNING.md
Normal file
@@ -0,0 +1,320 @@
|
||||
# smarm — Benchmarks & Tuning Recommendations
|
||||
|
||||
> Based on bench suite v0.3.0, Intel Xeon @ 2.80 GHz, 1-core sandbox,
|
||||
> kernel 6.18.5, rustc 1.95.0. Multi-core conclusions are extrapolated from
|
||||
> design reasoning and single-core sweep data; re-validate on real hardware.
|
||||
|
||||
---
|
||||
|
||||
## TL;DR
|
||||
|
||||
smarm is competitive with tokio for **channel-heavy, message-passing workloads**
|
||||
and wins outright on **uncontended channels** and **panic/unwind isolation**.
|
||||
It is significantly slower than tokio for **spawn-heavy** patterns and
|
||||
**timer-heavy** workloads. The preemption knobs (`alloc_interval`,
|
||||
`timeslice_cycles`) have minimal effect on single-core machines; they matter
|
||||
on multi-core under scheduler-thread contention.
|
||||
|
||||
---
|
||||
|
||||
## Bench results summary
|
||||
|
||||
All medians in µs. Tokio column is `current_thread` unless noted.
|
||||
|
||||
| Bench | smarm | tokio | ratio | winner |
|
||||
|----------------------|--------|--------|--------|---------------|
|
||||
| `chained_spawn` | 8 625 | 124 | 70× | tokio |
|
||||
| `ping_pong_oneshot` | 16 848 | 879 | 19× | tokio |
|
||||
| `spawn_storm_busy` | 126 k | 2 772 | 45× | tokio |
|
||||
| `yield_many` | 41 622 | 15 085 | 2.8× | tokio |
|
||||
| `yield_in_hot_loop` | 190 k | 153 k | 1.25× | tokio |
|
||||
| `many_timers` | 143 k | 14 462 | 10× | tokio |
|
||||
| `fan_out_compute` | 29 727 | 28 503 | 1.04× | **even** |
|
||||
| `multi_thread_scaling` | 30 k | 29 k | 1.04× | **even** |
|
||||
| `deep_recursion` | 83 | 25 | 3.3× | tokio |
|
||||
| `mpsc_contention` | 9 062 | 17 570 | 0.52× | **smarm** 1.9× |
|
||||
| `uncontended_channel`| 27 265 | 51 888 | 0.53× | **smarm** 1.9× |
|
||||
| `catch_unwind_panics`| 142 k | 682 k | 0.21× | **smarm** 4.8× |
|
||||
|
||||
---
|
||||
|
||||
## Where smarm wins
|
||||
|
||||
### Uncontended channels (1.9× faster)
|
||||
|
||||
When a single producer sends to a single consumer with no other actors
|
||||
competing for the queue, smarm's channel is meaningfully faster than
|
||||
tokio's. This is the core use case smarm is designed for: pipelines of
|
||||
actors passing owned data along a chain.
|
||||
|
||||
**Recommendation**: smarm is a good fit for any architecture where data
|
||||
flows through a chain of stages, each stage is an actor, and the
|
||||
channel between stages is the primary synchronisation point.
|
||||
|
||||
### Uncontended MPSC (1.9× faster, same reason)
|
||||
|
||||
Multi-producer single-consumer works well for the same reason. On a
|
||||
single-thread runtime, smarm's mutex is uncontended, so the lock is
|
||||
essentially free. On multi-core this advantage will shrink; re-measure.
|
||||
|
||||
### Panic isolation (4.8× faster recovery)
|
||||
|
||||
`catch_unwind_panics` creates 10 000 actors that each panic. smarm
|
||||
recovers and delivers `Signal::Panic` to the supervisor 4.8× faster
|
||||
than tokio. This matters if you're building a system that uses panics
|
||||
as a fast abort path for malformed input or actor-level faults, or if
|
||||
you're using supervision trees seriously.
|
||||
|
||||
**Recommendation**: if your system expects panics to be a normal
|
||||
operational event (not just bugs), smarm's supervision story is a
|
||||
genuine advantage over tokio's task abort model.
|
||||
|
||||
---
|
||||
|
||||
## Where smarm loses, and why
|
||||
|
||||
### Spawn-heavy workloads (19–70×)
|
||||
|
||||
Every smarm actor `mmap`s a 64 KiB stack with a guard page. This is
|
||||
a syscall. Tokio tasks are heap-allocated state machines — no stack,
|
||||
no syscall, ~100 bytes each. For workloads that spawn thousands of
|
||||
short-lived actors per second, this is a structural disadvantage.
|
||||
|
||||
**Recommendations**:
|
||||
- Avoid spawning actors for work that completes in microseconds.
|
||||
Use a worker-pool pattern: spawn N long-lived actors at startup,
|
||||
distribute work over channels.
|
||||
- If you genuinely need high-frequency short-lived actors, the stack
|
||||
allocation cost is a known roadmap item (stack caching, slab alloc).
|
||||
It is not an inherent design flaw — just not implemented yet.
|
||||
- `deep_recursion` shows the same problem at depth 500: smarm spawns
|
||||
a fresh actor per level, paying the mmap cost repeatedly. Recursive
|
||||
decomposition should use explicit stacks or iteration inside a single
|
||||
actor, not actor-per-level spawning.
|
||||
|
||||
### Timer-heavy workloads (10×)
|
||||
|
||||
smarm uses a global min-heap of `(deadline, Pid)` pairs behind the
|
||||
shared mutex. Tokio uses a sharded hierarchical timer wheel. With
|
||||
10 000 pending timers, smarm's O(log N) heap under lock is
|
||||
dramatically slower.
|
||||
|
||||
**Recommendations**:
|
||||
- Do not use smarm `sleep()` in tight loops with many concurrent
|
||||
sleeping actors if timing precision matters.
|
||||
- For IO timeouts: prefer a single timer actor that manages a priority
|
||||
queue and fans out wakeups over channels, rather than 1 000 actors
|
||||
each sleeping directly.
|
||||
- The hierarchical timer wheel is listed in `LOOM.md` deferred work.
|
||||
It is the correct fix if timer performance becomes a bottleneck.
|
||||
|
||||
### Yield overhead (2.8× in `yield_many`, 1.25× in `yield_in_hot_loop`)
|
||||
|
||||
Every `yield_now()` goes through the runtime mutex and run queue even
|
||||
on a single-thread scheduler. Tokio's current_thread scheduler handles
|
||||
yields with much lower overhead. smarm's naked context-switch is fast,
|
||||
but the lock acquisition around it dominates for high-frequency yields.
|
||||
|
||||
**Recommendation**: minimise explicit `yield_now()` calls in hot paths.
|
||||
In message-passing workloads this is natural — yield happens at
|
||||
`recv()` and `send()`, which is appropriate. If you are using
|
||||
`yield_now()` in a tight loop, consider whether the actor should
|
||||
instead be blocking on a channel or sleeping.
|
||||
|
||||
---
|
||||
|
||||
## Preemption knob recommendations
|
||||
|
||||
The knobs are `Config::alloc_interval(n)` and `Config::timeslice_cycles(c)`.
|
||||
Default: `alloc_interval = 128`, `timeslice_cycles = 300_000` (≈100 µs at 3 GHz).
|
||||
|
||||
### Findings from the sweep
|
||||
|
||||
The sweep varied alloc_interval in `{32, 64, 128, 256, 512}` and
|
||||
timeslice_cycles in `{150k, 300k, 600k, 1200k}` — 10 points total.
|
||||
|
||||
On a single-CPU machine the knobs are almost inert: most benches move
|
||||
< 5% across the entire grid. The exceptions are meaningful:
|
||||
|
||||
**Longer timeslices hurt under contention.** At `tc=600k` and `tc=1200k`:
|
||||
|
||||
- `spawn_storm_busy` degrades +11–15%
|
||||
- `catch_unwind_panics` degrades +10–12%
|
||||
|
||||
The cause: 8 background yielder actors hold the scheduler mutex longer
|
||||
per timeslice, delaying the 10 000 actors waiting to be joined. A
|
||||
longer timeslice amplifies the global-mutex bottleneck.
|
||||
|
||||
**Shorter timeslices marginally help timer-heavy work.** At `tc=150k`,
|
||||
`many_timers` improves 3–4%. Actors that are sleeping get rescheduled
|
||||
sooner because the runtime polls the timer heap more frequently.
|
||||
|
||||
**alloc_interval has no clear winner.** Moving from 32 to 512 causes
|
||||
< 3% variation on every bench. The check frequency is not the
|
||||
bottleneck — the lock is.
|
||||
|
||||
### Recommended starting points
|
||||
|
||||
| Workload | alloc_interval | timeslice_cycles |
|
||||
|-----------------------------------|----------------|------------------|
|
||||
| Default (unknown) | 128 (default) | 300 000 (default)|
|
||||
| Many concurrent sleeping actors | 128 | 150 000 |
|
||||
| High-throughput channel pipeline | 128 | 300 000 |
|
||||
| Compute-heavy (few allocs) | 32 | 300 000 |
|
||||
| Strict fairness / many actors | 64 | 150 000 |
|
||||
| Long-running compute batches | 256 | 600 000 |
|
||||
|
||||
**Note on `timeslice_cycles` calibration**: the default was tuned for
|
||||
≈100 µs on a 3 GHz CPU. On a 2.8 GHz machine that's ≈107 µs. On a
|
||||
4 GHz machine it's ≈75 µs. If you want a precise target timeslice,
|
||||
measure your CPU's TSC frequency at startup and set the cycles value
|
||||
accordingly:
|
||||
|
||||
```rust
|
||||
// Approximate TSC frequency measurement (call once at startup)
|
||||
fn tsc_hz() -> u64 {
|
||||
let t0 = smarm::preempt::rdtsc();
|
||||
std::thread::sleep(std::time::Duration::from_millis(100));
|
||||
let t1 = smarm::preempt::rdtsc();
|
||||
(t1 - t0) * 10 // extrapolate to 1 second
|
||||
}
|
||||
|
||||
let target_us = 100u64; // desired timeslice in microseconds
|
||||
let cycles = tsc_hz() / 1_000_000 * target_us;
|
||||
|
||||
let rt = smarm::runtime::init(
|
||||
smarm::runtime::Config::default()
|
||||
.timeslice_cycles(cycles)
|
||||
);
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Architecture recommendations
|
||||
|
||||
### Use actor pools, not per-request actors
|
||||
|
||||
```rust
|
||||
// Avoid: spawning an actor per request
|
||||
for req in requests {
|
||||
spawn(move || handle(req));
|
||||
}
|
||||
|
||||
// Prefer: fixed pool, channel dispatch
|
||||
let (tx, rx) = channel();
|
||||
for _ in 0..num_cpus {
|
||||
let rx = rx.clone();
|
||||
spawn(move || { while let Ok(req) = rx.recv() { handle(req); } });
|
||||
}
|
||||
for req in requests { tx.send(req).unwrap(); }
|
||||
```
|
||||
|
||||
The worker pool pattern amortises the 64 KiB mmap cost over the
|
||||
lifetime of the pool. The `chained_spawn` bench shows this cost is
|
||||
real: 8 625 µs for 1 000 sequential spawns vs tokio's 124 µs.
|
||||
|
||||
### Supervision for fault isolation
|
||||
|
||||
smarm delivers `Signal::Panic(pid, payload)` to the supervisor when an
|
||||
actor panics. Use `spawn_under` to register a supervisor channel and
|
||||
build restart logic:
|
||||
|
||||
```rust
|
||||
let (sup_tx, sup_rx) = channel::<smarm::Signal>();
|
||||
let child = smarm::spawn_under(sup_tx.clone(), move || {
|
||||
// ... actor body ...
|
||||
});
|
||||
|
||||
// Supervisor loop
|
||||
loop {
|
||||
match sup_rx.recv() {
|
||||
Ok(Signal::Panic(pid, _)) => {
|
||||
// restart, escalate, or record
|
||||
}
|
||||
Ok(Signal::Exit(_)) => break,
|
||||
Err(_) => break,
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
This pattern has essentially zero overhead compared to unmonitored
|
||||
spawning, and the `catch_unwind_panics` bench confirms it is 4.8×
|
||||
faster than tokio's abort/recover cycle.
|
||||
|
||||
### Explicit preemption in no-alloc hot loops
|
||||
|
||||
The allocator-driven preemption mechanism fires every `alloc_interval`
|
||||
allocations. Code that never allocates (tight numeric loops, parsing
|
||||
fixed-size buffers) will never yield preemptively. Add `smarm::check!()`
|
||||
at the natural loop boundary:
|
||||
|
||||
```rust
|
||||
for chunk in data.chunks(4096) {
|
||||
process(chunk); // no allocations
|
||||
smarm::check!(); // yield if timeslice expired
|
||||
}
|
||||
```
|
||||
|
||||
This is explicitly called out in `LOOM.md` as a known limitation.
|
||||
The `yield_in_hot_loop` bench (1M iterations of `yield_now()`) shows
|
||||
smarm is 1.25× slower than tokio even with explicit yields, which sets
|
||||
the floor on how much `check!()` can help in truly tight loops.
|
||||
|
||||
### IO-bound work
|
||||
|
||||
smarm's IO path (`wait_readable`, `wait_writable`, `block_on_io`) parks
|
||||
the actor without blocking the OS scheduler thread. This is correct and
|
||||
works well. There is no specific bench for IO-bound workloads in the
|
||||
current suite, but the architecture is sound for network servers and
|
||||
file-IO pipelines.
|
||||
|
||||
---
|
||||
|
||||
## Known limitations and roadmap items
|
||||
|
||||
These are from `LOOM.md` plus observations from the bench suite.
|
||||
|
||||
| Limitation | Impact | Roadmap status |
|
||||
|-------------------------------|--------------------|--------------------|
|
||||
| No stack size caching / slab | High spawn cost | Deferred |
|
||||
| Global single min-heap timers | Poor at many timers| Deferred (hierarch. wheel) |
|
||||
| Global `Mutex<RunQueue>` | Lock contention | Deferred (per-thread queues) |
|
||||
| No `join!()` macro | Ergonomics | Deferred |
|
||||
| x86-64 Linux only | Portability | ARM64 deferred |
|
||||
| No restart intensity caps | Supervision safety | Deferred |
|
||||
| Yield overhead under lock | Hot-loop fairness | Structural / ongoing |
|
||||
|
||||
The yield overhead and global mutex are the two issues most likely to
|
||||
matter on a real multi-core workload. The sweep confirmed that
|
||||
`timeslice_cycles` is a meaningful knob for controlling the mutex
|
||||
hold time; the right long-term fix is per-thread run queues with
|
||||
work stealing.
|
||||
|
||||
---
|
||||
|
||||
## Running the bench suite
|
||||
|
||||
```sh
|
||||
# Run all benches once, print results
|
||||
python3 benches/sweep.py run
|
||||
|
||||
# Save current results as regression baseline
|
||||
python3 benches/sweep.py run --save-baseline
|
||||
|
||||
# Check for regressions (>10% slower than baseline → exit 1)
|
||||
python3 benches/sweep.py regress
|
||||
|
||||
# Sweep preemption knobs across the grid defined in sweep.py
|
||||
python3 benches/sweep.py sweep
|
||||
|
||||
# Sweep and save raw data as CSV
|
||||
python3 benches/sweep.py sweep --save-csv results.csv
|
||||
|
||||
# Run a single knob configuration manually
|
||||
SMARM_ALLOC_INTERVAL=64 SMARM_TIMESLICE_CYCLES=150000 \
|
||||
cargo bench --bench general
|
||||
```
|
||||
|
||||
The regression threshold is 10% and is configurable in `sweep.py`
|
||||
(`REGRESSION_THRESHOLD_PCT`). The sweep grid is `SWEEP_GRID` in the
|
||||
same file.
|
||||
177
docs/benchmarks.md
Normal file
177
docs/benchmarks.md
Normal file
@@ -0,0 +1,177 @@
|
||||
# Benchmarks
|
||||
|
||||
Regression-test and tuning reference for smarm vs tokio.
|
||||
|
||||
## Running
|
||||
|
||||
```sh
|
||||
cargo bench --bench primes # original compute bench
|
||||
cargo bench --bench multi_scheduler # original 3-workload bench
|
||||
cargo bench --bench general # benches 1–4
|
||||
cargo bench --bench tokio_favored # benches 5–8
|
||||
cargo bench --bench smarm_favored # benches 9–12
|
||||
```
|
||||
|
||||
Each bench runs one warmup iteration (discarded) and 15 measured iterations.
|
||||
Results are reported as median / min / max in microseconds. Median is the
|
||||
headline number; the spread between min and max indicates measurement
|
||||
stability.
|
||||
|
||||
## Methodology notes
|
||||
|
||||
- The harness times wall-clock elapsed for the full workload, including
|
||||
runtime startup and shutdown. For multi-thread runtimes this means worker
|
||||
thread spawn cost is included; on short-lived benches this can dominate.
|
||||
Where startup matters, the bench is structured so the workload is much
|
||||
longer than typical startup.
|
||||
- `tokio` uses `new_current_thread` + `LocalSet` for the single-threaded
|
||||
comparison and `new_multi_thread().worker_threads(N)` for parallel.
|
||||
`smarm::runtime::Config::exact(N)` is the equivalent knob.
|
||||
- mpsc choice: tokio's `unbounded_channel` to match smarm's unbounded channel
|
||||
semantics. Bounded comparisons would need a separate suite.
|
||||
- Random delays in `many_timers` use a deterministic mixing function of the
|
||||
actor index so iterations are reproducible.
|
||||
|
||||
## Bench catalog
|
||||
|
||||
### General — neither runtime structurally favored
|
||||
|
||||
| # | Bench | Stresses | Prediction |
|
||||
|---|---------------------|-------------------------------------------------|--------------------|
|
||||
| 1 | `chained_spawn` | Spawn + exit overhead in a serial chain | Roughly even |
|
||||
| 2 | `yield_many` | Pure scheduling throughput, explicit yields | Roughly even |
|
||||
| 3 | `fan_out_compute` | CPU-bound parallel work, minimal coordination | Even (compute-bound) |
|
||||
| 4 | `ping_pong_oneshot` | Spawn + oneshot round-trip latency | Roughly even |
|
||||
|
||||
A regression here means a real change in per-task or per-yield cost — those
|
||||
should be investigated regardless of which runtime got slower.
|
||||
|
||||
### Tokio-favored — measures cost of smarm's design choices
|
||||
|
||||
| # | Bench | Stresses | Why tokio should win |
|
||||
|---|-------------------------|-------------------------------------------------------|-----------------------------------------------------------------------------------|
|
||||
| 5 | `spawn_storm_busy` | 8 background yielders + 10k zero-work spawns | Tokio's per-worker deque + LIFO slot vs smarm's global `Mutex<SharedState>` queue |
|
||||
| 6 | `mpsc_contention` | 32 producers × 10k msgs → 1 consumer | Tokio's mpsc is lock-free on the hot path; smarm channel is `Arc<Mutex<Inner>>` + runtime mutex on each unpark |
|
||||
| 7 | `many_timers` | 10k actors sleeping 1–10 ms, dense wake window | Tokio's per-worker sharded timer wheel vs smarm's single shared min-heap |
|
||||
| 8 | `multi_thread_scaling` | Primes, sweep thread count 1, 2, 4, available | Tokio scales near-linearly; smarm hits its mutex ceiling |
|
||||
|
||||
A regression here means a smarm design choice got more expensive. Widening
|
||||
gaps signal something to investigate; narrowing gaps after a tuning change is
|
||||
the desired direction.
|
||||
|
||||
### Smarm-favored — measures payoff of green-thread + stackful design
|
||||
|
||||
| # | Bench | Stresses | Why smarm should win |
|
||||
|----|------------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------|
|
||||
| 9 | `deep_recursion` | Actor recurses 1000 deep, returns | Native stack growth vs tokio's per-level `Box::pin` |
|
||||
| 10 | `yield_in_hot_loop` | 2 actors, 500k yields each, single thread | Naked context switch (~6 GPRs + xmm save + ret) vs poll → state machine → schedule |
|
||||
| 11 | `uncontended_channel` | 1→1, 1M msgs, single thread | Mutex is essentially free uncontended; green-thread switch is cheaper than poll |
|
||||
| 12 | `catch_unwind_panics` | 10k spawns, 50% panic | Smarm has `catch_unwind` at the actor entry; both runtimes do this but the boundaries differ — exploratory |
|
||||
|
||||
A regression here means we lost some of smarm's structural advantage. #12 is
|
||||
exploratory — if the baseline shows no real gap, drop it.
|
||||
|
||||
## Baseline (v0.3.0, Intel Xeon @ 2.80GHz, 1 core, kernel 6.18.5, rustc 1.95.0, RUSTFLAGS: none)
|
||||
|
||||
> Sandbox environment has only 1 logical CPU. All multi-thread rows (smarm Nt,
|
||||
> tokio mt) are equivalent to 1-thread; scaling sweep is limited to 1 thread.
|
||||
> Label duplication in bench output ("smarm 1-thread" appearing twice) is
|
||||
> because available_parallelism() == 1, so the N-thread variant is identical.
|
||||
|
||||
| Bench | smarm 1t | smarm Nt | tokio ct | tokio mt | Notes |
|
||||
|---------------------|----------|----------|----------|----------|-------|
|
||||
| chained_spawn | 7136 | 6979 | 113 | 176 | smarm ~60x slower; spawn+stack alloc dominates on 1 CPU |
|
||||
| yield_many | 40079 | 40073 | 14571 | 14044 | smarm ~2.8x slower; scheduling overhead real |
|
||||
| fan_out_compute | 19347 | 19461 | 18616 | 18905 | roughly even; compute-bound as expected |
|
||||
| ping_pong_oneshot | 13731 | 14176 | 828 | 3342 | smarm ~17x slower; per-round spawn+join cost high |
|
||||
| spawn_storm_busy | 105512 | 107113 | 2222 | 4546 | smarm ~47x slower; global mutex under 8 bg yielders |
|
||||
| mpsc_contention | 10456 | 10395 | 17348 | 18628 | smarm wins; uncontended mutex essentially free on 1-thread |
|
||||
| many_timers | 120242 | 121023 | 13581 | 14266 | smarm ~9x slower; single min-heap vs sharded wheel |
|
||||
| multi_thread_scaling — see thread-count sweep below |
|
||||
| deep_recursion | 62 | 71 | 22 | 44 | tokio wins unexpectedly; see sanity-check notes |
|
||||
| yield_in_hot_loop | 182177 | — | 138335 | — | tokio wins; smarm prediction wrong; see notes |
|
||||
| uncontended_channel | 31473 | — | 51925 | — | smarm wins as predicted; ~1.65x |
|
||||
| catch_unwind_panics | 112306 | 114305 | 151443 | 161344 | smarm wins as predicted; ~1.35x |
|
||||
|
||||
### `multi_thread_scaling` thread-count sweep (median µs)
|
||||
|
||||
> Sandbox has 1 logical CPU; only 1-thread row is available.
|
||||
|
||||
| Threads | smarm | tokio mt |
|
||||
|---------|-------|----------|
|
||||
| 1 | 19852 | 19638 |
|
||||
| 2 | — | — |
|
||||
| 4 | — | — |
|
||||
| N (avail=1) | 19852 | 19638 |
|
||||
|
||||
## Tuning experiments
|
||||
|
||||
### Reduction-budget sweep
|
||||
|
||||
`smarm` uses an allocator-driven preemption mechanism: every Nth allocation,
|
||||
the actor checks RDTSC against its timeslice start and yields if over budget.
|
||||
The Nth-allocation threshold (the "reduction budget") and the timeslice
|
||||
duration are the two knobs.
|
||||
|
||||
Record each experiment as a row below. Reference the commit or the parameter
|
||||
values explicitly.
|
||||
|
||||
| Date | Configuration | Bench (or "all") | Result vs baseline | Notes |
|
||||
|------|----------------------------|----------------------|------------------------------|-------|
|
||||
| | baseline | all | — | |
|
||||
| | budget=…, timeslice=… | | | |
|
||||
| | | | | |
|
||||
|
||||
When the gap on tokio-favored benches narrows without regressing
|
||||
smarm-favored benches, the change is a keeper. If a budget change improves
|
||||
one workload but regresses another by more, prefer keeping the broader-impact
|
||||
configuration unless we have a clear use case for the trade-off.
|
||||
|
||||
## Sanity-check notes (baseline run)
|
||||
|
||||
### Compile fixes applied
|
||||
|
||||
Two bench files had a type error: `smarm::Runtime::run()` takes
|
||||
`impl FnOnce() + Send + 'static` (returns `()`), but the consumer closures
|
||||
in `bench_mpsc_smarm` (tokio_favored.rs) and `bench_unc_smarm`
|
||||
(smarm_favored.rs) returned `u64` via a bare `count` tail expression. Fixed
|
||||
by changing the tail to `let _ = count;` in both closures, and the
|
||||
corresponding `consumer.join().unwrap()` calls to `let _ = consumer.join()...`.
|
||||
No workload semantics changed.
|
||||
|
||||
### Single-CPU sandbox caveat
|
||||
|
||||
`available_parallelism()` returns 1, so every "N-thread" variant is identical
|
||||
to "1-thread". Multi-thread results should not be used to draw scaling
|
||||
conclusions; re-run on a multi-core machine before committing to the tuning
|
||||
sweep.
|
||||
|
||||
### Predicted-winner mismatches
|
||||
|
||||
**`deep_recursion` — tokio wins (22 µs) over smarm (62 µs).**
|
||||
At depth 500, smarm spawns a fresh actor which requires mmap'ing a 64 KiB
|
||||
stack; that allocation cost dominates the actual recursion. Tokio's
|
||||
Box::pin recursion allocates 500 small heap objects but avoids the mmap.
|
||||
The prediction assumed stack allocation was amortised across many uses; here
|
||||
the actor is single-use. Not a bug, but the bench may not exercise the
|
||||
intended advantage.
|
||||
|
||||
**`yield_in_hot_loop` — tokio wins (138 ms) over smarm (182 ms).**
|
||||
The prediction was that smarm's ~6-GPR naked context switch would beat
|
||||
tokio's poll/state-machine cycle. In practice, on a single-thread sandbox,
|
||||
tokio's current_thread scheduler has very low overhead per yield_now, while
|
||||
smarm's yield_now still goes through the runtime mutex and run-queue even on
|
||||
a single thread. This is a meaningful data point: smarm's scheduling overhead
|
||||
is not as low as the assembly switch cost alone suggests.
|
||||
|
||||
### Noise / spread
|
||||
|
||||
- `catch_unwind_panics` smarm spread is reasonable (~10% min/max).
|
||||
- `spawn_storm_busy` tokio multi-thread has notable spread (3833–7305 µs);
|
||||
consistent with tokio issue #3829 noted in task spec.
|
||||
- `many_timers` smarm spread acceptable (~10%).
|
||||
|
||||
### Result-column equivalence
|
||||
|
||||
All result columns match between runtimes for every bench (same prime counts,
|
||||
same message totals, same task counts). Workloads are equivalent.
|
||||
1297
docs/smarm - Deep Dive.html
Normal file
1297
docs/smarm - Deep Dive.html
Normal file
File diff suppressed because it is too large
Load Diff
Reference in New Issue
Block a user