218 lines
10 KiB
Markdown
218 lines
10 KiB
Markdown
# SMARM Architecture
|
||
|
||
> Erlang-style actor concurrency for Rust, without the copies, the colors, or the GC pauses.
|
||
|
||
---
|
||
|
||
## Vision
|
||
|
||
Rust gives you the right ownership discipline for safe actor concurrency almost for free — `Send` already
|
||
draws the boundary, the borrow checker already enforces it. What it lacks is an execution model to match:
|
||
async/await is IO-centric, colors your functions, and trades stack simplicity for state-machine complexity;
|
||
OS threads are too heavy to spawn per actor.
|
||
|
||
SMARM adds a third option: **green-thread actors on a shared heap**, scheduled cooperatively, with
|
||
message-passing as the only cross-actor communication primitive. You get Erlang's isolation model without
|
||
Erlang's copying GC, and you get Rust's zero-copy ownership transfers without async's cognitive overhead.
|
||
No function coloring. No `Box<dyn Future>`. Just actors, messages, and the borrow checker doing what it
|
||
already does.
|
||
|
||
---
|
||
|
||
## Do: Core Runtime
|
||
|
||
### Actors and scheduling
|
||
|
||
Each actor is a lightweight green thread with its own heap-allocated, growable stack. Stacks are
|
||
allocated via `mmap` with a guard page below the region; overflow is detected by the OS without SMARM
|
||
polling for it. Initial stacks are small and grow by remapping on demand.
|
||
|
||
The scheduler runs one OS thread per CPU. Each scheduler thread loops against a single global
|
||
`Mutex<HashMap>` queue shared across all schedulers. If queue contention becomes a measured bottleneck
|
||
this can be revisited; the interface will not change.
|
||
|
||
SMARM requires `panic = unwind`. Users who set `panic = abort` accept that supervision and actor
|
||
isolation are silently degraded to process death.
|
||
|
||
### Process descriptor
|
||
|
||
Each actor has a descriptor that is hot while the actor runs and will typically live in L1 cache.
|
||
It holds:
|
||
|
||
- `stack_base: *mut u8` — bottom of the allocated stack region
|
||
- `stack_cap: usize` — total allocated size
|
||
- `stack_ptr: *mut u8` — current stack pointer (`rsp`), saved on yield
|
||
- `pid: (u32, u32)` — index and generation counter (see PIDs below)
|
||
- `alloc_count: u32` — countdown for preemption sampling
|
||
- `timeslice_start: u64` — `RDTSC` value written on every resume
|
||
- `resize_count: u16` — diagnostic counter for stack growth events
|
||
- `context: *mut ContextSaveArea` — pointer to the register save area (cold, touched only on switch)
|
||
|
||
### Context switching
|
||
|
||
Context switching is implemented in a `#[naked]` assembly shim, one per supported architecture.
|
||
The compiler cannot be asked to switch stacks.
|
||
|
||
**Suspend** (yield, preemption, or blocking):
|
||
1. Save callee-saved integer registers and SIMD registers into `ContextSaveArea`.
|
||
2. Save `rsp`/`sp` into the process descriptor.
|
||
3. Load the scheduler's stack pointer from a thread-local and jump back into the scheduler loop.
|
||
|
||
**Resume**:
|
||
1. Load `rsp`/`sp` from the process descriptor.
|
||
2. Restore registers from `ContextSaveArea`.
|
||
3. `ret` — the return address is already on the restored stack, execution resumes exactly where the
|
||
actor yielded.
|
||
|
||
**x86-64**: saves `rbx`, `rbp`, `r12`–`r15` (6 × 8 = 48 bytes) and `xmm0`–`xmm15` (16 × 16 = 256
|
||
bytes) = 304 bytes total. Full SSE baseline is required; the compiler may autovectorise freely.
|
||
AVX-512 is deferred.
|
||
|
||
**ARM64**: saves `x19`–`x30` (12 × 8 = 96 bytes, including the link register `x30` which must be
|
||
saved explicitly — it holds the return address, unlike x86 where `call` pushes it to the stack) and
|
||
`d8`–`d15` (8 × 8 = 64 bytes) = 160 bytes total.
|
||
|
||
`ContextSaveArea` is a `Box<ContextSaveArea>` per actor. Lifetime equals the actor's lifetime;
|
||
no churn, no bulk deallocation, `Box` is correct.
|
||
|
||
Initial platform target is x86-64 Linux. ARM64 and macOS are natural follow-ons.
|
||
|
||
### Allocator-driven preemption
|
||
|
||
Every Nth allocation, the allocator reads `RDTSC` and compares it against `timeslice_start`. If the
|
||
threshold is exceeded the actor yields. The workloads that starve a scheduler — sustained compute,
|
||
data transformation — are precisely the ones doing frequent allocations, so this approximation is
|
||
correct by construction.
|
||
|
||
`RDTSC` is not monotonic across core migration; a slightly wrong timeslice is acceptable. SMARM is
|
||
not a real-time scheduler.
|
||
|
||
Known failure mode: tight no-alloc loops are invisible to this mechanism. Actors doing sustained
|
||
allocation-free compute must call `smarm::yield_now()` explicitly, or offload to a thread pool
|
||
outside the actor scheduler (e.g. rayon). This is documented and acceptable — such loops are rare
|
||
in message-passing workloads.
|
||
|
||
### Yield points
|
||
|
||
An actor yields at:
|
||
|
||
- **Channel send/recv** — the primary communication primitive
|
||
- **Mutex contention** — attempting to lock a held `Arc<Mutex<>>` parks the actor
|
||
- **IO** — blocking on a socket or file descriptor parks the actor until the IO thread signals readiness
|
||
- **`smarm::sleep(duration)`** — parks the actor; the timer wheel re-queues it on expiry
|
||
- **`smarm::yield_now()`** — explicit cooperative yield
|
||
- **Allocator preemption** — as above
|
||
- **Spawn** — does not yield by default; the new actor is queued and the spawner continues
|
||
|
||
`std::thread::sleep` inside an actor blocks the entire OS thread and should never be used. SMARM
|
||
may emit a warning if it can detect this.
|
||
|
||
### IO thread
|
||
|
||
A single dedicated IO thread runs an `epoll`/`kqueue` loop. Actors blocking on IO register their
|
||
file descriptor and PID; the IO thread moves them back into the global queue when the fd is ready.
|
||
A `HashMap<RawFd, Pid>` maps fds to parked actors. Cancellation (actor dies while waiting on IO)
|
||
deregisters the fd. This is intentionally simple and not pluggable; SMARM is not a general async
|
||
executor.
|
||
|
||
### Communication
|
||
|
||
Messages must be `Send` or `Copy`. Non-`Send` types cannot cross an actor boundary; this is
|
||
enforced by the type system with no runtime overhead.
|
||
|
||
Two primitives only:
|
||
|
||
- **Move** — transfer owned data across a channel. Zero copy. The sender relinquishes ownership
|
||
at the type level. This is the default.
|
||
- **`Arc<Mutex<T>>`** — for genuinely shared long-lived state. Explicit and visible.
|
||
|
||
Cross-actor `Rc` or bare pointers are banned. There is no cycle detector. Cross-actor cycles are
|
||
banned by construction: either transfer ownership or use `Arc`.
|
||
|
||
### PIDs
|
||
|
||
A PID is a `(index, generation)` pair. The index may be reused after an actor dies; the generation
|
||
counter increments on every death. A stale handle holding the wrong generation is a detectable
|
||
error, not a silent misdirection. This avoids the ABA problem without reserving PID space forever.
|
||
|
||
### Supervision
|
||
|
||
Every actor has a supervisor, assigned at spawn. This is not optional. The root supervisor is
|
||
provided by the runtime; its death is a process exit.
|
||
|
||
A supervisor receives one of three signals when a child actor terminates:
|
||
|
||
- `Signal::Exit(pid)` — normal completion
|
||
- `Signal::Panic(pid, payload)` — caught via `catch_unwind` at the actor entry point boundary,
|
||
before unwinding can reach the assembly shim
|
||
- `Signal::Timeout(pid)` — actor exceeded a budget (see below)
|
||
|
||
The supervisor decides: restart the actor, escalate to its own supervisor, or ignore. Restart
|
||
intensity is capped: if an actor panics more than N times within a time window, the supervisor
|
||
stops restarting and escalates. This prevents a bad prelude or corrupted input from spinning the
|
||
supervisor in a restart loop indefinitely. N and the window are configurable per supervisor with a
|
||
sensible global default.
|
||
|
||
### Mutex timeout
|
||
|
||
Every `smarm::mutex` lock attempt is mediated by the scheduler. If the lock is not acquired within
|
||
a configurable timeout, the actor receives a `LockTimeout` error rather than parking forever. This
|
||
is a hard runtime guarantee, not a convention. Default timeout is global and configurable;
|
||
individual locks and individual call sites can override it.
|
||
|
||
### Task joining
|
||
|
||
Actors can spawn children and wait on a group of handles:
|
||
|
||
```rust
|
||
let h1 = smarm::spawn(|| compute_a());
|
||
let h2 = smarm::spawn(|| compute_b());
|
||
let (a, b) = smarm::join!(h1, h2);
|
||
```
|
||
|
||
`join!` parks the calling actor until all handles complete. The last child to finish re-queues the
|
||
parent. This is a countdown in the parent's descriptor; no polling, no waker registration. A
|
||
`join_timeout!` variant is a natural extension.
|
||
|
||
### Timer wheel
|
||
|
||
`smarm::sleep` and supervision timeouts are driven by a timer wheel in the scheduler. Sleeping
|
||
actors are parked and re-queued by the timer thread on expiry. The timer wheel is internal
|
||
infrastructure; its design is an implementation detail.
|
||
|
||
---
|
||
|
||
## Defer: Later Work
|
||
|
||
- **Stack sizing policy** — initial size, growth factor, and whether stacks ever shrink are
|
||
implementation decisions to be made with profiling data, not up front.
|
||
- **Queue contention** — if `Mutex<HashMap>` proves to be a bottleneck under profiling, evaluate
|
||
`DashMap` or a lock-free work-stealing deque (e.g. `crossbeam-deque`). Not before.
|
||
- **AVX-512 context save** — extend `ContextSaveArea` when there is a concrete use case.
|
||
- **`smarm::sleep` vs raw sleep semantics** — further control knobs deferred until the basic sleep
|
||
is working and real use cases are understood.
|
||
- **Supervision tree API** — the contract is defined; the recursive hierarchy, restart strategies,
|
||
and introspection API are implementation work.
|
||
- **no_std support** — the assembly shim is no_std friendly but the IO thread and allocator require
|
||
OS primitives. Target is no_std + `alloc` on hosted platforms; bare metal is out of scope.
|
||
- **Distribution** — SMARM is a single-process runtime. No distribution protocol, no BEAM-style
|
||
clustering.
|
||
|
||
---
|
||
|
||
## What SMARM is Not
|
||
|
||
- Not a drop-in replacement for Tokio. SMARM does not implement `Future` or the async executor interface.
|
||
- Not a general allocator. SMARM manages actor stacks; heap allocation for actor data goes through
|
||
the system allocator.
|
||
- Not Erlang. No hot code reloading, no distribution protocol, no BEAM bytecode. SMARM is a
|
||
concurrency runtime, not a platform.
|
||
- Not a real-time scheduler. Timeslice accuracy is best-effort.
|
||
|
||
|
||
---
|
||
|
||
## On names
|
||
|
||
<sub>The name is a recursive acronym. The M is for Marks, as in the BEAM — Bogdan/Björn's Erlang Abstract Machine, the virtual machine that runs Erlang and Elixir. smarm is not the BEAM. It just admires it from a safe distance.</sub>
|