Green-Thread Actor Runtime
+Erlang's isolation model. Rust's zero-copy ownership. No function colouring.
+
+ smarm is a prototype concurrent runtime for Rust. Each actor is a green thread with its own
+ mmap'd stack. N OS threads share a single global run queue. Actors communicate
+ exclusively via message passing (owned values over channels); no shared mutable state
+ without an explicit Arc<Mutex<T>>.
+
+ Preemption is allocator-driven: every Nth heap allocation, smarm reads RDTSC and yields + the actor if its timeslice has expired. No OS signals, no separate timer thread for scheduling. +
+ +No function colouring. No Box<dyn Future>. No poll state machines. Just plain Rust functions that block.
64 KB stacks instead of 8 MB. Context switch in ~10–20 ns (6 GPR saves + ret) instead of kernel mode.
+Zero-copy ownership via Rust's type system. No GC pause. No copying GC. Message passing is a move, not a clone.
Module Map
+13 source modules, three rough layers. The bottom layer has zero +smarm dependencies; middle layer builds the runtime machinery; top layer + is public API.
+ +Calls mmap for a contiguous region, then mprotect's the bottom page to PROT_NONE. Stack grows downward; overflow hits the guard page → SIGSEGV. Implements Drop via munmap. Zero smarm dependencies.
Two #[naked] assembly functions (switch_to_actor, switch_to_scheduler). Save 6 callee-saved GPRs, swap rsp, restore, ret.
+ Thread-locals hold each side's saved stack pointer. XMM registers not
+saved here — compiler guarantees spill at Rust call sites.
Implements GlobalAlloc — wraps System allocator. On every Nth alloc, reads RDTSC. If elapsed > timeslice_cycles and preemption is enabled, calls switch_to_scheduler(). Thread-locals hold the countdown, start timestamp, and an enabled flag (scheduler disables it to prevent self-preemption).
struct Pid(u32 index, u32 generation). Index = slot in the actor table. Generation increments on actor death. Stale handles are detectable: a Pid with wrong generation fails slot lookup rather than silently addressing a new actor. Solves ABA without exhausting PID space.
Owns the Stack. Defines the trampoline: every actor's first ret lands here. Trampoline reads the closure from a thread-local, calls it inside catch_unwind, writes the Outcome
+ to another thread-local, then yields back to the scheduler.
+Thread-locals: current PID, pending closure, last outcome, done flag.
The heaviest module. Contains SharedState (slot table, run queue, timers, IO), RuntimeInner (shared state behind a mutex, per-thread stats, drain lock), and schedule_loop
+ — the main scheduler loop that drains timers, drains IO completions,
+pops actors, resumes them, and handles the post-yield intent (re-queue
+vs park vs finalize).
Unbounded MPSC. Inner state is Arc<Mutex<Inner<T>>> — senders are clonable, last drop closes channel. recv(): checks queue; if empty, registers self as parked_receiver, releases the lock, calls park_current(). send(): pushes, takes the parked PID, calls unpark(pid).
Actor-aware mutex with mandatory timeout (default 30s). Fast
+path: no holder → grant immediately. Slow path: join FIFO waiter queue,
+insert a WaitTimeout timer, park. On timer expiry: if actor is still in waiters, unpark it with LockTimeout. On guard drop: pop next waiter, grant, unpark.
Two background OS threads: an epoll thread (waits on fds with EPOLLONESHOT; on ready, pushes FdReady completion) and a pool thread (runs blocking closures inside catch_unwind; pushes Blocking completion). Both write a wake pipe byte to stir the scheduler. Completions are drained inside schedule_loop.
BinaryHeap<Reverse<Entry>> = min-heap by deadline. Two Reason variants: Sleep (unpark unconditionally) and WaitTimeout (call target.on_timeout()). No cancellation — stale entries are no-ops on pop. Entries inserted by sleep() and mutex::lock_timeout().
Thin facade. Exposes spawn, yield_now, park_current, unpark, sleep, block_on_io, wait_readable, wait_writable, run. All delegate to runtime. Also owns JoinHandle and the NoPreempt RAII guard.
Just the Signal enum: Exit(Pid) or Panic(Pid, Box<dyn Any+Send>). No restart logic — that's user-space policy. Signals are delivered via the supervisor actor's own channel (Sender<Signal> stored in the child's slot).
Who Imports What
+The critical insight: runtime.rs is the hub. Every substantive module either feeds into it or is orchestrated by it. scheduler.rs is purely a facade — it imports runtime and re-exports it through the public API.
Circular dependency: channel and mutex call scheduler::unpark(), which calls into runtime. And runtime's schedule_loop resumes actors that run channel/mutex code. This is intentional — it's the cooperative unpark mechanism. It works because unpark() never blocks and preemption is disabled while holding any smarm internal lock.
What Happens When You Call run(f)
+ Starting from user code calling smarm::run(|| { ... }). The single-threaded run() is a wrapper around runtime::init(Config::exact(1)).run(f).
Install panic hook (once)
+A OnceLock guard installs a custom panic hook
+that suppresses output inside actor context. Without this, concurrent
+actor panics can deadlock Rust's default backtrace printer
+(non-reentrant internal lock). The previous hook is chained for panics
+outside actors.
Start IoThread io.rs
+ Creates a wake pipe (non-blocking O_NONBLOCK). Creates an epollfd. Creates a shutdown pipe and registers it in the epollfd. Spawns the epoll thread (epoll_wait loop) and the pool thread (blocking-work mpsc receiver). Both share a completion VecDeque behind a mutex.
Install RUNTIME thread-local runtime.rs
+ Arc<RuntimeInner> is cloned into the calling thread's RUNTIME thread-local. This makes with_runtime() work on the calling thread immediately — needed for the next step.
Spawn initial actor scheduler.rs
+Calls scheduler::spawn(f). This locks SharedState, allocates a slot, creates a Stack via mmap, calls init_actor_stack() to write the initial register frame (trampoline address + 6 zero GPR slots), stores the closure in pending_closures, pushes the PID to the run queue, returns a JoinHandle.
Spawn N-1 OS scheduler threads
+For each extra thread: clone Arc<RuntimeInner>, spawn OS thread, set RUNTIME and SCHED_SLOT thread-locals, enter schedule_loop. Thread 0 is the calling thread.
Enter schedule_loop on thread 0 runtime.rs
+ This is a loop { drain → pop → resume → handle-intent }.
+ Thread 0 blocks here until the run queue is empty and no timers or IO
+are pending. All actors run inside this loop. This call does not return
+until the program is done.
Shutdown sequence
+All scheduler threads return from schedule_loop. OS threads are joined. IoThread::drop() is called: writes shutdown pipe → epoll thread exits; drops the mpsc sender → pool thread exits; closes all fds. SharedState is cleared for potential next run() call.
The Yield → Schedule → Resume Cycle
+This is the heartbeat of the entire runtime. Every context switch +follows exactly this path, whether triggered by a cooperative yield, +preemption, channel recv, mutex contention, or IO wait.
+ +The 6 Yield Sources
+| Source | +Intent set | +Who re-queues | +Notes | +
|---|---|---|---|
yield_now() |
+ Yield | +Scheduler immediately | +Actor stays Runnable; pushed back to queue tail | +
| Allocator preemption | +Yield | +Scheduler immediately | +RDTSC check in maybe_preempt() triggers switch_to_scheduler() |
+
channel::recv() (empty) |
+ Park | +channel::send() → unpark() |
+ Receiver PID stored in channel's parked_receiver |
+
mutex::lock() (contended) |
+ Park | +MutexGuard::drop() or timer timeout |
+ FIFO waiter queue; timeout via WaitTimeout timer entry |
+
sleep(d) |
+ Park | +Timer heap → schedule_loop drain |
+ Inserts Reason::Sleep entry; scheduler unparks on pop |
+
wait_readable/writable(fd) |
+ Park | +epoll thread → completion queue → scheduler | +EPOLLONESHOT; one ADD → one wakeup → one DEL per call | +
New Actor From First Resume
+Spawning is the trickiest part of the runtime. An actor's first
+resume is fundamentally different from subsequent ones because we can't
+"call" into a new stack — we have to ret into it.
scheduler::spawn(f) called
+ Allocates a slot from free list or grows the slots vec. Assigns Pid(index, generation). Creates a Stack (64 KB mmap + guard page).
Initial stack frame written context::init_actor_stack()
+Starting from top & ~15 - 8 (aligned), pushes downward: the trampoline function pointer as the ret address, then 6 zero words for the callee-saved registers. The resulting rsp is stored as actor.sp. No actual function call has happened yet.
high addr ← top
+ top-8: &trampoline ← will be popped by 'ret'
+ top-16: 0 ← rbx
+ top-24: 0 ← rbp
+ top-32: 0 ← r12
+ top-40: 0 ← r13
+ top-48: 0 ← r14
+ top-56: 0 ← r15 ← initial rsp stored here
+ Closure stored separately
+The closure Box<dyn FnOnce() + Send> goes into SharedState::pending_closures keyed by PID — not
+ on the actor's stack. This is because we can't pass it via a register
+during first resume. The PID is pushed to the run queue; slot state is Runnable.
Scheduler picks up the PID, prepares first resume
+Before calling switch_to_actor(), the scheduler pops the closure from pending_closures and writes it to the CURRENT_ACTOR_BOX thread-local. Then sets ACTOR_SP, sets CURRENT_PID, arms the timeslice, enables preemption.
First context switch lands in trampoline()
+ switch_to_actor() saves the scheduler's GPRs, loads actor.sp as the new rsp, pops the 6 zero words (restoring the "saved" registers to zero), then rets — which pops the trampoline address from the stack and jumps to it. We're now executing on the actor's stack.
trampoline() reads the closure and runs it
+ Takes the closure from CURRENT_ACTOR_BOX thread-local (consuming it — subsequent resumes skip this). Calls it inside panic::catch_unwind(AssertUnwindSafe(f)). The actor's code runs normally from here. Any yield (channel, mutex, preemption) calls switch_to_scheduler(); the scheduler saves actor state, processes intent, loops.
Actor returns → trampoline handles completion
+If catch_unwind returns Ok(()), outcome is Exit. If it returns Err(payload), outcome is Panic(payload). Either way, outcome is written to LAST_OUTCOME thread-local, ACTOR_DONE is set to true, then switch_to_scheduler() is called for the last time. Scheduler sees is_actor_done() == true, calls finalize_actor(): delivers Signal to supervisor, unparks joiners, reclaims slot.
Allocator-Driven Timeslicing
+ +How it works
+The PreemptingAllocator is installed as the process's #[global_allocator]. Its alloc(), alloc_zeroed(), and realloc() all call maybe_preempt() before delegating to the system allocator.
maybe_preempt() decrements a thread-local counter. Every 128 allocations (default), it reads RDTSC. If rdtsc() - timeslice_start > 300_000 cycles (~100µs at 3 GHz) and PREEMPTION_ENABLED == true, it calls switch_to_scheduler().
The check!() macro calls the same maybe_preempt() function — for tight loops that make no allocations.
Invariant: preemption must be off when holding smarm locks
+If preemption fired while the scheduler held SharedState, the context switch would try to re-acquire the same mutex → deadlock. smarm prevents this with:
-
+
PREEMPTION_ENABLED = falsein the scheduler loop before/afterswitch_to_actor()
+ with_shared()saves and disables preemption while the mutex is held
+ NoPreemptRAII guard used in channel/mutex slow paths
+ trace::record()also disables preemption (it can allocate)
+
Known gap: tight no-alloc loops are invisible without explicit check!() calls. This is documented and by design — such loops are uncommon in message-passing workloads.
// preempt.rs — simplified
+pub fn maybe_preempt() {
+ ALLOC_COUNT.with(|c| {
+ let n = c.get();
+ if n == 0 {
+ c.set(ACTIVE_ALLOC_INTERVAL.with(|i| i.get())); // reset counter
+ if PREEMPTION_ENABLED.with(|e| e.get()) {
+ let elapsed = rdtsc() - TIMESLICE_START.with(|s| s.get());
+ if elapsed > ACTIVE_TIMESLICE_CYCLES.with(|i| i.get()) {
+ unsafe { switch_to_scheduler() }; // YieldIntent::Yield
+ }
+ }
+ } else {
+ c.set(n - 1);
+ }
+ });
+}
+Two Background Threads, One Wake Pipe
+ +epoll_ctl ADD/DEL is called by the scheduler thread directly on the epollfd — this is legal per the epoll_ctl(2) man page even while the epoll thread is inside epoll_wait. Avoids needing a second command channel.
Things That Would Bite You
+ +Between registering as a channel's parked_receiver and calling park_current(), a sender could call unpark(). At that moment the actor is still Runnable, so unpark() sets pending_unpark = true instead of re-queuing. The scheduler checks this flag after the Park yield and re-queues immediately rather than parking. This flag also protects epoll and mutex paths.
std::thread::sleep inside actorBlocks the entire OS scheduler thread, starving every actor assigned to that thread. There's no detection. Use smarm::sleep(d) instead.
SharedStateThe with_shared() helper disables preemption while the mutex is held. But any code path that allocates inside with_shared and then tries to acquire SharedState again will deadlock. All internal smarm code is carefully structured to avoid this.
All N scheduler threads contend on a single Mutex<SharedState>.
+ This is the primary scalability ceiling — visible in the benchmark
+suite as "tokio-favored" scenarios. Identified, documented, deferred.
+The fix would be per-thread deques with work stealing.
When a mutex lock is granted before its timeout, the timer +entry stays in the heap. It fires eventually, the callback sees "actor +is no longer waiting" and no-ops. Cost is ~32 bytes and a few cycles per + stale entry. Bounded by one entry per parked actor.
+If an actor dies while waiting on an fd, the epoll registration
+ is leaked. EPOLLONESHOT bounds damage to one stale wakeup, which the
+scheduler drops when it can't find the PID in waiters. Noted in io.rs as a known gap for a future pass.
This is intentional and correct. XMM0–15 are +caller-saved in SysV AMD64 ABI. Every yield passes through a Rust call +site, so the compiler has already spilled live XMM values to the actor's + stack before we get to the naked asm. They're restored when the actor +resumes because they're on its own stack.
+panic = unwind is requiredThe trampoline uses catch_unwind to intercept actor panics before they reach the naked assembly shim. If a user sets panic = abort,
+ panics kill the process instead of being caught — the supervision tree
+collapses to process death. This is documented and the profile is set in
+ Cargo.toml.