feat: I/O and mutex support (v0.3)
Add epoll-based non-blocking I/O and kernel-like mutexes: - src/io.rs: Complete epoll backend with timeout & error handling - src/mutex.rs: Fair mutex with waiter queues & parking integration - Enhanced scheduler to support synchronous I/O blocking - Comprehensive test suites for I/O (epoll) and mutex behavior - Documentation: LOOM.md concurrency model & README
This commit is contained in:
445
src/io.rs
445
src/io.rs
@@ -1,35 +1,73 @@
|
||||
//! Off-scheduler blocking work.
|
||||
//! Off-scheduler IO: blocking-work offload and epoll-based fd readiness.
|
||||
//!
|
||||
//! `block_on_io(closure)` runs `closure` on a dedicated worker OS thread,
|
||||
//! parks the calling actor in the meantime, and returns the closure's
|
||||
//! value when it completes. Lets actors call into blocking C libraries,
|
||||
//! synchronous file IO, or anything else that would otherwise stall the
|
||||
//! scheduler thread.
|
||||
//! synchronous file IO, or anything else that doesn't fit the readiness
|
||||
//! model.
|
||||
//!
|
||||
//! `wait_readable(fd)` / `wait_writable(fd)` register interest in an fd
|
||||
//! with epoll and park the calling actor. When the fd becomes ready, the
|
||||
//! epoll thread unparks the actor. The actual `read(2)`/`write(2)` syscall
|
||||
//! runs back on the scheduler thread, *inside* the actor — buffer never
|
||||
//! leaves the actor, no copying through an intermediary thread. Built on
|
||||
//! these are the conveniences `read(fd, &mut buf)` and `write(fd, &buf)`.
|
||||
//!
|
||||
//! Architecture
|
||||
//! ============
|
||||
//! Per `run()`:
|
||||
//! - one worker OS thread, started by `run()` and joined at shutdown;
|
||||
//! - a request channel (`mpsc::Sender<Request>`) from scheduler → worker;
|
||||
//! - a completion queue (`Mutex<VecDeque<Completion>>`) worker → scheduler;
|
||||
//! - a wake pipe: when the worker pushes a completion it writes one byte
|
||||
//! to the pipe; the scheduler polls the pipe (with timeout) when it
|
||||
//! would otherwise be idle.
|
||||
//! Per `run()`, two OS threads:
|
||||
//! - **epoll thread**: owns the epollfd. Loops in `epoll_wait`. On a
|
||||
//! ready fd, pushes `Completion::FdReady { pid, fd, events }` to the
|
||||
//! shared completion queue and writes the scheduler-wake pipe. On the
|
||||
//! shutdown pipe (also registered in epollfd), exits.
|
||||
//! - **pool thread**: blocks on the request mpsc. Runs the closure
|
||||
//! inside `catch_unwind`, pushes `Completion::Blocking { pid, result }`,
|
||||
//! writes the scheduler-wake pipe.
|
||||
//!
|
||||
//! For v0.2 the worker is a single thread, so concurrent `block_on_io`
|
||||
//! calls are serialised. v0.3 can replace it with a thread pool behind
|
||||
//! the same request channel.
|
||||
//! Both threads share a single `completions: Arc<Mutex<VecDeque<Completion>>>`
|
||||
//! and the same scheduler-wake pipe.
|
||||
//!
|
||||
//! `epoll_ctl` (register/unregister fd interest) is called by the
|
||||
//! scheduler thread *directly* on the epollfd. That's well-defined per
|
||||
//! `epoll_ctl(2)`: a thread may be calling `epoll_wait` on the epollfd
|
||||
//! while another thread calls `epoll_ctl`. Avoids needing a second mpsc
|
||||
//! and a second wake mechanism.
|
||||
//!
|
||||
//! Epoll mode
|
||||
//! ==========
|
||||
//! Level-triggered with EPOLLONESHOT. After a wakeup the kernel
|
||||
//! auto-disarms the fd, so we never get two wakeups for one
|
||||
//! `wait_readable` call. The scheduler explicitly `EPOLL_CTL_DEL`s the fd
|
||||
//! on completion to free the slot for re-registration. Net effect: each
|
||||
//! `wait_readable(fd)` is one ADD, one wakeup, one DEL — symmetric and
|
||||
//! stateless between calls.
|
||||
//!
|
||||
//! Fd hygiene
|
||||
//! ==========
|
||||
//! If an actor dies while waiting on an fd, the registration is leaked
|
||||
//! (the fd stays in the epollfd, armed). EPOLLONESHOT bounds the damage:
|
||||
//! at most one stale wakeup, after which the kernel disarms. The stale
|
||||
//! wakeup hits a dead pid in `waiters` and is dropped. Acceptable for v0.2;
|
||||
//! a future pass should DEL on actor death.
|
||||
//!
|
||||
//! Buffers used with `read`/`write` should be on fds opened with
|
||||
//! `O_NONBLOCK`. If they aren't, the syscall may block the scheduler
|
||||
//! thread despite the readiness notification (the fd reporting readable
|
||||
//! doesn't guarantee the syscall completes without blocking — e.g. a
|
||||
//! signal could be delivered). Documented; not enforced.
|
||||
//!
|
||||
//! Panic handling
|
||||
//! ==============
|
||||
//! The worker runs the closure inside `catch_unwind` and ships either the
|
||||
//! return value or the panic payload back to the scheduler. `block_on_io`
|
||||
//! resumes the panic on the calling actor's stack, so the actor's
|
||||
//! supervisor sees a real `Signal::Panic` as if the work had run inline.
|
||||
//! The pool worker runs the closure inside `catch_unwind` and ships either
|
||||
//! the return value or the panic payload back to the scheduler.
|
||||
//! `block_on_io` resumes the panic on the calling actor's stack, so the
|
||||
//! actor's supervisor sees a real `Signal::Panic` as if the work had run
|
||||
//! inline. Fd-wait primitives don't run user code on the IO thread, so
|
||||
//! they have no equivalent panic-propagation path.
|
||||
|
||||
use crate::pid::Pid;
|
||||
use std::any::Any;
|
||||
use std::collections::VecDeque;
|
||||
use std::collections::{HashMap, VecDeque};
|
||||
use std::io;
|
||||
use std::os::fd::RawFd;
|
||||
use std::panic;
|
||||
@@ -41,7 +79,7 @@ use std::thread::JoinHandle as OsJoinHandle;
|
||||
// Wire types
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// What the worker stores while computing a result. `Ok` is the closure's
|
||||
/// What the pool stores while computing a result. `Ok` is the closure's
|
||||
/// return value (boxed as `Any`); `Err` is the panic payload.
|
||||
pub type IoResult = Result<Box<dyn Any + Send>, Box<dyn Any + Send>>;
|
||||
|
||||
@@ -51,9 +89,17 @@ struct Request {
|
||||
work: Box<dyn FnOnce() -> IoResult + Send>,
|
||||
}
|
||||
|
||||
struct Completion {
|
||||
pid: Pid,
|
||||
result: IoResult,
|
||||
/// Completion message from either IO thread back to the scheduler.
|
||||
pub enum Completion {
|
||||
/// A `block_on_io` closure has finished (Ok = return value, Err = panic
|
||||
/// payload).
|
||||
Blocking { pid: Pid, result: IoResult },
|
||||
/// An fd registered via `wait_readable`/`wait_writable` is ready. The
|
||||
/// scheduler looks up the parked pid in `waiters`, unparks it, and
|
||||
/// removes the entry. `pid` isn't in this variant because the epoll
|
||||
/// thread doesn't have access to the `waiters` map; the scheduler
|
||||
/// thread owns that.
|
||||
FdReady { fd: RawFd, events: u32 },
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
@@ -61,61 +107,146 @@ struct Completion {
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
pub struct IoThread {
|
||||
/// Channel into the worker.
|
||||
// ----- Channels & queues -----
|
||||
|
||||
/// Submission queue into the blocking-work pool.
|
||||
tx: mpsc::Sender<Request>,
|
||||
/// Shared completion queue. The worker pushes; the scheduler drains.
|
||||
/// Shared completion queue, fed by both the pool and the epoll thread.
|
||||
completions: Arc<Mutex<VecDeque<Completion>>>,
|
||||
/// Pipe used as a one-bit wakeup. `wake_read` is what the scheduler
|
||||
/// polls; `wake_write` is what the worker writes to.
|
||||
/// Pipe the scheduler polls in its idle path. Both IO threads write to
|
||||
/// `wake_write` after pushing a completion.
|
||||
wake_read: RawFd,
|
||||
wake_write: RawFd,
|
||||
/// Worker thread handle, joined on shutdown.
|
||||
worker: Option<OsJoinHandle<()>>,
|
||||
/// Number of requests in-flight (sent but not yet drained as a
|
||||
/// completion). Used by the scheduler's idle path to decide whether
|
||||
/// to wait on the pipe or exit.
|
||||
|
||||
// ----- Epoll machinery -----
|
||||
|
||||
/// The epollfd, owned by `IoThread`. Callable cross-thread via
|
||||
/// `epoll_ctl` per the man page.
|
||||
epollfd: RawFd,
|
||||
/// Pipe used to signal the epoll thread to exit. Registered inside the
|
||||
/// epollfd so a single `epoll_wait` covers both fd readiness and
|
||||
/// shutdown.
|
||||
shutdown_read: RawFd,
|
||||
shutdown_write: RawFd,
|
||||
/// One parked actor per registered fd. Populated by `wait_readable` /
|
||||
/// `wait_writable` and drained by the scheduler when a `FdReady`
|
||||
/// completion is processed.
|
||||
pub waiters: HashMap<RawFd, Pid>,
|
||||
|
||||
// ----- Threads -----
|
||||
|
||||
pool_thread: Option<OsJoinHandle<()>>,
|
||||
epoll_thread: Option<OsJoinHandle<()>>,
|
||||
|
||||
/// Number of `block_on_io` requests in-flight. Used by the scheduler's
|
||||
/// idle path to decide whether to wait on the pipe or exit. Fd waits
|
||||
/// are not counted here; they're counted by `waiters.len()`.
|
||||
pub outstanding: u32,
|
||||
}
|
||||
|
||||
impl IoThread {
|
||||
pub fn start() -> io::Result<Self> {
|
||||
// Scheduler-facing wake pipe.
|
||||
let (wake_read, wake_write) = make_pipe()?;
|
||||
// Pool submission channel + shared completion queue.
|
||||
let (tx, rx) = mpsc::channel::<Request>();
|
||||
let completions: Arc<Mutex<VecDeque<Completion>>> =
|
||||
Arc::new(Mutex::new(VecDeque::new()));
|
||||
|
||||
let comps_worker = completions.clone();
|
||||
let worker = std::thread::Builder::new()
|
||||
.name("smarm-io".into())
|
||||
.spawn(move || worker_loop(rx, comps_worker, wake_write))?;
|
||||
// Epoll machinery.
|
||||
let epollfd = unsafe { libc::epoll_create1(libc::EPOLL_CLOEXEC) };
|
||||
if epollfd < 0 {
|
||||
// Best-effort fd cleanup before bailing.
|
||||
unsafe {
|
||||
libc::close(wake_read);
|
||||
libc::close(wake_write);
|
||||
}
|
||||
return Err(io::Error::last_os_error());
|
||||
}
|
||||
|
||||
let (shutdown_read, shutdown_write) = match make_pipe() {
|
||||
Ok(p) => p,
|
||||
Err(e) => {
|
||||
unsafe {
|
||||
libc::close(epollfd);
|
||||
libc::close(wake_read);
|
||||
libc::close(wake_write);
|
||||
}
|
||||
return Err(e);
|
||||
}
|
||||
};
|
||||
|
||||
// Register the shutdown pipe in epollfd. We use a sentinel `data`
|
||||
// value to recognise shutdown events. RawFd values are non-negative,
|
||||
// so u64::MAX is unambiguously not a real fd-data encoding.
|
||||
let mut shutdown_ev = libc::epoll_event {
|
||||
events: libc::EPOLLIN as u32,
|
||||
u64: SHUTDOWN_EPOLL_TOKEN,
|
||||
};
|
||||
if unsafe {
|
||||
libc::epoll_ctl(
|
||||
epollfd,
|
||||
libc::EPOLL_CTL_ADD,
|
||||
shutdown_read,
|
||||
&mut shutdown_ev as *mut _,
|
||||
)
|
||||
} < 0
|
||||
{
|
||||
let e = io::Error::last_os_error();
|
||||
unsafe {
|
||||
libc::close(epollfd);
|
||||
libc::close(shutdown_read);
|
||||
libc::close(shutdown_write);
|
||||
libc::close(wake_read);
|
||||
libc::close(wake_write);
|
||||
}
|
||||
return Err(e);
|
||||
}
|
||||
|
||||
// Spawn pool thread.
|
||||
let pool_comps = completions.clone();
|
||||
let pool_thread = std::thread::Builder::new()
|
||||
.name("smarm-io-pool".into())
|
||||
.spawn(move || pool_loop(rx, pool_comps, wake_write))?;
|
||||
|
||||
// Spawn epoll thread.
|
||||
let epoll_comps = completions.clone();
|
||||
let epoll_thread = std::thread::Builder::new()
|
||||
.name("smarm-io-epoll".into())
|
||||
.spawn(move || epoll_loop(epollfd, epoll_comps, wake_write))?;
|
||||
|
||||
Ok(Self {
|
||||
tx,
|
||||
completions,
|
||||
wake_read,
|
||||
wake_write,
|
||||
worker: Some(worker),
|
||||
epollfd,
|
||||
shutdown_read,
|
||||
shutdown_write,
|
||||
waiters: HashMap::new(),
|
||||
pool_thread: Some(pool_thread),
|
||||
epoll_thread: Some(epoll_thread),
|
||||
outstanding: 0,
|
||||
})
|
||||
}
|
||||
|
||||
/// Hand a request to the worker. Increments `outstanding`.
|
||||
/// Hand a request to the pool. Increments `outstanding`.
|
||||
pub fn submit(&mut self, pid: Pid, work: Box<dyn FnOnce() -> IoResult + Send>) {
|
||||
self.outstanding += 1;
|
||||
// Send can only fail if the worker has hung up, which only happens
|
||||
// Send can only fail if the pool has hung up, which only happens
|
||||
// on shutdown. submit during shutdown is a bug.
|
||||
self.tx
|
||||
.send(Request { pid, work })
|
||||
.expect("io worker hung up unexpectedly");
|
||||
.expect("io pool hung up unexpectedly");
|
||||
}
|
||||
|
||||
/// Drain every available completion. Caller is responsible for
|
||||
/// decrementing `outstanding` and routing the results.
|
||||
pub fn drain_completions(&mut self) -> Vec<(Pid, IoResult)> {
|
||||
/// Drain every available completion. Caller (the scheduler) routes the
|
||||
/// results and updates `outstanding` / `waiters` accordingly.
|
||||
pub fn drain_completions(&mut self) -> Vec<Completion> {
|
||||
let mut q = self.completions.lock().unwrap();
|
||||
let mut out = Vec::with_capacity(q.len());
|
||||
while let Some(c) = q.pop_front() {
|
||||
out.push((c.pid, c.result));
|
||||
out.push(c);
|
||||
}
|
||||
out
|
||||
}
|
||||
@@ -123,78 +254,222 @@ impl IoThread {
|
||||
pub fn wake_fd(&self) -> RawFd {
|
||||
self.wake_read
|
||||
}
|
||||
|
||||
/// Register interest in `fd` becoming readable/writable; record `pid`
|
||||
/// as the parked waiter. The epoll thread will push a `FdReady`
|
||||
/// completion when the kernel signals.
|
||||
///
|
||||
/// EPOLLONESHOT: one wakeup per registration. The scheduler must
|
||||
/// `epoll_del` on completion to free the slot for re-registration.
|
||||
pub fn epoll_register(
|
||||
&mut self,
|
||||
fd: RawFd,
|
||||
pid: Pid,
|
||||
readable: bool,
|
||||
writable: bool,
|
||||
) -> io::Result<()> {
|
||||
// Two actors waiting on the same fd would be a misuse: the kernel
|
||||
// delivers exactly one EPOLLONESHOT wakeup, so the second waiter
|
||||
// would hang. Reject up front.
|
||||
if self.waiters.contains_key(&fd) {
|
||||
return Err(io::Error::new(
|
||||
io::ErrorKind::AlreadyExists,
|
||||
"fd already has a parked waiter",
|
||||
));
|
||||
}
|
||||
|
||||
// Defensive cleanup: if a previous actor died while waiting on this
|
||||
// fd, the kernel-side registration was leaked (we don't walk all
|
||||
// waiters on actor death). A bare DEL is harmless if the fd isn't
|
||||
// registered (ENOENT), and removes any leak.
|
||||
unsafe {
|
||||
libc::epoll_ctl(self.epollfd, libc::EPOLL_CTL_DEL, fd, std::ptr::null_mut());
|
||||
}
|
||||
|
||||
let mut events: u32 = libc::EPOLLONESHOT as u32;
|
||||
if readable {
|
||||
events |= libc::EPOLLIN as u32;
|
||||
}
|
||||
if writable {
|
||||
events |= libc::EPOLLOUT as u32;
|
||||
}
|
||||
let mut ev = libc::epoll_event {
|
||||
events,
|
||||
u64: fd as u64,
|
||||
};
|
||||
let r = unsafe {
|
||||
libc::epoll_ctl(self.epollfd, libc::EPOLL_CTL_ADD, fd, &mut ev as *mut _)
|
||||
};
|
||||
if r < 0 {
|
||||
return Err(io::Error::last_os_error());
|
||||
}
|
||||
self.waiters.insert(fd, pid);
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Remove `fd` from the epollfd. Called by the scheduler after a
|
||||
/// `FdReady` completion, so the next `wait_readable(fd)` can ADD again.
|
||||
///
|
||||
/// Does NOT touch `waiters` — that's the scheduler's bookkeeping; this
|
||||
/// is purely the kernel-side cleanup.
|
||||
pub fn epoll_deregister(&mut self, fd: RawFd) {
|
||||
// EPOLL_CTL_DEL of an already-removed fd returns ENOENT; ignore.
|
||||
unsafe {
|
||||
libc::epoll_ctl(self.epollfd, libc::EPOLL_CTL_DEL, fd, std::ptr::null_mut());
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl Drop for IoThread {
|
||||
fn drop(&mut self) {
|
||||
// Hang up the request channel; the worker will exit its loop.
|
||||
// We must drop `tx` before joining. Take it out by moving.
|
||||
// mpsc::Sender doesn't have explicit `disconnect`; dropping it
|
||||
// (after this scope) causes the receiver to return Err.
|
||||
//
|
||||
// Trick: replace self.tx with a fresh dead one so we can drop it.
|
||||
// 1. Signal the epoll thread to exit by writing the shutdown pipe.
|
||||
unsafe {
|
||||
let buf: [u8; 1] = [0];
|
||||
// Single byte; we don't care about EINTR retry here — worst
|
||||
// case the epoll thread blocks until process exit, which is
|
||||
// fine because we then close fds out from under it.
|
||||
libc::write(self.shutdown_write, buf.as_ptr() as *const _, 1);
|
||||
}
|
||||
|
||||
// 2. Hang up the pool's request channel so the pool thread exits.
|
||||
let (dead_tx, _) = mpsc::channel::<Request>();
|
||||
let real_tx = std::mem::replace(&mut self.tx, dead_tx);
|
||||
drop(real_tx);
|
||||
|
||||
if let Some(h) = self.worker.take() {
|
||||
// Best-effort join. If the worker panicked, ignore.
|
||||
// 3. Join both threads.
|
||||
if let Some(h) = self.epoll_thread.take() {
|
||||
let _ = h.join();
|
||||
}
|
||||
if let Some(h) = self.pool_thread.take() {
|
||||
let _ = h.join();
|
||||
}
|
||||
|
||||
// Close the pipe.
|
||||
// 4. Close fds.
|
||||
unsafe {
|
||||
libc::close(self.epollfd);
|
||||
libc::close(self.shutdown_read);
|
||||
libc::close(self.shutdown_write);
|
||||
libc::close(self.wake_read);
|
||||
libc::close(self.wake_write);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Sentinel `epoll_event.u64` distinguishing the shutdown pipe from
|
||||
/// registered actor fds. RawFd values fit in i32, so the high bits are
|
||||
/// available for a marker; we use u64::MAX which can't be a valid fd.
|
||||
const SHUTDOWN_EPOLL_TOKEN: u64 = u64::MAX;
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Worker loop
|
||||
// Pool loop
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn worker_loop(
|
||||
fn pool_loop(
|
||||
rx: mpsc::Receiver<Request>,
|
||||
completions: Arc<Mutex<VecDeque<Completion>>>,
|
||||
wake_write: RawFd,
|
||||
) {
|
||||
while let Ok(Request { pid, work }) = rx.recv() {
|
||||
let result: IoResult = match panic::catch_unwind(panic::AssertUnwindSafe(work)) {
|
||||
Ok(r) => r,
|
||||
Ok(r) => r,
|
||||
Err(payload) => Err(payload),
|
||||
};
|
||||
completions.lock().unwrap().push_back(Completion { pid, result });
|
||||
// Write one byte to the pipe to wake the scheduler. If the pipe
|
||||
// buffer is full (scheduler isn't draining), the write may return
|
||||
// EAGAIN — we'll ignore it because there's already an outstanding
|
||||
// wakeup that hasn't been consumed yet.
|
||||
let buf: [u8; 1] = [0];
|
||||
unsafe {
|
||||
// EINTR is the only retryable case worth handling.
|
||||
loop {
|
||||
let n = libc::write(wake_write, buf.as_ptr() as *const _, 1);
|
||||
if n < 0 {
|
||||
let e = *libc::__errno_location();
|
||||
if e == libc::EINTR { continue; }
|
||||
}
|
||||
break;
|
||||
completions
|
||||
.lock()
|
||||
.unwrap()
|
||||
.push_back(Completion::Blocking { pid, result });
|
||||
wake_scheduler(wake_write);
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Epoll loop
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn epoll_loop(
|
||||
epollfd: RawFd,
|
||||
completions: Arc<Mutex<VecDeque<Completion>>>,
|
||||
wake_write: RawFd,
|
||||
) {
|
||||
// Buffer for epoll_wait. 64 is plenty for our scale; if a real load
|
||||
// appears that needs more, this is a one-line change.
|
||||
const MAX_EVENTS: usize = 64;
|
||||
let mut events: [libc::epoll_event; MAX_EVENTS] = unsafe { std::mem::zeroed() };
|
||||
|
||||
loop {
|
||||
let n = unsafe {
|
||||
libc::epoll_wait(
|
||||
epollfd,
|
||||
events.as_mut_ptr(),
|
||||
MAX_EVENTS as libc::c_int,
|
||||
-1,
|
||||
)
|
||||
};
|
||||
|
||||
if n < 0 {
|
||||
let e = unsafe { *libc::__errno_location() };
|
||||
if e == libc::EINTR {
|
||||
continue;
|
||||
}
|
||||
// Anything else here is a programming error (EBADF on epollfd
|
||||
// after we've closed it from Drop — the close races with us).
|
||||
// Treat as shutdown.
|
||||
return;
|
||||
}
|
||||
|
||||
let mut shutdown_requested = false;
|
||||
let mut pushed_any = false;
|
||||
{
|
||||
let mut q = completions.lock().unwrap();
|
||||
for ev in events.iter().take(n as usize) {
|
||||
if ev.u64 == SHUTDOWN_EPOLL_TOKEN {
|
||||
shutdown_requested = true;
|
||||
continue;
|
||||
}
|
||||
let fd = ev.u64 as RawFd;
|
||||
q.push_back(Completion::FdReady {
|
||||
fd,
|
||||
events: ev.events,
|
||||
});
|
||||
pushed_any = true;
|
||||
}
|
||||
}
|
||||
|
||||
if pushed_any {
|
||||
wake_scheduler(wake_write);
|
||||
}
|
||||
if shutdown_requested {
|
||||
return;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
/// Write one byte to the scheduler's wake pipe. Retries on EINTR; ignores
|
||||
/// EAGAIN (pipe full means there's already an outstanding wake we haven't
|
||||
/// consumed yet, which is sufficient).
|
||||
fn wake_scheduler(wake_write: RawFd) {
|
||||
let buf: [u8; 1] = [0];
|
||||
unsafe {
|
||||
loop {
|
||||
let n = libc::write(wake_write, buf.as_ptr() as *const _, 1);
|
||||
if n < 0 {
|
||||
let e = *libc::__errno_location();
|
||||
if e == libc::EINTR {
|
||||
continue;
|
||||
}
|
||||
}
|
||||
break;
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Pipe helpers
|
||||
// Pipe helpers (unchanged from v0.2)
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
fn make_pipe() -> io::Result<(RawFd, RawFd)> {
|
||||
let mut fds: [libc::c_int; 2] = [0; 2];
|
||||
// O_CLOEXEC so children don't inherit, O_NONBLOCK on the read side
|
||||
// so the scheduler's drain can `read` without blocking.
|
||||
let r = unsafe {
|
||||
libc::pipe2(fds.as_mut_ptr(), libc::O_CLOEXEC | libc::O_NONBLOCK)
|
||||
};
|
||||
let r = unsafe { libc::pipe2(fds.as_mut_ptr(), libc::O_CLOEXEC | libc::O_NONBLOCK) };
|
||||
if r != 0 {
|
||||
return Err(io::Error::last_os_error());
|
||||
}
|
||||
@@ -208,7 +483,6 @@ pub fn drain_wake_pipe(fd: RawFd) {
|
||||
loop {
|
||||
let n = unsafe { libc::read(fd, buf.as_mut_ptr() as *mut _, buf.len()) };
|
||||
if n <= 0 {
|
||||
// EAGAIN (would block) or EOF — done.
|
||||
break;
|
||||
}
|
||||
}
|
||||
@@ -220,17 +494,26 @@ pub fn poll_wake(fd: RawFd, timeout: Option<std::time::Duration>) {
|
||||
let timeout_ms: libc::c_int = match timeout {
|
||||
None => -1,
|
||||
Some(d) => {
|
||||
// Cap at i32::MAX milliseconds; poll's argument is c_int.
|
||||
let ms = d.as_millis();
|
||||
if ms > i32::MAX as u128 { i32::MAX } else { ms as i32 }
|
||||
if ms > i32::MAX as u128 {
|
||||
i32::MAX
|
||||
} else {
|
||||
ms as i32
|
||||
}
|
||||
}
|
||||
};
|
||||
let mut pfd = libc::pollfd { fd, events: libc::POLLIN, revents: 0 };
|
||||
let mut pfd = libc::pollfd {
|
||||
fd,
|
||||
events: libc::POLLIN,
|
||||
revents: 0,
|
||||
};
|
||||
loop {
|
||||
let r = unsafe { libc::poll(&mut pfd as *mut _, 1, timeout_ms) };
|
||||
if r < 0 {
|
||||
let e = unsafe { *libc::__errno_location() };
|
||||
if e == libc::EINTR { continue; }
|
||||
if e == libc::EINTR {
|
||||
continue;
|
||||
}
|
||||
}
|
||||
break;
|
||||
}
|
||||
|
||||
16
src/lib.rs
16
src/lib.rs
@@ -2,11 +2,14 @@
|
||||
//!
|
||||
//! Erlang-style green-thread actor concurrency for Rust.
|
||||
//!
|
||||
//! v0.1 is single-threaded. One scheduler, one OS thread. The scheduler
|
||||
//! Single-threaded for now: one scheduler, one OS thread. The scheduler
|
||||
//! cooperatively interleaves green-thread actors with hand-rolled context
|
||||
//! switches. Actors communicate by sending `Send` messages over channels;
|
||||
//! every actor has a supervisor, which is itself just an actor with a
|
||||
//! `Receiver<Signal>`.
|
||||
//! `Receiver<Signal>`. Synchronisation primitives — `Mutex<T>` with
|
||||
//! mandatory lock timeouts, channel `recv`, `sleep`, and epoll-backed
|
||||
//! `wait_readable`/`wait_writable` — all park the green thread, never
|
||||
//! the OS thread.
|
||||
//!
|
||||
//! See `LOOM.md` for the design intent and the deferred-for-later list.
|
||||
|
||||
@@ -20,6 +23,7 @@ pub mod scheduler;
|
||||
pub mod supervisor;
|
||||
pub mod timer;
|
||||
pub mod io;
|
||||
pub mod mutex;
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Global allocator
|
||||
@@ -37,10 +41,16 @@ static ALLOCATOR: preempt::PreemptingAllocator = preempt::PreemptingAllocator;
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
pub use channel::{channel, Receiver, RecvError, Sender};
|
||||
pub use mutex::{LockTimeout, Mutex, MutexGuard};
|
||||
pub use pid::Pid;
|
||||
pub use scheduler::{
|
||||
block_on_io, run, self_pid, sleep, spawn, spawn_under, yield_now, JoinError, JoinHandle,
|
||||
block_on_io, run, self_pid, sleep, spawn, spawn_under, wait_readable, wait_writable,
|
||||
yield_now, JoinError, JoinHandle,
|
||||
};
|
||||
// `read` and `write` would shadow heavily-used names if re-exported at the
|
||||
// crate root; users reach for them as `smarm::scheduler::read` /
|
||||
// `smarm::scheduler::write` instead. May reshuffle into a `smarm::io`
|
||||
// surface in a future pass.
|
||||
pub use supervisor::Signal;
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
318
src/mutex.rs
Normal file
318
src/mutex.rs
Normal file
@@ -0,0 +1,318 @@
|
||||
//! Actor-aware mutex with mandatory timeout.
|
||||
//!
|
||||
//! `loom::Mutex<T>` looks like `std::sync::Mutex<T>` but its `lock()` parks
|
||||
//! the calling *green* thread on contention rather than blocking the OS
|
||||
//! thread — and every lock attempt is bounded by a timeout. If the lock is
|
||||
//! not acquired within the timeout, `lock()` returns `Err(LockTimeout)`.
|
||||
//! This is a hard runtime guarantee (the spec calls it out): no actor can
|
||||
//! be parked on a mutex forever.
|
||||
//!
|
||||
//! ```ignore
|
||||
//! let m = loom::Mutex::new(42);
|
||||
//! let guard = m.lock()?; // default timeout
|
||||
//! let guard = m.lock_timeout(Duration::from_millis(50))?;
|
||||
//! ```
|
||||
//!
|
||||
//! Fairness
|
||||
//! ========
|
||||
//! Waiters are granted the lock in FIFO order. The spec prizes fairness:
|
||||
//! starvation under contention is precisely the kind of failure mode
|
||||
//! supervision can't recover from cleanly. LIFO would be faster on cache
|
||||
//! locality and is not offered.
|
||||
//!
|
||||
//! Poisoning
|
||||
//! =========
|
||||
//! Unlike `std::sync::Mutex`, `loom::Mutex` does not poison on panic. If a
|
||||
//! holder panics while holding the lock, the next waiter receives the
|
||||
//! (now-untouched) value. Rationale: supervision handles the panic at the
|
||||
//! actor level; a separate poisoning channel is redundant and adds an
|
||||
//! error case to every `lock()`. Users who care about "the value may be in
|
||||
//! an inconsistent state after a panic" should encode that in `T` itself
|
||||
//! (e.g. `Mutex<Option<State>>` and `take()` the value at the start of
|
||||
//! each critical section).
|
||||
//!
|
||||
//! Reentrance
|
||||
//! ==========
|
||||
//! Not reentrant. An actor that already holds the lock and calls `lock()`
|
||||
//! again on the same mutex will wait on its own grant — and time out. This
|
||||
//! is a bug in the caller, not a feature.
|
||||
//!
|
||||
//! Multi-threading note
|
||||
//! ====================
|
||||
//! The current implementation uses `Rc<RefCell<…>>` internals because the
|
||||
//! v0.2 scheduler is single-threaded. The public API is identical to what
|
||||
//! the eventual multi-threaded version will expose; the migration replaces
|
||||
//! the `Rc<RefCell>` with `Arc<sync::Mutex>` around bookkeeping (waiters
|
||||
//! queue, holder pid) — the *value* itself never goes through a blocking
|
||||
//! OS-level lock, because contention always parks the green thread first.
|
||||
//! No `unsafe impl Send` games today: `loom::Mutex<T>` is `!Send` on v0.2,
|
||||
//! which is correct given there is only one OS thread.
|
||||
|
||||
use crate::pid::Pid;
|
||||
use crate::scheduler;
|
||||
use crate::timer::{self, TimerTarget};
|
||||
use std::cell::{Cell, RefCell};
|
||||
use std::collections::VecDeque;
|
||||
use std::rc::Rc;
|
||||
use std::time::Duration;
|
||||
|
||||
/// 30 seconds. Override per-call with `lock_timeout`, or per-mutex (TODO)
|
||||
/// once the supervisor-level policy hook lands.
|
||||
pub const DEFAULT_TIMEOUT: Duration = Duration::from_secs(30);
|
||||
|
||||
#[derive(Debug, PartialEq, Eq, Clone, Copy)]
|
||||
pub struct LockTimeout;
|
||||
|
||||
impl std::fmt::Display for LockTimeout {
|
||||
fn fmt(&self, f: &mut std::fmt::Formatter<'_>) -> std::fmt::Result {
|
||||
write!(f, "mutex lock timed out")
|
||||
}
|
||||
}
|
||||
impl std::error::Error for LockTimeout {}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Internals
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// A pending lock attempt. Sits in `MutexCore::state.waiters` from the
|
||||
/// moment an actor parks until it is either granted the lock (popped by
|
||||
/// `MutexGuard::drop`) or times out (popped by `on_timeout`).
|
||||
struct Wait {
|
||||
pid: Pid,
|
||||
/// Per-mutex monotonic sequence. Lets `on_timeout` recognise "this
|
||||
/// specific wait" vs. "a later wait by the same pid on the same
|
||||
/// mutex" — important because a single actor can re-acquire and then
|
||||
/// re-wait, and we don't want a stale timer firing to disturb the new
|
||||
/// wait.
|
||||
seq: u64,
|
||||
}
|
||||
|
||||
/// The non-generic part of the mutex. Lives inside `Rc<>` so it can also
|
||||
/// be stashed (as `Rc<dyn TimerTarget>`) inside a timer entry.
|
||||
struct MutexCore {
|
||||
state: RefCell<MutexState>,
|
||||
default_timeout: Cell<Duration>,
|
||||
}
|
||||
|
||||
struct MutexState {
|
||||
holder: Option<Pid>,
|
||||
waiters: VecDeque<Wait>,
|
||||
next_seq: u64,
|
||||
}
|
||||
|
||||
impl MutexCore {
|
||||
fn new(default_timeout: Duration) -> Self {
|
||||
Self {
|
||||
state: RefCell::new(MutexState {
|
||||
holder: None,
|
||||
waiters: VecDeque::new(),
|
||||
next_seq: 0,
|
||||
}),
|
||||
default_timeout: Cell::new(default_timeout),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
impl TimerTarget for MutexCore {
|
||||
fn on_timeout(&self, pid: Pid, wait_seq: u64) {
|
||||
// Remove the waiter with this seq, if it's still queued. If it's
|
||||
// gone the lock was already granted to this actor before the timer
|
||||
// popped — the actor will return normally; do nothing.
|
||||
let removed = {
|
||||
let mut st = self.state.borrow_mut();
|
||||
if let Some(pos) = st.waiters.iter().position(|w| w.seq == wait_seq) {
|
||||
st.waiters.remove(pos);
|
||||
true
|
||||
} else {
|
||||
false
|
||||
}
|
||||
};
|
||||
if removed {
|
||||
// The actor is parked, waiting on us. Wake it up; `lock_timeout`
|
||||
// will resume, observe `holder != Some(self)`, and return
|
||||
// LockTimeout.
|
||||
scheduler::unpark(pid);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Public API
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
pub struct Mutex<T> {
|
||||
core: Rc<MutexCore>,
|
||||
/// `None` while the lock is held; `Some(T)` while free or while a
|
||||
/// grantee is in the gap between unpark and resumption.
|
||||
value: Rc<RefCell<Option<T>>>,
|
||||
}
|
||||
|
||||
impl<T> Mutex<T> {
|
||||
pub fn new(value: T) -> Self {
|
||||
Self {
|
||||
core: Rc::new(MutexCore::new(DEFAULT_TIMEOUT)),
|
||||
value: Rc::new(RefCell::new(Some(value))),
|
||||
}
|
||||
}
|
||||
|
||||
/// Set the default timeout used by `lock()`. Does not affect in-flight
|
||||
/// `lock_timeout` calls.
|
||||
pub fn set_default_timeout(&self, timeout: Duration) {
|
||||
self.core.default_timeout.set(timeout);
|
||||
}
|
||||
|
||||
/// Acquire the lock, blocking the calling actor until it's granted or
|
||||
/// the default timeout expires.
|
||||
pub fn lock(&self) -> Result<MutexGuard<'_, T>, LockTimeout> {
|
||||
self.lock_timeout(self.core.default_timeout.get())
|
||||
}
|
||||
|
||||
/// Acquire the lock with an explicit timeout.
|
||||
pub fn lock_timeout(&self, timeout: Duration) -> Result<MutexGuard<'_, T>, LockTimeout> {
|
||||
let me = scheduler::self_pid();
|
||||
|
||||
// Fast path: nobody holds it. Mark ourselves as holder, take the
|
||||
// value out, return a guard.
|
||||
{
|
||||
let mut st = self.core.state.borrow_mut();
|
||||
if st.holder.is_none() {
|
||||
st.holder = Some(me);
|
||||
drop(st);
|
||||
let value = self
|
||||
.value
|
||||
.borrow_mut()
|
||||
.take()
|
||||
.expect("Mutex: value missing on free fast path");
|
||||
return Ok(MutexGuard {
|
||||
mutex: self,
|
||||
value: Some(value),
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// Slow path: register as a waiter, schedule a timeout, park.
|
||||
// No preemption during prep-to-park — see scheduler::NoPreempt.
|
||||
let _np = scheduler::NoPreempt::enter();
|
||||
let seq = {
|
||||
let mut st = self.core.state.borrow_mut();
|
||||
let seq = st.next_seq;
|
||||
st.next_seq = st.next_seq.wrapping_add(1);
|
||||
st.waiters.push_back(Wait { pid: me, seq });
|
||||
seq
|
||||
};
|
||||
|
||||
let target: Rc<dyn TimerTarget> = self.core.clone();
|
||||
let deadline = timer::deadline_from_now(timeout);
|
||||
scheduler::insert_wait_timer(deadline, me, target, seq);
|
||||
scheduler::park_current();
|
||||
|
||||
// Resumed. Two possibilities:
|
||||
// (a) MutexGuard::drop on the previous holder popped us off the
|
||||
// waiters queue, set core.holder = me, and unparked us.
|
||||
// => self.value is Some, we take it and return Ok.
|
||||
// (b) on_timeout fired: it removed us from waiters and unparked
|
||||
// us, but did NOT set holder. core.holder is whatever it was
|
||||
// (Some(other) or None). => return Err.
|
||||
let is_holder = self.core.state.borrow().holder == Some(me);
|
||||
if is_holder {
|
||||
let value = self
|
||||
.value
|
||||
.borrow_mut()
|
||||
.take()
|
||||
.expect("Mutex: value missing after grant");
|
||||
Ok(MutexGuard {
|
||||
mutex: self,
|
||||
value: Some(value),
|
||||
})
|
||||
} else {
|
||||
Err(LockTimeout)
|
||||
}
|
||||
}
|
||||
|
||||
/// Non-blocking attempt. Returns `Some` if the lock was free, `None`
|
||||
/// otherwise. Useful as a fast path before a long-running computation,
|
||||
/// or for tests.
|
||||
pub fn try_lock(&self) -> Option<MutexGuard<'_, T>> {
|
||||
let mut st = self.core.state.borrow_mut();
|
||||
if st.holder.is_some() {
|
||||
return None;
|
||||
}
|
||||
let me = scheduler::self_pid();
|
||||
st.holder = Some(me);
|
||||
drop(st);
|
||||
let value = self
|
||||
.value
|
||||
.borrow_mut()
|
||||
.take()
|
||||
.expect("Mutex: value missing on try_lock free path");
|
||||
Some(MutexGuard {
|
||||
mutex: self,
|
||||
value: Some(value),
|
||||
})
|
||||
}
|
||||
}
|
||||
|
||||
impl<T> Clone for Mutex<T> {
|
||||
/// Cloning a `Mutex<T>` clones the handle, not the protected value —
|
||||
/// both clones refer to the same lock state and the same `T`.
|
||||
fn clone(&self) -> Self {
|
||||
Self {
|
||||
core: self.core.clone(),
|
||||
value: self.value.clone(),
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Guard
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
pub struct MutexGuard<'a, T> {
|
||||
mutex: &'a Mutex<T>,
|
||||
/// The protected value, taken out of `mutex.value` while the guard is
|
||||
/// alive. `Option` only so `Drop` can put it back; in normal use this
|
||||
/// is always `Some` while the guard is observable.
|
||||
value: Option<T>,
|
||||
}
|
||||
|
||||
impl<T> std::ops::Deref for MutexGuard<'_, T> {
|
||||
type Target = T;
|
||||
fn deref(&self) -> &T {
|
||||
self.value.as_ref().expect("MutexGuard: value missing")
|
||||
}
|
||||
}
|
||||
|
||||
impl<T> std::ops::DerefMut for MutexGuard<'_, T> {
|
||||
fn deref_mut(&mut self) -> &mut T {
|
||||
self.value.as_mut().expect("MutexGuard: value missing")
|
||||
}
|
||||
}
|
||||
|
||||
impl<T> Drop for MutexGuard<'_, T> {
|
||||
fn drop(&mut self) {
|
||||
// Put the value back into the mutex.
|
||||
let v = self.value.take().expect("MutexGuard: double drop");
|
||||
*self.mutex.value.borrow_mut() = Some(v);
|
||||
|
||||
// Pick the next waiter (if any) and grant it the lock by writing
|
||||
// its pid into `holder` *before* unparking. The grantee, on
|
||||
// resumption, will see `holder == self_pid` and take the value.
|
||||
let next_pid = {
|
||||
let mut st = self.mutex.core.state.borrow_mut();
|
||||
let next = st.waiters.pop_front();
|
||||
match next {
|
||||
Some(w) => {
|
||||
st.holder = Some(w.pid);
|
||||
Some(w.pid)
|
||||
}
|
||||
None => {
|
||||
st.holder = None;
|
||||
None
|
||||
}
|
||||
}
|
||||
};
|
||||
if let Some(pid) = next_pid {
|
||||
scheduler::unpark(pid);
|
||||
}
|
||||
}
|
||||
}
|
||||
262
src/scheduler.rs
262
src/scheduler.rs
@@ -344,16 +344,69 @@ pub fn park_current() {
|
||||
unsafe { crate::context::switch_to_scheduler() };
|
||||
}
|
||||
|
||||
/// RAII guard that disables allocator-driven preemption for its lifetime.
|
||||
///
|
||||
/// The "prep-to-park" hazard described in `preempt.rs`: a primitive that
|
||||
/// (a) registers an unparker (channel waiter slot, fd waiter map, mutex
|
||||
/// waiter queue, …) and then (b) calls `park_current()` must not yield
|
||||
/// between (a) and (b). If it does, an early unpark fires while the actor
|
||||
/// is still Runnable, the unpark no-ops, and then the actor parks with no
|
||||
/// one to wake it.
|
||||
///
|
||||
/// Library code wraps the prep + park in `let _g = NoPreempt::enter();`
|
||||
/// and the guard is held until just after `park_current` returns (or
|
||||
/// dropped earlier, immediately before `park_current`, since `park_current`
|
||||
/// itself returns control to the scheduler which disables preemption on
|
||||
/// its own path).
|
||||
pub struct NoPreempt(bool);
|
||||
|
||||
impl NoPreempt {
|
||||
pub fn enter() -> Self {
|
||||
let prev = crate::preempt::PREEMPTION_ENABLED.with(|c| c.replace(false));
|
||||
NoPreempt(prev)
|
||||
}
|
||||
}
|
||||
|
||||
impl Drop for NoPreempt {
|
||||
fn drop(&mut self) {
|
||||
crate::preempt::PREEMPTION_ENABLED.with(|c| c.set(self.0));
|
||||
}
|
||||
}
|
||||
|
||||
/// Park the current actor for at least `duration`. A zero duration behaves
|
||||
/// like `yield_now` (the deadline is immediately in the past, so the timer
|
||||
/// pops on the next scheduler iteration).
|
||||
pub fn sleep(duration: std::time::Duration) {
|
||||
let me = current_pid().expect("sleep() called outside an actor");
|
||||
let _np = NoPreempt::enter();
|
||||
let deadline = crate::timer::deadline_from_now(duration);
|
||||
with_sched(|s| s.timers.insert(deadline, me));
|
||||
with_sched(|s| s.timers.insert_sleep(deadline, me));
|
||||
park_current();
|
||||
}
|
||||
|
||||
/// Insert a `WaitTimeout` timer entry. Library code (`Mutex::lock_timeout`
|
||||
/// today, future bounded-wait primitives) calls this just before
|
||||
/// `park_current()` so that if the wait isn't satisfied by `deadline`,
|
||||
/// `target.on_timeout(pid, wait_seq)` will fire.
|
||||
///
|
||||
/// Cancellation: not needed. If the wait is satisfied early, the entry is
|
||||
/// still in the heap and will pop in due course; `on_timeout` is expected
|
||||
/// to be idempotent on stale-seq.
|
||||
pub fn insert_wait_timer(
|
||||
deadline: std::time::Instant,
|
||||
pid: Pid,
|
||||
target: std::rc::Rc<dyn crate::timer::TimerTarget>,
|
||||
wait_seq: u64,
|
||||
) {
|
||||
with_sched(|s| {
|
||||
s.timers.insert(
|
||||
deadline,
|
||||
pid,
|
||||
crate::timer::Reason::WaitTimeout { target, wait_seq },
|
||||
);
|
||||
});
|
||||
}
|
||||
|
||||
/// Run `f` on the IO worker thread, park the current actor while it runs,
|
||||
/// and return `f`'s value when it completes. Panics inside `f` propagate
|
||||
/// to the calling actor.
|
||||
@@ -377,12 +430,14 @@ where
|
||||
Ok(Box::new(v) as Box<dyn std::any::Any + Send>)
|
||||
});
|
||||
|
||||
with_sched(|s| {
|
||||
let io = s.io.as_mut().expect("io thread not started");
|
||||
io.submit(me, work);
|
||||
});
|
||||
|
||||
park_current();
|
||||
{
|
||||
let _np = NoPreempt::enter();
|
||||
with_sched(|s| {
|
||||
let io = s.io.as_mut().expect("io thread not started");
|
||||
io.submit(me, work);
|
||||
});
|
||||
park_current();
|
||||
}
|
||||
|
||||
// On resume, our slot has a result waiting.
|
||||
let result = with_sched(|s| {
|
||||
@@ -401,6 +456,83 @@ where
|
||||
}
|
||||
}
|
||||
|
||||
// ---------------------------------------------------------------------------
|
||||
// Fd-readiness primitives.
|
||||
//
|
||||
// `wait_readable(fd)` / `wait_writable(fd)` register interest with the
|
||||
// epoll thread, park the calling actor, and return when the kernel
|
||||
// signals readiness. The subsequent syscall (`read`/`write`) is done on
|
||||
// the actor's own thread by the caller — no buffer crosses an actor
|
||||
// boundary.
|
||||
//
|
||||
// Fds passed in should be O_NONBLOCK; see io.rs module docs.
|
||||
// ---------------------------------------------------------------------------
|
||||
|
||||
/// Park the calling actor until `fd` is readable.
|
||||
pub fn wait_readable(fd: std::os::fd::RawFd) -> std::io::Result<()> {
|
||||
wait_fd(fd, /*readable=*/ true, /*writable=*/ false)
|
||||
}
|
||||
|
||||
/// Park the calling actor until `fd` is writable.
|
||||
pub fn wait_writable(fd: std::os::fd::RawFd) -> std::io::Result<()> {
|
||||
wait_fd(fd, /*readable=*/ false, /*writable=*/ true)
|
||||
}
|
||||
|
||||
fn wait_fd(fd: std::os::fd::RawFd, readable: bool, writable: bool) -> std::io::Result<()> {
|
||||
let me = current_pid().expect("wait_*() called outside an actor");
|
||||
|
||||
// Register with the epoll thread. If registration fails (bad fd,
|
||||
// already-parked waiter, OOM in the kernel), return the error
|
||||
// without parking — the actor never went to sleep.
|
||||
let _np = NoPreempt::enter();
|
||||
with_sched(|s| {
|
||||
let io = s.io.as_mut().expect("io thread not started");
|
||||
io.epoll_register(fd, me, readable, writable)
|
||||
})?;
|
||||
|
||||
park_current();
|
||||
// On resume, the scheduler has already removed `fd` from `waiters`
|
||||
// and DEL'd it from epollfd. There is no per-call return value;
|
||||
// success here just means "fd is ready, go do your syscall".
|
||||
//
|
||||
// Note: there is no error path on resume because v0.2 doesn't time
|
||||
// out fd waits and doesn't otherwise spurious-wake. If those are
|
||||
// added, this function grows a non-trivial return.
|
||||
Ok(())
|
||||
}
|
||||
|
||||
/// Wait until `fd` is readable, then run a single `read(2)`. Returns the
|
||||
/// number of bytes read, or an `io::Error` from the syscall.
|
||||
///
|
||||
/// `fd` should be opened `O_NONBLOCK`. With a blocking fd, the kernel's
|
||||
/// readiness signal does not guarantee a non-blocking read — a signal
|
||||
/// could interrupt, and the actor's syscall would then stall the
|
||||
/// scheduler thread.
|
||||
pub fn read(fd: std::os::fd::RawFd, buf: &mut [u8]) -> std::io::Result<usize> {
|
||||
wait_readable(fd)?;
|
||||
let n = unsafe {
|
||||
libc::read(fd, buf.as_mut_ptr() as *mut _, buf.len())
|
||||
};
|
||||
if n < 0 {
|
||||
Err(std::io::Error::last_os_error())
|
||||
} else {
|
||||
Ok(n as usize)
|
||||
}
|
||||
}
|
||||
|
||||
/// Wait until `fd` is writable, then run a single `write(2)`.
|
||||
pub fn write(fd: std::os::fd::RawFd, buf: &[u8]) -> std::io::Result<usize> {
|
||||
wait_writable(fd)?;
|
||||
let n = unsafe {
|
||||
libc::write(fd, buf.as_ptr() as *const _, buf.len())
|
||||
};
|
||||
if n < 0 {
|
||||
Err(std::io::Error::last_os_error())
|
||||
} else {
|
||||
Ok(n as usize)
|
||||
}
|
||||
}
|
||||
|
||||
/// Wake a parked actor. If the actor isn't parked (already runnable or done)
|
||||
/// this is a no-op — that's important; channel and join can both fire
|
||||
/// spurious unparks under some orderings and we want them to be cheap.
|
||||
@@ -488,41 +620,95 @@ pub const ROOT_PID: Pid = Pid::new(u32::MAX, u32::MAX);
|
||||
|
||||
fn schedule_loop() {
|
||||
loop {
|
||||
// 1. Drain due timers into the run queue.
|
||||
// 1. Drain due timers and dispatch by reason.
|
||||
//
|
||||
// Sleep — unpark the actor (idempotently: only if still
|
||||
// parked).
|
||||
// WaitTimeout — call the target's on_timeout. The target decides
|
||||
// whether the wait was still in progress (timer
|
||||
// won the race) or had been fulfilled (the thing
|
||||
// the actor was waiting for arrived first → no-op).
|
||||
// The target is responsible for calling unpark()
|
||||
// if appropriate.
|
||||
let now = std::time::Instant::now();
|
||||
let due = with_sched(|s| s.timers.pop_due(now));
|
||||
for pid in due {
|
||||
// Same idempotency as `unpark`: only re-queue if still parked.
|
||||
with_sched(|s| {
|
||||
if let Some(slot) = s.slot_mut(pid) {
|
||||
if matches!(slot.state, State::Parked) {
|
||||
slot.state = State::Runnable;
|
||||
s.run_queue.push_back(pid);
|
||||
}
|
||||
for entry in due {
|
||||
match entry.reason {
|
||||
crate::timer::Reason::Sleep => {
|
||||
with_sched(|s| {
|
||||
if let Some(slot) = s.slot_mut(entry.pid) {
|
||||
if matches!(slot.state, State::Parked) {
|
||||
slot.state = State::Runnable;
|
||||
s.run_queue.push_back(entry.pid);
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
});
|
||||
crate::timer::Reason::WaitTimeout { target, wait_seq } => {
|
||||
// Note: the target callback runs *outside* with_sched.
|
||||
// It may call back into the scheduler (e.g. unpark), so
|
||||
// we must not hold the SCHED borrow across it.
|
||||
target.on_timeout(entry.pid, wait_seq);
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
// 2. Drain IO completions: route each result to its slot and
|
||||
// unpark the actor. Drain even when we have other runnables —
|
||||
// it's cheap (a try_lock of the completion queue) and keeps
|
||||
// pending_io_result freshness bounded.
|
||||
// 2. Drain IO completions: route each result by variant.
|
||||
//
|
||||
// Blocking — a `block_on_io` closure finished. Stash the result
|
||||
// on the actor's slot and unpark.
|
||||
// FdReady — an fd registered via `wait_readable`/`wait_writable`
|
||||
// is ready. Look up the parked pid in the io thread's
|
||||
// waiters map, deregister the fd, unpark.
|
||||
//
|
||||
// Drain even when we have other runnables — it's cheap and keeps
|
||||
// `pending_io_result` / `waiters` freshness bounded.
|
||||
let completions = with_sched(|s| {
|
||||
s.io.as_mut().map(|io| io.drain_completions()).unwrap_or_default()
|
||||
});
|
||||
for (pid, result) in completions {
|
||||
with_sched(|s| {
|
||||
if let Some(io) = s.io.as_mut() {
|
||||
io.outstanding = io.outstanding.saturating_sub(1);
|
||||
for completion in completions {
|
||||
match completion {
|
||||
crate::io::Completion::Blocking { pid, result } => {
|
||||
with_sched(|s| {
|
||||
if let Some(io) = s.io.as_mut() {
|
||||
io.outstanding = io.outstanding.saturating_sub(1);
|
||||
}
|
||||
if let Some(slot) = s.slot_mut(pid) {
|
||||
slot.pending_io_result = Some(result);
|
||||
if matches!(slot.state, State::Parked) {
|
||||
slot.state = State::Runnable;
|
||||
s.run_queue.push_back(pid);
|
||||
}
|
||||
}
|
||||
});
|
||||
}
|
||||
if let Some(slot) = s.slot_mut(pid) {
|
||||
slot.pending_io_result = Some(result);
|
||||
if matches!(slot.state, State::Parked) {
|
||||
slot.state = State::Runnable;
|
||||
s.run_queue.push_back(pid);
|
||||
}
|
||||
crate::io::Completion::FdReady { fd, events: _ } => {
|
||||
with_sched(|s| {
|
||||
let parked_pid = s.io.as_mut()
|
||||
.and_then(|io| {
|
||||
let pid = io.waiters.remove(&fd);
|
||||
// Deregister the fd from epollfd; the
|
||||
// EPOLLONESHOT already disarmed it but the
|
||||
// slot is still occupied until we DEL.
|
||||
io.epoll_deregister(fd);
|
||||
pid
|
||||
});
|
||||
if let Some(pid) = parked_pid {
|
||||
if let Some(slot) = s.slot_mut(pid) {
|
||||
if matches!(slot.state, State::Parked) {
|
||||
slot.state = State::Runnable;
|
||||
s.run_queue.push_back(pid);
|
||||
}
|
||||
}
|
||||
// else: actor died between registering and the
|
||||
// fd firing. Nothing to do; the registration
|
||||
// has been cleaned up.
|
||||
}
|
||||
// else: fd not in waiters — probably a duplicate
|
||||
// FdReady from a previous registration, ignore.
|
||||
});
|
||||
}
|
||||
});
|
||||
}
|
||||
}
|
||||
|
||||
// 3. Pop a runnable actor. If none, decide whether to block on
|
||||
@@ -536,11 +722,19 @@ fn schedule_loop() {
|
||||
// trying to take the completions mutex, which is fine,
|
||||
// but the scheduler thread itself mustn't hold SCHED
|
||||
// borrowed across a blocking syscall.
|
||||
//
|
||||
// "Outstanding" here means *anything* the IO thread is
|
||||
// expected to deliver a wakeup for: in-flight blocking
|
||||
// calls AND parked fd waiters. If either is non-zero we
|
||||
// must wait for the IO thread, not exit.
|
||||
let (next_deadline, io_outstanding, wake_fd) = with_sched(|s| {
|
||||
let next = s.timers.peek_deadline();
|
||||
let (out, fd) = match s.io.as_ref() {
|
||||
Some(io) => (io.outstanding, Some(io.wake_fd())),
|
||||
None => (0, None),
|
||||
Some(io) => (
|
||||
io.outstanding + io.waiters.len() as u32,
|
||||
Some(io.wake_fd()),
|
||||
),
|
||||
None => (0, None),
|
||||
};
|
||||
(next, out, fd)
|
||||
});
|
||||
|
||||
112
src/timer.rs
112
src/timer.rs
@@ -1,38 +1,86 @@
|
||||
//! Sleep timers.
|
||||
//! Sleep + wait-with-timeout timers.
|
||||
//!
|
||||
//! A min-heap of `(deadline, Pid)` entries lives on `SchedulerState`. When
|
||||
//! an actor calls `sleep`, the runtime inserts the entry, marks the actor
|
||||
//! parked, and yields. On every scheduler loop iteration the runtime pops
|
||||
//! all entries whose deadline has passed and unparks them. When the run
|
||||
//! queue is empty but the heap is not, the runtime sleeps the OS thread
|
||||
//! until the soonest deadline, then re-checks.
|
||||
//! A min-heap of `(deadline, seq, reason)` entries lives on `SchedulerState`.
|
||||
//! When an actor sleeps or starts a bounded wait (e.g. `mutex.lock()` with a
|
||||
//! timeout), the runtime inserts an entry, marks the actor parked, and yields.
|
||||
//! On every scheduler loop iteration the runtime pops all entries whose
|
||||
//! deadline has passed and dispatches each according to its `Reason`:
|
||||
//!
|
||||
//! `BinaryHeap` is a max-heap, so entries are stored with their deadline
|
||||
//! wrapped in `Reverse` to get min-heap behaviour.
|
||||
//! - `Sleep`: unpark the actor.
|
||||
//! - `WaitTimeout`: call `on_timeout` on the registered target. The target
|
||||
//! (e.g. a `Mutex`) decides whether the actor was actually still waiting
|
||||
//! (timer fires first → unpark with error) or had already been granted
|
||||
//! what it was waiting for (lock granted first → no-op).
|
||||
//!
|
||||
//! Stale pids (slot reused since the timer was inserted) are detected on
|
||||
//! `due_pids` pop and silently dropped — same convention as the run queue.
|
||||
//! `BinaryHeap` is a max-heap; entries are wrapped in `Reverse` to get
|
||||
//! min-heap behaviour.
|
||||
//!
|
||||
//! No cancellation. When a non-timer wakeup happens (e.g. lock granted
|
||||
//! before timeout), the timer entry is left in the heap. It will be popped
|
||||
//! eventually and the dispatch will observe "actor is no longer parked /
|
||||
//! wait_seq is stale" and no-op. Cost is ~32 bytes per stale entry plus a
|
||||
//! few cycles on pop; acceptable given the upper bound is "one entry per
|
||||
//! parked actor".
|
||||
//!
|
||||
//! Stale pids (slot reused since the timer was inserted) are filtered on
|
||||
//! pop by the scheduler — same convention as the run queue.
|
||||
|
||||
use crate::pid::Pid;
|
||||
use std::cmp::Reverse;
|
||||
use std::collections::BinaryHeap;
|
||||
use std::rc::Rc;
|
||||
use std::time::{Duration, Instant};
|
||||
|
||||
#[derive(PartialEq, Eq)]
|
||||
/// What to do when a timer entry's deadline arrives.
|
||||
///
|
||||
/// Held inside `Entry`, dispatched by the scheduler in `pop_due`.
|
||||
pub enum Reason {
|
||||
/// `loom::sleep(d)`. Unpark `pid` unconditionally (modulo the usual
|
||||
/// "still parked?" check the scheduler applies).
|
||||
Sleep,
|
||||
/// A bounded wait — currently only `Mutex::lock_timeout`. On expiry the
|
||||
/// scheduler calls `target.on_timeout(pid, wait_seq)`. The target then
|
||||
/// decides whether `pid` was actually still waiting, and if so unparks
|
||||
/// it with whatever error the wait was bounded for. `wait_seq` lets the
|
||||
/// target tell apart "this wait" from "a later wait by the same actor
|
||||
/// on the same target".
|
||||
WaitTimeout {
|
||||
target: Rc<dyn TimerTarget>,
|
||||
wait_seq: u64,
|
||||
},
|
||||
}
|
||||
|
||||
/// Callback the scheduler invokes when a `WaitTimeout` entry pops.
|
||||
///
|
||||
/// Implementors: do not touch `SchedulerState` other than via the public
|
||||
/// `unpark` / channel APIs. The scheduler is mid-iteration when this fires.
|
||||
pub trait TimerTarget {
|
||||
fn on_timeout(&self, pid: Pid, wait_seq: u64);
|
||||
}
|
||||
|
||||
pub struct Entry {
|
||||
pub deadline: Instant,
|
||||
/// Insertion order, used purely as a tiebreaker so `Entry: Ord` works
|
||||
/// without having to compare the `Reason` payload (which contains an
|
||||
/// `Rc<dyn TimerTarget>` and isn't `Ord`).
|
||||
seq: u64,
|
||||
pub pid: Pid,
|
||||
pub reason: Reason,
|
||||
}
|
||||
|
||||
impl PartialEq for Entry {
|
||||
fn eq(&self, other: &Self) -> bool {
|
||||
self.deadline == other.deadline && self.seq == other.seq
|
||||
}
|
||||
}
|
||||
impl Eq for Entry {}
|
||||
|
||||
impl Ord for Entry {
|
||||
fn cmp(&self, other: &Self) -> std::cmp::Ordering {
|
||||
// Only `deadline` matters for ordering; pid is a tiebreaker so the
|
||||
// type is Ord, but the order among same-deadline entries is
|
||||
// irrelevant.
|
||||
self.deadline
|
||||
.cmp(&other.deadline)
|
||||
.then_with(|| self.pid.index().cmp(&other.pid.index()))
|
||||
.then_with(|| self.pid.generation().cmp(&other.pid.generation()))
|
||||
// Earlier deadline first; ties broken by insertion order so the
|
||||
// ordering is total. `Reason` and `Pid` deliberately don't
|
||||
// participate.
|
||||
self.deadline.cmp(&other.deadline).then_with(|| self.seq.cmp(&other.seq))
|
||||
}
|
||||
}
|
||||
|
||||
@@ -46,15 +94,25 @@ impl PartialOrd for Entry {
|
||||
pub struct Timers {
|
||||
/// Reverse-wrapped so the smallest deadline is at the top.
|
||||
heap: BinaryHeap<Reverse<Entry>>,
|
||||
/// Monotonic counter for the tiebreaker `seq` field.
|
||||
next_seq: u64,
|
||||
}
|
||||
|
||||
impl Timers {
|
||||
pub fn new() -> Self {
|
||||
Self { heap: BinaryHeap::new() }
|
||||
Self { heap: BinaryHeap::new(), next_seq: 0 }
|
||||
}
|
||||
|
||||
pub fn insert(&mut self, deadline: Instant, pid: Pid) {
|
||||
self.heap.push(Reverse(Entry { deadline, pid }));
|
||||
/// Insert a `Sleep` timer. Convenience for the common case.
|
||||
pub fn insert_sleep(&mut self, deadline: Instant, pid: Pid) {
|
||||
self.insert(deadline, pid, Reason::Sleep);
|
||||
}
|
||||
|
||||
/// Insert an arbitrary timer entry.
|
||||
pub fn insert(&mut self, deadline: Instant, pid: Pid, reason: Reason) {
|
||||
let seq = self.next_seq;
|
||||
self.next_seq = self.next_seq.wrapping_add(1);
|
||||
self.heap.push(Reverse(Entry { deadline, seq, pid, reason }));
|
||||
}
|
||||
|
||||
pub fn is_empty(&self) -> bool {
|
||||
@@ -66,13 +124,13 @@ impl Timers {
|
||||
self.heap.peek().map(|r| r.0.deadline)
|
||||
}
|
||||
|
||||
/// Pop and return every pid whose deadline is ≤ `now`.
|
||||
pub fn pop_due(&mut self, now: Instant) -> Vec<Pid> {
|
||||
/// Pop every entry whose deadline is ≤ `now`, in deadline order.
|
||||
/// The scheduler dispatches each entry by inspecting `entry.reason`.
|
||||
pub fn pop_due(&mut self, now: Instant) -> Vec<Entry> {
|
||||
let mut out = Vec::new();
|
||||
while let Some(r) = self.heap.peek() {
|
||||
if r.0.deadline <= now {
|
||||
let e = self.heap.pop().unwrap().0;
|
||||
out.push(e.pid);
|
||||
out.push(self.heap.pop().unwrap().0);
|
||||
} else {
|
||||
break;
|
||||
}
|
||||
@@ -81,7 +139,7 @@ impl Timers {
|
||||
}
|
||||
}
|
||||
|
||||
/// Wall-clock duration helper exposed for `sleep`.
|
||||
/// Wall-clock duration helper exposed for `sleep` and `lock_timeout`.
|
||||
pub fn deadline_from_now(duration: Duration) -> Instant {
|
||||
Instant::now()
|
||||
.checked_add(duration)
|
||||
|
||||
Reference in New Issue
Block a user