Module rsyscall.near

Definitions of namespace-local identifiers, syscalls, and SyscallInterface

These namespace-local identifiers are like near pointers, in systems with segmented memory. They are valid only within a specific segment (namespace).

The syscalls are instructions, operating on near pointers and other arguments.

The SyscallInterface is the segment register override prefix, which is used with the instruction to say which segment register to use for the syscall.

We don't know from a segment register override prefix alone that the near pointers we are passing to an instruction are valid pointers in the segment currently contained in the segment register.

In terms of our actual classes: We don't know from a SyscallInterface alone that the identifiers we are passing to a syscall match the namespaces active in the task behind the SyscallInterface.

(The task is like the segment register, in this analogy.)

Expand source code Browse git
"""Definitions of namespace-local identifiers, syscalls, and SyscallInterface

These namespace-local identifiers are like near pointers, in systems
with segmented memory. They are valid only within a specific segment
(namespace).

The syscalls are instructions, operating on near pointers and other
arguments.

The SyscallInterface is the segment register override prefix, which is
used with the instruction to say which segment register to use for the
syscall.

We don't know from a segment register override prefix alone that the
near pointers we are passing to an instruction are valid pointers in
the segment currently contained in the segment register.

In terms of our actual classes: We don't know from a SyscallInterface
alone that the identifiers we are passing to a syscall match the
namespaces active in the task behind the SyscallInterface.

(The task is like the segment register, in this analogy.)

"""
# re-exported namepsace-local identifiers
from rsyscall.near.types import (
    FileDescriptor,
    WatchDescriptor,
    Address,
    MemoryMapping,
    Process,
    ProcessGroup,
)
# re-exported SyscallInterface
from rsyscall.near.sysif import SyscallInterface, SyscallHangup
__all__ = [
    'FileDescriptor',
    'WatchDescriptor',
    'Address',
    'MemoryMapping',
    'Process',
    'ProcessGroup',
    'SyscallInterface', 'SyscallHangup',
]

Sub-modules

rsyscall.near.sysif

The lowest-level interface for making syscalls …

rsyscall.near.types

Definitions of namespace-local identifiers …

Classes

class FileDescriptor (number: int)

The integer identifier for a file descriptor taken by many syscalls.

This is a file descriptor in a specific file descriptor table, but we don't with this object know what file descriptor table that is.

Expand source code Browse git
@dataclass(frozen=True)
class FileDescriptor:
    """The integer identifier for a file descriptor taken by many syscalls.

    This is a file descriptor in a specific file descriptor table, but we don't with this
    object know what file descriptor table that is.

    """
    __slots__ = ('number')
    number: int

    def __str__(self) -> str:
        return f"FD({self.number})"

    def __repr__(self) -> str:
        return str(self)

    def __int__(self) -> int:
        return self.number

Instance variables

var number : int

Return an attribute of instance, which is of type owner.

class WatchDescriptor (number: int)

The integer identifier for an inotify watch descriptor taken by inotify syscalls.

This is a watch descriptor for a specific inotify instance, but we don't with this object know what inotify instance that is.

Expand source code Browse git
@dataclass(frozen=True)
class WatchDescriptor:
    """The integer identifier for an inotify watch descriptor taken by inotify syscalls.

    This is a watch descriptor for a specific inotify instance, but we don't with this
    object know what inotify instance that is.

    """
    number: int

    def __str__(self) -> str:
        return f"WD({self.number})"

    def __repr__(self) -> str:
        return str(self)

    def __int__(self) -> int:
        return self.number

Class variables

var number : int
class Address (address: int)

The integer identifier for a virtual memory address as taken by many syscalls.

This is an address in a specific address space, but we don't with this object know what address space that is.

Expand source code Browse git
@dataclass(frozen=True)
class Address:
    """The integer identifier for a virtual memory address as taken by many syscalls.

    This is an address in a specific address space, but we don't with this object know
    what address space that is.

    """
    __slots__ = ('address')
    address: int

    def __add__(self, other: int) -> 'Address':
        return Address(self.address + other)

    def __sub__(self, other: int) -> 'Address':
        return Address(self.address - other)

    def __str__(self) -> str:
        return f"Address({hex(self.address)})"

    def __repr__(self) -> str:
        return str(self)

    def __int__(self) -> int:
        return self.address

Instance variables

var address : int

Return an attribute of instance, which is of type owner.

class MemoryMapping (address: int, length: int, page_size: int)

The integer identifiers for a virtual memory mapping as taken by many syscalls.

This is a mapping in a specific address space, but we don't with this object know what address space that is.

We require three pieces of information to describe a memory mapping. - Address is the start address of the memory mapping - Length is the length in bytes of the memory mapped region

Page size is unusual, but required for robustness: While the syscalls related to memory mappings don't appear to depend on page size, that's an illusion. They seem to deal in sizes in terms of bytes, but if you provide a size which is not a multiple of the page size, silent failures or misbehaviors will occur. Misbehavior include the sizes being rounded up to the page size, including in munmap, thus unmapping more memory than expected.

As long as we ensure that the original length we pass to mmap is a multiple of the page size that will be used for the mapping, then we could get by with just storing the length and not the page size. However, the memory mapping API allows unmapping only part of a mapping, or in general performing operations on only part of a mapping. These splits must happen at page boundaries, and therefore to support specifying these splits without allowing silent rounding errors, we need to know the page size of the mapping.

This is especially troubling when mmaping files with an unknown page size, such as those passed to us from another program. memfd_create or hugetlbfs can be used to create files with an unknown page size, which cannot be robust unmapped. At this time, we don't know of a way to learn the page size of such a file. One good solution would be for mmap to be taught a new MAP_ENFORCE_PAGE_SIZE flag which requires MAP_HUGE_* to be passed when mapping files with nonstandard page size. In this way, we could assert the page size of the file and protect against attackers sending us files with unexpected page sizes.

Expand source code Browse git
@dataclass(frozen=True)
class MemoryMapping:
    """The integer identifiers for a virtual memory mapping as taken by many syscalls.

    This is a mapping in a specific address space, but we don't with this object know what
    address space that is.

    We require three pieces of information to describe a memory mapping. 
    - Address is the start address of the memory mapping
    - Length is the length in bytes of the memory mapped region

    Page size is unusual, but required for robustness: While the syscalls related to
    memory mappings don't appear to depend on page size, that's an illusion. They seem to
    deal in sizes in terms of bytes, but if you provide a size which is not a multiple of
    the page size, silent failures or misbehaviors will occur. Misbehavior include the
    sizes being rounded up to the page size, including in munmap, thus unmapping more
    memory than expected.

    As long as we ensure that the original length we pass to mmap is a multiple of the
    page size that will be used for the mapping, then we could get by with just storing
    the length and not the page size. However, the memory mapping API allows unmapping
    only part of a mapping, or in general performing operations on only part of a
    mapping. These splits must happen at page boundaries, and therefore to support
    specifying these splits without allowing silent rounding errors, we need to know the
    page size of the mapping.

    This is especially troubling when mmaping files with an unknown page size, such as
    those passed to us from another program. memfd_create or hugetlbfs can be used to
    create files with an unknown page size, which cannot be robust unmapped. At this time,
    we don't know of a way to learn the page size of such a file. One good solution would
    be for mmap to be taught a new MAP_ENFORCE_PAGE_SIZE flag which requires MAP_HUGE_* to
    be passed when mapping files with nonstandard page size. In this way, we could assert
    the page size of the file and protect against attackers sending us files with
    unexpected page sizes.

    """
    __slots__ = ('address', 'length', 'page_size')
    address: int
    length: int
    page_size: int

    def __post_init_(self) -> None:
        if (self.address % self.page_size) != 0:
            raise Exception("the address for this memory-mapping is not page-aligned", self)
        if (self.length % self.page_size) != 0:
            raise Exception("the length for this memory-mapping is not page-aligned", self)

    def as_address(self) -> Address:
        "Return the starting address of this memory mapping."
        return Address(self.address)

    def __str__(self) -> str:
        if self.page_size == 4096:
            return f"MMap({hex(self.address)}, {self.length})"
        else:
            return f"MMap(pgsz={self.page_size}, {hex(self.address)}, {self.length})"

    def __repr__(self) -> str:
        return str(self)

Instance variables

var address : int

Return an attribute of instance, which is of type owner.

var length : int

Return an attribute of instance, which is of type owner.

var page_size : int

Return an attribute of instance, which is of type owner.

Methods

def as_address(self) ‑> Address

Return the starting address of this memory mapping.

Expand source code Browse git
def as_address(self) -> Address:
    "Return the starting address of this memory mapping."
    return Address(self.address)
class Process (id: int)

The integer identifier for a process taken by many syscalls.

This is a process in a specific pid namespace, but we don't with this object know what pid namespace that is.

Expand source code Browse git
@dataclass(frozen=True)
class Process:
    """The integer identifier for a process taken by many syscalls.

    This is a process in a specific pid namespace, but we don't with this object know what
    pid namespace that is.

    """
    id: int

    def __int__(self) -> int:
        return self.id

    def __str__(self) -> str:
        return f'Process({self.id})'

    def __repr__(self) -> str:
        return str(self)

Class variables

var id : int
class ProcessGroup (id: int)

The integer identifier for a process group taken by many syscalls.

This is a process group in a specific pid namespace, but we don't with this object know what pid namespace that is.

Expand source code Browse git
@dataclass(frozen=True)
class ProcessGroup:
    """The integer identifier for a process group taken by many syscalls.

    This is a process group in a specific pid namespace, but we don't with this object
    know what pid namespace that is.

    """
    id: int

    def __int__(self) -> int:
        return self.id

Class variables

var id : int
class SyscallInterface

The lowest-level interface for an object which lets us send syscalls to some process.

We send syscalls to a process, but nothing in this interface tells us anything about the process to which we're sending syscalls; that information is maintained in the Task, which contains an object matching this interface.

This is like the segment register override prefix, with no awareness of the contents of the register.

Expand source code Browse git
class SyscallInterface:
    """The lowest-level interface for an object which lets us send syscalls to some process.

    We send syscalls to a process, but nothing in this interface tells us anything about
    the process to which we're sending syscalls; that information is maintained in the
    Task, which contains an object matching this interface.

    This is like the segment register override prefix, with no awareness of the contents
    of the register.

    """
    @abc.abstractmethod
    async def syscall(self, number: SYS, arg1=0, arg2=0, arg3=0, arg4=0, arg5=0, arg6=0) -> int:
        """Send a syscall and wait for it to complete, throwing on error results.

        We provide a guarantee that if the syscall was sent to the process, then we will
        not return until the syscall has completed or our connection has broken.  To
        achieve this, we shield against Python coroutine cancellation while waiting for
        the syscall response.

        This guarantee is important so that our caller can deal with state changes caused
        by the syscall. If our coroutine was cancelled in the middle of a syscall, the
        result of the syscall would be discarded, and our caller wouldn't be able to
        guarantee that state changes in the process are reflected in state changes in
        Python.

        For example, a coroutine calling waitid could be cancelled; if that happened, we
        could discard a child state change indicating that the child exited. If that
        happened, future calls to waitid on that child would be invalid, or maybe return
        events for an unrelated child. We'd be completely confused.

        Instead, thanks to our guarantee, syscalls made through this method can be treated
        as atomic: They will either be submitted and completed, or not submitted at all.
        (If they're submitted and not completed due to blocking forever, that just means
        we'll never return.) There's no possibility of making a syscall, causing a
        side-effect, and never learning about the side-effect you caused.

        Since most syscalls use this method, this guarantee applies to most syscalls.

        This prevents us from cancelling an in-progress syscall if it has already been
        submitted to the process; meaning, we can't discard the result of the syscall, we
        have to wait for it.

        This may seem excessive, but the question is, what should we assume as our default?
        That syscall results can be dropped safely, or that they cannot be dropped?
        Most syscalls are side-effectful: even a simple read usually consumes data in a
        side-effectful matter, and others allocate resources which might be leaked, or can
        cause state changes. Thus, "weakening" (droppability) is not generally true for
        syscall results: syscall results cannot, in most cases, be safely ignored.

        For callers who want to preserve the ability for their coroutine to be cancelled
        even while waiting for a syscall response, the `submit_syscall` API can be used.

        Note that this Python-level cancellation protection has nothing to do with
        actually interrupting a syscall. That ability is still preserved with this
        interface; just send a signal to trigger an EINTR in the syscalling process, and
        we'll get back that EINTR as the syscall response. If you just want to be able to
        cancel deadlocked processes, you should do that. That's the true API for
        "cancellation" of syscalls on Linux.

        Likewise, if the rsyscall server dies, or we get an EOF on the syscall connection,
        or any other event causes response.receive to throw an exception, we'll still
        return that exception; so you can always fall back on killing the rsyscall server
        to stop a deadlock.

        """
        pass

    # non-syscall operations which we haven't figured out how to get rid of yet
    @abc.abstractmethod
    async def close_interface(self) -> None:
        "Close this syscall interface, shutting down the connection to the remote process."
        pass

    @abc.abstractmethod
    def get_activity_fd(self) -> t.Optional[handle.FileDescriptor]:
        """When this file descriptor is readable, it means other things want to run on this thread.

        Users of the SyscallInterface should ensure that when they block, they are
        monitoring this fd as well.

        Typically, this is the file descriptor which the rsyscall server reads for
        incoming syscalls.

        """
        pass

Subclasses

Methods

async def syscall(self, number: SYS, arg1=0, arg2=0, arg3=0, arg4=0, arg5=0, arg6=0) ‑> int

Send a syscall and wait for it to complete, throwing on error results.

We provide a guarantee that if the syscall was sent to the process, then we will not return until the syscall has completed or our connection has broken. To achieve this, we shield against Python coroutine cancellation while waiting for the syscall response.

This guarantee is important so that our caller can deal with state changes caused by the syscall. If our coroutine was cancelled in the middle of a syscall, the result of the syscall would be discarded, and our caller wouldn't be able to guarantee that state changes in the process are reflected in state changes in Python.

For example, a coroutine calling waitid could be cancelled; if that happened, we could discard a child state change indicating that the child exited. If that happened, future calls to waitid on that child would be invalid, or maybe return events for an unrelated child. We'd be completely confused.

Instead, thanks to our guarantee, syscalls made through this method can be treated as atomic: They will either be submitted and completed, or not submitted at all. (If they're submitted and not completed due to blocking forever, that just means we'll never return.) There's no possibility of making a syscall, causing a side-effect, and never learning about the side-effect you caused.

Since most syscalls use this method, this guarantee applies to most syscalls.

This prevents us from cancelling an in-progress syscall if it has already been submitted to the process; meaning, we can't discard the result of the syscall, we have to wait for it.

This may seem excessive, but the question is, what should we assume as our default? That syscall results can be dropped safely, or that they cannot be dropped? Most syscalls are side-effectful: even a simple read usually consumes data in a side-effectful matter, and others allocate resources which might be leaked, or can cause state changes. Thus, "weakening" (droppability) is not generally true for syscall results: syscall results cannot, in most cases, be safely ignored.

For callers who want to preserve the ability for their coroutine to be cancelled even while waiting for a syscall response, the submit_syscall API can be used.

Note that this Python-level cancellation protection has nothing to do with actually interrupting a syscall. That ability is still preserved with this interface; just send a signal to trigger an EINTR in the syscalling process, and we'll get back that EINTR as the syscall response. If you just want to be able to cancel deadlocked processes, you should do that. That's the true API for "cancellation" of syscalls on Linux.

Likewise, if the rsyscall server dies, or we get an EOF on the syscall connection, or any other event causes response.receive to throw an exception, we'll still return that exception; so you can always fall back on killing the rsyscall server to stop a deadlock.

Expand source code Browse git
@abc.abstractmethod
async def syscall(self, number: SYS, arg1=0, arg2=0, arg3=0, arg4=0, arg5=0, arg6=0) -> int:
    """Send a syscall and wait for it to complete, throwing on error results.

    We provide a guarantee that if the syscall was sent to the process, then we will
    not return until the syscall has completed or our connection has broken.  To
    achieve this, we shield against Python coroutine cancellation while waiting for
    the syscall response.

    This guarantee is important so that our caller can deal with state changes caused
    by the syscall. If our coroutine was cancelled in the middle of a syscall, the
    result of the syscall would be discarded, and our caller wouldn't be able to
    guarantee that state changes in the process are reflected in state changes in
    Python.

    For example, a coroutine calling waitid could be cancelled; if that happened, we
    could discard a child state change indicating that the child exited. If that
    happened, future calls to waitid on that child would be invalid, or maybe return
    events for an unrelated child. We'd be completely confused.

    Instead, thanks to our guarantee, syscalls made through this method can be treated
    as atomic: They will either be submitted and completed, or not submitted at all.
    (If they're submitted and not completed due to blocking forever, that just means
    we'll never return.) There's no possibility of making a syscall, causing a
    side-effect, and never learning about the side-effect you caused.

    Since most syscalls use this method, this guarantee applies to most syscalls.

    This prevents us from cancelling an in-progress syscall if it has already been
    submitted to the process; meaning, we can't discard the result of the syscall, we
    have to wait for it.

    This may seem excessive, but the question is, what should we assume as our default?
    That syscall results can be dropped safely, or that they cannot be dropped?
    Most syscalls are side-effectful: even a simple read usually consumes data in a
    side-effectful matter, and others allocate resources which might be leaked, or can
    cause state changes. Thus, "weakening" (droppability) is not generally true for
    syscall results: syscall results cannot, in most cases, be safely ignored.

    For callers who want to preserve the ability for their coroutine to be cancelled
    even while waiting for a syscall response, the `submit_syscall` API can be used.

    Note that this Python-level cancellation protection has nothing to do with
    actually interrupting a syscall. That ability is still preserved with this
    interface; just send a signal to trigger an EINTR in the syscalling process, and
    we'll get back that EINTR as the syscall response. If you just want to be able to
    cancel deadlocked processes, you should do that. That's the true API for
    "cancellation" of syscalls on Linux.

    Likewise, if the rsyscall server dies, or we get an EOF on the syscall connection,
    or any other event causes response.receive to throw an exception, we'll still
    return that exception; so you can always fall back on killing the rsyscall server
    to stop a deadlock.

    """
    pass
async def close_interface(self) ‑> NoneType

Close this syscall interface, shutting down the connection to the remote process.

Expand source code Browse git
@abc.abstractmethod
async def close_interface(self) -> None:
    "Close this syscall interface, shutting down the connection to the remote process."
    pass
def get_activity_fd(self) ‑> t.Optional[handle.FileDescriptor]

When this file descriptor is readable, it means other things want to run on this thread.

Users of the SyscallInterface should ensure that when they block, they are monitoring this fd as well.

Typically, this is the file descriptor which the rsyscall server reads for incoming syscalls.

Expand source code Browse git
@abc.abstractmethod
def get_activity_fd(self) -> t.Optional[handle.FileDescriptor]:
    """When this file descriptor is readable, it means other things want to run on this thread.

    Users of the SyscallInterface should ensure that when they block, they are
    monitoring this fd as well.

    Typically, this is the file descriptor which the rsyscall server reads for
    incoming syscalls.

    """
    pass
class SyscallHangup (*args, **kwargs)

This syscall was sent, but we got the equivalent of a hangup when we read the result.

We don't know if the syscall was actually executed or not. The hangup may not be actually related to the syscall we sent; we'd also get a hangup for syscalls if the process died.

Note that for some syscalls (exit and exec, namely), this result indicates success. (Although not with absolute certainty, since the hangup could also be unrelated in those cases.)

Raised by SyscallInterface.syscall.

Expand source code Browse git
class SyscallHangup(SyscallError):
    """This syscall was sent, but we got the equivalent of a hangup when we read the result.

    We don't know if the syscall was actually executed or not.  The
    hangup may not be actually related to the syscall we sent; we'd
    also get a hangup for syscalls if the process died.

    Note that for some syscalls (exit and exec, namely), this result
    indicates success. (Although not with absolute certainty, since
    the hangup could also be unrelated in those cases.)

    Raised by SyscallInterface.syscall.

    """
    pass

Ancestors

  • rsyscall.near.sysif.SyscallError
  • builtins.Exception
  • builtins.BaseException