Implements a thread-safe allocator with the following guarantees:
* `deinit` reports all leaks and frees all backing memory.
* All allocation mismatches result in either a panic or segmentation
fault.
* Allocations from other `SafeAllocator` instances cause a panic (if
`Options.canary` differ).
* Double frees and operation (resize, remap, and free) races panic or
segmentation fault.
Given the backing allocator does not reuse memory, it does not reuse
memory either and
* Most writes after free will segmentation fault or are eventually
detected and panic.
`std.heap.DebugAllocator` has been deprecated (I have also deprecated
`std.heap.Check` since this was its last usage and returning a `usize`
leak count is a much cleaner approach).
- General Design
Every allocation is trailed by an `AllocFooter` which contains metadata
for the allocation and stack traces. It is protected by a checksum to
catch corruption from allocation overwrites and report canary
mismatches. An allocation's memory has a minimum alignment of
`AllocFooter` so that the footer is at a fixed offset determined from
the allocation size. An allocation's memory is stored either:
* Inside linearly-filled buckets for small allocations.
* Inside an allocation directly from the backing allocator.
To track allocations, each thread maintains a table of backing
allocations. The table may be modified by other threads in the case of
a producer-consumer operation, so the table is a linked list only
expanded by creating new segments. Each thread maintains a linked list
of free entries, which may contain entries from other threads' tables.
In the case of producer-consumer operations, acquire/release ordering
is assumed to be provided externally. This is also assumed by all other
thread-safe allocators that reuse memory as otherwise there would be
data races on reuse of allocated memory.
- Fuzz Tests
Two fuzz tests have also been added for the allocator. They check that
there is no memory reuse, that returned memory is writable, and that
it is not overwritten. The multi-threaded fuzz test spawns a number of
worker threads which are used for all the test runs. I have run these
tests extensively under TSAN.
- Performance Measurements
Building the standard library tests with a RelaseSafe compiler build
and `-Ddebug-allocator`:
```
Benchmark 1 (3 runs): ./master-out/bin/zig test --zig-lib-dir lib lib/std/std.zig -femit-bin=test --test-no-exec
measurement mean ± σ min … max outliers delta
wall_time 29.4s ± 157ms 29.2s … 29.5s 0 ( 0%) 0%
peak_rss 2.24GB ± 3.49MB 2.23GB … 2.24GB 0 ( 0%) 0%
cpu_cycles 143G ± 999M 142G … 144G 0 ( 0%) 0%
instructions 268G ± 5.22M 268G … 268G 0 ( 0%) 0%
cache_references 13.1G ± 88.8M 13.0G … 13.2G 0 ( 0%) 0%
cache_misses 2.38G ± 30.7M 2.35G … 2.41G 0 ( 0%) 0%
branch_misses 634M ± 6.22M 629M … 641M 0 ( 0%) 0%
Benchmark 2 (3 runs): ./branch-out/bin/zig test --zig-lib-dir lib lib/std/std.zig -femit-bin=test --test-no-exec
measurement mean ± σ min … max outliers delta
wall_time 22.1s ± 88.6ms 22.0s … 22.2s 0 ( 0%) ⚡- 24.7% ± 1.0%
peak_rss 1.11GB ± 799KB 1.11GB … 1.11GB 0 ( 0%) ⚡- 50.3% ± 0.3%
cpu_cycles 136G ± 480M 136G … 137G 0 ( 0%) ⚡- 4.4% ± 1.2%
instructions 273G ± 2.07M 273G … 273G 0 ( 0%) 💩+ 1.6% ± 0.0%
cache_references 12.3G ± 71.3M 12.2G … 12.4G 0 ( 0%) ⚡- 6.0% ± 1.4%
cache_misses 2.02G ± 11.5M 2.01G … 2.03G 0 ( 0%) ⚡- 14.9% ± 2.2%
branch_misses 569M ± 2.65M 567M … 572M 0 ( 0%) ⚡- 10.2% ± 1.7%
```
This will be relevant once #31605 is merged.
In general, stack traces do *not* contain unique addresses for inlined
frames, but for error return traces, they will after the above PR. This
bool indicates that code printing the trace should not try to resolve
inline frames since they're explicitly encoded into the instruction
addresses.
This is set as state on stack trace rather than passed into the
formatting methods as an argument, as it's not really a formatting
option--whether or not it's correct to resolve inlines is decided at the
time of capture!
The cmpxchg is there to recover alignment padding that isn't needed (which
can only be determined after the fetch-and-add that reserves it as allocated
memory). As cmpxchg tends to be a very expensive operation, it is actually
faster to introduce an additional branch here that checks if the cmpxchg
would be a noop (because all of the reserved alignment padding was in fact
necessary) and skips it if that's the case.
This does not measurably regress performance if the arena is only accessed
by a single thread and yields slight performance benefits for multi-threaded
usage. If the arena is commonly used for unaligned allocations, the perf
benefits are quite significant.
Co-authored-by: Jacob Young <amazingjacob@gmail.com>
This prevents a race between `alloc` and `free` where T1 receives memory
from `alloc` that is semantically about to be freed by T2 and still being
accessed, but the `free` is already visible to T1. Using acquire-release
here guarantees that any `free` is only published after all accesses to
the memory being freed have already happened.
Co-authored-by: Jacob Young <amazingjacob@gmail.com>
This prevents a race between `alloc` and `free` where T1 receives memory
from `alloc` that is semantically about to be freed by T2 and still being
accessed, but the `free` is already visible to T1. Using acquire-release
here guarantees that any `free` is only published after all accesses to
the memory being freed have already happened.
Co-authored-by: Jacob Young <amazingjacob@gmail.com>
This reverts commit 589bcb2544.
The scenario presented in the reverted commit cannot actually happen.
Even if there are two contiguous arena nodes N1 and N2 and the `end_index`
of N1 points to somewhere in N2, a `resize` can never lead to an increase
of the `end_index` of N1 since it checks whether it's `<= size` first.
A `resize`/`free` *can* decrease `end_index`, but even if it is wrongly
assumed that some allocation that belongs to N2 actually belongs to N1
based on the `end_index` of N1, it can only ever be decreased to the start
of the buffer of N2. That's because a valid allocation of N2 logically
cannot be at any lower address than N2 itself. And any point still in N2
can never also be in N1, so there's no danger of overwriting any other
allocations of N1.
At smaller workloads the overhead of setting up a new `std.Io.Threaded`
for every run to reset thread-local state becomes more noticeable, so this
commit also switches from thread-local storage to a shared atomic variable
for keeping track of the most recent allocation. This has the side-effect
of simplifying the overall implementation a bit.
Shrinking allocations should always succeed with these allocators, even
if the allocation in question is the most recent one and `resize` didn't
manage to decrement the end index of its buffer successfully.
The fuzz test consists of a planning phase where the fuzzing smith is used
to generate a list of actions to be executed and an execution phase where
the actions are all executed by multiple threads at the same time. Each
action is only executed exactly once and is performed on an `ArenaAllocator`
and on a `FixedBufferAllocator` (for reference). The arena is backed by a
special allocator that purposely introduces spurious allocation failures.
After all actions are executed, the contents of all allocation pairs are
compared to each other.
`FixedBufferAllocator.threadSafeAllocator()` already provided a thread-safe
`alloc` implementation, but all other functions were nops. This commit
implements the remaining `Allocator` functions and tightens up the memory
orderings in `alloc` a bit, `monotonic` is good enough here.
If we use `@cmpxchgStrong` instead of `@cmpxchgWeak` to adjust the `end_index`
in `resize` and `free`, the only reason the CAS can fail is that another
thread has changed `end_index` in the meantime. If that's happened, the
allocation we were trying to resize/free isn't the most recent allocation
anymore and there's no point in retrying, so we can get rid of the loop.
The `alignedIndex` function is very hot (literally every single `alloc`
call invokes it at least once) and `std.mem.alignPointerOffset` seems to
be very slow, so this commit replaces this functions with a custom
implementation that doesn't do any unnecessary validation and doesn't have
any branches as a result of that. The validation `std.mem.alignPointerOffset`
does isn't necessary anyway, we're not actually calculating an offset that
we plan to apply to a pointer directly, but an offset into a valid buffer
that we only apply to a pointer if the result is inside of that buffer.
This leads to a ~4% speedup in a synthetic benchmark that just puts a lot
of concurrent load on an `ArenaAllocator`.
Previously resetting with `retain_capacity < @sizeOf(Node)` would create
an invalid node. This is now fixed, plus `Node.size` now has its own `Size`
type that provides additional safety via assertions to prevent bugs like
this in the future.
This is achieved by bumping `end_index` by a large enough amount so that
a suitably aligned region of memory can always be provided. The potential
wasted space this creates is then recovered by a single cmpxchg. This is
always successful for single-threaded arenas which means that this version
still behaves exactly the same as the old single-threaded implementation
when only being accessed by one thread at a time. It can however fail when
another thread bumps `end_index` in the meantime. The observerd failure
rates under extreme load are:
2 Threads: 4-5%
3 Threads: 13-15%
4 Threads: 15-17%
5 Threads: 17-18%
6 Threads: 19-20%
7 Threads: 18-21%
This version offers ~25% faster performance under extreme load from 7 threads,
with diminishing speedups for less threads. The performance for 1 and 2
threads is nearly identical.
Modifies the `Allocator` implementation provided by `ArenaAllocator` to be
threadsafe using only atomics and no synchronization primitives locked
behind an `Io` implementation.
At its core this is a lock-free singly linked list which uses CAS loops to
exchange the head node. A nice property of `ArenaAllocator` is that the
only functions that can ever remove nodes from its linked list are `reset`
and `deinit`, both of which are not part of the `Allocator` interface and
thus aren't threadsafe, so node-related ABA problems are impossible.
There *are* some trade-offs: end index tracking is now per node instead of
per allocator instance. It's not possible to publish a head node and its
end index at the same time if the latter isn't part of the former.
Another compromise had to be made in regards to resizing existing nodes.
Annoyingly, `rawResize` of an arbitrary thread-safe child allocator can
of course never be guaranteed to be an atomic operation, so only one
`alloc` call can ever resize at the same time, other threads have to
consider any resizes they attempt during that time failed. This causes
slightly less optimal behavior than what could be achieved with a mutex.
The LSB of `Node.size` is used to signal that a node is being resized.
This means that all nodes have to have an even size.
Calls to `alloc` have to allocate new nodes optimistically as they can
only know whether any CAS on a head node will succeed after attempting it,
and to attempt the CAS they of course already need to know the address of
the freshly allocated node they are trying to make the new head.
The simplest solution to this would be to just free the new node again if
a CAS fails, however this can be expensive and would mean that in practice
arenas could only really be used with a GPA as their child allocator. To
work around this, this implementation keeps its own free list of nodes
which didn't make their CAS to be reused by a later `alloc` invocation.
To keep things simple and avoid ABA problems the free list is only ever
be accessed beyond its head by 'stealing' the head node (and thus the
entire list) with an atomic swap. This makes iteration and removal trivial
since there's only ever one thread doing it at a time which also owns all
nodes it's holding. When the thread is done it can just push its list onto
the free list again.
This implementation offers comparable performance to the previous one when
only being accessed by a single thread and a slight speedup compared to
the previous implementation wrapped into a `ThreadSafeAllocator` up to ~7
threads performing operations on it concurrently.
(measured on a base model MacBook Pro M1)
Linux's approach to mapping the main thread's stack is quite odd: it essentially
tries to select an mmap address (assuming unhinted mmap calls) which do not
cover the region of virtual address space into which the stack *would* grow
(based on the stack rlimit), but it doesn't actually *prevent* those pages from
being mapped. It also doesn't try particularly hard: it's been observed that the
first (unhinted) mmap call in a simple application is usually put at an address
which is within a gigabyte or two of the stack, which is close enough to make
issues somewhat likely. In particular, if we get an address which is close-ish
to the stack, and then `mremap` it without the MAY_MOVE flag, we are *very*
likely to map pages in this "theoretical stack region". This is particularly a
problem on loongarch64, where the initial mmap address is empirically only
around 200 megabytes from the stack (whereas on most other 64-bit targets it's
closer to a gigabyte).
To work around this, we just need to avoid mremap in some cases. Unfortunately,
this system call isn't used too heavily by musl or glibc, so design issues like
this can and do exist without being caught. So, when `PageAllocator.resize` is
called, let's not try to `mremap` to grow the pages. We can still call `mremap`
in the `PageAllocator.remap` path, because in that case we can set the
`MAY_MOVE` flag, which empirically appears to make the Linux kernel avoid the
problematic "theoretical stack region".
The old logic was fine for targets where the stack grows up (so, literally just
hppa), but problematic on targets where it grows down, because we could hint
that we wanted an allocation to happen in an area of the address space that the
kernel expects to be able to expand the stack into. The kernel is happy to
satisfy such a hint despite the obvious problems this leads to later down the
road.
Co-authored-by: rpkak <rpkak@noreply.codeberg.org>
- delete std.Thread.Futex
- delete std.Thread.Mutex
- delete std.Thread.Semaphore
- delete std.Thread.Condition
- delete std.Thread.RwLock
- delete std.once
std.Thread.Mutex.Recursive remains... for now. it will be replaced with
a special purpose mechanism used only by panic logic.
std.Io.Threaded exposes mutexLock and mutexUnlock for the advanced case
when you need to call them directly.
This commit sketches an idea for how to deal with detection of file
streams as being terminals.
When a File stream is a terminal, writes through the stream should have
their escapes stripped unless the programmer explicitly enables terminal
escapes. Furthermore, the programmer needs a convenient API for
intentionally outputting escapes into the stream. In particular it
should be possible to set colors that are silently discarded when the
stream is not a terminal.
This commit makes `Io.File.Writer` track the terminal mode in the
already-existing `mode` field, making it the appropriate place to
implement escape stripping.
`Io.lockStderrWriter` returns a `*Io.File.Writer` with terminal
detection already done by default. This is a higher-level application
layer stream for writing to stderr.
Meanwhile, `std.debug.lockStderrWriter` also returns a `*Io.File.Writer`
but a lower-level one that is hard-coded to use a static single-threaded
`std.Io.Threaded` instance. This is the same instance that is used for
collecting debug information and iterating the unwind info.
instead, allow the user to set it as a field.
this fixes a bug where leak printing and error printing would run tty
config detection for stderr, and then emit a log, which is not necessary
going to print to stderr.
however, the nice defaults are gone; the user must explicitly assign the
tty_config field during initialization or else the logging will not have
color.
related: https://github.com/ziglang/zig/issues/24510