This will be relevant once #31605 is merged.
In general, stack traces do *not* contain unique addresses for inlined
frames, but for error return traces, they will after the above PR. This
bool indicates that code printing the trace should not try to resolve
inline frames since they're explicitly encoded into the instruction
addresses.
This is set as state on stack trace rather than passed into the
formatting methods as an argument, as it's not really a formatting
option--whether or not it's correct to resolve inlines is decided at the
time of capture!
The cmpxchg is there to recover alignment padding that isn't needed (which
can only be determined after the fetch-and-add that reserves it as allocated
memory). As cmpxchg tends to be a very expensive operation, it is actually
faster to introduce an additional branch here that checks if the cmpxchg
would be a noop (because all of the reserved alignment padding was in fact
necessary) and skips it if that's the case.
This does not measurably regress performance if the arena is only accessed
by a single thread and yields slight performance benefits for multi-threaded
usage. If the arena is commonly used for unaligned allocations, the perf
benefits are quite significant.
Co-authored-by: Jacob Young <amazingjacob@gmail.com>
This prevents a race between `alloc` and `free` where T1 receives memory
from `alloc` that is semantically about to be freed by T2 and still being
accessed, but the `free` is already visible to T1. Using acquire-release
here guarantees that any `free` is only published after all accesses to
the memory being freed have already happened.
Co-authored-by: Jacob Young <amazingjacob@gmail.com>
This prevents a race between `alloc` and `free` where T1 receives memory
from `alloc` that is semantically about to be freed by T2 and still being
accessed, but the `free` is already visible to T1. Using acquire-release
here guarantees that any `free` is only published after all accesses to
the memory being freed have already happened.
Co-authored-by: Jacob Young <amazingjacob@gmail.com>
This reverts commit 589bcb2544.
The scenario presented in the reverted commit cannot actually happen.
Even if there are two contiguous arena nodes N1 and N2 and the `end_index`
of N1 points to somewhere in N2, a `resize` can never lead to an increase
of the `end_index` of N1 since it checks whether it's `<= size` first.
A `resize`/`free` *can* decrease `end_index`, but even if it is wrongly
assumed that some allocation that belongs to N2 actually belongs to N1
based on the `end_index` of N1, it can only ever be decreased to the start
of the buffer of N2. That's because a valid allocation of N2 logically
cannot be at any lower address than N2 itself. And any point still in N2
can never also be in N1, so there's no danger of overwriting any other
allocations of N1.
At smaller workloads the overhead of setting up a new `std.Io.Threaded`
for every run to reset thread-local state becomes more noticeable, so this
commit also switches from thread-local storage to a shared atomic variable
for keeping track of the most recent allocation. This has the side-effect
of simplifying the overall implementation a bit.
Shrinking allocations should always succeed with these allocators, even
if the allocation in question is the most recent one and `resize` didn't
manage to decrement the end index of its buffer successfully.
The fuzz test consists of a planning phase where the fuzzing smith is used
to generate a list of actions to be executed and an execution phase where
the actions are all executed by multiple threads at the same time. Each
action is only executed exactly once and is performed on an `ArenaAllocator`
and on a `FixedBufferAllocator` (for reference). The arena is backed by a
special allocator that purposely introduces spurious allocation failures.
After all actions are executed, the contents of all allocation pairs are
compared to each other.
`FixedBufferAllocator.threadSafeAllocator()` already provided a thread-safe
`alloc` implementation, but all other functions were nops. This commit
implements the remaining `Allocator` functions and tightens up the memory
orderings in `alloc` a bit, `monotonic` is good enough here.
If we use `@cmpxchgStrong` instead of `@cmpxchgWeak` to adjust the `end_index`
in `resize` and `free`, the only reason the CAS can fail is that another
thread has changed `end_index` in the meantime. If that's happened, the
allocation we were trying to resize/free isn't the most recent allocation
anymore and there's no point in retrying, so we can get rid of the loop.
The `alignedIndex` function is very hot (literally every single `alloc`
call invokes it at least once) and `std.mem.alignPointerOffset` seems to
be very slow, so this commit replaces this functions with a custom
implementation that doesn't do any unnecessary validation and doesn't have
any branches as a result of that. The validation `std.mem.alignPointerOffset`
does isn't necessary anyway, we're not actually calculating an offset that
we plan to apply to a pointer directly, but an offset into a valid buffer
that we only apply to a pointer if the result is inside of that buffer.
This leads to a ~4% speedup in a synthetic benchmark that just puts a lot
of concurrent load on an `ArenaAllocator`.
Previously resetting with `retain_capacity < @sizeOf(Node)` would create
an invalid node. This is now fixed, plus `Node.size` now has its own `Size`
type that provides additional safety via assertions to prevent bugs like
this in the future.
This is achieved by bumping `end_index` by a large enough amount so that
a suitably aligned region of memory can always be provided. The potential
wasted space this creates is then recovered by a single cmpxchg. This is
always successful for single-threaded arenas which means that this version
still behaves exactly the same as the old single-threaded implementation
when only being accessed by one thread at a time. It can however fail when
another thread bumps `end_index` in the meantime. The observerd failure
rates under extreme load are:
2 Threads: 4-5%
3 Threads: 13-15%
4 Threads: 15-17%
5 Threads: 17-18%
6 Threads: 19-20%
7 Threads: 18-21%
This version offers ~25% faster performance under extreme load from 7 threads,
with diminishing speedups for less threads. The performance for 1 and 2
threads is nearly identical.
Modifies the `Allocator` implementation provided by `ArenaAllocator` to be
threadsafe using only atomics and no synchronization primitives locked
behind an `Io` implementation.
At its core this is a lock-free singly linked list which uses CAS loops to
exchange the head node. A nice property of `ArenaAllocator` is that the
only functions that can ever remove nodes from its linked list are `reset`
and `deinit`, both of which are not part of the `Allocator` interface and
thus aren't threadsafe, so node-related ABA problems are impossible.
There *are* some trade-offs: end index tracking is now per node instead of
per allocator instance. It's not possible to publish a head node and its
end index at the same time if the latter isn't part of the former.
Another compromise had to be made in regards to resizing existing nodes.
Annoyingly, `rawResize` of an arbitrary thread-safe child allocator can
of course never be guaranteed to be an atomic operation, so only one
`alloc` call can ever resize at the same time, other threads have to
consider any resizes they attempt during that time failed. This causes
slightly less optimal behavior than what could be achieved with a mutex.
The LSB of `Node.size` is used to signal that a node is being resized.
This means that all nodes have to have an even size.
Calls to `alloc` have to allocate new nodes optimistically as they can
only know whether any CAS on a head node will succeed after attempting it,
and to attempt the CAS they of course already need to know the address of
the freshly allocated node they are trying to make the new head.
The simplest solution to this would be to just free the new node again if
a CAS fails, however this can be expensive and would mean that in practice
arenas could only really be used with a GPA as their child allocator. To
work around this, this implementation keeps its own free list of nodes
which didn't make their CAS to be reused by a later `alloc` invocation.
To keep things simple and avoid ABA problems the free list is only ever
be accessed beyond its head by 'stealing' the head node (and thus the
entire list) with an atomic swap. This makes iteration and removal trivial
since there's only ever one thread doing it at a time which also owns all
nodes it's holding. When the thread is done it can just push its list onto
the free list again.
This implementation offers comparable performance to the previous one when
only being accessed by a single thread and a slight speedup compared to
the previous implementation wrapped into a `ThreadSafeAllocator` up to ~7
threads performing operations on it concurrently.
(measured on a base model MacBook Pro M1)
Linux's approach to mapping the main thread's stack is quite odd: it essentially
tries to select an mmap address (assuming unhinted mmap calls) which do not
cover the region of virtual address space into which the stack *would* grow
(based on the stack rlimit), but it doesn't actually *prevent* those pages from
being mapped. It also doesn't try particularly hard: it's been observed that the
first (unhinted) mmap call in a simple application is usually put at an address
which is within a gigabyte or two of the stack, which is close enough to make
issues somewhat likely. In particular, if we get an address which is close-ish
to the stack, and then `mremap` it without the MAY_MOVE flag, we are *very*
likely to map pages in this "theoretical stack region". This is particularly a
problem on loongarch64, where the initial mmap address is empirically only
around 200 megabytes from the stack (whereas on most other 64-bit targets it's
closer to a gigabyte).
To work around this, we just need to avoid mremap in some cases. Unfortunately,
this system call isn't used too heavily by musl or glibc, so design issues like
this can and do exist without being caught. So, when `PageAllocator.resize` is
called, let's not try to `mremap` to grow the pages. We can still call `mremap`
in the `PageAllocator.remap` path, because in that case we can set the
`MAY_MOVE` flag, which empirically appears to make the Linux kernel avoid the
problematic "theoretical stack region".
The old logic was fine for targets where the stack grows up (so, literally just
hppa), but problematic on targets where it grows down, because we could hint
that we wanted an allocation to happen in an area of the address space that the
kernel expects to be able to expand the stack into. The kernel is happy to
satisfy such a hint despite the obvious problems this leads to later down the
road.
Co-authored-by: rpkak <rpkak@noreply.codeberg.org>
- delete std.Thread.Futex
- delete std.Thread.Mutex
- delete std.Thread.Semaphore
- delete std.Thread.Condition
- delete std.Thread.RwLock
- delete std.once
std.Thread.Mutex.Recursive remains... for now. it will be replaced with
a special purpose mechanism used only by panic logic.
std.Io.Threaded exposes mutexLock and mutexUnlock for the advanced case
when you need to call them directly.
This commit sketches an idea for how to deal with detection of file
streams as being terminals.
When a File stream is a terminal, writes through the stream should have
their escapes stripped unless the programmer explicitly enables terminal
escapes. Furthermore, the programmer needs a convenient API for
intentionally outputting escapes into the stream. In particular it
should be possible to set colors that are silently discarded when the
stream is not a terminal.
This commit makes `Io.File.Writer` track the terminal mode in the
already-existing `mode` field, making it the appropriate place to
implement escape stripping.
`Io.lockStderrWriter` returns a `*Io.File.Writer` with terminal
detection already done by default. This is a higher-level application
layer stream for writing to stderr.
Meanwhile, `std.debug.lockStderrWriter` also returns a `*Io.File.Writer`
but a lower-level one that is hard-coded to use a static single-threaded
`std.Io.Threaded` instance. This is the same instance that is used for
collecting debug information and iterating the unwind info.
instead, allow the user to set it as a field.
this fixes a bug where leak printing and error printing would run tty
config detection for stderr, and then emit a log, which is not necessary
going to print to stderr.
however, the nice defaults are gone; the user must explicitly assign the
tty_config field during initialization or else the logging will not have
color.
related: https://github.com/ziglang/zig/issues/24510
This is a major refactor to `Step.Run` which adds new functionality,
primarily to the execution of Zig tests.
* All tests are run, even if a test crashes. This happens through the
same mechanism as timeouts where the test processes is repeatedly
respawned as needed.
* The build status output is more precise. For each unit test, it
differentiates pass, skip, fail, crash, and timeout. Memory leaks are
reported separately, as they do not indicate a test's "status", but
are rather an additional property (a test with leaks may still pass!).
* The number of memory leaks is tracked and reported, both per-test and
for a whole `Run` step.
* Reporting is made clearer when a step is failed solely due to error
logs (`std.log.err`) where every unit test passed.
Our usage of `ucontext_t` in the standard library was kind of
problematic. We unnecessarily mimiced libc-specific structures, and our
`getcontext` implementation was overkill for our use case of stack
tracing.
This commit introduces a new namespace, `std.debug.cpu_context`, which
contains "context" types for various architectures (currently x86,
x86_64, ARM, and AARCH64) containing the general-purpose CPU registers;
the ones needed in practice for stack unwinding. Each implementation has
a function `current` which populates the structure using inline
assembly. The structure is user-overrideable, though that should only be
necessary if the standard library does not have an implementation for
the *architecture*: that is to say, none of this is OS-dependent.
Of course, in POSIX signal handlers, we get a `ucontext_t` from the
kernel. The function `std.debug.cpu_context.fromPosixSignalContext`
converts this to a `std.debug.cpu_context.Native` with a big ol' target
switch.
This functionality is not exposed from `std.c` or `std.posix`, and
neither are `ucontext_t`, `mcontext_t`, or `getcontext`. The rationale
is that these types and functions do not conform to a specific ABI, and
in fact tend to get updated over time based on CPU features and
extensions; in addition, different libcs use different structures which
are "partially compatible" with the kernel structure. Overall, it's a
mess, but all we need is the kernel context, so we can just define a
kernel-compatible structure as long as we don't claim C compatibility by
putting it in `std.c` or `std.posix`.
This change resulted in a few nice `std.debug` simplifications, but
nothing too noteworthy. However, the main benefit of this change is that
DWARF unwinding---sometimes necessary for collecting stack traces
reliably---now requires far less target-specific integration.
Also fix a bug I noticed in `PageAllocator` (I found this due to a bug
in my distro's QEMU distribution; thanks, broken QEMU patch!) and I
think a couple of minor bugs in `std.debug`.
Resolves: #23801Resolves: #23802