Add intrinsic for launch-sized workgroup memory on GPUs
Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.
# Interface
With this change, workgroup memory can be accessed in Rust by
calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T`
intrinsic.
It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of `T`.
The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of `T` but can also be larger
or smaller or zero).
All calls to this intrinsic return a pointer to the same address.
See the intrinsic documentation for more details.
## Alternative Interfaces
It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.
# Implementation Details
Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.
There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.
Tracking issue: rust-lang/rust#135516
Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.
# Interface
With this change, workgroup memory can be accessed in Rust by
calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T`
intrinsic.
It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of `T`.
The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of `T` but can also be larger
or smaller or zero).
All calls to this intrinsic return a pointer to the same address.
See the intrinsic documentation for more details.
## Alternative Interfaces
It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.
# Implementation Details
Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.
There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.
Use convergent attribute to funcs for GPU targets
On targets with convergent operations, we need to add the convergent attribute to all functions that run convergent operations. Following clang, we can conservatively apply the attribute to all functions when compiling for such a target and rely on LLVM optimizing away the attribute in cases where it is not necessary.
This affects the amdgpu and nvptx targets.
cc @kjetilkjeka, @kulst for nvptx
cc @ZuseZ4
r? @nnethercote, as you already reviewed this in the other PR
Split out from rust-lang/rust#149637, the part here should be uncontroversial.
debuginfo: emit DW_TAG_call_site entries
Set `FlagAllCallsDescribed` on function definition DIEs so LLVM emits DW_TAG_call_site entries, letting debuggers and analysis tools track tail calls.
Add `-Zsanitize=kernel-hwaddress`
The Linux kernel has a config option called `CONFIG_KASAN_SW_TAGS` that enables `-fsanitize=kernel-hwaddress`. This is not supported by Rust.
One slightly awkward detail is that `#[sanitize(address = "off")]` applies to both `-Zsanitize=address` and `-Zsanitize=kernel-address`. Probably it was done this way because both are the same LLVM pass. I replicated this logic here for hwaddress, but it might be undesirable.
Note that `#[sanitize(kernel_hwaddress = "off")]` could be supported as an annotation on statics, but since it's also missing for `#[sanitize(hwaddress = "off")]`, I did not add it.
MCP: https://github.com/rust-lang/compiler-team/issues/975
Tracking issue: https://github.com/rust-lang/rust/issues/154171
cc @rcvalle @maurer @ojeda
On targets with convergent operations, we need to add the convergent
attribute to all functions that run convergent operations. Following
clang, we can conservatively apply the attribute to all functions when
compiling for such a target and rely on LLVM optimizing away the
attribute in cases where it is not necessary.
This affects the amdgpu and nvptx targets.
Always use the ThinLTO pipeline for pre-link optimizations
When using cargo this was already effectively done for all dependencies as cargo passes -Clinker-plugin-lto without -Clto=fat/thin. -Clinker-plugin-lto assumes that ThinLTO will be used. The ThinLTO pre-link pipeline is faster than the fat LTO one. And according to the benchmarks in [^1] there is barely any runtime performance difference between executables that used fat LTO with the fat vs ThinLTO pre-link pipeline.
This also helps avoid having yet another code path if we want to support Unified LTO (that is a single bitcode file that supports being used for both fat LTO and ThinLTO when using linker plugin LTO, we already support it when rustc does LTO as ThinLTO bitcode is enough of a superset of fat LTO bitcode that it happens to work by accident if you don't explicitly have a check preventing mixing of them for the current set of LTO features that rustc exposes.) I'm currently still investigating if rustc would benefit from Unified LTO and how exactly to integrate it.
[^1]: https://discourse.llvm.org/t/rfc-a-unified-lto-bitcode-frontend/61774
[win] Fix truncated unwinds for Arm64 Windows
Panic backtraces on ARM64 Windows are truncated because Rust's LLVM configuration sets `NoTrapAfterNoreturn = true`, which suppresses the generation of `brk #0x1` (trap) instructions after calls to `noreturn` functions. Without this trap instruction, the return address from a `noreturn` call points past the end of the calling function into an unrelated function, causing `RtlLookupFunctionEntry` to return the wrong unwind information, which terminates the stack walk prematurely.
In general, `NoTrapAfterNoreturn = true` is recommended against for Windows, since we have seen security vulnerabilities in the past where an attacker has managed to return from a noreturn function, or the function wasn't actually noereturn, resulting in executing whatever was after the call.
This change disables setting `NoTrapAfterNoreturn = true` for Windows.
Fixesrust-lang/rust#140489
When using cargo this was already effectively done for all dependencies
as cargo passes -Clinker-plugin-lto without -Clto=fat/thin.
-Clinker-plugin-lto assumes that ThinLTO will be used. The ThinLTO
pre-link pipeline is faster than the fat LTO one. And according to the
benchmarks in [1] there is barely any runtime performance difference
between executables that used fat LTO with the fat vs ThinLTO pre-link
pipeline.
[1]: https://discourse.llvm.org/t/rfc-a-unified-lto-bitcode-frontend/61774
The build script here wants to sniff for `-stdlib=libc++` but was
missing the leading dashes. We caught this on the Rust/LLVM HEADs
builder which also uses libc++.
Use shell-words to parse output from llvm-config
llvm-config might output paths that contain spaces, in which case the naive approach of splitting on whitespace breaks; instead we ask llvm-config to quote any paths and use the [shell-words](https://crates.io/crates/shell-words) crate by @tmiasko (a new dependency) to parse the output.
r? ChrisDenton
Fixesrust-lang/rust#152707
As far as I can tell it was introduced to allow fat LTO with
-Clinker-plugin-lto. Later a change was made to automatically disable
ThinLTO summary generation when -Clinker-plugin-lto -Clto=fat is used,
so we can safely remove it.
llvm-config might output paths that contain spaces, in which case the
naive approach of splitting on whitespace breaks; instead we ask
llvm-config to quote any paths and use the shell-words crate to parse
the output.
Add -Z large-data-threshold
This flag allows specifying the threshold size for placing static data in large data sections when using the medium code model on x86-64.
When using -Ccode-model=medium, data smaller than this threshold uses RIP-relative addressing (32-bit offsets), while larger data uses absolute 64-bit addressing. This allows the compiler to generate more efficient code for smaller data while still supporting data larger than 2GB.
This mirrors the -mlarge-data-threshold flag available in GCC and Clang. The default threshold is 65536 bytes (64KB) if not specified, matching LLVM's default behavior.
The attribute now has a size parameter and sorts differently:
* Explicitly omit size parameter during construction on 23+
* Tolerate alternate sorting in tests
https://github.com/llvm/llvm-project/pull/171712
This flag allows specifying the threshold size for placing static data
in large data sections when using the medium code model on x86-64.
When using -Ccode-model=medium, data smaller than this threshold uses
RIP-relative addressing (32-bit offsets), while larger data uses
absolute 64-bit addressing. This allows the compiler to generate more
efficient code for smaller data while still supporting data larger than
2GB.
This mirrors the -mlarge-data-threshold flag available in GCC and Clang.
The default threshold is 65536 bytes (64KB) if not specified, matching
LLVM's default behavior.
Allow inline calls to offload intrinsic
Removes explicit insertion point handling and recovers the pointer at the end of the saved basic block.
r? `@ZuseZ4`
fixes: https://github.com/rust-lang/rust/issues/150413
Accommodate LLVM PassPlugin rename
LLVM [recently moved](https://github.com/llvm/llvm-project/pull/173279) their `PassPlugin` files to a new folder. This PR updates our `PassWrapper` to point to the new location.