Add intrinsic for launch-sized workgroup memory on GPUs
Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.
# Interface
With this change, workgroup memory can be accessed in Rust by
calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T`
intrinsic.
It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of `T`.
The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of `T` but can also be larger
or smaller or zero).
All calls to this intrinsic return a pointer to the same address.
See the intrinsic documentation for more details.
## Alternative Interfaces
It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.
# Implementation Details
Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.
There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.
Tracking issue: rust-lang/rust#135516
Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.
# Interface
With this change, workgroup memory can be accessed in Rust by
calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T`
intrinsic.
It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of `T`.
The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of `T` but can also be larger
or smaller or zero).
All calls to this intrinsic return a pointer to the same address.
See the intrinsic documentation for more details.
## Alternative Interfaces
It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.
# Implementation Details
Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.
There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.
we were saying that the type is i32, but would often provide an i64.
That never failed so far, but starts failing (like, crashing LLVM) when
working with 128-bit values that are 16-byte aligned. So, we may as well
use the more robust approach now.
Store a PathBuf rather than SerializedModule for cached modules
In cg_gcc `ModuleBuffer` already only contains a path anyway. And for moving LTO into `-Zlink-only` we will need to serialize `MaybeLtoModules`. By storing a path cached modules we avoid writing them to the disk a second time during serialization of `MaybeLtoModules`.
Some further improvements will require changes to cg_gcc that I would prefer landing in the cg_gcc repo to actually test the LTO changes in CI.
Part of https://github.com/rust-lang/compiler-team/issues/908
Add autocast support for `x86amx`
Builds on rust-lang/rust#140763 by further adding autocasts for `x86amx` from/to vectors of size 8192 bits.
This also disables SIMD vector abi checks for the `"unadjusted"` abi because
- This is primarily used to link with LLVM intrinsics, which don't actually lower to function calls with vector arguments. Even with other cg backends, this is true.
- This ABI is internal and perma-unstable (and also super specific), so it is very unlikely that this will cause breakages.
- (The primary reason) Without doing this we can't actually use 8192 bit long vectors to represent `x86amx`
> Why do we need a bypass for `x86amx`? Can't we use a `#[lang_item]` or something?
If `x86amx` was a normal LLVM type, this approach would've worked and I would also prefer it. But LLVM specifies that
> No instruction is allowed for this type. There are no arguments, arrays, pointers, vectors or constants of this type.
So we can't treat it like a normal type at all -- even if we add it like a lang-item, we would still have to special-case everywhere to check if we are passing to the correct LLVM intrinsic, and only then use the `x86amx` type. IMO this is needlessly complex, and way worse than this solution, which just adds it to the autocast list in cg_llvm
r? codegen
Introduce `Unnormalized` wrapper
This is the first step of the [eager normalization](https://rust-lang.zulipchat.com/#narrow/channel/364551-t-types.2Ftrait-system-refactor/topic/Eager.20normalization.2C.20ahoy.21/with/582996293) series.
This PR introduce an `Unnormalized` wrapper and make most normalization routines consume it. The purpose is to make normalization explicit.
This PR contains no behavior change.
API changes are in the first two commit.
There're some normalization routines left untouched:
- `normalize` in the type checker of borrowck: better do it together with `field.ty()` returning `Unnormalized`.
- `normalize_with_depth`: only used inside the old solver. Can be done later.
- `query_normalize`: rarely used.
- misc local normalization helpers.
The compiler errors are mostly fixed via `ast-grep`, with exceptions handled manually.
Refactor FnDecl and FnSig non-type fields into a new wrapper type
#### Why this Refactor?
This PR is part of an initial cleanup for the [arg splat experiment](https://github.com/rust-lang/rust/issues/153629), but it's a useful refactor by itself.
It refactors the non-type fields of `FnDecl`, `FnSig`, and `FnHeader` into a new packed wrapper types, based on this comment in the `splat` experiment PR:
https://github.com/rust-lang/rust/pull/153697#discussion_r3004637413
It also refactors some common `FnSig` creation settings into their own methods. I did this instead of creating a struct with defaults.
#### Relationship to `splat` Experiment
I don't think we can use functional struct updates (`..default()`) to create `FnDecl` and `FnSig`, because we need the bit-packing for the `splat` experiment.
Bit-packing will avoid breaking "type is small" assertions for commonly used types when `splat` is added.
This PR packs these types:
- ExternAbi: enum + `unwind` variants (38) -> 6 bits
- ImplicitSelfKind: enum variants (5) -> 3 bits
- lifetime_elision_allowed, safety, c_variadic: bool -> 1 bit
#### Minor Changes
Fixes some typos, and applies rustfmt to clippy files that got skipped somehow.
preserve SIMD element type information
Preserve the SIMD element type and provide it to LLVM for better optimization.
This is relevant for AArch64 types like `int16x4x2_t`, see also https://github.com/llvm/llvm-project/issues/181514. Such types are defined like so:
```rust
#[repr(simd)]
struct int16x4_t([i16; 4]);
#[repr(C)]
struct int16x4x2_t(pub int16x4_t, pub int16x4_t);
```
Previously this would be translated to the opaque `[2 x <8 x i8>]`, with this PR it is instead `[2 x <4 x i16>]`. That change is not relevant for the ABI, but using the correct type prevents bitcasts that can (indeed, do) confuse the LLVM pattern matcher.
This change will make it possible to implement the deinterleaving loads on AArch64 in a portable way (without neon-specific intrinsics), which means that e.g. Miri or the cranelift backend can run them without additional support.
discussion at [#t-compiler > loss of vector element type information](https://rust-lang.zulipchat.com/#narrow/channel/131828-t-compiler/topic/loss.20of.20vector.20element.20type.20information/with/584483611)
Implement EII for statics
This PR implements EII for statics. I've tried to mirror the implementation for functions in a few places, this causes some duplicate code but I'm also not really sure whether there's a clean way to merge the implementations.
This does not implement defaults for static EIIs yet, I will do that in a followup PR
Use closures more consistently in `dep_graph.rs`.
This file has several methods that take a `FnOnce() -> R` closure:
- `DepGraph::with_ignore`
- `DepGraph::with_query_deserialization`
- `DepGraph::with_anon_task`
- `DepGraphData::with_anon_task_inner`
It also has two methods that take a faux closure via an `A` argument and a `fn(TyCtxt<'tcx>, A) -> R` argument:
- DepGraph::with_task
- DepGraphData::with_task
The rationale is that the faux closure exercises tight control over what state they have access to. This seems silly when (a) they are passed a `TyCtxt`, and (b) when similar nearby functions take real closures. And they are more awkward to use, e.g. requiring multiple arguments to be gathered into a tuple. This commit changes the faux closures to real closures.
r? @Zalathar
Building on the previous change, support scalable vectors with
`simd_select`. Previous patches already landed the necessary changes in
the implementation of this intrinsic, but didn't allow scalable vector
arguments to be passed in.
Previously `sve_cast`'s implementation was abstracted to power both
`sve_cast` and `simd_cast` which supported scalable and non-scalable
vectors respectively. In anticipation of having to do this for another
`simd_*` intrinsic, `sve_cast` is removed and `simd_cast` is changed to
accept both scalable and non-scalable intrinsics, an approach that will
scale better to the other intrinsics.
hwaddress: automatically add `-Ctarget-feature=+tagged-globals`
Note that since HWAddressSanitizer is/should be a target modifier, we do not have to worry about whether this LLVM target feature changes the ABI.
Fixes: rust-lang/rust#148185
In cg_gcc ModuleBuffer already only contains a path anyway. And for
moving LTO into -Zlink-only we will need to serialize MaybeLtoModules.
By storing a path cached modules we avoid writing them to the disk a
second time during serialization of MaybeLtoModules.
Use convergent attribute to funcs for GPU targets
On targets with convergent operations, we need to add the convergent attribute to all functions that run convergent operations. Following clang, we can conservatively apply the attribute to all functions when compiling for such a target and rely on LLVM optimizing away the attribute in cases where it is not necessary.
This affects the amdgpu and nvptx targets.
cc @kjetilkjeka, @kulst for nvptx
cc @ZuseZ4
r? @nnethercote, as you already reviewed this in the other PR
Split out from rust-lang/rust#149637, the part here should be uncontroversial.
Because the things in this module aren't MIR and don't use anything
from `rustc_middle::mir`. Also, modules that use `mono` often don't use
anything else from `rustc_middle::mir`.
Various LTO cleanups
* Move some special casing of thin local LTO into a single location.
* Move lto_import_only_modules handling for fat LTO earlier. There is no reason to keep it separate until right before pass the LTO modules to the codegen backend. For thin LTO this introduces `ThinLtoInput` to correctly handle incr comp caching.
* Remove the `Linker` type from cg_llvm. It previously helped deduplicate code for `-Zcombine-cgus`, but that flag no longer exists.
Part of https://github.com/rust-lang/compiler-team/issues/908
Abstract over the existing `simd_cast` intrinsic to implement a new
`sve_cast` intrinsic - this is better than allowing scalable vectors to
be used with all of the generic `simd_*` intrinsics.
Clang changed to representing tuples of scalable vectors as
structs rather than as wide vectors (that is, scalable vector types
where the `N` part of the `<vscale x N x ty>` type was multiplied by
the number of vectors). rustc mirrored this in the initial implementation
of scalable vectors.
Earlier versions of our patches used the wide vector representation and
our intrinsic patches used the legacy
`llvm.aarch64.sve.tuple.{create,get,set}{2,3,4}` intrinsics for creating
these tuples/getting/setting the vectors, which were only supported
due to LLVM's `AutoUpgrade` pass converting these intrinsics into
`llvm.vector.insert`. `AutoUpgrade` only supports these legacy intrinsics
with the wide vector representation.
With the current struct representation, Clang has special handling in
codegen for generating `insertvalue`/`extractvalue` instructions for
these operations, which must be replicated by rustc's codegen for our
intrinsics to use. This patch implements new intrinsics in
`core::intrinsics::scalable` (mirroring the structure of
`core::intrinsics::simd`) which rustc lowers to the appropriate
`insertvalue`/`extractvalue` instructions.
Instead of just using regular struct lowering for these types, which
results in an incorrect ABI (e.g. returning indirectly), use
`BackendRepr::ScalableVector` which will lower to the correct type and
be passed in registers.
This also enables some simplifications for generating alloca of scalable
vectors and greater re-use of `scalable_vector_parts`.
A LLVM codegen test demonstrating the changed IR this generates is
included in the next commit alongside some intrinsics that make these
tuples usable.