Add intrinsic for launch-sized workgroup memory on GPUs
Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.
# Interface
With this change, workgroup memory can be accessed in Rust by
calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T`
intrinsic.
It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of `T`.
The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of `T` but can also be larger
or smaller or zero).
All calls to this intrinsic return a pointer to the same address.
See the intrinsic documentation for more details.
## Alternative Interfaces
It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.
# Implementation Details
Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.
There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.
Tracking issue: rust-lang/rust#135516
Workgroup memory is a memory region that is shared between all
threads in a workgroup on GPUs. Workgroup memory can be allocated
statically or after compilation, when launching a gpu-kernel.
The intrinsic added here returns the pointer to the memory that is
allocated at launch-time.
# Interface
With this change, workgroup memory can be accessed in Rust by
calling the new `gpu_launch_sized_workgroup_mem<T>() -> *mut T`
intrinsic.
It returns the pointer to workgroup memory guaranteeing that it is
aligned to at least the alignment of `T`.
The pointer is dereferencable for the size specified when launching the
current gpu-kernel (which may be the size of `T` but can also be larger
or smaller or zero).
All calls to this intrinsic return a pointer to the same address.
See the intrinsic documentation for more details.
## Alternative Interfaces
It was also considered to expose dynamic workgroup memory as extern
static variables in Rust, like they are represented in LLVM IR.
However, due to the pointer not being guaranteed to be dereferencable
(that depends on the allocated size at runtime), such a global must be
zero-sized, which makes global variables a bad fit.
# Implementation Details
Workgroup memory in amdgpu and nvptx lives in address space 3.
Workgroup memory from a launch is implemented by creating an
external global variable in address space 3. The global is declared with
size 0, as the actual size is only known at runtime. It is defined
behavior in LLVM to access an external global outside the defined size.
There is no similar way to get the allocated size of launch-sized
workgroup memory on amdgpu an nvptx, so users have to pass this
out-of-band or rely on target specific ways for now.
codegen: Copy to an alloca when the argument is neither by-val nor by-move for indirect pointer.
Fixes https://github.com/rust-lang/rust/issues/155241.
When a value is passed via an indirect pointer, the value needs to be copied to a new alloca. For x86_64-unknown-linux-gnu, `Thing` is the case:
```rust
#[derive(Clone, Copy)]
struct Thing(usize, usize, usize);
pub fn foo() {
let thing = Thing(0, 0, 0);
bar(thing);
assert_eq!(thing.0, 0);
}
#[inline(never)]
#[unsafe(no_mangle)]
pub fn bar(mut thing: Thing) {
thing.0 = 1;
}
```
Before passing the thing to the bar function, the thing needs to be copied to an alloca that is passed to bar.
```llvm
%0 = alloca [24 x i8], align 8
call void @llvm.memcpy.p0.p0.i64(ptr align 8 %0, ptr align 8 %thing, i64 24, i1 false)
call void @bar(ptr %0)
```
This patch applies the rule to the untupled arguments as well.
```rust
#![feature(fn_traits)]
#[derive(Clone, Copy)]
struct Thing(usize, usize, usize);
#[inline(never)]
#[unsafe(no_mangle)]
pub fn foo() {
let thing = (Thing(0, 0, 0),);
(|mut thing: Thing| {
thing.0 = 1;
}).call(thing);
assert_eq!(thing.0.0, 0);
}
```
For this case, this patch changes from
```llvm
; call example::foo::{closure#0}
call void @_RNCNvCs15qdZVLwHPA_7example3foo0B3_(ptr ..., ptr %thing)
```
to
```llvm
%0 = alloca [24 x i8], align 8
call void @llvm.memcpy.p0.p0.i64(ptr align 8 %0, ptr align 8 %thing, i64 24, i1 false)
; call example::foo::{closure#0}
call void @_RNCNvCs15qdZVLwHPA_7example3foo0B3_(ptr ..., ptr %0)
```
However, the same rule cannot be applied to tail calls that would be unsound, because the caller's stack frame is overwritten by the callee's stack frame. Fortunately, https://github.com/rust-lang/rust/pull/151143 has already handled the special case. We must not copy again.
No copy is needed for by-move arguments, because the argument is passed to the called "in-place".
No copy is also needed for by-val arguments, because the attribute implies that a hidden copy of the pointee is made between the caller and the callee.
NOTE: The patch has a trick for tail calls that we pass by-move. We can choose to copy an alloca even for by-move arguments, but tail calls require MUST-by-move.
Store a PathBuf rather than SerializedModule for cached modules
In cg_gcc `ModuleBuffer` already only contains a path anyway. And for moving LTO into `-Zlink-only` we will need to serialize `MaybeLtoModules`. By storing a path cached modules we avoid writing them to the disk a second time during serialization of `MaybeLtoModules`.
Some further improvements will require changes to cg_gcc that I would prefer landing in the cg_gcc repo to actually test the LTO changes in CI.
Part of https://github.com/rust-lang/compiler-team/issues/908
Refactor FnDecl and FnSig non-type fields into a new wrapper type
#### Why this Refactor?
This PR is part of an initial cleanup for the [arg splat experiment](https://github.com/rust-lang/rust/issues/153629), but it's a useful refactor by itself.
It refactors the non-type fields of `FnDecl`, `FnSig`, and `FnHeader` into a new packed wrapper types, based on this comment in the `splat` experiment PR:
https://github.com/rust-lang/rust/pull/153697#discussion_r3004637413
It also refactors some common `FnSig` creation settings into their own methods. I did this instead of creating a struct with defaults.
#### Relationship to `splat` Experiment
I don't think we can use functional struct updates (`..default()`) to create `FnDecl` and `FnSig`, because we need the bit-packing for the `splat` experiment.
Bit-packing will avoid breaking "type is small" assertions for commonly used types when `splat` is added.
This PR packs these types:
- ExternAbi: enum + `unwind` variants (38) -> 6 bits
- ImplicitSelfKind: enum variants (5) -> 3 bits
- lifetime_elision_allowed, safety, c_variadic: bool -> 1 bit
#### Minor Changes
Fixes some typos, and applies rustfmt to clippy files that got skipped somehow.
Most diagnostic types are only used within their own crate, and so have
a `pub(crate)` visibility. We have some diagnostic types that are
unnecessarily `pub`. This is bad because (a) information hiding, and (b)
if a `pub(crate)` type becomes unused the compiler will warn but it
won't warn for a `pub` type.
This commit eliminates unnecessary `pub` visibilities for some
diagnostic types, and also some related things due to knock-on effects.
(I found these types with some ad hoc use of `grep`.)
preserve SIMD element type information
Preserve the SIMD element type and provide it to LLVM for better optimization.
This is relevant for AArch64 types like `int16x4x2_t`, see also https://github.com/llvm/llvm-project/issues/181514. Such types are defined like so:
```rust
#[repr(simd)]
struct int16x4_t([i16; 4]);
#[repr(C)]
struct int16x4x2_t(pub int16x4_t, pub int16x4_t);
```
Previously this would be translated to the opaque `[2 x <8 x i8>]`, with this PR it is instead `[2 x <4 x i16>]`. That change is not relevant for the ABI, but using the correct type prevents bitcasts that can (indeed, do) confuse the LLVM pattern matcher.
This change will make it possible to implement the deinterleaving loads on AArch64 in a portable way (without neon-specific intrinsics), which means that e.g. Miri or the cranelift backend can run them without additional support.
discussion at [#t-compiler > loss of vector element type information](https://rust-lang.zulipchat.com/#narrow/channel/131828-t-compiler/topic/loss.20of.20vector.20element.20type.20information/with/584483611)
Start using pattern types in libcore
cc rust-lang/rust#135996
Replaces the innards of `NonNull` with `*const T is !null`.
This does affect LLVM's optimizations, as now reading the field preserves the metadata that the field is not null, and transmuting to another type (e.g. just a raw pointer), will also preserve that information for optimizations. This can cause LLVM opts to do more work, but it's not guaranteed to produce better machine code.
Once we also remove all uses of rustc_layout_scalar_range_start from rustc itself, we can remove the support for that attribute entirely and handle all such needs via pattern types
Previously `sve_cast`'s implementation was abstracted to power both
`sve_cast` and `simd_cast` which supported scalable and non-scalable
vectors respectively. In anticipation of having to do this for another
`simd_*` intrinsic, `sve_cast` is removed and `simd_cast` is changed to
accept both scalable and non-scalable intrinsics, an approach that will
scale better to the other intrinsics.
Invert dependency between `rustc_errors` and `rustc_abi`.
Currently, `rustc_errors` depends on `rustc_abi`, which depends on `rustc_error_messages`. This is a bit odd.
`rustc_errors` depends on `rustc_abi` for a single reason: `rustc_abi` defines a type `TargetDataLayoutErrors` and `rustc_errors` impls `Diagnostic` for that type.
We can get a more natural relationship by inverting the dependency, moving the `Diagnostic` trait upstream. Then `rustc_abi` defines `TargetDataLayoutErrors` and also impls `Diagnostic` for it. `rustc_errors` is already pretty far upstream in the crate graph, it doesn't hurt to push it a little further because errors are a very low-level concept.
r? @davidtwco
hwaddress: automatically add `-Ctarget-feature=+tagged-globals`
Note that since HWAddressSanitizer is/should be a target modifier, we do not have to worry about whether this LLVM target feature changes the ABI.
Fixes: rust-lang/rust#148185
In cg_gcc ModuleBuffer already only contains a path anyway. And for
moving LTO into -Zlink-only we will need to serialize MaybeLtoModules.
By storing a path cached modules we avoid writing them to the disk a
second time during serialization of MaybeLtoModules.
test `#[naked]` with `#[link_section = "..."]` on windows
As a part of https://github.com/rust-lang/rust/pull/147811 I ran into that we actually don't match (current) LLVM output.
r? @mati865
Because the things in this module aren't MIR and don't use anything
from `rustc_middle::mir`. Also, modules that use `mono` often don't use
anything else from `rustc_middle::mir`.
Rollup of 9 pull requests
Successful merges:
- rust-lang/rust#153440 (Various LTO cleanups)
- rust-lang/rust#151899 (Constify fold, reduce and last for iterator)
- rust-lang/rust#154561 (Suggest similar keyword when visibility is not followed by an item)
- rust-lang/rust#154657 (Fix pattern assignment suggestions for uninitialized bindings)
- rust-lang/rust#154717 (Fix ICE in unsafe binder discriminant helpers)
- rust-lang/rust#154722 (fix(lints): Improve `ill_formed_attribute_input` with better help message)
- rust-lang/rust#154777 (`#[cfg]`: suggest alternative `target_` name when the value does not match)
- rust-lang/rust#154849 (Promote `char::is_case_ignorable` from perma-unstable to unstable)
- rust-lang/rust#154850 (ast_validation: scalable vectors okay for rustdoc)
Various LTO cleanups
* Move some special casing of thin local LTO into a single location.
* Move lto_import_only_modules handling for fat LTO earlier. There is no reason to keep it separate until right before pass the LTO modules to the codegen backend. For thin LTO this introduces `ThinLtoInput` to correctly handle incr comp caching.
* Remove the `Linker` type from cg_llvm. It previously helped deduplicate code for `-Zcombine-cgus`, but that flag no longer exists.
Part of https://github.com/rust-lang/compiler-team/issues/908
Remove `Clone` impl for `StableHashingContext`.
`HashStable::hash_stable` takes a `&mut Hcx`. In contrast, `ToStableHashKey::to_stable_hash_key` takes a `&Hcx`. But there are some places where the latter calls the former, and due to the mismatch a `clone` call is required to get a mutable `StableHashingContext`.
This commit changes `to_stable_hash_key` to instead take a `&mut Hcx`. This eliminates the mismatch, the need for the clones, and the need for the `Clone` impls.
r? @fee1-dead
Instead of just using regular struct lowering for these types, which
results in an incorrect ABI (e.g. returning indirectly), use
`BackendRepr::ScalableVector` which will lower to the correct type and
be passed in registers.
This also enables some simplifications for generating alloca of scalable
vectors and greater re-use of `scalable_vector_parts`.
A LLVM codegen test demonstrating the changed IR this generates is
included in the next commit alongside some intrinsics that make these
tuples usable.
`HashStable::hash_stable` takes a `&mut Hcx`. In contrast,
`ToStableHashKey::to_stable_hash_key` takes a `&Hcx`. But there are
some places where the latter calls the former, and due to the mismatch a
`clone` call is required to get a mutable `StableHashingContext`.
This commit changes `to_stable_hash_key` to instead take a `&mut Hcx`.
This eliminates the mismatch, the need for the clones, and the need for
the `Clone` impls.
Use `Hcx`/`hcx` consistently for `StableHashingContext`.
The `HashStable` and `ToStableHashKey` traits both have a type parameter that is sometimes called `CTX` and sometimes called `HCX`. (In practice this type parameter is always instantiated as `StableHashingContext`.) Similarly, variables with these types are sometimes called `ctx` and sometimes called `hcx`. This inconsistency has bugged me for some time.
The `HCX`/`hcx` form is more informative (the `H`/`h` indicates what type of context it is) and it matches other cases like `tcx`, `dcx`, `icx`.
Also, RFC 430 says that type parameters should have names that are "concise UpperCamelCase, usually single uppercase letter: T". In this case `H` feels insufficient, and `Hcx` feels better.
Therefore, this commit changes the code to use `Hcx`/`hcx` everywhere.
r? @petrochenkov
The `HashStable` and `ToStableHashKey` traits both have a type parameter
that is sometimes called `CTX` and sometimes called `HCX`. (In practice
this type parameter is always instantiated as `StableHashingContext`.)
Similarly, variables with these types are sometimes called `ctx` and
sometimes called `hcx`. This inconsistency has bugged me for some time.
The `HCX`/`hcx` form is more informative (the `H`/`h` indicates what
type of context it is) and it matches other cases like `tcx`, `dcx`,
`icx`.
Also, RFC 430 says that type parameters should have names that are
"concise UpperCamelCase, usually single uppercase letter: T". In this
case `H` feels insufficient, and `Hcx` feels better.
Therefore, this commit changes the code to use `Hcx`/`hcx` everywhere.
Error enum names should not be plural. Even though there are multiple
possible errors, each instance of an error enum describes a single
error. There are dozens of singular error enum names, and only two
plural error enum names. This commit makes them both singular.
Merge `fabsf16/32/64/128` into `fabs::<F>`
Following [a small conversation on Zulip](https://rust-lang.zulipchat.com/#narrow/channel/131828-t-compiler/topic/Float.20intrinsics/with/521501401) (and because I'd be interested in starting to contribute on Rust), I thought I'd give a try at merging the float intrinsics :)
This PR just merges `fabsf16`, `fabsf32`, `fabsf64`, `fabsf128`, as it felt like an easy first target.
Notes:
- I'm opening the PR for one intrinsic as it's probably easier if the shift is done one intrinsic at a time, but let me know if you'd rather I do several at a time to reduce the number of PRs.
- Currently this PR increases LOCs, despite being an attempt at simplifying the intrinsics/compilers. I believe this increase is a one time thing as I had to define new functions and move some things around, and hopefully future PRs/commits will reduce overall LoCs
- `fabsf32` and `fabsf64` are `#[rustc_intrinsic_const_stable_indirect]`, while `fabsf16` and `fabsf128` aren't; because `f32`/`f64` expect the function to be const, the generic version must be made indirectly stable too. We'd need to check with T-lang this change is ok; the only other intrinsics where there is such a mismatch is `minnum`, `maxnum` and `copysign`.
- I haven't touched libm because I'm not familiar with how it works; any guidance would be welcome!
Pass -pg to linker when using -Zinstrument-mcount
This selects a slightly different crt on gnu targets which enables the profiler within glibc.
This makes using gprof a little easier with Rust binaries. Otherwise, rustc must be passed `-Clink-args=-pg` to ensure the correct startup code is linked.