Mirrors/rust - rust - Gitea @ Femelysm.ru

mirror of https://github.com/rust-lang/rust.git synced 2026-04-29 11:51:31 +03:00

Author	SHA1	Message	Date
bors	db3e99bbab	Auto merge of #150605 - RalfJung:fallback-intrinsic-skip, r=mati865 skip codegen for intrinsics with big fallback bodies if backend does not need them This hopefully fixes the perf regression from https://github.com/rust-lang/rust/pull/148478. I only added the intrinsics with big fallback bodies to the list; it doesn't seem worth the effort of going through the entire list. Fixes https://github.com/rust-lang/rust/issues/149945 Cc @scottmcm @bjorn3	2026-02-04 17:12:58 +00:00
Jonathan Brouwer	9fd5712bf5	Rollup merge of #151526 - ZuseZ4:fix-autodiff-codegen-tests, r=oli-obk Fix autodiff codegen tests Preparing autodiff for release on nightly. Since we haven't been running these tests in CI, they regressed over the last months. These changes fixes this and hopefully make the tests more robust for the future. r? compiler	2026-02-04 14:39:19 +01:00
Jacob Pratt	e2c5b89d2a	Rollup merge of #151958 - chahar-ritik:add-slp-vectorization-test, r=jieyouxu Add codegen test for SLP vectorization close: rust-lang/rust#142519 This PR adds a codegen regression test for rust-lang/rust#142519. A regression in LLVM to fail to auto-vectorize, leading to significant performance loss. The SLP vectorizer correctly groups the 4-byte operations into <4 x i8> vectors. The loop state is maintained in SIMD registers (phi <4 x i8>). The test remains robust across architectures (AArch64 vs x86_64) by allowing flexible store types (i32 or <4 x i8>).	2026-02-02 23:12:05 -05:00
ltdk	28feae0c87	Move bigint helper tracking issues	2026-02-02 18:45:26 -05:00
Ritik Chahar	8476e893e7	Update min-llvm-version: 22 Co-authored-by: Nikita Popov <github@npopov.com>	2026-02-02 16:47:09 +05:30
ritik chahar	6176945223	fix: remove space for tidy and only for x86_64	2026-02-02 16:05:08 +05:30
ritik chahar	0830a5a928	fix: add min-llvm-version	2026-02-02 15:44:50 +05:30
ritik chahar	95ac5673ce	Fix SLP vectorization test CHECK patterns	2026-02-02 15:38:26 +05:30
ritik chahar	c64f9a0fc4	Add backlink to issue	2026-02-02 07:38:14 +05:30
ritik chahar	1c396d24dd	Restrict test to x86_64 per reviewer feedback	2026-02-01 22:14:13 +05:30
ritik chahar	0a60bd653d	fix: remove trailing newline for tidy	2026-02-01 22:09:05 +05:30
ritik chahar	2292d53b7b	Add codegen test for SLP vectorization	2026-02-01 21:41:43 +05:30
Nikita Popov	acb5ee2f84	Disable append-elements.rs test with debug assertions The IR is a bit different (in particular wrt naming) if debug-assertions-std is enabled. Peculiarly, the issue goes away if overflow-check-std is also enabled, which is why CI did not catch this.	2026-01-30 13:01:22 +01:00
Stuart Cook	3d102a7812	Rollup merge of #150893 - ZuseZ4:move-un-register-lib, r=oli-obk offload: move (un)register lib into global_ctors Right now we initialize the openmp/offload runtime before every single offload call, and tear it down directly afterwards. What we should rather do is initialize it once in the binary startup code, and tear it down at the end of the binary execution. Here I implement these changes. Together, our generated IR has a lot less usage of globals, which in turn simplifies the refactoring in https://github.com/rust-lang/rust/pull/150683, where I introduce a new variant of our offload intrinsic. r? oli-obk	2026-01-28 19:03:51 +11:00
Manuel Drehwald	35ce8ab120	adjust testcase for new logic	2026-01-27 10:43:21 -08:00
Stuart Cook	1c892e829c	Rollup merge of #147436 - okaneco:eq_ignore_ascii_autovec, r=scottmcm slice/ascii: Optimize `eq_ignore_ascii_case` with auto-vectorization - Refactor the current functionality into a helper function - Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function - Add a codegen test checking for vectorization and no panicking - Add benches for `eq_ignore_ascii_case` --- The optimized function is initially only enabled for x86_64 which has `sse2` as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation. Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16. Benchmarks - Cases below 16 bytes are unaffected, cases above all show sizeable improvements. ``` before: str::eq_ignore_ascii_case::bench_large_str_eq 4942.30ns/iter +/- 48.20 str::eq_ignore_ascii_case::bench_medium_str_eq 632.01ns/iter +/- 16.87 str::eq_ignore_ascii_case::bench_str_17_bytes_eq 16.28ns/iter +/- 0.45 str::eq_ignore_ascii_case::bench_str_31_bytes_eq 35.23ns/iter +/- 2.28 str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.56ns/iter +/- 0.22 str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.64ns/iter +/- 0.06 after: str::eq_ignore_ascii_case::bench_large_str_eq 611.63ns/iter +/- 28.29 str::eq_ignore_ascii_case::bench_medium_str_eq 77.10ns/iter +/- 19.76 str::eq_ignore_ascii_case::bench_str_17_bytes_eq 3.49ns/iter +/- 0.39 str::eq_ignore_ascii_case::bench_str_31_bytes_eq 3.50ns/iter +/- 0.27 str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.27ns/iter +/- 0.09 str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.60ns/iter +/- 0.05 ```	2026-01-27 17:36:35 +11:00
Jonathan Pallant	6ecb3f33f0	Adds two new Tier 3 targets - `aarch64v8r-unknown-none` and `aarch64v8r-unknown-none-softfloat`. The existing `aarch64-unknown-none` target assumes Armv8.0-A as a baseline. However, Arm recently released the Arm Cortex-R82 processor which is the first to implement the Armv8-R AArch64 mode architecture. This architecture is similar to Armv8-A AArch64, however it has a different set of mandatory features, and is based off of Armv8.4. It is largely unrelated to the existing Armv8-R architecture target (`armv8r-none-eabihf`), which only operates in AArch32 mode. The second `aarch64v8r-unknown-none-softfloat` target allows for possible Armv8-R AArch64 CPUs with no FPU, or for use-cases where FPU register stacking is not desired. As with the existing `aarch64-unknown-none` target we have coupled FPU support and Neon support together - there is no 'has FPU but does not have NEON' target proposed even though the architecture technically allows for it. This PR was developed by Ferrous Systems on behalf of Arm. Arm is the owner of these changes.	2026-01-26 12:43:52 +00:00
bors	873d4682c7	Auto merge of #151337 - the8472:bail-before-memcpy2, r=Mark-Simulacrum optimize `vec.extend(slice.to_vec())`, take 2 Redoing https://github.com/rust-lang/rust/pull/130998 It was reverted in https://github.com/rust-lang/rust/pull/151150 due to flakiness. I have traced this to layout randomization perturbing the test (the failure reproduces locally with layout randomization), which is now excluded.	2026-01-25 19:45:35 +00:00
Matthias Krüger	0de96f455d	Rollup merge of #151405 - heiher:fix-cli, r=Mark-Simulacrum LoongArch: Fix call-llvm-intrinsics test	2026-01-25 16:27:23 +01:00
Matthias Krüger	f6a8326a99	Rollup merge of #151404 - heiher:fix-dae, r=Mark-Simulacrum LoongArch: Fix direct-access-external-data test On LoongArch targets, `-Cdirect-access-external-data` defaults to `no`. Since copy relocations are not supported, `dso_local` is not emitted under `-Crelocation-model=static`, unlike on other targets.	2026-01-25 16:27:22 +01:00
Matthias Krüger	9dffb21112	Rollup merge of #150065 - is57primenumber:add-slice-cse-test, r=Mark-Simulacrum add CSE optimization tests for iterating over slice This PR is regression test for issue rust-lang/rust#119573. This PR introduces a new regression test to verify a critical optimization known as Common Subexpression Elimination (CSE) is correctly applied during various slice iteration patterns.	2026-01-25 07:42:59 +01:00
Matthias Krüger	b651be2191	Rollup merge of #145393 - clubby789:issue-138497, r=Mark-Simulacrum Add codegen test for removing trailing zeroes from `NonZero` Closes rust-lang/rust#138497	2026-01-25 07:42:56 +01:00
bors	75963ce795	Auto merge of #151065 - nagisa:add-preserve-none-abi, r=petrochenkov abi: add a rust-preserve-none calling convention This is the conceptual opposite of the rust-cold calling convention and is particularly useful in combination with the new `explicit_tail_calls` feature. For relatively tight loops implemented with tail calling (`become`) each of the function with the regular calling convention is still responsible for restoring the initial value of the preserved registers. So it is not unusual to end up with a situation where each step in the tail call loop is spilling and reloading registers, along the lines of: foo: push r12 ; do things pop r12 jmp next_step This adds up quickly, especially when most of the clobberable registers are already used to pass arguments or other uses. I was thinking of making the name of this ABI a little less LLVM-derived and more like a conceptual inverse of `rust-cold`, but could not come with a great name (`rust-cold` is itself not a great name: cold in what context? from which perspective? is it supposed to mean that the function is rarely called?)	2026-01-25 02:49:32 +00:00
Matthias Krüger	3a69035338	Rollup merge of #151346 - folkertdev:simd-splat, r=workingjubilee add `simd_splat` intrinsic Add `simd_splat` which lowers to the LLVM canonical splat sequence. ```llvm insertelement <N x elem> poison, elem %x, i32 0 shufflevector <N x elem> v0, <N x elem> poison, <N x i32> zeroinitializer ``` Right now we try to fake it using one of ```rust fn splat(x: u32) -> u32x8 { u32x8::from_array([x; 8]) } ``` or (in `stdarch`) ```rust fn splat(value: $elem_type) -> $name { #[derive(Copy, Clone)] #[repr(simd)] struct JustOne([$elem_type; 1]); let one = JustOne([value]); // SAFETY: 0 is always in-bounds because we're shuffling // a simd type with exactly one element. unsafe { simd_shuffle!(one, one, [0; $len]) } } ``` Both of these can confuse the LLVM optimizer, producing sub-par code. Some examples: - https://github.com/rust-lang/rust/issues/60637 - https://github.com/rust-lang/rust/issues/137407 - https://github.com/rust-lang/rust/issues/122623 - https://github.com/rust-lang/rust/issues/97804 --- As far as I can tell there is no way to provide a fallback implementation for this intrinsic, because there is no `const` way of evaluating the number of elements (there might be issues beyond that, too). So, I added implementations for all 4 backends. Both GCC and const-eval appear to have some issues with simd vectors containing pointers. I have a workaround for GCC, but haven't yet been able to make const-eval work. See the comments below. Currently this just adds the intrinsic, it does not actually use it anywhere yet.	2026-01-24 21:04:15 +01:00
Simonas Kazlauskas	6db94dbc25	abi: add a rust-preserve-none calling convention This is the conceptual opposite of the rust-cold calling convention and is particularly useful in combination with the new `explicit_tail_calls` feature. For relatively tight loops implemented with tail calling (`become`) each of the function with the regular calling convention is still responsible for restoring the initial value of the preserved registers. So it is not unusual to end up with a situation where each step in the tail call loop is spilling and reloading registers, along the lines of: foo: push r12 ; do things pop r12 jmp next_step This adds up quickly, especially when most of the clobberable registers are already used to pass arguments or other uses. I was thinking of making the name of this ABI a little less LLVM-derived and more like a conceptual inverse of `rust-cold`, but could not come with a great name (`rust-cold` is itself not a great name: cold in what context? from which perspective? is it supposed to mean that the function is rarely called?)	2026-01-24 19:23:17 +02:00
Folkert de Vries	71f34429ac	const-eval: do not call `immediate_const_vector` on vector of pointers	2026-01-24 10:40:47 +01:00
Manuel Drehwald	b6d567c12c	Shorten the autodiff batching test, to make it more reliable	2026-01-23 23:18:52 -08:00
Jonathan Brouwer	13f0399a57	Rollup merge of #151259 - bonega:fix-is-ascii-avx512, r=folkertdev Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native ## Summary This PR fixes a severe performance regression in `slice::is_ascii` on AVX-512 CPUs when compiling with `-C target-cpu=native`. On affected systems, the current implementation achieves only ~3 GB/s for large inputs, compared to ~60–70 GB/s previously (≈20–24× regression). This PR restores the original performance characteristics. This change is intended as a temporary workaround for upstream LLVM poor codegen. Once the underlying LLVM issue is fixed and Rust is able to consume that fix, this workaround should be reverted. ## Problem When `is_ascii` is compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31 `kshiftrd` instructions to extract mask bits one-by-one, instead of using the efficient `pmovmskb` instruction. This causes a ~22x performance regression. Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings, affecting anyone using `-C target-cpu=native` on AVX-512 CPUs. ## Root cause (upstream) The underlying issue appears to be an LLVM vectorizer/backend bug affecting certain AVX-512 patterns. An upstream issue has been filed by @folkertdev to track the root cause: llvm/llvm-project#176906 Until this is resolved in LLVM and picked up by rustc, this PR avoids triggering the problematic codegen pattern. ## Solution Replace the counting loop with explicit SSE2 intrinsics (`_mm_movemask_epi8`) that force `pmovmskb` codegen regardless of CPU features. ## Godbolt Links (Rust 1.92) \| Pattern \| Target \| Link \| Result \| \|---------\|--------\|------\|--------\| \| Counting loop (old) \| Default SSE2 \| https://godbolt.org/z/sE86xz4fY \| `pmovmskb` \| \| Counting loop (old) \| AVX-512 (znver4) \| https://godbolt.org/z/b3jvMhGd3 \| 31x `kshiftrd` (broken) \| \| SSE2 intrinsics (fix) \| Default SSE2 \| https://godbolt.org/z/hMeGfeaPv \| `pmovmskb` \| \| SSE2 intrinsics (fix) \| AVX-512 (znver4) \| https://godbolt.org/z/Tdvdqjohn \| `vpmovmskb` (fixed) \| ## Benchmark Results CPU: AMD Ryzen 5 7500F (Zen 4 with AVX-512) ### Default Target (SSE2) — Mixed \| Size \| Before \| After \| Change \| \|------\|--------\|-------\|--------\| \| 4 B \| 1.8 GB/s \| 2.0 GB/s \| +11% \| \| 8 B \| 3.2 GB/s \| 5.8 GB/s \| +81% \| \| 16 B \| 5.3 GB/s \| 8.5 GB/s \| +60% \| \| 32 B \| 17.7 GB/s \| 15.8 GB/s \| -11% \| \| 64 B \| 28.6 GB/s \| 25.1 GB/s \| -12% \| \| 256 B \| 51.5 GB/s \| 48.6 GB/s \| ~same \| \| 1 KB \| 64.9 GB/s \| 60.7 GB/s \| ~same \| \| 4 KB+ \| ~68-70 GB/s \| ~68-72 GB/s \| ~same \| ### Native Target (AVX-512) — Up to 24x Faster \| Size \| Before \| After \| Speedup \| \|------\|--------\|-------\|---------\| \| 4 B \| 1.2 GB/s \| 2.0 GB/s \| 1.7x \| \| 8 B \| 1.6 GB/s \| 5.0 GB/s \| 3.3x \| \| 16 B \| ~7 GB/s \| ~7 GB/s \| ~same \| \| 32 B \| 2.9 GB/s \| 14.2 GB/s \| 4.9x \| \| 64 B \| 2.9 GB/s \| 23.2 GB/s \| 8x \| \| 256 B \| 2.9 GB/s \| 47.2 GB/s \| 16x \| \| 1 KB \| 2.8 GB/s \| 60.0 GB/s \| 21x \| \| 4 KB+ \| 2.9 GB/s \| ~68-70 GB/s \| 23-24x \| ### Summary - SSE2 (default): Small inputs (4-16 B) 11-81% faster; 32-64 B ~11% slower; large inputs unchanged - AVX-512 (native): 21-24x faster for inputs ≥1 KB, peak ~70 GB/s (was ~3 GB/s) Note: this is the pure ascii path, but the story is similar for the others. See linked bench project. ## Test Plan - [x] Assembly test (`slice-is-ascii-avx512.rs`) verifies no `kshiftrd` with AVX-512 - [x] Existing codegen test updated to `loongarch64`-only (auto-vectorization still used there) - [x] Fuzz testing confirms old/new implementations produce identical results (~53M iterations) - [x] Benchmarks confirm performance improvement - [x] Tidy checks pass ## Reproduction / Test Projects Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation - `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison - `fuzz/` - Compares old/new implementations with libfuzzer ## Related Issues - issue opened by @folkertdev llvm/llvm-project#176906 - Regression introduced in https://github.com/rust-lang/rust/pull/130733	2026-01-24 08:18:05 +01:00
Manuel Drehwald	7bcc8a7053	update abi handling test	2026-01-23 21:54:04 -08:00
Manuel Drehwald	d7877615b4	Update test after new mangling scheme, make test more robust	2026-01-23 20:37:58 -08:00
Manuel Drehwald	c0c6e2166d	make generic test invariant of function order	2026-01-23 19:58:29 -08:00
bors	9283d592de	Auto merge of #151389 - scottmcm:vec-repeat, r=joboet Use `repeat_packed` when calculating layouts in `RawVec` Seeing whether this helps the icounts seen in https://github.com/rust-lang/rust/pull/148769#issuecomment-3769921666	2026-01-23 07:24:11 +00:00
Andreas Liljeqvist	c609cce8cf	Merge is_ascii codegen tests using revisions Combine the x86_64 and loongarch64 is_ascii tests into a single file using compiletest revisions. Both now test assembly output: - X86_64: Verifies no broken kshiftrd/kshiftrq instructions (AVX-512 fix) - LA64: Verifies vmskltz.b instruction is used (auto-vectorization)	2026-01-22 22:18:00 +01:00
Matthew Maurer	b639b0a4d8	llvm: Tolerate dead_on_return attribute changes The attribute now has a size parameter and sorts differently: * Explicitly omit size parameter during construction on 23+ * Tolerate alternate sorting in tests https://github.com/llvm/llvm-project/pull/171712	2026-01-21 23:39:03 +00:00
Scott McMurray	c3f309e32b	Use `repeat_packed` when calculating layouts in `RawVec`	2026-01-21 01:11:12 -08:00
Jacob Pratt	43d2006c25	Rollup merge of #150436 - va-list-copy, r=workingjubilee,RalfJung `c_variadic`: impl `va_copy` and `va_end` as Rust intrinsics tracking issue: https://github.com/rust-lang/rust/issues/44930 Implement `va_copy` as (the rust equivalent of) `memcpy`, which is the behavior of all current LLVM targets. By providing our own implementation, we can guarantee its behavior. These guarantees are important for implementing c-variadics in e.g. const-eval. Discussed in [#t-compiler/const-eval > c-variadics in const-eval](https://rust-lang.zulipchat.com/#narrow/channel/146212-t-compiler.2Fconst-eval/topic/c-variadics.20in.20const-eval/with/565509704). I've also updated the comment for `Drop` a bit. The background here is that the C standard requires that `va_end` is used in the same function (and really, in the same scope) as the corresponding `va_start` or `va_copy`. That is because historically `va_start` would start a scope, which `va_end` would then close. e.g. https://softwarepreservation.computerhistory.org/c_plus_plus/cfront/release_3.0.3/source/incl-master/proto-headers/stdarg.sol ```c #define va_start(ap, parmN) {\ va_buf _va;\ _vastart(ap = (va_list)_va, (char )&parmN + sizeof parmN) #define va_end(ap) } #define va_arg(ap, mode) ((mode *)_vaarg(ap, sizeof (mode))) ``` The C standard still has to consider such implementations, but for Rust they are irrelevant. Hence we can use `Clone` for `va_copy` and `Drop` for `va_end`.	2026-01-20 19:46:29 -05:00
Jamie Hill-Daniel	76438f032a	Add codegen test for issue 138497	2026-01-20 21:37:31 +00:00
Folkert de Vries	dd9241d150	`c_variadic`: use `Clone` instead of LLVM `va_copy`	2026-01-20 18:38:50 +01:00
Nikita Popov	0be66603ac	Avoid passing addrspacecast to lifetime intrinsics Since LLVM 22 the alloca must be passed directly. Do this by stripping the addrspacecast if it exists.	2026-01-20 14:47:04 +01:00
WANG Rui	d977471ce2	LoongArch: Fix call-llvm-intrinsics test	2026-01-20 19:43:06 +08:00
WANG Rui	e3f198ec05	LoongArch: Fix direct-access-external-data test On LoongArch targets, `-Cdirect-access-external-data` defaults to `no`. Since copy relocations are not supported, `dso_local` is not emitted under `-Crelocation-model=static`, unlike on other targets.	2026-01-20 16:26:15 +08:00
Stuart Cook	1262ff906b	Rollup merge of #150288 - offload-bench-fix, r=ZuseZ4 Add scalar support for offload This PR adds scalar support to the offload feature. The scalar management has two main parts: On the host side, each scalar arg is casted to `ix` type, zero extended to `i64` and passed to the kernel like that. On the device, the each scalar arg (`i64` at that point), is truncated to `ix` and then casted to the original type. r? @ZuseZ4	2026-01-20 18:00:08 +11:00
Marcelo Domínguez	307a4fcdf8	Add scalar support for both host and device	2026-01-19 22:28:42 +01:00
Folkert de Vries	80c0b99de0	add `simd_splat` intrinsic	2026-01-19 16:48:28 +01:00
Jonathan Brouwer	a56e2d3037	Rollup merge of #151071 - gen-openmp-metadata, r=nnethercote Generate openmp metadata LLVM has an openmp-opt pass, which is part of the default O3 pipeline. The pass bails if we don't have a global called openmp, so let's generate it if people enable our experimental offload feature. openmp is a superset of the offload feature, so they share optimizations. In follow-up PRs I'll start verifying that LLVM optimizes Rust the way we want it. r? compiler	2026-01-19 08:31:31 +01:00
The 8472	2b8f4a562f	avoid phi node for pointers flowing into Vec appends	2026-01-18 21:03:14 +01:00
Andreas Liljeqvist	a0f9a15b4a	Fix is_ascii performance regression on AVX-512 CPUs When `[u8]::is_ascii()` is compiled with `-C target-cpu=native` on AVX-512 CPUs, LLVM generates inefficient code. Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings. The previous implementation used a counting loop that LLVM auto-vectorizes to `pmovmskb` on SSE2, but with AVX-512 enabled, LLVM uses k-registers and extracts bits individually with ~31 `kshiftrd` instructions. This fix replaces the counting loop with explicit SSE2 intrinsics (`_mm_loadu_si128`, `_mm_or_si128`, `_mm_movemask_epi8`) for x86_64. `_mm_movemask_epi8` compiles to `pmovmskb`, forcing efficient codegen regardless of CPU features. Benchmark results on AMD Ryzen 5 7500F (Zen 4 with AVX-512): - Default build: ~73 GB/s → ~74 GB/s (no regression) - With -C target-cpu=native: ~3 GB/s → ~67 GB/s (22x improvement) The loongarch64 implementation retains the original counting loop since it doesn't have this issue. Regression from: https://github.com/rust-lang/rust/pull/130733	2026-01-17 17:38:51 +01:00
Manuel Drehwald	5c85d522d0	Generate global openmp metadata to trigger llvm openmp-opt pass	2026-01-16 14:57:32 -05:00
Jacob Pratt	6912c676cd	Rollup merge of #150607 - dispatch-ptr-intrinsic, r=workingjubilee Add amdgpu_dispatch_ptr intrinsic There is an ongoing discussion in rust-lang/rust#150452 about using address spaces from the Rust language in some way. As that discussion will likely not conclude soon, this PR adds one rustc_intrinsic with an addrspacecast to unblock getting basic information like launch and workgroup size and make it possible to implement something like `core::gpu`. Add a rustc intrinsic `amdgpu_dispatch_ptr` to access the kernel dispatch packet on amdgpu. The HSA kernel dispatch packet contains important information like the launch size and workgroup size. The Rust intrinsic lowers to the `llvm.amdgcn.dispatch.ptr` LLVM intrinsic, which returns a `ptr addrspace(4)`, plus an addrspacecast to `addrspace(0)`, so it can be returned as a Rust reference. The returned pointer/reference is valid for the whole program lifetime, and is therefore `'static`. The return type of the intrinsic (`&'static ()`) does not mention the struct so that rustc does not need to know the exact struct type. An alternative would be to define the struct as lang item or add a generic argument to the function. Is this ok or is there a better way (also, should it return a pointer instead of a reference)? Short version: ```rust #[cfg(target_arch = "amdgpu")] pub fn amdgpu_dispatch_ptr() -> *const (); ``` Tracking issue: rust-lang/rust#135024	2026-01-15 19:35:46 -05:00
Jieyou Xu	cd79ff2e2c	Revert "avoid phi node for pointers flowing into Vec appends #130998 " This reverts PR <https://github.com/rust-lang/rust/pull/130998> because the added test seems to be flaky / non-deterministic, and has been failing in unrelated PRs during merge CI.	2026-01-15 09:37:16 +08:00

1 2 3 4 5 ...

305 Commits