stabilizes `core::range::RangeFrom`
stabilizes `core::range::RangeFromIter`
add examples for `remainder` method on range iterators
`RangeFromIter::remainder` was not stabilized (see issue 154458)
Rename `into_range` to `try_into_slice_range`:
- Prepend `try_` to show that it returns `None` on error, like `try_range`
- add `_slice` to make it consistent with `into_slice_range`
The checks for `self.end() == usize::MAX` and `self.end() + 1 > slice.len()`
can be replaced with `self.end() >= slice.len()`, since
`self.end() < slice.len()` implies both
`self.end() <= slice.len()` and
`self.end() < usize::MAX`.
Align `ArrayWindows` trait impls with `Windows`
With `slice::ArrayWindows` getting ready to stabilize in 1.94, I noticed that it currently has some differences in trait implementations compared to `slice::Windows`, and I think we should align these.
- Remove `derive(Copy)` -- we generally don't want `Copy` for iterators at all, as this is seen as a footgun (e.g. rust-lang/rust#21809). This is obviously a breaking change though, so we should only remove this if we also backport the removal before it's stable. Otherwise, it should at least be replaced by a manual impl without requiring `T: Copy`.
- Manually `impl Clone`, simply to avoid requiring `T: Clone`.
- `impl FusedIterator`, because it is trivially so. The `since = "1.94.0"` assumes we'll backport this, otherwise we should change that to the "current" placeholder.
- `impl TrustedLen`, because we can trust our implementation.
- `impl TrustedRandomAccess`, because the required `__iterator_get_unchecked` method is straightforward.
r? libs-api
@rustbot label beta-nominated
(at least for the `Copy` removal, but we could be more selective about the rest).
Remove duplicated code in `slice/index.rs`
Looks like `const fn` is far enough along now that we can just not have these two copies any more, and have one call the other.
Tweak `SlicePartialEq` to allow MIR-inlining the `compare_bytes` call
rust-lang/rust#150265 disabled this because it was a net perf win, but let's see if we can tweak the structure of this to allow more inlining on this side while still not MIR-inlining the loop when it's not just `memcmp` and thus hopefully preserving the perf win.
This should also allow MIR-inlining the length check, which was previously blocked, and thus might allow some obvious non-matches to optimize away as well.
150265 disabled this because it was a net perf win, but let's see if we can tweak the structure of this to allow more inlining on this side while still not MIR-inlining the loop when it's not just `memcmp`.
This should also allow MIR-inlining the length check, which was previously blocked.
slice/ascii: Optimize `eq_ignore_ascii_case` with auto-vectorization
- Refactor the current functionality into a helper function
- Use `as_chunks` to encourage auto-vectorization in the optimized chunk processing function
- Add a codegen test checking for vectorization and no panicking
- Add benches for `eq_ignore_ascii_case`
---
The optimized function is initially only enabled for x86_64 which has `sse2` as part of its baseline, but none of the code is platform specific. Other platforms with SIMD instructions may also benefit from this implementation.
Performance improvements only manifest for slices of 16 bytes or longer, so the optimized path is gated behind a length check for greater than or equal to 16.
Benchmarks - Cases below 16 bytes are unaffected, cases above all show sizeable improvements.
```
before:
str::eq_ignore_ascii_case::bench_large_str_eq 4942.30ns/iter +/- 48.20
str::eq_ignore_ascii_case::bench_medium_str_eq 632.01ns/iter +/- 16.87
str::eq_ignore_ascii_case::bench_str_17_bytes_eq 16.28ns/iter +/- 0.45
str::eq_ignore_ascii_case::bench_str_31_bytes_eq 35.23ns/iter +/- 2.28
str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.56ns/iter +/- 0.22
str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.64ns/iter +/- 0.06
after:
str::eq_ignore_ascii_case::bench_large_str_eq 611.63ns/iter +/- 28.29
str::eq_ignore_ascii_case::bench_medium_str_eq 77.10ns/iter +/- 19.76
str::eq_ignore_ascii_case::bench_str_17_bytes_eq 3.49ns/iter +/- 0.39
str::eq_ignore_ascii_case::bench_str_31_bytes_eq 3.50ns/iter +/- 0.27
str::eq_ignore_ascii_case::bench_str_of_8_bytes_eq 7.27ns/iter +/- 0.09
str::eq_ignore_ascii_case::bench_str_under_8_bytes_eq 2.60ns/iter +/- 0.05
```
remove `#[deprecated]` from unstable & internal `SipHasher13` and `24` types
These types are unstable and `doc(hidden)` (under the internal feature `hashmap_internals`). Deprecating them only adds noise (`#[allow(deprecated)]`) to all places where they are used, so this PR removes the deprecation attributes from them.
It also includes a few other small cleanups in separate commits, including one I overlooked in rust-lang/rust#151228.
Fix is_ascii performance regression on AVX-512 CPUs when compiling with -C target-cpu=native
## Summary
This PR fixes a severe performance regression in `slice::is_ascii` on AVX-512 CPUs when compiling with `-C target-cpu=native`.
On affected systems, the current implementation achieves only ~3 GB/s for large inputs, compared to ~60–70 GB/s previously (≈20–24× regression). This PR restores the original performance characteristics.
This change is intended as a **temporary workaround** for upstream LLVM poor codegen. Once the underlying LLVM issue is fixed and Rust is able to consume that fix, this workaround should be reverted.
## Problem
When `is_ascii` is compiled with AVX-512 enabled, LLVM's auto-vectorization generates ~31 `kshiftrd` instructions to extract mask bits one-by-one, instead of using the efficient `pmovmskb`
instruction. This causes a **~22x performance regression**.
Because `is_ascii` is marked `#[inline]`, it gets inlined and recompiled with the user's target settings, affecting anyone using `-C target-cpu=native` on AVX-512 CPUs.
## Root cause (upstream)
The underlying issue appears to be an LLVM vectorizer/backend bug affecting certain AVX-512 patterns.
An upstream issue has been filed by @folkertdev to track the root cause: llvm/llvm-project#176906
Until this is resolved in LLVM and picked up by rustc, this PR avoids triggering the problematic codegen pattern.
## Solution
Replace the counting loop with explicit SSE2 intrinsics (`_mm_movemask_epi8`) that force `pmovmskb` codegen regardless of CPU features.
## Godbolt Links (Rust 1.92)
| Pattern | Target | Link | Result |
|---------|--------|------|--------|
| Counting loop (old) | Default SSE2 | https://godbolt.org/z/sE86xz4fY | `pmovmskb` |
| Counting loop (old) | AVX-512 (znver4) | https://godbolt.org/z/b3jvMhGd3 | 31x `kshiftrd` (broken) |
| SSE2 intrinsics (fix) | Default SSE2 | https://godbolt.org/z/hMeGfeaPv | `pmovmskb` |
| SSE2 intrinsics (fix) | AVX-512 (znver4) | https://godbolt.org/z/Tdvdqjohn | `vpmovmskb` (fixed) |
## Benchmark Results
**CPU:** AMD Ryzen 5 7500F (Zen 4 with AVX-512)
### Default Target (SSE2) — Mixed
| Size | Before | After | Change |
|------|--------|-------|--------|
| 4 B | 1.8 GB/s | 2.0 GB/s | **+11%** |
| 8 B | 3.2 GB/s | 5.8 GB/s | **+81%** |
| 16 B | 5.3 GB/s | 8.5 GB/s | **+60%** |
| 32 B | 17.7 GB/s | 15.8 GB/s | -11% |
| 64 B | 28.6 GB/s | 25.1 GB/s | -12% |
| 256 B | 51.5 GB/s | 48.6 GB/s | ~same |
| 1 KB | 64.9 GB/s | 60.7 GB/s | ~same |
| 4 KB+ | ~68-70 GB/s | ~68-72 GB/s | ~same |
### Native Target (AVX-512) — Up to 24x Faster
| Size | Before | After | Speedup |
|------|--------|-------|---------|
| 4 B | 1.2 GB/s | 2.0 GB/s | **1.7x** |
| 8 B | 1.6 GB/s | 5.0 GB/s | **3.3x** |
| 16 B | ~7 GB/s | ~7 GB/s | ~same |
| 32 B | 2.9 GB/s | 14.2 GB/s | **4.9x** |
| 64 B | 2.9 GB/s | 23.2 GB/s | **8x** |
| 256 B | 2.9 GB/s | 47.2 GB/s | **16x** |
| 1 KB | 2.8 GB/s | 60.0 GB/s | **21x** |
| 4 KB+ | 2.9 GB/s | ~68-70 GB/s | **23-24x** |
### Summary
- **SSE2 (default):** Small inputs (4-16 B) 11-81% faster; 32-64 B ~11% slower; large inputs unchanged
- **AVX-512 (native):** 21-24x faster for inputs ≥1 KB, peak ~70 GB/s (was ~3 GB/s)
Note: this is the pure ascii path, but the story is similar for the others.
See linked bench project.
## Test Plan
- [x] Assembly test (`slice-is-ascii-avx512.rs`) verifies no `kshiftrd` with AVX-512
- [x] Existing codegen test updated to `loongarch64`-only (auto-vectorization still used there)
- [x] Fuzz testing confirms old/new implementations produce identical results (~53M iterations)
- [x] Benchmarks confirm performance improvement
- [x] Tidy checks pass
## Reproduction / Test Projects
Standalone validation tools: https://github.com/bonega/is-ascii-fix-validation
- `bench/` - Criterion benchmarks for SSE2 vs AVX-512 comparison
- `fuzz/` - Compares old/new implementations with libfuzzer
## Related Issues
- issue opened by @folkertdev llvm/llvm-project#176906
- Regression introduced in https://github.com/rust-lang/rust/pull/130733
Remove the `#[target_feature(enable = "sse2")]` attribute and make the
function safe to call. The SSE2 requirement is already enforced by the
`#[cfg(target_feature = "sse2")]` predicate.
Individual unsafe blocks are used for intrinsic calls with appropriate
SAFETY comments.
Also adds FIXME reference to llvm#176906 for tracking when this
workaround can be removed.
For inputs smaller than 32 bytes, use usize-at-a-time processing
instead of calling the SSE2 function. This avoids function call
overhead from #[target_feature(enable = "sse2")] which prevents
inlining.
Also moves CHUNK_SIZE to module level so it can be shared between
is_ascii and is_ascii_sse2.
When `[u8]::is_ascii()` is compiled with `-C target-cpu=native` on
AVX-512 CPUs, LLVM generates inefficient code. Because `is_ascii` is
marked `#[inline]`, it gets inlined and recompiled with the user's
target settings. The previous implementation used a counting loop that
LLVM auto-vectorizes to `pmovmskb` on SSE2, but with AVX-512 enabled,
LLVM uses k-registers and extracts bits individually with ~31
`kshiftrd` instructions.
This fix replaces the counting loop with explicit SSE2 intrinsics
(`_mm_loadu_si128`, `_mm_or_si128`, `_mm_movemask_epi8`) for x86_64.
`_mm_movemask_epi8` compiles to `pmovmskb`, forcing efficient codegen
regardless of CPU features.
Benchmark results on AMD Ryzen 5 7500F (Zen 4 with AVX-512):
- Default build: ~73 GB/s → ~74 GB/s (no regression)
- With -C target-cpu=native: ~3 GB/s → ~67 GB/s (22x improvement)
The loongarch64 implementation retains the original counting loop
since it doesn't have this issue.
Regression from: https://github.com/rust-lang/rust/pull/130733
Expand `str_as_str` to more types
Tracking issue: rust-lang/rust#130366
ACP: https://github.com/rust-lang/libs-team/issues/643
This PR expands `str_from_str` feature and adds analogous methods to more types. Namely:
- `&CStr`
- `&[T]`, `&mut [T]`
- `&OsStr`
- `&Path`
- `&ByteStr`, `&mut ByteStr` (tracking issue: rust-lang/rust#134915) (technically was not part of ACP)