Commit Graph

61 Commits

Author SHA1 Message Date
Jules Bertholet 8d072616a5 Add APIs for dealing with titlecase
- `char::is_cased`
- `char::is_titlecase`
- `char::case`
- `char::to_titlecase`
2026-03-21 14:30:38 -04:00
Karl Meakin 902199b5b1 Compress case-mapping tables 2026-03-08 22:39:35 +00:00
Karl Meakin 95365cc5bf Make unicode_data tests normal
Instead of generating a standalone executable to test `unicode_data`,
generate normal tests in `coretests`. This ensures tests are always
generated, and will be run as part of the normal testsuite.

Also change the generated tests to loop over lookup tables, rather than
generating a separate `assert_eq!()` statement for every codepoint. The
old approach produced a massive (20,000 lines plus) file which took
minutes to compile!
2026-03-08 22:38:31 +00:00
Marco A L Barbosa 262426d535 Avoid index check in char::to_lowercase and char::to_uppercase 2025-12-30 16:20:19 -03:00
Jieyou Xu 4aeb297064 Revert "unicode_data refactors RUST-147622"
This PR reverts RUST-147622 for several reasons:

1. The RUST-147622 PR would format the generated core library code using
   an arbitrary `rustfmt` picked up from `PATH`, which will cause
   hard-to-debug failures when the `rustfmt` used to format the
   generated unicode data code versus the `rustfmt` used to format the
   in-tree library code.
2. Previously, the `unicode-table-generator` tests were not run under CI
   as part of `coretests`, and since for `x86_64-gnu-aux` job we run
   library `coretests` with `miri`, the generated tests unfortunately
   caused an unacceptably large Merge CI time regression from ~2 hours
   to ~3.5 hours, making it the slowest Merge CI job (and thus the new
   bottleneck).
3. This PR also has an unintended effect of causing a diagnostic
   regression (RUST-148387), though that's mostly an edge case not
   properly handled by `rustc` diagnostics.

Given that these are three distinct causes with non-trivial fixes, I'm
proposing to revert this PR to return us to baseline. This is not
prejudice against relanding the changes with these issues addressed, but
to alleviate time pressure to address these non-trivial issues.
2025-11-03 19:53:11 +08:00
Karl Meakin 0e6131c9aa refactor: make unicode_data tests normal tests
Instead of generating a standalone executable to test `unicode_data`,
generate normal tests in `coretests`. This ensures tests are always
generated, and will be run as part of the normal testsuite.

Also change the generated tests to loop over lookup tables, rather than
generating a separate `assert_eq!()` statement for every codepoint. The
old approach produced a massive (20,000 lines plus) file which took
minutes to compile!
2025-10-31 14:12:17 +00:00
Karl Meakin 9a80731094 refactor: make string formatting more readable
To make the final output code easier to see:
* Get rid of the unnecessary line-noise of `.unwrap()`ing calls to
  `write!()` by moving the `.unwrap()` into a macro.
* Join consecutive `write!()` calls using a single multiline format
  string.
* Replace `.push()` and `.push_str(format!())` with `write!()`.
* If after doing all of the above, there is only a single `write!()`
  call in the function, just construct the string directly with
  `format!()`.
2025-10-31 14:12:14 +00:00
Karl Meakin c8ab4279a5 refactor: format unicode_data
Remove `#[rustfmt::skip]` from all the generated modules in
`unicode_data.rs`. This means we won't have to worry so much about
getting indetation and formatting right when generating code.

Exempted for now some tables which would be too big when formatted by
`rustfmt`.
2025-10-31 14:11:39 +00:00
Karl Meakin bf7b05c97b refactor: move runtime functions to core
Instead of `include_str!()`ing `range_search.rs`, just make it a normal
module under `core::unicode`. This means the same source code doesn't
have to be checked in twice, and it plays nicer with IDEs.

Also rename it to `rt` since it includes functions for searching the
bitsets and case conversion tables as well as the range
represesentation.
2025-10-31 14:11:35 +00:00
Marco Cavenati 83a99aadcf Bump unicode printable to version 17.0.0 2025-09-09 15:01:47 +02:00
Marco Cavenati 400e687598 Bump unicode_data to version 17.0.0 2025-09-09 09:03:52 +02:00
Karl Meakin a8c669461f optimization: Don't include ASCII characters in Unicode tables
The ASCII subset of Unicode is fixed and will never change, so we don't
need to generate tables for it with every new Unicode version. This
saves a few bytes of static data and speeds up `char::is_control` and
`char::is_grapheme_extended` on ASCII inputs.

Since the table lookup functions exported from the `unicode` module will
give nonsensical errors on ASCII input (and in fact will panic in debug
mode), I had to add some private wrapper methods to `char` which check
for ASCII-ness first.
2025-09-07 15:21:24 +02:00
Marijn Schouten 98e10290c9 change file-is-generated doc comment to inner 2025-09-05 10:08:45 +00:00
Stuart Cook 8b790d7b00 Rollup merge of #145414 - Kmeakin:km/unicode-table-refactors, r=joshtriplett,tgross35
unicode-table-generator refactors

Split off from https://github.com/rust-lang/rust/pull/145219
2025-09-03 23:08:06 +10:00
bors 0f50696801 Auto merge of #145479 - Kmeakin:km/hardcode-char-is-control, r=joboet
Hard-code `char::is_control`

Split off from https://github.com/rust-lang/rust/pull/145219

According to
https://www.unicode.org/policies/stability_policy.html#Property_Value, the set of codepoints in `Cc` will never change. So we can hard-code the patterns to match against instead of using a table.

This doesn't change the generated assembly, since the lookup table is small enough that[ LLVM is able to inline the whole search](https://godbolt.org/z/bG8dM37YG). But this does reduce the chance of regressions if LLVM's heuristics change in the future, and means less generated Rust code checked in to `unicode-data.rs`.
2025-08-30 14:18:21 +00:00
Karl Meakin 1bb9b151c9 refactor: Hard-code char::is_control
According to
https://www.unicode.org/policies/stability_policy.html#Property_Value,
the set of codepoints in `Cc` will never change. So we can hard-code
the patterns to match against instead of using a table.
2025-08-16 01:46:30 +01:00
Karl Meakin 69e1974bb0 refactor: Include size of case conversion tables
Include the sizes of the `to_lowercase` and `to_uppercase` tables in the
total size calculations.
2025-08-15 01:29:12 +00:00
Karl Meakin c992245361 refactor: Include table sizes in comment at top of unicode_data.rs
To make changes in table size obvious from git diffs
2025-08-15 01:29:12 +00:00
ltdk c855a2f4d6 Hide docs for core::unicode 2025-08-13 12:27:44 -04:00
yukang 93db9e7ee0 Remove uncessary parens in closure body with unused lint 2025-07-10 09:25:56 +08:00
Markus Reiter 3628a8f326 Remove unneeded parentheses. 2025-03-08 12:56:00 +01:00
Markus Reiter 90ebc24607 Use intrinsics::assume instead of hint::assert_unchecked. 2025-03-07 20:19:12 +01:00
Markus Reiter 22725588d3 Never inline lookup_slow. 2025-03-07 20:17:52 +01:00
Markus Reiter 34ac75be28 Add second precondition for skip_search. 2025-03-06 21:38:39 +01:00
Markus Reiter 222adac953 Allow optimizing out panic_bounds_check in Unicode checks. 2025-03-06 21:38:39 +01:00
Urgau 8e61502484 core: add #![warn(unreachable_pub)] 2025-01-20 18:35:32 +01:00
Jakub Beránek 536516f949 Reformat Python code with ruff 2024-12-04 23:03:44 +01:00
Boxy 22998f0785 update cfgs 2024-11-27 15:14:54 +00:00
Ralf Jung eddab479fd stabilize const_unicode_case_lookup 2024-11-12 15:13:31 +01:00
bors cf2b370ad0 Auto merge of #132500 - RalfJung:char-is-whitespace-const, r=jhpratt
make char::is_whitespace unstably const

I am adding this to the existing https://github.com/rust-lang/rust/issues/132241 feature gate, since `is_digit` and `is_whitespace` seem similar enough that one can group them together.
2024-11-06 04:07:32 +00:00
Matthias Krüger b438a5cd2a Rollup merge of #132499 - RalfJung:unicode_data.rs, r=tgross35
unicode_data.rs: show command for generating file

https://github.com/rust-lang/rust/pull/131647 made this an easily runnable tool, now we just have to mention that in the comment. :)

Fixes https://github.com/rust-lang/rust/issues/131640.
2024-11-03 12:08:51 +01:00
Ralf Jung 0804815e69 make char::is_whitespace unstably const 2024-11-02 10:17:16 +01:00
Ralf Jung 720d618b5f unicode_data.rs: show command for generating file 2024-11-02 10:06:52 +01:00
Ralf Jung 66351a6184 get rid of a whole bunch of unnecessary rustc_const_unstable attributes 2024-11-02 09:59:55 +01:00
Ralf Jung 90e4f10f6c switch unicode-data back to 'static' 2024-10-13 11:53:06 +02:00
Matthias Krüger 4428d6f363 Rollup merge of #130101 - RalfJung:const-cleanup, r=fee1-dead
some const cleanup: remove unnecessary attributes, add const-hack indications

I learned that we use `FIXME(const-hack)` on top of the "const-hack" label. That seems much better since it marks the right place in the code and moves around with the code. So I went through the PRs with that label and added appropriate FIXMEs in the code. IMO this means we can then remove the label -- Cc ``@rust-lang/wg-const-eval.``

I also noticed some const stability attributes that don't do anything useful, and removed them.

r? ``@fee1-dead``
2024-09-12 19:03:41 +02:00
Marcondiro c8d9bd488a Bump unicode printable to version 16.0.0 2024-09-10 11:13:35 +02:00
Marcondiro bdda4ec2f5 Bump unicode_data to version 16.0.0 2024-09-10 10:50:20 +02:00
Ralf Jung 332fa6aa6e add FIXME(const-hack) 2024-09-08 23:08:40 +02:00
Nicholas Nethercote c5dadd0408 Use #[rustfmt::skip] on some use groups to prevent reordering.
`use` declarations will be reformatted in #125443. Very rarely, there is
a desire to force a group of `use` declarations together in a way that
auto-formatting will break up. E.g. when you want a single comment to
apply to a group. #126776 dealt with all of these in the codebase,
ensuring that no comments intended for multiple `use` declarations would
end up in the wrong place. But some people were unhappy with it.

This commit uses `#[rustfmt::skip]` to create these custom `use` groups
in an idiomatic way for a few of the cases changed in #126776. This
works because rustfmt treats any `use` item annotated with
`#[rustfmt::skip]` as a barrier and won't reorder other `use` items
around it.
2024-07-19 13:26:48 +10:00
Nicholas Nethercote 75b6ec9800 Avoid comments that describe multiple use items.
There are some comments describing multiple subsequent `use` items. When
the big `use` reformatting happens some of these `use` items will be
reordered, possibly moving them away from the comment. With this
additional level of formatting it's not really feasible to have comments
of this type. This commit removes them in various ways:

- merging separate `use` items when appropriate;

- inserting blank lines between the comment and the first `use` item;

- outright deletion (for comments that are relatively low-value);

- adding a separate "top-level" comment.

We also entirely skip formatting for four library files that contain
nothing but `pub use` re-exports, where reordering would be painful.
2024-07-17 08:02:46 +10:00
Arpad Borsos 488598c183 Add a lower bound check to unicode-table-generator output
This adds a dedicated check for the lower bound
(if it is outside of ASCII range) to the output of the `unicode-table-generator` tool.

This generalized the ASCII-only fast-path, but only for the `Grapheme_Extend` property for now,
as that is the only one with a lower bound outside of ASCII.
2024-04-20 10:16:45 +02:00
Marcondiro e9870b5df3 Bump Unicode printables to version 15.1, align to unicode_data 2024-03-28 11:21:52 +01:00
Marcondiro 01fa7209d5 Bump Unicode to version 15.1.0, regenerate tables 2024-02-09 17:35:46 +01:00
Trevor Gross 22d00dcd47 Apply changes to fix python linting errors 2023-06-16 20:56:01 -04:00
Martin Gammelsæter 54f55efb9a Use hex literal for INDEX_MASK 2023-03-21 09:59:47 +01:00
Martin Gammelsæter 355e1dda1d Improve case mapping encoding scheme
The indices are encoded as `u32`s in the range of invalid `char`s, so
that we know that if any mapping fails to parse as a `char` we should
use the value for lookup in the multi-table.

This avoids the second binary search in cases where a multi-`char`
mapping is needed.

Idea from @nikic
2023-03-16 21:42:15 +01:00
Martin Gammelsæter f9bd884385 Split unicode case LUTs in single and multi variants
The majority of char case replacements are single char replacements,
so storing them as [char; 3] wastes a lot of space.

This commit splits the replacement tables for both `to_lower` and
`to_upper` into two separate tables, one with single-character mappings
and one with multi-character mappings.

This reduces the binary size for programs using all of these tables
with roughly 24K bytes.
2023-03-16 12:34:04 +01:00
Martin Gammelsæter 8a4eb9e3a8 Skip serializing ascii chars in case LUTs
Since ascii chars are already handled by a special case in the
`to_lower` and `to_upper` functions, there's no need to waste space on
them in the LUTs.
2023-03-15 17:27:23 +01:00
jonathanCogan db47071df2 Replace libstd, libcore, liballoc in line comments. 2022-12-30 14:00:42 +01:00