Remove support for 1-token lookahead from the lexer
`StringReader` maintained `peek_token` and `peek_span_src_raw` for look ahead.
`peek_token` was used only by rustdoc syntax coloring. After moving peeking logic into highlighter, I was able to remove `peek_token` from the lexer. I tried to use `iter::Peekable`, but that wasn't as pretty as I hoped, due to buffered fatal errors. So I went with hand-rolled peeking.
After that I've noticed that the only peeking behavior left was for raw tokens to test tt jointness. I've rewritten it in terms of trivia tokens, and not just spans.
After that it became possible to simplify the awkward constructor of the lexer, which could return `Err` if the first peeked token contained error.
Allow attributes in formal function parameters
Implements https://github.com/rust-lang/rust/issues/60406.
This is my first contribution to the compiler and since this is a large and complex project, I am not fully aware of the consequences of the changes I have made.
**TODO**
- [x] Forbid some built-in attributes.
- [x] Expand cfg/cfg_attr
- Add detail on origin of current parser when reaching EOF and stop
saying "found <eof>" and point at the end of macro calls
- Handle empty `cfg_attr` attribute
- Reword empty `derive` attribute error
Add stream_to_parser_with_base_dir
This PR adds `stream_to_parser_with_base_dir`, which creates a parser from a token stream and a base directory.
Context: I would like to parse `cfg_if!` macro and get a list of modules defined inside it from rustfmt so that rustfmt can format those modules (cc https://github.com/rust-lang/rustfmt/issues/3253). To do so, I need to create a parser from `TokenStream` and set the directory of `Parser` to the same directory as the parent directory of a file which contains `cfg_if!` invocation. AFAIK there is no way to achieve this, and hence this PR.
Alternatively, I could change the visibility of `Parser.directory` from `crate` to `pub` so that the value can be modified after initializing a parser. I don't have a preference over either approach (or others, as long as it works).
Move token tree related lexer state to a separate struct
Just a types-based refactoring.
We only used a bunch of fields when tokenizing into a token tree, so let's move them out of the base lexer
Identify when a stmt could have been parsed as an expr
There are some expressions that can be parsed as a statement without
a trailing semicolon depending on the context, which can lead to
confusing errors due to the same looking code being accepted in some
places and not others. Identify these cases and suggest enclosing in
parenthesis making the parse non-ambiguous without changing the
accepted grammar.
Fix#54186, cc #54482, fix#59975, fix#47287.
introduce unescape module
A WIP PR to gauge early feedback
Currently, we deal with escape sequences twice: once when we [lex](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/lexer/mod.rs#L928-L1065) a string, and a second time when we [unescape](https://github.com/rust-lang/rust/blob/112f7e9ac564e2cfcfc13d599c8376a219fde1bc/src/libsyntax/parse/mod.rs#L313-L366) literals. Note that we also produce different sets of diagnostics in these two cases.
This PR aims to remove this duplication, by introducing a new `unescape` module as a single source of truth for character escaping rules.
I think this would be a useful cleanup by itself, but I also need this for https://github.com/rust-lang/rust/pull/59706.
In the current state, the PR has `unescape` module which fully (modulo bugs) deals with string and char literals. I am quite happy about the state of this module
What this PR doesn't have yet are:
* [x] handling of byte and byte string literals (should be simple to add)
* [x] good diagnostics
* [x] actual removal of code from lexer (giant `scan_char_or_byte` should go away completely)
* [x] performance check
* [x] general cleanup of the new code
Diagnostics will be the most labor-consuming bit here, but they are mostly a question of just correctly adjusting spans to sub-tokens. The current setup for diagnostics is that `unescape` produces a plain old `enum` with various problems, and they are rendered into `Handler` separately. This bit is not actually required (it is possible to just pass the `Handler` in), but I like the separation between diagnostics and logic this approach imposes, and such separation should again be useful for #59706
cc @eddyb , @petrochenkov
Currently, we deal with escape sequences twice: once when we lex a
string, and a second time when we unescape literals. This PR aims to
remove this duplication, by introducing a new `unescape` mode as a
single source of truth for character escaping rules
There are some expressions that can be parsed as a statement without
a trailing semicolon depending on the context, which can lead to
confusing errors due to the same looking code being accepted in some
places and not others. Identify these cases and suggest enclosing in
parenthesis making the parse non-ambiguous without changing the
accepted grammar.