The closest namespace the pi/4 constant could belong to is `trig.zig` since
it's used across trig function implementations. On the other hand, chucking
`long double` bit slicing functions into `trig.zig` seems a little more
awkward, so they're put into their own namespace.
The implementation was ported from `musl`. Unit tests for `f80` and `f128` were also added.
`__cosl` was already implemented in `trig.zig` while working on `sinl`.
The changes were tested by running:
```
$ ./build/stage3/bin/zig build -p stage4 -Denable-llvm -Dno-lib
$ stage4/bin/zig build test-libc -Dlibc-test-path=<LIBC-TEST-PATH> -Dtest-filter='math.cosl' -fqemu -fwasmtime --summary line
Build Summary: 553/553 steps succeeded
```
The implementation was ported from `musl`. Unit tests for `f80` and `f128` were also added.
The changes were tested by running:
```
$ ./build/stage3/bin/zig build -p stage4 -Denable-llvm -Dno-lib
$ stage4/bin/zig build test-libc -Dlibc-test-path=<LIBC-TEST-PATH> -Dtest-filter=sinl -fqemu -fwasmtime --summary line
Build Summary: 553/553 steps succeeded
```
The logic was more or less ported from `musl`, with small adjustments
where it was convenient. The 'internal' `__tanl` function was implemented
in the `trig.zig` module along with other 'internal' trigonometric functions.
Now, the `tanl` implementation is precise enough to pass all the relevant `libc-test` suite tests:
```
$ ./build/stage3/bin/zig build -p stage4 -Denable-llvm -Dno-lib
$ stage4/bin/zig build test-libc -Dlibc-test-path=<LIBC-TEST-PATH> -Dtest-filter=tanl -fqemu -fwasmtime --summary line
Build Summary: 553/553 steps succeeded
```
The unit tests were also extended to include cases for `f80` and `f128`, and they're passing.
Additionally, add helper functions for fetching the sign+exponent
and top 16 bits of a f80/f128's mantissa.
We'll need these functions to implement `cosl`, `sinl`, `tanl`, and,
transitively, `sincosl`.
Replace the O(n) shift-subtract loop with a constant-time trial
quotient approach (Knuth Algorithm D, TAOCP Vol 2 Section 4.3.1).
The old code iterates clz(b_hi)-clz(a_hi)+1 times (up to 64
iterations of 128-bit arithmetic). The new code uses a single
divwide call to get a trial quotient, then verifies with two
native-width widening multiplies.
Benchmark (Apple M1, ReleaseFast):
- Large divisor, large shift: 87ns -> 7.5ns (11.5x faster)
- Small divisor / uniform: unchanged
use the "symbol" helper function in all exports
move all declarations from common.zig to compiler_rt.zig
flatten the tree structure somewhat (move contents of tiny files into
parent files)
No functional changes.
- delete std.Thread.Futex
- delete std.Thread.Mutex
- delete std.Thread.Semaphore
- delete std.Thread.Condition
- delete std.Thread.RwLock
- delete std.once
std.Thread.Mutex.Recursive remains... for now. it will be replaced with
a special purpose mechanism used only by panic logic.
std.Io.Threaded exposes mutexLock and mutexUnlock for the advanced case
when you need to call them directly.
Our implementation did the classic add-sub rounding trick `(y = x +/- C =+ C - x)`
with `C = 1 / eps(T) = 2^(mantissa - 1)`. This approach only works for values whose
magnitude is below the rounding capacity of the constant. For a 64-bit mantissa
(like f80 has), `C = 2^63` only rounds for `|x| < 2^63`. Before we allowed this to
be ran on `e < bias + 64` aka `|x| < 2^64`. And because it isn't large enough,
we lose a bit to rounding.
For reference, the musl implementation does the same thing, using `mantissa - 1`:
https://git.musl-libc.org/cgit/musl/tree/src/math/ceill.c#n18
where `LDBL_MANT_DIG` is 64 for `long double` on x86.
This commit also combines the floor and ceil implementations into one generic one.
Simplifies the logic, clarifies the comment, and fixes a minor bug,
which is that we exported the Windows ABI name *instead* of the standard
compiler-rt name, but it's meant to be exported *in addition* to the
standard name (this is LLVM's behavior and it is more useful).
Apple's own headers and tbd files prefer to think of Mac Catalyst as a distinct
OS target. Earlier, when DriverKit support was added to LLVM, it was represented
a distinct OS. So why Apple decided to only represent Mac Catalyst as an ABI in
the target triple is beyond me. But this isn't the first time they've ignored
established target triple norms (see: armv7k and aarch64_32) and it probably
won't be the last.
While doing this, I also audited all Darwin OS prongs throughout the codebase
and made sure they cover all the tags.
`__addosi4`, `__addodi4`, `__addoti4`, `__subosi4`, `__subodi4`, and
`__suboti4` were all functions which we invented for no apparent reason.
Neither LLVM, nor GCC, nor the Zig compiler use these functions. It
appears the functions were created in a kind of misunderstanding of an
old language proposal; see https://github.com/ziglang/zig/pull/10824.
There is no benefit to these functions existing; if a Zig compiler
backend needs this operation, it is trivial to implement, and *far*
simpler than calling a compiler-rt routine. Therefore, this commit
deletes them. A small amount of that code was used by other parts of
compiler-rt; the logic is trivial so has just been inlined where needed.
I also chose to quickly implement `__addvdi3` (a standard function)
because it is trivial and we already implement the `sub` parallel.
The previous version (ported from musl) used bit-by-bit calculations and was slow, but the current version (also ported from musl) uses lookup tables combined with Goldschmidt iterations to significantly improve the speed.