Replacing log2f with a Faster Approximation

2023-09-26

This problem came up while optimizing my implementation of WFC, but I figured it deserves its own post. One part of that algorithm is calculating Shannon entropies, which means calculating base-2 logarithms. And in this algorithm, a LOT of logarithms get calculated.

My program was very slow, so I profiled it and realized that most of the time was spent on C’s log2f function. This is generally a very well-implemented function that offers high precision and covers a lot of corner cases. It can, however, be a bit slow.

I did not need a too-precise logarithm function. I did need it to be much faster and I knew my inputs would always be positive numbers and never NaN, infinity, nor other special values. The solution was obvious - I would write my own log2f!

Now, one should not just start replacing functions from the standard library unless it really, provably is necessary. But in my case… well, I think it was necessary:

Approximating the binary logarithm

Library authors use a lot of tricks to implement floating-point math functions. In part they use math identities and approximation methods, in part they use bit-twiddling hacks on floating-point representations. Because of that the code can get quite involved and ends up looking very cryptic. Eg. here’s musl libc’s implementation of log2f: https://git.musl-libc.org/cgit/musl/tree/src/math/log2f.c, and of log2 (for doubles): https://git.musl-libc.org/cgit/musl/tree/src/math/log2.c. And this is an implementation of log (natural base) from a different codebase: https://www.netlib.org/fdlibm/e_log.c. As you can see, they are not simple. I am not qualified to explain those, so I will focus only on what I’ve implemented.

My first idea was to develop a Taylor polynomial, stick it in a function, and call it a day. All my inputs are in the range (0, 1], so the root would be 0.5 and I’d play around with different polynomial degrees for a good precision-speed tradeoff. There was one problem, though. Here’s what 5th-degree Taylor polynomial would look like:

See how it diverges from the actual function as x nears zero? The y-axis is an asymptote for logarithms and no polynomial will simulate that well enough. This means that, for small enough numbers, the approximation error is inevitably going to be large. Thankfully, there is a solution - I’ll map my arguments to a different range. Floats are already represented as m * 2^e, where m is the mantissa and e is the exponent. This representation is convenient because it can easily be broken up by a binary logarithm:

log2(x) = log2(m * 2^e) = log2(m) + log2(2^e) = log2(m) + e

Float mantissa is always in the range [1, 2) and it turns out that approximating log2(x) in this range is quite easy. Hell, even log2(x) ~ x - 1 is not too far off, and x - 1 + 0.043 is even better.

Taylor can now be applied on this range. But, in my research on how to approximate the binary logarithm, I learned that the Taylor approximation I was taught in school was not the best one. Two topics kept coming up - Chebyshev polynomials and the Remez algorithm.

To put it as simply as I can, Chebyshev polynomials (for a specified degree) aim to minimize the maximum error by constructing a polynomial whose error function will rise and fall in a sine-like fashion. Imagine the polynomial oscillating between above and below the function it approximates. For example, this is the error function of 5th-degree Chebyshev polynomial for log2(x) in the range [1, 2]:

Remez algorithm computes the coefficients of Chebyshev polynomials. I won’t pretend to understand how it works, but fortunately, I don’t have to - I found a project called lolremez that runs the algorithm and outputs the coefficients. For the curious: check out reference links on the Remez algorithm at the bottom.

Implementation

Now that we have all that, what should the code look like?

First, it should extract the mantissa and exponent. This is done with some bit-twiddling. Here’s a refresher on the 32-bit float representation:

sign
|   <--exponent--->   <-----------------mantissa------------------>
V
31  30            23  22                                          0
. | . . . . . . . . | . . . . . . . . . . . . . . . . . . . . . . .

The actual number is then: (-1)^sign * 2^(exponent - 127) * 1.{mantissa bits}.

Exponent can be extracted like this (assuming the number is positive):

    int32_t xi;
    memcpy(&xi, &x, sizeof(xi));
    float e = (float)((xi >> 23) - 127);

The second line reinterprets bytes of float as an integer to allow bitwise operations to be performed. memcpy is a standard-compliant way to reinterpret bytes of one type as a different type that works in both C and C++. That’s right, C and C++ rules are not the same. Compilers should optimize the call to memcpy away, so don’t worry about speed.

Mantissa can be extracted like this:

    float m;
    int32_t mxi = (xi & ((1 << 23) - 1)) | (127 << 23);
    memcpy(&m, &mxi, sizeof(m));

(1 << 23) - 1 is just a series of 23 ones at the lowest bits - this bitmask is used to select just the mantissa bits. The code then creates the bit representation for (-1)^0 * 2^(127 - 127) * 1.{mantissa bits}, which resolves to just the mantissa. It then reinterprets the bytes as a float.

Mantissa is always inside [1, 2), so the next step is to calculate the value of the Chebyshev polynomial. Eg. for the 3rd-degree case:

    float l = 0.15824871f;
    l = l * m + -1.051875f;
    l = l * m + 3.0478842f;
    l = l * m + -2.1536207f;

The total logarithm value is then l + e.

You can find the final code for various degree approximations here.

Benchmark

Here is the benchmark code. My setup:

The benchmark calls various implementations on one billion different values. In the tables below, err is the maximum absolute difference from the equivalent log2f result. time is the time in seconds it took to make a billion calls (plus some test overhead), averaged across 5 runs. err01 and time01 are the same, except calls were made on a different set of a billion values, all of which were in the range (0, 1].

I also added two functions I found at a GitHub repo called fastapprox. I have only a vague idea of how they’re computing the logarithm value (there’re a lot of magic constants), but they are reasonably fast and precise and therefore a nice reference for comparison.

These are the results after compiling with -O3 -march=native:

function time err time01 err01
log2f 23.990800 0.000000 23.738200 0.000000
fasterlog2 0.440600 0.057310 0.440400 0.057307
fastlog2 0.453400 0.000153 0.440800 0.000147
remez deg-1 0.441400 0.043037 0.441000 0.043036
remez deg-2 0.442800 0.004940 0.442800 0.004940
remez deg-3 0.443600 0.000639 0.441400 0.000639
remez deg-4 0.451000 0.000090 0.440600 0.000090
remez deg-5 0.445800 0.000015 0.440800 0.000015

These are the results after compiling with -O2 only (no -march):

function time err time01 err01
log2f 23.943800 0.000000 23.718400 0.000000
fasterlog2 0.439800 0.057310 0.442600 0.057307
fastlog2 0.621800 0.000153 0.626800 0.000148
remez deg-1 0.439800 0.043037 0.441600 0.043036
remez deg-2 0.442600 0.004942 0.442400 0.004941
remez deg-3 0.521800 0.000639 0.522400 0.000639
remez deg-4 0.631000 0.000090 0.635600 0.000090
remez deg-5 0.735400 0.000015 0.734400 0.000015

All of them are much faster than log2f. With -O3 -march=native times are so close that benchmark isn’t able to determine which functions are faster. With -O2 differences do show and it`s clear that adding more degrees makes for slower functions.

For the first four versions, precisions increased by about one decimal place with each degree added.

For my project, I ended up going with the 3rd-degree version. At lower degrees, my program’s outputs started to differ from when log2f was used. I figured that the 3rd-degree polynomial was a good compromise on precision and performance.

Conclusion

Don’t take this to mean you should replace all your calls to log2f - you probably shouldn’t. But if you ever run into a situation where a math function from the standard library is slow, hopefully this gives you some ideas on what you can do about it.

One thing I neglected to mention is that imprecisions on floating operations compound. If you use a slightly imprecise logarithm value in some calculations and then those results in additional calculations, results will diverge more and more from the correct one. What you use the logarithm value for is an important factor in deciding what loss of precision you can afford.

Anyway, as for my program from the start of the post, here’s what the profiler reported after I replaced log2f:

Those 125 seconds went down to 31 - nice!

References

https://austinhenley.com/blog/cosine.html

https://www.youtube.com/watch?v=p8u_k2LIZyo

https://en.wikipedia.org/wiki/Single-precision_floating-point_format

https://math.stackexchange.com/questions/3381629/what-is-the-fastest-algorithm-for-finding-the-natural-logarithm-of-a-big-number

https://stackoverflow.com/questions/9411823/fast-log2float-x-implementation-c

https://github.com/romeric/fastapprox/blob/master/fastapprox/src/fastlog.h

https://www.flipcode.com/archives/Fast_log_Function.shtml

https://git.musl-libc.org/cgit/musl/tree/src/math/log2f.c

https://git.musl-libc.org/cgit/musl/tree/src/math/log2.c

https://www.netlib.org/fdlibm/e_log.c

http://jrfonseca.blogspot.com/2008/09/fast-sse2-pow-tables-or-polynomials.html

https://www1.icsi.berkeley.edu/pubs/techreports/TR-07-002.pdf

https://en.wikipedia.org/wiki/Chebyshev_polynomials

https://en.wikipedia.org/wiki/Remez_algorithm

https://www.boost.org/doc/libs/1_83_0/libs/math/doc/html/math_toolkit/remez.html

https://github.com/samhocevar/lolremez

https://github.com/samhocevar/lolremez/wiki/Implementation-details

https://stackoverflow.com/questions/38601780/can-memcpy-be-used-for-type-punning

https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8

https://cellperformance.beyond3d.com/articles/2006/06/understanding-strict-aliasing.html

https://www.vplesko.com/posts/wfc/devlog_0.html

https://github.com/vplesko/libwfc

https://en.wikipedia.org/wiki/Entropy_(information_theory)