# Floating Point

In programming floating point (colloquially just *float*) is a way of representing [fractional](rational_number.md) numbers (such as 5.13) and approximating [real numbers](real_number.md) (i.e. numbers with higher than [integer](integer.md) precision), which is a bit more complex than simpler methods for doing so (such as [fixed point](fixed_point.md)). The core idea of it is to use a radix ("decimal") point that's not fixed but can move around so as to allow representation of both very small and very big values. Nowadays floating point is the standard way of [approximating](approximation.md) [real numbers](real_number.md) in computers (floating point types are called *real* in some programming languages, even though they represent only [rational numbers](rational_number.md), floats can't e.g. represent [pi](pi.md) exactly), basically all of the popular [programming languages](programming_language.md) have a floating point [data type](data_type.md) that adheres to the IEEE 754 standard, all personal computers also have the floating point hardware unit ([FPU](fpu.md)) and so it is widely used in all [modern](modern.md) programs. However most of the time a simpler representation of fractional numbers, such as the mentioned [fixed point](fixed_point.md), suffices, and weaker computers (e.g. [embedded](embedded.md)) may lack the hardware support so floating point operations are emulated in software and therefore slow -- remember, float rhymes with [bloat](bloat.md). Prefer fixed point.

Back in the earlier days of personal computers -- like the early [90s](90s.md) -- hardware accelerated floating point still wasn't completely common, for example Intel 80286 didn't have a built-in FPU, it had to be bought extra, and that was usually done only by professionals like engineers and scientists, games didn't really use floating point. Integrated FPU became standard only later on.

**Floating point is tricky**, it works most of the time but a danger lies in programmers relying on this kind of [magic](magic.md) too much, some new generation programmers may not even be very aware of how floats internally work. Even though the principle is not difficult to understand, the emergent complexity of the math can get really complex and practical problems of implementation and standardization don't help at either. One floating point expression may evaluate differently on different systems, for example due to different rounding settings. Floating point can introduce [chaotic](chaos.md) behavior into linear systems as it inherently makes rounding errors and so becomes a nonlinear system (source: http://foldoc.org/chaos). One common pitfall of float is working with big and small numbers at the same time -- due to differing precision at different scales small values simply get lost when mixed with big numbers and sometimes this has to be worked around with tricks (see e.g. [this](http://the-witness.net/news/2022/02/a-shader-trick/) devlog of The Witness where a float time variable sent into [shader](shader.md) is periodically reset so as to not grow too large and cause the mentioned issue). Another famous trickiness of float is that you shouldn't really be comparing them for equality with a normal `==` operator as small rounding errors may make even mathematically equal expressions unequal (i.e. you should use some range comparison instead).

And there is more: floating point behavior really depends on the language you're using (and possibly even compiler, its setting etc.) and it may not be always completely defined/specified, leading to possible [nondeterministic](determinism.md) behavior which can cause real trouble e.g. in physics engines. This may also lead to nasty bugs and trouble with [portability](portability.md) (i.e. assuring the exact same behavior on all platforms).

There is also a bit of an unfortunate situation with standardization. The widely adopted IEEE 754 standard is not nearly flawless in design, it's actually kind of bad but also came to be widely established and supported in all hardware so much so that it's immensely difficult to replace it even with an objectively better ways of handling floating point numbers, for example [posits](posit.md).

{ Really as I'm now getting down the float rabbit hole I'm seeing what a huge mess it all is, I'm not nearly an expert on this so maybe I've written some BS here, which just confirms how messy floats are. Anyway, from the articles I'm reading even being an expert on this issue doesn't seem to guarantee a complete understanding of it :) Just avoid floats if you can. ~drummyfish }

For starers consider the following snippet (let's now assume the standard 32 bit IEEE float etc.):

```
for (float f = 0; f < 20000000; f++)
  if (((int) f) % 4096 == 0) // once in a while output current f
    printf("%f\n",f);
```

Take a look at the code and guess what it does. The loop should count up to 20 million and stop, right? NOPE. The loop will never end because *f* will never reach 20 million -- and no, it's not because 20 million would be a too high value, in fact it's laughably low considering that float can store values up to the order of 10^38. What gives then? Upon running the loop you'll notice it gets stuck at the value 16777216.0, which is the line beyond which float's resolution falls below 1, meaning the number can no longer be incremented by one because float cannot represent the next integer, 16777217. And that's just a very basic, innocent looking loop.

Is floating point literal evil? Well, of course not, but it is extremely overused. You may need it for precise scientific simulations, e.g. [numerical integration](numerical_integration.md), but as our [small3dlib](small3dlib.md) shows, you can comfortably do even [3D rendering](3d_rendering.md) without it. So always consider whether you REALLY need float. **You mostly do NOT need it**.

**Simple example of avoiding floating point**: many noobs think that if they e.g. need to multiply some integer *x* by let's say 2.34 they have to use floating point. This is of course false and just proves most retarddevs don't know elementary school [math](math.md). Multiplying *x* by 2.34 is the same as *(x * 234) / 100*, which  we can [optimize](optimization.md) to an approximately equal division by power of two as *(x * 2396) / 1024*. Indeed, given e.g. *x = 56* we get the same integer result 131 in both cases, the latter just completely avoiding floating point.

## How It Works

The gist of the basic idea is this: we have digits in memory and in addition we have a position of the radix point among these digits, i.e. both digits and position of the radix point can change. The fact that the radix point can move is reflected in the name *floating point*. In the end any number stored in float can be written with a finite number of digits with a radix point, e.g. 12.34. Notice that any such number can also always be written as a simple fraction of two integers (e.g. 12.34 = 1 * 10 + 2 * 1 + 3 * 1/10 + 4 * 1/100 = 617/50), i.e. any such number is always a rational number. This is why we say that floats represent fractional numbers and not true real numbers (real numbers such as [pi](pi.md), [e](e.md) or square root of 2 can only be approximated).

More precisely floats represent numbers by storing two main parts: the *base* -- actual encoded digits, called **mantissa** (or significand etc.) -- and the position of the radix point. The position of radix point is called the **exponent** because mathematically the floating point works similarly to the scientific notation of extreme numbers that use exponentiation. For example instead of writing 0.0000123 scientists write 123 * 10^-7 -- here 123 would be the mantissa and -7 the exponent.

Though various numeric bases come to consideration, in [computers](computer.md) we almost exclusively use [base 2](binary.md), so we are about to stick with base 2 from now on. Moving on, our numbers will be of format:

*mantissa * 2^exponent*

Note that besides mantissa and exponent there may also be other parts, typically there is also a sign bit that says whether the number is positive or negative.

Let's now consider an extremely simple floating point format based on the above. Keep in mind this is an EXTREMELY NAIVE inefficient format that wastes values. We won't consider negative numbers. We will use 6 bits for our numbers:

- 3 leftmost bits for mantissa: This allows us to represent 2^3 = 8 base values: 0 to 7 (including both).
- 3 rightmost bits for exponent: We will encode exponent in [two's complement](twos_complement.md) so that it can represent values from -4 to 3 (including both).

So for example the binary representation `110011` stores mantissa `110` (6) and exponent `011` (3), so the number it represents is 6 * 2^3 = 48. Similarly `001101` represents 1 * 2^-3 = 1/8 = 0.125.

Note a few things: firstly our format is [shit](shit.md) because some numbers have multiple representations, e.g. 0 can be represented as `000000`, `000001`, `000010`, `000011` etc., in fact we have 8 zeros! That's unforgivable and formats used in practice address this (usually by prepending an implicit 1 to mantissa).

Secondly observe the non-uniform distribution of our numbers: whilst we have good resolution close to 0 (we can represent 1/16, 2/16, 3/16, ...), the resolution in high numbers falls (the highest number we can represent is 56 but the second highest is 48, we can NOT represent e.g. 50 exactly). Realize that obviously with 6 bits we can still represent only 64 numbers at most! So float is NOT a magical way to get more numbers, with integers on 6 bits we can represent numbers from 0 to 63 spaced exactly by 1 and with our floating point we can represent numbers spaced as close as 1/16th but only in the region near 0, we pay the price of having big gaps in higher numbers.

Also notice that things like simple addition of numbers become more difficult and time consuming, you have to include conversions and [rounding](rounding.md) -- while with fixed point addition is a single machine instruction, same as integer addition, here with software implementation we might end up with dozens of instructions (specialized hardware can perform addition fast but still, not all computer have that hardware).

Rounding errors will appear and accumulate during computations: imagine the operation 48 + 1/8. Both numbers can be represented in our system but not the result (48.125). We have to round the result and end up with 48 again. Imagine you perform 64 such additions in succession (e.g. in a loop): mathematically the result should be 48 + 64 * 1/8 = 56, which is a result we can represent in our system, but we will nevertheless get the wrong result (48) due to rounding errors in each addition. So the behavior of float can be **non intuitive** and dangerous, at least for those who don't know how it works.

## Standard Float Format: IEEE 754

IEEE 754 is THE standard that basically all computers use for floating point nowadays -- it specifies the exact representation of floating point numbers as well as rounding rules, required operations applications should implement etc. However note that the standard is **kind of [shitty](shit.md)** -- even if we want to use floating point numbers there exist better ways such as **[posits](posit.md)** that outperform this standard. Nevertheless IEEE 754 has been established in the industry to the point that it's unlikely to go anytime soon. So it's good to know how it works.

Numbers in this standard are signed, have positive and negative zero (oops), can represent plus and minus [infinity](infinity.md) and different [NaNs](nan.md) (not a number). In fact there are thousands to billions of different NaNs which are basically wasted values. These inefficiencies are addressed by the mentioned [posits](posit.md).

Briefly the representation is following (hold on to your chair): leftmost bit is the sign bit, then exponent follows (the number of bits depends on the specific format), the rest of bits is mantissa. In mantissa implicit `1.` is considered (except when exponent is all 0s), i.e. we "imagine" `1.` in front of the mantissa bits but this 1 is not physically stored. Exponent is in so called biased format, i.e. we have to subtract half (rounded down) of the maximum possible value to get the real value (e.g. if we have 8 bits for exponent and the directly stored value is 120, we have to subtract 255 / 2 = 127 to get the real exponent value, in this case we get -7). However two values of exponent have special meaning; all 0s signify so called denormalized (also subnormal) { Lol in Spanish subnormal means retarded. ~drummyfish } number in which we consider exponent to be that which is otherwise lowest possible (e.g. -126 in case of 8 bit exponent) but we do NOT consider the implicit 1 in front of mantissa (we instead consider `0.`), i.e. this allows storing [zero](zero.md) (positive and negative) and very small numbers. All 1s in exponent signify either [infinity](infinity.md) (positive and negative) in case mantissa is all 0s, or a [NaN](nan.md) otherwise -- considering here we have the whole mantissa plus sign bit unused, we actually have many different NaNs ([WTF](wtf.md)), but usually we only distinguish two kinds of NaNs: quiet (qNaN) and signaling (sNaN, throws and [exception](exception.md)) that are distinguished by the leftmost bit in mantissa (1 for qNaN, 0 for sNaN).

The standard specifies many formats that are either binary or decimal and use various numbers of bits. The most relevant ones are the following:

| name                              |M bits|E bits| smallest and biggest number              | precision <= 1 up to |
| --------------------------------- | ---- | ---- | ---------------------------------------- | -------------------- |
|binary16 (half precision)          | 10   | 5    |2^(-24), 65504                            | 2048                 |
|binary32 (single precision, float) | 23   | 8    |2^(-149), 2^127 * (2 - 2^-23) ~= 3 * 10^38| 16777216             |
|binary64 (double precision, double)| 52   | 11   |2^(-1074), ~10^308                        | 9007199254740992     |
|binary128 (quadruple precision)    | 112  | 15   |2^(-16494), ~10^4932                      | ~10^34               |

**Example?** Let's say we have float (binary34) value `11000000111100000000000000000000`: first bit (sign) is 1 so the number is negative. Then we have 8 bits of exponent: `10000001` (129) which converted from the biased format (subtracting 127) gives exponent value of 2. Then mantissa bits follow: `11100000000000000000000`. As we're dealing with a normal number (exponent bits are neither all 1s nor all 0s), we have to imagine the implicit `1.` in front of mantissa, i.e. our actual mantissa is `1.11100000000000000000000` = 1.875. The final number is therefore -1 * 1.875 * 2^2 = -7.5.

The following table shows approximate resolution (i.e. distance to next representable value) of float (32 bit) and double (64 bit) near given stored value:

| value   | float      | double     |
| ------- | ---------- | ---------- |
| 10^-20  | 3 * 10^-28 | 6 * 10^-37 |
| 10^-19  | 2 * 10^-27 | 5 * 10^-36 |
| 10^-18  | 4 * 10^-26 | 8 * 10^-35 |
| 10^-17  | 3 * 10^-25 | 6 * 10^-34 |
| 10^-16  | 2 * 10^-24 | 5 * 10^-33 |
| 10^-15  | 4 * 10^-23 | 8 * 10^-32 |
| 10^-14  | 3 * 10^-22 | 6 * 10^-31 |
| 10^-13  | 2 * 10^-21 | 5 * 10^-30 |
| 10^-12  | 4 * 10^-20 | 8 * 10^-29 |
| 10^-11  | 3 * 10^-19 | 6 * 10^-28 |
| 10^-10  | 2 * 10^-18 | 5 * 10^-27 |
| 10^-9   | 4 * 10^-17 | 8 * 10^-26 |
| 10^-8   | 3 * 10^-16 | 7 * 10^-25 |
| 10^-7   | 3 * 10^-15 | 5 * 10^-24 |
| 10^-6   | 4 * 10^-14 | 8 * 10^-23 |
| 10^-5   | 3 * 10^-13 | 7 * 10^-22 |
| 10^-4   | 3 * 10^-12 | 5 * 10^-21 |
| 10^-3   | 4 * 10^-11 | 9 * 10^-20 |
| 10^-2   | 3 * 10^-10 | 7 * 10^-19 |
| 10^-1   | 3 * 10^-09 | 5 * 10^-18 |
| 1       | 5 * 10^-08 | 9 * 10^-17 |
| 10      | 4 * 10^-07 | 7 * 10^-16 |
| 100     | 3 * 10^-06 | 6 * 10^-15 |
| 1000    | 2 * 10^-05 | 4 * 10^-14 |
| 10000   | 4 * 10^-04 | 7 * 10^-13 |
| 100000  | 3 * 10^-03 | 6 * 10^-12 |
| 1000000 | 0.02       | 4 * 10^-11 |
| 10^7    | 0.42       | 7 * 10^-10 |
| 10^8    | 3.38       | 6 * 10^-09 |
| 10^9    | 27.10      | 5 * 10^-08 |
| 10^10   | 433.68     | 8 * 10^-07 |
| 10^11   | 3469.44    | 6 * 10^-06 |
| 10^12   | 27755.57   | 5 * 10^-05 |
| 10^13   | 444089.21  | 8 * 10^-04 |
| 10^14   | 3552713.75 | 6 * 10^-03 |
| 10^15   | 28421710   | 0.05       |
| 10^16   | 454747360  | 0.84       |
| 10^17   | 3637978880 | 6.77       |
| 10^18   | 29103831040| 54.21      |
| 10^19   | 4 * 10^11  | 867.36     |
| 10^20   | 3 * 10^12  | 6938.89    |

## See Also

- [posit](posit.md)
- [fixed point](fixed_point.md)
- [conum](conum.md)
