Floating point numbers

This is another essential mathematical page, explaining how C++ stores floating point numbers (called "double") Modern processors are able to calculate with numbers stored in this way very quickly, so the advantages of using the numerical formats here are considerable. You will have to learn about the disadvantages too.

1. Floating point numbers

Recall that in binary we can have a "binary point" and represent numbers in a similar sort of way to decimal, including numbers that are not integers. Forexample, doing the long division gives the binary for $1 / 10$ as

0.0001100110011001100 \dots

Non-integer numbers are usually represented in computer languages (including C++) in the following way (called, for historical reasons, "double precision" or "double"). We imagine the number written as

[sign] 1. b_{0} b_{1} \dots b_{50} b_{51} \times 2^{exponent}

where the $b_{i}$ are binary digits (0 or 1) and the exponent is from -1022 to +1023. The number $1. b_{0} b_{1} \dots b_{50} b_{51}$ is called the mantissa. The sign is a single binary digit for $+$ or $-$ .

The choice of 52 digits after the binary point and the range -1022 to +1023 for the exponent is an international standard and what is used in C++'s double type, and is also used by popular processors by Intel, AMD, etc. Other standards (such as a different number of digits or a different range for the exponent) also exist.

The "1." at the beginning of the mantissa is convenient because it doesn't need to be stored (since it is always there). The problem with this is that it means that zero cannot be represented. By convention 0 is represented as $1.0000 \dots \times 2^{-1023}$ . This means C++'s double has both $+0$ and $-0$ . (Work out what this means if you can!) There are also strange conventions for other double "numbers" such as infin, -infin and NaN (Not a Number). You might come across these at times. Whether they mean anything much is highly debatable. (I am not 100% sure the standards are complete in these regards, and even if they are the implementations might be buggy.)

Whatever the number of bits used in the mantissa and exponent, there will often be some rounding. For example 1/10 cannot be represented exactly but will probably be stored as

1.1001100110011001100110011001100110011001100110011010 \times 2^{-4}

The last part, $\dots 11010$ is chosen because it is slightly closer to the actual number than the alternative, $\dots 11001$ . Unfortunately rounding algorithms like these are fraught with danger, with many seemingly arbitrary options to choose from (such as which way to round when it apparently doesn't matter) and are often not implemented very well.

Important. Note that if you do "illegal" calculations on double values such as dividing by zero, the computer program may continue regardless, producing strange numbers such as infin or NaN. The implementation details here are tricky.

Technical details aside, the important thing is that there is a sign, a mantissa (53 binary digits aways starting "1.") and an 11 binary digit exponent. This is stored in 64 bits. Thus "double" can be expected to carry about 53 binary digits of precision in a range from about $\pm 2^{-1022}$ to $\pm 2^{1023}$ . This translates to approximately (and slightly less than) 16 decimal digits of precision between approximately $\pm 10^{-308}$ and $\pm 10^{308}$ . Rounding in some form or other almost always takes place when floating point values are used.

The C++ (and IEEE) standard guarantees that every int value can be exactly represented as a double.

For most compilers (including GCC running on Intel or AMD processors) double arithmetic is carried out in the processor itself. This means it is fast, but you are relying on Intel or AMD doing their work correctly and calculating arithmetic without bugs. In the past this has not always been the case. See for example https://en.wikipedia.org/wiki/Pentium_FDIV_bug. Typically, multiplication is faster than division and it is often worthwhile avoiding divisions where possible. For example, it is always better to multiply by 0.1 than divide by 10.

Remark.

In worked examples, done on paper to help understand some of these limitations, we normally work with some number ( $N$ , smaller than 53 binary digits) in the mantissa and assume an arbitrary amount of precision in the exponent. I always include the mandatory "1." when I count the number of digits in the mantissa.

Example.

If the number -1/9 were to be represented in this format with a mantissa of 8 binary digits, start by looking at the binary representation of 1/9 which is $0.000111000111000111 \dots$ or $1.11000111000 \dots \times 2^{-4}$ . To 8 digits this is $1.1100011 \times 2^{-4}$ but the trucated part is greater than a half, so the last digit should be rounded up. The final answer is therefore: -1/9 is (to 8 mantissa digits) best approximated as $- 1.1100100 \times 2^{-4}$ .

Of course, the most important fact to remember is that even "double" precision numbers are not perfect, and because of the way the numbers are represented unexpected things can happen. One of the most awkward things to remember is that, because "double" precision is base 2, some numbers that are normally easy to represent exactly in base 10 are impossible to represent exactly in "double".

Remark.

Beware exact tests of ==, <= and < using double. The numbers concerned may not be as "exact" as you might think.

Example.

What do you think the result of the following program is?

// addtenth.cpp (click here to download)

// A puzzle! what happens?
#include <iostream>

using namespace std;

int main() {
  double x = 0.0;
  while ( x < 10.0 ) {
    x = x + 1.0/10.0;
  }
  cout << "End of loop: x is now " << x << endl;
  return 0;
}