Double precision in C++

C++'s standard built-in type for real numbers ("floating point" nubers) is called "double". This is for historical reasons: when C and C++ were first designed its standard type for floating point was called "float", and "double" was an "extra" that usually was not used. Double precision real numbers were costly in time and memory because all floating point arithmetic was computed in separate steps from ordinary integer arithmetic. Nowadays CPUs have double precision arithmetic implemented in hardware that execute very quickly, so there is little extra cost in time for using double any more. It is true that double precision requires twice the amount of memory, but memory is comparitively cheap and only specialist applications require huge arrays of "float"s that won't fit into memory with "double". So "double" should be the normal choice for "real" arithmetic.

C++ also has a "long double" which may provide additional precision, but this is not guaranteed to be any different to the usual "double". If you require even more precision, there are packages that give multiple precision arithmetic. This page, however concerns the usual "double" type. But first, we'll catch up on the details about int.

1. Representation of ints

The int data type in C++ is relatively simple and easy to understand. An int variable takes a fixed number of bits which are all 0 or 1. The C++ standard specifies that there must be at least 16 bits, and on the machines you have been using there are 32. To simply the discussion here we will assume there are only 8 bits.

Using 8-bit arithmetic, an int value therefore might be stored internally as 10010111 say, where the most significant bit is to the left and the least significant to the right. If this was a normal binary number it would represent $2^{7} + 2^{4} + 2^{2} + 2^{1} + 2^{0} = 128 + 17 + 3 + 2 + 1 = 151$ . But then negative numbers are not possible. In fact, in C++, int is represented in a clever system called two's complement where the most significant digit is the negative of what it would have been in normal binary. So the number above (in 8-bit two's complement) is $-128 + 17 + 3 + 2 + 1 = -105$ . This way, all integers from $-128$ to $127$ can be represented. For example, $-128$ is 10000000, $127$ is 01111111, and $-1$ is 11111111.

In 32-bit two's complement, everything is just the same except the number of bits. The most significant bit is $- 2^{31}$ and the other bits are positive. All integers from $- 2^{31}$ to $2^{31} -1$ are represented. Amongst other things, this explains why in C++'s int, when 1 is added to $2^{31} - 1$ , i.e. 01111111+00000001, the result is 10000000 or $- 2^{31}$ .

// twoscompl.cpp (click here to download)

#include <iostream>

using namespace std;

int main() {
  int n = 1; 
  int x = 0;
  for (int k=0; k<31; k++) { 
    x = x+n; 
    n = n+n; // overflows on very last time
  }
  /* 
     at this point,
     x is 1+2+4+...+2^30 = 2^31-1
     n = whatever C++ calculates as 2^30+2^30
  */
  cout << "x==" << x << " n==" << n << endl;
  
  // now add 1 to x
  x = x+1;
  cout << "x==" << x << endl;
}

The two's complement system is very clever, and adding in two's complement is the same algorithm as for ordinary positive numbers and you don't have to worry whether you are adding positives or negatives or one of each.

2. Representation of doubles

The usual form of a number represented by a "double" on the computer is as a number in binary written the following form.

[sign] 1. b_{51} b_{50} \dots b_{1} b_{0} \times 2^{exponent}

where the $b_{i}$ are binary digits (0 or 1) and the exponent is from -1022 to +1023. The number $1. b_{51} b_{50} \dots b_{1} b_{0}$ is called the mantissa.

Here the system is very different. There are 64 bits to represent a double (compared to 32 for int). The sign is represented by a bit this time (1 for "-" and 0 for "+"). The exponent is an 11-bit binary number, but is stored as a "positive" number in the range 0..2047 for which 1023 must be subtracted to get the true exponent. Moreover the bit patterns 00000000000 (which would have been -1023) and 11111111111 (which would have been +1024) have special meanings (see below) so are not available. So the exponent goes from -1022 to +1023.

The remaining 53 bits represent the digits of the number itself, in binary of course, with an implicit "1." (that is not stored) at the beginning.

Remark.

As a special case, with the exponent equal to -1023 and all $b_{i} = 0$ , the number represented is $\pm 0$ . (Whether -0 is of any real meaning will depend on your application and careful readings of the standard. Here I shall regard +0 and -0 as essentially the same.) With the exponent equal to +1024 the number represented is either $\pm \infty$ or "NaN" ("not a number"). There are in fact many variations here depending on the value of the $b_{i}$ s and you will have to read the standards to unpick what goes on here. They are not easy reading.

Technical details aside, the important thing is that there is a sign, a mantissa (53 binary digits aways starting "1.") and an 11 binary digit exponent. Thus "double" can be expected to carry about 53 binary digits of precision in a range from about $\pm 2^{-1022}$ to $\pm 2^{1023}$ . This translates to approximately (and slightly less than) 16 decimal digits of precision between approximately $\pm 10^{-308}$ and $\pm 10^{308}$ .

Remark.

In worked examples, done on paper to help understand some of these limitations, we normally work with some number ( $N$ , smaller than 53 binary digits) in the mantissa and assume arbitrary amount of precision in the exponent.

Example.

If the number -1/9 were to be represented in this format with a mantissa of 8 binary digits (including the initial "1."), start by looking at the binary representation of 1/9 which is $0.000111000111000111 \dots$ or $1.11000111000 \dots \times 2^{-4}$ . To 8 digits this is $1.1100011 \times 2^{-4}$ but the trucated part is greater than a half, so the last digit should be rounded up. The final answer is therefore: -1/9 is (to 8 mantissa digits) best approximated as $- 1.1100100 \times 2^{-4}$ .

Of course, the most important fact to remember is that even "double" precision numbers are not perfect, and because of the way the numbers are represented unexpected things can happen. One of the most awkward things to remember is that, because "double" precision is base 2, some numbers that are normally easy to represent exactly in base 10 are impossible to represent exactly in "double".

Example.

What do you think the result of the following program is?

// addtenth.cpp (click here to download)

// A puzzle! what happens?
#include <iostream>

using namespace std;

int main() {
  double x = 0.0;
  while ( x < 10.0 ) {
    x = x + 1.0/10.0;
  }
  cout << "End of loop: x is now " << x << endl;
  return 0;
}

3. Operators and functions on doubles

C++ provides the usual arithmetic operations on doubles. That includes +, -, *, / and the assignments =, +=, -=, *=, /=. To ensure that an operator on numbers is done using double arithmetic make sure that at least one subexpression on one side of each operator returns a double. In the case of constants you can always turn a number such as "123" into double with either a decimal point or an exponent, e.g. "123.0" or "123E0".

C++ also has a long list of mathematical functions on doubles. Here is a sample of some of them. To use these do #include <cmath> and I have implicity assumed using namespace std;. (Or else prefix std:: to all the names.)

double abs(double x); // absolute value
double acos(double x); // inverse cos, principal value radians
double asin(double x); // inverse sin, principal value radians
double atan(double x); // inverse tan, principal value radians
double atan2(double y, double x); // inverse tan of y/x in range -pi..pi
double ceil(double x); // round up to next integer
double cos(double x); // cos of x in radians
double cosh(double x); // hyperbolic cos of x
double exp(double x); // e to the x
double exp2(double x); // 2 to the x
double floor(double x); // round down to next integer
double log(double x); // natural log of x
double log10(double x); // base 10 log of x
double pow(double x, double y); // x to the y
double sin(double x); // sin of x in radians
double sinh(double x); // hyperbolic sin of x
double sqrt(double x); // square root of x
double tan(double x); // tan of x in radians
double tanh(double x); // hyperbolic tan of x

If you want the value $π$ you can use M_PI. (You may need to write #define _USE_MATH_DEFINES just before #include <cmath>.) Some other constants are available this way:

pi: M_PI =3.14159265358979323846
pi/2: M_PI_2 =1.57079632679489661923
pi/4: M_PI_4 =0.785398163397448309616
1/pi: M_1_PI =0.318309886183790671538
2/pi: M_2_PI =0.636619772367581343076
2/sqrt(pi): M_2_SQRTPI =1.12837916709551257390
sqrt(2): M_SQRT2 =1.41421356237309504880
1/sqrt(2): M_SQRT1_2 =0.707106781186547524401
e: M_E =2.71828182845904523536
log2(e): M_LOG2E =1.44269504088896340736
log10(e): M_LOG10E =0.434294481903251827651
log(2): M_LN2 =0.693147180559945309417
log(10): M_LN10 =2.30258509299404568402

It is officially better in C++ not to use these #defines for mathematical constants but, for example, to write const double pi = 3.14159265358979323846;.

Also, be aware that #include <math.h> is subtly different, and abs won't work if you do this. Note that #include <math.h> is from C, and isn't C++.

Later versions of C++ (such as the 2011 version) have even more functions on doubles. If you need them, use a good online reference, such as http://www.cplusplus.com/.

4. Values of type double.

Given a number $x$ in the form

[sign] 1. b_{51} b_{50} \dots b_{1} b_{0} \times 2^{exponent}

the value of the last binary digit $b_{0}$ , if it were 1, would be

lsb (x) = 1 \times 2^{exponent - 52}

This number is quite interesting for a number of reasons.

C++ is not able to represent any real number between $x$ and $x + lsb (x)$ . In other words $x + lsb (x)$ is the "next" double value greater than $x$ .
Similarly, C++ is not able to represent any real number between $x$ and $x - lsb (x)$ , i.e. $x - lsb (x)$ is the "previous" double value less than $x$ .
If $x$ is the best C++ double approximation to some number $X$ in the real world, and rounding up/down is applied where possible and in in the best possible way, then the difference between $x$ and $X$ is at most $lsb (x) / 2$ , i.e. half the value of the last digit.

These various values can be calculated by a C++ program using some other functions in <cmath>. This is not entirely easy, and certainly not a "beginner's programming exercise", but I did this is the following the program.

// doublesucc.cpp (click here to download)

#include <iostream>
#include <cmath>

/* Prints information about x to cout
 * illustrates how to get exponent and mantissa of x
 * where = mantissa*2^exponent and
 * either 0.5 <= |mantissa| < 1.0 or mantissa=0.0
 */
void printdouble(double x) {
  int exponent = 0;
  // NOTE: the & in the next line converts exponent to a pointer as
  // required by this old-style library function.  This is like "call
  // by reference" but using the old C style way of doing it.
  double mantissa = frexp( x, &exponent );
  std::cout << x << " = " << mantissa << " * 2^" << exponent << std::endl;
  // sanity check:
  double y = ldexp(mantissa,exponent);
  if (x != y) std::cerr << "Error, these should be equal!\n"; 
}

/* returns the value d that a binary 1 would have in the least
 * significant position in the mantissa. This is also the least d in
 * the form 1*2^k such that x+d > x */
double leastsigbit(double x) {
  int exponent = 0;
  frexp( x, &exponent ); // get exponent and discard mantissa
  // Here we pretend I don't know the number of binary digits in the
  // mantissa and find it numerically:
  double e = ldexp(0.5,exponent); // e = 0.5*2^exponent
  double f = e/2.0;
  while (x+f > x) { e=f; f=f/2.0; }
  return e;
}

/* returns the double value just greater than x */
double succdouble(double x) {
  return x+leastsigbit(x);
}

/* returns the double value just less than x */
double preddouble(double x) {
  return x-leastsigbit(x);
}

/* returns the double value with same sign and numerically just
* greater than x; returns 0.0 if x==0.0 */
double succabsdouble(double x) {
  if (x<0.0) return preddouble(x);
  if (x>0.0) return succdouble(x);
  return 0.0;
}

/* returns the double value with same sign and numerically just less
   than x; returns 0.0 if x==0.0 */
double predabsdouble(double x) {
  if (x<0.0) return succdouble(x);
  if (x>0.0) return preddouble(x);
  return 0.0;
}

int main() {
  std::cout.precision(20); // for illustration
  double x;
  std::cin >> x;
  std::cout << "Value:       ";
  printdouble(x);
  std::cout << "LSB:         ";
  printdouble(leastsigbit(x));
  std::cout << "Successor:   ";
  printdouble(succdouble(x));
  std::cout << "Predecessor: ";
  printdouble(preddouble(x));
}

Example.

Real world numbers $X_{1}, X_{2}$ are known to C++ by double approximations $x_{1}, x_{2}$ , within errors of $e_{1}, e_{2}$ respectively, i.e. $|X_{i} - x_{i}| < e_{i}$ for each $i$ . The value $Y = X_{1} + X_{2}$ is approximated by the result of calculating $y = x_{1} + x_{2}$ in C++. Give a similar error bound for $y$ as an approximation for $Y$ .

Solution.

If $y = x_{1} + x_{2}$ were calculated exactly, then the worst possible error would be $e_{1} + e_{2}$ . However, rounding may also occur in the calculation of $y = x_{1} + x_{2}$ so this value is not exact but only accurate to $lsb (y) / 2$ so the best possible error estimate is in fact given by $|Y - y| < e_{1} + e_{2} + lsb (y) / 2$ .

There's not a lot more we can say here. In particular, the final answer depends on the least significant digit of $y$ , and not on $x_{1}$ or $x_{2}$ . But it is possible to see that $lsb (y)$ is at most twice the maximum of $lsb (x_{1})$ and $lsb (x_{2})$ .

The moral of the last exercise is that errors or inaccuracies exist everywhere when using double, and that new errors (albeit rather small ones) creep in with each operation performed.