Errors and precision in programming with floating point

Numerical programming with any kind of fixed precision floating point arithmetic will introduce errors. There is no way errors can be avoided. Simply increasing the number of digits (from double to long double, say) will not eliminate them and may not help at all. It is up to you, the programmer, to keep a track of possible errors and their magnitudes.

Precision does not necessarily mean "having more digits" it means knowing how many of your digits are correct. Without this you will have no chance of usefully obtaining a more accuate answer to your problem.

This page is a basic theoretical page without program examples on errors and their sources.

1. Measurement of error

Suppose there is an external real quantity $x \in ℝ$ . This may be a known mathematical constant such as $π$ or $e$ , or the value of a physical quantity. If the latter, this physical quantity will have to be measured, and even just this measurement will introduce some inaccuracy. In any case, we end up with a value $\tilde{x}$ that is represented in C++'s double precision floating point (or some other system) and there is no reason to expect $x$ and $\tilde{x}$ to be the same.

In general $\tilde{x} = x + e$ where $e$ is the error in the representation of $x$ by $\tilde{x}$ .

Definition.

The absolute error in the representation $\tilde{x}$ for $x$ is the mathematical quantity $err (x) = |\tilde{x} - x|$ .

The absolute error is an indication of how accurate we are, but an absolute error of 1 may be good or bad. Knowing the current distance between the Earth and Pluto in km with absolute error of 1 might be rather good, but knowing $π$ to an absolute error of 1 is hardly very accurate.

Definition.

The relative error in $\tilde{x}$ is the mathematical quantity $rel-err (x) = \frac{|\tilde{x} - x|}{|x|}$ .

Note we divide by the exact (accurate) value $x$ here. In practice this is often inconvenient as $x$ is not known. But $\tilde{x}$ is, we we often estimate the relative error as $\frac{|\tilde{x} - x|}{|\tilde{x}|}$ .

The relative error is usually a good indication of how many digits are valid. However it is not appropriate at all times.

Exercise.

Write a program to compute $\sin (1000000000 π)$ . What is the absolute error? The relative error?

Solution.

An example program is provided at cpp/sinbillpi.cpp. When run it gives the following output,

pi = 3.14159
sin( 1*pi) = 1.22465e-16
sin( 10*pi) = -1.22465e-15
sin( 100*pi) = 1.96439e-15
sin( 1000*pi) = -3.21417e-13
sin( 10000*pi) = -4.85682e-13
sin( 100000*pi) = -3.39607e-11
sin( 1e+06*pi) = -2.23191e-10
sin( 1e+07*pi) = 5.62056e-10
sin( 1e+08*pi) = -3.90829e-08
sin( 1e+09*pi) = -3.32014e-08
sin( 1e+10*pi) = -2.23936e-06
sin( 1e+11*pi) = -1.47642e-05
sin( 1e+12*pi) = -0.000269713

The absolute error is not very impressive. The relative error is infinite.

Exercise.

Given that "double" has an accuracy of approximately 16 decimal digits, what is the expected relative accuracy of $\tilde{x}$ for a reasonable nonzero quantity $x$ ?

Solution.

Taking (slightly optimistically) the accuracy of "double" as 16 decimal digits, the expected relative accuracy of $\tilde{x}$ for a reasonable nonzero quantity $x$ would be $5 \times 10^{-16}$ . (This makes no sense when $x = 0$ because of division by zero.) In practice, "double" is not quite as accurate as this. 53 binary digits can represent decimal numbers from $0$ to approx $1.8 \times 10^{16}$ so only 15 digits can be always represented exactly. If the value is calculated (even in a library routine such as from <cmath>) special care is needed to ensure that the full possible accuracy is obtained, and this might not be feasible or possible.

Exercise.

3.14159 is $π$ to six significant digits. Pretend that you do not know the next digit, and give a bound on the relative accuracy of this approximation.

Solution.

If 3.14159 represents some number to 6 decimal digits then that number is between 3.141585 and 3.141595 so the absolute accuracy is at worst 0.000005. The relative accuracy is slightly tricky because we have to divide by the actual number, not the approximation and we don't know the actual number! But in the worst possible case it will still be less than $0.000005 / 3.141585$ or 0.000001592.

2. Experimental error

There are a number of possible sources of errors. The first of these is experimental error. That data you just measured will not be exact in general, and in the course of making the measurement it is essential for precise scientific work to identify the sources of the errors and estimate their magnitude.

Example.

The discovery of the cosmic microwave background temperature of 3.5 degrees K plus or minus 1 degree K by Penzias and Wilson makes interesting reading. By far the most important mathematical analysis in their work was the careful quantification of the maximum one degree of error. A plus or minus 5 degrees of error would have been useless.

In these notes we will mostly ignore experimental error for the simple reason that none of the data in the examples here come from any experiment and there isn't any. That's not to say that in real life situations you can ignore it too. Some problems might say "assume an experimental error that is no more than .." and then you will have to do so, but you will not have to analyse where this experimental error comes from. That part of the work will have been done for you.

3. Round-off error

The next source of error is one we have already seen. It is called round-off error and it is caused by approximating a real quantity $x \in ℝ$ with a fixed precision double number $\tilde{x}$ , necessarily truncating the digits after the double's precision has been "used up".

There are a number of possible strategies for "rounding". One is always to truncate, deleting extra digits. One is to truncate downwards (i.e. towards zero) at all times. One is to take the nearest possible value (with some convention if there are two possibilities, e.g. a trailing 5 is always rounded up). "Nearest possible value" seems the sensible approach. In the early days of computers, the strategies adopted were not always the best ones, and "truncation" was common.

Example.

-2.345355 to 5 decimal digits is -2.3453 (truncation), -2.3454 (truncation downwards) or -2.3454 (nearest). 2.345355 to 5 decimal digits is 2.3453 (truncation), 2.3453 (truncation downwards) or 2.3454 (nearest).

Of course, the actual calculations are usually in binary, not base 10, but the same principles apply. Also, when doing calculations with double values, it is sensible to use a "few more digits" first and then round to the best value possible by the policy chosen. I assume that modern computers do this, but I am not sure, and it certainly has been the case that computers in the past did not do their calculations with enough extra digits. Moreover, bugs in floating point processors in commercially sold CPUs (e.g. by Intel) are certainly not unheard-of.

For this module, I will assume that calculations are done sufficiently carefully and without bugs in hardware or software to obtain the best (i.e. closest) output double value from the two (or more) input double values.

But the input double values are not necessarily themselves exact. So what you learn is that round-off error gets worse after a number of calculations.

Exercise.

A quantity $x \in ℝ$ which is approximately $9$ is known by an estimate $\tilde{x}$ correct to 6 decimal places. The quantity $y = 2 x$ is estimated by computing $\tilde{y} = \tilde{x} + \tilde{x}$ . Give bounds on the absolute errors $err (x), err (y)$ .

Solution.

In this case, the error in $\tilde{x}$ is at most 0.000005. If the addition is exact, then the errors add: the error in $\tilde{y}$ is twice the error in $\tilde{x}$ , so the error is at most 0.00001.

Exercise.

Do the same, bounding the absolute error of $\tilde{x} + \tilde{y}$ when $x$ is about 10000 and $y$ is about 1000, and decimal addition is used to 6 figures.

Solution.

In this case, the error in $\tilde{x}$ is at most 0.05 and the the error in $\tilde{y}$ is at most 0.005. But the error in $\tilde{x} + \tilde{y}$ can be more than 0.055, the sum of the errors. For example, if $\tilde{x} = 10000.1$ representing $x = 10000.05$ and $\tilde{y} = 1000.15$ representing $y = 1000.145$ , then $\tilde{x} + \tilde{y}$ is calculated as 11000.3 (rounding up) but the actual exact answer is 11000.195 so the error is 0.105. In general an additional "rounding error on addition" is sometimes added, which is half of the least significant digit of the answer (in this case 0.05) and in fact the error 0.105 is the maximum possible error in this case.

As a rough rule of thumb, if $N$ quantities $x_{i}$ are known to (absolute) accuracy $e$ and all subsequent arithmetic is done fully with all possible digits, then $\sum i x_{i}$ can be calculated to some number of digits by addition to accuracy $e N$ . However after this addition, or possibly while this addition is taking place, additional round offs are often required, up to as much as half the least significant digit of the answer in each addition carried out, and the error may therefore be worse than $e N$ when the final answer is represented as double.

Note however that if the (signed) errors $\tilde{x_{i}} - x_{i}$ are equally distributed in the interval $(- e, e)$ and numbers of similar size are added together then positive errors and negative errors will often cancel out and we might expect an accuracy of around $e \sqrt{N}$ , for large $N$ (plus additional round-off errors for using doubles, as before). Of course the assumption of even distribution should be tested properly before this conclusion is made. (The square root comes from the formula for the expected distance away from the starting point in $N$ steps of a random walk, see https://en.wikipedia.org/wiki/Random_walk.

Remark.

Many older computer systems that did not do arithmetic by rounding up or down each calculation to the "best possible" representation, but simply chopped off extra digits ha serious problems. On these systems round-off errors were not equally distributed but were all on "one side". Even if your computer does arithmetic properly, your experimental set up or computer algorithm may for one reason or another favour round off errors on one side.

Round-off has a particularly nasty aspect when estimates for two similar values $x, y$ are subtracted, $\tilde{x} - \tilde{y}$ , or when estimates for two values $x, y$ with $x \approx - y$ are added, $\tilde{x} + \tilde{y}$ . This is called catastrophic cancellation.

Example.

The exact quantities and $x = 1.121023446$ and $y = 1.120865326$ are estimated to 6 significant figures as $\tilde{x}$ and $\tilde{y}$ . Say what $\tilde{x}$ and $\tilde{y}$ are and give the absolute and relative errors for these approximations. Compute $z = x - y$ . Also compute $\tilde{z} = \tilde{x} - \tilde{y}$ (which also has 6 significant figures). Compute the absolute and relative errors $err (z)$ and $rel-err (z)$ .

Solution.

$x - y = 0.00015812 = 1.5812 \times 10^{-4}$ and $\tilde{x} - \tilde{y} = 1.12102 - 1.12087 = 0.00015 = 1.50000 \times 10^{-4}$ working to 6 significant figures. The absolute error $err (z) = 8.12 \times 10^{-6}$ which is as expected (being exactly $err (x) + err (y)$ ). The relative error is $rel-err (z) = \frac{8.12 \times 10^{-6}}{0.00015812}$ which is about $0.05$ . Thus despite working to 6 significant figures throughout, the relative error tells us that only two significant figures in the answer are correct. If we didn't know the "correct" value and were not careful enough we might have assumed all 6 digits were good.

The above example was designed to be a mild illustration of the problem. When such catastrophic cancellation occurs, it can be really disastrous. Note that the problem of catastrophic cancellation can occur even when one of the two values $x, y$ is nown exactly.

Exercise.

A fairly recent C++ specification (included in C++ 2011, and it may be present in other systems too) contains functions in the <cmath> library as follows.

\erf (x) = \frac{2}{\sqrt{π}} \int_{0}^{x} e^{- t^{2}} d t

and

erfc (x) = 1 - \frac{2}{\sqrt{π}} \int_{0}^{x} e^{- t^{2}} d t

(These may or may not be present in your compiler package. Set your compiler to use C++ 2011 if possible.) Why do you think both are included even though they are so similar and it is so easy to get one of them given the other? What is wrong with the code "y = 1.0 - erf(x);" and for which values of x is this a problem? (Hint: $\int_{0}^{\infty} e^{- t^{2}} d t = \frac{\sqrt{π}}{2}$ .)

Solution.

For large positive values of $x$ the value erf(x) is very close to 1. So there is catastrophic cancellation in y = 1.0 - erf(x);. To get this small difference to better accuracy a different method of computation, as used in erfc, is required.

4. Truncation error

The final kind of error is called truncation error. This is also easy to understand given that we often calculate an approximation to a limit (such as an infinite sum or product) by taking the first $N$ values. For instance the quantity $e^{x}$ might be calculated using the first $N$ terms of the series

\sum_{n = 0}^{\infty} \frac{x^{n}}{n!}

The truncation error is the sum of the terms omitted,

truncation-error (N) = \sum_{n = N}^{\infty} \frac{x^{n}}{n!}

Example.

An example program is provided at cpp/expseries.cpp.

As a general rule, the more terms that are taken the less the truncation error will be but the greater the round-off error. You will see this in example programs at various points in the course.

Exercise.

Explain in broad terms why it is generally the case that reducing truncation error increases round off error.

Of course, truncating a series too early will introduce large errors. But in some extreme cases round-off error can dominate significantly and even make the algorithm "unstable" and give silly answers. It is sensible to decide how much accuracy is required and design the algorithm (and amount of truncation) with this in mind.

5. Order of convergence

Finally, we give a few words by way of introduction on orders of convergence. The object is just to give a taste for what you may see later. Some detailed examples will be seen in other pages, and you will see more discussion of this in later modules on numerical analysis.

A typical situation is that an iterative process is used to calculate an estimate $\tilde{X}$ of a quantity $X$ . The process goes in stages $0, 1, 2, \dots, n, \dots$ with $\tilde{X_{n}}$ at stage $n$ , and exact error $e_{n} = \tilde{X_{n}} - X$ at stage $n$ . The order of convergence describes how $e_{n}$ behaves as a function of $n$ .

If there is a constant $0 < |μ| < 1$ such that $|e_{n + 1}| \approx μ |e_{n}|$ for all but finitely many $n$ (i.e. there is $n_{0}$ such that for all $n > n_{0}$ $e_{n + 1} \approx μ e_{n}$ ) then the process is said to converge linearly or is first order. This is a rather common situation, and linear convergence usually gives a predictable estimate of the error, the error being multiplied by $μ$ each step. For example with $μ = \frac{1}{2}$ the error is halved each step.

If $|e_{n + 1}| \approx μ |e_{n}^{γ}|$ for some constant $γ$ then the process converges with order $γ$ . In the case $γ = 2$ we have $|e_{n + 1}| \approx μ |e_{n}^{2}|$ and provided some $e_{n}$ is sufficiently small compared to $μ$ then we say the process converges quadratically. Quadratic processes (if they converge at all) converge rather quickly. However they can be more unstable, with errors growing in size if they do not get close enough to the required answer.

You will see examples of first order convergence and second order convergence in these pages. For more details and more examples you must follow a more advanced module. The purpose here is just to explain the difference and alert you to an interesting topic that you may see elsewhere.