A worked example: roots of a quadratic

This web page goes though a really simple problem in detail. We find the roots of a quadratic via the formula. The equation to be solved is a X 2 + b X + c = 0 where the coefficients a , b , c are input via the console.

1. A first attempt

Most people's first attempt at a solution would look something like this.

// quadroot.cpp (click here to download)

// solves quadratics using the formula
#include <iostream>
#include <cmath>

using namespace std;

double p(double a,double b,double c,double x) { 
  return a*x*x + b*x + c;
}

int main() {
  cout.precision(16); // maximum precision for double, for illustration only
  double a,b,c; // coefficients
  double x1,x2; // two roots
  cout << "This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c." << endl;
  cout << "Input the values for a, b, c." << endl;
  cin >> a >> b >> c;
  cout << "The polynomial is p(X) = " << a << "X^2 + " << b << "X + " << c << endl;
  double d = b*b - 4.0*a*c; // discriminant
  if (d < 0.0) {
    cout << "No real solutions!" << endl;
  } else {
    x1 = ( -b + sqrt( d ) ) / 2.0 / a;
    x2 = ( -b - sqrt( d ) ) / 2.0 / a;
    cout << "The roots are " << x1 << " and " << x2 << endl;
    cout << "Check: " << endl;
    cout << "p(" << x1 << ") = " << p(a,b,c,x1) << endl;
    cout << "p(" << x2 << ") = " << p(a,b,c,x2) << endl;
  }
  return 0;
}

Here it is in action.

This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
1 4 1
The polynomial is p(X) = 1X^2 + 4X + 1
The roots are -0.2679491924311228 and -3.732050807568877
Check: 
p(-0.2679491924311228) = -4.440892098500626e-16
p(-3.732050807568877) = 0

The checks that the supposed roots do at least make the polynomial very small (to around 10 -16 , the approximate "accuracy" of "double") are reassuring, but do not tell us the accuracy of the answers for the roots that we have obtained. Unfortunately not all runs are quite so convincing.

This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
1 1000000 1
The polynomial is p(X) = 1X^2 + 1000000X + 1
The roots are -1.00000761449337e-06 and -999999.999999
Check: 
p(-1.00000761449337e-06) = -7.614492369967252e-06
p(-999999.999999) = 0

and

This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
1 1000000000 1
The polynomial is p(X) = 1X^2 + 1000000000X + 1
The roots are 0 and -1000000000
Check: 
p(0) = 1
p(-1000000000) = 1

Let us try to get some rough estimates of the accuracy in this program. Firstly, we have seen that "double" is representented in binary to approximately 16 decimal places. Let us simplify this slightly and assume that exactly 16 decimal places are correct. (There are some cases when a double is not quite correct to this accuracy but we ignore these for now.) In the case when x 0 , to say we have x correct to 16 decimal digits is to say the relative error is at most 5 × 10 -16 . (The relative error of an estimate x ~ to a real value x is the ratio of the absolute error x - x ~ by the actual value x . See Errors and precision in programming with floating point.)

Next we assume that a , b , c are all given exactly and are all represented exactly by doubles. In other words there is no experimental error in the input. That is certainly the case with the coefficients in the three examples just given because they are plain integer values that are not too big.

The program computes values - b , b 2 - 4 a c , and 2 a . Of these, the most complicated is the second. But if a , b , c are known exactly then the algorithms for b 2 and 4 a c rounded to 16 decimal places should be accurate to 16 decimal places. So the relative error in b 2 and 4 a c will be at most 5 × 10 -16 . One could argue that if b 2 and 4 a c are close together the relative error of the difference is much greater. That's true, but as it happens this issue is of only secondary importance for this problem, and the problems we are seeing are due to something else. We will return to this issue later, but for now will assume that b 2 and 4 a c are rather different so that the severe decrease in relative error (called "catastrophic cancellation") in b 2 - 4 a c does not occur. For a similar reason, taking square roots of very small numbers can introduce large relative errors (because the derivative of y = x is large near x = 0 ) but we again assume that this is not the case and analyse this properly later. Indeed in all three cases above b 2 is large compared to 4 a c .

We conclude that, except in these extreme cases, we might expect to get a relative error in b 2 - 4 a c of approaching 16 decimal places, at least as small as 10 -15 say. The same goes for - b and 2 a .

Now look at the ± in - b ± b 2 - 4 a c . In one case the signs of - b and b 2 - 4 a c are the same and in one case they are different. (Which way round this is depends on whether b is positive or negative.) When the signs are the same, the operation will cause no real problem, so

Proposition.

If b is positive then ( - b - b 2 - 4 a c ) / 2 a will be a good approximation to one of the roots. (And something similar if b is negative and we take the other square root.)

This doesn't quite say how good, but at least if b 2 and 4 a c are not approximately equal we might hope to get close to 15 decimal places of accuracy. But

Proposition.

If b is positive and b 2 is large compared to 4 a c then catastrophic cancellation occurs in the addition in - b + b 2 - 4 a c (And something similar if b is negative and we look at the other square root.)

Looking at the second and third examples we can see this is the case, and one of our roots is probably very poor. It doesn't take much to check that it is the first of the two roots in all cases that is suspicious. Even when, in the last case, the polynomial was evaluated to 1 not zero, the last root of approximately -1000000000 is rather accurate. But in any sort of terms the other root 0 is probably rather inaccurate.

In any case, we have remaining a nagging worry that we do not know how many digits in the answer are good. For example, consider this run.

This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
1 2.0000000001 1
The polynomial is p(X) = 1X^2 + 2.0000000001X + 1
The roots are -0.9999900000495863 and -1.000010000050414
Check: 
p(-0.9999900000495863) = 1.110223024625157e-16
p(-1.000010000050414) = 2.220446049250313e-16

2. A program with error estimates

Rather than give exact analytic estimates of the errors (as a formula perhaps) we try to estimate the errors introduced at each stage. We will switch between absolute error and relative error as needed, and recall that they are related by

rel-err ( X ) = err ( X ) X err ( X ) X ~

As noted, we believe we can trust the computer to give us full precision for double arithmetic of numbers known exactly. Full precision for double, at least for nonzero numbers that are represented by the format, is a relative error of about 5 × 10 -16 . For simplicity we use this figure. If x is so represented its approximate absolute error is 5 x × 10 -16 .

What we do is take the last program apart and analyse the errors step by step, using the computer to do the calculations for us. I offer my program here and explain it in the notes following.

// quadrooterr.cpp (click here to download)

// solves quadratics, with error estimates
#include <iostream>
#include <cmath>

using namespace std;

double p(double a,double b,double c,double x) { 
  return a*x*x + b*x + c;
}

double basicerr(double x) { 
  return abs(x) * 5.0E-16; 
}

double numdigits(double x, double x_err) { 
  return floor(log10(abs( x ))) - floor(log10(abs( 2.0 * x_err )));
}

int main() {
  cout.precision(16); // maximum precision for double, for illustration only
  double a,b,c; // coefficients
  cout << "This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c." << endl;
  cout << "Input the values for a, b, c." << endl;
  cin >> a >> b >> c;
  cout << "The polynomial is p(X) = " << a << "X^2 + " << b << "X + " << c << endl;
  //
  // now compute various values with estimates of errors, assuming a,b,c are exact
  //
  double bsquared = b*b; 
  double bsquared_err = basicerr(bsquared); // estimate of max absolute error in bsquared
  double fourac = 4.0*a*c;
  double fourac_err = basicerr(fourac); // estimate of max absolute error in fourac
  double discrim = bsquared - fourac; // discriminant
  double discrim_err = bsquared_err + fourac_err + basicerr(discrim); // estimate of max absolute error in discriminant
  if (discrim < 0.0) {
    cout << "No real solutions!" << endl;
    return 0; // exit main
  } 
  // 
  // If there are real solutions, continue as before
  //
  double rootdiscrim = sqrt(discrim); // discriminant
  double rootdiscrim_err = 0.5 * discrim_err / rootdiscrim + basicerr(rootdiscrim); // estimate of absolute error in rootdiscrim
  double top1 = -b + rootdiscrim;
  double top2 = -b - rootdiscrim;
  double top1_err = rootdiscrim_err + basicerr(top1);
  double top2_err = rootdiscrim_err + basicerr(top2);
  double x1 = top1 / 2.0 / a; // possible problem if a = 0
  double x2 = top2 / 2.0 / a; // possible problem if a = 0
  double x1_err = abs(top1_err / 2.0 / a) + basicerr(x1);
  double x2_err = abs(top2_err / 2.0 / a) + basicerr(x2);
  //
  // Compute number of digits, output answer and give error estimate
  //
  double x1_digits = numdigits( x1, x1_err );
  double x2_digits = numdigits( x2, x2_err );
  cout << "The roots are:" << endl;
  cout << x1 << " pm " << x1_err << " (" << x1_digits << " significant digits)" << endl;
  cout << x2 << " pm " << x2_err << " (" << x2_digits << " significant digits)" << endl;
  cout << "Check: " << endl;
  cout << "p(" << x1 << ") = " << p(a,b,c,x1) << endl;
  cout << "p(" << x2 << ") = " << p(a,b,c,x2) << endl;
  return 0;
}

You see that we have broken the various calculations of long formulas into several steps. (This is generally speaking a very good thing. It takes the computer no longer to do several short steps than one long equivalent step, but the program is easier to read. In fact we may spot values or formulas that can be re-used and hence this may actually make the computation shorter.) For a quantity xxxx the quantity xxxx_err is an estimate of the maximum absolute error we expect. Some of the errors, such as in bsquared and fourac are computed just as indicated above using the presumed precision of double. In fact the computation x × 5 × 10 -16 is used so often it is programmed as a simple function. (This is also good programming style. As well as saving typing it means the number 5 × 10 -16 can be changed throughout in one single place.)

So you see how bsquared_err and fourac_err are computed. The discriminant is the difference of the values bsquared and fourac so the maximum error is the sum of the errors plus a possible additional round-off error. This gives discrim_err and takes us through the first section of the program.

The estimation of errors in the next section is a bit more tricky. Clearly we do rootdiscrim = sqrt(discrim); but what is the possible error here? Putting Δ for the discriminant and Δ ~ for its approximation, we have Δ ~ = Δ + h where the error h is in absolute value less than the error discrim_err. We can reasonably assume that h is small compared to Δ and expanding a series,

Δ + h = Δ 1 + h Δ = Δ ( 1 + 1 2 h Δ + )

So the relative error in Δ is approximately 1 2 h Δ and the absolute error is at most 1 2 h Δ plus a round-off term. (Or, done another way, you can observe that x + h x + h d x d x . From this you get the same estimate of the error.) This explains the line rootdiscrim_err = 0.5 * discrim_err / rootdiscrim + basicerr(rootdiscrim);.

The absolute errors in a sum and difference are at most the sum of the errors plus round-off, hence top1_err and top2_err, and the absolute error in a quotient such as top1/2.0/a of a variable such as top1 by an exact quantity is the quotient of the error by the same quantity plus round-off. This explains the next section of code.

Finally, the number of available digits can be estimated from the relative error (for a nonzero quantity) by noting that if the relative error is within 5 × 10 - n then n digits are good. This leads to the logarithmic formula in the function numdigits(x, x_err). The final part of the program prints the answer and its estimated errors and expected number of digits accuracy.

We can run the program with the same test data, as follows.

// quadrooterr.txt (click here to download)

This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
The polynomial is p(X) = 1X^2 + 4X + 1
The roots are:
-0.2679491924311228 pm 2.288675134594813e-15 (14 significant digits)
-3.732050807568877 pm 5.752776749732568e-15 (14 significant digits)
Check: 
p(-0.2679491924311228) = -4.440892098500626e-16
p(-3.732050807568877) = 0
This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
The polynomial is p(X) = 1X^2 + 1000000X + 1
The roots are:
-1.00000761449337e-06 pm 5.000000000010001e-10 (3 significant digits)
-999999.999999 pm 1.499999999999e-09 (14 significant digits)
Check: 
p(-1.00000761449337e-06) = -7.614492369967252e-06
p(-999999.999999) = 0
This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
The polynomial is p(X) = 1X^2 + 1000000000X + 1
The roots are:
0 pm 5.000000000000001e-07 (-inf significant digits)
-1000000000 pm 1.5e-06 (15 significant digits)
Check: 
p(0) = 1
p(-1000000000) = 1
This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
The polynomial is p(X) = 1X^2 + 2.0000000001X + 1
The roots are:
-0.9999900000495863 pm 5.000099793149091e-11 (9 significant digits)
-1.000010000050414 pm 5.000099795149091e-11 (10 significant digits)
Check: 
p(-0.9999900000495863) = 1.110223024625157e-16
p(-1.000010000050414) = 2.220446049250313e-16

Using the rather conservative estimates of errors, we see that the first case is accurate to most of the digits shown, and in the second and third cases only one of the roots is good to a reasonable number of digits. The answer in the fourth example is of intermediate quality.

3. A program with alternative formulas

The conclusion of the last section is that the "famous" formulas for the roots of a quadratic are good analytically, but sometimes poor numerically. What can we do about this? One option is to find a better algorithm. For cubics and higher, this is certainly the preferred approach, and we will discuss the bisection method later. The alternative is to look for alternative formulas, and this is what we do here.

In fact there are other useful, but not well-known, formulas for the roots of a quadratic. We have

- b + b 2 - 4 a c 2 a = - b + b 2 - 4 a c 2 a - b - b 2 - 4 a c - b - b 2 - 4 a c = 2 c - b - b 2 - 4 a c

and

- b - b 2 - 4 a c 2 a = 2 c - b + b 2 - 4 a c

So the roots are 2 c / ( - b ± b 2 - 4 a c ) . This formula is less well known because it is less useful analytically, but it turns out to be equally useful numerically. The new program computes the roots by both methods, and comparing the errors choose the best value.

// quadrootalt.cpp (click here to download)

// solves quadratics, with error estimates and alternative formulas
#include <iostream>
#include <cmath>

using namespace std;

double p(double a,double b,double c,double x) { 
  return a*x*x + b*x + c;
}

double basicerr(double x) { 
  return abs(x) * 5.0E-16; 
}

double numdigits(double x, double x_err) { 
  return floor(log10(abs( x ))) - floor(log10(abs( 2.0 * x_err )));
}

int main() {
  cout.precision(16); // maximum precision for double, for illustration only
  double a,b,c; // coefficients
  cout << "This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c." << endl;
  cout << "Input the values for a, b, c." << endl;
  cin >> a >> b >> c;
  cout << "The polynomial is p(X) = " << a << "X^2 + " << b << "X + " << c << endl;
  //
  // now compute various values with estimates of errors, assuming a,b,c are exact
  //
  double bsquared = b*b; 
  double bsquared_err = basicerr(bsquared); // estimate of max absolute error in bsquared
  double fourac = 4.0*a*c;
  double fourac_err = basicerr(fourac); // estimate of max absolute error in fourac
  double discrim = bsquared - fourac; // discriminant
  double discrim_err = bsquared_err + fourac_err + basicerr(discrim); // estimate of max absolute error in discriminant
  if (discrim < 0.0) {
    cout << "No real solutions!" << endl;
    return 0; // exit main
  } 
  // 
  // If there are real solutions, continue as before
  //
  double rootdiscrim = sqrt(discrim); // discriminant
  double rootdiscrim_err = 0.5 * discrim_err / rootdiscrim + basicerr(rootdiscrim); // estimate of absolute error in rootdiscrim
  double top1 = -b + rootdiscrim;
  double top2 = -b - rootdiscrim;
  double top1_err = rootdiscrim_err + basicerr(top1);
  double top2_err = rootdiscrim_err + basicerr(top2);
  double x1 = top1 / 2.0 / a; // possible problem if a = 0
  double x2 = top2 / 2.0 / a; // possible problem if a = 0
  double x1alt = 2.0 * c / top2; // possible problem if the root is zero
  double x2alt = 2.0 * c / top1; // possible problem if the root is zero
  double x1_err = abs(top1_err / 2.0 / a) + basicerr(x1);
  double x2_err = abs(top2_err / 2.0 / a) + basicerr(x2);
  double x1alt_err = abs(2.0 * c * top2_err / top2 / top2) + basicerr(x1alt);
  double x2alt_err = abs(2.0 * c * top1_err / top1 / top1) + basicerr(x2alt);
  //
  // Now choose best answer
  //
  if ( abs(x1alt_err) < abs(x1_err) ) { x1 = x1alt; x1_err=x1alt_err; }
  if ( abs(x2alt_err) < abs(x2_err) ) { x2 = x2alt; x2_err=x2alt_err; }
  //
  // Compute number of digits, output answer and give error estimate
  //
  double x1_digits = numdigits( x1, x1_err );
  double x2_digits = numdigits( x2, x2_err );
  cout << "The roots are:" << endl;
  cout << x1 << " pm " << x1_err << " (" << x1_digits << " significant digits)" << endl;
  cout << x2 << " pm " << x2_err << " (" << x2_digits << " significant digits)" << endl;
  cout << "Check: " << endl;
  cout << "p(" << x1 << ") = " << p(a,b,c,x1) << endl;
  cout << "p(" << x2 << ") = " << p(a,b,c,x2) << endl;
  return 0;
}

The only part that needs to be explained further is the estimation of the error in the calculation x1alt = 2.0 * c / top2 and the similar one for x2alt. Again, 2 c is a constant we suppose we know exactly. If we estimate a quantity C / x by computing C / ( x + h ) where h is the absolute error in x assumed small relative to x then

C x + h C x + h C d ( 1 / x ) d x = C x ( 1 - h x )

So the relative error is h x and the absolute error is h C x 2 . This is the formula used.

Running the program with the same test data, we get

// quadrootalt.txt (click here to download)

This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
The polynomial is p(X) = 1X^2 + 4X + 1
The roots are:
-0.2679491924311227 pm 4.13030787576954e-16 (15 significant digits)
-3.732050807568877 pm 5.752776749732568e-15 (14 significant digits)
Check: 
p(-0.2679491924311227) = 0
p(-3.732050807568877) = 0
This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
The polynomial is p(X) = 1X^2 + 1000000X + 1
The roots are:
-1.000000000001e-06 pm 1.500000000002e-21 (15 significant digits)
-999999.999999 pm 1.499999999999e-09 (14 significant digits)
Check: 
p(-1.000000000001e-06) = 0
p(-999999.999999) = 0
This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
The polynomial is p(X) = 1X^2 + 1000000000X + 1
The roots are:
-1e-09 pm 1.5e-24 (15 significant digits)
-1000000000 pm 1.5e-06 (15 significant digits)
Check: 
p(-1e-09) = 0
p(-1000000000) = 1
This program solves the equation p(X)=0 where p(X) = aX^2 + bX + c.
Input the values for a, b, c.
The polynomial is p(X) = 1X^2 + 2.0000000001X + 1
The roots are:
-0.9999900000495863 pm 4.999999794149066e-11 (10 significant digits)
-1.000010000050414 pm 5.000099795149091e-11 (10 significant digits)
Check: 
p(-0.9999900000495863) = 1.110223024625157e-16
p(-1.000010000050414) = 2.220446049250313e-16

You will note a small change (a possible improvement but hardly important) in the first case, significant improvements in the second and third as the cancellation of digits is avoided in - b + b 2 - 4 a c , and little improvement in the fourth case which remains slightly problematic. The fourth case is less accurate because of cancellation of digits in b 2 - 4 a c . To avoid the last problem a different formula or algorithm must be used.

4. Conclusions

Careful analysis of errors pays off! The additional confidence we have in the answers is significant. Using the computer to make much of the analysis also saves a lot of work. On the other hand, the calculation of the errors takes a significant amount of computer time, certainly much more than double the time required to just compute the formulas themselves. If this work was carried out inside a large loop the extra time required would be very noticable.

In cases when the "check" works to an accuracy about 10 -16 , this does not mean the answers are correct to the same relative accuracy. In the final case, the roots x1,x2 had p(x1),p(x2) of the order of 10 -16 but the roots themselves are unlikely to be this accurate. Be suspicious of simple checks like this!

Note that the final program is by no means perfect. In terms of possible bugs, there are a number of places where division by zero can and sometimes does occur. C++ quietly records "infinity" in such cases which may or may not be appropriate. Also the program does not check that all intermediate values are represented in a reasonable way in the "double" format, or whether such values are stored to sufficient accuracy. Whether this matters for any particular input is not checked. In any case the program is not perfect for aesthetic reasons either: there is repeated code, and it may be better to compute the roots in an array. In any case it should never be necessary to compute all four of x1,x2,x1alt,x2alt and their errors. Two of these suffice, but it is necessary to work out which two.

Suggestions.

  • Break your programming into many short steps. Use plenty of variables with appropriate names.
  • Always use your programming skills (e.g. functions) to make the program as short and elegant as possible: additional care almost always pays off.
  • Use the computer to do most of the hard work.
  • Comment the code with any possible issues and return to them later.
  • Consider adding a software "switch" to turn off error analysis in long fast runs when the errors have already been determined.
  • Consider re-writing the program to use "long double" or some other longer precision format and comparing the answers.