More on sorting algorithms

1. Introduction

These notes conclude the mini-project on insertion sort and Shell sort. Most of the information here is additional background and should be of interest but will not be assessed. However you need to pay attention to the details on your C++ source file that you must submit for the assessment.

The sorting methods you have been looking at are all called "comparison sorts" because we have chosen to make the number of comparisons between vector elements the measure of time. Comparison sorts can be written that work in worst case O ( n log n ) time (e.g. heap sort) and this is "best possible" (see below).

2. Commentary on insertion sort

Insertion sort is arguably the most useful "easy to program" sorting algorithm.

Its main advantage is that it is able to "detect" when a vector is already sorted in good time and quit early, and that it also works very quickly on "nearly sorted" vectors.

Its main disadvantage is that, in general, it requires a lot of "swaps" in a vector. When insertion sort is running you will find the computer does a lot of work that looks like: { double temp=v[i]; v[i]=v[j]; v[j]=temp; } .

We have not been measuring the number of such "swaps", but if this kind of swapping is considered costly (e.g. if you are sorting other data rather than double) you probably will want to consider a different method. (But note also that there are alternative implementations of arrays like "vector" for which insertion and deletion of elements are very simple and fast, usually at the expense of some other operation.)

In fact, if minimising the number of "swaps" is important, then a modified selection sort is possibly the best method!

Exercise.

(Optional.) Implement a modification of selection sort that does at most n swaps to sort a vector of n entries. (Do not include it in your C++ source file for submission though: do it as a new project.)

With some effort, insertion sort can be "improved" by an "interval halving" technique. The idea is, to insert the value x = v[n] into its correct place in v[0], v[1], ..., v[n] , you should test v[0] and v[n-1] first. If x is not between these the insertion can be done directly, otherwise test v[n/2] and then either v[(n/2)/2] or v[(n+n/2)/2] , and so on, always maintaining indices i, j so that x is between v[i] and v[j] and halving the interval from i to j each time. By this kind of search one can find the insertion point and then do the insertion.

Exercise.

(For ambitious students only.) Implement this method, as insertionsort_withintervalhalving. (Do not include it in your C++ source file for submission though: do it as a new project.) Assess its worst case and average case run time, measured in terms of number of comparisons.

Whether insertionsort_withintervalhalving is an improvement in practice will depend on your point of view (whether worst-case or average case is more important, what your typical data is, and how much you value your programming time).

3. Commentary on Shell sort

Shell sort is named after its inventor, Donald Shell, and is a very clever modification of insertion sort, with significantly improved performance. It has similar advantages in that it sorts nearly sorted data quickly and similar disadvantages in terms of the number of swaps.

How good shell sort works in practice or in "average cases" depends on the choice of the "gaps". No one knows what the best possible choice of gaps is. There are two schools of thought:

A number of open questions exist concerning the choice of gaps and whether the performance of Shell sort can be made to approach that of heap sort. Be warned: some rubbish is printed on the subject too. For example, the "gaps" values of the form 2 k has been advocated by otherwise quite responsible authors, but this choice of gaps yields worst case performance no better than O ( n 2 ) .

Because the choice of gaps is critical in shell sort it is sensible to re-write the shellsort function so that a particular sequence of gaps can be selected. See below for this change in your code (which is needed for the assignment.)

4. Comparison sorts in general

Here I will show that a worst case of O ( n log n ) is the best one can hope for in a comparison sort.

Lemma.

For positive n we have ( 1 + 2 n ) 1 2 n < 3 .

Proof: The sequence ( ( 1 + 2 n ) 1 2 n ) is a well-known increasing sequence converging to e = 2.71828 < 3 . (To see it is increasing note that ( 1 + 2 n ) n n +1 1 + 2 n + 1 by Bernouilli's inequality (see https://en.wikipedia.org/wiki/Bernoulli%27s_inequality ) hence ( 1 + 2 n ) n 2 ( 1 + 2 n + 1 ) n + 1 2 .)

Lemma.

For n at least 2 we have n ! n 1 2 n .

Proof: By induction on n . Let P ( n ) denote n ! n 1 2 n . We show that: P ( n ) implies P ( n +2 ) . Indeed, assuming P ( n ) , we have

( n + 2 ) ! = n ! ( n 2 + 3 n + 2 ) n 1 2 n ( n 2 + 3 n + 2 ) = ( n + 2 ) 1 2 n ( n n + 2 ) 1 2 n ( n 2 + 3 n + 2 ) = ( n + 2 ) 1 2 n 1 ( 1 + 2 n ) 1 2 n ( n 2 + 3 n + 2 ) ( n + 2 ) 1 2 n 1 3 ( n 2 + 3 n + 2 ) ( n + 2 ) 1 2 n ( n + 2 )

using the previous lemma. Note that as we have shown P ( n ) implies P ( n +2 ) we need two base cases: for n = 2 and n = 3 . Both are easy to check.

(Note too that better approximations to n ! than the above are possible by using Stirling's approximation to the factorial, https://en.wikipedia.org/wiki/Stirling%27s_approximation.)

Theorem.

Any comparison sort algorithm runs in worst case no better than O ( n log n ) .

Proof: Suppose the algorithm is correct and requires no more than f ( n ) comparisons on an input vector of size n . Without loss of generality assume the items in the input vector are distinct. There are n ! ways of ordering such a vector and only one of these is correctly sorted.

Since the algorithm uses at most f ( n ) comparisons, each of which has a "yes" or "no" answer, the answers to these comparisons distinguishes at most 2 f ( n ) orderings of the vector. (Think of a game of 20-questions where it is possible without guessing to identify one object out of 2 20 .) Therefore, since all possible orderings must be identified (and sorted) by the algorithm, we have 2 f ( n ) n ! .

It follows, taking logs to base 2, that f ( n ) log ( n ! ) 1 2 n log n , using the previous lemma. So the algorithm runs in time no better than O ( n log n ) .

5. Preparing your sorting C++ source file for submission

As usual, you will need to submit a C++ source file for marking via canvas. The C++ file should contain the four functions required (which will be assessed) and may contain other functions (which will be ignored, provided they compile correctly in C++ 2011 and do not affect the main four functions).

The C++ file must not use any C++ feature not covered in these web pages to this point. It is only allowed to #include the four header files <iostream> <vector> <cstdlib> <ctime> used in the original template, and must contain the following four functions with their signatures exactly as here.

int doOneInsertion(vector<double>& v, int n) { .. }
int insertionsort(vector<double>& vec) {  .. }
int doOneInsertionSS(vector<double>& v, int n, int h) { .. }
int shellsort(vector<double>& vec, const vector<int>& gaps) { .. }

These signatures should be character for character the same as yours. (The only possible variation is that you can have different variable names than mine above: n, h, v, vec and gaps.) The other parts must be the same else you will lose marks.

Note that the "gaps" vector is an input to shellsort. With the variable

const vector<int> g = {701, 301, 132, 57, 23, 10, 4, 1};

which can now be declared in main if you like, or can be a global variable as before, your program should be able to sort a vector of double with a line in main like

vector<double> data1;
// do something to put numbers in data1 here
count = shellsort(data1,g);

This code will be a useful starting point for someone interested in investigating the effect of other "gaps" vectors, to see how shell sort performs with different gaps.