Other sorting algorithms

1. Introduction

For these exercises you will be looking at some sorting algorithms.

The topic of "sorting algorithms" is huge. It would be easy to devote a full module on them. (And this would be a very interesting module indeed.)

You have seen algorithms for "selection sort" and "heap sort", and will know how differently these behave. (Heap sort is very good and selection sort is really very bad. You do not have to understand heap sort to do these exercises, but it helps to know that some algorithms are much better than others and in what ways they are better.) Here we will look at two others which are intermediate in performance: Insertion sort and Shell sort.

(Note that I have made the arbitrary choice that we will sort vectors in decreasing order throughout. That won't always be the same as examples you find on the internet, but is necessary for assessments.)

How do we measure "performance"? Here, you are to use the measure which is:

As well as being straightforward, this measure of time is of practical importance too. Swapping entries in arrays can often be made very fast using pointers or references. The task of deciding whether one entry is bigger than another is admittedly fairly fast for double values but for more complex data this task is often highly significant and cannot be reduced.

You will be looking at Insertion sort and Shell sort, sorting a vector and counting the number of comparisons in this way. For assessment purposes, I will test to see if you have the right algorithm by checking that your methods count the correct number of comparisons on certain input data. If this count is wrong then your algorithm is also wrong.

Insertion sort and Shell sort are intermediate in performance: both are better than selection sort and arguably not as good as heap sort in general, though both the Insertion and Shell methods are much simpler than heap sort and have the ability to "stop early" when they detect the array is already sorted, making them preferable and of practical importance in certain circumstances such as when the array is already partially sorted, or if the array is fairly short.

As always, various other web pages in this section should help you. In particular the summaries on programming constructs (including "break"), and on a summary of expressions may be useful, as well as the pages on arrays. The issue already discussed on "segmentation faults" that can occur when using indices for vectors or arrays outside the correct range is an important one and you should pay it attention, putting "guards" in your program from the very start. You may also like to read the page on array sorting and run time. You may find that turning on compiler warnings (using "-Wall" in the compiler) will save you lots of time in the long run. (Some IDEs do this by default.)

You can safely index your arrays with int going from 2 -31 to 2 31 - 1 . None of the arrays here will be so long that the int overflows. However with compiler warnings enabled you might prefer to use unsigned int going from 0 to 2 32 - 1 . Or on some computers you may have to use unsigned long int going from 0 to 2 64 - 1 . Do ask the lecturer if you are not sure. But make sure the signature of the functions required by the assessment exactly matches the ones given and don't change int to unsigned int there.

Some code here does require C++2011. This should be the default but if it isn't it can be enabled with "-std=c++11" in the compiler. So to compile you could write the command,

c++ -std=c++11 -Wall myprog.cpp -o myprog

A program comparing selection sort and heap sort with counting of the number of comparisons is available as heapsort_count.cpp. For example, the main selection sort function becomes

/* version of selectionsort that sorts a vector and returns the number of comparisons made */
int selectionsort(vector<double>& vec) {
  int count = 0; // initialise number of comparisons as zero
  int s = vec.size();
  for (int i=0; i<s; i++) {
    for (int j=i+1; j<s; j++) {
      count++; // There's a comparison coming up!
      if (vec[i] < vec[j]) {
        double x = vec[i];
	vec[i] = vec[j];
	vec[j] = x;
      }
    }
  }
  return count; // return number of comparisons
}

The count variable is initialised as 0 and incremented "count++;" just before the comparison is made "if (vec[i] < vec[j]) { ... }". It is done this way to make sure it is impossible to do one without also doing the other. You should get used to incrementing like this just before comparisons are made.

You can print a vector of doubles with code like this.

/* Print a vector to standard output */
void vector_print(vector<double>& data) {  
  int nentries = data.size();
  for (int n = 0 ; n<nentries ; n++) {
    cout << data[n] << " ";
  }
  cout << endl;
}

The programming for two weeks is explained in this web page. There will be a follow up web page with final details of the assessments.

2. Insertion sort

Insertion sort is rather like selectionsort in many ways, but works faster because the program "detects" when the array is successfully sorted and can therefore stop sooner in many cases. It works by repeatedly inserting a single item in a sorted list like you might sort a hand of cards in Bridge. All that is required is to work out which position to insert the item.

Here is an example of insertion sort in action.

0,2,1,3,5,4   // original list
0,  2,1,3,5,4 // the 0 is "sorted", and the rest must be inserted
2,0  ,1,3,5,4 // insert 2 into the sorted list
2,1,0  ,3,5,4 // insert 1 into the sorted list
3,2,1,0  ,5,4 // insert 3 into the sorted list
5,3,2,1,0  ,4 // insert 5 into the sorted list
5,4,3,2,1,0   // insert 4 into the sorted list and finished

Note that at any time the list is considered in two parts, a sorted part and a non-sorted part. The "insertion" inserts the first nonsorted item into the sorted part. The gap between the parts printed above is just to help you understand what is happening: there is no such gap on the computer.

Here is another example, counting the number of comparisons. It shows how a nearly sorted list can be sorted very quickly.

4,5,2,1,3    // original list
4,  5,2,1,3  // the 4 is "sorted"
5,4  ,2,1,3  // insert the 5, one comparison with 4
5,4,2  ,1,3  // insert the 2, one comparison with 4
5,4,2,1  ,3  // insert the 1, one comparison with 2
5,4,3,2,1    // insert the 3, three comparisons with 1,2,4

so the total number of comparisons is 6. Note that the comparisons are going from right to left for each insertion.

We'll split the insertion sort algorithm into two parts: one to do one insertion and one to put it together.

/** 
 * Part of insertion sort
 *
 *  assuming: 
 *      entries v[0]..v[n-1] are in decreasing order and 
 *    	n is positive and less than the size of v
 *  what it does: 
 *      This function "inserts" the value v[n] into the list, 
 *      comparing v[n] only with values to its immediate left and
 *      swapping some of the entries as required, so that 
 *      the resulting v[0]..v[n-1],v[n] are  in decreasing order. 
 *  at the end:
 *      the entries v[0]..v[n] are the same values as they
 *      were orginally but possibly in a different order
 *  return value:
 *      the number of comparisons (<, <= , > or >=) that were 
 *      performed on double values from v[]
 *
 * For example if v[] contains 23 13 10 4 2 1 12 17 2 then after 
 * doOneInsertion(v,6), v[] contains 23 13 12 10 4 2 1 17 2 and 
 * 5 comparisons were needed, so 5 is returned.
 */
int doOneInsertion(vector<double>& v, int n) { 
	// ... 
}

With this, it is easy to understand and implement the insertion sort algorithm. It goes as follows. Since the first item of the vector on its own is always sorted :) we go with increasing index starting at 1 and repeatedly insert that item into the list of previously sorted items.

/** 
 * Sorts an arbitrary vector vec by insertion sort, 
 * repeatedly using doOneInsertion(...)
 *
 * Returns the total number of comparisons that were made. 
 */
int insertionsort(vector<double>& vec) {
    // ...
}

Insertion sort.

Implement these functions. Test them thoroughly.

There is a template to help you, which contains some suggestions for testing. (Your "main" will not be assessed. Only the two functions above.) The template also contains some "helper" functions for you to use in "main". (NOTE: There are many many examples in these web pages. If you need even more examples these "helper" functions are well-worth looking at.)

// insertionsort-template.cpp (click here to download)

/** creates a random vector and sorts it by insertionsort. */
#include <iostream>
#include <vector>
#include <cstdlib>
#include <ctime>
using namespace std;

/** 
 * Part of insertion sort
 */
int doOneInsertion(vector<double>& v, int n) {
	// over to you...	
 }  

/** 
 * Sorts an arbitrary vector vec by insertion sort, 
 * repeatedly using doOneInsertion(...)
 *
 * Returns the total number of comparisons that were made. 
 */
int insertionsort(vector<double>& vec) {  
	// over to you...
}

/* Print a vector to standard output, cout. */
void printVector(vector<double>& data) {  
  int nentries = data.size();
  for (int n = 0 ; n<nentries ; n++) {
    cout << data[n] << " ";
  }
  cout << endl;
}

/* Returns true if vec is sorted in nonincreasing order, false
   otherwise, and prints a message to cout. */
bool testSorted(vector<double>& vec) {
  int s = vec.size();
  for (int i=1; i<s; i++) {
    if (vec[i-1] < vec[i]) {
      cout << "Data is not sorted" << endl;
      return false;
    }
  }
  cout << "Data is sorted" << endl;
  return true;  
}

/* A version of selectionsort that returns the number of comparisons made. */
int selectionsort(vector<double>& vec) {
  int count = 0;
  int s = vec.size();
  for (int i=0; i<s; i++) {
    for (int j=i+1; j<s; j++) {
      count++; // There's a comparison coming up!
      if (vec[i] < vec[j]) {
	double x = vec[i];
	vec[i] = vec[j];
	vec[j] = x;
      }
    }
  }
  return count;
}

/* fills the vector "data" with nentries chosen
 * "randomly" from 0..1 */
void fillVector(vector<double>& data, int nentries) {
  data.clear();
  for (int n = 0 ; n<nentries ; n++) {
    double p = double(rand())/RAND_MAX; 
    // you may prefer to replace the above line with
    //   cin >> p;
    // so that you can enter your own data from the keyboard.
    // or put this version as a new function to be called from main.
    data.push_back(p);
  }	
}

/* A "main" fo you to tinker with to help you get started in 
 * testing your functions. This will not be assessed, but
 * thorough testing is still your responsibility.
 * un-commentout the lines with // as you wish... 
 */
int main () {
  int count = 0;
  vector<double> data1;
  vector<double> data2;
  vector<double> data3;
  int nentries; 
  cout << "Number of entries? " << endl;
  cin >> nentries;

  /* initialize random seed: */
  srand(time(NULL));
  // OR
  // srand(102938); // change this number if needed

  fillVector(data1, nentries);
  // vectors can be copied in C++ using = like any other variables
  data2 = data1;
  data3 = data1;

  // printVector(data1);
  
  testSorted(data1); // check if the vector is already sorted
  cout << "Sorting.." << endl;
  
  cout << "Using selectionsort..." << endl;
  count = selectionsort(data1);
  testSorted(data1);
  cout << "selectionsort complete with " << count << " comparisons" << endl;

  cout << "Using insertionsort..." << endl;
  count = insertionsort(data2);
  // printVector(data2);
  testSorted(data2);
  cout << "insertionsort complete with " << count << " comparisons" << endl;

  // cout << "Using shellsort..." << endl;
  // count = shellsort(data3);
  // printVector(data3);
  // testSorted(data3);
  // cout << "shellsort complete with " << count << " comparisons" << endl;

  return 0;
}


The template contains a function to fill a vector with pseudorandom numbers. This is done in "main" with

fillVector(data1, nentries);

Before this you have a choice:

  /* initialize random seed: */
  srand(time(NULL));
  // OR
  // srand(102938); // change this number if needed

The idea is that you choose something different every time the program is run (with a "seed" for the random numbers based on the time) or you can explicitly specify a seed (102938, or anything you like) so that the "random" vector is the same each time the program runs. Both approaches can be useful for example if your program contains a bug and you want to repeatedly test it on the same data.

3. Discussion

The main advantage of insertion sort over selection sort is that it is able to "quit early" when it detects the array is already sorted. This means that it is likely to work a little bit faster than selection sort for most "random" data, but should work much faster than selection sort when the array is already "partially sorted".

On the other hand, insertion sort is not any faster than selection sort for some kinds of data that is "badly sorted" to start with.

You should be able to see a significant improvement with insertion sort over selection sort when you replace fillVector with the following function that generates "partially sorted random data".

/* fills the vector "data" with nentries chosen
   "in a partially sorted way" from 0..1 */
void fillVectorPS(vector<double>& data, int nentries) {
  data.clear();
  /* create data in "blocks".  Block n=k contains
     values from k/10 to (k+1)/10  Blocks count
     downwards. */
  int k = 10; // number of block we are in
  int nblocks = 10; // number of blocks
  int n = 0; // index of next item;
  double blocksize = 1.0/k; 
  while(n<nentries) {
    double p = double(rand())/RAND_MAX; // from 0 to 1
    p = p*blocksize + k*blocksize;
    data.push_back(p);
    n++;
    if (n%nblocks == 0) k--;
  }	
}

(You don't need to remove fillVector. Simply add this function to your file near fillVector and change the line in "main" that refers to fillVector replacing it with fillVectorPS.)

There are plenty of other ways of "partially sorting" data to make insertion sort work faster. You can experiment with these if you have time.

The conclusion is that insertion sort is often no better than selection sort, but sometimes is much better. We will exploit this in the next section.

4. Shell sort

Shell sort is a variation of insertion sort which is particularly interesting and has some mathematical open problems associated with it. At first sight, it is like doing insertion sort many times, when only the final time does it "properly".

A better way of thinking about what it is doing is that the first few passes over the data are roughly sorting it, and then the last pass can be much quicker.

Definition.

A subsequence of a vector vec is a sequence of entries of the form vec[s],vec[s+h],vec[s+2*h],...,vec[s+k*h], where s is the starting index, h is the gap (a positive int), and n == s+k*h is the final index.

The idea is that instead of sorting the whole of a vector we can choose to sort a subsequence of it.

Here is a worked example,

0,13,8,4,7,14,2,5,6,9,10,1,12,3,11,15 // original data
// sort the 8 subsequences of gap 8 and length 2
6,13,8,4,7,14,2,5, 0,9,10,1,12,3,11,15 // 0,6 swapped
6,13,8,4,7,14,2,5, 0,9,10,1,12,3,11,15 // 13,9 remain
6,13,10,4,7,14,2,5, 0,9,8,1,12,3,11,15 // 8,10 swapped
6,13,10,4,7,14,2,5, 0,9,8,1,12,3,11,15 // 4,1 remain
6,13,10,4,12,14,2,5, 0,9,8,1,7,3,11,15 // 7,12 swapped
6,13,10,4,12,14,2,5, 0,9,8,1,7,3,11,15 // 14,3 remain
6,13,10,4,12,14,11,5, 0,9,8,1,7,3,2,15 // 2,11 swapped
6,13,10,4,12,14,11,15, 0,9,8,1,7,3,2,5 // 5,15 swapped
// sort the 4 subsequences of gap 4 and length 4
12,13,10,4, 7,14,11,15, 6,9,8,1, 0,3,2,5 // 12,7,6,0 put in order
12,14,10,4, 7,13,11,15, 6,9,8,1, 0,3,2,5 // 14,13,9,3 put in order
12,14,11,4, 7,13,10,15, 6,9,8,1, 0,3,2,5 // 11,10,8,2 put in order
12,14,11,15, 7,13,10,5, 6,9,8,4, 0,3,2,1 // 15,5,4,1 put in order
// sort the 2 subsequences of gap 2 and length 8
12,14, 11,15, 10,13, 8,5, 7,9, 6,4, 2,3, 0,1 // 12,11,10,8,7,6,2,0 put in order
12,15, 11,14, 10,13, 8,9, 7,5, 6,4, 2,3, 0,1 // 15,14,13,9,5,4,3,1 put in order
// finish with a normal insertion sort
15,14,13,12,11,10,9,8,7,5,6,4,3,2,1,0 // 24 comparisons only were needed for this

The details of what you should do are all explained precisely in the comments in the following.

/** 
 * Part of Shell sort
 * 
 * A subvector of a vector v with entries v[0],v[1],...,v[s-1]
 * is a sequence of entries v[i],v[i+h],v[i+2*h],...,v[i+k*h]
 * (where s is the size of the original vector, 0<=i<h and i+k*h<s).  
 * So a subvector is like a subsequence (RAC:) that is specified by
 *   h>0,    the gap
 *   0<=i<h, the start point
 * The following function is exactly like "doOneInsertion" in
 * insertionsort except that it only works on the
 * the subvector  v[i],v[i+h],v[i+2*h],...,v[i+k*h]
 * where n = i+k*h.  (If you need it, the value i can be
 * calculated from n and h, but you probably don't.)
 *
 *  assuming: 
 *      entries v[i],v[i+h],v[i+2*h],...,v[i+(k-1)*h] are in decreasing 
 *      order, n == i+k*h is less than the size of v, i<h is as above
 *  what it does: 
 *      This function "inserts" the value v[n] into the above list, 
 *      comparing only with items in the same subsequence to its immediate left,
 *      swapping some of the entries as required, so that 
 *      v[i],v[i+h],v[i+2*h],...,v[i+k*h] are in decreasing order. 
 *  at the end: 
 *      the entries v[i],v[i+h],v[i+2*h],...,v[i+k*h] are the same 
 *      values as they were orginally, but possibly in a different order.
 *  return value: 
 *      the number of comparisons (<, <=, >= or >) that were made 
 *      on double values from v[]
 *
 * For example if v[] contains 23 3 13 4 10 7 12 17 2 then after 
 * doOneInsertion(v,6,2), v[] contains 23 3 13 4 12 7 10 17 2 and 
 * 2 comparisons were needed (12 with 10 and 12 with 13) so 2 is returned.
 */
int doOneInsertionSS(vector<double>& v, int n, int h) {
	// over to you...
}

So the old "doOneInsertion" is the special case of this with h==1. This should help you to program it: you should be able to just modify your old code.

The actual Shell sort method uses "doOneInsertionSS" in a clever way. We apply it with a sequence of "gaps" h, starting with a large value of h and then smaller and smaller values. In the best known versions of the method, these gaps are themselves stored in a vector. Here is a programming template for you.

// This is how you can declare and initialise a fixed (const)
// vector of ints in C++2011 onwards.  You'll need to make sure you
// are using at least C++2011 
// (If this is not the default, use the option -std=c++11 in GCC)

const vector<int> gaps = {701, 301, 132, 57, 23, 10, 4, 1};

//
// The above list of gaps was taken from a wikipedia page.
// You can try other sequences here. For example the following
// was recommended once by a very authoritative textbook on the
// subject (and then silently withdrawn in a later edition - it
// isn't very good!)
//
// const vector<int> gaps = {1024, 512, 256, 128, 64, 32, 16, 8, 4, 2, 1};
// 
// on the other hand, according to the wiki,
// 
// const vector<int> gaps = {1023, 511, 255, 127, 63, 31, 15, 7, 3, 1};
//
// is much better, and has the advantage of being able to be calculated by a 
// function, so with care your shellsort algorithm could be rewritten to 
// work on vectors of any size without needing such a const vector.

/** 
 * Sorts an arbitrary vector vec by Shell sort using the vector with
 * the gaps as given here (e.g. ues the vector gaps above).  Returns the 
 * total number of comparisons required.
 */
int shellsort(vector<double>& vec, const vector<int>& gaps) {
	// over to you...
}

One possible method for shell sort involves a loop inside a loop inside a loop:

Shell sort

To sort v[0]..v[s-1]:

  • For each value h from the vector of gaps (in decreasing order)
    • for each starting point i < h,
      • Apply the insertion sort method to sort the subsequence v[i],v[i+h],..,

Since the very last "gap" is 1, the very last run of this method is exactly the same as the original insertion sort, so the array is certainly going to be finally sorted at the end.

Exercise.

Implement this. (There are a number of ways of doing it.) You should copy the above lines into your program for insertion sort, complete them, and modify "main" to test everything. Make sure your code keeps a correct running count of all the comparisons that are needed.

Remember: the specification does not say you need to do array index checking, but it is highly recommended, especially at the debugging stage.

Of course, if you manage to get a working shell sort, you should test it thoroughly on any data you can think of, including but not only random data. What happens if the original vector was already sorted? What if it was sorted the opposite way round? What tests are most appropriate to decide if shell sort is better than insertion sort? What do we mean by "better"? Etc.

This should be enough programming for about two weeks, especially with the testing. Next week's worksheet will finish this task off, doing some further testing and in it I will give details of the functions required for the assessment. As far as possible, the individual functions in your program will be tested individually. For this reason is is going to be essential that your program meets the specification that I give you in next week's web page.