다음은 제가 science writing course에서 쓴 Journal article입니다. 원래 hwp file로 쓴 것을 html로 옮긴 것이어서 그림과 표는 지웠습니다. 불완전하나마 대강의 내용은 아실 수 있을 것입니다. 이 글은 전산학을 위한 글이라기 보다는 영어 공부를 위한 글일 것이란 생각으로 올린 글이라는 점 밝혀둡니다. 한글파일은 여기 있습니다.

Title : The optimal sorting method according to the characteristics of the data

Abstract

Sorting is a process rearranging the elements of a list into some linear order. Sorting is a very important subject in computer science because of its practical nature and its role as an example of various versions of problem solving. In this article, five internal sorting algorithms - selection sort, insertion sort, bubble sort, quick sort, and merge sort - are presented and compared with each other. Though the quick sort is often the best sorting method, various sorting methods can be optimal according to the data characteristics.

Introduction

Sorting is the process for rearranging the elements of a list into some linear order[1]. Sorting is the most important problem in computer science, especially in the field of algorithms. First, sorting is a very practical problem. By experimental estimation, almost 70% of total computer time - every computer program in every computer on earth - is spent sorting and searching. In the case of searching, before searching a element in the list, sorting should be done previously for a efficient speed[2]. Second, the study of sorting illustrates many different approaches the same problem. This variety of ideas toward the same topic is an essential point in studying algorithms, which is one of the most fundamental aspects of computer programs.
In this review, five internal sorting algorithms - selection sort, insertion sort, bubble sort, quicksort, and mergesort - will be introduced. Internal sorting takes place in the main memory of a computer, where random access capacity is used. The opposite process, external sorting, is necessary when the number of elements to be sorted is too large to fit in main memory. In such cases, the bottleneck between the internal memory and the external memory device is too critical, and a completely different approach is needed[3].

The selection sort is a very elementary algorithm. In this sorting process, first the smallest element among n elements(the list size is n) is selected, and swapped with the first element of the list. Next, the smallest element among n-1 elements, from the second to the last element, is selected and swapped with the second element of the list. As a result, after the mth pass, m smallest elements will occupy the first m elements in sorted order.

The insertion sort is another elementary algorithm. Its main methodology is inserting an element into the right place in a sorted list. In this sorting process, an element can be considered sorted(thought it is only one element) and the size of the sorted list is increased by insertion while the list's order is maintained. As a result, after the mth pass, m elements are sorted and n-m elements from the m+1th element to mth(last) element are not sorted(m<n).

The basic idea behind the bubble sort is that the elements to be sorted are kept in a vertical list. From the first two elements, the larger element of two adjacent elements moves to right, and the smaller element moves to left. Next, for the second and third element of the list, the same operation takes place. After repeating the swapping operation until the last two elements of the list, the largest element will be at the right end. Same procedure is then repeated from the first element to the n-1th element. The last element is the largest element; in other words, it is already at the right position. After the mth repetition, m large elements are placed at the right position.

A quicksort is more complicated. The main idea of the quick sort originates from the "divide and conquer" methodology[4]. First, an element is selected and the list is rearranged with the selected number as in Fig.1 following.

Next, sublists A and B are sorted in the same way(Fig.1) and the same procedure is repeated in the four sublists. Afterrepetitions( means logarithm n to the base 2), every element should be sorted. The implementation of the quick sort can vary. The first implementation issue is the determination of how x in Fig.1. Generally, x is chosen by selecting the first number in the list or sublist, but sometimes other special techniques can be used for efficiency[5].

The merge sort also uses the "divide and conquer" methodology[6]. Though the quick sort algorithm decomposes the list into two sublists of different sizes, merge sort does it into equal size. After decomposing the list, the two lists are sorted separately by using the merge sort algorithm. The strategy of the merge sort is in Fig.2.

Discussion

The O notation is a good measure for comparing the efficiency of each sorting method[7]. In other words, the O notation is a useful measurement to estimate the running-time of a program or subroutine in computer science(in such cases the O notation can describe the time complexity of a program). Mathematically the O notation is defined like this:

The running time T(n) of some program is O(f(n)) there are positive constants c and n0 such that for n equal to or greater than n0, we have T(n).

In the case of the above mentioned sorting methods, the methods are O(n^2) or O(nlgn). On average, the selection, insertion, and bubble sorts are the O(n^2) algorithms and the quicksort and mergesort are the O(nlgn) algorithms. By mathematical intuition, the latter are more efficient than the former, and when sufficiently large amounts of data are sorted, the former are almost useless. The relation between the size of data and the sorting time of O(nlgn) and O(n^2 ) are shown below in Table 1[8].

Besides the average, in the worst case the O notation can be different. For example, the worst-case time complexity of quicksort is O(n^2). The average-case time complexity and the worse-case time complexity of each sorting method are as below (Table 2).

Though their inefficiency in the worst-case, a quick sort is the most efficient method for average data. For this reason, quicksort are widely used in spite of their worst-case inefficiency[9].
The speed of each sorting method can vary according to data size, and the size of the data to be sorted is sometimes the essential factor in choosing the optimal sorting method. For this reason, it is necessary to check the time used for sorting in various data sizes. Fig. 3 below shows the results[10].

Conclusion

When data size is sufficiently large, a quicksort is the best sorting method, in spite of its inefficiency for the worst-case data; but if stable efficiency is required, the merge sort is the best method. When data size is small(less than 20), the insertion or selection sort is the best method[11]. This is because that the overhead required for O(nlgn) algorithms is quite large. In other words, the efficiency of O(nlgn) algorithms does not affect the sorting time such small amounts of data. In particular when the data are almost sorted prior to the beginning of the sorting, insertion sort is most effective[12].

References

[1] Aho, Hopcroft, Ullman, Data structures and algorithms, Addison-Wesley, pp.253
[2] Baase, Computer algorithms - Introduction to design and analysis, Addison-Wesley, pp. 48
[3] Baase, Computer algorithms - Introduction to design and analysis, Addison-Wesley, pp. 48
[4] Baase, Computer algorithms - Introduction to design and analysis, Addison-Wesley, pp. 53
Aho, Hopcroft, Ullman, Data structures and algorithms, Addison-Wesley, pp.306
[5] Horowitz, Sahni, Anderson-freed, Fundamentals of data structures is C, Computer science press, pp. 332
[6] Baase, Computer algorithms - Introduction to design and analysis, Addison-Wesley, pp. 53
[7] Aho, Hopcroft, Ullman, Data structures and algorithms, Addison-Wesley, pp. 17
[8] Horowitz, Sahni, Anderson-freed, Fundamentals of data structures is C, Computer science press, pp. 40
[9] Horowitz, Sahni, Anderson-freed, Fundamentals of dat a structures is C, Computer science press, pp. 329
[10] Horowitz, Sahni, Anderson-freed, Fundamentals of data structures is C, Computer science press, pp. 370
[11] Aho, Hopcroft, Ullman, Data structures and algorithms, Addison-Wesley, pp. 260
[12] Baase, Computer algorithms - Introduction to design and analysis, Addison-Wesley, pp. 49-50