Sorting algorithm. 
Author Message
 Sorting algorithm.

I have a couple of questions. I'm writing a program which will sort a
large
file by distributing it into smaller files, then recompiling then into a
large single file again by the following process.

1. Read first entry of each smaller file.

2. Pick which entry is the smallest and write it to large file.

3. Next entry in the small file which was picked becomes the new first
entry
   for that file. If there are none, that file is not looked at again.

4. Repeat steps 2 and 3 until ALL of the files have been read into the
larger
   file. If the large file is not completely sorted, then do it again
and
   again until it is.

Here is an illustration of how the sort works. Consider the following
numbers.

                                 5
                                 2
                                 7
                                 4
                                 1
                                 6
                                 3
                                 0

Now, to sort them we distribute them into say, 4 files.

                           5  2  7  4
                           1  6  3  0

Now we recompiling them.

                                 2
                                 4
                                 0
                                 5
                                 1
                                 6
                                 7
                                 3

Now we redistribute them.

                           2  4  0  5
                           1  6  7  3

Now we recompile them.

                                 0
                                 2
                                 1
                                 4
                                 5
                                 3
                                 6
                                 7

Now we redistribute them.

                           0  2  1  4
                           5  3  6  7

Now we recompile them.

                                 0
                                 1
                                 2
                                 3
                                 4
                                 5
                                 6
                                 7

Now the file is sorted after 3 iterations.

Now, I have two questions.

1. What is the name of this sort?

2. Assuming that you know how many entries are in the large file,
   (8 in my example) what is the optimal number of smaller files that
you
   should split it into (4 in my example) to minimize the number of
   iterations needed to sort the file? Thanks!

-Patrick-



Mon, 29 Nov 1999 03:00:00 GMT  
 Sorting algorithm.

.> I have a couple of questions. I'm writing a program which will sort a
.> large
.> file by distributing it into smaller files, then recompiling then
into a
.> large single file again by the following process.
.>
.> 1. Read first entry of each smaller file.
.>
.> 2. Pick which entry is the smallest and write it to large file.
.>
.> 3. Next entry in the small file which was picked becomes the new
first
.> entry
.>    for that file. If there are none, that file is not looked at
again.
.>
.> 4. Repeat steps 2 and 3 until ALL of the files have been read into
the
.> larger
.>    file. If the large file is not completely sorted, then do it again
.> and
.>    again until it is.
.>
.> Here is an illustration of how the sort works. Consider the following
.> numbers.
.>
.>                                  5
.>                                  2
.>                                  7
.>                                  4
.>                                  1
.>                                  6
.>                                  3
.>                                  0
.>
.> Now, to sort them we distribute them into say, 4 files.
.>
.>                            5  2  7  4
.>                            1  6  3  0
.>

Quote:
>. Now we recompiling them.

.>
.>                                  2
.>                                  4
.>                                  0
.>                                  5
.>                                  1
.>                                  6
.>                                  7
.>                                  3
.>
.> Now we redistribute them.
.>
.>                            2  4  0  5
.>                            1  6  7  3
.>
.> Now we recompile them.
.>
.>                                  0
.>                                  2
.>                                  1
.>                                  4
.>                                  5
.>                                  3
.>                                  6
.>                                  7
.>
.> Now we redistribute them.
.>
.>
.>                            0  2  1  4
.>                            5  3  6  7
.>
.> Now we recompile them.
.>
.>                                  0
.>                                  1
.>                                  2
.>                                  3
.>                                  4
.>                                  5
.>                                  6
.>                                  7
.>
.> Now the file is sorted after 3 iterations.
.>
.> Now, I have two questions.
.>
.> 1. What is the name of this sort?
.>
.> 2. Assuming that you know how many entries are in the large file,
.>    (8 in my example) what is the optimal number of smaller files that
.> you
.>    should split it into (4 in my example) to minimize the number of
.>    iterations needed to sort the file? Thanks!
.>
.> -Patrick-

     This is a form of a merge sort.  In the "old days" when disk space
was scarce, but tape drives were plentiful, this is precisely how large
data files were sorted.  You take your data file on Tape A, distribute
it on tapes B, C, D, and E, and merge it back onto A.  This sort
(sometimes called a polyphase sort) can probably be found in Knuth's
book, "The Art of Computer Programming, Vol 3, Sorting and Searching",
or in Wirth's book, "Algorithms + Data Structures = Programs".

     I suspect you'll find that the question about "optimum number" is
"more is better", though it becomes a hassle managing many open files.
There is probably a point of diminishing returns ...

     What you have implemented is not quite the classical merge sort.
That would have merged the distributed files into as long a run as
possible, and then would have distributed the runs one to a file.  So
your example would have been (re-formatting somewhat for clarity)

  [ 5 2 7 4 1 6 3 0 ]              original file
  [ 5 3 ] [ 2 7 0 ] [ 4 ] [ 1 6 ]  first distribution
  [ 1 2 4 5 6 3 7 0 ]              first merge
  [ 1 2 4 5 6 ] [ 3 7 ] [ 0 ]      redistribution
  [ 0 1 2 3 4 5 6 7 ]              second merge

     When you distribute, try to write out as long a sequence of ordered
information to the file as you can, going on to the next file only when
you need to start a new "run".  Similarly, when merging, merge as long
a run as possible.  

Bob Schor
Pascal Enthusiast



Tue, 30 Nov 1999 03:00:00 GMT  
 Sorting algorithm.

Quote:

> I have a couple of questions. I'm writing a program which will sort a
> large
> file by distributing it into smaller files, then recompiling then into a
> large single file again by the following process.

> 1. Read first entry of each smaller file.

> 2. Pick which entry is the smallest and write it to large file.

> 3. Next entry in the small file which was picked becomes the new first
> entry
>    for that file. If there are none, that file is not looked at again.

> 4. Repeat steps 2 and 3 until ALL of the files have been read into the
> larger
>    file. If the large file is not completely sorted, then do it again
> and
>    again until it is.

> Here is an illustration of how the sort works. Consider the following
> numbers.

>                                  5
>                                  2
>                                  7
>                                  4
>                                  1
>                                  6
>                                  3
>                                  0

> Now, to sort them we distribute them into say, 4 files.

>                            5  2  7  4
>                            1  6  3  0

> Now we recompiling them.

>                                  2
>                                  4
>                                  0
>                                  5
>                                  1
>                                  6
>                                  7
>                                  3

> Now we redistribute them.

>                            2  4  0  5
>                            1  6  7  3

> Now we recompile them.

>                                  0
>                                  2
>                                  1
>                                  4
>                                  5
>                                  3
>                                  6
>                                  7

> Now we redistribute them.

>                            0  2  1  4
>                            5  3  6  7

> Now we recompile them.

>                                  0
>                                  1
>                                  2
>                                  3
>                                  4
>                                  5
>                                  6
>                                  7

> Now the file is sorted after 3 iterations.

> Now, I have two questions.

> 1. What is the name of this sort?

> 2. Assuming that you know how many entries are in the large file,
>    (8 in my example) what is the optimal number of smaller files that
> you
>    should split it into (4 in my example) to minimize the number of
>    iterations needed to sort the file? Thanks!

> -Patrick-

1. merge sort
2. probably two -- let me explain
  on the surface, ideally you'd have the same number of smaller files as
there are elements in the larger file.  This would give you a sort
capability of one iteration :

      5 2 7 4 1 6 3 0 --> files : 5, 2, 7, 4, 1, 6, 3, 0 -~-> sorted:
      0 1 2 3 4 5 6 7
but think of how, within each iteration, you'd look for the smallest
number.  Say you go down the line - look at file #1 first, then file #2,
and so on and so on.  That's called a linear search and is the slowest
there is.  So, you'd be combining a merge sort of 1 iteration with a
linear search of many iterations - result := extremely sllloooowwww
sort.  If you use two files, then you can just do a comparison :

if file1[current] < file2[current] then
    fileresult[current] := file1[current]
else
    fileresult[current] := file2[current];

(of course, I'm talking about arrays here but you can use files if you
want. Also, current is the current index in *that* array)

result := fastest merge sort (because you not combining it with a slower
search/sort).

--
.--.            
|~~| .-------.  Kevin Kreamer

|--| || KCK ||  "Profanity is the one language understood by all
|  | |'-----'|   programmers."
|__|~'-------'



Tue, 30 Nov 1999 03:00:00 GMT  
 
 [ 3 post ] 

 Relevant Pages 

1. Most efficient sort text file routine, sort algorithms

2. Counting Comparisons in a Sorting Algorithm

3. Help! sorting algorithm

4. Quick Sort Algorithm or Source Code

5. sort algorithm trouble...

6. Help me searching a sorting algorithm

7. Sorting algorithm for three keys

8. Sort Algorithms

9. Sorting algorithm.

10. Sorting Algorithm

11. ?:merge sort algorithm

12. Efficient sorting algorithm?

 

 
Powered by phpBB® Forum Software