Sorting algorithm.
Author Message
Sorting algorithm.

I have a couple of questions. I'm writing a program which will sort a
large
file by distributing it into smaller files, then recompiling then into a
large single file again by the following process.

1. Read first entry of each smaller file.

2. Pick which entry is the smallest and write it to large file.

3. Next entry in the small file which was picked becomes the new first
entry
for that file. If there are none, that file is not looked at again.

4. Repeat steps 2 and 3 until ALL of the files have been read into the
larger
file. If the large file is not completely sorted, then do it again
and
again until it is.

Here is an illustration of how the sort works. Consider the following
numbers.

5
2
7
4
1
6
3
0

Now, to sort them we distribute them into say, 4 files.

5  2  7  4
1  6  3  0

Now we recompiling them.

2
4
0
5
1
6
7
3

Now we redistribute them.

2  4  0  5
1  6  7  3

Now we recompile them.

0
2
1
4
5
3
6
7

Now we redistribute them.

0  2  1  4
5  3  6  7

Now we recompile them.

0
1
2
3
4
5
6
7

Now the file is sorted after 3 iterations.

Now, I have two questions.

1. What is the name of this sort?

2. Assuming that you know how many entries are in the large file,
(8 in my example) what is the optimal number of smaller files that
you
should split it into (4 in my example) to minimize the number of
iterations needed to sort the file? Thanks!

-Patrick-

Mon, 29 Nov 1999 03:00:00 GMT
Sorting algorithm.

.> I have a couple of questions. I'm writing a program which will sort a
.> large
.> file by distributing it into smaller files, then recompiling then
into a
.> large single file again by the following process.
.>
.> 1. Read first entry of each smaller file.
.>
.> 2. Pick which entry is the smallest and write it to large file.
.>
.> 3. Next entry in the small file which was picked becomes the new
first
.> entry
.>    for that file. If there are none, that file is not looked at
again.
.>
.> 4. Repeat steps 2 and 3 until ALL of the files have been read into
the
.> larger
.>    file. If the large file is not completely sorted, then do it again
.> and
.>    again until it is.
.>
.> Here is an illustration of how the sort works. Consider the following
.> numbers.
.>
.>                                  5
.>                                  2
.>                                  7
.>                                  4
.>                                  1
.>                                  6
.>                                  3
.>                                  0
.>
.> Now, to sort them we distribute them into say, 4 files.
.>
.>                            5  2  7  4
.>                            1  6  3  0
.>

Quote:
>. Now we recompiling them.

.>
.>                                  2
.>                                  4
.>                                  0
.>                                  5
.>                                  1
.>                                  6
.>                                  7
.>                                  3
.>
.> Now we redistribute them.
.>
.>                            2  4  0  5
.>                            1  6  7  3
.>
.> Now we recompile them.
.>
.>                                  0
.>                                  2
.>                                  1
.>                                  4
.>                                  5
.>                                  3
.>                                  6
.>                                  7
.>
.> Now we redistribute them.
.>
.>
.>                            0  2  1  4
.>                            5  3  6  7
.>
.> Now we recompile them.
.>
.>                                  0
.>                                  1
.>                                  2
.>                                  3
.>                                  4
.>                                  5
.>                                  6
.>                                  7
.>
.> Now the file is sorted after 3 iterations.
.>
.> Now, I have two questions.
.>
.> 1. What is the name of this sort?
.>
.> 2. Assuming that you know how many entries are in the large file,
.>    (8 in my example) what is the optimal number of smaller files that
.> you
.>    should split it into (4 in my example) to minimize the number of
.>    iterations needed to sort the file? Thanks!
.>
.> -Patrick-

This is a form of a merge sort.  In the "old days" when disk space
was scarce, but tape drives were plentiful, this is precisely how large
data files were sorted.  You take your data file on Tape A, distribute
it on tapes B, C, D, and E, and merge it back onto A.  This sort
(sometimes called a polyphase sort) can probably be found in Knuth's
book, "The Art of Computer Programming, Vol 3, Sorting and Searching",
or in Wirth's book, "Algorithms + Data Structures = Programs".

I suspect you'll find that the question about "optimum number" is
"more is better", though it becomes a hassle managing many open files.
There is probably a point of diminishing returns ...

What you have implemented is not quite the classical merge sort.
That would have merged the distributed files into as long a run as
possible, and then would have distributed the runs one to a file.  So
your example would have been (re-formatting somewhat for clarity)

[ 5 2 7 4 1 6 3 0 ]              original file
[ 5 3 ] [ 2 7 0 ] [ 4 ] [ 1 6 ]  first distribution
[ 1 2 4 5 6 3 7 0 ]              first merge
[ 1 2 4 5 6 ] [ 3 7 ] [ 0 ]      redistribution
[ 0 1 2 3 4 5 6 7 ]              second merge

When you distribute, try to write out as long a sequence of ordered
information to the file as you can, going on to the next file only when
you need to start a new "run".  Similarly, when merging, merge as long
a run as possible.

Bob Schor
Pascal Enthusiast

Tue, 30 Nov 1999 03:00:00 GMT
Sorting algorithm.

Quote:

> I have a couple of questions. I'm writing a program which will sort a
> large
> file by distributing it into smaller files, then recompiling then into a
> large single file again by the following process.

> 1. Read first entry of each smaller file.

> 2. Pick which entry is the smallest and write it to large file.

> 3. Next entry in the small file which was picked becomes the new first
> entry
>    for that file. If there are none, that file is not looked at again.

> 4. Repeat steps 2 and 3 until ALL of the files have been read into the
> larger
>    file. If the large file is not completely sorted, then do it again
> and
>    again until it is.

> Here is an illustration of how the sort works. Consider the following
> numbers.

>                                  5
>                                  2
>                                  7
>                                  4
>                                  1
>                                  6
>                                  3
>                                  0

> Now, to sort them we distribute them into say, 4 files.

>                            5  2  7  4
>                            1  6  3  0

> Now we recompiling them.

>                                  2
>                                  4
>                                  0
>                                  5
>                                  1
>                                  6
>                                  7
>                                  3

> Now we redistribute them.

>                            2  4  0  5
>                            1  6  7  3

> Now we recompile them.

>                                  0
>                                  2
>                                  1
>                                  4
>                                  5
>                                  3
>                                  6
>                                  7

> Now we redistribute them.

>                            0  2  1  4
>                            5  3  6  7

> Now we recompile them.

>                                  0
>                                  1
>                                  2
>                                  3
>                                  4
>                                  5
>                                  6
>                                  7

> Now the file is sorted after 3 iterations.

> Now, I have two questions.

> 1. What is the name of this sort?

> 2. Assuming that you know how many entries are in the large file,
>    (8 in my example) what is the optimal number of smaller files that
> you
>    should split it into (4 in my example) to minimize the number of
>    iterations needed to sort the file? Thanks!

> -Patrick-

1. merge sort
2. probably two -- let me explain
on the surface, ideally you'd have the same number of smaller files as
there are elements in the larger file.  This would give you a sort
capability of one iteration :

5 2 7 4 1 6 3 0 --> files : 5, 2, 7, 4, 1, 6, 3, 0 -~-> sorted:
0 1 2 3 4 5 6 7
but think of how, within each iteration, you'd look for the smallest
number.  Say you go down the line - look at file #1 first, then file #2,
and so on and so on.  That's called a linear search and is the slowest
there is.  So, you'd be combining a merge sort of 1 iteration with a
linear search of many iterations - result := extremely sllloooowwww
sort.  If you use two files, then you can just do a comparison :

if file1[current] < file2[current] then
fileresult[current] := file1[current]
else
fileresult[current] := file2[current];

(of course, I'm talking about arrays here but you can use files if you
want. Also, current is the current index in *that* array)

result := fastest merge sort (because you not combining it with a slower
search/sort).

--
.--.
|~~| .-------.  Kevin Kreamer

|--| || KCK ||  "Profanity is the one language understood by all
|  | |'-----'|   programmers."
|__|~'-------'

Tue, 30 Nov 1999 03:00:00 GMT

 Page 1 of 1 [ 3 post ]

Relevant Pages