Card Columns (was Why did they make ... ?)
Author Message
Card Columns (was Why did they make ... ?)

On Thu, 16 Jan 2003 18:51:22 +0000 (UTC), "Pointless Harlows"

Quote:

>> > Not always.  If you had several sorters and a very large file
>> to sort it
>> > was quite usual to do a block sort first to divide the file on
>> the leftmost
>> > digit, then share out the blocks among the sorters.

>> True. Actually, that was sometimes a good method even with one
>> sorter, because it cut down on the size of decks you were running
>> through the sorter on each pass.  To facilitate this, many shops
>> installed big vertical card trays that stood above and behind the
>> sorter, to hold the decks in each "block".

>Interesting to think that a single card sorter is effectively  Order(n) ...
>i.e. double the records = double the time exactly.
>How many so called fast algorithms can match that ?

With a large scaling factor k where k is the key length
determining the number of passes required. Adding 1 to k has the
same effect on time required as doubling n: so the time
complexity is O(n*2^(k-1)), linear in n, exponential in k, does

Quote:
>Interestingly, if you introduce multiple machines as described by Russ, the
>algorithm isn't as optimal even though the actual speed may be better ... I
>say MAY because the speed would actually be worse if a high proportion of

>Oh what fun ;-)

Thanks. Take care, Brian Inglis         Calgary, Alberta, Canada
--

Tue, 05 Jul 2005 03:31:46 GMT
Card Columns (was Why did they make ... ?)

Quote:
> Interesting to think that a single card sorter is effectively
Order(n) ...
> i.e. double the records = double the time exactly.
> How many so called fast algorithms can match that ?

Well, of course, when you increase the number of records, the
length of the key
must get longer.  The optimal keylength is O(log(n)), so overall,
you have O(N*log(N)),
but this assumes that you're working with an optimal key, where
the records are consecutively numbered, with very few gaps in the
numbering, and you're only sorting
on as many significant digits as you are actually using.

Quote:
> Interestingly, if you introduce multiple machines as described
by Russ, the
> algorithm isn't as optimal even though the actual speed may be
better ... I
> say MAY because the speed would actually be worse if a high
proportion of

Actually, a block-sort is closer to optimal than the standard
right-to-left approach. Since, if records are
consecutively-numbered, only the first digit wouldn't be
uniformly distributed, the block-sorting method described above
would be better.

Tue, 05 Jul 2005 03:58:00 GMT
Card Columns (was Why did they make ... ?)

Quote:

> Interesting to think that a single card sorter is effectively  Order(n) ...
> i.e. double the records = double the time exactly.
> How many so called fast algorithms can match that ?

Cute, but there's a bit of a fallacy hiding there.  The width of the
sorting field will normally be proportional to the logarithm of the
number of records, which brings things back to the standard n*log n
time.

A card-sorter-like sort _is_ possible, and can be faster than QuickSort
when measured in actual machine cycles if procedure-call overhead is
high -- which it is in a recursive context in generalized mainframe
assembler, due to the lack of a stack.  (In a PL/I environment, the
library provides a stack, so the situation isn't so bad.)  I wrote an
assembler macro years ago to generate sort routines of this kind,
although it ran from left to right.  I did find that, as soon as a
partition numbered fewer than 1536 records, it was better to finish the
partition with Shell Sort.  (Shell Sort is unpredictable in timing, but,
as a rule, spikes every time the number of records crosses from 3*2^n-1
to 3*2^n.)

--
John W. Kennedy
"The poor have sometimes objected to being governed badly;
the rich have always objected to being governed at all."
-- G. K. Chesterton, "The Man Who Was Thursday"

Tue, 05 Jul 2005 04:27:09 GMT
Card Columns (was Why did they make ... ?)

Quote:
>     The assumption that the freedom to make two copies of
> everything opens the door to more efficient sorts seems
> dubious at best.  There's usually something better to do
> with all that extra memory than merely use half of it for
> data and half of it for dirtied swap fodder.

Although I might have agreed with you in the abstract, as a
practical matter,
the size of data that one even *considers* sorting in-memory is
very small
compared to the memory available in a typical system today,
thanks to the
exponential growth that we've seen thanks to Moore's Law.
Actually, the
true amount of available memory when running an application is
quite
variable, owing to the fact that there are likely to be many
other applications
running on the system (including something gratuitous, like a
screen-saver
and/or web browser displaying tons of graphics.)

I recognize that the study of in-place sorts dates back to the
days of extremely
limited-size systems.  Usually, there was only one significant
computer per
University campus, and it's available memory size was effectively
a constant
value known to everyone who wrote code for it. And, since most
folks were
crunching on research data resulting from experiments and
observations, the
number of data points themselves could usually be regulated to be
comensurate
with the size of the school's big computer.

Those days are in the past now.  And since I'm still working for
a living, it
doesn't do for me to be living in the past, as much as I like to
reminisce with

Quote:

>     One example of the "smarter, not harder" idea can be
> found in Knuth's explication of merge sorting.  He begins
> with the back-and-forth approach using two memory regions
> able to hold N records each, and develops the basic merge
> sort and a few modifications.  But then he shows that making
> a single linked list of the records uses less space (assuming
> a record is larger than a pointer) *and* less time:

> "Thus we have a clear victory for linked-memory
> techniques over sequential allocation, when internal
> merging is being done: Less memory space is required,
> and the program runs about 10 to 20 percent faster."
> -- TAOCP Volume III, section 5.2.4

All true, but we're not talking about the "in-place sorting"
situation
any more, which is what my rant was about.

In any case, I agree with the principle of using pointers to
avoid
having to move the data around.

Actually, one of the biggest objections I have to in-place
sorting isn't only that the memory-savings isn't worth
considering anymore. It's the broader principle that a
generalized-algorithm is much less generalized if the
input (the unsorted list) is replaced by the output
(the sorted one).  I find, in many/most of the applications
where I want to sort data, that I often need to preserve
the original order of the records, as well as having
one or more sorted "views" of them available. In these
situations, sorting in place is not only unnecessary,
but is actually harmful.

Quote:
> > Even the best of them, the so-called Quick-Sort, can only be
> > fairy-dust.

>     The amount of fairy dust required for the stack (whether
> a call stack or other) is, er, dirt cheap compared to the
> amount required for an entire duplicate copy of the data
> being sorted.  N+lg(N) < 2*N for all N>1

Hmmm. Now I admit that it's been a number of years since I
looked at QuickSort, but I recall that, although it usually
requires
a modest investment in stack-space, that there existed one or
more "worst-case scenarios" in which the required stack size
approached the size of the data to be sorted.  I've always
written my programs with the idea that if a worse-case scenario
exists, that resources should be allocated to account for the
possibility. That would be a form of Murphy's Law, you might
say.  My stomach begins to churn whenever someone says
"Yes, but that hardly ever happens." I guess that comes from
working for years on big systems that are supposed to run
24/7, and noting that "worst-case scenarios" tend to eventuate
on weekends and holidays.

If I'm wrong about the worst-case thing with Quicksort, I'd be
happy to retract
my comment about that. (Well... not *that* happy, to be honest,
but grudgingly
willing.) But I'm very reluctant to buy the idea I've seen
suggested of randomly
pre-scrambling the data to improve the odds, so don't even bother
mentioning
that one.

Quote:
>-- and while I
> admit I'm no expert on sorting algorithms for one-record
> data sets, I can't imagine they'd be improved by making
> an extra copy of the one record ..

Gratuitous sarcasm duly noted.

Quote:
>     I'd be interested (and I imagine others would, too) to
> learn what sorting methods you have found preferable.  Once
> again, Knuth (describing the various methods one might have
> come up with to solve his opening thought-experiment about
> how best to sort five numbers):

A simple merge sort works nicely for me.  The "records"
I'm merging are usually not the records themselves, but rather
lists
of pointers to them. I think of the list as a "View object" that
defines
a particular sort-order for the data, while I leave the data sit
there by itself.

If I have a really big list of data to sort, then I'll go with a
generalized file sort, which is also a merge-type, combined with
some
sort of tree-sorting algorithm on the first pass.

Quote:
>     Moo.  My mind is not udderly open, but I'm not predisposed
> to dismiss your contention as a load of bull.  Let's see the
> arguments to support your opinion; if they're good, I for one
> will happily lap them up.

This is all that I have time for right now. I've got a program
bug to track down.

Tue, 05 Jul 2005 05:03:04 GMT
Card Columns (was Why did they make ... ?)

Quote:

> > > Default seems to be at col 72, in my experience.  Can't
> imagine why....
> > > :->  I mean, who's going to add a comment in the final 8
> cols??

> > Columns 73-80 were for card numbering.

> I've mentioned this before, but for those who missed it:

> The tradition of using only 1-72 in mainframe compilers dates
> back to the 704 through 7094 computers, whose online card
> words from each row of punches, 24 words per card.
> Conversion to characters was done by software.

> Columns 73-80 were ignored because the reader literally
> couldn't pick them up, unless you dropped in
> another control panel that was jumpered differently.
> [...]

Moreover some ttys could print only 72 columns.

--
__Pascal_Bourguignon__                   http://www.informatimago.com/
----------------------------------------------------------------------
There is a fault in reality. Do not adjust your minds. -- Salman Rushdie

Tue, 05 Jul 2005 05:13:14 GMT
Card Columns (was Why did they make ... ?)

Quote:
> Interesting to think that a single card sorter is effectively
Order(n) ...
> i.e. double the records = double the time exactly.
> How many so called fast algorithms can match that ?

-- glen

Tue, 05 Jul 2005 06:23:48 GMT
Card Columns (was Why did they make ... ?)

Quote:
> Moreover some ttys could print only 72 columns.

This is true, but I'm certain that the 72-column convention in
programming languages was well-established on punched cards for
the reason I gave. Programming was done on punched cards well
before systems advanced to the stage where interaction from a
terminal, for programming or other time-sharing functions, was
very common.

I'd need to see some evidence that this isn't just a coincidence.
I have an old Teletype Model 15 sitting in my garage, vintage WW
II. And it has exactly 72 print columns. It also has a very
limited 5-bit character set that doesn't include things like '=',
'+', or '*'.  It would be hard to do fortran with that, much less
PL/I or C.  The ASCII TTY didn't appear until the mid-'60s.

Tue, 05 Jul 2005 06:26:44 GMT
Card Columns (was Why did they make ... ?)

Quote:

> Radix sort can match it.

That's because what card sorters do *IS* radix sort.

Tue, 05 Jul 2005 06:33:54 GMT
Card Columns (was Why did they make ... ?)

Quote:

> >     The assumption that the freedom to make two copies of
> > everything opens the door to more efficient sorts seems
> > dubious at best.  There's usually something better to do
> > with all that extra memory than merely use half of it for
> > data and half of it for dirtied swap fodder.

> Although I might have agreed with you in the abstract, as a
> practical matter,
> the size of data that one even *considers* sorting in-memory is
> very small
> compared to the memory available in a typical system today,
> thanks to the
> exponential growth that we've seen thanks to Moore's Law. [...]

"Memory is cheap" pundits he'd encounter in design sessions:
He'd hold out his hand, palm up, and invite them them to fill
it with memory ...

Memory's even cheaper now than it was back then, but
it still seems to me that using twice as much as you need
isn't a great idea.  Or, turning it around, a method that
can handle twice as much data in the same amount of memory
sounds like a good idea: it lets me sort twice as much
data before the system starts to suffer.

Quote:
> >     One example of the "smarter, not harder" idea can be
> > found in Knuth's explication of merge sorting.  [...]

> > "Thus we have a clear victory for linked-memory
> > techniques over sequential allocation, when internal
> > merging is being done: Less memory space is required,
> > and the program runs about 10 to 20 percent faster."
> > -- TAOCP Volume III, section 5.2.4

> All true, but we're not talking about the "in-place sorting"
> situation
> any more, which is what my rant was about.

I'm afraid I don't follow: A list merge *is* an in-place
sort; the back-and-forth-in-twice-the-memory method it
beats is -- well, an "out-of-place" sort.  I mentioned it
because it seemed a perfect example of in-place winning
both ends of the time/space tradeoff.

Quote:
> > > Even the best of them, the so-called Quick-Sort, can only be
> > > considered "in-place" if your call-stack is made of magic
> > > fairy-dust.

> >     The amount of fairy dust required for the stack (whether
> > a call stack or other) is, er, dirt cheap compared to the
> > amount required for an entire duplicate copy of the data
> > being sorted.  N+lg(N) < 2*N for all N>1

> Hmmm. Now I admit that it's been a number of years since I
> looked at QuickSort, but I recall that, although it usually
> requires
> a modest investment in stack-space, that there existed one or
> more "worst-case scenarios" in which the required stack size
> approached the size of the data to be sorted.

True, taking "worst-case scenario" to mean "coded by an
incompetent."  The simple precaution of stacking the larger
subfile and working on the smaller one first means the stack
depth cannot possibly exceed lg(N/2) = lg(N)-1.

I've never seen an algorithm -- for sorting or anything
else -- that an incompetent couldn't code badly.

Quote:
> [...] But I'm very reluctant to buy the idea I've seen
> suggested of randomly
> pre-scrambling the data to improve the odds, so don't even bother
> mentioning
> that one.

That's an entirely different Quicksort failing: its
running time has a worst case that's embarassingly bad, and
many people (including Hoare in his original paper) have
suggested various remedies.  However, there's no arrangement
of the input records that can make the stack or other memory
requirements balloon.

Quote:
> This is all that I have time for right now. I've got a program
> bug to track down.

Happy tracking!  I'm sure you'll, er, sort it out and
put it in its place.

--

Tue, 05 Jul 2005 06:05:02 GMT
Card Columns (was Why did they make ... ?)

Quote:

>>Moreover some ttys could print only 72 columns.

> This is true, but I'm certain that the 72-column convention in
> programming languages was well-established on punched cards for
> the reason I gave. Programming was done on punched cards well
> before systems advanced to the stage where interaction from a
> terminal, for programming or other time-sharing functions, was
> very common.

> I'd need to see some evidence that this isn't just a coincidence.
> I have an old Teletype Model 15 sitting in my garage, vintage WW
> II. And it has exactly 72 print columns. It also has a very
> limited 5-bit character set that doesn't include things like '=',
> '+', or '*'.  It would be hard to do FORTRAN with that, much less
> PL/I or C.  The ASCII TTY didn't appear until the mid-'60s.

Actually it did.  The alpha characters were uppercase only and the shift
(FIGS) key accessed all the special symbols assigned to each
alphabetic key.  The downshift (LTRS) key brought you back to alphabetic
mode.

Tue, 05 Jul 2005 07:44:21 GMT
Card Columns (was Why did they make ... ?)

Quote:

> Interesting to think that a single card sorter is effectively  Order(n) ...
> i.e. double the records = double the time exactly.
> How many so called fast algorithms can match that ?

Thats just for one pass.  The number of passes is equal to the number of columns
in the sort key plus the number of columns which are alphanumeric as opposed to
numeric (one pass for the zones and a second for the digits).

If there are no duplicate keys this is at least n*log(n) for some base (probably
between 10 and 26).  On the other hand if there are many duplicate keys and the
number of records sorted greatly exceeds the maximum number of unique keys that
can be represented in the given number of columns then the performance can be
far better than n*log(n). Properly done, card sorting is also stable. I.e.,
records with identical keys retain their original ordering.  (Better not drop
the deck.)

Technically, sorting in this manner on a card sorted is an example of radix
sorting, so called because the method is based on the structure of the
representation of the key in positional notation in some base (radix).

Ordinary comparison sorting relies solely on the ordering of the keys and does
not depend in any way on how they are encoded.

Tue, 05 Jul 2005 07:52:08 GMT
Card Columns (was Why did they make ... ?)

Quote:

> With a large scaling factor k where k is the key length
> determining the number of passes required. Adding 1 to k has the
> same effect on time required as doubling n: so the time
> complexity is O(n*2^(k-1)), linear in n, exponential in k, does

No.

The time complexity is O(n*k), assuming no columns contain alpahnumeric data and
ignoring time between passes.  Of course, if you want to claim that k is a fixed
pre determined constant then you can claim it is O(n)!  But really k varies from
problem to problem and can't honestly be claimed to be a constant unless you
want to use k=80.  It is true that once k is fixed, if you double n you double
the time.

Tue, 05 Jul 2005 08:04:56 GMT
Card Columns (was Why did they make ... ?)
Not that I'm old or anything... not that it's a bad thing ... and less and

Old IRS tab shop had rows and rows of sorters.  One of the employees went to
his supervisor and demanded a hefty raise.  When asked why he was worth such
an extravegant amount he replied "well, I'm the column 9 man".

I guess this was back when there was no reusable code.

Jeff Raben

Tue, 05 Jul 2005 07:53:41 GMT
Card Columns (was Why did they make ... ?)

Quote:

> Well, of course, when you increase the number of records, the
> length of the key
> must get longer.  The optimal keylength is O(log(n)), so overall,
> you have O(N*log(N)),
> but this assumes that you're working with an optimal key, where
> the records are consecutively numbered, with very few gaps in the
> numbering, and you're only sorting
> on as many significant digits as you are actually using.

Perhaps yes, perhaps no.  Say key is account number and the record is customer
account record.  Then indeed what you say is true.  Such keys are usually
structured to identify the issuing bank or whatever, geographical region, etc as
well as the customer and also usually involve self checking digits, so in that
case performance will be n*log(n) or worse.

On the other hand with the same key but sorting transaction records there can be
many transactions for the same account.  It real applications of this kind the
batch of transcations sorted at one time is generally way too small to realize
performance better than n*log(n), but there are examples of applications where
there can be a huge number of records but a very short key with many duplicates
and in such cases the performance indeed is better than n*log(n).  This is in
fact getting back to the real meaning of sort that someone else pointed out:
dropping into bins as opposed to arranging in order.

Tue, 05 Jul 2005 08:16:32 GMT
Card Columns (was Why did they make ... ?)
On Thu, 16 Jan 2003 15:26:44 -0700, "Russ Holsclaw"

Quote:

>> Moreover some ttys could print only 72 columns.

>This is true, but I'm certain that the 72-column convention in
>programming languages was well-established on punched cards for
>the reason I gave. Programming was done on punched cards well
>before systems advanced to the stage where interaction from a
>terminal, for programming or other time-sharing functions, was
>very common.

>I'd need to see some evidence that this isn't just a coincidence.
>I have an old Teletype Model 15 sitting in my garage, vintage WW
>II. And it has exactly 72 print columns. It also has a very
>limited 5-bit character set that doesn't include things like '=',
>'+', or '*'.  It would be hard to do FORTRAN with that, much less
>PL/I or C.  The ASCII TTY didn't appear until the mid-'60s.

Europe tended to use paper tapes punched offline on teleprinters,
like the Creed models, Friden Flexowriters, TTYs, and read in
from a tape reader, whereas the US tended to use punched cards
due to Hollerith and successors.

Thanks. Take care, Brian Inglis         Calgary, Alberta, Canada
--