Space efficient sorting of arrays 
Author Message
 Space efficient sorting of arrays

Hi all,
I've been trundling along fine up until now, sorting a 2d array by one of
its columns as follows:
myarray{<-}myarray[{gradeup}myarray[;3];]

Now I'm trying to sort large arrays (>100,000 rows) and am running into
WSFULL problems.
Does anyone have any tips on how to make this sorting more efficient?

Cheers,
Ric Sherlock



Sun, 23 May 2004 08:41:23 GMT  
 Space efficient sorting of arrays
Hi Ric,

If you have 100,000 rows and (say) 10 columns of floating-point data (8
bytes for each number), this array will consume 8Mb. APL will need 2 copies
of the data while it reorders the array, so you will need a workspace in
excess of 16Mb to complete the operation. If you use an external file, and
write then read "chunks" of the array you can reduce the space requirement
somewhat, but before we get into that, perhaps you can answer a couple of
questions:

1) How big is your workspace?
2) Which APL system are you using?

Regards,

Morten Kromberg

Quote:
-----Original Message-----

Gateway
Sent: 5. december 2001 02:31

Subject: Space efficient sorting of arrays


Hi all,
I've been trundling along fine up until now, sorting a 2d array by one of
its columns as follows:
myarray{<-}myarray[{gradeup}myarray[;3];]

Now I'm trying to sort large arrays (>100,000 rows) and am running into
WSFULL problems.
Does anyone have any tips on how to make this sorting more efficient?

Cheers,
Ric Sherlock



Sun, 23 May 2004 10:03:56 GMT  
 Space efficient sorting of arrays

Quote:

> Hi all,
> I've been trundling along fine up until now, sorting a 2d array by one of
> its columns as follows:
> myarray{<-}myarray[{gradeup}myarray[;3];]

> Now I'm trying to sort large arrays (>100,000 rows) and am running into
> WSFULL problems.
> Does anyone have any tips on how to make this sorting more efficient?

Don't.  Buy memory.  

It's dead cheap these days ... 256 MB of registered ECC PC 133
can be had for about $50 from crucial.com.

If you've maxed out the machne already  ...

It's not too hard to sort space-efficiently:

   g {<-} {gradeup} A




  k {first}{rho} g

  :While k > 0
      k {<-} k - 1
      :While i[k] {/=} k
         A[k,i[k];] {<-} A[i[k],k;]
         i[k,i[k]]  {<-} i[i[k],k]
      :EndWhile
  :EndWhile

This is going to be rather slow for large arrays ...
10 seconds for a 100000x4 integer matrix (and
within 10% of the same for a 100000-item vector)
on a machine where the {gradeup} takes 0.32 sec
(matrix) or .13 sec (vector), and the direct
reindexing takes another 0.04 (matrix) or 0.01
(vector) seconds.

It's not doing a lot of extra iteration BTW ...
each iteration puts at least one more row/index
into the right place, and you must at least do
the comparison in the outer loop once per row
so that in the end the cost, after the grade,
is linear.  Whopping big linear coefficient,
though.

At the cost of more complexity, you could work
on blocks of the data, swapping the out-of-place
items in a block into place until there are 0
out-of-place items in the block, and then working
on the next block.  Note that in the limit, where
the block size is equal to the first axis length
of A, this amounts to A[i;] {<-} A and so the only
extra cost is the inversion of the permutation,
which is cheap.  This suggests that the gain from
blocking is likely to be substantial and you may
want to set a block size that's just small enough
to reliably avoid WS full.



Sun, 23 May 2004 11:01:05 GMT  
 Space efficient sorting of arrays
Hi Morten,
I have tried workspace sizes upto 128MB and the program usually falls over
once the array gets up around 120,000 rows.  The array I'm working with has
43 odd columns, so using your example, that would translate to 34.4MB (or
64.8MB required for sorting).

Part of the problem is that there are a number of other large variables that
are using up a decent amount of the workspace, and I'm sure I could increase
the size available for sorting by storing some of them to disk.  Up until
now though I'd been trying to stay away from that to keep things simple &
quick.
I'm using APL+Win version 3.6.
Regards,
Ric


Quote:
> Hi Ric,

> If you have 100,000 rows and (say) 10 columns of floating-point data (8
> bytes for each number), this array will consume 8Mb. APL will need 2
copies
> of the data while it reorders the array, so you will need a workspace in
> excess of 16Mb to complete the operation. If you use an external file, and
> write then read "chunks" of the array you can reduce the space requirement
> somewhat, but before we get into that, perhaps you can answer a couple of
> questions:

> 1) How big is your workspace?
> 2) Which APL system are you using?

> Regards,

> Morten Kromberg

> -----Original Message-----

> Gateway
> Sent: 5. december 2001 02:31

> Subject: Space efficient sorting of arrays


> Hi all,
> I've been trundling along fine up until now, sorting a 2d array by one of
> its columns as follows:
> myarray{<-}myarray[{gradeup}myarray[;3];]

> Now I'm trying to sort large arrays (>100,000 rows) and am running into
> WSFULL problems.
> Does anyone have any tips on how to make this sorting more efficient?

> Cheers,
> Ric Sherlock



Sun, 23 May 2004 11:02:46 GMT  
 Space efficient sorting of arrays
1) If you have 43 columns, am i right when assuming you have lots of mixed
data types? If you have only numeric data, did you know that the most
space-consuming data type (8-byte float) sets the required space consumption
for every other cell? I.e. if you have 10 by 10 of integers between 0 and
127, the array size is 120 bytes (Dyalog APL). Then, if you set the value of
any single cell to a decimal value, the size increases to 820 bytes... just
because of that single float. Solution: Split up the matrix into a few
separate ones by column data type. Sort them separately. But note that this
may apply if you have text data in the matrix as well, or other levels of
enclosure.

2) Maybe better solution: Do not sort at all. Just keep the sort order
vector around and interact with the matrix through that. This is a very thin
layer of rearranging. To retrieve the first 5 rows from the 'sorted' matrix,
have something like:

M[order[1 2 3 4 5];]

You may have to update the sort vector frequently, but if you sort by one
single column, that should be fast enough.
/ Tomas



Quote:
> Hi all,
> I've been trundling along fine up until now, sorting a 2d array by one of
> its columns as follows:
> myarray{<-}myarray[{gradeup}myarray[;3];]

> Now I'm trying to sort large arrays (>100,000 rows) and am running into
> WSFULL problems.
> Does anyone have any tips on how to make this sorting more efficient?

> Cheers,
> Ric Sherlock



Sun, 23 May 2004 18:07:49 GMT  
 Space efficient sorting of arrays

Quote:

> Hi all,
> I've been trundling along fine up until now, sorting a 2d array by one of
> its columns as follows:
> myarray{<-}myarray[{gradeup}myarray[;3];]

> Now I'm trying to sort large arrays (>100,000 rows) and am running into
> WSFULL problems.
> Does anyone have any tips on how to make this sorting more efficient?

> Cheers,
> Ric Sherlock

These days 100,000 rows is a pretty `small' problem. Aren't you in danger of
wasting a lot of time and energy trying to do this in APL when it would be
a very short task in lots of other languages/environments?


Sun, 23 May 2004 21:53:05 GMT  
 Space efficient sorting of arrays
(Forgot a small but important "not" in my previous post)



Quote:
> 1) If you have 43 columns bla bla bla...
> But note that this
> may ***NOT*** apply if you have text data in the matrix as well, or other
levels of
> enclosure.



Mon, 24 May 2004 00:26:46 GMT  
 Space efficient sorting of arrays
As Tomas Gustafsson has pointed out, the problem is actually not the upgrade
or "sort", which only uses one of your 43 columns, but the subsequent
reordering of the array, which makes a complete copy of the array. You have
a number of choices:

1) Find more workspace by purchasing more memory, or getting rid of some of
the other things in the workspace.

2) If possible, use Tomas' idea of partitioning the array by data type. If
some of your columns are NOT floating point, but could be represented as
integers or even booleans, there may be a substantial reduction in the
memory requirement (and the speed of many operations) if you split it into
separate Boolean, Integer and Floating arrays.

3) Also suggested by Tomas suggests, leave the array as it is and use the
"order" index to reorder bits of it as required by your application.

4) The quick and dirty solution, if none of the above appeal: Reorder your
array bit by bit, for example:

   order <- {upgrade} Data[;3]
   :For i :In {iota}43
        Data[;i] <- Data[order;i]
   :EndFor

... this will only require duplication of one column at a time. Just make
sure to reorder ALL of the columns :).

5) Combinations of the above...

Good luck! / Morten

Quote:
-----Original Message-----

Gateway
Sent: 5. december 2001 04:35

Subject: Re: Space efficient sorting of arrays


Hi Morten,
I have tried workspace sizes upto 128MB and the program usually falls over
once the array gets up around 120,000 rows.  The array I'm working with has
43 odd columns, so using your example, that would translate to 34.4MB (or
64.8MB required for sorting).

Part of the problem is that there are a number of other large variables that
are using up a decent amount of the workspace, and I'm sure I could increase
the size available for sorting by storing some of them to disk.  Up until
now though I'd been trying to stay away from that to keep things simple &
quick.
I'm using APL+Win version 3.6.
Regards,
Ric



Mon, 24 May 2004 05:28:59 GMT  
 Space efficient sorting of arrays
Thank you all for your input,

David is probably right that there are other programs that could do this
relatively easily - however I don't think that I personally would save much
in terms of time and effort by trying to work out how to interface to some
other solution.

The block sorting (scan, search, merge) approach suggested by both Phil &
Mike sounds like it would be the "best" APL-based approach, but I confess
does sound a little more complicated than I wanted to get.

I haven't quite got my head around what Mike's code is actually doing yet
so, I'll have to investigate that further.

I will keep in mind the use the "order" index for other situations, but I
don't think it would work in this instance.

Which leaves me with the "sort one column at a time" or by datatype
approaches - these seem nice & simple but effective.  If I can sort out a
way of automatically working out which columns store only integers & which
store floating point numbers, I'll use the datatype approach, otherwise I'll
use the column at a time approach.

Thanks again for everyone's help,

Ric



Mon, 24 May 2004 11:03:19 GMT  
 Space efficient sorting of arrays

Quote:

> Thank you all for your input,

> David is probably right that there are other programs that could do this
> relatively easily - however I don't think that I personally would save much
> in terms of time and effort by trying to work out how to interface to some
> other solution.

At worst, write out your data and read it back in ascii.

I can understand, but perhaps it is worthwhile just establishing some
simple facts so you know what kind of a tradeoff you are actually making.

Just for test purposes I created a 24 MB file that had 780,000+ lines of junk
data. This file loaded into K in 2.9 sec, and it `sorted' in 9.2 sec. Writing
the result took 2.1sec.

I obviously have no idea how this compares with your real problem, but it
does suggest that a `solution' in K costs a grand total of less than 15 seconds,
The program, of course, is terse (about 30 chars).



Mon, 24 May 2004 11:54:19 GMT  
 Space efficient sorting of arrays
But... the problem here is NOT the sorting, it is reordering the array. So
unless a "piece by piece" approach is taken, formatting the array in order
to write it to an ASCII file is almost definitely going to WS FULL even
earlier than the current algorithm.

The general problem is that, once your array size is something like 20% of
available workspace, any APL algorithms must be built with care, since many
APL expressions will cause 2-4 copies of the data to be made, sometimes
more.

Alas, as soon as you start to split the array up, your life becomes more
complicated. WS FULL remains my least favourite error message.

Note that some APL systems (I know this is true for at least Dyalog APL and
J) allow you to have arrays which are external to the workspace, mapped to a
file. In theory, APL just slows down when working on these arrays, although
I suspect there are some limits to what sort of operations you can perform.

/ Morten

Quote:
-----Original Message-----

Gateway
Sent: 6. december 2001 05:30

Subject: Re: Space efficient sorting of arrays



> Thank you all for your input,

> David is probably right that there are other programs that could do this
> relatively easily - however I don't think that I personally would save
much
> in terms of time and effort by trying to work out how to interface to some
> other solution.

At worst, write out your data and read it back in ascii.

I can understand, but perhaps it is worthwhile just establishing some
simple facts so you know what kind of a tradeoff you are actually making.

Just for test purposes I created a 24 MB file that had 780,000+ lines of
junk
data. This file loaded into K in 2.9 sec, and it `sorted' in 9.2 sec.
Writing
the result took 2.1sec.

I obviously have no idea how this compares with your real problem, but it
does suggest that a `solution' in K costs a grand total of less than 15
seconds,
The program, of course, is terse (about 30 chars).



Tue, 25 May 2004 02:39:01 GMT  
 Space efficient sorting of arrays

Quote:

> But... the problem here is NOT the sorting, it is reordering the array. So
> unless a "piece by piece" approach is taken, formatting the array in order
> to write it to an ASCII file is almost definitely going to WS FULL even
> earlier than the current algorithm.

> The general problem is that, once your array size is something like 20% of
> available workspace, any APL algorithms must be built with care, since many
> APL expressions will cause 2-4 copies of the data to be made, sometimes
> more.

> Alas, as soon as you start to split the array up, your life becomes more
> complicated. WS FULL remains my least favourite error message.

> Note that some APL systems (I know this is true for at least Dyalog APL and
> J) allow you to have arrays which are external to the workspace, mapped to a
> file. In theory, APL just slows down when working on these arrays, although
> I suspect there are some limits to what sort of operations you can perform.

I am at a bit of a loss about how to respond, particularly since my use of
English doesn't find much distinction between `sorting' and `reordering' in
this context.

But let me resist the temptation to respond line by line by just summarizing my
view that the kind of problem presented here is likely to be a `15 sec problem',
and if it causes _any_ consternation or thought about computational strategy, you
are probably working in the wrong environment. I well remember the days that
sorting a 300k file pushed the edge of PC technology, but those days are _long_
gone, and now <100MB problems are often `small'. So if a 200K line 40Mb task is
causing you a problem, there is at least a prima facie case that you are causing
yourself trouble by choosing an environment that doesn't fit well.

A measure of how well an environment `fits' a problem is how much tangential work
it forces you to do. In this case, it sounded to me---at least at the superficial
level of knowledge we have of the problem here---like we were seeing a mis-fit.



Tue, 25 May 2004 04:03:38 GMT  
 Space efficient sorting of arrays

Quote:

> A measure of how well an environment `fits' a problem is how much
tangential work
> it forces you to do. In this case, it sounded to me---at least at the
superficial
> level of knowledge we have of the problem here---like we were seeing a

mis-fit.

Perhaps I misunderstood what you were suggesting - I though you were
suggesting that an external sorting tool would help Ric avoid WS FULL. All I
was saying was that, as far as I can see, this suggestion does not solve the
fundamental issue of the array being a bit too big for "wholesale
manipulation"; the code to do the external sort would essentially encounter
the same issues.

In other words, the effort required to export and reimport the data without
encountering WS FULL seems to be at least as great as using some of the
other solutions, which all have much lower operational complexity.

We don't have enough information about what Ric is tryong to do with his
data do judge whether it makes any sense for him to consider a new
environment for his entire application. He did say he was doing fine except
for this problem. Somehow, I suspect that he is trying to do more than sort
16Mb of data.

/ Morten



Tue, 25 May 2004 05:45:15 GMT  
 Space efficient sorting of arrays

Quote:

> Perhaps I misunderstood what you were suggesting - I though you were
> suggesting that an external sorting tool would help Ric avoid WS FULL. All I
> was saying was that, as far as I can see, this suggestion does not solve the
> fundamental issue of the array being a bit too big for "wholesale
> manipulation"; the code to do the external sort would essentially encounter
> the same issues.

I think you understood me correctly. But what you say suggests to me a `poor'
environment. If it costs a whole lot to `roll-out' and `roll-in' I'd pass on
using the environment at all.

Quote:
> In other words, the effort required to export and reimport the data without
> encountering WS FULL seems to be at least as great as using some of the
> other solutions, which all have much lower operational complexity.

Strikes me as a rather awful limitation of a programming environment.
I go into and out of environments all the time, generally at a trivial cost
(like a second or two to read/write 50mb files), thus allowing me the luxury
of using perl to handle perl tasks, k to handle k tasks etc. perl, k, j are
all pretty good at doing this kind of stuff. I can't afford any of the
fancy apls so I don't know how good they are at this, but I'd sure hope
you could get out and get back in easily.

Quote:
> We don't have enough information about what Ric is tryong to do with his
> data do judge whether it makes any sense for him to consider a new
> environment for his entire application. He did say he was doing fine except
> for this problem. Somehow, I suspect that he is trying to do more than sort
> 16Mb of data.

I sure hope so. So far the problem sounds _way_ too easy to run into _any_ problem.


Tue, 25 May 2004 06:51:32 GMT  
 Space efficient sorting of arrays

Quote:

> I think you understood me correctly. But what you say suggests to me a
`poor'
> environment. If it costs a whole lot to `roll-out' and `roll-in' I'd pass
on
> using the environment at all.

It is indeed trivial to roll in and roll out. But - as far as I can see - it
does not solve the original problem which was posed to the newsgroup by Ric
Sherlock.

Maybe I'm a bit slow today. Perhaps your point is actually "If you wanted to
go there, you wouldn't be starting from here"?

:-)

?

Morten Kromberg



Tue, 25 May 2004 08:24:25 GMT  
 
 [ 24 post ]  Go to page: [1] [2]

 Relevant Pages 

1. Space efficient sorting of arrays

2. J/APL memory use [was Re: SV: Space efficient sorting of arrays]

3. J/APL memory use [was Re: SV: Space efficient sorting of

4. re efficient sorting of arrays

5. space-efficient top-N algorithm

6. Anybody know a fast efficient sorting Algorithm

7. GC efficient sort

8. Array#sort! returns nil when array empty

9. Efficient Mutable Arrays

10. Is Ruby Array#shift/unshift Efficient?

11. macro for efficient array operations

12. An efficient way to pass 3-D arrays to functions

 

 
Powered by phpBB® Forum Software