Am I on the correct track? 
Author Message
 Am I on the correct track?

Quote:

> Hi All

> I have a list of  100,000 people.
> Some eat peanuts, some eat cashew nuts and peanuts, some only eat
> cashew nuts, some eat brazilian nuts....
> I started declaring by declaring  a byte variable called nuts. I
> then create the following masks

> Peanuts     EQUATE(00000001)
> Cashew     EQUATE(00000010)
> Brazilian   EQUATE(00000100)
> Almond    EQUATE(00001000)
> Walnut     EQUATE(00010000)
> etc

> Put them all together in a variable using the BOR function
> Assuming there are say 8 types of nuts I can easily filter the
> variable nuts in a browse with BAND function but that would mean going
> through each record. I think that would take too long.
> Creating 8 keys would be another option where range could come into
> effect as well but I think this would be wasteful just to have a key
> just for this.
> Any ideas?

> Regards
> Ric

Ric,

I will suggest that problem be faced with 3 files in mind. One for people
info, another for the type of nuts and finaly, a file which wild hold a
peopleID and a nutID. This third file will have two keys, one for peopleID
and the other for nutsID. Then you will be able to know who is eating what
and what a particular people eat.



Sun, 10 Jan 1999 03:00:00 GMT  
 Am I on the correct track?

Ric,

Re the topic: Yes you are <g>.

The two alternatives you listed are correct. The following is a list of all
alternatives that I could think of (including yours).

Alternative 1: Filter it. Problem: Slow.

Alternative 2: Have many keys. Problem: needs n! keys for n features. (2! =
2*1, 3! = 3*2*1, 4! = 4*3*2*1 etc. - this is worse than exponential growth.)

Alternative 3: A combination of 1 and 2: Have a key for those features (or
feature combinations) that have the best (most uniform) value distribution.
Filter the remaining conditions. Problem: Requires some coding that is
specific to the actual distribution of keys. That distribution may not be
known before the program is actually in use, or it may change over time.

Alternative 4: Have a key on each feature. Expand the browser templates to
allow a range on each key that is present. Problem: Requires some heavy
rewriting of template code that is already complicated enough.

Alternative 5: BOR the features and have a single key. This is essentially
the same as Alternative 4, but requires only one key. However, you have to
do some heavy bit manipulation to find the proper range limits. Plus the
number of bits that can be conveniently handled is limited to size(long) =
32, so you have some built-in restrictions.

HTH
-Joachim
-----------------------------------------------------------------------
Joachim Durchholz, +49(9132)82-3150, speaking for himself, not for INA.
KOA, INA Werk Schaeffler, Industriestr.1-3, D-91074 Herzogenaurach



Sun, 10 Jan 1999 03:00:00 GMT  
 Am I on the correct track?

Hi All

I have a list of  100,000 people.
Some eat peanuts, some eat cashew nuts and peanuts, some only eat
cashew nuts, some eat brazilian nuts....
I started declaring by declaring  a byte variable called nuts. I
then create the following masks

Peanuts     EQUATE(00000001)
Cashew     EQUATE(00000010)
Brazilian   EQUATE(00000100)
Almond    EQUATE(00001000)
Walnut     EQUATE(00010000)
etc

Put them all together in a variable using the BOR function
Assuming there are say 8 types of nuts I can easily filter the
variable nuts in a browse with BAND function but that would mean going
through each record. I think that would take too long.
Creating 8 keys would be another option where range could come into
effect as well but I think this would be wasteful just to have a key
just for this.
Any ideas?

Regards
Ric



Sun, 10 Jan 1999 03:00:00 GMT  
 Am I on the correct track?

Quote:
> Thanks for your reply. If I understand you correctly this 3rd file
> could land up with more than 500000  records as most people eat more
> than one type of nut.

That's the nature of an N to N relation.  If ALL the people liked ALL the
kinds of nuts, then the third file contains the number of nuts * the
number of people.

Quote:
> Maybe having as many keys as nut - types would
> still result in less storage space?

Na, you don't want to do that.

Tom Ruby



Mon, 11 Jan 1999 03:00:00 GMT  
 Am I on the correct track?

Quote:


> >I will suggest that problem be faced with 3 files in mind. One for people
> >info, another for the type of nuts and finaly, a file which wild hold a
> >peopleID and a nutID. This third file will have two keys, one for peopleID
> >and the other for nutsID. Then you will be able to know who is eating what
> >and what a particular people eat.

> Thanks for your reply. If I understand you correctly this 3rd file
> could land up with more than 500000  records as most people eat more
> than one type of nut. Maybe having as many keys as nut - types would
> still result in less storage space?
> I still think there must be another way maybe using some tricks?

> Regards
> Ric

Ric,

The third file will simply have two fields: PeopleID and NutsID. There will
be two keys only, one on PeopleID and the other on NutsID.

Files layout could be like this

People          file
kPeopleID         key(PeopleID)
record            record
PeopleID            long
Name                string(25)
etc...

Nuts            file
kNutsID           key(NutsID)
record            record
NutsID              long
Description         string(15)
                  end
                end

NutsAndPeople   file
kPeopleID         key(PeopleID)
kNutsID           key(NutsID)
record            record
PeopleID            long
NutsID              long
                  end
                end

I understand that using BOR and BAND can save some space but this is, my
opinion only, at the exepense of flexibility. This way, the number of
different kind of nuts is not important, it will always work as long as you
give an unique ID to the Nuts as well as the people.



Mon, 11 Jan 1999 03:00:00 GMT  
 Am I on the correct track?

Quote:

>I will suggest that problem be faced with 3 files in mind. One for people
>info, another for the type of nuts and finaly, a file which wild hold a
>peopleID and a nutID. This third file will have two keys, one for peopleID
>and the other for nutsID. Then you will be able to know who is eating what
>and what a particular people eat.

Thanks for your reply. If I understand you correctly this 3rd file
could land up with more than 500000  records as most people eat more
than one type of nut. Maybe having as many keys as nut - types would
still result in less storage space?
I still think there must be another way maybe using some tricks?

Regards
Ric



Mon, 11 Jan 1999 03:00:00 GMT  
 Am I on the correct track?


Quote:
>> Thanks for your reply. If I understand you correctly this 3rd file
>> could land up with more than 500000  records as most people eat more
>> than one type of nut.
>That's the nature of an N to N relation.  If ALL the people liked ALL the
>kinds of nuts, then the third file contains the number of nuts * the
>number of people.
>> Maybe having as many keys as nut - types would
>> still result in less storage space?
>Na, you don't want to do that.
>Tom Ruby

Hi

That's what I was afraid of.
I am looking to have a single browse with say 15 tabs one for each nut
type. My different options are:-

Option 1
Filtering is very slow, so thats out, except for maybe reports.

Option 2
Create a 2nd and 3rd file - Disadvantage lots of extra storage space
needed because most of the people eat most of the nuts. This would
introduce lookups and also if you want to search for people's names
with a locator you need to introduce into the 3rd key file an
alphabetical element to the key which again means more overhead?

Surely there must be another way?

Regards
Ric



Mon, 11 Jan 1999 03:00:00 GMT  
 Am I on the correct track?


Quote:
> Hi All

> I have a list of  100,000 people.
> Some eat peanuts, some eat cashew nuts and peanuts, some only eat
> cashew nuts, some eat brazilian nuts....
> I started declaring by declaring  a byte variable called nuts. I
> then create the following masks

> Peanuts     EQUATE(00000001)
> Cashew     EQUATE(00000010)
> Brazilian   EQUATE(00000100)
> Almond    EQUATE(00001000)
> Walnut     EQUATE(00010000)
> etc

> Put them all together in a variable using the BOR function
> Assuming there are say 8 types of nuts I can easily filter the
> variable nuts in a browse with BAND function but that would mean going
> through each record. I think that would take too long.
> Creating 8 keys would be another option where range could come into
> effect as well but I think this would be wasteful just to have a key
> just for this.
> Any ideas?

> Regards
> Ric

Well, you've been worried about the size of the "cross reference" file
which maps people to nuts...  You said it'd have like 500,000 records.
Assuming each person likes on average 4 kinds ( a wild assumption) that
makes 125,000 people.

Let's see just how big a .tps file that makes...

I made an application to write person-nut records.  The file has a person
number (ULONG), and a nut number (BYTE).  One index on person number, one
on nut number.  It assigned person numbers sequentially, and for each of
the 8 possible nut numbers, decided randomly to put in a record or not.

It made 500,592 records and the file is 7,468,032 bytes.  It took just a
few minutes (I didn't time it).  It was a 16 bit program.

I'll attach a nuts.zip (3k) if anybody's interested.

Tom Ruby

P.S.  Unfortunately, my internet connection timed out while I was doing
this experiment, and I don't seem to be able to reconnect.  You may never
see this.

begin 600 nuts.zip
<uuencoded_portion_removed>
+`&P````!#P``````
`
end



Fri, 15 Jan 1999 03:00:00 GMT  
 Am I on the correct track?

Alternative 6:  Since the number of nuts is so limited, he could make a
separate "third" file for each nut listing the people.

Tom Ruby



Fri, 15 Jan 1999 03:00:00 GMT  
 Am I on the correct track?


Quote:
> Hi

> That's what I was afraid of.
> I am looking to have a single browse with say 15 tabs one for each nut
> type. My different options are:-

> Option 1
> Filtering is very slow, so thats out, except for maybe reports.

> Option 2
> Create a 2nd and 3rd file - Disadvantage lots of extra storage space
> needed because most of the people eat most of the nuts. This would
> introduce lookups and also if you want to search for people's names
> with a locator you need to introduce into the 3rd key file an
> alphabetical element to the key which again means more overhead?

> Surely there must be another way?

> Regards
> Ric

Well, since you have a restricted number of nuts (sounds like a
psychological problem to me), you could...

Make a separate file for each kind of nut, listing just the people who
like that nut.

This could get tricky, but perhaps, since most customers like most of the
nuts, you could just record the ones they DON'T like?

Your "likes" file could be kept fairly small, 10 MB or so by just
including a 32 bit person identifier, and a 1 byte nut identifier.  Now,
to list alphabetically, everybody who likes a specific nut, you list the
CUSTOMERS alphabetically, using a lookup from the "likes" file in the
filter.

This all would be pretty easy to do in 1 powerbrowse.  Use the "CW Tabs"
sort order option.  Make all 15 sort by the same key, but include
different filter expressions.

Now, thinking about it, since your application is a little bit of a
"special case" in that most of the people would be included in most of the
lists, you'd probably have reasonable performance if you put a boolean for
each nut type in the customer record, and used that in a filter.

Here's the issue with filters, and why they're often slow...

A table is just a list.  An index imposes an order on the list.  If you
don't have an index, and you want to find the 5 people out of the 50,000
where BrazilNut = TRUE, the program will have to look at all 50,000
records to see if BrazilNut is TRUE.

If you have an index on BrazilNut, it would sort the FALSE records from
the TRUE records.  Your program could do a search and find the first
person where BrazilNut = TRUE, and step through them till it gets one
where BrazilNut = FALSE, or the end of the file the program can produce
the list with only 6 record retrievals (the 5 TRUE's and the first FALSE).
 In most B-Tree implementations, this index would degenerate to a linked
list of all the TRUE's and a linked list of all the FALSE's, which is
pretty efficient.  Of course, if you then wanted them alphabetically
sorted, you add the name to each record's index entry, making it LOTS
bigger.

If your're doing the reverse query, and trying to retrieve the 49,995
records where BrazilNut = FALSE, then a filter is very reasonable, as the
program only has to retrieve 5 extra records.

Since you're looking at a situation where 25,000 or more of these 50,000
records would have BrazilNut = TRUE, you've probably got a good candidate
for a filter lookup since the skipped records are a fairly small
percentage of the retrieved records.  In fact, it might be faster than the
overhead of looking up a record in another file using an index.

SQL jockies often forget this.  Remember, the server would have to go
through the same process the client would in a non SQL setup.

Rats, my stinking internet connection disconnected and won't reconnect...

Tom Ruby



Mon, 18 Jan 1999 03:00:00 GMT  
 
 [ 10 post ] 

 Relevant Pages 

1. Am I on the right track?

2. Am I on the right track?

3. Reverse Engineering Continued (Am I on the right track)

4. Am I on the right track?

5. SC-Track Roundup 0.5.8 - an issue tracking system

6. SC-Track Roundup 0.6.0b2 - an issue tracking system

7. SC-Track Roundup 0.5.7 - an issue tracking system

8. SC-Track Roundup 0.5.6 - an issue tracking system

9. SC-Track Roundup 0.5.5 - an issue tracking system

10. SC-Track Roundup 0.5.4 - an issue tracking system

11. SC-Track Roundup 0.5.3 - an issue tracking system

12. SC-Track Roundup 0.5.2 - an issue tracking system

 

 
Powered by phpBB® Forum Software