Large-scale file reads/writes and efficiency 
Author Message
 Large-scale file reads/writes and efficiency

Hello again,

I'm still working on this same project, and I've recently hit a major
conceptual snag.  This deals with perl in that perl is what the project
itself is written in, so this may not be the place to ask, but:

The project stores the bulk of its data (user-input updates / journal
entries / what-have-you) in distinct files, each of which represents
the happenings on one specific date, as given by the filename.

In other words, the file 19990101 contains the entry for 01 Jan 1999;
19990102 is 02 Jan 1999, etc.

The problem is this: I find it necessary to take some data from each
file in order to give the user useful information about these files.  I
liken this to be roughly akin to a rudimentary search engine.

The problem is, I would like to avoid building a "lookup table" that
contains all of the relevant information about all entries, because
this would be an inefficient use of disk space.

However, since the entire thing is being accessed from the web, parsing
through all of the files to retrieve the data is an inefficient use of
processor time and may cause unnecessary wear and tear on the webhost's
disks.

To give some idea as to the scope of the problem, we're talking about a
maximum of 1000 entries (roughly three years of writing every day, or
five years writing every other), and entries with an average of, say,
5K (my average is 2K).

Worst case, we're talking about 5MB of files.  If, say, 100 people are
reading on a daily basis (uh, yeah right, but anyway), then the actual
data the script is sifting through shoots to 5GB.

Now granted, this is all worst-case scenario; there are things that
would cut the data sifting in half or even almost eliminate it that I'm
not taking into consideration.

The point is, how efficient is perl's filehandling?  Would it show an
obvious performance hit on the webserver if I were opening and closing
files repeatedly?  No writes would be done, only reads.

I don't think it's possible to go to a database solution, which would
be the logical choice.

Is it possible to run a CGI script as a daemon?  Would that be
applicable here?

--
-Stephen Deken, type designer and general geek

Sent via Deja.com http://www.*-*-*.com/
Before you buy.



Wed, 18 Jun 1902 08:00:00 GMT  
 Large-scale file reads/writes and efficiency
On Tue, 30 Nov 1999 02:11:41 GMT,

Quote:
> The problem is this: I find it necessary to take some data from each
> file in order to give the user useful information about these files.  I
> liken this to be roughly akin to a rudimentary search engine.

> The problem is, I would like to avoid building a "lookup table" that
> contains all of the relevant information about all entries, because
> this would be an inefficient use of disk space.

Well... you may _want_ to avoid that, but it still is the better
solution to your problem.

Quote:
> However, since the entire thing is being accessed from the web, parsing
> through all of the files to retrieve the data is an inefficient use of
> processor time and may cause unnecessary wear and tear on the webhost's
> disks.

Indeed. It's slow, wastes CPU cycles and I/O. The last one is the
worst.

Quote:
> To give some idea as to the scope of the problem, we're talking about a
> maximum of 1000 entries (roughly three years of writing every day, or
> five years writing every other), and entries with an average of, say,
> 5K (my average is 2K).

So, you'd need one index file with 1000 entries of each approximately
80 characters? maybe? That's about 80 kb. Maybe double it for a bit
more information and you'd have 160 kb. Hardly much.

Quote:
> Worst case, we're talking about 5MB of files.  If, say, 100 people are
> reading on a daily basis (uh, yeah right, but anyway), then the actual
> data the script is sifting through shoots to 5GB.

The amount of data is less relevant than the number of files. Sure, a
large file system cache will make it less horrible, but the number of
system calls to open and close hundreds of files is really large.

Quote:
> Now granted, this is all worst-case scenario; there are things that
> would cut the data sifting in half or even almost eliminate it that I'm
> not taking into consideration.

> The point is, how efficient is perl's filehandling?  Would it show an
> obvious performance hit on the webserver if I were opening and closing
> files repeatedly?  No writes would be done, only reads.

Perl will be mostly as efficient as the file system is. And for each
file the system has to do quite a bit of work. It has to resolve the
path, stat the file, open the file, etc. A lot of system calls and
I/O. Since you seem to have about 1500 files (3 years?) That is an
enormous amount of work.

In general the really expensive things you want to try to avoid are
file opens.

Quote:
> I don't think it's possible to go to a database solution, which would
> be the logical choice.

Well.. You don't have to install a RDBMS to have a database :) A
simple index file, or even two or three index files, indexed in
different ways would get you a long way towards having an efficient
way to catalogue your data.

You could also look at one of the free text search engines out there,
which may be able to do most of what you want. It's not entirely clear
what sort of indexing you want to provide.

Quote:
> Is it possible to run a CGI script as a daemon?  Would that be
> applicable here?

No. You can't run a CGI program as a daemon. You can however run a
program as a daemon and have your CGI processes contact that. The
daemon would build an index, or multiple indexes, in memory or in some
fast access (and/or mmapped) files, and use those. You'd still want
index files though.

The algorithms and techniques for all this stay the same, regardles
off the programming language, and in the case of daemon processes can
get very system specific, so I won't talk about those. I would advise
you to download one of the free search engines out there, and look at
how they do it.

Martien
--
Martien Verbruggen              |
Interactive Media Division      | If at first you don't succeed, try
Commercial Dynamics Pty. Ltd.   | again. Then quit; there's no use
NSW, Australia                  | being a damn fool about it.



Wed, 18 Jun 1902 08:00:00 GMT  
 
 [ 2 post ] 

 Relevant Pages 

1. File read/write efficiency

2. hash efficiency and key/value/array reading file

3. RESG Event: Large Scale Requirements Analysis, 3 November 2000, UMIST

4. Incomplete file I/O with large executeables/large files

5. writing large(r) files with perl for MS-DOS

6. Problem reading large files (dumb question?)

7. Memory Usage When Reading Large (non crlf) Files

8. Reading only a part of a large file.

9. Reading large files problem, please help.

10. Problem reading large files (dumb question?)

11. how can I read lines from file compare it and write it to another file

12. Memory usage: reading large files

 

 
Powered by phpBB® Forum Software