simple problem 
Author Message
 simple problem

This is the problem:
I'm in Unix
I've an huge amount (1Gb) of data like this:

foobar|312,434,222
Hello |  315,423
World| 314

and I need to convert it to this result:

FOOBAR|312434222
HELLO|315423
WORLD|314

the conversion must satisfy theese 3 rules:
1-all uppercase
2-delete all spaces near the "|" separator
3-delete all thousand separators

I think the best tools for resolving it are TR, SED[&|]AWK
can someone tell me the FASTEST complete solution to perform the task
please?

byeee
Giovanni

Sent via Deja.com http://www.*-*-*.com/
Before you buy.



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem

Quote:

>This is the problem:
>I'm in Unix
>I've an huge amount (1Gb) of data like this:

>foobar|312,434,222
>Hello |  315,423
>World| 314

>and I need to convert it to this result:

>FOOBAR|312434222
>HELLO|315423
>WORLD|314

>the conversion must satisfy theese 3 rules:
>1-all uppercase
>2-delete all spaces near the "|" separator
>3-delete all thousand separators

>I think the best tools for resolving it are TR, SED[&|]AWK
>can someone tell me the FASTEST complete solution to perform the task
>please?

tr '[a-z]' '[A-Z]' < infile | sed 's/,//g;s/ *\| */|/' > outfile

which, with this infile:

foobar|312,434,222
Hello |  315,423
World| 314
Hello world|  315,423

produces this outfile

FOOBAR|312434222
HELLO|315423
WORLD|314
HELLO WORLD|315423

I noted you only wanted spaces "near" the | to be deleted.

Chuck Demas
Needham, Mass.

--
  Eat Healthy    |   _ _   | Nothing would be done at all,

  Die Anyway     |    v    | That no one could find fault with it.



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem

Quote:


> > This is the problem:
> > I'm in Unix
> > I've an huge amount (1Gb) of data like this:

> > foobar|312,434,222
> > Hello |  315,423
> > World| 314

> > and I need to convert it to this result:

> > FOOBAR|312434222
> > HELLO|315423
> > WORLD|314

> > the conversion must satisfy theese 3 rules:
> > 1-all uppercase
> > 2-delete all spaces near the "|" separator
> > 3-delete all thousand separators

> > I think the best tools for resolving it are TR, SED[&|]AWK
> > can someone tell me the FASTEST complete solution to perform the task
> > please?

> tr '[a-z]' '[A-Z]' < infile | sed 's/,//g;s/ *\| */|/' > outfile

> which, with this infile:

> foobar|312,434,222
> Hello |  315,423
> World| 314
> Hello world|  315,423

> produces this outfile

> FOOBAR|312434222
> HELLO|315423
> WORLD|314
> HELLO WORLD|315423

> I noted you only wanted spaces "near" the | to be deleted.

You missed the part about the "huge amount ... of data" and the
need for the "FASTEST complete solution." Processing the file twice,
once with tr and again with sed, will likely not be as fast as
processing it once with sed. And though you correctly noted that
only specific spaces (those surrounding vertical bars) should be
removed, you overlooked the possibility that commas, too, could
occur in places other than as "thousand[s] separators." For example,
this

    Hello, world|  315,423

should probably become this

    HELLO, WORLD|315423

and not this

    HELLO WORLD|315423

The following sed script is untested:

    s/ *\| */|/
    s/\([0-9]\),\([0-9][0-9]\)/\1\2/g
    y/abcdefghijklmnopqrstyvwxyz/ABCDEFGHIJKLMNOPQRSTYVWXYZ/

Globally replacing all occurrences of n,nn with nnn ain't perfect,
but it may be the simplest, fastest solution that doesn't wreak
havoc with commas that aren't thousands separators. (By the way,
removing thousands separators from numbers is an example of a text
substitution for which Perl's lookahead and lookbehind assertions
are useful; e.g.:  s/(?<=\d),(?=\d\d\d)//g  .)

It might be instructive to benchmark test these two functionally
equivalent sed scripts using the OP's real data and version of sed:

    Script 1           Script 2
    -----------        -----------
    s/ *\| */|/        s/  *\|/|/
                       s/\|  */|/

Whether Script 2 is faster than Script 1 depends primarily on
whether sed uses a DFA or an NFA regular expression engine.

--
Jim Monty

Tempe, Arizona USA



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem

Quote:

> ...
> the conversion must satisfy theese 3 rules:
> 1-all uppercase
> 2-delete all spaces near the "|" separator
> 3-delete all thousand separators

> I think the best tools for resolving it are TR, SED[&|]AWK
> can someone tell me the FASTEST complete solution to perform the task
> please?

The 'fastest' would be in C.
If you've got to convert those data only ONCE, the 'fastest' will
be the one that forces you to think & code less (real world rule).
'One' of the possible solutions in awk (as this is what matters
in this group) could be:

BEGIN{
  FS=OFS="|"

Quote:
}

{
  gsub(/ *\| */,"|") # remove spaces before or after "|"
  $1=toupper($1)     # toupper() as in gawk, consider locale if avail.
  gsub(/,/,"",$2)    # remove commas in 2nd field
  print

Quote:
}

--
  All true believers shall break their eggs at the convenient end.



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem

Quote:
> you overlooked the possibility that commas, too, could
> occur in places other than as "thousand[s] separators." For example,
> this

>     Hello, world|  315,423

> should probably become this

>     HELLO, WORLD|315423

yes!  correct :)

Quote:
> and not this

>     HELLO WORLD|315423
> The following sed script is untested:

>     s/ *\| */|/
>     s/\([0-9]\),\([0-9][0-9]\)/\1\2/g
>     y/abcdefghijklmnopqrstyvwxyz/ABCDEFGHIJKLMNOPQRSTYVWXYZ/

s/\([0-9]\),\([0-9][0-9]\)/\1\2/g   is a partial solution..
first field sometimes contains numeric strings and I need not to delete the
comma from them

Quote:
>     Script 1           Script 2
>     -----------        -----------
>     s/ *\| */|/        s/  *\|/|/
>                        s/\|  */|/

> Whether Script 2 is faster than Script 1 depends primarily on
> whether sed uses a DFA or an NFA regular expression engine.

That's interesting..  I already heard of them.. what engine do you think the
last GNU SED have?

thanks,
Giovanni



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem

Quote:
> BEGIN{
>   FS=OFS="|"
> }
> {
>   gsub(/ *\| */,"|") # remove spaces before or after "|"
>   $1=toupper($1)     # toupper() as in gawk, consider locale if avail.
>   gsub(/,/,"",$2)    # remove commas in 2nd field
>   print
> }

wonderful!!! this is the most clean solution I've ever seen :)
I hope it's fast too..  do you think gsub function speed is paragonable to
SED  (considering using SED means reading two times the file) ?

thank you very much!
bye
Giovanni



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem

Quote:

> > BEGIN{
> >   FS=OFS="|"
> > }
> > {
> >   gsub(/ *\| */,"|") # remove spaces before or after "|"
> >   $1=toupper($1)     # toupper() as in gawk, consider locale if avail.
> >   gsub(/,/,"",$2)    # remove commas in 2nd field
> >   print
> > }

> wonderful!!! this is the most clean solution I've ever seen :)

Just about average. (thanks:-)

Quote:
> I hope it's fast too..  do you think gsub function speed is paragonable to
> SED  (considering using SED means reading two times the file) ?

Considering only the need for temp{*filter*}space & time for processing
the file twice I'd go for the 'just one pass in awk', but there are
many implementation factors: underlying system, disk speed, swap
space, etc...  
Have a nice time running it.  Ciao.

--
  All true believers shall break their eggs at the convenient end.



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem

Quote:

> > > BEGIN{
> > >   FS=OFS="|"
> > > }
> > > {
> > >   gsub(/ *\| */,"|") # remove spaces before or after "|"
> > >   $1=toupper($1)     # toupper() as in gawk, consider locale if avail.
> > >   gsub(/,/,"",$2)    # remove commas in 2nd field
> > >   print
> > > }

> > wonderful!!! this is the most clean solution I've ever seen :)

> Just about average. (thanks:-)

You can forego the first gsub() [shouldn't that have been sub(),
not gsub(), anyway?] by setting the value of FS to a regular
expression pattern that includes spaces, if any, surrounding the
vertical bar; thus (untested):

    BEGIN {
        FS  = " *| *"
        OFS = "|"
    }

    {
        $1 = toupper($1)  # toupper() as in gawk, consider locale if avail.
        gsub(/,/, "", $2) # remove commas in 2nd field
        print
    }

Quote:
> > I hope it's fast too..  do you think gsub function speed is paragonable
> > to SED (considering using SED means reading two times the file)?

Huh?

The sed script I posted works in a single pass. It's shortcoming
is that it doesn't restrict the removal of thousands separators to
values in the second field.

Though it may be marginally slower to use awk, it seems you have
to for this application because you have two fields that you want
to address separately. Awk does this; sed does not.

Quote:
> Considering only the need for temp{*filter*}space & time for processing
> the file twice I'd go for the 'just one pass in awk', but there are
> many implementation factors: underlying system, disk speed, swap
> space, etc...

Benchmark tests from the OP (Giovanni Loc) might be instructive
and would at least be interesting.

--
Jim Monty

Tempe, Arizona USA



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem

Quote:

> > The following sed script is untested:

> >     s/ *\| */|/
> >     s/\([0-9]\),\([0-9][0-9]\)/\1\2/g
> >     y/abcdefghijklmnopqrstyvwxyz/ABCDEFGHIJKLMNOPQRSTYVWXYZ/

> s/\([0-9]\),\([0-9][0-9]\)/\1\2/g   is a partial solution..
> first field sometimes contains numeric strings and I need not to delete
> the comma from them

That is effectively the death knell for sed in this case. You might
be able to simulate field splitting by using sed's hold buffer and
a whole lot of chicanery, but I wouldn't recommend it.

Quote:
> >     Script 1           Script 2
> >     -----------        -----------
> >     s/ *\| */|/        s/  *\|/|/
> >                        s/\|  */|/

> > Whether Script 2 is faster than Script 1 depends primarily on
> > whether sed uses a DFA or an NFA regular expression engine.

> That's interesting..  I already heard of them.. what engine do you think
> the last GNU SED have?

Traditional NFA. Silly me. Sed could not support backreferences
if it used a DFA engine. I tested GNU sed on a Linux machine and
determined that it uses a traditional (i.e., not POSIX) NFA engine.

--
Jim Monty

Tempe, Arizona USA



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem


<snip>

Quote:
>     BEGIN {
>         FS  = " *| *"
>         OFS = "|"
>     }

<snip>

Shouldn't you backslash the vertical bar, or do you mean (zero or more
spaces) or (zero or more spaces)?

Sent via Deja.com http://www.deja.com/
Before you buy.



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem


Quote:
>> BEGIN{
>>   FS=OFS="|"
>> }
>> {
>>   gsub(/ *\| */,"|") # remove spaces before or after "|"
>>   $1=toupper($1)     # toupper() as in gawk, consider locale if avail.
>>   gsub(/,/,"",$2)    # remove commas in 2nd field
>>   print
>> }

>wonderful!!! this is the most clean solution I've ever seen :)

If you are concerned about run-times, I have found that gsub can be very
time-consuming. You can save some time, by checking first if there is
a comma with
   if ($2 ~ /,/) gsub(/,/,"",$2)    # remove commas in 2nd field

Rgds
Mark
---
Mark Katz
Mark-it, London. Delivering MR-IT/Internet solutions
Tel: (44) 20-8731 7516, Fax: (44) 20-8458 9554
For latest information about ISPC/ITE - see http://www.e-tabs.com



Fri, 24 Jan 2003 03:00:00 GMT  
 simple problem

Quote:


> >  gsub(/ *\| */,"|") # remove spaces before or after "|"

> wouldn't it be better to write
> gsub(/ *| */,"|",$0 ) because this is just one tried match instead of two?

. No need to specify $0

. You should backslash that vertical bar in your regexp

. As Jim Monty already pointed out, this one gsub(),
  which is a one shot try, could be better a sub()

--
  All true believers shall break their eggs at the convenient end.



Sat, 25 Jan 2003 03:00:00 GMT  
 
 [ 12 post ] 

 Relevant Pages 

1. Simple problem, simple solution.

2. Simple problems I face in VW

3. Simple problem....

4. Another simple problem..(1)

5. Another simple problem..(2)

6. MDI-child simple problem

7. probably a simple problem

8. help in a simple problem

9. Simple Problem?

10. Very simple problem with an indexedlineset

11. newbie with simple problem

12. Need some help on a pretty simple problem...

 

 
Powered by phpBB® Forum Software