awk not printing single quotes 
Author Message
 awk not printing single quotes

Hello,

I have a huge list of oracle insert statements (3.7 million records,
1.5GB file). One of the fields is "Name". Some of the fields, like
O'Henry and D'Angelo, have single-quotes within the name. The rest of
the name is surrounded by single-quotes, so this messes up the oracle
syntax.

How can I make awk print the single quotes that are in the source
document?
Here is an example the source file:
insert into table (acctno, phone, name) values (12345, 1234567890,
'O'Henry')

If I do a simple awk comand to print the line, the quotes print:
awk '{print $0}' source.sql

insert into table (acctno, phone, name) values (12345, 1234567890,
'O'Henry')

but if I search through the file for only those lines with an odd
number of single-quotes and print the line, the quotes don't print:
awk 'BEGIN {OFMT = "%.10g"}{CONVFMT = "%.10g" } {n += gsub(/'\''/,"")}
n%2==1 {print $0} {n=0}' source.sql >errors.sql
insert into table (acctno, phone, name) values (12345, 1234567890,
OHenry)

I've also tried outputting just the line number, and feeding it to
head, sed, or awk, but it takes literally days for these three
commands to search through a 1.5GB file, whereas the above awk runs
very quickly.
awk 'BEGIN {OFMT = "%.10g"}{CONVFMT = "%.10g" } {n += gsub(/'\''/,"")}
n%2==1 {print NR, n} {n=0}' source.sql >errors.lst

awk '{NR==3500000}' errors.lst >errors.sql
I've also tried:
awk 'NR==3500000 {print; exit}'
sed -n '3500000p;3500000q'
They are all incredibly slow.

Thanks

Martin Willingham



Sat, 24 Apr 2004 22:49:36 GMT  
 awk not printing single quotes

Quote:

> Hello,

> I have a huge list of oracle insert statements (3.7 million records,
> 1.5GB file). One of the fields is "Name". Some of the fields, like
> O'Henry and D'Angelo, have single-quotes within the name. The rest of
> the name is surrounded by single-quotes, so this messes up the oracle
> syntax.

> How can I make awk print the single quotes that are in the source
> document?
> Here is an example the source file:
> insert into table (acctno, phone, name) values (12345, 1234567890,
> 'O'Henry')

> If I do a simple awk comand to print the line, the quotes print:
> awk '{print $0}' source.sql

> insert into table (acctno, phone, name) values (12345, 1234567890,
> 'O'Henry')

> but if I search through the file for only those lines with an odd
> number of single-quotes and print the line, the quotes don't print:
> awk 'BEGIN {OFMT = "%.10g"}{CONVFMT = "%.10g" } {n += gsub(/'\''/,"")}
> n%2==1 {print $0} {n=0}' source.sql >errors.sql
> insert into table (acctno, phone, name) values (12345, 1234567890,
> OHenry)

> I've also tried outputting just the line number, and feeding it to
> head, sed, or awk, but it takes literally days for these three
> commands to search through a 1.5GB file, whereas the above awk runs
> very quickly.
> awk 'BEGIN {OFMT = "%.10g"}{CONVFMT = "%.10g" } {n += gsub(/'\''/,"")}
> n%2==1 {print NR, n} {n=0}' source.sql >errors.lst

> awk '{NR==3500000}' errors.lst >errors.sql
> I've also tried:
> awk 'NR==3500000 {print; exit}'
> sed -n '3500000p;3500000q'
> They are all incredibly slow.

> Thanks

> Martin Willingham

I think your error is in this part:
---
{n += gsub(/'\''/,"")}
---

Notice that you are replacing the string "'" (' character) with
the string "" (null).  To keep yoyr script working as written
try changing it to replace with the ' character:
---
{n += gsub(/'\''/,"'\''")}
---

You also seem to have a problem in your BEGIN area.  If I
am correct you want to set OFMT and CONVFMT there but
only OFMT is set there.  CONVFMT is set for every input
line.  You can typically avoid this confusion and all of the
single quote/shell escape crud by putting your script in a file:
(I've taken some liberties)
(Note: I use mawk but it will work with other awks with line 1 changes)

-----
#!/usr/bin/mawk -f
BEGIN {
              OFMT = "%.10g"
              CONVFMT = "%.10g"
              }
{
n = gsub(/'/,"'")                      # NOTE no escapes needed on ' now
if (n%2==1) {print $0}         # NOTE in a expr block can't use pat-act
pairs.

Quote:
}

-----

Regards,

Michael Witkowski



Sat, 24 Apr 2004 23:19:09 GMT  
 awk not printing single quotes

Quote:

> if I search through the file for only those lines with an odd
> number of single-quotes and print the line, the quotes don't print:
> awk 'BEGIN {OFMT = "%.10g"}{CONVFMT = "%.10g" } {n += gsub(/'\''/,"")}
> n%2==1 {print $0} {n=0}' source.sql >errors.sql

Obviously, since you are changing all single quotes to nothing with
that gsub.

How about simply matching lines with three or more quotes:

awk "/'.*'.*'/" OFMT="%.10g" CONVFMT="%.10g" source.sql

If you really want all lines with an odd number of quotes,
there are several ways. First, if you don't need to do
any field-based operations in the same time, you can use
quote as field separator and count the fields:

awk -F\' 'NF%2==0'  OFMT="%.10g" CONVFMT="%.10g" source.sql

That should be about as fast as it gets.

If you can't change FS, you can use split instead of gsub:

awk 'split($0,a,"'\''")%2==0' OFMT="%.10g" CONVFMT="%.10g" source.sql

Of course you can also use gsub, just change the quotes
back to themselves:

awk 'gsub(/'\''/,"'\''")%2' OFMT="%.10g" CONVFMT="%.10g" source.sql

You could also use a regexp that matches only lines with
an odd number of quotes:

awk "/^([^']*'[^']*')*[^']*'[^']*\$/" OFMT="%.10g" CONVFMT="%.10g" source.sql

There are probably other ways, too.

--
Tapani Tarvainen



Sat, 24 Apr 2004 23:32:57 GMT  
 awk not printing single quotes


...

Quote:
>-----
>#!/usr/bin/mawk -f
>BEGIN {
>              OFMT = "%.10g"
>              CONVFMT = "%.10g"
>              }
>{
>n = gsub(/'/,"'")                      # NOTE no escapes needed on ' now
>if (n%2==1) {print $0}         # NOTE in a expr block can't use pat-act
>pairs.
>}

Yes, but anyone with any self-respect would code it thus:

        BEGIN { OFMT = CONVFMT = "%.10g" }
        gsub(/'/,"&") % 2 # I suppose the point is to print lines
                                # with an odd number of single quotes, right?

And be done with it...

(I have no idea what this script it supposed to do, not any comment on
whether or not it does it.  I'm just commenting on style.)



Sat, 24 Apr 2004 23:54:43 GMT  
 awk not printing single quotes

Quote:



> ...
> >-----
> >#!/usr/bin/mawk -f
> >BEGIN {
> >              OFMT = "%.10g"
> >              CONVFMT = "%.10g"
> >              }
> >{
> >n = gsub(/'/,"'")                      # NOTE no escapes needed on ' now
> >if (n%2==1) {print $0}         # NOTE in a expr block can't use pat-act
> >pairs.
> >}

> Yes, but anyone with any self-respect would code it thus:

>         BEGIN { OFMT = CONVFMT = "%.10g" }
>         gsub(/'/,"&") % 2       # I suppose the point is to print lines
>                                 # with an odd number of single quotes, right?

> And be done with it...

> (I have no idea what this script it supposed to do, not any comment on
> whether or not it does it.  I'm just commenting on style.)

Yup!  I'm due some lashes!  That's at least another 25%
increase in efficiency, probably a bit more.

Michael Witkowski



Sun, 25 Apr 2004 00:22:45 GMT  
 awk not printing single quotes

% I have a huge list of oracle insert statements (3.7 million records,
% 1.5GB file). One of the fields is "Name". Some of the fields, like

Leaving aside your question, if your goal is to insert this into
a database, I suggest you use sql loader in direct path mode. Apart
from eliminating this problem (you can get rid of the quotes if there
are no commas in your character strings, use double-quotes, or change
the delimiter to some value that doesn't appear in your data), it will
save you time on the load.

% but if I search through the file for only those lines with an odd
% number of single-quotes and print the line, the quotes don't print:
% awk 'BEGIN {OFMT = "%.10g"}{CONVFMT = "%.10g" } {n += gsub(/'\''/,"")}
% n%2==1 {print $0} {n=0}' source.sql >errors.sql
% insert into table (acctno, phone, name) values (12345, 1234567890,
% OHenry)

For dealing with files this size, I suggest installing mawk. It is much
faster than other implementations. Rather than playing tricks with gsub,
how about playing tricks with FS:

 mawk -F"'" '!(NF % 2)' source.sql > errors.sql

or, perhaps better:

 mawk -F"'" '!(NF % 2) { print > "errors.sql"; next } { print > "clean.sql" }' source.sql

You could also take a stab at cleaning the data up (in a script, since
it's a pain dealing with single quotes on the command-line):

 #!/usr/local/bin/mawk -f
 BEGIN { FS = OFS = "'" }

 # if there are an odd-number of fields, things are OK.
 NF % 2 { print; next }

 # things must not be OK if we reach this line
 # we need to leave the odd-numbered quotes alone, but double-up the
 # even-numbered ones
 {
   printf "%s'%s", $1, $2
   for (i = 3; i <= NF; i += 2) {
      printf "''%s'%s", $i, $(i+1)
   }
 }
--

Patrick TJ McPhee
East York  Canada



Sun, 25 Apr 2004 01:15:45 GMT  
 awk not printing single quotes

On 6 Nov 2001 at 06:49, Martin Willingham so generously added:

Quote:
> Hello,

> I have a huge list of oracle insert statements (3.7 million records,
> 1.5GB file). One of the fields is "Name". Some of the fields, like
> O'Henry and D'Angelo, have single-quotes within the name. The rest of
> the name is surrounded by single-quotes, so this messes up the oracle
> syntax.

> How can I make awk print the single quotes that are in the source
> document?
> Here is an example the source file:
> insert into table (acctno, phone, name) values (12345, 1234567890,
> 'O'Henry')

I'm sure the guys here can come up with something much better. I'm fairly
new to awk. But this does do what you want. (And if I'm not mistaken, the
escape character in sql for a single quote is another single quote, so you
want 'O''Henry' I believe. If not, you can change this character)

From input file:
insert into table (acctno, phone, name) values ('12345', 1234567890,
'O'Henry')
insert into table (acctno, phone, name) values ('12345'', 1234567890,
'Joes'')
insert into table (acctno, phone, name) values (''12345', 1234567890,
''Mac')

I run:
{
    for (i=1;i<=NF;i++)
    {
        n = index($i,"'");
        if(n == 1 || n == 2) {
            # Found something which needs escaping
            # add the possible "(" or "'"
            a = substr($i,0,n);
            for(j=n+1;j<length($i)-1;j++){
                b = substr($i,j,1);
                if(b == "'") {
                    a = a "'";
                }
                a = a b;
            }
            a = a substr($i,j,length($i));
            $i = a;
        }
    }
    print $0;

Quote:
}

and get:

insert into table (acctno, phone, name) values ('12345', 1234567890,
'O''Henry')
insert into table (acctno, phone, name) values ('12345''', 1234567890,
'Joes''')
insert into table (acctno, phone, name) values ('''12345', 1234567890,
'''Mac')

--
Jason Simpson

  "To avoid slow performance, Apple suggests that the amount of virtual  memory you select be less than the system RAM."  -- INFOWORLD



Sun, 02 May 2004 04:43:48 GMT  
 awk not printing single quotes


Quote:
> If you really want all lines with an odd number of quotes,

All solutions involving an odd number of quotes are a really bad idea.

insert into table (a y z) values ('O'Malley's Pub', 300, 400)

Needs to be escaped properly.

--
Jason Simpson

  "To avoid slow performance, Apple suggests that the amount of virtual  memory you select be less than the system RAM."  -- INFOWORLD



Sun, 02 May 2004 04:50:23 GMT  
 awk not printing single quotes

Quote:


>> If you really want all lines with an odd number of quotes,

> All solutions involving an odd number of quotes are a really bad idea.

> insert into table (a y z) values ('O'Malley's Pub', 300, 400)

> Needs to be escaped properly.

Have you tried the escaped octal notation?
\047  I think..

`man ascii' has a handy list to check it.



Sun, 02 May 2004 12:32:12 GMT  
 
 [ 9 post ] 

 Relevant Pages 

1. printing single quote (apostrophe) with awk

2. printing a single quote in awk

3. Substituting single quotes from Perl via shell using awk and gsub

4. Getting awk (or sed or anything else) to put single quotes into file

5. How to print out single quoted columns?

6. Printing single quotes

7. replace single quote to double quote with an example

8. single-quoted string conversion to triple-quoted string

9. When is a quote not a quote?

10. REGULAR EXPRESSIONS; PERL/AWK/SED; SINGLE QUOTES/DOUBLE QUOTES

11. gawk win32 binary & single quote invalid char

12. parsing lines with quoted strings as single fields

 

 
Powered by phpBB® Forum Software