Awk Beyond Ascii? 
Author Message
 Awk Beyond Ascii?

I was somewhat abstractly wondering if it is possible to use awk (and sed)
with Unicode or character sets beyond the standard ascii. What do Greeks and
Russians do?

Not urgent, just curious.

Elisa Francesca Roselli



Mon, 08 Jul 2002 03:00:00 GMT  
 Awk Beyond Ascii?

Quote:
> I was somewhat abstractly wondering if it is possible to use awk (and sed)
> with Unicode or character sets beyond the standard ascii. What do Greeks and
> Russians do?
> Not urgent, just curious.
> Elisa Francesca Roselli

Awk works very well beyond Ascii :-).
Russians use code page 866 and appropriate screen and keyboard drivers
for it. But awk doesn't know what letters we see on a display screen and
and on keyboard keys, because it works with their codes (bytes). We can't
use such awk programs as alphabetical sorting, based on English alphabet,
but we can write our own programs for this purpose or accomodate existing
programs and we do it very easy :-).

Alexander

Ukraine



Mon, 08 Jul 2002 03:00:00 GMT  
 Awk Beyond Ascii?

Quote:

> > I was somewhat abstractly wondering if it is possible to use awk
(and sed)
> > with Unicode or character sets beyond the standard ascii. What do
Greeks and
> > Russians do?

> > Not urgent, just curious.

> > Elisa Francesca Roselli

> Awk works very well beyond Ascii :-).
> Russians use code page 866 and appropriate screen and keyboard drivers
> for it. But awk doesn't know what letters we see on a display screen
and
> and on keyboard keys, because it works with their codes (bytes). We
can't
> use such awk programs as alphabetical sorting, based on English
alphabet,
> but we can write our own programs for this purpose or accomodate
existing
> programs and we do it very easy :-).

But how about Unicode and such wide char sets? I'm curious too now. =)
/P
--
-= Spam safe(?) e-mail address: pez68 at netscape.net =-

Sent via Deja.com http://www.deja.com/
Before you buy.



Mon, 08 Jul 2002 03:00:00 GMT  
 Awk Beyond Ascii?
Skipped

Quote:
>> > I was somewhat abstractly wondering if it is possible to use awk
> (and sed)
>> > with Unicode or character sets beyond the standard ascii. What do
> Greeks and
>> > Russians do?

>> > Not urgent, just curious.

>> > Elisa Francesca Roselli

>> Awk works very well beyond Ascii :-).
>> Russians use code page 866 and appropriate screen and keyboard drivers
>> for it. But awk doesn't know what letters we see on a display screen
> and
>> and on keyboard keys, because it works with their codes (bytes). We
> can't
>> use such awk programs as alphabetical sorting, based on English
> alphabet,
>> but we can write our own programs for this purpose or accomodate
> existing
>> programs and we do it very easy :-).

> But how about Unicode and such wide char sets? I'm curious too now. =)

Skipped

It's a question not only for russians but for whole other peoples.
As I know, Unicode use 2-byte coding scheme for characters, but
I don't use it, because 256 characters is quite sufficient for
me. Windows and maybe other OS's use it, but I don't know who
need it.

Alexander



Mon, 08 Jul 2002 03:00:00 GMT  
 Awk Beyond Ascii?

I see - so basically it is not a matter of integrating the whole Unicode set
but merely of redefining Ascii to a different code page. It is like, Ascii
is a small window onto the world of all possible characters, but it is in a
rotating room, and depending on where one is, one can direct that window to
different areas of the complete character set.

I suppose the downside of the arrangement is the need to configure your
whole system to take one or the other. Different code page, different
keyboard, different fonts, etc. This could make it troublesome, for example,
to use sed or awk for transliteration purposes (which I guess is what I was
thinking of when I asked this question.)

I got interested in Unicode and in computer transliteration between Latin
and Cyrillic alphabets during a Java project that I had to do for a course.
Ultimately, I decided it would be too hard for a beginner and undertook a
different project, but it raised questions which have lingered.

Levin A.A. a crit dans le message

Quote:
>Awk works very well beyond Ascii :-).
>Russians use code page 866 and appropriate screen and keyboard drivers
>for it. But awk doesn't know what letters we see on a display screen and
>and on keyboard keys, because it works with their codes (bytes). We can't
>use such awk programs as alphabetical sorting, based on English alphabet,
>but we can write our own programs for this purpose or accomodate existing
>programs and we do it very easy :-).



Tue, 09 Jul 2002 03:00:00 GMT  
 Awk Beyond Ascii?
Yes, it is an interesting question in mechanical terms, quite beyond the
linguistic terms. As you say, awk works directly with the bytes, and 256
characters are sufficient for Western, alphabet languages even if one uses
different sets of 256. But two-byte coding is needed when we get into Asian,
ideographic languages. Java, one of the youngest of the great programming
languages, has whole reams of classes and IO and other devices dedicated to
handling Unicode in all its splendor. By contrast, awk and sed are senior
citizens among programming tools, and that brings in a whole historical
dimension to the debate. I'm given to understand that Unicode itself is a
relative newcomer in the world of computing, and that as a standard it has
not yet been completely defined. When I last looked into this issue, they
were still fighting with version numbers.

I notice that Jeffrey E.F. Friedl, who wrote the O'Reilly on MASTERING
REGULAR EXPRESSIONS, is based in Japan. Among the people mentioned in his
acknowledgments is Ken Lunde, author of CJKV INFORMATION PROCESSING
(Chinese, Japanese, Korean and Vietnamese). So this must be an area that is
being discussed somewhere, even if not immediately here....

EFR

Levin A.A. a crit dans le message

Quote:
>It's a question not only for russians but for whole other peoples.
>As I know, Unicode use 2-byte coding scheme for characters, but
>I don't use it, because 256 characters is quite sufficient for
>me. Windows and maybe other OS's use it, but I don't know who
>need it.

>Alexander



Tue, 09 Jul 2002 03:00:00 GMT  
 Awk Beyond Ascii?
Skipped

Quote:
>                                    ... Different code page, different
> keyboard, different fonts, etc. This could make it troublesome, for example,
> to use sed or awk for transliteration purposes ...

It's seems to me that it is not necessary to have different keyboards,
different fonts an so on for this purposes. Various drivers or emulation
programs exist for this. And generally speaking, I think that it is not
a problem for german, french, russian and naturely american or english
members of this conference, that they use different keyboards and screen
drivers.

Skipped

Quote:
>                                                             ... and 256
> characters are sufficient for Western alphabet languages ...

And for eastern too :-)

for example for russian, which consist of 32 characters, or others
languages with cyrillic characters.

As for Japan or Chinese languages, I think Unicode can't help us
freely communicate with their representatives, because very big
difference exists between our languages.

And finally it's interesting for me, who invent Unicode ?
May be Microsoft ? And it was a realisation of Bill Gates dream
to sell his Windoze all over the world ?

Alexander



Tue, 09 Jul 2002 03:00:00 GMT  
 Awk Beyond Ascii?

writes:

Quote:
>I was somewhat abstractly wondering if it is possible to use awk (and sed)
>with Unicode or character sets beyond the standard ascii. What do Greeks and
>Russians do?

Much discussion so far. I'd guess that languages with 'letters' which represent
single sounds that can be used in combination to represent other sounds
(loosely speaking when it comes to English) can generally be handled without
much difficulty as bytes. As for dealing with eastern Asian languages, it would
require a thorough rewrite of awk and possibly the use of unicode rather than
standard C libraries to compile the awk executable. For instance,
substr(string, i, 1) would presumably need to return two bytes rather than one,
and length would operate on two-byte characters. So a unicode awk would be an
entirely different creature than an ASCII awk (or an EBCDIC awk?).

I suppose that begs the question whether or not there are unicode unix variants
yet.



Wed, 10 Jul 2002 03:00:00 GMT  
 Awk Beyond Ascii?

% I see - so basically it is not a matter of integrating the whole Unicode set
% but merely of redefining Ascii to a different code page. It is like, Ascii

Not exactly. There are 256 possible characters in an 8-bit byte. ASCII
defines the first 128 of them. There potentially 128! ways to fill in the
other 128 characters, and all of them are in use somewhere. ISO has defined
(in standard 8859) between 10 and 20 standard 8-bit encodings, covering
most alphabetic languages. ISO also has a standard numbered, I think, 10646,
which defines additional code points. Unicode is a 16-bit representation
of 10646.

The advantage of the 8-bit representation is that it saves space. Storage
is becoming less expensive, but there's still an advantage to using
1/2 the space to store data, 1/2 the bandwidth to send it, and so on,
if that's all that's required (which is the case for most alphabetic
languages). The advantage of Unicode is that it covers the languages
which can't be handled by 8-bit data, and it covers documents which
languages which require different alphabets.

awk will support the underlying character set used by the C compiler which
compiled the program. My guess is that you could recompile gawk on
a machine which uses 2-byte chars, and it would handle Unicode data
correctly.
--

Patrick TJ McPhee
East York  Canada



Wed, 10 Jul 2002 03:00:00 GMT  
 Awk Beyond Ascii?

% much difficulty as bytes. As for dealing with eastern Asian languages, it would
% require a thorough rewrite of awk and possibly the use of unicode rather than
% standard C libraries to compile the awk executable. For instance,

C allows chars to be as many bytes as you want, and all the sizing operations
work in chars, rather than bytes, really, a recompile ought to handle it.

% entirely different creature than an ASCII awk (or an EBCDIC awk?).

An ebcdic awk will be a different beast, since the character ordering will
be different, but the first 128 chars in  Unicode are just ASCII.

% I suppose that begs the question whether or not there are unicode
% unix variants

There's plan 9.
--

Patrick TJ McPhee
East York  Canada



Wed, 10 Jul 2002 03:00:00 GMT  
 Awk Beyond Ascii?

Quote:

> I suppose that begs the question whether or not there are unicode unix variants
> yet.

You might want to check out the "UTF-8 and Unicode FAQ for Unix/Linux" at
<http://www.cl.cam.ac.uk/~mgk25/unicode.html>

Regards...
                Michael



Wed, 10 Jul 2002 03:00:00 GMT  
 
 [ 11 post ] 

 Relevant Pages 

1. Coroutines :: Beyond 'Python 1.4 and beyond'

2. Coroutines :: Beyond 'Python 1.4 and beyond'

3. help on awk and ASCII file input

4. Ascii to Binary to Ascii

5. Ascii to Hex, Hex to Ascii convertion

6. ASCII to numbers and numbers to ASCII

7. Need converter program to translate ascii to ebcdic in mixes ascii/binary file

8. Ascii-records (with some comp fields) to plain Ascii-records

9. common mistakes in awk: comparing awk with C

10. Awk compilers / Awk to C converters

11. Arrays in awk/awk help please!

12. Help with Awk, totally new to AWK programing

 

 
Powered by phpBB® Forum Software