character classes 
Author Message
 character classes

Hello

Suppose I have

(defun foo (c)
  (ecase c
    (#\_ "underscore")
    (#\- "hyphen")
    ((#\a #\s #\d #\f) "asdf")))

(#\a #\s #\d #\f) fits well for small number of choices.
What is the conventional (or frequently used) way for such
things when choices can be automatically generated. E.g.
I'd like to return "asdf" for all english small letters.

Thanks in advance

--



Wed, 13 Aug 2003 03:29:56 GMT  
 character classes

Quote:

> Hello

> Suppose I have

> (defun foo (c)
>   (ecase c
>     (#\_ "underscore")
>     (#\- "hyphen")
>     ((#\a #\s #\d #\f) "asdf")))

> (#\a #\s #\d #\f) fits well for small number of choices.
> What is the conventional (or frequently used) way for such
> things when choices can be automatically generated. E.g.
> I'd like to return "asdf" for all english small letters.

I do not know if I understand you right, but it is easy to write a Function
(or macro) that generates a predicate-function that does the minimal amount
of compares to find if a character is a member of some set of characters.
Remember that to check if a given character is a member of the english
(latin???) small letters you need only to test if it's between the
boundaries.

Regards,
Jochen



Wed, 13 Aug 2003 03:28:21 GMT  
 character classes
Here is three version

I.

defun foo (c)
  (ecase c
    (#\_ "underscore")
    (#\- "hyphen")
    (#.(loop for i from (char-code #\A) to (char-code #\z)
             collect (code-char i)) "asdf")))
(Thanks to Paul Foley. I'd never guessed myself about #.)

II.

(deftype underscore () `(eql #\_))
(deftype hyphen () `(eql #\-))

(char-code #\z)
                                    collect (code-char i))))

(defun foo2 (c)
  (etypecase c
    (underscore "underscore")
    (hyphen "hyphen")
    (letter "letter")))

III.

(defun foo3 (c)
  (cond ((char= c #\_) "underscore")
        ((char= c #\-) "hyphen")
        ((member c (loop for i from (char-code #\A) to (char-code #\z)
                         collecting (code-char i))) "letter")))

Among those above which can be considered as good Lisp style ?

Another question: to compare two chars I have at least three
opportunity
  char=
  eq
  eql
Seems eql is least efficient. What is the difference between
char= and eq for comparing chars? Should most specific
comparison be selected in all cases ?

--



Thu, 14 Aug 2003 00:23:52 GMT  
 character classes

Quote:
> Remember that to check if a given character is a member of the english
> (latin???) small letters you need only to test if it's between the
> boundaries.

Actually, this is only true in certain character code sets.  Having recently
been working on an IBM AS/400 (no, not in Lisp), this point has been well
and fully driven home to me.  It also fails for the ISO wetsern code set
(sorry, don't remember the ISO spec number) as per your "latin???" comment,
as letters with cedillas, accent grave, etc. are not properly interspersed.

The standard also places no restriction on mapping of codes to characters
other than the following ordering predicates on their codes:

 A<B<C<D<E<F<G<H<I<J<K<L<M<N<O<P<Q<R<S<T<U<V<W<X<Y<Z
 a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<s<t<u<v<w<x<y<z
 0<1<2<3<4<5<6<7<8<9
 either 9<A or Z<0
 either 9<a or z<0

Immediately following this in Section 13.1.6 of the Hyperspec comes the
explicit statement:

This implies that, for standard characters, alphabetic ordering holds within
each case (uppercase and lowercase), and that the numeric characters as a
group are not interleaved with alphabetic characters. However, the ordering
or possible interleaving of uppercase characters and lowercase characters is
implementation-defined.

In general, the best way to see if an item is a lower-case character is to
use the predicate lower-case-p.  The same statement applies for other
well-defined character categories for which the standard has provided us
handy (if not always so easy to remember) predicates.

Now, all that being said, given that the ISO Western character set (sorry, I
still don't remember the spec number) is a superset of 7-bit ASCII, and that
most implementations use this character code set as their base character
set, as a first order approximation, comparing the range with char< (or its
brethren) is usually safe :-).

faa

P.S.  Is it just me, or does anyone else think that the entire character set
issue has not been very well thought out with respect to programming in any
language?



Wed, 13 Aug 2003 23:36:06 GMT  
 character classes

Quote:
Erik Naggum writes:
>   I'd like Common Lisp to have the first real solution to this problem.  I
>   have worked on this off and on for a very long time, scrapping more
>   designs than I feel comfortable enumerating.  Breaking out of the notions
>   that are so hard-wired into our "modern" operating systems is very hard.

I've only started thinking about this, but could you give a brief
illustration of your state?  Site-dependent character sets, encodings,
and upper/lower mappings are simple enough, but variable-width
characters puzzle me.  By variable-width, I mean things such as accent
modifiers, which change the following character but are not characters
themselves, and the German ess-tset, which is a single character (B)
in lower-case, but two (SS) in upper.  (Pardon the lack of a real
character, but I have no idea how to generate it on my system.)

The seperate concepts of byte-streams and true strings seem important.
If a string is an array of characters, length is no longer a good
concept; does it refer to the number of characters, or to the length
of the textual representation?  Clearly, there should be two seperate
functions to decide this.

In addition, even the parsing of string literals is a hard thing.
Should "HEISSEN" be an array containing #\H #\E #\I #\SS #\E #\N, or
should the #\SS be #\S #\S instead?  I would argue that it depends on
the language encodings currently in operation, but that seems
difficult.

I'd be inclined to have a character be a conceptual character, with
all accents, state, and so on encoded within it.  But now I am
curious.  Where would I go to find out more about issues like this?

--Johann

--



Thu, 14 Aug 2003 04:43:31 GMT  
 character classes

Quote:

> ...
>In addition, even the parsing of string literals is a hard thing.
>Should "HEISSEN" be an array containing #\H #\E #\I #\SS #\E #\N, or
>should the #\SS be #\S #\S instead?  I would argue that it depends on
>the language encodings currently in operation, but that seems
>difficult.

wouldn't that depend on which "heissen" you are parsing ("wie heissen sie?"
vs. "die heissen oefen")?

hs



Thu, 14 Aug 2003 06:24:43 GMT  
 character classes

Quote:
Hartmann Schaffer writes:

> wouldn't that depend on which "heissen" you are parsing ("wie heissen sie?"
> vs. "die heissen oefen")?

Good point.  I think you're right, but my German is very rusty.  Hm.

--



Thu, 14 Aug 2003 08:02:27 GMT  
 character classes

Quote:

> wouldn't that depend on which "heissen" you are parsing ("wie heissen sie?"
> vs. "die heissen oefen")?

Norwegian has similar issues with our letters "???" (ae ligature, o
with slash across, a with ring above) which is sometimes transscribed
as "ae", "oe" and "aa".  And of course these letter combinations can
occur in other words as well...

After many years with all sorts of square wheels reinvented all over
the place, I think the only conclusion is that you should never try to
make the computer handle this is a "smart" manner.  It will always fail.

Stig Hemmer,
Jack of a Few Trades.



Thu, 14 Aug 2003 20:46:58 GMT  
 character classes

Quote:


> > wouldn't that depend on which "heissen" you are parsing ("wie heissen sie?"
> > vs. "die heissen oefen")?

> Norwegian has similar issues with our letters "???" (ae ligature, o
> with slash across, a with ring above) which is sometimes transscribed
> as "ae", "oe" and "aa".  And of course these letter combinations can
> occur in other words as well...

i don't know about norwegian, but swedes consider ? ? and ? to be
distinct letters rather than modifications of roman letters.  the ae
oe and aa are just poor man's spellings for when you are stuck with a
limited set of characters.

i don't know what the official word is, but as i see it, the ? is
really two esses.  if you look closely (depending on the font) you can
see a long (integral sign type) s followed by the usual s.  it's a
ligature in the same way a font merges f and i to be a unit "fi" in
which the f hangs over and becomes the dot to the i.  correct me if i
am wrong, but ? isn't considered a distinct letter of its own.

Gau? and Gauss are the same person, and if you capitalize all the
letters, both become GAUSS.  it's kind of weird when capitalization
changes the number of characters you need.  i don't know how german
text would be best represented and manipulated given this quirk.  does
anyone have experience with this?

i don't see norwegian and swedish having this problem since capital ?
is ? &c.  they all take up the same amount of room.  and if you're
stuck with ae then of course AE would be available.  i don't think
it's common to have ? but not ?.

Quote:
> After many years with all sorts of square wheels reinvented all over
> the place, I think the only conclusion is that you should never try to
> make the computer handle this is a "smart" manner.  It will always
> fail.

the other recourse is to abstraction.  sometimes WYIWIG just isn't
enough.  there's more behind what you see than just what you can see.
where you have a logical representation which indicates double ss and
then a printing mechanism which renders it as ? or SS depending upon
capitalization.  it would presumably also keep distinct cases where
two ss just happen to come together but ? would be illegal.

--
J o h a n  K u l l s t a m

Don't Fear the Penguin!



Fri, 15 Aug 2003 12:42:31 GMT  
 character classes

Quote:

> i don't know what the official word is, but as i see it, the ? is
> really two esses.  if you look closely (depending on the font) you can
> see a long (integral sign type) s followed by the usual s.  it's a
> ligature in the same way a font merges f and i to be a unit "fi" in
> which the f hangs over and becomes the dot to the i.  correct me if i
> am wrong, but ? isn't considered a distinct letter of its own.

> Gau? and Gauss are the same person, and if you capitalize all the
> letters, both become GAUSS.  it's kind of weird when capitalization
> changes the number of characters you need.  i don't know how german
> text would be best represented and manipulated given this quirk.  does
> anyone have experience with this?

If we're talking about the German letter that looks a bit like a
`beta' in many typefaces (I can't make it on this terminal for some
reason), then I thought it was historically an s-z ligature, not an
s-s ligature.  Indeed it kind of looks like that -- long-s with a z
against it.  A native German speaker can probably correct me.

--tim



Fri, 15 Aug 2003 18:51:41 GMT  
 character classes

Note: If emacs will stand by me, this message contains latin-1
characters coded according to iso-8859-1 , which is the de facto usenet
standard. The letters having char-code<128 coincide with ASCII.

Quote:



> > > wouldn't that depend on which "heissen" you are parsing ("wie
> > > heissen sie?" vs. "die heissen oefen")?
> > Norwegian has similar issues with our letters "???" (ae ligature,
> > o with slash across, a with ring above) which is sometimes
> > transscribed as "ae", "oe" and "aa". And of course these letter
> > combinations can occur in other words as well...

> i don't know about norwegian, but swedes consider ? ? and ? to be
> distinct letters rather than modifications of roman letters. the ae
> oe and aa are just poor man's spellings for when you are stuck with
> a limited set of characters.

Same in Norwegian, but the problems illustrate a point, namely that
ligatures that are more or less mandatory, or even proper letters in
their own right, have to be handled by people. They can not be dealt
with _correctly_ in a purely programmatic fashion. If one had an
exhaustive dictionary, one might come close, but no dictionary can
ever be exhaustive, because new words are coined all the time.

Examle: If you write Norwegian without having the letter ? available,
you'd have:

- hoest for h?st (autumn)
- moene for m?ne (top of peaked roof)
- Moen for Moen (the name, note no change)

Going the other way (reinserting the proper ?) would require human
intervention (possibly aided by dictionary), because the letters oe
appear as separate letters next to each other on occasion. A
dictionary alone would not be able to see the difference between m?ne
and moene ("the grasslands", sing.indef.:mo), so it would need to flag
collisions in some way for a human to resolve. These same
considerations hold for the German ?, vs. two separate s-es that
happen to be next to each other.

This is just to illustrate a point, since these "ligatures" are
mandatory, and have codes in the common local character encodings. It
becomes interesting when you consider the f-ligatures, that are not
mandatory (except at good quality typesetters) , and which don't exist
on any ordinary keyboard.

Should a computer language care about those ligatures? I'd say maybe.
The f-ligatures also need to be verified by a human, but it should be
possible to represent them in a string, and leave it up to the display
engine whether to show one or two glyphs. Making it possible to
represent the ligatures in a string is one of the considerations when
designing the system for representing strings in a computer-program.

--
H?kon Alstadheim, Montreal, Quebec, Canada  



Sat, 16 Aug 2003 01:53:58 GMT  
 character classes

Quote:

> If we're talking about the German letter that looks a bit like a
> `beta' in many typefaces (I can't make it on this terminal for some
> reason), then I thought it was historically an s-z ligature, not an
> s-s ligature.  Indeed it kind of looks like that -- long-s with a z
> against it.  A native German speaker can probably correct me.

That's true.  Unfortunately, the capitalization of `?' is always `SS',
whereas the lowercase of `SS' is either `ss' or `?'.  Without knowledge of
the of the word's semantics you cannot decide which lowercase
representation to use.

Krid



Sat, 16 Aug 2003 03:28:31 GMT  
 character classes

Quote:

> ...
>> > > wouldn't that depend on which "heissen" you are parsing ("wie
>> > > heissen sie?" vs. "die heissen oefen")?
>dictionary alone would not be able to see the difference between m?ne
>and moene ("the grasslands", sing.indef.:mo), so it would need to flag
>collisions in some way for a human to resolve. These same
>considerations hold for the German ?, vs. two separate s-es that
>happen to be next to each other.

actually, as far as i remember (it has been too long), the "heissen"
examples are somwhat different.  in one case the "s-z" is mandatory (in
case you have it on your keyboard), in the other case it is an error.
and in the second case it is not an accidental justapposition.

Quote:
> ...
>The f-ligatures also need to be verified by a human, but it should be
>possible to represent them in a string, and leave it up to the display
>engine whether to show one or two glyphs. Making it possible to
>represent the ligatures in a string is one of the considerations when
>designing the system for representing strings in a computer-program.

why?  this is simply a typographic convention that best be left to type-
setting programs.

hs



Sat, 16 Aug 2003 08:29:53 GMT  
 character classes

Quote:

> >The f-ligatures also need to be verified by a human, but it should be
> >possible to represent them in a string, and leave it up to the display
> >engine whether to show one or two glyphs. Making it possible to
> >represent the ligatures in a string is one of the considerations when
> >designing the system for representing strings in a computer-program.

> why?  this is simply a typographic convention that best be left to type-
> setting programs.

Actually, this is a more complicated problem than most people assume.
Not all instances of "ff", "fi", etc., should be ligated.  I can't
think of an example word off the top of my head, but this is mentioned
in the TeXBook, along with how to prevent automatic ligation
("affable" v. "af{}fable", I believe).  If it's possible for a program
to determine when f-ligatures are appropriate and when they aren't, I
don't think I've seen it implemented.  So it would actually seem (to
me anyway) to be useful to have both an "f" and an "ff" glyph -- then
display engines that do ff ligation could display the ligated form
when appropriate, and two f's when appropriate.  Display engines that
don't care could render both "ff" and two "f"s the same.


Sat, 16 Aug 2003 09:13:43 GMT  
 
 [ 14 post ] 

 Relevant Pages 

1. (regex) nested complemented character class list

2. (regex) nested complemented character class list

3. Looking for an old online text about single character classes

4. gsub() with character class

5. Regexp behavior with character class

6. Character classes in Ruby regexp

7. Backreference within a character class

8. Perl regex character classes (patch)

9. regular expression character classes

10. regexp character class bug?

11. Expect 5.21 regexp predefined character classes

12. regexp character classes

 

 
Powered by phpBB® Forum Software