delimiters (was: SOKOBAN.F Bug) 
Author Message
 delimiters (was: SOKOBAN.F Bug)


WB> [about his conversion tool]

WB> In the Standard in A.6.2.2000 it is written:
[..]
WB> In the OpenFirmware Standard it is written:
[..]

WB> In modern source files `-TRAILING`, `SKIP`, and `SCAN` are no
longer
WB> appropriate.  Expanding their meaning will bite you.  For some time
WB> I have replaced them with `TRIM-TRAILING`, `LEFT-JUSTIFY`,
WB> `SPLIT-AT-SPACE`, and `SPLIT-AT-CHAR`.  I use them in my program to
WB> canonize spelling.

I understand three issues from your input.

3. For the latter one, I see '-TRAILING' as confusing, as far as it will
not do in all instances what the name suggests. I hardly use it. Mostly
it is to be expressed by '-SKIP' or '-LEX-SKIP', be it zeros, blanks,
anything between 0..32, or specific combinations.

2. For characters kept in memory, I'd regret it as a big loss if the
contents of memory space had to be regarded as being hidden. The
question would arise if it were so, what programming should be about at
all - it would reduce to serving APIs of OSses.

Instead, the big lot of things being done by programming is rather
related to serving APIs, but to conform to other specs coming from
elsewhere but not from OS vendors. So with programming, most of the
time, first of all the contents of what is held in memory space is to be
known clearly.

If it is, so SKIP and SCAN (plus -SKIP, -SCAN and all four as well
together with a 'LEX-' to enable for groups of characters) _are_ basic
building tools. Show me a real-world secondary implementation of WORD or
PARSE which is defined without those building tools...

I think, there is a mis-understanding creeping in: should Forth only be
good for _using_ any protocols of streams of characters, or should it be
understood being fit for to _implement_ such protocols? That is other
way round: should we hold OS (and API-) builders for fit enough to
provide each and every data streaming protocol which ever is to be
invented? I hardly believe this can be answered with 'yes' in real world
terms, taking apart totalitarian approaches of some OS vendors.

I guess this is less the time any more to expect very basic conveniances
to be settled. Data streams _are_ streams in octets now. The rest has
become marketing politics, not to be taken seriously any more in terms
of settling long-term conveniancies.

Yes, if someone says 'programming' and just means 'calling APIs', so
maybe those SKIP and SCAN where 'outdated' for the user (better say
'believer') in terms of some OS. For real-world tasks this is a to be
neglected issue, especially when taking into account the field of
micro-processing.

We cannot neglect the fact, that part of data will be streams of encoded
text, and a straight definition of how this is encoded in octets will
always be given for sure. Or it will be pure data octets, meaning the
value range 0..255. Or, ss well, for two-byte and four-byte combinations
no-one can specify data streams without as well to specify the encoding
in terms of octets.

There rests some minority of spec writers, wanting to write human
consumable text and do as if they would write for a compiler. In such
cases, this is poor specifications which recur to the behaviour of some
other compiler (this is met as a bad habit sometimes, mainly when 'C'
programmers should describe a protocol they implemented). It is not
enough then to say 'use the compiled API', but it is better then calling
a bad communication habit by its name (instead of hiding it behind terms
of being 'modern' or similar consorts).

I'd ask not to confuse all this with the representation of characters
from the other end, mainly how a CPU maintains its storage.

1. The Forth word WORD expects a delimiter character as parameter. Alone
from the naming I'd expect PARSE-WORD to do the same. To have the
delimiter included, I use here WORD-NEXT (or NAME following the usage
found at some places), or PARSE-NEXT.



Thu, 25 May 2000 03:00:00 GMT  
 delimiters (was: SOKOBAN.F Bug)

There are two points that I had hoped to make.

The major point is that `PARSE-WORD` has a well-adopted meaning, and
different semantics will raise confusion.  

The minor point is that words in natural language and program code could
be delimited by white space, not just blanks.

(I have also run into Forths where `SKIP` or `SCAN` has had a different
meaning from what you expect, but that's not important in a system you're
building yourself.)

--
Let us go forth in peace.   Wil Baden Costa Mesa, California


: WB> [about his conversion tool]

: WB> In the Standard in A.6.2.2000 it is written:
: [..]
: WB> In the OpenFirmware Standard it is written:
: [..]

: WB> In modern source files `-TRAILING`, `SKIP`, and `SCAN` are no
: longer
: WB> appropriate.  Expanding their meaning will bite you.  For some time
: WB> I have replaced them with `TRIM-TRAILING`, `LEFT-JUSTIFY`,
: WB> `SPLIT-AT-SPACE`, and `SPLIT-AT-CHAR`.  I use them in my program to
: WB> canonize spelling.

: I understand three issues from your input.

: 3. For the latter one, I see '-TRAILING' as confusing, as far as it will
: not do in all instances what the name suggests. I hardly use it. Mostly
: it is to be expressed by '-SKIP' or '-LEX-SKIP', be it zeros, blanks,
: anything between 0..32, or specific combinations.

: 2. For characters kept in memory, I'd regret it as a big loss if the
: contents of memory space had to be regarded as being hidden. The
: question would arise if it were so, what programming should be about at
: all - it would reduce to serving APIs of OSses.

: Instead, the big lot of things being done by programming is rather
: related to serving APIs, but to conform to other specs coming from
: elsewhere but not from OS vendors. So with programming, most of the
: time, first of all the contents of what is held in memory space is to be
: known clearly.

: If it is, so SKIP and SCAN (plus -SKIP, -SCAN and all four as well
: together with a 'LEX-' to enable for groups of characters) _are_ basic
: building tools. Show me a real-world secondary implementation of WORD or
: PARSE which is defined without those building tools...

: I think, there is a mis-understanding creeping in: should Forth only be
: good for _using_ any protocols of streams of characters, or should it be
: understood being fit for to _implement_ such protocols? That is other
: way round: should we hold OS (and API-) builders for fit enough to
: provide each and every data streaming protocol which ever is to be
: invented? I hardly believe this can be answered with 'yes' in real world
: terms, taking apart totalitarian approaches of some OS vendors.

: I guess this is less the time any more to expect very basic conveniances
: to be settled. Data streams _are_ streams in octets now. The rest has
: become marketing politics, not to be taken seriously any more in terms
: of settling long-term conveniancies.

: Yes, if someone says 'programming' and just means 'calling APIs', so
: maybe those SKIP and SCAN where 'outdated' for the user (better say
: 'believer') in terms of some OS. For real-world tasks this is a to be
: neglected issue, especially when taking into account the field of
: micro-processing.

: We cannot neglect the fact, that part of data will be streams of encoded
: text, and a straight definition of how this is encoded in octets will
: always be given for sure. Or it will be pure data octets, meaning the
: value range 0..255. Or, ss well, for two-byte and four-byte combinations
: no-one can specify data streams without as well to specify the encoding
: in terms of octets.

: There rests some minority of spec writers, wanting to write human
: consumable text and do as if they would write for a compiler. In such
: cases, this is poor specifications which recur to the behaviour of some
: other compiler (this is met as a bad habit sometimes, mainly when 'C'
: programmers should describe a protocol they implemented). It is not
: enough then to say 'use the compiled API', but it is better then calling
: a bad communication habit by its name (instead of hiding it behind terms
: of being 'modern' or similar consorts).

: I'd ask not to confuse all this with the representation of characters
: from the other end, mainly how a CPU maintains its storage.

: 1. The Forth word WORD expects a delimiter character as parameter. Alone
: from the naming I'd expect PARSE-WORD to do the same. To have the
: delimiter included, I use here WORD-NEXT (or NAME following the usage
: found at some places), or PARSE-NEXT.



Thu, 25 May 2000 03:00:00 GMT  
 delimiters (was: SOKOBAN.F Bug)

WB> The major point is that `PARSE-WORD` has a well-adopted meaning,
and
WB> different semantics will raise confusion.  

If it clashes with a well-adopted meaning, so somewhere there's a
mis-noming involved, be it under whichever famous title.

WB> The minor point is that words in natural language and program code
WB> could be delimited by white space, not just blanks.

This is what my LEX-... things may apply to. But it should not play a
role for Forth sources. A line fed to the interpreter from INCLUDE-FILE
may be expected to have white space (that is tabs and spaces) converted
to spaces only. Sources from block files (and EVALUATE) instead are to
be dealt with as raw input.

WB> (I have also run into Forths where `SKIP` or `SCAN` has had a
WB> different  meaning from what you expect, but that's not important
in
WB> a system you're  building yourself.)

They may be found starting at least from wide-spread Forth 83
implementations with a consistent meaning. Skip means there, to advance
in a string _past_ a delimiting character. Scan means advancing _to_ a
given character. The result is the string only, which might urge a
newcomer to spend some thought on it. The building tool for to advance
in a string is /STRING when redefined secondary - on many CPUs this is
supported as a very efficient low-level way of parsing.

To offer the LEX.. versions:

: LEXSKIP       ( a/string n a/list n -- a/string+ n-)

                2swap
   begin        dup
   while        dup >R 2over
      0 ?do     count swap >R skip R>
      loop      drop R> over =
   until then   2swap 2drop ;

: LEXSCAN       ( a/string n a/list n -- a/string+ n-)

            2over 2swap
   0 ?do    >R 2dup

            R> char+
   loop     drop nip /string ;

: LEXSPLIT  ( a/s n a/l n -- a/s n- a/s n- 0|#)

        2over 2over
        lexscan dup

        >R 2swap tuck
        R> scan nip -
        >R dup
        >R 1 /string 2swap
        R> - R> char+ exit
   then 2swap 2drop 0 ;
x
\ ''''''''''

S" ABC DEF" S" " LEXSPLIT should return the string parts 'ABC' and 'DEF'
(if the interpreter supports more than one S" .." in one turn) plus the
offset of the hit into the list-string plus one, that is 1 [chars]
here.



Sat, 27 May 2000 03:00:00 GMT  
 delimiters (was: SOKOBAN.F Bug)

I like `SKIP` and `SCAN`.  With `/STRING` they were presented at
FORML by Klaus Schleisieck just before Forth 83.  I used them for
years, and was disappointed when they were rejected by the Standard
Committee.  

Your LEX-words are very powerful.  I'll file them away for future
reference.
--
Wil



Sat, 27 May 2000 03:00:00 GMT  
 delimiters (was: SOKOBAN.F Bug)

Quote:

>To offer the LEX.. versions:
>: LEXSKIP       ( a/string n a/list n -- a/string+ n-)
>                2swap
>   begin        dup
>   while        dup >R 2over
>      0 ?do     count swap >R skip R>
>      loop      drop R> over =
>   until then   2swap 2drop ;

...

I'd propose to use arrays of size 256 of char, non-zero value
meaning a delimiter.
this would work a bit faster, especially if CODEd.
OTOH, special words will be needed to manipulate
these arrays, while the only problem with strings is
to have control characters in them (both are
not complex).

And one more problem with an array which can contain all characters
is Unicode: the size of such array for Unicode will be 64K.
OTOH, if we note that special symbols are only among the 1st 128
chars, we can add a  check if char is less than the size of the
array... and so on. If there are problems, they are solvable.

Are char attribute arrays more convenient than strings?



Sat, 27 May 2000 03:00:00 GMT  
 delimiters (was: SOKOBAN.F Bug)

Quote:
>And one more problem with an array which can contain all characters
>is Unicode: the size of such array for Unicode will be 64K.
>OTOH, if we note that special symbols are only among the 1st 128
>chars, we can add a  check if char is less than the size of the
>array... and so on. If there are problems, they are solvable.

They've already been solved.  UTF-8 character encoding is an 8-bit mapping
of the 16-bit Unicode standard, such that it is fully backward compatible
with existing, 8-bit versions of strcmp() and like functions.

The mapping is as follows:

Unicode                 8-bit binary encoding
--------------------------------------------------------------------------
0000-007F               0xxxxxxx                        (00)-(7F)
0080-07FF               110xxxxx 10xxxxxx               (C2,80)-(DF,BF)
0800-FFFF               1110xxxx 10xxxxxx 10xxxxxx      (E0,80,80)-(EF,BF,BF)
--------------------------------------------------------------------------

==========================================================================
                            -| TEAM DOLPHIN |-
                    Chief Architect and Project Founder
                       (web page under construction)

                 PGP 5.0 Public Key Available Upon Request.



Mon, 29 May 2000 03:00:00 GMT  
 delimiters (was: SOKOBAN.F Bug)


Quote:

>To offer the LEX.. versions:

[..]

MG> I'd propose to use arrays of size 256 of char, non-zero value
MG> meaning a delimiter.
MG> this would work a bit faster, especially if CODEd.
MG> OTOH, special words will be needed to manipulate
MG> these arrays, while the only problem with strings is
MG> to have control characters in them (both are
MG> not complex).

This should work - with not an easy check if it is worth the overhead
for additional data-fields management. The READ-COMPARE-INCREMENT
combination found on CPUs is quite fast for usual. If a 'lex'-list is to
be expected to contain more than, say, 6 or 8 items (and is run very
often in its context), your mapping method might become a good deal
faster. I'd regard it as a bitmapping technique, wherefore you propose
bytes containing flags meaning bits as I understand it. With bit-setting
and bit-look-ups supported by a CPU more or less directly, so a
bit-array for 256 choices could be held in 32 bytes. By this, an input
string were to be scanned only once (instead of once for each item in a
list worst case), but scanning (inclusive looking-up an array) for sure
would become a good deal slower.



Fri, 02 Jun 2000 03:00:00 GMT  
 delimiters (was: SOKOBAN.F Bug)

WB> I like `SKIP` and `SCAN`.  With `/STRING` they were presented at
WB> FORML by Klaus Schleisieck just before Forth 83.  I used them for
WB> years, and was disappointed when they were rejected by the Standard

WB> Committee.  

Did not know this reference. Thank you for pointing it out. Back at the
time when 'Basis 14' appeared, I wondered why there was a LEX in there
(which motivated me thinking about those 'LEXSCAN' and consorts), but
SCAN and SKIP was missing. Looked like a compromise, but that one
disappeared as well.



Fri, 02 Jun 2000 03:00:00 GMT  
 delimiters (was: SOKOBAN.F Bug)


: WB> I like `SKIP` and `SCAN`.  With `/STRING` they were presented at
: WB> FORML by Klaus Schleisieck just before Forth 83.  I used them for
: WB> years, and was disappointed when they were rejected by the Standard

: WB> Committee.  

: Did not know this reference. Thank you for pointing it out. Back at the
: time when 'Basis 14' appeared, I wondered why there was a LEX in there
: (which motivated me thinking about those 'LEXSCAN' and consorts), but
: SCAN and SKIP was missing. Looked like a compromise, but that one
: disappeared as well.

0 [IF] COMMENT

Martin Tracy used `LEX` for several years before proposing it.  It was
rejected because it was not considered simple enough for a universal primitive.

I didn't care for it because it stepped over the delimiter, and from the
returned values with 0 characters remaining, you can't tell whether the
delimiter has been found.  You handle that with an extra return value, but I
simply stop at the delimiter -- I use `1 /STRING` to step over the delimiter
when needed.

In my experience skipping and scanning is not done on sets of mixed characters
but on closely related characters -- digits, lower case, punctuation, white
space, printable, _etc_.

(There are exceptions.  I have some applications which need to recognize
vowels.  For speed I use a character table.)

I don't understand why `WITHIN` is so popular.  To convert a character to
upper case without a character table or using CODE, surely this is preferable.
`
: UPC ( c -- C ) DUP [CHAR] A BL OR - 26 U< BL AND - ;
`
Instead of `LEXSPLIT` I factor out the splitting condition.  The following
works for me.

END COMMENT [THEN]


MACRO SPLIT> " 0= WHILE 1 /STRING REPEAT THEN DUP >R 2SWAP R> - "

( Examples )

: split-at-space ( a n -- a+i n-i a i )  >SPLIT BL > NOT SPLIT> ;

: letters-end ( a n -- a+i n-i ) >SPLIT BL OR [CHAR] a - 26 U< NOT SPLIT> ;

(
--
Wil
)



Sat, 03 Jun 2000 03:00:00 GMT  
 delimiters (was: SOKOBAN.F Bug)



Quote:
> WB> The minor point is that words in natural language and program code
> WB> could be delimited by white space, not just blanks.
> This is what my LEX-... things may apply to. But it should not play a
> role for Forth sources. A line fed to the interpreter from INCLUDE-FILE
> may be expected to have white space (that is tabs and spaces) converted
> to spaces only. Sources from block files (and EVALUATE) instead are to
> be dealt with as raw input.

Hmm. it looks more and more as if WORD is only really useable as the
phrase "BL WORD" and PARSE is to be preferred for all other uses.  
WORD in this context treats control characters as white space.  They
cannot in any case be dealt with portably in any other way.  For
example, a S" may not include a control character.

So ther are two possible srategies:

!.  Convert all control chars to blanks before WORD see them

".  Have WORD treat them as blanks.  This is easy with ASCII - simply
have WORD check for char <= 32 instead of equality.

I have taken the second option.  It allows me to use control
characters in a screen editor for formatting source.  The most
obvious diffencre from normal screens is that I now have
variable-length lines, but it is possible to use the control
characters to provide colour syntax highlighting, or whatever else you like.  

The same intrepreter can handle both formatted and plain block
source.  There is no difference in the load words except that  I use

: % 13 PARSE ; IMMEDIATE

instead of / in the formatted source.  Given this definition, a
formatted block file can also be interpreted by most Forths that only
loald text files!

--
Jack Brien
Prove all things - hold fast to that which is good
(1 Thessalonians 5:21)
www.users.zetnet.co.uk/aborigine/forth



Sat, 03 Jun 2000 03:00:00 GMT  
 
 [ 10 post ] 

 Relevant Pages 

1. SOKOBAN.F Bug

2. Minor bug with FS in gawk 3.0.x

3. System Bug, Or Am I Really Dense?

4. A major bug or am i Crazy (or BOTH)

5. I am clinically stupid. (fastcgi.rb bug)

6. re bug, or am I missing something?

7. FINALLY, The Sokoban Contest Winner

8. Sokoban Contest

9. My 1995 Sokoban Contest Entry

10. Programmable Robotic Sokoban with Elica

11. Programmable Robotic Sokoban with Elica

12. Sokoban in Oberon for Linux

 

 
Powered by phpBB® Forum Software