Atomic Strings

Introduction

We will devise a number of words to implement so called "atomic strings". This data type augments Forth's more machine level string handling with something higher level. Hereafter atomic strings will simply be referred to as atoms. The central properties of atoms are:

The utility of atoms will become apparent given some examples.

Using Atoms

Atoms with the same string are equal:

testing atoms +≡
atom" foo" atom" foo" = assert

Atoms with different strings are of course, not equal:

testing atoms +≡
atom" bar" atom" foo" <> assert

Atoms can be concatenated:

testing atoms +≡
atom" testing" atom" 123" atom+ atom" testing123" = assert

Atoms can have a meaning assigned to them using atom+=$ (to append a literal string) or atom+=ref (to append a reference to the meaning of another atom).

testing atoms +≡
atom" abc" atom" bar" atom+=$
atom" def" atom" bar" atom+=$
atom" 1234" atom" foo" atom+=$
atom" bar" atom" foo" atom+=ref
atom" 5678 9" atom" foo" atom+=$
atom" bar" atom" foo" atom+=ref
atom" foo" means atom" 1234abcdef5678 9abcdef" = assert

Structure of an Atom

Conveniently, because atoms have a single numerical value per string value, we can implement meaning without the need for a lookup data structure. Each atom's value will be the address of a structure:

Some words to read these values are useful:

implement atoms +≡
: atom-length@ ( A -- n )
    1 cells + @ ;
: atom-data@ ( A -- a )
    2 cells + @ ;
: atom-string@ ( A -- $ )
    dup atom-data@ swap atom-length@ ;
: atom-meaning-head ( A -- A[head] )
    3 cells + ;

Off of each atom's primary structure, a chain of "meaning" links. When determining the "meaning" of an atom, the expansion of each link in the chain is concatenated. There are two types of link:

The format of the meaning links is:

Implementing Atoms

A list of all atoms will be kept chained off atom-root . Whenever an atom is needed, this list should be consulted before a new atoms is created (as an existing one may exist and must be used).

implement atoms +≡
linked-list atom-root

We will create new unchained atoms either from a string that can safely be assumed to persist:

implement atoms +≡
: $atom-new ( $ -- A )
    >r >r 0 0 r> r> 4 atom-root chain atom-root cell+ @ ;

Or from one that is transitory (parse region for example).

implement atoms +≡
: atom-new ( $ -- A )
    $clone $atom-new ;

Comparison for equality with a normal string is needed in order to seek out a match from the existing pool of atoms.

implement atoms +≡
: atom= ( $ A -- f )
    atom-string@ compare 0= ;

We then need a way to look through all atoms for a match.

implement atoms +≡
: atom-find' ( $ A -- A )
    begin
       dup 0= if nip nip exit then
       3dup atom= if nip nip exit then
       ->next
    again ;
: atom-find ( $ -- A )
    atom-root @ atom-find' ;

Now we can implement two versions of atom lookup. $atom for atoms based on persistent strings.

implement atoms +≡
: $atom ( $ -- A )
    2dup atom-find dup if nip nip else drop $atom-new then ;

And atom for atoms based on non-persistent strings.

implement atoms +≡
: atom ( $ -- A )
    2dup atom-find dup if nip nip else drop atom-new then ;

Printing an atom is provided (mainly for debugging).

implement atoms +≡
: atom. ( A -- )
    atom-string@ type ;

As is printing all atoms.

implement atoms +≡
: atoms. ( -- )
    atom-root @ begin dup while
    dup atom. cr ->next repeat drop ;

We provide two different stringing words for atoms. One based on quotes, the other braces.

implement atoms +≡
: atom" ( -- A )
    [char] " parse
    state @ if postpone sliteral postpone atom
    else atom then ; immediate
: atom{ ( -- A )
    [char] } parse
    state @ if postpone sliteral postpone atom
    else atom then ; immediate

As well as a word for an empty atom.

implement atoms +≡
: atom"" ( -- A ) 0 0 atom ;

While atoms are fixed, once created, their "meanings" can be accumulated gradually. The two words for this are atom+=$ and atom+=ref.

implement atoms +≡
: atom-append ( A n Ad -- )
    atom-meaning-head 2 swap chain ;
: atom+=$ ( A Ad -- )
    0 swap atom-append ;
: atom+=ref ( A Ad -- )
    1 swap atom-append ;

We then provide a way to extract the "meaning" of an atom.

implement atoms +≡
implement means tools
: means ( A -- A' )
    dup atom-walk-length dup allocate 0= assert
    swap 2dup >r >r drop
    atom-walk-gather r> r> $atom ;

Using this plumbing.

implement means tools +≡
: ref-parts ( ref -- A ref? )
    cell+ dup cell+ @ swap @ ;
: atom-walk ( fn A -- )
     atom-meaning-head @ begin dup while
         2dup >r >r
         ref-parts if recurse else swap execute then
         r> r>
         ->next
     repeat 2drop ;
: tally-length ( n A -- n )
    atom-length@ + ;
: gather-string ( a A -- a' )
    2dup atom-string@ >r swap r> move tally-length ;
: atom-walk-length ( A -- n )
    0 swap ['] tally-length swap atom-walk ;
: atom-walk-gather ( a A -- )
    swap ['] gather-string swap atom-walk drop ;

We provide atom concatenation.

implement atoms +≡
: atom>>$ ( A d -- d' )
    2dup >r atom-string@ r> swap move swap atom-length@ + ;
: atom+ ( A A -- A )
    swap 2dup atom-length@ swap atom-length@ + dup >r
    allocate 0= assert dup >r
    atom>>$ atom>>$ drop r> r> $atom ;

And a way to get an atom from one character.

implement atoms +≡
: atom-ch ( ch -- A )
    1 allocate 0= assert 2dup c! nip 1 atom ;

This allows us to add a shorthand for carriage returns and concatenation of carriage returns.

implement atoms +≡
10 atom-ch constant atom-cr
: atom-cr+ ( A -- A )
    atom-cr atom+ ;

We can then apply the tests above.

implement atoms +≡
testing atoms

And some words that depend on atoms.

implement atoms +≡
post atom utility words

HTML Escaping

A critical feature is to be able to html escape an atom. We convert the following:

implement atoms +≡
: escape-ch ( ch -- )
   dup [char] < = if [char] & c, [char] l c, [char] t c,
                     [char] ; c, drop exit then
   dup [char] > = if [char] & c, [char] g c, [char] t c,
                     [char] ; c, drop exit then
   dup [char] " = if [char] & c, [char] q c, [char] u c, [char] o c,
                     [char] t c, [char] ; c, drop exit then
   dup [char] & = if [char] & c, [char] a c, [char] m c, [char] p c,
                     [char] ; c, drop exit then
   c, ;
: escape-each ( A -- )
    atom-string@ 0 ?do dup i + c@ escape-ch loop drop ;
: here! ( a -- )
    here - allot ;
: escape ( A -- A )
    here dup >r swap escape-each here over - atom r> here! ;