Frere_de_la_Quote

joined 1 year ago
 

As long as you don't test it you are in a quantum state, the same as for the Schrödinger's cat. Your code is either buggy or is not.

Hence, bugs are in fact triggered by users.

I tried to solve some of the last Advent of Code enigmas with LispE, and I discovered a plethora of problems, which I didn't think would erupt after so many years of tests.

 

Old school text generation

The method we will present here belongs to the old school, the one before chatGPT and other language models. But sometimes, if you don't have a dozen GPUs at hand and a good terabyte of hard disk, well, the normal configuration of a computer scientist today... Ah, this is not your case... I didn't know... Sorry...

So there are some traditional methods to achieve similar results. Well, it won't write your year-end report...

I want to show you how you can implement one with LispE. (Yeah... I know again)

The grammar

In the case of an old-fashioned generation, you need a generation grammar.

For those in the know, the parser that I will describe next is called a Chart Parser.

For example:

  S : NP VP
  PP : PREP NNP
  NP : DET NOUN
  NP : DET NOUN PP
  NP : DET ADJ NOUN
  NNP : DET NLOC
  NNP : DET ADJ NLOC
  VP : VERB NP
  DET : "the" "a" "this" "that
  NOUN : "cat" "dog" "bat" "man" "woman" "child" "puppy
  NLOC : "house" "barn" "flat" "city" "country
  VERB: "eats" "chases" "dicks" "sees"
  PREP: "of" "from" "in"
  ADJ: "big" "small" "blue" "red" "yellow" "petite"

There is also a grammar for French, but it is admittedly a bit complicated to read, especially because of the agreement rules.

Compile this thing

This grammar is rather simple to read. We start with a sentence node "S", which is composed of a nominal group and a verbal group. The rules that follow give the different forms that each of these groups can take. Thus a nominal group: NNP can be broken down into a determiner followed by an adjective and a noun.

The compilation of this grammar consists in creating a large dictionary indexed on the left parts of these rules:

{
   %ADJ:("big" "small" "blue" "red" "yellow" "petite")
   %DET:("the" "a" "this" "that")
   %NLOC:("house" "barn" "flat" "city" "country")
   %NOUN:("cat" "dog" "bat" "man" "woman" "child" "puppy")
   %PREP:("of" "from" "in")
   %VERB:("eats" "chases" "bites" "sees")
   ADJ:"%ADJ"
   DET:"%DET"
   NLOC:"%NLOC"
   NNP:(("DET" "NLOC") ("DET" "ADJ" "NLOC"))
   NOUN:"%NOUN"
   NP:(
      ("DET" "NOUN")
      ("DET" "NOUN" "PP")
      ("DET" "ADJ" "NOUN")
   )
   PP:(("PREP" "NNP"))
   PREP:"%PREP"
   S:(("NP" "VP"))
   VERB:"%VERB"
   VP:(("VERB" "NP"))
}

Some lines are simple copy/paste of the rules above, except for the lexical rules which are preceded by a "%". The goal is to be able to differentiate between applying a rule and generating words.

Analyze and generate with the same grammar

This is certainly the nice thing about the approach we propose here.

We will use this grammar in both directions, which means that we can feed it a piece of sentence and let it finish.

For example, if we start with: a cat, it can then propose its own continuations.

Note that here, the continuations will draw random words from the word lists. This can result in completely ridiculous sentences... or not.

The first step

The user provides the beginning of a sentence, but also, and this is fundamental, the initial symbol corresponding to what (s)he wants to produce.

This symbol is an entry point in our grammar. We will choose: S.

In other words, we will ask the system to produce a sentence.

In the first step we have two lists in parallel:

   Words   Categories
("a "cat")  ("S")

The replacement

S is an entry point in the grammar whose value is: ("NP" "VP")

So we replace the structure above to reflect this possibility.

  Words     Categories
("a "cat") ("NP" "VP")

The head of the category list is now: NP.

Since there are several possible rules for NP, we'll just loop around to find the one that covers our list of words:

  Words        Categories
("a "cat") ("DET" "Noun" "VP")

Now our head is DET which points to a lexical item. We just have to check that "a" belongs to the list associated with "DET".

This is the case, we can then eliminate elements from both lists:

  Words  Categories
("cat") ("Noun" "VP")

We can do the same operation for "Noun", the word list is then empty.

Words Categories
()     ("VP")

We then switch to the generation mode.

Generation

VP returns a list with only one element: ("Verb" "NP")

     Categories             Words
  ("Verb" "NP")          ("a" "cat")

Note that "Generated" contains as initial value the words coming from our sentence.

Since Verb is a lexical item, we draw a word at random from our list of verbs:

     Categories             Words
      ("NP")         ("a "cat" "chases")

We then draw a rule at random from those associated with NP:

     Categories             Words
("Det" "Adj" "Noun")    ("a "cat" "chases")

The job is now very simple, just draw a determiner, an adjective and a noun at random from their respective list:

     Categories                Words
        ()         ("a "cat" "chases" "a" "big" "dog")

Since the list of categories is now empty we stop there and returns our sentence.

Implementation detail in LispE

If you take a quick look at the code of the parser, you will observe the presence of two functions: match and generate. These functions are based on the extensive use of defpat, the pattern programming functions in LispE.

match

match is used to check if the words in a sentence can be parsed by the grammar. The conditions for match to succeed are twofold:

  • Either the word list and the category list are empty
  • Either the word list is empty and the system continues in generation mode on the remaining categories

; We have used up all our words and categories
; No need to go further
(defpat match ([] [] consume) 
   (nconcn consume "$$") 
)

; We stop and generate, the word list is empty
(defpat match ( current_pos [] consume)   
   (generate current_pos consume)
)

; We check the rule associated to the leading category
; consp checks if an object is a list. If it is not the case, it is a lexical rule.
; If not, we loop over the possible rules. 
(defpat match ( [POS $ current_pos] [w $ sentence] consume)
   (setq rule (key grammar POS))
   (if (consp rule) ; if it is a group of rules, we loop to find the right one
      (loop r rule
         (setq poslst (match (nconcn r current_pos) (cons w sentence) consume)
         (if poslst
            (return poslst) ; we find one we stop
         )
      )
      (if (in (key grammar rule) w) ; otherwise it is a lexical rule and we check if the current word is part of it
         (match current_pos sentence (nconcn consume w))
      )
   )
)

Note that "$" is the tail separator operator. Hence "match((NP Verb NP))" will return "POS = NP" and "current_pos = (Verb NP)".

generate

Generation is the final step. Thanks to pattern programming, this operation is reduced to two functions.

; Generating a word
; We are looking for a rule
; This one is either a normal rule (consp) or a lexical rule
(defpat generate([POS $ current_pos] tree)
   (setq r (key grammar POS))
   (if (consp r)
      ; here places the categories of a randomly drawn rule on top
      (generate (nconcn (random_choice 1 r 30) current_pos) tree) 
      ; here we add a word drawn at random
      (generate current_pos (nconc tree (random_choice 1 (key grammar r) 30)) 
   )
)  

; There are no more categories available, we place an end-of-sequence symbol to indicate that 
; all was generated
(defpat generate ([] tree) (nconc tree "%%") )

Conclusion

For those who have already had the opportunity to work with Prolog, this way of designing a program should seem very familiar. For others, this way of programming may seem rather confusing. The use of a pattern to distinguish different functions with the same name but different arguments is called "polymorphism". This kind of operation is also available in C++:

    Element* provideString(wstring& c);
    Element* provideString(string& c);
    Element* provideString(wchar_t c);
    Element* provideString(u_uchar c);

For example, these lines of code come from the interpreter LispE itself.

What distinguishes defpat here from the example above, however, is the richness and complexity of the patterns that can be dynamically used to parse a list of words and categories. Instead of a static compiled call, we have here a very flexible method that allows us to concentrate on the code specific to the detected pattern.

In particular, this method allows tree or graph traversal without the programmer ever getting lost in the tangle of special cases. If the list of elements evolves, it is often enough to add an additional function to take these new elements into account without redesigning the rest of the program.

[–] Frere_de_la_Quote@alien.top 1 points 11 months ago

This is a very complex problem. Maybe this will have some interest for you:

https://github.com/naver/lispe/wiki/6.1-Pattern-Functions

Here is how fizzbuzz is implemented:

; check if the rest of v/d is 0
(defun checkmod (v d) (eq (% v d) 0))

; if the element can be divided by 15 we return fizzbuzz
(defpat fizzbuzz ( [integer_ (not (% x 15))] ) 'fizzbuzz)

; if the element can be divided by 3 we return fizz
(defpat fizzbuzz ( [integer_ (checkmod x 3)] ) 'fizz)

; if the element can be divided by 5 we return buzz
(defpat fizzbuzz ( [integer_ (checkmod x 5)] ) 'buzz)

; rollback function, no type, we return x
; This function will only be called, if none of the above did apply
(defpat fizzbuzz (x) x)

; we apply 'fizzbuzz' to the first 100 elements.
; The argument type is: integer_

(mapcar 'fizzbuzz (range 1 100 1))
 

Someone asked me recently, what was the motivation behind LispE. Here is an example of the kind of functionalities that LispE proposes.

Adventures games were very popular at the beginning of the PC revolution back at the beginning of the 80s. Implementing a game is not very difficult in itself, but it is far from being a trivial task.

However, function programming concepts such as data types and pattern matching can prove incredibly handy to implement such a game.

For instance, you can transform each of your different actions into a Data Type:

(data Command   
   [Move atom_]
   [Break atom_ atom_]
   [Open atom_ atom_]
   [Kill atom_ atom_]
   [Pick atom_ atom_]
   [Take atom_]
   [Drop atom_]
)

(Note that [..] are equivalent to (..), we can freely choose one or the other for more code clarity)

A Data Type provides some very interesting way of controlling how your input is handled in your code. For instance, in each of these definitions, we define the number of arguments and their basic type. In this case, each should be an atom.

However, a data type should not be confused with a struct as in languages such as C++ or Java. It does not record any data structures in memory. Functional Programming tries to keep away from side-effect, while struct in imperative languages wouldn't care less.

If we take the following C struct:

struct {
   int i;
   char c;
} foo;

foo.i = 10;
foo.c = 'a';

Nothing prevents us from modifying the internal values of that struct, which of course might provoke some side-effect.

So what's the point?

These structures shine when used in collaboration with pattern functions.

(defpat action ( [Pick 'up x] ) ...)
(defpat action ( [Take x] ) ...)
(defpat action ( [Drop x] ... )
(defpat action ( [Move direction] ...)

The function action in this context is polymorphic and can be triggered with different data types.

(action [Move 'left])
(action [Pick 'up 'stone])

Each of these actions will then be matched against the corresponding function definitions above.

If you have a look on: minizork_en.lisp, you'll see that we take a string as input and then apply a few transformations:

(setq dialog "GET THE STONE")

; car and cdr have also been repurposed for strings
; GET THE STONE is transformed into: Get the stone

(setq dialog (+ (upper . car dialog) . lower . cdr dialog))

; We replace synonyms with their official value: Get --> Take
; We transform each of our strings into atoms. 
; Note that select returns the first non null value
; We remove stopwords (replaced with "~")
; We eventually obtain: (Take stone)

(setq commands 
  (filterlist '(neq '~) ; we remove stopwords
     (maplist ; we replace words with their synonyms or remove stopwords
         (λ(x) (atom . select (key filterwords x) x)) 
         (split dialog " "))))

Once this transformation has been completed, we can try our new command:

(maybe
   (action command) ; (action [Take stone])
   (println "Wrong command")
)

Basically, if there is an error, we catch it and execute the last line.

The use of pattern functions with data types simplify quite a lot the whole code.

[–] Frere_de_la_Quote@alien.top 1 points 11 months ago

In the case of calling C++ from LispE, there is a directory template, which you can use to create a Makefile and a stub to compile your oan library

[–] Frere_de_la_Quote@alien.top 1 points 11 months ago

Actually, you have two different ways to execute your program: as a string with "execute" or with a file with "load". In both cases, when you execute your code, it returns an Element object that can be anything that your program may return. You can then test if it is a list with isList(), or a string or a number. You have then many way to translate this object into what you want. The method "toString" is just one way to transform your output in a string to display it on screen:

Element* e = lisp->execute("(numbers 1 2 3 4 5)") returns a vector of doubles. You can then iterate on it to extract your values:

vector res; if (e->label() == t_numbers) { ((Numbers*)e)->liste.to_vector(res); }

Another way, which is much more generic:

if (e->isList()) { for (long i = 0; i < e->size(); i++) { res.push_back(e->index(i)->asNumber()); } }

[–] Frere_de_la_Quote@alien.top 1 points 11 months ago (3 children)

For info, here is how you integrate LispE in a C++ environment:

see: https://github.com/naver/lispe/wiki/3.3-Use-LispE-in-your-C---programs

[–] Frere_de_la_Quote@alien.top 1 points 11 months ago (1 children)

Hello,

I have implemented a full Lisp, which my company has accepted to put in open source. The documentation is here: https://github.com/naver/lispe/wiki.

The list of all available functions are here:

https://github.com/naver/lispe/wiki/5.-Description-of-Functions,-Operators-and-Libraries

And you can find the code over here: https://github.com/naver/lispe

The engine is implemented in C++11 and can compile on any platforms: Windows, Mac and Linux. I also provide installers for Windows and Mac OS, and a Web Assembly version.

You can use it as an external library in a C++ program or in a Python program.

In the documentation, I have documented how the internal engine is built.

You can very easily add external libraries and extend the instruction set as you wish.

The engine also proposes many methods to handle Unix commands and retrieve their content as a list of strings.

[–] Frere_de_la_Quote@alien.top 1 points 11 months ago (1 children)

Hum... The answer might surprise you, but there is a real reason why I implemented LispE in the first place, and it was not to create a new Lisp dialect, this came later as a side-effect.

I'm researcher in Computational Linguistics, and I implemented a syntactic analyser during my PhD thesis, which was used as the main tool for research in linguistics in my laboratory for 17 years (from 1999 to 2016). If you look for XIP or Xerox Incremental Parser, you'll find European projects, articles and patents, which all rely on it.

Let's be honest, the advent of LLMs has made the whole endeavor quite moot.

Dealing with texts was quite complicated back then, and we had to deal with heterogeneous encodings all the time. Texts were often encoded in Latin, instead of UTF-8, and the Japanese team used a local encoding that was not UTF-8 neither... Big, big mess... I tried to use Python, but it sucked (and still suck) at handling multi-encoded strings, something that was quite common back then, as almost no dataset had been curated for encodings... A real nightmare, especially for languages such as French or Spanish, which use accented letters.

So I implemented my own language, which is called Tamgu now, to handle these problems (see https://github.com/naver/tamgu). There was a lot of work on string manipulation, with many of the main functions such as find or replace being implemented with SIMD instructions. For instance, encoding conversions for Latin, UTF-8, UTF-16 and UTF-32 rely on this SIMD for fast responses.

Tamgu is now used in production and has become a huge project. However, since the language implementation is now too large, I decided to create a very simple demonstrator in C++ to show how Tamgu actually works, to help other people contribute to it.

And Lisp is the easiest language to demonstrate how an AST interpreter works.

However, I also became very interested in Haskell and Array Languages (such as APL) . But since certain concepts eluded me, I used LispE as a platform to experiment with these concepts. For instance if you have a look on: https://github.com/naver/lispe/wiki/5.4-A-la-Haskell, you'll see how I managed to replicate function composition.

I also experimented with many of the different APL operators such as scan, reduce, rho etc. which I implemented to also better understand their internal working.

Furthermore, LispE is also a Python library, which was a way to better understand Python API. Actually, LispE has been created in such a way that creating external libraries is really simple. Much much simpler than Python by the way. It basically consists of deriving a new class from LispE base class.

LispE also proved simple enough to experiment with Web Assembly.

Finally, LispE is also a very fast multithreaded language, which is almost lock free.

I love LispE, because I can tinker with it as much as I want. Adding a new instruction to the langage takes a couple of minutes. It has many features that are absent from Common Lisp, such as an array based implementation for lists, that makes it the perfect language for Advent of Code or Leetcode.

See: https://github.com/naver/lispe/wiki/6.20-Conway-Game-of-Life-in-LispE for an example of how LispE can help you now.

You can even used it as a terminal to handle Unix commands: https://github.com/naver/lispe/wiki/7.-Shell, especially that I developed my own terminal editor (see https://github.com/naver/lispe/wiki/1.2-Jag:-Terminal-Editor-With-Mouse-Support-and-Colour-Highlighting)

I have more than 30 years of professional experience in implementing parsers and languages in C++ and I thought that at the end of my career it will be fun to share this experience with others.

[–] Frere_de_la_Quote@alien.top 1 points 1 year ago

You are right, I implemented other methods to access elements in a list, which do not rely on cdr or car (see nth operator). Actually, I also implemented a parallel Linked Lists, where these functions make more sense. You can choose which structure is more adapted.

(list 1 2 3) ; array list

(llist 1 2 3) ; linked list

The "llist" behaves exactly as traditional lists.

I don't use std::vector under the hood... :-)

[–] Frere_de_la_Quote@alien.top 1 points 1 year ago

Not exactly. Javascript strings are encoded as UTF-16 characters. Each element in this array of integers is a UTF-16 code point... So basically most characters are encoded on int16_t, except for very large codes, such as emojis, which are usually implemented on two int16_t.

[–] Frere_de_la_Quote@alien.top 1 points 1 year ago (2 children)

Very interesting remark. Basically, a gap buffer is a linked list in disguise, which I wanted to avoid since most implementations become much slower when your data are no longer contiguous in memory.

And what you propose is basically what I have implemented. My list is implemented as two different objects: ITEM and LIST, where ITEM contains a buffer that can be extended at will and LIST contains a pointer to ITEM and an _offset_ value.

When I do a _cdr_, I build a new LIST object that shares the same ITEM pointer as the current list but with an offset incremented by 1.

The implementation is here: https://github.com/naver/lispe/blob/master/include/listes.h

[–] Frere_de_la_Quote@alien.top 1 points 1 year ago (3 children)

I'm really interested if you have some news about this very topic. Seriously!!! Handling strings was really a hassle in WASM and if there is now a better solution I'm really excited to this prospect.

Do yo have any pointers?

 

When you implement your own programming language, one of the problems is to implement a GUI to play with it. If you compile with WASM, you can solve this issue and a as a bonus you will be able to demonstrate your language across the Internet quite easily.

However, there are some real issues. One is that Web Assembly handle strings as arrays of integers.

If you are interested, I compiled my own language: LispE as a WASM library.

I have written a few blogs about how to do it: https://github.com/naver/lispe/wiki/6.18-Handling-strings-in-WASM-without-burning-yourself

Here is also a guide on how to use this library: https://github.com/naver/lispe/wiki/6.17-A-WebAssembly-version-of-LispE

1
LispE (alien.top)
 

I have been working for a few years on my own implementation of Lisp. I have extensively described the language syntax here: https://github.com/naver/lispe/wiki.

The real reason of this version of Lisp is that I wanted to explore array languages such as APL, so I implemented a version of Lisp that is based on arrays and not linked lists. Thanks to this implementation, I was able to implement many operators that are very similar to APL, but which would be implemented on top of a Lisp syntax. However, you can still use traditional Lisp instructions such as car and cdr.

This Lisp also provides many specific types: such as tensors, matrices or dictionaries.

For instance, you can create a tensor in one line of code: (rho 4 5 6 (iota 120)).

I have written a blog on this topic: https://github.com/naver/lispe/wiki/6.20-Conway-Game-of-Life-in-LispE

There are also some implementations for the resolution of Advent of Code enigma in the examples directory. That will give you some flavors of what I try to achieve.

I also provide installers for mac (intel and apple silicon) and windows: https://github.com/naver/lispe/tree/master/binaries. I also compile a version for WebAssembly: See the WASM directory for an example of how to execute LispE in a browser.

For those who are interested, I also implemented my own terminal editor, which is part of the language itself but can also be used independently: jag.

https://github.com/naver/lispe/wiki/1.2-Jag:-Terminal-Editor-With-Mouse-Support-and-Colour-Highlighting

It is implemented as two C++ classes, that can be derived for your own purpose, even though the internal documentation might be a bit lacking.

The whole project is open-source, and was developped from within Naver corporation, however, there are not strings attached.

I have also implemented a mechanism to implement external libraries that can be loaded within the language itself.