The Awesome Factor: Factor

Showing posts with label Factor. Show all posts

Tuesday, November 01, 2011

Code coverage tool

Motivation for writing a coverage tool

In Factor, the old code-is-data adage is true and can be used to our advantage when writing runtime tools. Code is stored in what we call a quotation, which is just a sequence of functions (words) and data literals and can be called at runtime. A function (or word) in Factor is simply a named quotation, and typing that word name causes the quotation to run.

Since quotations in Factor do not terminate early unless an exception is thrown, we know that if each quotation gets run, then we have run all of the code. While this won't tell us if there are logic errors, and while we can still have code fail due to unexpected inputs, at least we can be sure that all of the code is getting exercise and there are no paths that never get run. We can use the results of this tool to write better unit tests.

Demo of the coverage tool

As a simple demo, there's a fun algorithm that produces a sequence of numbers from a start number, where the sequence always seems to end at one. For any integer greater than zero, the the Collatz conjecture states that this sequence will eventually reach the number one. The algorithm goes as follows: take any counting number (1, 2, ...) and either multiply by three and add 1, or divide by two, if the number is odd or even, respectively, and record the sequence of numbers until you reach the number one. This conjecture has been found to be true experimentally for every number up to 10^18, but no proof exists that it's true for all possible inputs.

For example, the Collatz sequence for 3 is: { 3 10 5 16 8 4 2 1 }

We can come up with a solution pretty easily (partially taken from the Project Euler #14, in extra/).

We do the command: "collatz" scaffold-work

Click on the link, edit the file ~/factor/extra/collatz/collatz.factor

USING: combinators.short-circuit kernel make math ;
IN: collatz

: next-collatz ( n -- n )
    dup even? [ 2 / ] [ 3 * 1 + ] if ;

: collatz-unsafe ( n -- seq )
    [ [ dup 1 > ] [ dup , next-collatz ] while , ] { } make ;

ERROR: invalid-collatz-input n ;

: collatz ( n -- seq )
    dup { [ integer? ] [ 1 >= ] } 1&&
    [ collatz-unsafe ]
    [ invalid-collatz-input ] if ;

We're going to be extra careful here to demonstrate the code coverage tool, so I added error checking to make sure it's an integer greater than or equal to one.

Let's write some unit tests to make sure it works.

Run the command: "collatz" scaffold-tests
Now click on the link to edit the file it created, ~/factor/extra/collatz/collatz-tests.factor

USING: tools.test collatz ;
IN: collatz.tests

[ { 1 } ] [ 1 collatz ] unit-test
[ { 2 1 } ] [ 2 collatz ] unit-test
[ { 3 10 5 16 8 4 2 1 } ] [ 3 collatz ] unit-test

If we run "collatz" test, we see that all tests pass.

Now, let's see how well we did with code coverage.

We run the command:

IN: scratchpad USE: tools.coverage "collatz" test-coverage .
Loading resource:work/collatz/collatz-tests.factor
Unit Test: { [ { 1 } ] [ 1 collatz ] }
Unit Test: { [ { 2 1 } ] [ 2 collatz ] }
Unit Test: { [ { 3 10 5 16 8 4 2 1 } ] [ 3 collatz ] }
{
    { next-collatz { } }
    { collatz { [ invalid-collatz-input ] } }
    {
        invalid-collatz-input
        { [ \ invalid-collatz-input boa throw ] }
    }
    { collatz-unsafe { } }
}

What this tells us is we had quotations in the collatz word and the invalid-collatz-input words that did not get called. Of course -- we never passed it anything other than valid inputs. How about passing it a string or a negative integer?

Now the result looks better:

IN: scratchpad "collatz" test-coverage .
Loading resource:work/collatz/collatz-tests.factor
Unit Test: { [ { 1 } ] [ 1 collatz ] }
Unit Test: { [ { 2 1 } ] [ 2 collatz ] }
Unit Test: { [ { 3 10 5 16 8 4 2 1 } ] [ 3 collatz ] }
Must Fail With: { [ "hello world" collatz ] [ invalid-collatz-input? ] }
Must Fail With: { [ -50 collatz ] [ invalid-collatz-input? ] }
{
    { next-collatz { } }
    { collatz { } }
    { invalid-collatz-input { } }
    { collatz-unsafe { } }
}

We can even get a number for how well tests cover a vocabulary:

"collatz" %coverage .
1.0

The implementation of the coverage tool

Every word in Factor stores its definition in the `def' slot. If we examine this slot, we see that it's a quotation that may contain other quotations. Using the annotation vocabulary, we can add code that executes before and after the code in the quotation. What the coverage tool does is adds a container that stores a boolean flag at the beginning of each quotation, and when the quotation gets run, the flag is set to true. This tool can be turned on and off, independent of the containers being in place, with the coverage-on and coverage-off words.

Here's a word that turns a Roman numeral into an integer:

\ roman> def>> .
[ >lower [ roman>= ] monotonic-split [ (roman>) ] map-sum ]

After annotating it, the new definition looks like:

\ roman> add-coverage \ roman> def>> .
[
    T{ coverage } flag-covered >lower
    [ T{ coverage } flag-covered roman>= ] monotonic-split
    [ T{ coverage } flag-covered (roman>) ] map-sum
]

The flags are all set to false right now. After turning on the flag to let enable the coverage code and running the word, we see a change:

coverage-on "iii" roman> drop \ roman> def>> .[
    T{ coverage { executed? t } } flag-covered >lower
    [ T{ coverage { executed? t } } flag-covered roman>= ]
    monotonic-split
    [ T{ coverage { executed? t } } flag-covered (roman>) ]
    map-sum
]

Notice that all of the coverage containers have been executed. To generate the report, we simply iterate over each word and collect all of the quotations where this flag has not been set -- these quotations never ran.

In writing this article, I realized that each quotation should have a flag on exit as well, in case an exception gets thrown in the middle of executing this quotation and control never reaches the end. Partially-executed quotations will soon be reported by the tool, after I make this fix.

I hope you can use this tool to improve your Factor code. Have fun!

Saturday, September 04, 2010

Filling a file with zeros

In this blog post I'll demonstrate a few of ways to fill a file with zeros in Factor. The goal is to write a some number of bytes to file in the least amount of time and using only a small amount of RAM; writing a large file should not fail.

Filling a file with zeros by seeking

The best way of writing a file full of zeros is to seek to one byte from the end of the file, write a zero, and close the file. Here's the code:

: (zero-file) ( n path -- )
    binary 
    [ 1 - seek-absolute seek-output 0 write1 ] with-file-writer ;

ERROR: invalid-file-size n path ;

: zero-file ( n path -- )   
    {
        { [ over 0 < ] [ invalid-file-size ] }
        { [ over 0 = ] [ nip touch-file ] }
        [ (zero-file) ]
    } cond ;

The first thing you'll notice about the zero-file is that we special-case negative and zero file sizes. Special-casing zero file length is necessary to avoid seeking to -1, which does everything correctly but throws an error in the process instead of returning normally. Special-casing negative file sizes is important because it's always an error, and though the operation fails overall, the file-system can become littered with zero-length files that are created before the exception is thrown.

To call the new word:

123,456,789 "/Users/erg/zeros.bin" zero-file
"/Users/erg/zeros.bin" file-info size>> .
123456789

Copying a zero-stream

With Factor's stream protocol, you can write new kinds of streams that, when read from or written to, do whatever you want. I wrote a read-only zero-stream below that returns zeros whenever you read from it. Wrapping a limit-stream around it, you can give the inexhaustible zero-stream an artificial length, so that copying it reaches an end and terminates.

TUPLE: zero-stream ;

C: <zero-stream> zero-stream

M: zero-stream stream-read drop <byte-array> ;
M: zero-stream stream-read1 drop 0 ;
M: zero-stream stream-read-partial stream-read ;
M: zero-stream dispose drop ;

:: zero-file2 ( n path -- )
    <zero-stream> n limit-stream 
    path binary <file-writer> stream-copy ;

The drawback to this approach is that it creates 8kb byte-arrays in memory that it immediately writes to disk.

Setting the contents of a file directly

Using the set-file-contents word, you can just assign a file's contents to be a sequence. However, this sequence has to fit into memory, so this solution is not as good for our use case.

:: zero-file3 ( n path -- )
    n <byte-array> path binary set-file-contents ;

Bonus: writing random data to a file

The canonical way of copying random data to a file in Unix systems is to use the dd tool to read from /dev/urandom and write to a file. But what about on Windows, where there is no /dev/urandom? We can come up with a cross-platform solution that uses method number two from above, but instead of a zero-stream, we have a random-stream. But then what about efficiency? Well, it turns out that Factor's Mersenne Twister implementation generates random numbers faster than /dev/urandom on my Macbook -- writing a 100MB file from /dev/urandom is about twice as slow as a Factor-only solution. So not only is the Factor solution cross-platform, it's also more efficient.

TUPLE: random-stream ;

C: <random-stream> random-stream

M: random-stream stream-read drop random-bytes ;
M: random-stream stream-read1 drop 256 random ;
M: random-stream stream-read-partial stream-read ;
M: random-stream dispose drop ;

:: stream-copy-n ( from to n -- )
    from n limit-stream to stream-copy ;

:: random-file ( n path -- )
     
    path binary <file-writer> n stream-copy-n ;

! Read from /dev/urandom
:: random-file-urandom ( n path -- )
    [
         path
        binary <file-writer> n stream-copy-n
    ] with-system-random ;

Here are the results:

$ dd if=/dev/urandom of=here.bin bs=100000000 count=1
1+0 records in
1+0 records out
100000000 bytes transferred in 17.384370 secs (5752294 bytes/sec)

100,000,000 "there.bin" random-file
Running time: 5.623136439 seconds

Conclusion

Since Factor has high-level libraries that wrap the low-level libc and system calls used for nonblocking i/o, we don't have to deal with platform-specific quirks at this level of abstraction like handling EINTR, error codes, or resource cleanup at the operating system level. When calls get interrupted, when errno is set to EINTR after the call returns, the i/o operation is simply tried again behind the scenes, and only serious i/o errors get thrown. There are many options for correct resource cleanup should an error occur, but the error handling code we used here is incorporated into the stream-copy and with-file-writer words--resources are cleaned up regardless of what happens. We also demonstrated that a Factor word is preferable to a shell script or the dd command for making files full of random data because it's more portable and faster, and that custom streams are easy to define.

Finally, there's actually a faster way to create huge files full of zeros, and that's by using sparse files. Sparse files can start off using virtually no file-system blocks, but can appear to be as large as you wish, and only start to consume more blocks as parts of the file are written. However, support for this is file-system dependent and, overall, sparse files are of questionable use. On Unix file-systems that support sparse files, the first method above should automatically creates them with no extra work. Note that on MacOSX, sparse file-systems are supported but not enabled by default. On Windows, however, you have to make a call to DeviceIoControl. If someone wants to have a small contribution to the Factor project, they are welcome to implement creation of sparse files for Windows.

Edit: Thanks to one of the commenters, I rediscovered that there's a Unix syscall truncate that creates zero-length files in constant time on my Mac. This is indeed the best solution for making files full of zeros, and although unportable, a Factor library would have no problem using a hook on the OS variable to call truncate on Unix and another method on Windows.

Monday, August 17, 2009

Twin Cities Lisp Factor Presentation

At the next TC Lisp meeting, I'll be explaining as much as I can about Factor in a one-hour presentation. Arrive before 6pm for the happy hour beer prices. All are welcome!

Place: Common Roots Cafe
Date: August 18, 2009
Time: 6pm

View Larger Map

Friday, May 29, 2009

A new library to manage client connections, and a proof of concept chat server in Factor

Some thirty seconds into play-testing Joe Groff's terrain demo game, I realized there were no NPCs, no double-barreled shotguns or hand-cannons, and most disturbingly, no other players. Sure you can fly around, but you can't gib your friends! So I decided to do start to do something about it and write a managed server vocabulary with a generic protocol that you can extend for writing your own servers. Hey, I'm not much of an artist -- the guns will have to wait.

Managed-server infrastructure

The HTTP server already uses a library called io.servers.connections which implements a threaded-server with SSL support in 164 lines of code. A threaded-server listens for a set number of clients on an open port and handles each one individually; no client needs to know about any other. To get the code so concise, it uses libraries for concurrency, logging, sockets, and SSL that are themselves reused elsewhere.
Features of the threaded-server vocabulary:

the best nonblocking I/O code on every platform (completion ports, kqueue, epoll, not select())
connection/error logging, log rotation
correct error handling and resource cleanup
SSL support on Unix platforms
IPv4 and IPv6 support by default

For an HTTP or FTP server, handling each connection individually is what you want. However, for games or chat servers, you really want your users to interact. Building on top of this thread-server, I made a new tuple called a managed-server that tracks a list of connecting and disconnecting clients. You still get all of the features threaded-server implements, but now there's a new client handler that maintains a list of connected clients keyed by a username and utility words to send data to all clients.
You can also use this code to make custom binary protocols, and I'm mostly through implemented an SRP6 library to allow secure unencrypted logins after you create an account through an SSL connection. UDP support for first-person shooter and faster-paced games will also be supported when someone needs it.

The implementation of managed-server

A managed-server inherits from threaded-server class and adds a new slot called clients to store connections. Each connection's state -- the input/output streams, the local/remote socket addresses, username, a slot for passing quit messages, and a quit flag -- is wrapped inside a managed-client tuple and stored into the clients hashtable with the username as the key. In this way, it's easy to look up another client's stream and send it a message:

"wally" "hi wally!" send-client

You can also send a message to all connected clients,send-everyone, or to all but yourself:

"This one goes out to all the ladies." send-everyone-else

Here's what the tuple classes code looks like:

TUPLE: managed-server < threaded-server clients ;

TUPLE: managed-client
input-stream output-stream local-address remote-address
username object quit? ;

The managed-server protocol

A managed-server has some generics in place to guide you in creating your own servers. The first two generics are required, but the others default to no-ops unless you want to handle these events. Of course, the clients are still tracked no matter what your method does on the client-join or client-disconnect generics. The default method for handle-already-logged-in throws an error to prevent a new client from taking over the other client's session or logging in multiple times. You can override this behavior with your own perversions.
Here's the protocol:


HOOK: handle-login threaded-server ( -- username )
HOOK: handle-managed-client* managed-server ( -- )
HOOK: handle-already-logged-in managed-server ( -- )
HOOK: handle-client-join managed-server ( -- )
HOOK: handle-client-disconnect managed-server ( -- )

The implementation of a chat server using managed-server

Eventually someone will use managed-server for the networking code in a game, but until then I've implemented a simple chat server. Writing the chat server was fun and helped me to iron out a couple of bugs, which I wrote about below.

A walkthrough of the chat server protocol

The chat server code begins by inheriting from the managed-server tuple:

TUPLE: chat-server < managed-server ;

From here you go about implementing required parts of the protocol, handle-login and handle-managed-client*, so let's start there.

M: chat-server handle-login
    "Username: " write flush
    readln ;

The current input/output streams are bound to the client connection, so calling write will send them the login prompt. To read back the username, readln reads until a newline is sent. If you were to connect with telnet at this point, you would see the prompt and could send back a username. Then the server would kick you off because there's no implementation of handle-managed-client*.

M: chat-server handle-managed-client*
    readln dup f = [ t client (>>quit?) ] when
    [
        "/" ?head [ handle-command ] [ handle-chat ] if
    ] unless-empty ;

This word handles every other message the client sends apart from the login code. Calling readln reads the client's message one line at a time and returns false when the stream closes. The quit flag is set in such a case and will be explained later. For now, suffice to say that you're quitting if readln returns false. Next, the message is checked for any content -- both false and an empty string can be safely ignored here by the unless-empty combinator. Inside the quotation, the leading slash is stripped from the input, if any, and a boolean returned by ?head decides if the message was intended for the server or the chat room.

: handle-command ( string -- )
    dup " " split1 swap >lower commands get at* [
        call( string -- ) drop
    ] [
        2drop "Unknown command: " prepend print flush
    ] if ;

Commands sent to the server are normalized by converting to lower case and then looked up in the commands table. If you send a successful command such as /who or /nick then it gets executed; if not you get the generic "Unknown command" error.

: handle-chat ( string -- )
    [
        [ username ": " ] dip
    ] "" append-outputs-as send-everyone ;

Sending a message to the chat room is the alternative to server commands. I'm using append-outputs-as here to append together a bunch of strings, although i could easily have used 3append instead. I left this in because it's easier to change the look of the chat if you don't have to keep track of how many strings you're appending and you just let the compiler infer. Please take note: smart combinators in Factor are analogous to applying a function to a list or parameters in Lisp in that you don't need to know the number of parameters. The following two snippets will demonstrate what I mean:

(+ 1 2 10 1200)

[ 1 2 10 1200 ] sum-outputs

That's pretty much the essence of the chat server since everything else was just added for fun.

Fun with default encodings

Default encodings are terrible! Of course, you can change the encoding of a stream whenever you want, but the encoding for threaded-servers defaulted to ASCII until I changed it this evening. When I made my chat server yesterday, I forgot to set the encoding to what I wanted -- UTF8. Sending a character above 127 caused the server to throw an exception since ASCII is only 7 bits wide, and the sender would get disconnected. The FTP server I wrote started out with this bug as well, before I changed it to latin1. But now that threaded-server takes an encoding on the stack, this bug can never happen again.
So what's wrong with picking a different default encoding, maybe UTF8? Well, if I'm making a binary server, the UTF8 decoder will replace bad bit sequences with replacement characters -- another latent bug! What about binary as the default encoding, i.e. no encoding? Binary is the best option for a default, but then people who need to use UTF8 or latin1 might not know that the stream protocol supports encodings at all, and will end up doing a lot of work by hand which should be handled by the stream implementation. So not having a default encoding 1) prevents latent bugs and 2) forces the programmer to think about what they really want in each situation -- surely a good idea.

Quitting gracefully with quit flag

My first thought was just to throw an exception when I wanted to disconnect a client and cause the error handler to clean up the resources. Hopefully it's common knowledge that control flow implemented with exceptions is inefficient and bad design, in the general case. Maybe just this once? Nope, in my case the logging framework logs all exceptions, so the majority of the logs would be filled up with useless disconnect error messages. Clearly something better was needed -- the quit flag. Managed clients have a quit flag slot that is checked every time the server processes some data. Clients can quit gracefully by setting this flag to true and returning control back to the stream reader loop, and quits caused by exceptions are logged and worthy of further investigation.

Live coding on the server

After the chat server was up and running, I could add features without restarting. One of the first requested features was "/help", which required a redesign of how slash commands were handled. Instead of a case statement, now there's a word add-command that takes the implementation, the documentation, and the name of the command you want to add. Adding a command stores the code and the docs in symbols holding hashtables, indexed by the name of the command.

SYMBOL: commands
commands [ H{ } clone ] initialize

SYMBOL: chat-docs
chat-docs [ H{ } clone ] initialize

:: add-command ( quot docs key -- )
    quot key commands get set-at
    docs key chat-docs get set-at ;

I added a time command for fun:

[ drop gmt timestamp>rfc822 print flush ]
<" Syntax: /time
Returns the current GMT time."> "time" add-command

Someone else wanted a "/who" command -- easy enough.

[ drop clients keys [ "``" "''" surround ] map ", " join print flush ]
<" Syntax: /who
Shows the list of connected users.">
"who" add-command

There last feature I implemented was a way to change your nickname without reconnecting:

: handle-nick ( string -- )
    [
        "nick" usage
    ] [
        dup clients key? [
            username-taken-string print flush
        ] [
            [ username swap warn-name-changed ]
            [ username clients rename-at ]
            [ client (>>username) ] tri
        ] if
    ] if-empty ;

[ handle-nick ]
<" Syntax: /nick nickname
Changes your nickname.">
"nick" add-command

Changing your nickname is straightforward but takes the most steps of all the commands I implemented. Try to understand the code -- "string" in the stack effect is the requested nickname and the clients word returns a hashtable of connections, indexed by nicknames. If the user didn't supply a nickname, remind them of the syntax for using /nick. If they did supply a nickname, check if it's in use and, if so, refuse to change their name. Otherwise, the nick change succeeded, so tell all the users of the nickname change, apply the nick change in the clients hashtable, and set the new nickname in the client.

Chat server running live

You can try out the chat server by downloading Factor and running this command:

USING: managed-server.chat io.servers.connection ; 8889 <chat-server> start-server

Or you can connect to my running chat server:

telnet trifocus.net 8889

It's just a demo and I didn't implement any limits on your nickname or what you can send, though it would be easy enough to do so. Have fun, and please let me know if you can find any bugs.

Sunday, May 24, 2009

An ID3 Parser in Factor

A new contributor, Tim Wawrzynczak, wrote an ID3 parser as his first Factor program a couple of months back. The code looked pretty good, so it was easy to refactor the way ID3v1 and ID3v2 tags are represented and to add some utility words for managing directories of MP3s. The library still needs some work, but now it can take a directory tree and recursively parse all of the ID3 headers. I also realized that Factor's mmap implementation always tried to map a file read/write, so for ID3 parsing I added a read-only mmap. The finished code is here.

ID3v1 format

ID3 tags come in two flavors -- the old ID3v1 format, and the newer ID3v2 one. ID3v1 has a fixed length of 128 bytes and, if present, is the last 128 bytes of an MP3 file. The bytes begin with "TAG" and follow with the song title (30 bytes), artist (30 bytes), album (30 bytes), year (4 bytes), comment (30 bytes), and the genre (1 byte). The problem with this approach is that you are limited in the length of all the fields and in which data you can represent.

To implement the parser, we use the Memory-mapped files vocabulary to open the file. You can treat the file as an array of bytes that obeys the sequence protocol. Checking if a file has ID3v1 headers becomes:

: id3v1? ( mmap -- ? )
    { [ length 128 >= ] [ 128 tail-slice* "TAG" head? ] } 1&& ;

The logic here is straightforward: the sequence (mmapped file) is checked to make sure it's long enough to contain a header, and then the last 128 bytes are checked against the magic bytes "TAG". Since this is a short-circuit combinator, if the length test fails then the computation will end early. In this way, you can string together long computations that use short-circuit and and or. Factor's usual and/or words take two already-evaluated objects, so the short-circuit behavior is implemented as a library of macros instead.

ID3v2 format

The newer standard, ID3v2, has more room for metadata and can store anything up to 256 MB. Its use is indicated when the MP3 begins with "ID3" or when the bytes "3DI" are present 10 bytes from the end of the file or 10 bytes before the ID3v1 header. For parsing the header, the first two bytes are the version, then a flag byte, and finally four bytes for the size of the header. The size bytes are encoded as a synchsafe number, which means that the top bit is discarded and the lower seven bits are the data. In our case, 28 bits of data are the length, which is why the maximum length is 256MB.

The actual data we want to parse is stored in frames. Each frame has a tag, which is four ascii or utf16 bytes, followed by another 4 byte synchsafe integer for the length, two bytes of flag data, and the frame data. There are many different types of frames, but some of the more common ones are tagged TALB (album title), TIT2 (title), TRCK (track), COMM (comment), TPE1 (performer). All of the frame are added to a hashtable and keyed by the tag. Looking up the title frame becomes as easy as "TIT2" find-id3-frame on an ID3 tuple.

Future work

The ID3v1 "TAG+" format is not supported yet and apparently it's hardly ever used. ID3v2 tags are only looked for at the beginning of an MP3 but may be present at the end too as of version four. Most of the ID3 tags are not parsed into meaningful data besides the raw bytes. Lastly, writing out ID3 tags is not implemented yet and would be a good first program to write in Factor. Volunteers?

Wednesday, September 03, 2008

Vocabulary and documentation tool: tools.scaffold

The Factor project is on a documentation binge until all of the core and basis vocabularies are documented. I noticed, as anyone tasked to write a bunch of documenation would, that quite a bit of it could be automated. To begin writing docs for a vocabulary that already exists, you can just run the tool in it and edit the generated help file.

I actually ended up writing two tools -- one for the docs, and another to create new vocabularies with a simple command.

Scaffold tool to create new vocabularies

First, an aside. Factor's module system is based on a directory layout, with several vocabulary roots which are stored in the vocab-roots symbol.

( scratchpad ) USE: vocabs.loader vocab-roots get .
V{ "resource:core" "resource:basis" "resource:extra" "resource:work" }

Knowing this, you can now create a new vocabulary and boilerplate empty Factor files from inside of Factor and click the links to edit them in your favorite editor.

( scratchpad ) "extra" "alchemy" scaffold-vocab
Creating scaffolding for P" resource:extra/alchemy/alchemy.factor"
Creating scaffolding for P" resource:extra/alchemy/alchemy-tests.factor"
Creating scaffolding for P" resource:extra/alchemy/authors.txt"

( scratchpad ) "extra" "alchemy" scaffold-help
Creating scaffolding for P" resource:extra/alchemy/alchemy-docs.factor"

The scaffold tool is programmed not to overwrite files if they already exist.

Scaffold tool for documenation

Say I have a Factor word that turns lead into gold:

: lead>gold ( lead -- gold ) ... ;

The implementation of this word is left as an exercise to the reader. We don't have to understand how it works to document what it does.

Without an automation tool, you would begin by creating a new file (extra/alchemy/alchemy-docs.factor) in the same directory as your code file (extra/alchemy/alchemy.factor). What follows is the standard bolierplate for documentation. Notice that documenation goes in the same vocabulary as your actual code, but comes from different files on disk.

! Copyright (C) 2008 Doug Coleman.
! See http://factorcode.org/license.txt for BSD license.
USING: help.markup help.syntax ;
IN: alchemy

Every doc file has such a header based on the date, who you are, and the vocabulary name. With those three pieces of information, the above can be auto-generated. I added a new symbol developer-name in tools.scaffold that should be set to your name when writing docs.

Now let's look at the documenation for lead>gold:

HELP: lead>gold
{ $values { "lead" "boring lead" } { "gold" "shiny gold" } }
{ $description "Turns lead into gold." } ;

Notice that we know the word name, the first strings in each stack effect in the $values array, and the markup structure.

Here is the generated scafford:

HELP: lead>gold
{ $values { "lead" null } { "gold" null } }
{ $description } ;

This is much less wrist punishment than typing manually. The scaffold tool can even generate the documentation with their Factor types given correct code and standard type names like "quot" and "hashtable". Types that are not known are given the null class. Documenation correctness tools can later search for null objects and report them as errors, since no words should operate on this type.

I'll finish by including a snippet of the output of running the scaffold help generator on itself.

! Copyright (C) 2008 Doug Coleman.
! See http://factorcode.org/license.txt for BSD license.
USING: arrays help.markup help.syntax io.streams.string kernel strings words ;
IN: tools.scaffold

HELP: check-root
{ $values
     { "string" string }
     { "string" string } }
{ $description } ;

HELP: check-scaffold
{ $values
     { "vocab-root" "a vocabulary root string" } { "string" string }
     { "vocab-root" "a vocabulary root string" } { "string" string } }
{ $description } ;

HELP: check-vocab-name
{ $values
     { "string" string }
     { "string" string } }
{ $description } ;

HELP: developer-name
{ $description } ;

HELP: help.
{ $values
     { "word" word } }
{ $description } ;

HELP: lookup-type
{ $values
     { "string" string }
     { "object/string" null } { "?" "a boolean" } }
{ $description } ;

HELP: main-file-string
{ $values
     { "vocab" "a vocabulary specifier" }
     { "string" string } }
{ $description } ;

HELP: not-a-vocab-root
{ $values
     { "string" string } }
{ $description } ;

HELP: not-a-vocab-root?
{ $values
     { "object" object }
     { "?" "a boolean" } }
{ $description } ;

HELP: root?
{ $values
     { "string" string }
     { "?" "a boolean" } }
{ $description } ;

HELP: scaffold-authors
{ $values
     { "path" "a pathname string" } }
{ $description } ;

HELP: scaffold-copyright
{ $description } ;

HELP: tests-file-string
{ $values
     { "vocab" "a vocabulary specifier" }
     { "string" string } }
{ $description } ;

HELP: using
{ $description } ;

HELP: vocab>scaffold-path
{ $values
     { "vocab-root" "a vocabulary root string" } { "string" string }
     { "path" "a pathname string" } }
{ $description } ;

ARTICLE: "tools.scaffold" "tools.scaffold"
;

ABOUT: "tools.scaffold"

I hope this new vocabulary makes documenation a lot less tedious and error-prone. Now to go write some docs...

Monday, April 28, 2008

Word renaming (part 2)

Here are the word names that have changed:

find* -> find-from
find-last* -> find-last-from
index* -> index-from
last-index* -> last-index-from
subset -> filter

New shorthand words:

1 tail -> rest
1 tail-slice -> rest-slice
swap compose -> prepose

Changes to existing word behavior:

reverse stack effect of assoc-diff, diff
before? after? before=? after=? are now generic
min, max can compare more objects than before, such as timestamps
between? can compare more objects
<=> returns symbols

There are several motivations at work here. One is that words named foo* are a variant of foo, but otherwise the * is no help to what the word actually does differently. We're trying to move away from word names with stars to something more meaningful.

Code with common patterns that come up a lot, like 1 tail and swap compose, is more clearly understood if these patterns are given a single name.

Factor's subset is not equivalent to the mathematical definition of subset, so it was renamed to filter to avoid confusion. Along these same lines, diff and assoc-diff are now the more mathematically intuitive; you can think of diff like set subtraction now, seq1 seq2 diff is like seq1 - seq2.

Finally, the "UFO operator" <=> now returns symbols +lt+ +eq+ +gt+ instead of negative, zero, and positive numbers. The before? and after? words can compare anything that defines a method on this operator. Since between?, min, and max
are defined in terms of these comparison words, they also work on more objects.

Please let me know if you have any more suggestions for things words that have awkward argument orders, imprecise names, or if you can suggest alternate names for words with stars in their names.

Sunday, April 13, 2008

Adding a new primitive

I added two primitives to the Factor VM to allow setting and unsetting of environment variables. It's not that hard to do, but you have to edit several C files in the VM and a couple .factor files in the core. Really they should not be primitives, so eventually they will be moved into the core.

The primitives that I added are defined as follows:

IN: system
PRIMITIVE: set-os-env ( value key -- )
PRIMITIVE: unset-os-env ( key -- )

Adding a primitive to vm/

Since Factor's datatypes are not the same as C's datatypes, and because of the garbage collector, there are C functions for accessing and manipulating Factor objects. The data conversion functions are not documented yet, so here's a sampling of a few of them:

unbox_u16_string() - pop a Factor string off the datastack and return it as a F_CHAR*
from_u16_string() - convert a C string to a Factor object
REGISTER_C_STRING() - register a C string with Factor's garbage collector
UNREGISTER_C_STRING() - unregister a registered C string
dpush() - push an object onto the datastack
dpop() - pop an object off of the datastack

Registering a C string with the garbage collector is required when VM code calls code that may trigger a garbage collection (gc). Any call to Factor from the VM might trigger a gc, and if that happened the object could be moved, thus invalidating your C pointer. When a pointer is unregistered, it's popped from a gc stack with the corrected pointer value.

Here is the call to set-os-env:

DEFINE_PRIMITIVE(set_os_env)
{
 F_CHAR *key = unbox_u16_string();
 REGISTER_C_STRING(key);
 F_CHAR *value = unbox_u16_string();
 UNREGISTER_C_STRING(key);
 if(!SetEnvironmentVariable(key, value))
     general_error(ERROR_IO, tag_object(get_error_message()), F, NULL);
}

The function is defined with a macro DEFINE_PRIMITIVE that takes only the function name. A corresponding DECLARE_PRIMITIVE goes in run.h as your function declaration. Not all primitives use these C preprocessor macros, for instance bignums don't because it doesn't improve the performance. Parameters to your primitive are popped or unboxed off the data stack, so a primitive's declaration expands to:

F_FASTCALL primitive_set_os_env_impl(void);

F_FASTCALL void primitive_set_os_env(CELL word, F_STACK_FRAME *callstack_top) {
    save_callstack_top(callstack_top);
    primitive_set_os_env_impl();
}
INLINE void primitive_set_os_env_impl(void)

F_FASTCALL is a wrapper around FASTCALL, which on x86 will pass the first two arguments in registers as an optimization. Note that while it declares that it takes no arguments (void), most primitives will do something to the data stack.

Since unbox_u16_string() allocates memory for the Factor object, it could trigger a gc, so it's registered as a string. You can also register values using REGISTER_ROOT for cells, REGISTER_BIGNUM for bignums, and REGISTER_UNTAGGED for arrays, words, and other Factor object pointers for which the type is known. The key string can immediately be unregistered after calling unbox on the next stack value since the rest of the function will not cause a gc. If the win32 call fails, there's a function general_error() that throws an exception. In this case, it's an ERROR_IO that calls a helper function to return the Windows error message.

Now that the function is written, you have to add it to the list of primitives in primitives.c. The important thing is that this list remains in the same order as the list in core/ which you will edit in the next section. Also, add a prototype to the run.h file.

Adding a primitive to core/

Everything in Factor compiles down to primitives. Because they are by definition "primitive", the compiler cannot infer the stack effect and argument types. To make a primitive's stack effect "known", edit core/inference/known-words/known-words.factor:

\ set-os-env { string string } { } <effect> set-primitive-effect

The next step is to put your word in the file core/bootstrap/primitives.factor in the same order as in vm/primitives.c.

Sometime in the future there might be a PRIMITIVE: word that will reduce the number of different places to edit to add a primitive. If it used Factor's FFI, you could add a new primitive without even having to bootstrap again.

Wednesday, March 12, 2008

Singletons in Factor

A singleton is a design pattern that only allows for a unique class that is only instantiated once. In other languages, singletons can contain global data, but why not just use a global to store global state instead?

In Factor, a singleton is simply a predicate class whose predicate tests if the class is identical to itself. Singletons are defined like this:

SINGLETON: factor

That's it! No messy boilerplate to copy and paste, no subtle reasons why your singleton might be wrong in some corner case. Using the singleton, we can replace some of the less elegant parts of the Factor core code and implement things in a simpler way.

For instance, until now the core words os and cpu have returned strings based on how the Factor binary is compiled. Soon, these strings will be parsed into whichever singleton they represent, allowing for generic dispatch. I wanted to have a cross-platform library for finding out basic hardware information about your computer, like how many cpus/cores, what speed, and how much RAM is in a machine. To do this without singletons, I had to redefine information already available as strings (the cpu and os words) as symbols. With singletons, this duplication can be removed. The same applies to the compiler code and loading libraries.

Here is the way the core supported operating systems can be defined using singletons.

SINGLETON: winnt
SINGLETON: wince
SINGLETON: macosx
SINGLETON: linux
UNION: windows winnt wince ;

Now we can dispatch on these to find the number of cores:

HOOK: #cpus os ( -- n )
M: macosx #cpus { 6 3 } sysctl-query-uint ;
M: winnt #cpus system-info SYSTEM_INFO-dwNumberOfProcessors ;

For loading code, the current idiom is this:

<< "alut" {
        { [ win32? ]  [ "alut.dll" ] }
        { [ macosx? ] [ "/System/Library/Frameworks/OpenAL.framework/OpenAL" ] }
        { [ unix?  ]  [ "libalut.so" ] }
    } cond "cdecl" add-library >>

Using singletons, we can shorten this to:

"alut" {
    { win32 "alut.dll" }
    { macosx "/System/Library/Frameworks/OpenAL.framework/OpenAL" }
    { unix "libalut.so" }
} add-library

The library is assumed to be 'cdecl', but if it were 'stdcall' you could specify this by adding a "stdcall" string after the libary name, thus making a triple instead of a pair. The amount of boilerplate is reduced and the programmer can be more productive and write fewer bugs.

The implementation of singleton is:

: define-singleton-class ( class -- )
    \ word swap
    dup [ eq? ] curry define-predicate-class ;

This generates code that looks like:

PREDICATE: word winnt \ winnt eq? ;

It makes a predicate class with a superclass of 'word' that you can dispatch on and only a single instance exists. Why are singletons so hard to define in some other languages?

Thursday, March 06, 2008

Google Summer of Code 2008 Project Ideas

The Factor project is applying for Summer of Code. Here are our project ideas.

Applications

Write a structure editor
Write a text editor
Write an IRC client
Write a package manager for Factor
Write a file manager

Enterprise Factor

Update the bindings to MySQL, Oracle, ODBC for the latest DB framework in extra/db
Add bindings for DB2, Informix, SQL Server, Sybase, etc
Write a MySQL binding in Factor instead of calling the C library to permit nonblocking I/O operation and avoid GPL licensing issues
Write high-level wrapper for OpenSSL binding which implements encrypted Factor streams
Write a SOAP stack
Improve the continuation-based web framework with ideas from Seaside

Embedded Languages

Improve the Prolog implementation in extra/prolog
Revive extra/lisp
Write an assembler to a microcontroller and cross-compile applications from within Factor
Write an infix math DSL
Write a 'language X' to Factor compiler, where X is C, Ruby, Python, Javascript, Elisp, etc

Miscellaneous Libraries

Write a binding and high level interface to zlib or implement gzip compression in Factor
Finish the tar library
Finish the packet sniffer libraries
Implement the math algorithm of your choice for extra/math
Implement the data structures described in Purely Functional Data Structures
Write a COM interface
Improve the JNI library
Write a JNI-like bridge for Android
Write a wrapper for Windows Mobile libraries to make calls and send SMSs
Integrate the zoneinfo timezone database with Factor's calendar library
Add more calendars to the library, such as Persian, Hebrew, Islamic, Chinese, Hindu, Buddhist

Factor VM

Write a GC profiler
Make GC memory areas shrink after garbage collection
Implement dtrace probes
Make continuations serializable

User Interface

Write a binding to an image manipulation library
Finish bitmap library
Improve the look of Factor's UI gadgets
Add drag and drop support
Add any of the following gadgets: tabs, syntax highlighting text editor using the xmode library, combo boxes, tables, trees, spinners
Implement native font rendering on Windows

Thursday, February 14, 2008

Disassembler Vocabulary "ported" to Windows

Slava wrote a vocabulary to send gdb a process id and a range of addresses to disassemble in order to streamline the process of writing compiler optimizations. The original code only worked on Unix, requiring a call unix:getpid, which is a Unix system call. The Windows equivalent is GetCurrentProcessId. Since we need to call the Unix version on MacOSX and Linux, and the Windows API call on Windows XP, we use a Factor design pattern -- the HOOK:.

The HOOK:

The word in question is called make-disassemble-cmd. Behold its majesty:

QUALIFIED: unix

M: pair make-disassemble-cmd
    in-file [
        "attach " write
        unix:getpid number>string print
        "disassemble " write
        [ number>string write bl ] each
    ] with-file-out ;

You can see where it's calling unix:getpid. Knowing the Windows API call from above, it's easy to write a version that works on Windows:

M: pair make-disassemble-cmd
    in-file [
        "attach " write
         GetCurrentProcessId number>string print
        "disassemble " write
        [ number>string write bl ] each
    ] with-file-out ;

Obviously, this is unacceptable, because now it doesn't work on Unix! If we rename the make-disassemble-cmd word for the new platform, then there are still two copies of the exact same word, and you'll be loading them both on platforms where they shouldn't be loaded. We really just want to rename the one word that changed, so...

Let's make a HOOK:

! in io.launcher
HOOK: current-process-handle io-backend ( -- handle )

! in io.unix.launcher
M: unix-io current-process-handle ( -- handle ) getpid ;

! in io.windows.launcher
M: windows-io current-process-handle ( -- handle ) GetCurrentProcessId ;

! in tools.disassembler
M: pair make-disassemble-cmd
    in-file [
        "attach " write
        current-process-handle number>string print
        "disassemble " write
        [ number>string write bl ] each
    ] with-file-out ;

Now, there is just one word that will do the work on both platforms. The relevant code is only loaded into the image on the correct platform, and the problem is solved without renaming lots of words, without #ifdefs, and without copy/pasting.

Tuesday, February 12, 2008

Text Editor Integration and Directory Traversal

To edit a word in Factor you pass the edit word a defspec, like \ + edit or { float + } edit, or you press CTRL-SHIFT-E on a word. If an editor path is already set, then the word definition should pop up in your editor at the line of the definition. However, if you haven't configured your text editor, a restart will be thrown and you can select it from a list. If you selected gvim, previously it would use a braindead algorithm that makes a list of all paths in \Program Files and then traverses that list looking for the gvim.exe binary. This caused it to hang while doing tons of disk IO.

Now it's better in two ways -- it only looks in \Program Files\vim\ (oops), and secondly you can search with find-file-breadth or find-file-depth. The implementation is pretty simple -- breadth first search uses a dlist, pushes the new traversal results to the back of the list, and traverses the list in order. Depth first also pushes newly found directories to the end of the list, but it pops them from the end.

The next step is to extract out the depth/breadth first algorithm so it's generally useful, but I need to continue working on the database library.

Saturday, February 02, 2008

Parsing HTTP headers in Factor with multi-assocs

The implementation of setting and parsing http headers in Factor has previously used a hashtable with a single key/value pair. However, this is broken because certain fields can be sent twice, e.g. set-cookie. The new implementation is a hashtable with keys/vectors to store multiple values for the same key.

I originally tried to make this obey the assoc protocol so that you could convert from a hashtable of vectors back to any type of assoc (hashtable/alist/AVL tree/etc) but this turned out to be a really bad idea because not only was it not useful, but it breaks the semantics of the assoc protocol if set-at inserts an element instead of sets it.

So the implementation is in assocs.lib as a few helper words:

: insert-at ( value key assoc -- )
    [ ?push ] change-at ;

: peek-at* ( key assoc -- obj ? )
    at* dup [ >r peek r> ] when ;

: peek-at ( key assoc -- obj )
    peek-at* drop ;

: >multi-assoc ( assoc -- new-assoc )
    [ 1vector ] assoc-map ;

: multi-assoc-each ( assoc quot -- )
    [ with each ] curry assoc-each ; inline

: insert ( value variable -- ) namespace insert-at ;

Of course, set-at and at still set and access the values, but there are a couple new utility words. The insert-at word has the same stack effect as set-at but pushes a value instead of setting it. peek-at will give you the last value set for a given key, and this is the standard way of accessing values when you only care about the last one.

To turn an assoc into a multi-assoc, call >multi-assoc. To iterate over all the key/value pairs, use multi-assoc-each.

The insert word is for use with the make-assoc word, which executes inside a new namespace and outputs the variables you set as a hashtable.

Here's an example of what the headers look like for a website:

( scratchpad ) USE: http.client "amazon.com" http-get drop .
H{
    { "connection" V{ "close" } }
    { "content-type" V{ "text/html; charset=ISO-8859-1" } }
    { "server" V{ "Server" } }
    { "x-amz-id-2" V{ "L0oid1yo1Z6cuq+VgwWCv0G/UdPov/0v" } }
    { "x-amz-id-1" V{ "15CPXN68HXB35FXE62CX" } }
    {
        "set-cookie"
        V{
            "skin=noskin; path=/; domain=.amazon.com; expires=Sun, 03-Feb-2008 04:57:59 GMT"
            "session-id-time=1202544000l; path=/; domain=.amazon.com; expires=Sat Feb 09 08:00:00 2008 GMT"
            "session-id=002-3595241-4867224; path=/; domain=.amazon.com; expires=Sat Feb 09 08:00:00 2008 GMT"
        }
    }
    { "vary" V{ "Accept-Encoding,User-Agent" } }
    { "date" V{ "Sun, 03 Feb 2008 04:57:59 GMT" } }
}

I have normalized the keys by converting them to all lower case. For some reason, Amazon sends two headers as Set-Cookie and the last one as Set-cookie, which is pretty weird.

Since the prettyprinter outputs valid Factor code, you can copy/paste the above headers into a Factor listener and run some of the multi-assoc words on them.

Wednesday, November 07, 2007

Doubly-linked Lists, Queues, and Heaps

Doubly-linked Lists and Queues

Factor has had doubly-linked lists for years now, but they were not well-documented or polished. Now, they're documented and have replaced the queues library.

An example of a dlist usage:

<dlist> "red" over push-front "blue" over push-front dup pop-back .
    "red"

You can add/remove nodes of a dlist with push-front, push-back, pop-front, pop-back, delete-node, and search with dlist-find, dlist-contains?.

Finding the length of a dlist is O(1) since it stores the length as dlist-length, a tuple slot.

Heaps

Heaps have been updated to allow for <min-heap> and <max-heap> data structures. Adding elements to a heap is achieved with heap-push ( value key heap -- ), while popping elements is heap-pop ( heap -- value key ).

Factor's green threads implementation had been using a hack for the sleep-queue: each time a new entry was added it would modify a non-growable array, which would then be sorted by the smallest timeout. Adding a sequence of sleep continuations would take O(n^2 log n) time! Running 10000 [ [ 100 sleep ] in-thread ] times should spawn 10000 threads and sleep for 100 ms in each one, and with the old sleep-queue implementation it takes over a minute on my Macbook. Now it's just O(n log n), which takes a second or two.

Monday, September 24, 2007

Windows SEH Revisited

It turns out there are Win32 API calls to set up a structured exception handler (SEH) on Windows NT, which is easier and more reliable to use than inline assembly. To install an exception handler, call AddVectoredExceptionHandler. An exception handler gets passed a PEXCEPTION_POINTERS struct, which is a PEXCEPTION_RECORD and a CONTEXT*, both of which are defined in the header files. The context structure is different on every platform because it lists the contents of the CPU registers at the time of the exception, while the ExceptionRecord struct is the same, and contains the ExceptionCode and the address where it occurred (in ExceptionInformation[1]).

The job of the exception handler is to determine where program execution should proceed after it returns. For Factor, the errors we handle are memory access errors, which happen when a stack overflows or underflows and hits a guard page, division by zero, and 'any other error'. We can't just jump to these Factor exception handlers inside the Win32 exception handler, but we can set the instruction pointer, the EIP register, to our handler and return EXCEPTION_CONTINUE_EXECUTION. In this way, we can let the operating system catch errors and report them to the user as "Data stack underflow", "Division by zero", and continue running Factor without expensive checks for each memory access or division.

The stack pointer at the time of the exception is also important. If we were executing compiled Factor code, as determined by checking the fault address against the C predicate in_code_heap_p, then we set a global with this address to continue execution after the exception is handled. However, if we are in C code, then the global is left as NULL.

Here is the code--much cleaner, and more correct, than before.

void c_to_factor_toplevel(CELL quot)
{
    AddVectoredExceptionHandler(0, (void*)exception_handler);
    c_to_factor(quot);
    RemoveVectoredExceptionHandler((void*)exception_handler);
}

long exception_handler(PEXCEPTION_POINTERS pe)
{
    PEXCEPTION_RECORD e = (PEXCEPTION_RECORD)pe->ExceptionRecord;
    CONTEXT *c = (CONTEXT*)pe->ContextRecord;

    if(in_code_heap_p(c->Eip))
        signal_callstack_top = (void*)c->Esp;
    else
        signal_callstack_top = NULL;

    if(e->ExceptionCode == EXCEPTION_ACCESS_VIOLATION)
    {
        signal_fault_addr = e->ExceptionInformation[1];
        c->Eip = (CELL)memory_signal_handler_impl;
    }
    else if(e->ExceptionCode == EXCEPTION_FLT_DIVIDE_BY_ZERO
            || e->ExceptionCode == EXCEPTION_INT_DIVIDE_BY_ZERO)
    {
        signal_number = ERROR_DIVIDE_BY_ZERO;
        c->Eip = (CELL)divide_by_zero_signal_handler_impl;
    }
    else
    {
        signal_number = 11;
        c->Eip = (CELL)misc_signal_handler_impl;
    }

    return EXCEPTION_CONTINUE_EXECUTION;
}

void memory_signal_handler_impl(void)
{
    memory_protection_error(signal_fault_addr,signal_callstack_top);
}

void divide_by_zero_signal_handler_impl(void)
{
    general_error(ERROR_DIVIDE_BY_ZERO,F,F,signal_callstack_top);
}

void misc_signal_handler_impl(void)
{
    signal_error(signal_number,signal_callstack_top);
}

Wednesday, June 20, 2007

Roman Numeral Conversion in Factor

I whipped up some words to convert integers to Roman numerals. The word <PRIVATE changes the IN: vocabulary to roman.private to hide the implementation from the library user. Of course you can access these private words, like all words in Factor, with a USE:.

The algorithm is simple. Going from high to low, iterate the Roman numeral values and /mod (integer division with remainder) each with the input, outputting the divisor and replacing the input with the remainder. This algorithm treats 4s and 9s as digits, just as it treats single letters as digits (i, v, x, etc). Without the 4s and 9s, you end up getting longer answers that, while logical, are wrong, e.g. 9 is "ix", not "viv". (I found this bug in the first iteration while writing unit tests.)

The words we care about, >roman and >ROMAN, are placed IN: roman because of the PRIVATE> word, which drops back to the public vocabulary roman. The > in a word's name is a convention for words that do conversions; the parentheses around the word (>roman) mean it's an implementation word; you should never have a (>roman) without also having a >roman. Picking these names is done purely by convention--the only forbidden word names are numbers and words that start with a ", which parse as strings. Everything until whitespace is a word name.

The conversion from Roman numerals back to integers and roman+, roman*, etc are in roman.factor.

USING: arrays assocs kernel math math.vectors namespaces
quotations sequences sequences.private strings ;
IN: roman

<PRIVATE

: roman-digits ( -- seq )
    { "m" "cm" "d" "cd" "c" "xc" "l" "xl" "x" "ix" "v" "iv" "i" } ;

: roman-values ( -- seq )
    { 1000 900 500 400 100 90 50 40 10 9 5 4 1 } ;

TUPLE: roman-range-error n ;

: roman-range-check ( n -- )
    dup 1 3999 between? [
        drop
    ] [
        <roman-range-error> throw
    ] if ;

: (>roman) ( n -- )
    roman-values roman-digits [
        >r /mod swap r> <repetition> concat %
    ] 2each drop ;

PRIVATE>

: >roman ( n -- str )
    dup roman-range-check [
        (>roman)
    ] "" make ;

: >ROMAN ( n -- str ) >roman >upper ;

Friday, April 27, 2007

Windows CE SEH for ARM with GCC

Structure exception handling (SEH) is a low-level way to handle software and hardware exceptions in Microsoft Windows. Other exception handling mechanisms are built on top of SEH. Factor uses SEH in order to report stack underflow/overflow errors on Windows. Microsoft Visual Studio (VS) would usually take care of the tricky implementation details by supplying __try, __except, and __finally keywords, but Factor's compiler relies on globally assigning the datastack and retainstack to specific registers, a feature which VS does not support. Conversely, GCC does not support __try, so we are left to implement the exception handler in assembly.

The internal implementation is both OS and architecture dependent, meaning that we have to support SEH on Windows NT on x86, and Windows CE for x86 and ARM. Digging through MSDN leads to a page called SEH in RISC Environments, where RISC standing for "Reduced Instruction Set Computer". (You are just supposed to know that ARM is RISC and x86 is CISC). I recommend reading about SEH on MSDN, but it basically says that SEH on RISC uses Virtual Unwinding and that you will need to set the PDATA structure for your function in order to set up the exception handler. Virtual unwinding traverses the framestack until it finds a frame with an exception handler, at which time it will actually unwind the framestack to that point and call the exception handler.

All we need to do is define void run_toplevel(void){...} to install our exception_handler and call Factor's "main" subroutine, run.

vm/os-windows-ce.c


long exception_handler(PEXCEPTION_RECORD rec, void *frame, void *ctx, void *disp
atch)
{
    memory_protection_error(
        rec->ExceptionInformation[1] & 0x1ffffff, // mask off process slot
        native_stack_pointer());
    return -1; /* unreachable */
}

vm/os-windows-ce-arm.S


 .text

 .globl run_toplevel

 .word exception_handler
 .word 0

run_toplevel:
 ldr pc, _Prun

_Prun:  .word run

 .section .pdata
 .word run_toplevel
 .word 0xc0000002 | (0xFFFFF << 8)

The C code passes the memory fault address, stored in ExceptionInformation[1], and the native stack pointer (another assembly function that simply moves ESP -> EAX) to a function that converts the fault address into a Factor stack error message, e.g. "Datastack underflow". The fault address is actually a slotized address, meaning that it is relative to some process slot which must be masked off. Annoyingly, the exceptions happen at different address ranges on the emulator and on a real mobile device. Slotized addresses in this context are not well documented on MSDN, but the Windows CE Blog Team quickly answered my email. (Thanks!)

The assembly code is mostly boilerplate. The .text directive starts us in the executable code section. We're going to be exporting void run_toplevel(void){...} in order to call it from C, so we declare it as a .globl and start the definition. We simply call our "main" function, void run(void){...}, which is the platform-dependent run function to start the Factor interpreter.

Hopefully, if someone else has to do this it will take much less time.

Thursday, April 12, 2007

Building Factor in Cygwin now supported

Factor has only compiled on MinGW since we made the Windows Factor port a couple of years ago. The problem with Cygwin has been its dependence on cygwin1.dll, which slows your program. Also, the Cygwin installer doesn't add it to your path, so that is one more step (and not standard). Additionally, structured exception handling (SEH) works the first time but hangs on the second time it triggers with the dll, e.g. on the second stack underflow.

However, Eric Mertens, a new Factor user, found a compiler flag to disable cygwin.dll:
-mno-cygwin

Three cheers for new users!

This flag fixed both the dll dependency and the SEH bug, allowing all unit tests to pass. The only other change I had to make for the port was to #include <wchar.h> for the compiler to find wcslen().

So now you can compile Factor with either MinGW or Cygwin, and both versions are officially supported.

Saturday, December 30, 2006

Serialization To Disk

For fun, I wrote a TinyURL clone in Factor. You can try it out at http://wee-url.com. I needed a way to save the data to a file, so I used libs/serialize to serialize the hashtable containing the links. There was yet another bug involving vector literals -- the serialization code would reuse a vector, so the first time you serialized the output is correct, but the next time it would produce garbage.

  scratchpad SYMBOL: aa H{ { 1 2 } } aa set
  scratchpad aa get .
H{ { 1 2 } }
  scratchpad aa get [ [ serialize ] with-serialized ] string-out .
"hp\0\0\0\u0001Dap\0\0\0\u0001Ep\0\0\0\u0001\u0001ap\0\0..."
  scratchpad aa get [ [ serialize ] with-serialized ] string-out .
"op\0\0\0\u0001:"

The fix was to change a V{ } to a V{ } clone.

Serialization to Disk Code

: save-fstore ( variable path -- )
     [ [ get serialize ] with-serialized ] with-stream ;

: load-fstore ( variable path -- )
    dup exists? [
         [ [ deserialize ] with-serialized ] with-stream
        swap
    ] [
        drop >r H{ } clone r>
    ] if set-global ;

: fstore-set ( variable fstore -- )
    get >r [ get ] keep r> set-hash ;

: fstore-get ( default variable fstore -- )
    get dupd hash* [ swap set-global drop ] [ drop set-global ] if ;

save-fstore and load-fstore save/load a hashtable to disk. The hashtable keys are variable names and the values are the variable values. fstore-set puts your variable in the hashtable, so you should call this directly before saving to disk. fstore-get loads the data from your store. If there is no fstore or if the key does not exist, it uses your default value.

The loading code is pretty simple:

: fstore-name "wee-url.fstore" ;
SYMBOL: wee-shortcuts
SYMBOL: wee-fstore
wee-fstore fstore-name load-fstore
H{ } clone wee-shortcuts wee-fstore fstore-get

The variable wee-shortcuts stores the encoded URLs, while wee-fstore is the hashtable that the above code reads/writes to disk.

Finally, to write the store:

: save-shortcuts ( -- )
    wee-shortcuts wee-fstore fstore-set
    wee-fstore fstore-name save-fstore ;

Notice that we add wee-shortcuts to the hashtable before saving to disk.

I'm going to rewrite the wee-url code to use Furnace instead of callback-responder. Once I get it looking nice, I'll show you how it works.

Thursday, November 23, 2006

Compiling Factor on the Intel Mac

To compile Factor for Mac you will need:

Installation DVDs that came with your Mac (for X11)
darwin-ports

Run the installation disk that came with your Mac in order to install X11. Since Apple does not offer X11 as a download for OS X 10.4 for Intel computers, you have to install from the DVD.
Download the darcs binary and place it in /usr/local/bin
Open terminal (Finder->Applications->Utilities->Terminal) and go to your home directory. To get the Factor source code, run:
```
darcs get http://factorcode.org/repos
```
To compile, move into the directory darcs created with cd repos/Factor and run the command make macosx-x86
Download the boot image at http://factorcode.org/images/latest/boot.image.pentium4
./f boot.image.pentium4
make macosx.app

(Optional) Create a shell script to start Factor in your terminal window. I called mine go.

#!/bin/sh
DIR=~/repos/Factor/Factor.app/Contents
$DIR/MacOS/Factor $@

Save the file and run chmod a+x go

(Optional) You can instead run open Factor.app and use Console.app to view runtime output.

To start the UI, run ./go or open Factor.app; run ./f to start Factor in the shell. If you run Factor in the terminal, I suggest using rlwrap, which adds a history buffer

Now you can code anything!