Protocol

As already stated, GLASS is composed of a series of modules communicating with each other via a common protocol is a simple stream of tagged tokens. The tokens are written in 7 bit (could be 8) ASCII text, one token per line with tabs used to separate the token's components. Each token holds one piece of information and any additional attributes of that information.

a 2	21
b 5	banks	2
b 5	after
b 8	takeover
b 3	bid	1

This is an example stream of five tokens.

Although the token format is verbose and the modules that process it are inefficient (each converting ASCII input into an internal representation and then converting it back to ASCII for output), such considerations are less important when building an experimental IR system. What I feel is more important is the readability of the format to allow the experimenter to examine the flow of data between modules to help debug the system or redirect intermediate data into a file. In addition many UNIX text processing commands can be used to manipulate a token stream. For example counting how many documents there are in a collection.

cat tokenised_col_file | grep ^a | wc

Making use of these existing commands reduces the number of modules that need to be written.