Protocol
As already stated, GLASS is composed of a series of modules communicating with each other via a common protocol is a simple stream of tagged tokens. The tokens are written in 7 bit (could be 8) ASCII text, one token per line with tabs used to separate the token's components. Each token holds one piece of information and any additional attributes of that information.
a 2 21
b 5 banks 2
b 5 after
b 8 takeover
b 3 bid 1
This is an example stream of five tokens.
- The first component of a token, a single letter, indicates that token's type. There isn't any strong typing used in this protocol, the aim of the type char is to distinguish tokens from each other within a token stream. Having said that, there is one strong type and a sort of convention has started up.
- The strong type is the comment token type '#': all tokens starting with the char '#' are ignored.
- The convention is that 'a' tokens indicate the start of a document, 'b' tokens are terms, and 'h' tokens are the end of a document.
Based on this scant information, you can see that the tokens, shown above, are the start of a document: the 'a token' is a document start tag containing the document id; the 'b tokens' are the first four words.
- The number immediately following a token type indicates the character length of the token's main attribute.
- This number is followed by the main attribute. The length thing was intended for other purposes and isn't really used. I should probably get rid of it one day. The rule for the main attribute strings is that they can be composed of any character except a TAB or a NEWLINE character. TABs and NEWLINES are used a separators. Main attributes can be up to 1K in length.
- Sometimes a token may require additional attributes, for example a tf score for word tokens. Up to 3000 attributes can be added to a token, each separated by tabs. The first additional token can like the main attribute be 1K in length, the other 2999 can be 32 bytes long. The additional tokens can, again like the main attribute, be composed of any character except TABs and NEWLINEs. In the example the word tokens 'banks' and 'bid' have one additional attribute each.
Although the token format is verbose and the modules that process it are inefficient (each converting ASCII input into an internal representation and then converting it back to ASCII for output), such considerations are less important when building an experimental IR system. What I feel is more important is the readability of the format to allow the experimenter to examine the flow of data between modules to help debug the system or redirect intermediate data into a file. In addition many UNIX text processing commands can be used to manipulate a token stream. For example counting how many documents there are in a collection.
cat tokenised_col_file | grep ^a | wc
Making use of these existing commands reduces the number of modules that need to be written.