3.0 GLASS’s index file structures and modules explained

This section contains a description of the structure of index files in GLASS followed by an overview of the most commonly used of the modules.

3.1 Index files

This section describes the format of the two types of index file used in GLASS. There are two types of files, term index files and doc id index files. These files are identified by a suffix on their name. These two types share the same basic format and this is first explained, followed by a description of the format of the two index file types.

3.1.1 Basic format of index files

An index file consist of two main parts, a header and a set of entries. The header is composed of three tokens, as shown in Figure 6. The lines of this header carry the following information.

The entries of an index file are type ‘g’ tokens. All entry tokens of a index file are the same character length.

3.1.2 Doc id index files (suffix ‘.dsi’, ‘.dii’)

The doc id index files are used to fetch the content of documents from the file in which they occur. The files are created with ‘di’ and accessed with ‘get_dsi’. Documents are indexed by their id. The id must be an positive integer, and the documents of a collection must have a contiguous set of ids.

A doc id index consist of two files, the ‘.dsi’ file and the ‘.dii’ file. The ‘.dsi’ holds the complete set of doc ids (sorted in numberical order) for the collection it is indexing. Figure 7 shows the first few tokens of such a file. The main attribute of an entry token is a doc id, the two additional attributes of this token are the file position and length of the document corresponding to that id. In Figure 7 the document with an id 2 starts at the 486th byte in a document collection file and it is 13,735 bytes long.

FIGURE 7. Example ‘.dsi’ file

Many document collections are stored in multiple files, and the ‘.dii’ index is used to keep track of which documents reside in which file. Figure 8 shows the beginning of a ‘.dii’ file. Here the main attribute of an entry token is a file name, the additional attributes of that token are the doc ids at the start and end of that file. A document cannot sit across a file boundary in GLASS.

FIGURE 8. Example ‘.dii’ file

3.1.3 Term index files (suffix ‘.ti’, ‘.dol’)

A term index is used to list which tokens occur in which documents. Its files are created with the module ‘idx’ and accessed using ‘occs’. The ‘.ti’ file contains an alphabetically sorted list of tokens. Each token in that file has a corresponding entry list in a ‘.dol’ file, that list is composed of tuples. A tuple consists of a doc id and any additional information associated the token’s occurrence in the document. The doc id and the additional information must be in integer form. One example of the type of additional information one might store in a tuple is a term frequency (tf) weight.

Figure 9 shows the first few tokens of a ‘.ti’ file. The main attribute of an entry token is the name of the item that was indexed. The following four additional attributes of the token are the token’s inverse document frequency (idf), the position of the token’s entry in the ‘.dol’ file, the number of documents in which that token occurs, and the size of all tuples in the token’s entry list. In Figure 9 the entry token ‘abandon’ has an idf of 50, occurs in three documents, and an addtional piece of information is associated with the token’s occurrence in each of these three documents.

FIGURE 9. Example ‘.ti’ file

The ‘.dol’ file which holds token entry lists is the only index file to be stored in a binary format. This was done to speed GLASS up. Because this is such an unreadable format, there is little to be said about the structure of this type of file.