Chapter 4 Index Construction
This chapter introduces and compares several ways to construct an index.
In comparing the BSBI and the SPIMI, the SPIMI is better than the BSBI in
several aspects:
1. The SPIMI needs not a data structure for mapping terms to term IDs which takes great
memory space in large collections.
2. The SPIMI does not sort the tokens; therefore, it runs faster than the BSBI.
However, I also noticed that, the SPIMI actually sorts the terms in lexicographic order
before writes the index of the block into the disk to help merge the blocks later.
Although the SPIMI seems better than the BSBI, I am curious about in what situations,
the BSBI is more suitable than the SPIMI. Or is the SPIMI always faster than the BSBI?
Sunday, January 17, 2010
Unit 2: muddiest point
I am still not clear about why loss of recall is more a problem
in a smaller corpora compared to that in a very large corpora
while using the phrasal indexing.
in a smaller corpora compared to that in a very large corpora
while using the phrasal indexing.
Sunday, January 10, 2010
Unit 2: reading notes
Chapter 1.2
In this part, the auther introduced two different ways to construct the inverted index: the singly linked lists and the variable length array. The former have the advantage of easier insertion of documents compared to the latter. It is also said that "If updates are relatively infrequent, variable length arrays will be more compact and faster to traverse". The former have overheads of pointers, thus, they need more storage space. Moreover, because the former do not occupy a continuous storage space, they need more time to traverse from one node to another. My question is, can Prof introduce more about the variable length array, such as how an item is inserted, deleted to the array, and if there is a limit for the size of variable length array. If there is a limit for the size of the array, what the system will do when the number of items exceeds the limit? Therefore, we could have a better understanding about the weakness and strenths for these two ways.
In this part, the auther introduced two different ways to construct the inverted index: the singly linked lists and the variable length array. The former have the advantage of easier insertion of documents compared to the latter. It is also said that "If updates are relatively infrequent, variable length arrays will be more compact and faster to traverse". The former have overheads of pointers, thus, they need more storage space. Moreover, because the former do not occupy a continuous storage space, they need more time to traverse from one node to another. My question is, can Prof introduce more about the variable length array, such as how an item is inserted, deleted to the array, and if there is a limit for the size of variable length array. If there is a limit for the size of the array, what the system will do when the number of items exceeds the limit? Therefore, we could have a better understanding about the weakness and strenths for these two ways.
Subscribe to:
Posts (Atom)