The Performance of Boolean Retrieval and Vector Space Model in Textual Information Retrieval

—Boolean Retrieval (BR) and Vector Space Model (VSM) are very popular methods in information retrieval for creating an inverted index and querying terms. BR method searches the exact results of the textual information retrieval without ranking the results. VSM method searches and ranks the results. This study empirically compares the two methods. The research utilizes a sample of the corpus data obtained from Reuters. The experimental results show that the required times to produce an inverted index by the two methods are nearly the same. However, a difference exists on the querying index. The results also show that the number of generated indexes, the sizes of the generated ﬁles, and the duration of reading and searching an index are proportional with the ﬁle number in the corpus and the ﬁle size.


I. INTRODUCTION
I NTERNET users usually use the World Wide Web to retrieve data/information from current large-scale sources [1].Unfortunately, the presented information are sometimes less relevant.In the field of information retrieval, the users expect to obtain very accurate results.The facts are that several existing approaches may provide less accurate queries [2].
Many researchers of information retrieval use Boolean Retrieval (BR) [3] and Vector Space Model (VSM) [4] in creating an inverted index and querying terms.The previous studies have done a search engine for collection of English or Arabic and shown the BR and VSM methods were optimal [3,4].Not only for text searching, information retrieval also can query multimedia elements, such as pictures [5,6], or sounds [7,8].Information retrieval methods can be optimized with another algorithm such as Genetic Algorithm (GA) [3,4,9], or Particle Swarm Optimization (PSO) [2,10].
The BR method searches the exact results [11].This method does not rank the number of terms appearing in a document because it only finds the term whether it exists or not (boolean) in documents.The result of the search method is a list of documents containing the unranked terms.In the other side, the VSM method searches the exact results with ranking [11].This method ranks the number of terms appearring in a document, and count how many documents contain that term.The calculation process uses vector and the search results in the form of an ordered list.However, it is possible for two search engine methods to retrieve highly different documents, or to rank similar documents in a very different order [8].The advantages and disadvantages of BR and VSM methods are described on Table I [11].Due to the popularity of BR and VSM methods, there is a need to understand the relative performance between the two methods.The previous work had compared the Naïve Bayes (NB) method and the SVM method and found that the former method was better for the case of an external knowledge base [12].The NB method correctly classified 79.44% of instances compared to SVM method [13].Reference [4] found that SVM method with GA was better than SVM only for similarity measure.The SVM method with Finite State Transducer was better than SVM with Latent Semantic Analysis for automatic speech recognition [8].Some comparisons by using Arabic data collection Cite this article as: B. Yulianto, W. Budiharto, and I. H. Kartowisastro, "The Performance of Boolean Retrieval and Vector Space Model in Textual Information Retrieval", CommIT (Communication & Information Technology) Journal 11(1), 33-39, 2017.showed that BR with GA was better than BR only [3], BR with adaptive GA was better (55.1%) than BR with traditional GA, and SVM with adaptive GA was better (42.1%) than SVM with traditional GA [9].
In this study, the researchers compare the performance of BR and VSM methods in textual information retrieval by applying a search engine application written in Python.The application is needed to do some experiment such as to get keywords input from user and then display the search results in the form of documents ID and name.Along this study, researchers also explain the steps of experiments in simple and clear ways.Researchers will use tables, graphics, arithmetic equations, and representative sourcecode to make clearer explanation for readers or other researchers in re-testing this experiment for studying, teaching, validation, or further works.At the end of this research, the performance of both methods is compared in any fields to be concluded.Some terminologies used in this study are explained in the following on the basis of Ref. [14].Corpus is a collection of documents, for example the articles on Wikipedia.Some examples of the popular corpus are Gutenberg, Brown, and Reuters.Term is a unique word contained in a document.Term derived from a tokenize process of a document.Generally, the document has been cleared from stop-words to obtain more specific term, and then do stemming to obtain clear terms of affixes.Tokenize is the process of converting a sentence into words (terms).Generally, the results of tokenize are stored in an array, set, or list.Sentence of "Mr.Widodo is a professor who teaches the course Information Retrieval" is tokenized to terms of "Mr.", "Widodo", "is", "a", "professor", "who", "teaches", "the", "course", "Information", "Retrieval".Stop-word is common word that is less meaningful for the search process, such as "the", "a", "an", "with", and others.In the tokenize example above, the clearance of stop-words will become ("Widodo", "professor", "teaches", "course", "Information", "Retrieval").The words "Mr.", "is", "a", "who", and "the" are discarded because just as conjunctions that are less meaningful in the search process.Stemming is the process of simplifying the word to its basic word by removing the suffix.For example, the word "teaches" becomes "teach".Some often used stemming method are Porter Stemmer and Lancaster Stemmer (the differences are not discussed in this study).Inverted Index is a mapping of terms to any document containing the terms and the position (index) of terms in the document.Example of inverted index is "budi-1:17,30,63;3:1,4,8", means that the term 'budi' in the document ID '1' with position (index) of '17, 30, and 63', and in the document '3' with position (index) of '1, 4, and 8'.Posting-List is the value of the mapped term.In the above example, the posting-list for the term 'budi' is '1:17,30,63;3:1,4,8'. Term Frequency (tf) is the number a term appears in a document.Document Frequency (df) is the number of documents that contain a term.Inverse Document Frequency (idf) is the result of total documents divided by document frequency (df).

II. RESEARCH METHOD
This study uses experimental method for getting and analyzing quantitative data.The researchers use Python for creating index and querying index for both BR and VSM methods.The comparison of used application for creating and querying the inverted index is described simply in Table II.
After writing the source code, the researchers execute it to create an inverted index of Reuters corpus.The application does process of tokenization, stopwords removal, stemming, inverted index and generating the files "titleIndex.dat"and "testIndex.dat."Both the files will be read in query process and show the result.The process is described in Fig. 1.

A. Creating Inverted Index
The first thing is creating an inverted index.The application reads the corpus and process the content

Boolean Retrieval
Create Python Soucecode File Name: "createIndex.py"Index Function: Create mapping of the terms and its posting-list (inverted index) from available file collection.Input: Files / documents in a corpus.Output: File "testIndex.dat"contains terms and its posting-list (document ID and its index position).File "titleIndex.dat"contains document ID dan its name.
Query Python Sourcecode File Name: "queryIndex.py"Index Function: Display search results.Input: keywords, file "testIndex.dat" and "titleIndex.dat"Output: List of search results that are not-ranked

Vector Space Model
Create Python Sourcecode File Name: "createIndex_tfidf.py"Index Function: Create mapping of the terms of its posting-list (inverted index) from available file collection, and provide term frequency (tf) and inverse document-frequency (idf).Input: Files / documents in a corpus Output: File "testIndex.dat"contains terms and its posting-list (document ID and its index position), term-frequency (tf) and inverse document-frequency (idf).File "titleIndex.dat"contains document ID dan its name.Query Python Sourcecode File Name: "queryIndex_tfidf.py"Index Function: Display search results.
Input: keywords, file "testIndex.dat" and "titleIndex.dat"Output: List of search results that are ranked (based on the most relevant documents).Retrieval and Vector Space Model in Textual Information Retrieval", CommIT (Communication & Information Technology) Journal 11(1), 33-39, 2017.into words (tokenize), remove stop-words, and remove affix-words (stemming).After that, it will create posting-list (inverted index).For VSM, the postinglist will be completed with term-frequency (tf), and inverse document-frequency (idf).Last, it will write all the document titles and inverted-index into files.The process is described in Fig. 2.
In the VSM method, term frequency (tf) is obtained by the number of terms appearing in a document using the formula: where N term,doc is the frequency occurrence of a word in a document.Then, the term frequency is normalized by where E doc denotes the Euclidian norm and is defined by It has been implemented by Ref. [15].In searching two or more terms, such as 'authoris buckey', it is necessary to rank these two terms.By doing ranking based on term-frequency is not sufficient.There is possibility of generating the same term-frequency.It is necessary to calculate documentfrequency that is the number of documents that contain that term or The more documents contain the term, the more unim- removal, stemming, inverted index and generate file titleIndex.dat and testIndex.dat.Both the files will be read in query process and show the result.The process is described in Fig. 1.Details of each process and the content of the files will be discussed in next chapter.The first thing creates inverted index.The application will read defined corpus and process the content into words (tokenize), remove stop-words, and remove affixwords (stemming).After that, it will create posting-list (inverted index).For completed with term document-frequency (i document titles and inv is described in Fig. 2.
In VSM method, ter number of term appears tf term,doc If the term 'authoris' app tfauthoris,A is 7.
Assuming that doc 'authoris', and 7 term contains 2 times the ex terms of 'authoris', and is 7, and tfauthoris,B is 14. the same meaning to normalize term frequen portant the document because the term is too common.In addition, there should be an inverse documentfrequency according to the formula: where N doc is the number of the documents containing the term.The statistic idf term reflects how important a word is to a document in a corpus.The statistic is often expressed in the log-scale by:

B. Querying Index
The querying index is by reading document that contains document title and inverted index.Query can be from one (single) word, multiple words (free text), or exact (phrase) words.The process of query is described in Fig. 3 By using this formula, we obtain idfauthoris is log(100/7) = 1.155, and idfbuckey is log(100/14) = 0.854.From these results, the documents containing term 'authoris' is still more important rather than term 'buckey', but the range is not too harsh.

III-2. QUERYING INDEX
Next, querying index is done by read document that contains document title and inverted index.Query can be from one (single) word, multiple words (free text), or exact (phrase) words.Process of query can be described in Fig. 3.For example, we search a sentence of digit beye controversi (searching type: ftq), then based on file testIndex.dat is obtained posting-list of each terms: Ranking of relevant documents carried by the results of dot product between document vector and query vector with formula: rank = tfterm,doc .idfterm Then, we do dot-product for each document vector to the query vector, and then add the results as follows: [736.5000,0, 491.0000] = [0 x 736.5000, 0 x 0, 0.0340 x 491.0000] = [0, 0,  For example, we search a sentence of digit beye controversi (searching type: ftq), then based on file testIndex.dat is obtained posting-list of each terms: digit|224:90;1090:46|0.0392,0.0556|736.5000controversi|46:337,475;984:28;1090:24|0.0420,0.0340,0.0556|491.0000 In VSM method, there is ranking process of relevant documents.Assume term-index 0 for term digit, termindex 1 for term beye, and term-index 2 for term controversi, then the query vector for each term is idf value: queryVector [0] = 736.5000,queryVector [1] = 0 (since no idf for term beye), and queryVector [2] = 491.0000,so queryVector is [736.5000,0, 491.0000].This is implemented in file "queryIndex_tfidf.py" as follows: The ranking of the relevant documents is the results of a dot product of the document vector and the query vector: For examples, we consider four documents with the doc-product values of the following: docVector Using the dot-product results in the distance of document vector (DocID: 46, 224, 984, and 1090) to the query vector.DocVector that is closest to the queryVector is the most relevant document.This is indicated by smaller cosine angle towards queryVector (Fig. 4), or greater summed value of dot product.To obtain ranking of relevant documents, then sort in descending the whole summed values of the dot product.From the results, we obtain that DocID 1090 (68.249) is the most relevant document, following by DocID 224 (28.87),DocID 46 (20.622), and DocID 984 (16.694).This is implemented in file "queryIndex_tfidf.py" as follows: The dot-product function is implemented in file "queryIndex_tfidf.py" as follows: To display in 3D Cartesian, we often need to adjust docVector values (if too large) by using same multiplication factor.Value adjustment will not change the angle of vector, it's only pull the vertex near to the coordinate center.Adjusted values are displayed in Table III and drawn in Fig. 4.   In addition to the comparison of the concepts and the algorithm above, the performance test is also conducted using the Reuters corpus.The data contains 2610 files, divided into 4 sets of collection based on size scale of 0.5 MB (Table IV).The set 1 is a subset of the sets 2, 3, and 4. Each performance test is conducted 12 times with the highest and lowest results are removed.The remaining 10 results are averaged.The results of the performance test are in the following tables.
The results of creating the inverted index are shown in Table V.The results show that the required time to create the inverted index increases with increasing the file collection size.The time required by VSM method is just slightly lower than the BR method.
The result of the generate file testIndex.datshows that the larger the size of collection, the more indexes are generated.But, the percentage of the generated index to the total number of words in the collection is decreased.This happens because the words (terms) that appear on the following documents have been indexed in the previous documents.The use of  BR and VSM methods show similar results, proving the consistency of the algorithms.They also show that the larger the size of collection is, the greater the size of generated index file.The use of BR method generates an index file size smaller than the collection of source files, while the VSM method generates a larger index file than the collection of source files (see Table VI).
The test results of reading/querying index are shown in Table VII.They suggest that the larger the size of the collection is, the longer the time it takes to read the index file that has been generated previously, the longer the time required to perform the query, and the more search results found.The use of BR method requires less time than the VSM method, and show similar results to prove the consistency of the algorithm.Results on VSM method already ranked by using tf-idf formula (term frequency with inverse document frequency).
In summary, the process of creating index and querying term digit beye controversi can be demonstrated through the results presented in Tables VIII and IX from a study of 1473 documents.
The results are also same with the file   The rankings for the document relevance is calculated through the dot-product.The results are displayed in Table X.
The results are also same with the output of executed application as follows.

IV. CONCLUSIONS AND FUTURE WORK
Results of applied algorithm in Python for both BR and VSM are working properly when those are compared to manual calculation.By using corpus Reuters (2610 docs, 2 MB), we find there is no significant time difference for creating inverted index for both method.The differences come from querying index.Number of generated indexes, generated file size, duration of reading and searching index, and results found are growing inline with corpus file number and file size.However, choosing the appropriate method of BR or VSM is basically depending on the need of document ranking availability.
Comparing the implementation of BR and VSM can be applied in many other fields.Further researches may try non-Latin language, such as Chinese or Arab [3,9].There are also other bigger corpus to be used for next experiments such as Reuters Corpora (RCV1, RCV2, TRC2) [16], and can be combined with MapReduce (Hadoop) [17,18].Validation for recall, precision, accuracy, f-measure (f-score), and error rate are also interesting topics for future works [3,11].

Fig 2 .
Fig 2. Flowchart of Creating Inverted Index III.RESULTS AND DISCUSSION III-1.CREATING INVERTED INDEX
f o r t e r m I n d e x , t e r m i n e n u m e r a t e ( t e r m s ) : . . .q u e r y V e c t o r [ t e r m I n d e x ] = s e l f .i d f [ t e r m ] Both posting-list of term digit or controversi contain DocID [46, 224, 984, 1090].Then, we generate document vector based on term-frequency of each terms DocID.For term-index 0 (term digit), we obtain DocID 224, and 1090, so the document vector is: docVector [224][0] = 0.0392, and docVector [1090][0] = 0.0556.For term-index 2 (term controversi), we obtain Do-cID 46, 984 and 1090, so the document vector is: docVector [46][2] = 0.0420, docVector [984][2] = 0.0340 and docVector [1090][2] = 0.0556.This is implemented in file "queryIndex_tfidf.py" as follows: f o r t e r m I n d e x , t e r m i n e n u m e r a t e ( t e r m s ) : . . .f o r d o c I n d e x , ( doc , p o s t i n g s ) i n \ e n u m e r a t e ( s e l f .i n d e x [ t e r m ] ) : i f doc i n d o c s : d o c V e c t o r s [ doc ] [ t e r m I n d e x ] = \ s e l f .t f [ t e r m ] [ d o c I n d e x ] [46] • queryVector = 20.622,docVector[224] • queryVector = 28.8708,docVector[984] • queryVector = 16.694, and docVector[1090] • queryVector = 68.249.
d o c S c o r e s = [ [ s e l f .d o t P r o d u c t ( curDocVec , \ q u e r y V e c t o r ) , doc ] f o r doc , curDocVec i n \ d o c V e c t o r s .i t e r i t e m s ( ) ] d o c S c o r e s .s o r t ( r e v e r s e = T r u e ) d e f d o t P r o d u c t ( s e l f , vec1 , v e c 2 ) : . . .r e t u r n sum ( [ x * y f o r x , y i n z i p ( vec1 , v e c 2 ) ] )

TABLE II A
COMPARISON OF CREATING AND QUERYING INDEX.

TABLE VI THE
SIZE OF THE GENERATED INDEX FILE IN KB.

TABLE VII THE
TEST OF READING AND SEARCHING INDEX.

TABLE VIII TERM
APPERANCE IN DOCUMENTS.The Number of Term Appears tf = N term,doc /E doc = Term-Frequency