Example: A Search Engine - Maple Programming Help

Home : Support : Online Help : Applications and Example Worksheets : Language and System : examples/SearchEngine

Example: A Search Engine

Introduction

 Search engines are used in a variety of contexts to locate documents that match or satisfy a query. Queries consist of sequences of search terms -- words that are meant to prescribe the subject matter of matching documents. Examples of search engines include Web search engines and database interfaces. However, search engines can be adapted to search a wide variety of structured or unstructured data.
 This example illustrates how a simple search engine, based on the vector space model, can be constructed using Maple. Various Maple packages are used to pre- and post-process the data to be searched, as well as the queries, and the LinearAlgebra package is used to effect the computations necessary to locate relevant documents.

Introduction to Searching

 Prototypically, a document is a file of structured or unstructured text, but this section treats documents as abstract data items.  A document can be a scientific article in LaTeX format, a web page, an integral formula, an image, or a simple string of text, as used here for illustrative purposes.  The document is a raw piece of data that can be identified as a whole.  Each document is equipped with a document ID, that is, an identifier that is used as a handle for the document data. The document ID is small, while the document (containing data) may be arbitrarily large.
 For searching, documents are organized into corpora.  A corpus is a collection of documents that can be searched collectively using search terms.  A query is a list of search terms. A search term is typically a (natural language) word or phrase. The purpose of a search is to identify, among documents in a given corpus, those which match the search terms. Thus, search input consists of a corpus to search, and one or more search terms.  The output is a ranked list of documents that the search criteria judge to be relevant to the search terms.
 For this example, consider the following simple corpus of documents. Each document is a short string of text, which serves as both the document ID and the document data.
 > doc1 := "The mathematician's patterns, like the painter's "   "or the poet's must be beautiful; the ideas, like "   "the colors or the words must fit together in a "   "harmonious way. Beauty is the first test: there "   "is no permanent place in this world "   "for ugly mathematics.": # Hardy
 > doc2 := "God does arithmetic.":        # Karl Friedrich Gauss
 > doc3 := "Anyone  who  cannot cope with mathematics is not"   " fully human.  At best he is a tolerable subhuman "   "who has learned to wear shoes, bathe, and not make "   "messes in the house.": # Robert A. Heinlein
 > doc4 := "Things should be made as simple as possible, "   "but no simpler.": doc5 := "I don't believe in mathematics.": doc6 := "Imagination is more important than knowledge.": doc7 := "The most beautiful thing we can experience is "   "the mysterious.  It is the source of all true "   "art and science.": doc8 := "Common sense is the collection of prejudices "   "acquired by age eighteen.": doc9 := "God does not care about our mathematical "   "difficulties. He integrates empirically.": # A. Einstein
 The corpus consists of these documents. To facilitate searches, the corpus is preprocessed to construct a search index.  The index aids searches by making it easy to quickly locate documents in the corpus that are relevant to one or more of the search terms in a query.

Inverted Term Occurrence Indexing

 An inverted term occurrence index is a simple search index. An inverted term occurrence index constructs a list of all the potential search terms present in the corpus and maintains a list of the documents in which each search term occurs. Construction of an inverted term occurrence index begins by constructing a document-term mapping. For simple text strings, construct a list of document terms as follows.
 > DocumentTerms := proc( text::string )   StringTools:-Words( text ) end proc:
 > DocumentTerms( doc1 );
 $\left[{"The"}{,}{"mathematician\text{'}s"}{,}{"patterns"}{,}{"like"}{,}{"the"}{,}{"painter\text{'}s"}{,}{"or"}{,}{"the"}{,}{"poet\text{'}s"}{,}{"must"}{,}{"be"}{,}{"beautiful"}{,}{"the"}{,}{"ideas"}{,}{"like"}{,}{"the"}{,}{"colors"}{,}{"or"}{,}{"the"}{,}{"words"}{,}{"must"}{,}{"fit"}{,}{"together"}{,}{"in"}{,}{"a"}{,}{"harmonious"}{,}{"way"}{,}{"Beauty"}{,}{"is"}{,}{"the"}{,}{"first"}{,}{"test"}{,}{"there"}{,}{"is"}{,}{"no"}{,}{"permanent"}{,}{"place"}{,}{"in"}{,}{"this"}{,}{"world"}{,}{"for"}{,}{"ugly"}{,}{"mathematics"}\right]$ (1)
 Using this, construct an inverted term occurrence index.
 > BuildIndex := proc( corpus::list(string) )   local    docterms, corpusterms, index, term, doc;   use StringTools in     # Construct all terms in the corpus     docterms := table( [seq]( doc = DocumentTerms( doc ),        doc = corpus ) );     corpusterms := union( seq( {op}( docterms[ doc ] ),        doc = corpus ) );     # Map each term to the documents containing it      index := table();      for doc in corpus do        for term in docterms[ doc ] do          if assigned( index[ term ] ) then            index[ term ] := index[ term ] union { doc };          else            index[ term ] := { doc };          end if;        end do;      end do;   end use;   # Return the table   eval( index, 1 ); end proc:
 > Index := BuildIndex( [ doc || ($1..9) ] ):  > nops( {indices}( Index ) );  ${104}$ (2)  Searching is simple, using this index. Search for the terms "mathematical" and "beauty".  > Search := proc( query::list(string) ) global Index; local results, term; results := {}; for term in query do if assigned( Index[ term ] ) then results := results union Index[ term ]; end if; end do; results; end proc:  > Search( [ "mathematical", "beauty" ] );  $\left\{{"God does not care about our mathematical difficulties. He integrates empirically."}\right\}$ (3)  > nops( (3) );  ${1}$ (4)  There are several problems with this index. One problem is that the index is quite large (relative to the size of the corpus). Many words that occur in the corpus convey little or no information about the content of the documents in which they occur. This can lead to irrelevant results, especially for poorly chosen search terms.  > nops( Search( [ "the" ] ) );  ${4}$ (5)  This problem can be solved by removing unimportant terms from the index. This set of unwanted terms is called a stop list.  > STOP_WORDS := { "a", "i", "an", "the", "in", "to", "which", "that", "is", "and", "we", "it", "of", "all", "can", "does", "don't", "most", "true", "thing" }:  A second problem is synonymy in the corpus. That is, many distinct terms in the corpus have the same meaning in the context of searching. For example, searching for the term "mathematics" should return documents that contain the terms "mathematical", "math" and "mathematically", even if the exact term "mathematics" does not occur in the document. To solve this problem, map each term onto its stem, or root term.  To implement these changes in the DocumentTerms procedure, use the StringTools package.  > DocumentTerms := proc( text::string ) global STOP_WORDS; local words; use StringTools in words := map( LowerCase, Words( text ) ); words := remove( type, words, STOP_WORDS ); map( Stem, words ); end use; end proc:  By using the LowerCase function, case distinctions are removed, making the process more efficient. Apply the same preprocessing to search terms: convert to lowercase, remove stop list words, and map words onto their stems.  > Search := proc( query::list(string) ) global Index; local results, term, terms; use StringTools in terms := map( LowerCase, query ); terms := remove( type, terms, STOP_WORDS ); terms := map( Stem, terms ); end use; results := {}; for term in terms do if assigned( Index[ term ] ) then results := results union Index[ term ]; end if; end do; results; end proc:  Because BuildIndex uses DocumentTerms, rebuild the index.  > Index := BuildIndex( [ doc || ($1..9) ] ):
 > nops( {indices}( Index ) );
 ${83}$ (6)
 This has substantially reduced the size of the index.
 > Search( [ "the" ] );
 ${\varnothing }$ (7)
 Because the stop word "the" has been removed from the index, it returns no matches. This is also a user interface enhancement because queries can contain stop words without affecting the results.
 > Search( [ "mathematical", "beauty" ] );
 $\left\{{"I don\text{'}t believe in mathematics."}{,}{"God does not care about our mathematical difficulties. He integrates empirically."}{,}{"The most beautiful thing we can experience is the mysterious. It is the source of all true art and science."}{,}{"Anyone who cannot cope with mathematics is not fully human. At best he is a tolerable subhuman who has learned to wear shoes, bathe, and not make messes in the house."}{,}{"The mathematician\text{'}s patterns, like the painter\text{'}s or the poet\text{'}s must be beautiful; the ideas, like the colors or the words must fit together in a harmonious way. Beauty is the first test: there is no permanent place in this world for ugly mathematics."}\right\}$ (8)
 > nops( (8) );
 ${5}$ (9)
 The new process returns many more documents relevant to the query.

The Vector Space Model

 It is possible to model the search by using a vector space model of the term-document indices for corpora. After collecting the relevant search terms for a corpus, represent each document by a vector in a Euclidean space ${E}^{n}$, where n is the number of distinct terms in the corpus. Each coordinate in this vector space represents a term in the corpus. First determine an ordering of the corpus terms. Then map each document to a term vector in this Euclidean space. A simple mapping is to represent a document by a term vector whose $i$th coordinate is equal to 1 if the $i$th term occurs in the document, and is equal to 0 otherwise. For each query, construct a term vector. The dot product of the query term vector with a document term vector is the number of query terms that appear in the document. A document containing more query terms might reasonably be judged more relevant than one containing fewer query terms. To rank the documents, sort the dot products of the query term vector with the document term vectors.
 For example, consider a smaller corpus comprising only the documents doc2, doc5, and doc7.
 > SmallIndex := BuildIndex( [ doc2, doc5, doc7 ] ):
 The ordered list of corpus terms is:
 > SmallCorpusTerms := sort( [op]( map( op, {indices}(      SmallIndex ) ) ) );
 ${\mathrm{SmallCorpusTerms}}{≔}\left[{"arithmet"}{,}{"art"}{,}{"beauti"}{,}{"believ"}{,}{"experi"}{,}{"god"}{,}{"mathemat"}{,}{"mysteri"}{,}{"scienc"}{,}{"sourc"}\right]$ (10)
 Note: Terms in the documents are replaced by their stems and stop words are removed from the index.  The document term vector can be computed using the following procedure.
 > DocumentTermVector := proc( doc )   global  SmallCorpusTerms;   local  terms;   terms := DocumentTerms( doc );   Vector[row]( 1 .. nops( SmallCorpusTerms ),     i -> if( member( SmallCorpusTerms[ i ], terms ), 1, 0 ) ); end proc:
 > doc5Vector := DocumentTermVector( doc5 );
 ${\mathrm{doc5Vector}}{≔}\left[\begin{array}{cccccccccc}{0}& {0}& {0}& {1}& {0}& {0}& {1}& {0}& {0}& {0}\end{array}\right]$ (11)
 Compute the query term vector for the search term "mathematical beauty".
 > queryVector := DocumentTermVector( "mathematical beauty" );
 ${\mathrm{queryVector}}{≔}\left[\begin{array}{cccccccccc}{0}& {0}& {1}& {0}& {0}& {0}& {1}& {0}& {0}& {0}\end{array}\right]$ (12)
 The inner product is:
 > queryVector . doc5Vector;
 ${1}$ (13)
 which indicates that one of the query terms (corresponding to "mathematical") appears in the document.
 To rank the documents in the corpus for their relevance to the query, apply this process to each document in the corpus.
 > queryVector . DocumentTermVector( doc2 ); queryVector . DocumentTermVector( doc5 ); queryVector . DocumentTermVector( doc7 );
 ${0}$
 ${1}$
 ${1}$ (14)
 It is more efficient to represent the entire corpus by a term-document matrix, in which the rows are the term vectors for the documents in the corpus. First determine a fixed ordering of the documents in the corpus. The dot products can then be computed by forming the product of the term-document matrix representing the corpus with the term vector for a query.
 > TermDocumentMatrix := Matrix(   map( [s->s], map( DocumentTermVector,     [ doc2, doc5, doc7 ] ) ) );
 ${\mathrm{TermDocumentMatrix}}{≔}\left[\begin{array}{cccccccccc}{1}& {0}& {0}& {0}& {0}& {1}& {0}& {0}& {0}& {0}\\ {0}& {0}& {0}& {1}& {0}& {0}& {1}& {0}& {0}& {0}\\ {0}& {1}& {1}& {0}& {1}& {0}& {0}& {1}& {1}& {1}\end{array}\right]$ (15)
 > Scores := TermDocumentMatrix . queryVector^+;
 ${\mathrm{Scores}}{≔}\left[\begin{array}{c}{0}\\ {1}\\ {1}\end{array}\right]$ (16)
 There is a geometric interpretation of the search process.  The dot product of two vectors, $u$ and $v$, is related to the angle between the vectors, for which the cosine is given by the formula:

$\frac{⟨u,v⟩}{||u||·||v||}$

 Consider the projection of the document term vectors onto the hyperplane defined by the component vectors of the query vector, and then introduce appropriate scaling, using vector norms. In this context, searching can be viewed as a process of determining those vectors representing documents in a corpus for which the angle between their projections and the query vector is small.

Term Weighting

 A document that contains many instances of a given term is generally considered to be more relevant to queries containing that term than a document with fewer instances. To improve rankings, therefore, record not only the presence of a term in a document, but also its frequency. This is a simple change to DocumentTermMatrix.  The term vector of a document is the vector of ${E}^{n}$ whose $i$th entry is equal to the number of times the $i$th corpus term occurs in the document.
 > DocumentTermVector := proc( doc )   global  SmallCorpusTerms;   local  terms;   terms := DocumentTerms( doc );   Vector[row]( 1 .. nops( SmallCorpusTerms ),     i -> numboccur( terms, SmallCorpusTerms[ i ] ) ); end proc:
 This can lead to significantly improved results when searching a larger corpus.
 To improve this method, scale the number of instances by the size (the total number of terms with multiplicities) of the document. For example, a book about cats is not more relevant than a short paper on cats merely because the term "cats" appears more often in the book than in the short article.
 > DocumentTermVector := proc( doc )   global  SmallCorpusTerms;   local  terms;   terms := DocumentTerms( doc );   Vector[row]( 1 .. nops( SmallCorpusTerms ),     i -> evalf( numboccur( terms, SmallCorpusTerms[ i ] )              / nops( terms ) ) ); end proc:
 With this change, recompute the term-document matrix and the matrix of scores, which represents the search results. Also recompute the query term vector.
 > TermDocumentMatrix := Matrix( map( [s->s],    map( DocumentTermVector, [ doc2, doc5, doc7 ] ) ) ):
 > queryVector := DocumentTermVector( "mathematical beauty" ):
 > Scores := TermDocumentMatrix . queryVector^+;
 ${\mathrm{Scores}}{≔}\left[\begin{array}{c}{0.}\\ {0.250000000000000}\\ {0.0833333333500000}\end{array}\right]$ (17)
 According to these results, the second document in the corpus (doc5) is the most relevant to the query, followed by the third document (doc7). The first document (doc2) is judged to have no relevance to the query.

Building a Search Engine Package

 The next step is to design a search engine package that includes all the features described to this point. The search engine must also be as generic as possible, and accept a variety of document and term types.
 The package manages two kinds of data objects: documents and corpora. Each is represented by an object that supports certain features.
 A document is abstracted as a record object with three slots. The constructor is:
 > Document := proc( id, fetch, filter )   Record(     ':-id' = id,     ':-fetch' = fetch,     ':-filter' = filter ); end proc:
 The id slot is the document ID. The document ID must uniquely represent the document within the corpus. For efficiency, document IDs are small, while their referents may be quite large.  While enormous documents can be handled in the design, for simplicity, it is assumed that any document can reasonably be read into memory in its entirety.  The fetch slot is a procedure that returns the body of a document, given its document ID.  The filter slot contains a procedure that generates a list of terms appearing in the document, with multiplicities.  Several accessor procedures for documents are also provided in the package.
 A corpus is a collection of documents. It is represented by an object that supports methods for accessing, indexing, and searching the collection.
 > SearchEngine := module()   description "a simple search engine";   option  package;   export     Filters,        # subpackage of filters     Document,       # document constructor     DocumentID,     # accessor procedure     FilterDocument, # accessor procedure     FetchDocument,  # accessor procedure     Corpus,         # corpus constructor     NumberOfDocuments, # returns number of documents in corpus     BuildIndex,     # index a corpus     GetDocumentIdByIndex, # retrieve a document ID     Search;         # perform a search   local     documentTermVector, # construct document term vector     queryTermVector;    # construct query term vector   # Exports   # Document constructor   Document := proc( id, fetch, filter )     Record(       ':-id' = id,       ':-fetch' = fetch,       ':-filter' = filter );   end proc;   # Accessor routines for documents.   DocumentID := doc -> doc:-id;   FetchDocument := doc -> doc:-fetch( doc:-id );   FilterDocument := doc -> doc:-filter( FetchDocument( doc ) );   # Corpus constructor. Called with either a sequence of   # documents or a list of document IDs, a fetcher, a document   # filter routine, and an optional query filter routine.   Corpus := proc( listOfIds::list, fetch, filter, _qfilter )     local  docs, qfilter;     # Process arguments.     if nargs = 0 then       error "expecting corpus description"     elif nargs > 3 then       # Allow the query filter to be different       # from the document filter       qfilter := eval( _qfilter, 2 );     else       # If query filter is not specified,       # use the document filter.       qfilter := eval( filter, 2 );     end if;     # Construct list of documents.     docs := map( Document, listOfIds, fetch, filter );     # Build the corpus.     module()       export  search, buildIndex,         numberOfDocuments,         getDocumentIdByIndex;       local  ids, corpusTerms,         documents,         term_document_matrix;       ids := listOfIds;       documents := docs;       numberOfDocuments := () -> nops( documents );       getDocumentIdByIndex := proc( idx::posint )         if idx <= numberOfDocuments() then           ids[ idx ];         else           error "there are fewer than %1 documents in the corpus",                 idx;         end if;       end proc;       buildIndex := proc()         local  docterms;         # Construct corpus terms.         docterms := map( FilterDocument, documents );         corpusTerms := sort( [op](                 union( op( map( {op}, docterms ) ) ) ) );         # Build the term-document matrix.         term_document_matrix := Matrix( map( [s -> s],           map( documentTermVector, docs, corpusTerms ) ),           'datatype' = 'float'[ 8 ], 'storage' = 'sparse' );         eval( thismodule, 1 );       end proc;       search := proc( query, numberOfResults::posint )         local  qt, qv, scores;         if not type( term_document_matrix, 'Matrix' ) then           error "corpus not yet indexed";         end if;         qt := qfilter( query );         qv := queryTermVector( qt, corpusTerms );         scores := (term_document_matrix . qv)^+;       end proc;     end module;   end proc;   NumberOfDocuments := corpus -> corpus:-numberOfDocuments();   GetDocumentIdByIndex := ( corpus, idx )          -> corpus:-getDocumentIdByIndex( idx );   BuildIndex := corpus -> corpus:-buildIndex();   Search := ( corpus, query ) -> corpus:-search( query );   # Locals   documentTermVector := proc( doc, corpusTerms::list )     local  terms;     terms := FilterDocument( doc );     Vector[row]( 1 .. nops( corpusTerms ),       i -> evalf( numboccur( terms, corpusTerms[ i ] )                     / nops( terms ) ),       'datatype' = 'float'[ 8 ],       'storage' = 'sparse' );   end proc;   queryTermVector := proc( queryTerms::list, corpusTerms::list )     Vector[column]( 1 .. nops( corpusTerms ),       i -> evalf( numboccur( queryTerms, corpusTerms[ i ] )                      / nops( queryTerms ) ),       'datatype' = 'float'[ 8 ],       'storage' = 'sparse' );   end proc;   # Filters subpackage   Filters := module()     description "filter subpackage";     option  package;     export  Text;     local  stopWords;     stopWords := { "a", "i", "an", "the", "in", "to",       "which", "that", "is", "and", "we", "it", "of",       "all", "can", "does", "don't", "most", "true",       "thing" }:     Text := proc( text::string )       local  words;       use StringTools in         words := map( LowerCase, Words( text ) );         words := remove( type, words, stopWords );         map( Stem, words );       end use;     end proc;   end module; end module:
 > with( SearchEngine ):
 > corpus := Corpus( [ doc || ($1..9) ], s -> s, Filters:-Text ):  > NumberOfDocuments( corpus );  ${9}$ (18)  > GetDocumentIdByIndex( corpus, 1 );  ${"The mathematician\text{'}s patterns, like the painter\text{'}s or the poet\text{'}s must be beautiful; the ideas, like the colors or the words must fit together in a harmonious way. Beauty is the first test: there is no permanent place in this world for ugly mathematics."}$ (19)  > BuildIndex( corpus ):  > Search( corpus, "mathematical beauty" );  $\left[\begin{array}{ccccccccc}{0.0483870967750000}& {0.}& {0.0208333333350000}& {0.}& {0.250000000000000}& {0.}& {0.0833333333500000}& {0.}& {0.0500000000000000}\end{array}\right]$ (20) Latent Semantic Analysis  The simple vector space model described previously has shortcomings. Suppose a user searches a corpus of documents about pets for the term "cat". The simple vector space model search engine locates all the documents that contain the term "cat". However, the search engine does not locate documents that contain the word "feline", but not the word "cat" (or any term for which the stem is "cat"). This issue is called synonymy -- the problem that distinct search terms may represent the same concept. One way to circumvent this problem is to have a human domain expert prepare search term lists for each document in the corpus. All documents referring either to "cats" or "felines" would contain the search term "cat" included in their list of search terms. This solution is, however, very expensive.  An automatic indexing procedure known as latent semantic analysis (LSA) helps to discover relations between terms and documents that are latent in the corpus by analyzing the corpus as a whole. The process is related to factor analysis, and is essentially a statistical technique, relying heavily on linear algebra. When used to prepare a search index for a corpus, LSA is referred to as latent semantic indexing (LSI).  A thorough discussion of LSI is beyond the scope of this manual, so only the operational details necessary to construct the SearchEngine package are discussed.  LSI computes a lower rank approximation to the term-document matrix. The approximation is computed by using the singular value decomposition of the term-document matrix. The singular value decomposition of a matrix $A$ is a factorization: $A=U·S·{V}^{T}$  where $U$ and $V$ are column orthogonal matrices, and $S$ is a diagonal matrix whose diagonal entries are arranged in descending order. The diagonal entries of $S$ are called the singular values of $A$, the columns of $U$ are called the left singular vectors of $A$, and the columns of $V$ are called the right singular vectors of $A$. If $A$ is a term-document matrix, then the columns of $U$ can be interpreted as an orthogonal basis for the semantic factors of term-term correlations between terms in the corpus represented by $A$, while the columns of $V$ can be interpreted as an orthogonal basis for the semantic factors of the correlations between documents. For large corpora, the rank of $S$ may be on the order of a few thousand. To obtain a rank k approximation of $A$ that is closest in the least squares sense, choose a rank k smaller than the rank of $S$ (say, on the order of a few hundred), and form the matrix: ${A}_{k}={U}_{k}·{S}_{k}·{{V}_{k}}^{T}$  where ${U}_{k}$ consists of the first $k$ columns of $U$, ${V}_{k}$ consists of the first $k$ columns of $V$, and ${S}_{k}$ is the first $k×k$ submatrix of $S$.  When the matrix $A$ is a term-document matrix, its approximation ${A}_{k}$ is used as a surrogate corpus index for searching. It can be argued that the matrix ${A}_{k}$ is better able to determine correlations between terms in such a way that searches using it are able to approximate results obtained by human expert indexing. For example, in a corpus on pets, some documents may contain the term "cat", others the term "feline", and still others may contain both terms. LSI places documents containing only one of these terms closer together in the lower dimensional projection by virtue of their co-occurrence in some of the documents in the corpus. In practice, the rank k chosen for the approximation is an empirically determined value, based on the quality of search results. Because LSI is a statistical technique, in general, it produces useful results only for large corpora. The Search Engine Package  This section modifies the SearchEngine package to incorporate LSI without changing the interface that was designed for the simpler indexing scheme. The updated package contains more filters to allow for a greater variety of corpora.  > SearchEngine := module() description "a generic search engine package"; option package; export Filters, # subpackage of filters Document, # document constructor DocumentID, FilterDocument, FetchDocument, Corpus, # corpus constructor NumberOfDocuments, BuildIndex, GetDocumentIdByIndex, Search; local Tools, # private submodule documentTermVector, queryTermVector; # Exports # Document constructor Document := proc( id, fetch, filter ) description "document constructor"; Record( ':-id' = id, ':-fetch' = fetch, ':-filter' = filter ); end proc; # Document accessors. DocumentID := doc -> doc:-id; FetchDocument := doc -> doc:-fetch( doc:-id ); FilterDocument := doc -> doc:-filter( FetchDocument( doc ) ); # Corpus constructor. Called with either a sequence of documents, # or a list of document IDs, a fetcher and a document filter # routine, and a query filter routine. Corpus := proc( listOfIds::list, fetch, filter, _qfilter ) description "corpus constructor"; local docs, qfilter; # Process arguments. if nargs = 0 then error "expecting corpus description"; elif nargs > 3 then # Allow the query filter to be different # than the document filter qfilter := eval( _qfilter, 2 ); else # If not query filter is specified, # use the document filter qfilter := eval( filter, 2 ); end if; # Construct list of documents. docs := map( Document, listOfIds, fetch, filter ); # Build the corpus. module() export search, buildIndex, numberOfDocuments, getDocumentIdByIndex; local ids, corpusTerms, documents, term_document_matrix; # Initialize private data. ids := listOfIds; documents := docs; # Accessor methods. numberOfDocuments := () -> nops( docs ); getDocumentIdByIndex := proc( idx::posint ) if idx <= numberOfDocuments() then ids[ idx ]; else error "there are fewer than %1 documents in the corpus", idx; end if; end proc; # Construct an index based on a _k-dimensional approximation # to the term-document matrix. buildIndex := proc( _k::posint ) local docterms, k, u, s, v; # Construct corpus terms. docterms := map( FilterDocument, documents ); corpusTerms := sort( [op]( union( op( map( {op}, docterms ) ) ) ) ); # Build the term-document matrix. term_document_matrix := Matrix( map( [s -> s], map( documentTermVector, docs, corpusTerms ) ), 'datatype' = 'float'[ 8 ], 'storage' = 'sparse' ); use LinearAlgebra in u, s, v := SingularValues( term_document_matrix, 'output' = [ ':-U', ':-S', ':-Vt' ] ); v := v^+; if nargs > 0 then k := _; else # Use a default if no dimension provided. k := floor( Rank( DiagonalMatrix( s ) ) * 0.7 ); end if; u := u[ 1 .. -1, 1 .. k ]; v := v[ 1 .. k, 1 .. -1 ]; s := DiagonalMatrix( s[ 1 .. k ] ); # Replace the term-document matrix with its rank # k approximation term_document_matrix := MatrixMatrixMultiply( u, MatrixMatrixMultiply( s, v ) ); end use; eval( thismodule, 1 ); end proc; search := proc( query, numberOfResults::posint ) local qt, qv, scores; if not type( term_document_matrix, 'Matrix' ) then error "corpus not yet indexed"; end if; qt := qfilter( query ); qv := queryTermVector( qt, corpusTerms ); scores := (term_document_matrix . qv)^+; Tools:-permSort( scores ); end proc; end module; end proc; NumberOfDocuments := corpus -> corpus:-numberOfDocuments(); GetDocumentIdByIndex := ( corpus, idx ) -> corpus:-getDocumentIdByIndex( idx ); BuildIndex := ( corpus, k ) -> if( nargs = 1, corpus:-buildIndex(), corpus:-buildIndex( k ) ); Search := ( corpus, query ) -> corpus:-search( query ); # Locals documentTermVector := proc( doc, corpusTerms::list ) local terms, norm; terms := FilterDocument( doc ); Vector[row]( 1 .. nops( corpusTerms ), i -> evalf( numboccur( terms, corpusTerms[ i ] ) / nops( corpusTerms ) ), 'datatype' = 'float'[ 8 ], 'storage' = 'sparse' ); end proc; queryTermVector := proc( queryTerms::list, corpusTerms::list ) Vector[column]( 1 .. nops( corpusTerms ), i -> evalf( numboccur( queryTerms, corpusTerms[ i ] ) / nops( corpusTerms ) ), 'datatype' = 'float'[ 8 ], 'storage' = 'sparse' ); end proc; # The Tools submodule Tools := module() export permSort; permSort := proc( V::Vector ) local partition, quickSort, n, P, i; partition := proc( a, lb, ub ) local i, j, k, v; i, j, k := lb, 1 + ub, lb; v := a[ k ]; while i < j do i := 1 + i; while i < j and a[ i ] < v do i := 1 + i; end do; j := j - 1; while a[ j ] > v do j := j - 1; end do; if i < j then P[ i ], P[ j ] := P[ j ], P[ i ]; a[ i ], a[ j ] := a[ j ], a[ i ]; end if; end do; P[ k ], P[ j ] := P[ j ], P[ k ]; a[ k ], a[ j ] := a[ j ], a[ k ]; j; end proc; quickSort := proc( a, lb, ub ) local k; if lb < ub then k := partition( a, lb, ub ); procname( a, lb, k - 1 ); procname( a, k + 1, ub ); end if; a; end proc; n := LinearAlgebra:-Dimensions( V ); P := Array( 1 .. n, [$ 1 .. n ], 'datatype' =         'integer[ 4 ]' );       quickSort( V, 1, n );       [seq]( P[ i ], i = 1 .. n );     end proc;   end module;   # The Filters subpackage   Filters := module()     option  package;     export  Text, XML, Maple, WorksheetTerms;     local  stopWords;     stopWords := {       # The 48 most common English words       "i", "a", "all", "an", "and", "are",       "as", "at", "be", "been", "but", "by",       "can", "do", "for", "from", "had", "has",       "have", "he", "his", "if", "in", "is",       "it", "not", "of", "on", "or", "she",       "that", "the", "their", "there", "they", "this",       "to", "was", "we", "were", "what", "which",       "who", "will", "with", "would", "you",       # a few others       "thing", "true", "most", "does", "don't",       NULL};     Text := proc( text::string )       description "compute the terms in a text string";       local  words;       use StringTools in         words := map( LowerCase, Words( text ) );         words := remove( type, words, stopWords );         map( Stem, words );       end use;     end proc;     XML := proc( xml )       description "compute the terms in an XML document";     local    t, count, rec;     rec := proc( xml, t::table )         local  cm, texts, text, others;         use XMLTools in           if IsElement( xml ) then             cm := ContentModel( xml );             texts, others := selectremove( IsText, cm );             for text in texts do               count := 1 + count;               t[ count ] := text;             end do;             map( procname, others, t );           end if;         end use;     end proc;     count := 0;     t := rec( xml, t );     [seq]( t[ i ], i = 1 .. count );     end proc;     Maple := proc( expr )       description "compute the terms in a Maple expression";       local    fns, terms, nocc;       fns := [op]( map2( op, 0, indets( expr, 'function' ) ) );       nocc := map2( numboccur, expr, fns );       terms := [seq]( [ seq( fns[ i ], j =  1 .. nocc[ i ] ) ],         i = 1 .. nops( fns ) );       sort( map( op, terms ), 'lexorder' );     end proc;     WorksheetTerms := proc( mws )       description "compute the terms in a worksheet";       local  rec, wks, count, t, i;       rec := proc( x, t::table )         local  cm, texts, text, others;         use XMLTools in           if IsElement( x ) and ElementName( x ) = "Text-field" then             cm := ContentModel( x );             texts, others := selectremove( IsText, cm );             for text in texts do               count := 1 + count;               t[ count ] := text;             end do;             map( procname, others, t );           elif not type( x, 'string' ) then             map( procname, args );           end if;         end use;       end proc;       use XMLTools in         if IsDocument( mws ) then           wks := select( IsElement, [op]( mws ) );           if nops( wks ) = 1 then             wks := wks[ 1 ];           else             error "ill-formed worksheet `%1'", fname;           end if;         end if;       end use;       count := 0;       t := table();       rec( wks, t );       t := [seq]( t[ i ], i = 1 .. count );       use XMLTools in         t := map( TextText, t );         t := cat( op( t ) );         Filters:-Text( t );       end use;     end proc;   end module; end module:
 The revised package contains several new document filters. To use document formats that are not directly supported, compose these filters with custom code. Rather than providing a vector of raw scores, the Search command in the package now returns a permutation of the document list indicating document rankings. This can be used directly with the GetDocumentIdByIndex routine.

Using the Package

 The package can be used with a variety of corpora. This subsection demonstrates two examples. The first is the corpus of short text strings used previously in this section.
 > with( SearchEngine ): corpus := Corpus( [ doc || ($1..9) ], id -> id, Filters:-Text ):  > NumberOfDocuments( corpus );  ${9}$ (21)  > GetDocumentIdByIndex( corpus, 1 ); ranking := Search( BuildIndex( corpus ), "mathematical beauty" );  ${"The mathematician\text{'}s patterns, like the painter\text{'}s or the poet\text{'}s must be beautiful; the ideas, like the colors or the words must fit together in a harmonious way. Beauty is the first test: there is no permanent place in this world for ugly mathematics."}$  ${\mathrm{ranking}}{≔}\left[{3}{,}{1}{,}{7}{,}{5}{,}{6}{,}{2}{,}{9}{,}{8}{,}{4}\right]$ (22)  > map2( GetDocumentIdByIndex, corpus, ranking[ 1 .. 3 ] );  $\left[{"Anyone who cannot cope with mathematics is not fully human. At best he is a tolerable subhuman who has learned to wear shoes, bathe, and not make messes in the house."}{,}{"The mathematician\text{'}s patterns, like the painter\text{'}s or the poet\text{'}s must be beautiful; the ideas, like the colors or the words must fit together in a harmonious way. Beauty is the first test: there is no permanent place in this world for ugly mathematics."}{,}{"The most beautiful thing we can experience is the mysterious. It is the source of all true art and science."}\right]$ (23)  The second example corpus is a database of formulae. The intent is to be able to locate formulae relevant to a search query consisting of function names. For the formula database, generate a list of identities among elementary functions using the FunctionAdvisor command.  > Formulae := map2( FunctionAdvisor, 'identities', FunctionAdvisor( 'elementary', 'quiet' ), 'quiet' ):  > Formulae := map( op, sort( Formulae, 'length' ) ):  > nops( Formulae );  ${136}$ (24)  Use each formula as both the document and its ID. The Maple filter in the Filters subpackage extracts the terms from each document.  > corpus2 := Corpus( Formulae, id -> id, Filters:-Maple, query -> [op]( {op}( query ) ) ):  > BuildIndex( corpus2 ):  It is possible to locate formulae relevant to, for example, the sin and cos functions.  > ranking := Search( corpus2, [ 'sin', 'cos' ] );  ${\mathrm{ranking}}{≔}\left[{89}{,}{93}{,}{116}{,}{38}{,}{92}{,}{117}{,}{88}{,}{26}{,}{32}{,}{122}{,}{111}{,}{115}{,}{91}{,}{23}{,}{97}{,}{126}{,}{105}{,}{131}{,}{21}{,}{12}{,}{121}{,}{94}{,}{98}{,}{110}{,}{15}{,}{68}{,}{63}{,}{120}{,}{125}{,}{106}{,}{17}{,}{27}{,}{130}{,}{124}{,}{104}{,}{127}{,}{118}{,}{58}{,}{6}{,}{4}{,}{62}{,}{75}{,}{100}{,}{108}{,}{99}{,}{107}{,}{35}{,}{10}{,}{39}{,}{13}{,}{74}{,}{22}{,}{2}{,}{18}{,}{82}{,}{86}{,}{83}{,}{85}{,}{81}{,}{33}{,}{84}{,}{16}{,}{14}{,}{9}{,}{34}{,}{76}{,}{54}{,}{7}{,}{3}{,}{64}{,}{132}{,}{109}{,}{87}{,}{11}{,}{1}{,}{28}{,}{5}{,}{79}{,}{80}{,}{134}{,}{78}{,}{135}{,}{136}{,}{29}{,}{50}{,}{43}{,}{49}{,}{24}{,}{67}{,}{36}{,}{69}{,}{59}{,}{95}{,}{96}{,}{90}{,}{57}{,}{30}{,}{72}{,}{51}{,}{44}{,}{19}{,}{42}{,}{112}{,}{113}{,}{114}{,}{119}{,}{65}{,}{71}{,}{37}{,}{61}{,}{70}{,}{60}{,}{25}{,}{31}{,}{123}{,}{129}{,}{128}{,}{55}{,}{77}{,}{103}{,}{102}{,}{101}{,}{53}{,}{47}{,}{20}{,}{45}{,}{46}{,}{52}{,}{40}{,}{66}{,}{133}{,}{56}{,}{8}{,}{48}{,}{41}{,}{73}\right]$ (25)  > map2( GetDocumentIdByIndex, corpus2, ranking[ 1 .. 3 ] );  $\left[{\mathrm{tan}}{}\left({z}\right){=}\frac{{2}{}{\mathrm{tan}}{}\left(\frac{{z}}{{2}}\right)}{{1}{-}{{\mathrm{tan}}{}\left(\frac{{z}}{{2}}\right)}^{{2}}}{,}{\mathrm{tan}}{}\left({z}\right){=}\frac{{2}}{{\mathrm{cot}}{}\left(\frac{{z}}{{2}}\right){-}{\mathrm{tan}}{}\left(\frac{{z}}{{2}}\right)}{,}{\mathrm{cot}}{}\left({z}\right){=}\frac{{1}{-}{{\mathrm{tan}}{}\left(\frac{{z}}{{2}}\right)}^{{2}}}{{2}{}{\mathrm{tan}}{}\left(\frac{{z}}{{2}}\right)}\right]$ (26)  Construct a similar corpus using a different choice for the document IDs and the fetcher routine passed to the constructor. Instead of using formulae for both the document and its ID, use the position of the formula in the global list Formulae as the document ID, and pass a suitable fetcher routine.  > corpus3 := Corpus( [$1..nops(Formulae)], id -> Formulae[    id ], Filters:-Maple, query -> [op]( {op}( query ) ) ):
 > ranking := Search( corpus2, [ 'sin', 'cos' ] );
 ${\mathrm{ranking}}{≔}\left[{89}{,}{93}{,}{116}{,}{38}{,}{92}{,}{117}{,}{88}{,}{26}{,}{32}{,}{122}{,}{111}{,}{115}{,}{91}{,}{23}{,}{97}{,}{126}{,}{105}{,}{131}{,}{21}{,}{12}{,}{121}{,}{94}{,}{98}{,}{110}{,}{15}{,}{68}{,}{63}{,}{120}{,}{125}{,}{106}{,}{17}{,}{27}{,}{130}{,}{124}{,}{104}{,}{127}{,}{118}{,}{58}{,}{6}{,}{4}{,}{62}{,}{75}{,}{100}{,}{108}{,}{99}{,}{107}{,}{35}{,}{10}{,}{39}{,}{13}{,}{74}{,}{22}{,}{2}{,}{18}{,}{82}{,}{86}{,}{83}{,}{85}{,}{81}{,}{33}{,}{84}{,}{16}{,}{14}{,}{9}{,}{34}{,}{76}{,}{54}{,}{7}{,}{3}{,}{64}{,}{132}{,}{109}{,}{87}{,}{11}{,}{1}{,}{28}{,}{5}{,}{79}{,}{80}{,}{134}{,}{78}{,}{135}{,}{136}{,}{29}{,}{50}{,}{43}{,}{49}{,}{24}{,}{67}{,}{36}{,}{69}{,}{59}{,}{95}{,}{96}{,}{90}{,}{57}{,}{30}{,}{72}{,}{51}{,}{44}{,}{19}{,}{42}{,}{112}{,}{113}{,}{114}{,}{119}{,}{65}{,}{71}{,}{37}{,}{61}{,}{70}{,}{60}{,}{25}{,}{31}{,}{123}{,}{129}{,}{128}{,}{55}{,}{77}{,}{103}{,}{102}{,}{101}{,}{53}{,}{47}{,}{20}{,}{45}{,}{46}{,}{52}{,}{40}{,}{66}{,}{133}{,}{56}{,}{8}{,}{48}{,}{41}{,}{73}\right]$ (27)
 The common and practical case, in which a corpus represents a collection of files to be indexed, can be handled by using a constructor call such as the following.
 > corpus := Corpus(    remove( type, listdir( "MyDocuments" ), { ".", ".." } ),    fname -> readbytes( fname, 'TEXT', infinity ),    Filters:-Text ):
 If the documents contain structured text encoded as XML, then a similar invocation can be used.
 > corpus := Corpus(    remove( type, listdir( "MyDocuments" ), { ".", ".." } ),    fname -> XMLTools:-ParseFile( fname ), Filters:-XML ):
 Finally, a directory of Maple worksheets can be represented by a corpus constructed as follows.
 > corpus := Corpus(    remove( type, listdir( "MyDocuments" ), { ".", ".." } ),    fname -> Worksheet:-ReadFile( fname ), Filters:-WorksheetTerms    ):
 A client of the SearchEngine package can provide a specialized filter routine to be used in constructing a corpus object to represent a collection of documents of a specific type. Generic interfaces and careful hiding of representational details provide considerable client flexibility and the ability to evolve the implementation.