Navigating pages with the full text index

rschram

Hi everyone,

A dokuwiki has a full text index of its pages. This is what it uses to find search results. Are there other ways to make use of it, for instance, could one generate a list of pages by word?

I'm interesting in finding or creating new ways to navigate through a wiki. Search is good when you know what you are looking for. Metadata like tags, namespaces and subject categories can give a structure to a wiki, but then only if the authors have thought of that. A key word index, like in a book, lets one get a quick overview of what's in a wiki, and perhaps see connections not yet considered.

If one pulled the complete wordlist out to make an index page, then of course, it'd be so long as to be useless, even in a small wiki of 100 pages. But in preparing the list it would be easy to exclude the most common words, and maybe not so hard to look for really uncommon words, boiling down the list of entries to meaningful ones. Further processing could combine related words, e.g. plurals, conjugations, etc.

In general I'm seeing the limits of what I can add as an author of wiki documents, and now want to develop or use tools that let me analyze what's there, or better yet, tools which generate various kinds of analysis automatically, and then can be used as new overviews and points of entry. The full text indexes seem like a really good place to start.

On that point, if DW maintains a full text index, are there date indexes too? That would be another good way to analyze the contents of a wiki.

I welcome your thoughts. I feel like I should mention that I only have a faint idea of how you could implement this in a plugin, but I've decided to use DW more, and train myself in extending it, so I'm thinking really big right now.

Cheers,
Ryan

rschram

Having now read up on NLP, I have a new question. Is it possible to use the full text index to calculate the cosine similarity of two articles based on the tf-idf vectors of their contents? For all the articles in a wiki?

Cheers,
Ryan

andi

https://www.dokuwiki.org/devel:fulltextindex has the beginning of some documentation on the index. Have a look at inc/fulltext.php to see how to access the index. Feel free to update/extend above page with your findings.

Also have a look at the pagequery plugin.

rschram

Thanks, Andi,

The pagequery is a great plugin and very much suited to this purpose. With it, you can query the whole wiki in a variety of ways, e.g. tags, words, etc., and sort and group them by date. So, on every page in the wiki--in the side bar, even--you can create what I've been thinking of as 'presence' and 'recency'. Rather like stacks of paper on my desk right now, I can be working on one task and in my peripheral vision, I'm aware of other stuff, with more recent and more important stuff near me. Dokuwiki lets me brain-dump stuff I don't want to forget, and perhaps with pagequery, it can also help me 'remember to remember.'

However, based on my quick glance at pagequery, I think it does have one piece missing. Basically its a query of what's there, and just like any search query, you have to know what you are looking for (e.g. "stories about coconuts", "lecture notes on subjectivity") or at least the parameters of what you're looking ("lecture notes written in the last year, or this time last year...") in order to find it. Likewise, its report page of the full inventory depends in part on having filed things in a meaningful way. My desk in front of me is a living testament to my inability to do this... :) And, yet, my desk in front of me is actually very well organized! It's just that I didn't organize it! Much as Life shows us that cellular automata can bootstrap higher-order complexity, so too are the complex assemblages piled before my eyes also pregnant with an immanent ordering, indeed, one that is highly practical if a bit fuzzy. It has presence, recency and also relevance.

Would it possible then for pagequery to access a similar kind of implicit information about the temporal and conceptual relationships in a collection of wiki articles? Yes I think so. I understand now that this has been an area of active research for some decades, so I recognize that it's my ignorance that freeing me to speak boldly. Anyways...

So, the docs on the full text index at https://www.dokuwiki.org/devel:fulltextindex describe the contents of the index pages. Basically, i<wordlength>.idx contains a list of words, and for each word, a series of page IDs (stored elsewhere) and the number of occurences of the word in each of these pages. Thus for a page, we already know the term frequency, and a count of the number of entries would tell us the document frequency of the word. The total number of documents is also known. So the TF*IDF weight is easily calculable. Perhaps the search function already does this, or assigns some other weight. (The documentation seemed to imply that there was weighting but only based on frequency.)

A new development could be to take these weights for each term and pull together another idx table with page IDs and an array of the TF*IDF for each term. This can be treated as a vector, and for two article vectors, we can calculate the cosine of the angle of their intersection. Based on my 72 hours of frantic reading on this topic, this is apparently a good measure of how similar the information of two documents is. An acute angle means that the two documents have many words in common that are otherwise relatively rare in the context of the wiki. When you compute cosine similarities for every pairing of documents in a collection, you get a matrix of cosines for each pair, a relative score of their informational distance.

I used Perl modules to do this for a collection of 343 research notes. It took about five minutes for the computer to create the matrix, and then print out a HTML pages with the content of each note, and a list of related pages. I think it stood up really well. All of these notes also contained a list of key words in a separate field and, for a given note, the rankings of the pages based on the cosine similarity of shared tags and shared words was actually pretty close. If anything, the ranking of relevance based on the full text seemed better than my tagging of the documents. It brought together notes about the same people, which was not something I used as a key word. Proper names are relatively rare occurrences, and so they get a strong weighting. I found stuff I had forgotten about, or hadn't scrupulously tagged when I wrote it. It was awesome!

But, there was a lot of computational power being applied. So, much as the indexer bug silently updates the full text index idx pages, that would be the better way to go to implement this kind of search engine in DW. indexer.php updates the existing idx pages with new data on frequency and location. A modified indexer, if it does not do so already, could also be updating another set of records of the vector of term weights for each page. When it has a free moment, it could compute the cosine similarity matrix and write it to another idx file.

Then, when a user browses a page that's been indexed, something like pagequery, except its query would be based on these additional pieces of information from the full text indexes. Say we place the pagequery double-curly braces syntax on every page. Say also it can grab the id of the current page. It could then look this page in the cosine similarity matrix, sort it according to its criteria and return a list of other pages ranked by relevance. Each article would be accompanied by a list of 'related articles', their degree of relevance already having been calculated when the page content was added to the ful text indexes.

I will keep reading about this, and also try to look at the indexer.php and fulltext.php to see what's really going on there. Are weights assigned to words? Is this information stored? Are there other theories of the informational relationships between documents that would be worth considering, either from practical or conceptual perspectives?

Cheers,
Ryan

Global DokuWiki Links