Indexing n-grams via plugin?

rschram

Hi everyone,

Currently, the inc/Search/Indexer.php class has a tokenizer that explodes a page on whitespace to collect a list of terms for indexing (e.g. "Papua New Guinea" is tokenized as "papua", "new", "guinea"). I'd like to also create an index of bigrams (two-word phrases, i.e. "papua new", "new guinea"). The only way I've managed to do this is to rewrite the tokenizer function.

Could the list of bigrams and/or n-grams be generated via a plugin by using an existing event? The tokenizer triggers "INDEXER_TEXT_PREPARE" and passes the full text as a string to the event handler, but I guess I'm not clear on whether handlers can replace instructions or just insert additional instructions before or after an event is triggered.

Would it be advisable to instead define a new event, insert it into the tokenizer after the word list is generated?

Best wishes,
Ryan

gerardnico

From what I see the tokenizer event permits to split words by new line.

https://www.dokuwiki.org/devel:event:indexer_text_prepare

If you do that with one bigram by line, it does not work ?

rschram

Thanks for pointing this out. I missed this on the devel pages before and it clarifies what's going on with the strtr in Indexer.php.

Yes, I believe that would be one possible approach, or at least part of it. Simply inserting newlines into $text at INDEXER_TEXT_PREPARE would mean indexing by two-word stretches instead of any other units. One would also have to remove whitespace within each stretch of text, e.g. s/[^0-9a-zA-Z]//U (the only allowed interword characters are [0-9a-zA-Z]). Then, the newlines after every bigram in $text at INDEXER_TEXT_PREPARE would be converted back into \s (by the strtr line in Indexer.php) and then exploded on spaces to create the array of tokens, $wordlist. So for $text = "Papua New Guinea", $wordlist would be array("papuanew", "guinea").

If you wanted a list of true "bigrams" ("papua new", "new guinea"), one could also manipulate $text further, but then you're basically tokenizing the input string in order to tokenize it on spaces later. So we come to my other questions. Can one modify or replace instructions via plugin events or merely insert new instructions? And, at what point does one opt instead to rewrite a core function rather than use a plugin (for instance if one wanted to allow other interword characters like apostrophes or hyphens)?

Thanks again,
Ryan

gerardnico

I didn't still check the code because i'm on my mobile but if you can't tweak the indexer you just can add a new one via the event indexer_task_run (it's just a cron event) or make a pull request (not sure if it will be accepted)

andi

TBH I wonder if it's worth the effort. It might be easier to switch to the ElasticSearch plugin for a full blown search engine, when the builtin one is too simple.

But I'd be interested to see how your n-gram based index performs. Even just doing 2 word combinations would blow up the index size a lot and you might run into memory limits soon...

gerardnico

andi easier to switch to the ElasticSearch plugin

And yes, when DokuWiki is too simple 🙂 I saw your answer after writing mine.

gerardnico

For the search part,:

You can always add or delete the search result with the search_query_pagelookup event it seems.
you can create a new action (it seems) with tpl_act_unknown
you may also uses an AJAX event to tweak the dropdown in a search bar

The whole search is executed here
I didn't find any way to tweak into this part to not use the original index.

rschram

Thanks for these comments. I will look at ElasticSearch to see if I can make use of it.

So far, the effort has been worthwhile. I'm learning about Dokuwiki, and PHP. I've become very reliant on Dokuwiki for a lot of things, and so I'm happy to now start learning how it works inside.

My goal is to find "similar pages" based on the TFIDF values of the terms in the text. The TFIDF value for each term in an article can be calculated from the data in i<n>.idx files. If one represents an article as a vector of TFIDF values, then you can also find the cosine between two articles and use this as a measure of their similarity. It works quite well, even with just a vector of 1-grams. Bigrams or higher n-grams aren't needed to determine similarity then.

At the bottom of this page, you can see what I've got working so far. It lists the most similar pages based on their TFIDF cosines. For each linked page, the gray terms are terms that are shared between the current page and the linked page, and the TFIDF value of that term on the linked page. As you can see they are not surprising results, but it's nice.

This is where 2-grams and above come in. The TFIDF cosines of pairs of 2-gram vectors aren't actually all that great. I was able to get that working in an earlier version, and the resulting list of similar pages was much narrower and tightly focused on a specific topic, which sort of defeats the purpose of making connections across the wiki. But 2-grams and 3-grams that are shared between two highly similar pages are interesting. For one, since I'm writing in English, high TFIDF 2-grams and 3-grams have a better chance of being proper names and abstract noun phrases. So relating pages based on overall similarity of their TFIDF vectors and then drilling down into the intersection of those pages at the level of larger units brings to light specific phrases and ideas that they have in common. As you can see, the syntax component has prelinked all these terms. The idea is that when this plugin identifies a key topic, I am prompted to extend the wiki.

But the similarity calculation needs a complete index of each TFIDF of each unique terms on each page. This is stored as one file, and then the cosines of one page with other pages are stored as page metadata. The TFIDF index is 2 Mb on a wiki of less than 350 pages. The page metadata files are each about 250K. So this works for now, but perhaps not as the wiki continues to grows. And to index both 1-grams and 2-grams would blow it and the i<n>.idx files up even more. (When I indexed the site by 1..3-grams, it churned out i<1..137>.idx. Although I also rewrote the tokenizer to permit hyphens and apostrophes in terms and did not strip URLs or stopwords. Still, the full version of this extension as I currently imagine it will have a very bulky store of data.)

Would there be a way to exploit search results to find similar pages and identify shared prominent words and phrases? Would that save processing time and disk quota?

Best wishes,
Ryan

gerardnico

rschram Would there be a way to exploit search results to find similar pages and identify shared prominent words and phrases?

I answered already there - comment 6

gerardnico

The TFIDF is not the same for each term ?

From my understanding of TFDIF, this is a weight factor / ratio created from the frequency on the page (TF) against the frequency on the corpus (IDF). Ie words used a lot in a corpus have a low TFIDF.

You can always create a vector of words frequency to represent a document but I don't understand why using the IDF ? For filtering ?

The similarity calculation has also often a graph degree for the edge (ie link) between the pages (at the larry page rank sauce). The closest pages have more similarity. Did you take that into account ?

I don't know how it works in php when you want to traverse a file by block but as it's much more a intermediate computation, for a similarity calculation. Loading the whole file in memory could be avoid.

In all case, beautiful project, well done.

rschram

gerardnico I don't know how it works in php when you want to traverse a file by block but as it's much more a intermediate computation, for a similarity calculation. Loading the whole file in memory could be avoid.

I think we are now getting to the point where I can only make improvements to the design and implementation with more skill and understanding. I'm sure there's lots of ways I can do what I want to do better, but I don't know how because I don't really understand the basics.

Anyways, thanks in part to your advice here, the plugin works on both Greebo and Hogfather. Could it work better? Most likely. Is it ready for anyone to use on any Dokuwiki wiki? I don't know, but I believe it isn't. It's half-done. I'm using it now for myself, and I'm sharing it with everyone to see if anyone wants to pick up the ball and run.

https://github.com/rschram/similarities

There's a lot of things big and small that need to be changed in order for this to be useful to anyone but me. Besides the question of memory use, this current half-baked version hard-codes the number of results returned at 6 and the number of shared terms listed for each result at 5. These should be user-configurable variables. Also, even though I do use quickaclcheck() when generating the results list, it doesn't work 100% of the time, I think because the metadata cache is not cleared and that's because I don't fully understand how that works or how to do it.

Still maybe someone will see this and know how to improve it. Thanks again for your advice.

gerardnico

The INDEXER_TASKS_RUN runs on every page requested.

Dokuwiki uses a lock to be sure that the process is not running twice, eating CPU for nothing.

Your memory problem may come from the fact that you are running the similarity calculation for every page visited.
You could also just create also a lock mechanism and start it once a week.

I'm still a newbie on how to monitor processes in PHP. I monitor for now only on a server scope.

Global DokuWiki Links