Thanks for these comments. I will look at ElasticSearch to see if I can make use of it.
So far, the effort has been worthwhile. I'm learning about Dokuwiki, and PHP. I've become very reliant on Dokuwiki for a lot of things, and so I'm happy to now start learning how it works inside.
My goal is to find "similar pages" based on the TFIDF values of the terms in the text. The TFIDF value for each term in an article can be calculated from the data in i<n>.idx files. If one represents an article as a vector of TFIDF values, then you can also find the cosine between two articles and use this as a measure of their similarity. It works quite well, even with just a vector of 1-grams. Bigrams or higher n-grams aren't needed to determine similarity then.
At the bottom of this page, you can see what I've got working so far. It lists the most similar pages based on their TFIDF cosines. For each linked page, the gray terms are terms that are shared between the current page and the linked page, and the TFIDF value of that term on the linked page. As you can see they are not surprising results, but it's nice.
This is where 2-grams and above come in. The TFIDF cosines of pairs of 2-gram vectors aren't actually all that great. I was able to get that working in an earlier version, and the resulting list of similar pages was much narrower and tightly focused on a specific topic, which sort of defeats the purpose of making connections across the wiki. But 2-grams and 3-grams that are shared between two highly similar pages are interesting. For one, since I'm writing in English, high TFIDF 2-grams and 3-grams have a better chance of being proper names and abstract noun phrases. So relating pages based on overall similarity of their TFIDF vectors and then drilling down into the intersection of those pages at the level of larger units brings to light specific phrases and ideas that they have in common. As you can see, the syntax component has prelinked all these terms. The idea is that when this plugin identifies a key topic, I am prompted to extend the wiki.
But the similarity calculation needs a complete index of each TFIDF of each unique terms on each page. This is stored as one file, and then the cosines of one page with other pages are stored as page metadata. The TFIDF index is 2 Mb on a wiki of less than 350 pages. The page metadata files are each about 250K. So this works for now, but perhaps not as the wiki continues to grows. And to index both 1-grams and 2-grams would blow it and the i<n>.idx files up even more. (When I indexed the site by 1..3-grams, it churned out i<1..137>.idx. Although I also rewrote the tokenizer to permit hyphens and apostrophes in terms and did not strip URLs or stopwords. Still, the full version of this extension as I currently imagine it will have a very bulky store of data.)
Would there be a way to exploit search results to find similar pages and identify shared prominent words and phrases? Would that save processing time and disk quota?