Thanks, Andi,
The pagequery is a great plugin and very much suited to this purpose. With it, you can query the whole wiki in a variety of ways, e.g. tags, words, etc., and sort and group them by date. So, on every page in the wiki--in the side bar, even--you can create what I've been thinking of as 'presence' and 'recency'. Rather like stacks of paper on my desk right now, I can be working on one task and in my peripheral vision, I'm aware of other stuff, with more recent and more important stuff near me. Dokuwiki lets me brain-dump stuff I don't want to forget, and perhaps with pagequery, it can also help me 'remember to remember.'
However, based on my quick glance at pagequery, I think it does have one piece missing. Basically its a query of what's there, and just like any search query, you have to know what you are looking for (e.g. "stories about coconuts", "lecture notes on subjectivity") or at least the parameters of what you're looking ("lecture notes written in the last year, or this time last year...") in order to find it. Likewise, its report page of the full inventory depends in part on having filed things in a meaningful way. My desk in front of me is a living testament to my inability to do this... :) And, yet, my desk in front of me is actually very well organized! It's just that I didn't organize it! Much as Life shows us that cellular automata can bootstrap higher-order complexity, so too are the complex assemblages piled before my eyes also pregnant with an immanent ordering, indeed, one that is highly practical if a bit fuzzy. It has presence, recency and also relevance.
Would it possible then for pagequery to access a similar kind of implicit information about the temporal and conceptual relationships in a collection of wiki articles? Yes I think so. I understand now that this has been an area of active research for some decades, so I recognize that it's my ignorance that freeing me to speak boldly. Anyways...
So, the docs on the full text index at
https://www.dokuwiki.org/devel:fulltextindex describe the contents of the index pages. Basically, i<wordlength>.idx contains a list of words, and for each word, a series of page IDs (stored elsewhere) and the number of occurences of the word in each of these pages. Thus for a page, we already know the term frequency, and a count of the number of entries would tell us the document frequency of the word. The total number of documents is also known. So the TF*IDF weight is easily calculable. Perhaps the search function already does this, or assigns some other weight. (The documentation seemed to imply that there was weighting but only based on frequency.)
A new development could be to take these weights for each term and pull together another idx table with page IDs and an array of the TF*IDF for each term. This can be treated as a vector, and for two article vectors, we can calculate the cosine of the angle of their intersection. Based on my 72 hours of frantic reading on this topic, this is apparently a good measure of how similar the information of two documents is. An acute angle means that the two documents have many words in common that are otherwise relatively rare in the context of the wiki. When you compute cosine similarities for every pairing of documents in a collection, you get a matrix of cosines for each pair, a relative score of their informational distance.
I used Perl modules to do this for a collection of 343 research notes. It took about five minutes for the computer to create the matrix, and then print out a HTML pages with the content of each note, and a list of related pages. I think it stood up really well. All of these notes also contained a list of key words in a separate field and, for a given note, the rankings of the pages based on the cosine similarity of shared tags and shared words was actually pretty close. If anything, the ranking of relevance based on the full text seemed better than my tagging of the documents. It brought together notes about the same people, which was not something I used as a key word. Proper names are relatively rare occurrences, and so they get a strong weighting. I found stuff I had forgotten about, or hadn't scrupulously tagged when I wrote it. It was awesome!
But, there was a lot of computational power being applied. So, much as the indexer bug silently updates the full text index idx pages, that would be the better way to go to implement this kind of search engine in DW. indexer.php updates the existing idx pages with new data on frequency and location. A modified indexer, if it does not do so already, could also be updating another set of records of the vector of term weights for each page. When it has a free moment, it could compute the cosine similarity matrix and write it to another idx file.
Then, when a user browses a page that's been indexed, something like pagequery, except its query would be based on these additional pieces of information from the full text indexes. Say we place the pagequery double-curly braces syntax on every page. Say also it can grab the id of the current page. It could then look this page in the cosine similarity matrix, sort it according to its criteria and return a list of other pages ranked by relevance. Each article would be accompanied by a list of 'related articles', their degree of relevance already having been calculated when the page content was added to the ful text indexes.
I will keep reading about this, and also try to look at the indexer.php and fulltext.php to see what's really going on there. Are weights assigned to words? Is this information stored? Are there other theories of the informational relationships between documents that would be worth considering, either from practical or conceptual perspectives?
Cheers,
Ryan