Not logged in. · Lost password · Register
Forum: General Help and Support Plugins RSS
Syntax plugin to match every word on the page
Avatar
zioth #1
Member since Jul 2011 · 77 posts
Group memberships: Members
Show profile · Link to this post
Subject: Syntax plugin to match every word on the page
For my autolink4 plugin, I'm trying to analyze all page titles on the site, and link them on every other page. This could be done with a ton of addSpecialPattern() matches, but I'd quickly reach the limit of Doku's lexer, which can't handle more than 500-1000 patterns. Instead, I'd like to make my own parser, which searches every bit of text on the page not already processed by another syntax plugin, and look for strings that match page titles.

Is there a way to do this? I tried calling addSpecialPattern('(?:[\w\'\-]+\s*)+') to match all phrases, stopping at punctuation. I gave it a getSort() of 1000, in the hope that it wouldn't interfere with other syntax plugins, but it does. My other plugin with a getSort() of 999 no longer functions.

I also tried `function accepts($mode) {return true;}`. That helped a little, but still interfered with certain kinds of special patterns. I doubt it's the right solution anyway.

The alternative is to create an action plugin that post-processes the page, using Doku's lexer pattern to omit parts of the page that have already been worked on by syntax plugins, but that's very error-prone, if it's possible at all.

Another alternative is for me to submit a pull request to chunk the pattern in Doku's lexer, allowing it to accept more than 1000 special patterns. Though if there are 100,000 pages on the site, that's 100,000 special patterns, which means running through at least 100 huge regexes on every pass through the page.
Avatar
zioth #2
Member since Jul 2011 · 77 posts
Group memberships: Members
Show profile · Link to this post
This also didn't help:
    function getAllowedTypes() {
        global $PARSER_MODES;
        return array_keys($PARSER_MODES);
    }
Avatar
zioth #3
Member since Jul 2011 · 77 posts
Group memberships: Members
Show profile · Link to this post
I did some more debugging, and found that the problem is with the way regular expressions work. When two expressions both match in the lexer, the one that starts earlier wins.

Since (?:[\w'\-]+ *)+ always matches starting with the first word in a sentence, it always wins against my specific-phrase patterns, unless those patterns start at the beginning of the sentence too. I tried (?:[\w'\-]+ *)+? to make it non-greedy, but then it competes with itself, so it always matches exactly one word.

I want a regex that is non-greedy, so it never wins against other regexes, but I want it to be greedy when compared against itself. Any regex masters know whether that's possible?

I've also tried:
(?:(?:[\w'\-]+ *){1,1000})+?
(?:(?:[\w'\-]+ *)+)+?

Edit: More detail:

Here's an example of what I'm trying to do. The '|' separates my two plugins, and my example is in JavaScript for convenient debugging:

  • 'Once upon a time there were three bears'.match(/(\bupon a time\b)|(?:[\w\'\-]+ *)+?/g)
  • Returns: ["Once ", "upon a time", "there ", "were ", "three ", "bears"].
  • Expected: ["Once", "upon a time", "there were three bears"]
  • The first expression works, but the second breaks the string down into phrases, rather than words.

  • 'Once upon a time there were three bears'.match(/(\bupon a time\b)|(?:[\w\'\-]+ *)+/g)
  • Returns: ["Once upon a time there were three bears"].
  • Expected: ["Once", "upon a time", "there were three bears"]
  • The first expression is ignored entirely.

  • 'Once upon a time there were three bears'.match(/(\bOnce upon a time\b)|(?:[\w\'\-]+ *)+/g)
  • Returns: ["Once upon a time", "there were three bears"]
  • This is exactly what I want, but it only works if the first expression matches the beginning of the sentence.
This post was edited on 2018-12-28, 01:22 by zioth.
Close Smaller – Larger + Reply to this post:
Verification code: VeriCode Please enter the word from the image into the text field below. (Type the letters only, lower case is okay.)
Smileys: :-) ;-) :-D :-p :blush: :cool: :rolleyes: :huh: :-/ <_< :-( :'( :#: :scared: 8-( :nuts: :-O
Special characters:
Go to forum
Imprint
This board is powered by the Unclassified NewsBoard software, 20150713-dev, © 2003-2015 by Yves Goergen
Current time: 2019-01-17, 20:56:59 (UTC +01:00)