Not logged in. · Lost password · Register
Forum: General Help and Support General Stuff RSS
Strange behaviour of syntax recognition patterns with german umlauts
Avatar
Albrecht #1
Member since Dec 2012 · 93 posts
Group memberships: Members
Show profile · Link to this post
Subject: Strange behaviour of syntax recognition patterns with german umlauts
Using the pattern '\b[_A-ZÄÖÜ][A-ZÄÖÜ][_a-zäöüßA-ZÄÖÜ\-]*\b' with Lexer::addSpecialPattern, I would expect to get matches for words like this:
ALBERT EINSTEIN
_NABU
JOHAN_de JONG
ÄUSSERUNG
The first 3 lines are matched, as expected. For the 4th line, I only get a match with USSERUNG - Ä is handled a a separate word with word border. Maybe, that this is a problem with PHP 5.3.

Really strange are matches like this:
Bücher
Bütte
Häuser
while words like
Geäußert
Grasbüschel
dont't match. The rule is: [äöüÄÖÜß] at position 2 cause false matches.

I use PHP 5.3 and my fileencoding is utf-8.

What is the reason?
Avatar
Albrecht #2
Member since Dec 2012 · 93 posts
Group memberships: Members
Show profile · Link to this post
Now I changed the pattern to
$this->Lexer->addSpecialPattern('\b_?\p{Lu}\p{M}*\p{Lu}\p{M}*[-_\p{L}\p{M}]*\b', $mode, 'plugin_smallcaps_substitute');
and it still doesn't work correctly.

Running
<php>
echo(preg_match('/\b_?\p{Lu}\p{M}*\p{Lu}\p{M}*[-_\p{L}\p{M}]*\b/', 'HÜrbine') ? 'match' : 'no match');
</php>
in preview prints 'no match', while
<php>
echo(preg_match('/\b_?\p{Lu}\p{M}*\p{Lu}\p{M}*[-_\p{L}\p{M}]*\b/u', 'HÜrbine') ? 'match' : 'no match');
</php>
matches correctly.

Might it be possible, that the dw recognition pattern constructed via addSpecialPattern doesn't produce unicode patterns?

At least, this assumption would explain some matching problems with my pattern.
Avatar
Albrecht #3
Member since Dec 2012 · 93 posts
Group memberships: Members
Show profile · Link to this post
In fact: recognition patterns do not use the u pattern modifier:
lexer.php(216):
function _getPerlMatchingFlags() {
     return ($this->_case ? "msS" : "msSi");
}

What is the reason?


After fixing _getPerlMatchingFlags() to

function _getPerlMatchingFlags() {
     return ($this->_case ? "umsS" : "umsSi");
}

my pattern recognition works. I hope, it won't have bad side effects.
This post was edited 2 times, last on 2014-02-23, 21:00 by Albrecht.
Avatar
s.sahara #4
Member since Feb 2012 · 50 posts · Location: Makuhari, Chiba, Japan
Group memberships: Members
Show profile · Link to this post
The 'u'(PCRE_UTF8) modifier should be set so that the DokuWiki syntax pattern contains UTF-8 sequences are treated as UTF-8.

Now fortunately, DokuWiki release 2018-04-22 (Greebo) requires PHP 5.6 at least. And the default value of PHP directive default_charset has been “UTF-8” since PHP >= 5.6.0. Therefore I guess adding 'u'(PCRE_UTF8) modifier to function _getPerlMatchingFlags()  in inc/parser/lexer.php will not cause PHP error any more in most local installations.
Close Smaller – Larger + Reply to this post:
Verification code: VeriCode Please enter the word from the image into the text field below. (Type the letters only, lower case is okay.)
Smileys: :-) ;-) :-D :-p :blush: :cool: :rolleyes: :huh: :-/ <_< :-( :'( :#: :scared: 8-( :nuts: :-O
Special characters:
Go to forum
Imprint
This board is powered by the Unclassified NewsBoard software, 20150713-dev, © 2003-2015 by Yves Goergen
Current time: 2018-08-21, 01:18:41 (UTC +02:00)