Manifold issue as it seems. One reason is that Thai-script will be broken into "one character one word" and seems to make it hard that larger pages can be indexed. Just if somebody is generally interested in that issues, one may follow the case topic
here.
Quote from inc/indexer.php file, line 18 and following:
// Asian characters are handled as words. The following regexp defines the
// Unicode-Ranges for Asian characters
// Ranges taken from
http://en.wikipedia.org/wiki/Unicode_block
// I'm no language expert. If you think some ranges are wrongly chosen or
// a range is missing, please contact me
define('IDX_ASIAN1','[\x{0E00}-\x{0E7F}]'); // Thai
It's not clear why that. Of course it seems to be that there are no single words sometimes, but they are usually separated by zero width spaces if written proper in certain Asian scripts. One issue is of course the inter-punctuation, "special" characters.