• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.

Item5529: SEARCHes of type word do not work if word is non-English and with TWiki running utf-8

Item Form Data

AppliesTo: Component: Priority: CurrentState: WaitingFor: TargetRelease ReleasedIn
Engine   Urgent Closed   patch 4.2.1, 5.0.0

Edit Form Data

Summary:
Reported By:
Codebase:
Applies To:
Component:
Priority:
Current State:
Waiting For:
Target Release:
Released In:
 

Detail

SEARCHes of type word do not work if word is non-English and the TWIki is setup for UTF8

This Danish text: Danmarks måske kommende statsminister Lars Løkke Rasmussen er ikke så indlysende en kandidat til posten som for blot et par uger siden, skriver Morgenavisen Jyllands-Posten.

followed by these searches

First Regex which works

Searched: Løkke

Results from Bugs web retrieved at 10:08 (GMT)

SEARCHes of type word do not work if word is non English and the TWIki is setup for UTF8 This Danish text: Danmarks m...
SEARCHes of type query using text ~ has been broken since the released version of 4.2.0 First I thought it was a UTF 8 issue but I have confirmed that this issue is...
Number of topics: 2

Then query

Searched: text ~ '*Løkke*'

Results from Bugs web retrieved at 10:08 (GMT)

SEARCHes of type word do not work if word is non English and the TWIki is setup for UTF8 This Danish text: Danmarks m...
SEARCHes of type query using text ~ has been broken since the released version of 4.2.0 First I thought it was a UTF 8 issue but I have confirmed that this issue is...
Number of topics: 2

And finally word

Searched: Løkke

Results from Bugs web retrieved at 10:08 (GMT)

SEARCHes of type word do not work if word is non English and the TWIki is setup for UTF8 This Danish text: Danmarks m...
SEARCHes of type query using text ~ has been broken since the released version of 4.2.0 First I thought it was a UTF 8 issue but I have confirmed that this issue is...
Number of topics: 2

note that here on Bugs we do not run UTF8. You have to copy the examples to a UTF8 TWiki

It also seems that the query does not really work at all with text ~ here on Bugs which runs iso-8859. It does work on my T42 with utf-8

-- TWiki:Main/KennethLavrsen - 13 Apr 2008

Problem is still there also after the SVN16656.

Regex works. But both word and query search does not work if the word you search for contains non-English characters and TWiki runs UTF8.

-- TWiki:Main.KennethLavrsen - 13 Apr 2008

Would this be a simply (workaround) fix? : Scan for punctuation and whitespace instead of perl word boundaries.

-- TWiki:Main.PeterThoeny - 12 May 2008

I do not understand what the idea is of this work around.

We are talking about searching for plain simple words.

If you cannot search for plain words in languages that do not use only A-Z (that is the majority of this world) then the search is in practical totally worthless. This needs to be fixed if people are to be able to use UTF8.

-- TWiki:Main.KennethLavrsen - 13 May 2008

Search does also not work when the searched word contains a single quote, like TWiki's.

-- TWiki:Main.ArthurClemens - 24 May 2008

Query using text ~ "something" does not work with English words either and not in ISO-8859-1. It seems Query is simply just broken now.

-- TWiki:Main.KennethLavrsen - 26 May 2008

After having fixed 5529 (Sven still need to check in the fix on SVN) I have been able to debug this one further and I know the exact root cause.

It is in lib\TWiki\Store\SearchAlgorithms\Forking.pm we have the problem.

The problem only occurs in a search where we are looking for work boundaries but it is not the \b that is the problem.

There are the code lines

    if ($options->{wordboundaries} ) {
        $searchString = '\b'.quotemeta( $searchString ).'\b';
    }

and the problem is the quotemeta( $searchString ) which screws up the string when it contains unicode characters.

Crawford, you added this code originally. What is the quitemeta supposed to do? We obviously need to do the similar operation in a different way but before I just remove the function I need to understand what it is doing and what to watch out for.

-- TWiki:Main.KennethLavrsen - 30 May 2008

I'm really surprised to hear that quotemeta fails with UTF-8 encoding. quotemeta is a standard perl function used to escape regular expression meta-characters in the search string. However, when you read the doc in detail, you can see that it is absolute shit. I quote all characters not matching "/[A-Za-z_0-9]/" will be preceded by a backslash in the returned string, regardless of any locale settings. Note the "regardless of any locale settings" bit, which ensures it won't work for any multibyte character encoding.

The simlpest solution I can think of is to replace quotemeta with a method that actually recognises valid meta grep characters.

    if ($options->{wordboundaries} ) {
        $searchString =~ s#([][|/\\$^*()+{};@?.{}])#\\$1#g; # Can't use quotemeta because $searchString may be UTF8 encoded
        $searchString = '\b'.$searchString.'\b';
    }
If the above code doesn't work, try converting the string to unicode first:
$searchString = Encode::decode($TWiki::cfg{Site}{CharSet}, $searchString) if $TWiki::cfg{Site}{CharSet};
as the first line in the condition block. If this causes a Wide character in print error, then add
$searchString = Encode::encode($TWiki::cfg{Site}{CharSet}, $searchString) if $TWiki::cfg{Site}{CharSet};
as the last line in the condition block.

Note that all uses of quotemeta in the code that operate on data that is potentially UTF8-encoded will be similarly affected. I think this problem would "just go away" if TWiki used unicde strings internally - this is a problem specific to multibyte encodings such as UTF8.

-- CrawfordCurrie - 30 May 2008

Working on this.

Tried the first solution and it works.

Tried the 2nd solution with Encode::decode. It also works. I did not need to use the Encode::encode.

I only tried with my test topic and only in utf-8.

I will try different other searches and combinations before I check in a fix.

For the moment I am mostly keen on the 2nd solution because it seems less a hack.

I cannot help thinking that the $searchString = Encode::decode($TWiki::cfg{Site}{CharSet}, $searchString) if $TWiki::cfg{Site}{CharSet}; operation should happen a lot earlier in the code to prevent other bugs that we have not seen reveal themselves yet.

Something for me to investigate a little further this weekend.

Thanks for following up on my questions Crawford.

-- TWiki:Main.KennethLavrsen - 31 May 2008

Note that you will have to test with at least one multibyte encoding (e.g. UTF-8) with a multibyte search string, at least one high bit encoding such as iso-8859-1 checking high-bit characters, normal 7-bit ascii, and you should also really test all legal meta-characters in regex searches.

-- TWiki:Main.CrawfordCurrie - 31 May 2008

I tried to Encode the $searchString much earlier. It seems to have a negative effect on the non-word type of searches resulting in searches results containing garbage. So it is a bit of a can of worms.

I continue learning all I can but we probably have to settle for the fix that targets this particular problem for 4.2.1

-- TWiki:Main.KennethLavrsen - 31 May 2008

I decided to go for the solution that does not use Encode because when we later want to change TWiki to general utf-8 additional hidden Encode conversions can harm so the regex substitute is a better short term solution which will still work when we go utf-8

-- TWiki:Main.KennethLavrsen - 02 Jun 2008

Search work with utf-8 and non-Englich only if remove '\b' (im sure its must work for \sword\s)

$searchString = '\b'.quotemeta( $searchString ).'\b';

or

$searchString = '\b'.$searchString.'\b';

-- TWiki:Main.VictorKasatkin - 26 Apr 2009

Please open up a new bug report to report an issue and proposed fix.

-- TWiki:Main.PeterThoeny - 30 Apr 2009

ItemTemplate
Summary SEARCHes of type word do not work if word is non-English and with TWiki running utf-8
ReportedBy TWiki:Main.KennethLavrsen
Codebase ~twiki4
SVN Range TWiki-5.0.0, Thu, 03 Apr 2008, build 16612
AppliesTo Engine
Component

Priority Urgent
CurrentState Closed
WaitingFor

Checkins TWikirev:16871 TWikirev:16873
TargetRelease patch
ReleasedIn 4.2.1, 5.0.0
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | Raw edit | More topic actions
Topic revision: r17 - 2009-04-30 - PeterThoeny
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback