• Do not register here on develop.twiki.org, login with your twiki.org account.
• Use View topic Item7848 for generic doc work for TWiki-6.1.1. Use View topic Item7851 for doc work on extensions that are not part of a release. More... Close
• Anything you create or change in standard webs (Main, TWiki, Sandbox etc) will be automatically reverted on every SVN update.
Does this site look broken?. Use the LitterTray web for test cases.

Item3652: I18N: Urls to file attachments that has umlauts only works in some browsers

Item Form Data

AppliesTo: Component: Priority: CurrentState: WaitingFor: TargetRelease ReleasedIn
Engine I18N Urgent Closed   minor 4.2.0

Edit Form Data

Summary:
Reported By:
Codebase:
Applies To:
Component:
Priority:
Current State:
Waiting For:
Target Release:
Released In:
 

Detail

This fix, included in TWiki 4.1.2, is broken for all character sets except ISO-8859-1 and EBCDIC - see comments at end. -- RD

How to reproduce:

  1. create a topic with umlauts in its name by clicking on Sandbox.TästTopic
  2. save it
  3. attach a file to it, optionally create a link in the topic text to it
  4. try to retrieve the attachment ... you can't
The store creates a directory in the PubDir for Sandbox.TästTopic using utf8 encoding for the ä. So your attachment is not really lost. It is just been put into the wrong directory: /pub/Sandbox/TästTopic The rest of the system assumes the attachments to be in pub/Sandbox/TästTopic ... where it isn't. This happens both with UseLocale switched on or off. There's always the danger that users create topics with umlauts in its name even though you swiched UseLocale off. -- TWiki:Main/MichaelDaum - 16 Feb 2007 Here is a patch that solves the problem for me:
--- RcsFile.pm  (revision 12897)
+++ RcsFile.pm  (working copy)
@@ -83,6 +83,11 @@
             $this->{rcsFile} = $TWiki::cfg{DataDir}.'/'.
               $web.$rcsSubDir.'/'.$topic.'.txt,v';
         }
+
+        # remove utf8 encodings from filenames
+        utf8::downgrade($this->{attachment}) if $attachment && utf8::is_utf8($this->{attachment});
+        utf8::downgrade($this->{file} if utf8::is_utf8($this->{file});
+        utf8::downgrade($this->{rcsFile} if utf8::is_utf8($this->{rcsFile});
     }

     return $this;

Can anybody else with UTF8 knowledge please check back?

-- TWiki:Main/MichaelDaum

I tested your code with Perl 5.6.1 and this patch would break 5.6.1 support.

Undefined subroutine utf8::is_utf8 called at /var/www/twiki/lib/TWiki/Store/RcsFile.pm line 89

It would be silly to require that people NOT using utf8 would be required to install a utf8 CPAN library to turn off a utf8 feature.

I am sure there are other ways to do the same.

So back to the drawing board for a new and better fix.

-- TWiki:Main.KennethLavrsen - 26 Feb 2007

I have looked at the other code using utf8.

And one thing is for sure. It is a perl 5.8 thing. So we can do what is done elsewhere in TWiki.

        # remove utf8 encodings from filenames
        if( $] >= 5.008 ) {
            utf8::downgrade($this->{attachment}) if $attachment && utf8::is_utf8($this->{attachment});
            utf8::downgrade($this->{file}) if utf8::is_utf8($this->{file});
            utf8::downgrade($this->{rcsFile}) if utf8::is_utf8($this->{rcsFile});
        }

This should work in 5.8 and not break 5.6 but naturally the utf8 fixes will not work either then.

Micha try and see if this works for you.

-- TWiki:Main.KennethLavrsen - 26 Feb 2007

I have tested above code in a non UTF environment and it seems to work fine and not disturb anything.

I also have tested that things are silent with respect to errors in perl 5.6.1.

So it above fix works for you Michael let me know and I will check in the change.

-- TWiki:Main.KennethLavrsen - 27 Feb 2007

Your perl check works but I'd prefer to fix that issue for perl 5.6.1 also. The is_utf8 function has been moved during 5.8 from Encode::is_utf8 to utf8::is_utf8. It is still available in both places in 5.8 and functionally equivalent. So the following should work for both:

Index: lib/TWiki/Store/RcsFile.pm
===================================================================
--- lib/TWiki/Store/RcsFile.pm  (revision 12981)
+++ lib/TWiki/Store/RcsFile.pm  (working copy)
@@ -44,6 +44,7 @@
 use Assert;
 use TWiki::Time;
 use TWiki::Sandbox;
+use Encode;

 =pod

@@ -83,6 +84,11 @@
             $this->{rcsFile} = $TWiki::cfg{DataDir}.'/'.
               $web.$rcsSubDir.'/'.$topic.'.txt,v';
         }
+
+        # remove utf8 encodings from filenames
+        utf8::downgrade($this->{attachment}) if $attachment && Encode::is_utf8($this->{attachment});
+        utf8::downgrade($this->{file}) if Encode::is_utf8($this->{file});
+        utf8::downgrade($this->{rcsFile}) if Encode::is_utf8($this->{rcsFile});
     }

Note, that I added a use Encode; This is part of the perl distro. Can you check if it is also part of perl 5.6.1?

-- TWiki:Main.MichaelDaum - 27 Feb 2007

Encode was not shipped in 5.6.1. And most distributions did not have utf8 as default either. I have no problem uploading Tästfile.txt to my 5.6.1 based TWiki and downloading it again.

It makes no sense to add a requirement for Encode CPAN module when everything else in TWiki related to utf8 is inside if ( $] >= 5.008)

So I think my proposal is the best compromize.

-- TWiki:Main.KennethLavrsen - 27 Feb 2007

Per agreement with Michael - checked in the 5.8 only fix.

Ready for release

-- TWiki:Main.KennethLavrsen - 27 Feb 2007

Many linux distributions run with UTF8 as default/global LANG setting now (Fedora, Redhat EL, Suse ..) - I am unsure how this patch will affect those installations?

If I run with either

  • LANG=da_DK.ISO-8859-15 or
  • LANG=da_DK.UTF8
and set $TWiki::cfg{Site}{CharSet} accordingly, the two set of settings should result in differently encoded names for .txt files and pub directories on disk - this is what I would expect. Another thing that could have inflicted on this test is that some browsers (i.e. Internet Explorer) per default encodes all links in UTF8 (attachments go to one directory) while other browsers (i.e. Firefox) per default encodes all links in ISO-8859-x (attacments go to another directory) - but only if uploading to a I18N-named topic, i.e. with umlaut. BEWARE: There are usually no problems when uploading to US-ASCII-named topics, encodings for attachments are more properly handled in that case. Sandbox has this code for handling filenames at upload time already (only iso-8859-1 or -15):

    $fileName =~ s/ /_/go;
    # If in iso8859 surroundings and Unicode::Normalize is available, let's get rid of 8-bit chars in filenames
    if ( $TWiki::cfg{Site}{CharSet} =~ /^iso-?8859-?15?$/i ) {
        if( $] >= 5.008 && eval { require Unicode::Normalize } ) {
            require Encode;
            eval { use Unicode::Normalize };
            # Some normalizations need to be intercepted early
            $fileName =~ s/\xc4/AE/g;
            $fileName =~ s/\xc5/AA/g;
            $fileName =~ s/\xd6/OE/g;
            $fileName =~ s/\xdc/UE/g;
            $fileName =~ s/\xe4/ae/g;
            $fileName =~ s/\xe5/aa/g;
            $fileName =~ s/\xf6/oe/g;
            $fileName =~ s/\xfc/ue/g;
            #  convert to Unicode
            $fileName = NFD( $fileName );  # decompose (Unicode Normalization Form D)
            $fileName =~ s/\pM//g;         # strip combining characters
            # normalizations, Latin-1
            $fileName =~ s/\x{00c6}/AE/g;
            $fileName =~ s/\x{00d8}/OE/g;
            $fileName =~ s/\x{00df}/ss/g;
            $fileName =~ s/\x{00e6}/ae/g;
            $fileName =~ s/\x{00f8}/oe/g;
            $fileName =~ s/\x{0152}/OE/g;
            $fileName =~ s/\x{0153}/ae/g;
            # clear everything left that is 8-bit
            $fileName =~ s/[^\0-\x80]//g;
        }
    } 

(Perhaps these two pieces of code could be refactored into the same procedure; they are trying to do somewhat the same thing - the Sandbox procedure maps some chars (i.e. "ö") into double chars (i.e. "oe"), my guess would be that the utf8::downgrade simply maps it to single chars (i.e. "o")).

Anyway, before releasing this, please test that everything is OK with at least both firefox and internet explorer and get an idea how it works with attachments created earlier on - I would say that at a release note with at least an explanation of the spec change would be in order (some installations will need to rename their existing directories to not loose existing attachments). Remember to retrieve the files using both viewfile and directly through the /pub location (I recognize this Item is about the direct /pub links, my guess is viewfile links are not affected).

Note to self: This Item is along the lines of the utf2iso-workaround mentioned in TWiki:Plugins.ImagePluginDev, 3-5 Nov.

-- TWiki:Main.SteffenPoulsen - 27 Feb 2007

I will need help with this task. I would not know exactly what to look for. I do not have much knowledge how the locales work.

-- TWiki:Main.KennethLavrsen - 28 Feb 2007

I'm afraid I'm not much help either - I only see the problem, not the resolution. Perhaps TWiki:Main.RichardDonkin has a an all round view of the issue and can give a few hints.

-- TWiki:Main.SteffenPoulsen - 28 Feb 2007

Note that Item3698 made me revert the fix. It was not robust enough.

-- TWiki:Main.KennethLavrsen - 01 Mar 2007

Ok, I checked in a bette fix. I found out that the is_utf8 check is not needed. The problem also occurs if your system uses utf8 thoroughly. You can't upload a regular file to a topic whose name is utf8 encoded. Note, that the upload code is broken as it creates the wrong pub directory triggered by the utf8 encoded characters in the filename. ATTACHURL (direct links) and viewfile both expect the same directory. The issue occurs with Firefox, Internet Explorer and konqueror. Direct links to attachments at utf8 encoded topics are still broken using Internet Explorer. Viewfile works for all browsers. So at least with the current code, topics don't seem to get lost anymore.

-- TWiki:Main.MichaelDaum - 02 Mar 2007

I would like to have someone with an installation running utf8 test this Item, but the patch works alright for my iso8859-installation after some testing.

It even made using the ImagePlugin possible without the utf2iso-workaround (with suggested patch at TWiki:Plugins.ImagePluginDev), so I am pretty comfortable with it now.

-- TWiki:Main.SteffenPoulsen - 02 Mar 2007

Closed with release of 4.1.2

KJL

Summary - fix that was committed to 4.1.2 is broken.

Just noticed this bug... I know I don't check the bugs web enough, but it's hard to just get the I18N bugs in a filter, and this one passed me by - a quick email would be helpful if anyone notices an I18N problem on which I haven't commented. There are some serious issues with this fix that will break many non-European sites using I18N.

A few comments on this:

  • If you have users who are likely to create topics using I18N characters, you should just turn UseLocale on - otherwise you will get this sort of problem, and a lack of WikiWords supporting I18N characters, broken sorting, and more. There is really no point trying to write I18N code for the case where UseLocale is off.
  • I believe Michael is running in ISO-8859-{1,15} but it would be useful to know for sure. His code will only work for ISO-8859-1 or EBCDIC according to perldoc utf8, which specifically mentions this restriction on downgrade.
  • Any UTF-8 conversions should simply re-use the TWiki.pm code, i.e. UTF82SiteCharSet - this does exactly what's needed here, i.e. converting the UTF-8 URL into the site charset. The current fix will break badly for non-ISO-8859-1 character sets such as KOI8-R (corrupting directory names), and would also corrupt the Euro sign in ISO-8859-15.
  • UTF82SiteCharSet is already coded to work for Perl 5.6 and 5.8, works for virtually any character set, and does a dynamic require of Encode (the use Encode introduces an unnecessary dependency on Perl 5.8, which is only version that includes CPAN:Encode).
  • TWiki:Codev.EncodeURLsWithUTF8 did specifically address attachments and the Firefox vs IE UTF-8 encoding issue, of which I was well aware - the idea was to URL-encode in site character set at the TWiki end, so that the attachment folder, which is served by Apache of course, uses the site character set regardless of browser.
  • The right place to fix this regression is in the code for ATTACHURLPATH in TWiki.pm - this will make the browser send a non-UTF-8 request URL.
The issue now is to make sure the TWiki:Codev.EncodeURLsWithUTF8 feature works properly for attachments again - it definitely did work without problems when I coded this feature (see the feature page section where I specifically mentioned this was working). Although if you don't turn on UseLocale all bets are off, as this code won't be used... The place to fix this is in the TWiki.pm code, which has a nice SMELL comment on urlEncode that I think highlights the reason this is broken (although the comment is wrong - all that's needed is to encode the URL in current site character set, e.g. KOI8-R, as that's how attachment URLs are supposed to work). The CairoRelease code was the last time this worked, I think, so it's worth having a look at URL encoding in TWiki.pm back then. Not sure of exact fix but this is the place to look. Unfortunately I don't have time to keep track of I18N stuff, let alone write new code, but I can still detect some serious regressions going on here due to various changes to the charset related I18N code. Separate comment: Nobody except Japanese/Chinese/Korean sites should be using UTF-8 ever as the {Site}{Charset} - there is no point for European languages as it breaks things such as WikiWords. Just because your Linux server uses UTF-8 by default doesn't mean you should use this for TWiki. Real TWiki:Codev.UnicodeSupport would help but is a significant amount of work, as discussed recently on TWiki:Codev.InternationalisationGuidelines and the related Item3679 (which is another regression BTW). This fix sort-of works with UTF-8 but that's no use for anyone who wants WikiWords, sorting or searching of I18N characters to work properly. If someone would like to step up to be the owner of Unicode support and I18N generally I'm happy to provide pointers and some of the code I did a while back, as well as test pages for KOI8-R and other cases - SergejZnamenskij has already done some Unicode patches and might be interested, but it would be great to have someone with more TWiki development experience to run this. If we don't get Unicode support done, more and more people will end up using UTF-8 because it's more common these days, so this problem does need addressing. Another comment on Sandbox.pm: I'm a bit surprised that we have this rather weird looking usage of TWiki:Codev.UnicodeNormalisation within the sandbox code, of all places (ref: Item3163) - would never have thought to look there and Unicode normalisation is not really a good way to do this. Now that I've looked at this, it's doing it in a complex and slow way that only works on Perl 5.8, and only when this module is installed (it's not in core Perl), so it normally won't do anything much apart from using the hard-coded s/// statements. It also stops use of quite valid I18N characters in filenames for uploaded attachments, though only for ISO-8859-1 and -15 - not sure why this was considered a good idea as these are not a security issue. UPDATE: I found the piece of code that was 'refactored' out of Cairo and whose omission is causing this bug, although it needs rework to fit into the new TWiki code - basically it makes sure that we URL encode attachment URLs (in fact it's only called for ATTACHURLPATH etc). This code also makes TWikiOnMainframe I18N work. It works by ensuring that for most site character sets other than UTF-8 and EBCDIC, we URL-encode the URL in the site character set so it's pre-encoded in ISO-8859-15, KOI8-R, or whatever, when it hits the web server (not TWiki) in a GET request.


---++ sub handleNativeUrlEncode ( $theStr, $doExtract )

Perform URL encoding into native charset ($siteCharset) - for use when
viewing attachments via browsers that generate UTF-8 URLs, on sites running
with non-UTF-8 (Native) character sets.  Aim is to prevent UTF-8 URL
encoding.  For mainframes, we assume that UTF-8 URLs will be translated
by the web server to an EBCDIC character set.

=cut

sub handleNativeUrlEncode {
    my( $theStr, $doExtract ) = @_;

    my $isEbcdic = ( 'A' eq chr(193) );         # True if Perl is using EBCDIC

    if( $siteCharset eq "utf-8" or $isEbcdic ) {
        # Just strip double quotes, no URL encoding - let browser encode to
        # UTF-8 or EBCDIC based $siteCharset as appropriate
        $theStr =~ s/^"(.*)"$/$1/;
        return $theStr;
    } else {
        return handleUrlEncode( $theStr, $doExtract );
    }
}

Another thing that would help is if people ping me before they remove code that is clearly I18N related, and actually took a while to get right despite it looking rather hackish smile

RD

This bug item was open from 16 Feb 2007 till release date 03 Mar 2007.

It has been on the short release blocker list all the time.

And it was discussed at the release meeting.

So why does this critique surface the day after the release from a core team member???

Core team members should - in my view - follow the bug reports and code checkins and participate at release meetings whenever possible.

You say it is broken. How is it broken? What happens when you run with another locale? I have an english Linux and it runs utf8. And I run my production TWikis in international environments - ie. English. So I do not use the locale feature because I do not have any regional language to set it to that would make sense. So I do not know how this problem surface itself. Is it impossible to upload attachments? Or are they just renamed to english characters? Is this so serious that we need a 4.1.3 within a week?

When you run an english language site people may still upload attachments with funny characters and I could see that people could upload files that could not be downloaded afterwards so there was a bug to be fixed.

-- TWiki:Main.KennethLavrsen - 04 Mar 2007

Just tried with {Site}{Locale} = da_DK.utf8

I do not understand any of these mysterious code but I assume that is the way to set my site to Danish and utf8. And when I upload a file it works fine. My Danish letters are converted to ae, oe, aa which I guess is what I should expect and how it also works without the fix.

I do not see anything broken. Or do I need to setup more? The whole localization setup is so complexe that I doubt any of our users understand anything.

-- TWiki:Main.KennethLavrsen - 04 Mar 2007

Like I said, I don't follow the bugs web very closely, which is unfortunate, but nobody else who knows I18N picked this up, and I was away all last week. I would be happy to hand I18N over to someone more active.

Running TWiki in UTF-8 is a bad idea for non-Japanese/Chinese users - see my comments, you won't get any WikiWord support. Using en_US.ISO-8859-1 is a better idea if you are using English only. It's possible this works for a site charset of UTF-8 but since that's not recommended for European users I didn't address that.

Try it with KOI8-R and some Russian text, or ISO-8859-* (not -1 or -15) with some accented characters. The symptoms will be that it converts the URL from UTF-8 to ISO-8859-1, so all KOI8-R Russian characters (8th bit set) in the URL will be corrupted. The upload may work but with a weird filename. When the user tries to view the attachment, they'll be using an ISO-8859-1 encoded URL not KOI8-R.

Trust me, this really is broken for Russian and many non-European users, it's just that it sort-of works for ISO-8859-1 and UTF-8. Locale setup is really not that complex if you read the installation guide and have working locales.

-- TWiki:Main.RichardDonkin - 05 Mar 2007

Since my knowledge of Russian and Japanese is very limited (zero) - I know Danish, Swedish, Norwegian, English, German, French so when I have tested I have only seen that æ becomes ae, ø becomes oe, and å becomes aa which I guess we Danes are used to.

Can you perhaps define a good test case with all the locale relevant settings that we latin alphabet users are able to understand and test with. That would be great.

-- TWiki:Main.KennethLavrsen - 05 Mar 2007

OK, here's a test case that I think will show this:

  1. Set the TWiki {Site}{Locale} to ru_RU.KOI8-R (check this locale exists on server with that spelling, and check for errors in configure)
  2. Load TWiki site in your browser and check it's using Cyrillic/KOI8-R in Encodings - no other setup such as fonts needed if you test with Firefox
  3. Take the attached RussianText.txt file and paste contents into a new topic with the name "Иностранная" - includes some Russian text in KOI8-R as well.
  4. WikiWord in the only bullet in page should be linked OK (КодОбменаИнформацией)
  5. Attach this file, using the name "Информацией.txt", and create a link
  6. View page for any corruption in name of attachment links, and try downloading the file through those links

-- TWiki:Main.RichardDonkin - 05 Mar 2007

Richard, if the current code fails on your test case, then please open up a new bug item. Can you come up with a code fix, please, as you are the only one to know what's going on?

-- TWiki:Main.MichaelDaum - 06 Mar 2007

I'm working on a fix - already opened up a new bug Item3714 for this issue, will submit fix there.

-- TWiki:Main.RichardDonkin - 07 Mar 2007

Quick update: a lot of useful I18N code disappeared between Cairo and 4.1 - not sure where, but the net effect is that the semantics of {Site}{Charset} are broken (should be derived from {Site}{Locale}), bad charsets are not filtered out, {Site}{Lang} is not calculated, and so on. Not surprisingly, there are some problems as a result, and TWiki:TWiki:InstallationWithI18N doc is broken because you do need to set {Site}{Charset}.

Having installed 4.1.2, I also notice that configure sets a locale based on ISO-8859-1 as default, and the charset to ISO-8859-15 - one basic rule is that these should mean the same charset, even if spelling is different on server locale vs browser. So setting -1 and -15 on a single site doesn't make much sense, although in this case it sort-of works for the common characters that are the same between the two settings.

-- TWiki:Main.RichardDonkin - 07 Mar 2007

Michael - can you confirm your setup (attaching LocalSite.cfg would be great) when you had this error?

Kenneth - I don't know Russian or Japanese either, I just use cut and paste...

-- TWiki:Main.RichardDonkin - 12 Mar 2007

I have a patch to 4.1.2 almost ready, fixing this issue for Russian topic names and attachment names. Need to run the unit tests etc though, and for some reason the attachments are done via viewfile post-Cairo, which is a performance hit and not necessary given this URL-encoding chicanery. However, this is quite promising.

There are some other changes I'd like to make but will commit those separately.

Once ready, should I commit this to MAIN and Patch04x01? Is there some sort of approval process as per TWiki:Codev.PatchReleaseMaintenanceSVN, i.e. should I commit to MAIN first for review?

-- TWiki:Main.RichardDonkin - 13 Mar 2007

Kenneth is TWikis release manager at the moment and he will make sure that changes are properly applied to the Patch branches if suitable (or rolled back from it :-)).

I am very happy to see that you are able to give this area some attention still, Richard.

-- TWiki:Main.SteffenPoulsen - 14 Mar 2007

One oddity is that we seem in Attach.pm to be using viewfile for some attachment serving (via explicit link in text), but not for links in attachment table. I think avoiding viewfile is best for performance reasons, as with the original attachment I18N code. Anyone who disagrees, please speak up!

-- TWiki:Main.RichardDonkin - 15 Mar 2007

I agree.

Please ping Peter about dropping viewfile in attachments tables as he had the strongest reasons (I can't imagine which) for using viewfile in there.

-- TWiki:Main.MichaelDaum - 15 Mar 2007

Perhaps a good time to YACP (Yet Another Configuration Parameter)?

The reasons for using viewfile could include

The reasons for using pub could include
  • Performance
  • No "local filename" problems (current wget chooses a not-so-useful name for the local file when retrieving a file using viewfile)
Having a few webs using pub (Main, TWiki) and the rest using viewfile (per default) has been suggested previously. -- TWiki:Main.SteffenPoulsen - 15 Mar 2007 The access control link you provided suggests Apache mod_rewrite to convert non-viewfile links into viewfile links. And I am quite happy that I18N filenames can be handled OK using the existing non-viewfile links. So I think we should simply use non-viewfile links everywhere and specify use of mod_rewrite as now if this needs to be changed. This could be made configurable of course, but I thought we were trying to get rid of config options! -- TWiki:Main.RichardDonkin - 15 Mar 2007 Using pub is perfectly acceptable for me. Setting this waiting for feedback from customer advocate. -- TWiki:Main.SteffenPoulsen - 17 Mar 2007

Not sure who the customer advocate is here but since the existing code is not configurable between viewfile and non-viewfile (pub), I think making it all non-viewfile is a reasonable way of fixing this bug. Making it configurable would be a new feature.

-- TWiki:Main.RichardDonkin - 17 Mar 2007

viewfile + mod_rewrite results in a serious delay loading the page as lots of static files have to jump through the viewfile needle eye. Apache does a better job serving static files. It only doesn't know about TWiki acls. This can be implemented using a mod_perl access handler for the pub directory hierarchy. But that's another battle to fight.

-- TWiki:Main.MichaelDaum - 17 Mar 2007

OK - I think we have a consensus, will go for non-viewfile as discussed.

UPDATE: I've looked at this code more closely and have (hopefully) fixed the 'link within topic' case with just a few lines of code in SVN:lib/TWiki/Attach.pm. I have left the viewfile code used for the attachments table for now, as it's possible this also handles the 'link to specific attachment version case' that is part of the 'view all versions' code. In I18N terms there's no reason to change that code since it works fine with I18N characters.

The non-viewfile code will be used, as now, for the 'link within topic' cases, i.e. for all attached images, which is the biggest performance issue.

-- TWiki:Main.RichardDonkin - 18 Mar 2007

Steffen. I thought you were one of the two customer advocates?? wink

Making the default in the attachment table a direct link to the pub dir is something I very much favour. You can always you the re-write feature in Apache if you need access rights to attachments. You will need this anyway because otherwise people just manually type the pub URL.

But this is a separate thing!!!. When you view older versions of an attachment it has to go through viewfile so somehow we need to deal with the I18N issue anyway.

The process for feature change is not to discuss it in a bug item but to create a topic on Codev and add a link to it from TWiki:Codev.TWikiFeature04x02

If such a topic has a developer that commits to implement it (this one is really simple) then the rule is

  • If all agrees - do it
  • If someone has a concern we vote at release meeting
  • If people ignore it - then it is automatically an OK after 14 days.
-- TWiki:Main.KennethLavrsen - 22 Mar 2007

I wasn't intending to do a new feature here, so I will just maintain the status quo re attachments - probably won't have time to do the 'go via pub always' feature but it's not hard to do once I18N fixed. I'm currently struggling with making the filename filtering work better, but most of the patch is ready.

-- TWiki:Main.RichardDonkin - 24 Mar 2007

Clarification: I am for using pub instead of viewfile, not the other way around as stated above. See Item3575.

-- TWiki:Main.PeterThoeny - 24 Mar 2007

I've done a patch that mostly fixes this issue, but there's unfortunately one chunk of code that won't work, in the sanitizeAttachment routine in Sandbox.pm - however, we could simply revert to the older filtering-in code that is still part of that, which would not be ideal but would still work. The security people might have something to say about this of course as it means just filtering-in.

The patch is attached - the non-working stuff is all in that routine, marked by warning comments, in case it helps someone. Not committed because it's not final, although quite close if we can agree on what to do with the sanitizeAttachment routine. What do people think should be done - can we revert to filtering in (near end of routine)?

I'd like to drop the Unicode::Normalize stuff as it really breaks I18N a lot, and only works for a few European languages as well as introducing some Unicode dependencies before TWiki:Codev.UnicodeSupport is implemented.

I've tried testing this patch against the latest SVN MAIN (r14213), to make sure it at least runs as well as it did 3 months ago when I worked on it - however, I get this error:

Software error:
Can't locate object method "supportsRegistration" via package "TWiki::Users::TWikiUserMapping" at /home/twikidev/lib/TWiki/Users.pm line 212.

Once that bug's fixed, could someone ping me here, or preferably by email, so I can re-test?

-- TWiki:Main.RichardDonkin - 19 Jun 2007

Thanks Richard for working on this! Hopefully someone finds time to support you.

-- TWiki:Main.PeterThoeny - 20 Jun 2007

Key question for all reading this - can we just go back to filtering-in in the sanitizeAttachmentName routine?

If the answer is yes, I can finalise and re-test this patch on my machine, and commit to SVN.

If nobody answers in a day or two, I will assume filtering-in is OK.

-- TWiki:Main.RichardDonkin - 20 Jun 2007

Generally if you want feedback on something, you should flip it to "Waiting for Feedback" smile

I believe in filtering in as a general principle, so I say yes.

All the tests pass on MAIN, so you should be OK....

-- TWiki:Main.CrawfordCurrie - 21 Jun 2007

Sorry - I got the question wrong - my proposal is to revert to the less-secure filtering- out. I did code a filtering-in approach (included in the patch, see the comment starting WARNING), but couldn't get it to work due to I18N locale problems (may be a Perl bug, or the Sandbox environment, or something else...).

So - in the interests of having an upload attachment approach that works OK for I18N, could we re-instate filtering-out? If this is too much of a security risk for most sites, we could make this automatically configured for I18N sites only, i.e. do filtering-in (more secure) for sites without I18N and filtering-out for sites with I18N enabled.

Note that filtering-out was the approach when the I18N attachment code was originally written. Also, once we do TWiki:Codev.UnicodeSupport, we can basically dump locales (except for older sites still using them) and just use Unicode - filtering-in is very easy with Unicode and may well work more reliably in the Sandbox environment.

-- TWiki:Main.RichardDonkin - 22 Jun 2007

To move this Item forward I second the suggestion on filtering-out for I18N and leaving filtering-in as is for non-I18N.

Thanks for working on this, Richard!

-- TWiki:Main.SteffenPoulsen - 23 Jun 2007

OK, have started fixing the patch to do this.

-- TWiki:Main.RichardDonkin - 23 Jun 2007

Updated patch is attached - would like to check this in but I can't test this currently due to crash bugs in other bits of TWiki... See other bug reports made today, but particularly Item4295 which is blocking me.

-- RichardDonkin - 24 Jun 2007

We assume that you will follow up when you have tested now that MAIN seems pretty stable - Richard

-- TWiki:Main.KennethLavrsen - 02 Jul 2007

I will do - I think the problem was my not running pseudo-install.pl, and possibly some other Linux problems ( apt is broken on my Ubuntu box at present). I was on a business trip last week but hope to get on with this in next week.

-- TWiki:Main.RichardDonkin - 03 Jul 2007

I have tried exercising the patch a bit, and it seems stable, both in L10N and non-L10N sourroundings.

Committed to MAIN, testcases pass - everybody please test it further.

-- TWiki:Main.SteffenPoulsen - 21 Aug 2007

On my setup (using iso8859 on the file level) it seems I still have problems using IE for retrieving files with IE (Firefox is OK).

Testcase:

  • Upload a file named ÆØÅ.txt
  • Try to retrieve it using Firefox and IE (using viewfile) - IE fails (at my installation)
    • IE has an option to send links as UTF-8 or not, the non-default setting probably works
-- TWiki:Main.SteffenPoulsen - 22 Aug 2007

Thanks for the feedback - I now have a working TWiki SVN based test setup at home, so can re-test this. Will try IE from a different PC, though it could be a few days before I get time to resolve this.

-- TWiki:Main.RichardDonkin - 27 Aug 2007

#templates/messages.tmp:351 explains that I18N characters are replaced by US-ASCII chars - it should be updated once this item is closed.

-- TWiki:Main.SteffenPoulsen - 04 Sep 2007

To be a bit more specific, the testcase above (the ÆØÅ.txt-file) actually still only fails when the Web name or the Topic name has I18N chars also. If it is uploaded to webs without international chars it can be downloaded just fine through viewfile, using both IE and Firefox.

So the problem is still with the encoding/decoding of the the "path" path to the file, not the encoding of the filename itself.

The filename / pathnames on the disk are encoded as supposed to in my case (iso8859).

-- TWiki:Main.SteffenPoulsen - 05 Sep 2007

That's what the original bug report was all about frown

-- TWiki:Main.MichaelDaum - 05 Sep 2007

As far as I can tell the culprit is now the lines 439/439 in lib/TWiki/UI/View.pm that reads:

    my $topic = pop( @path );
    my $webName = join('.', @path );

These are different pr. browser, but are used directly as is in attachmentExists (UTF-8-encoded if IE was asking, ISO-encoded if Firefox was asking).

If these can be normalized (UTF82SiteCharSet?) I think the problem that at least I am seeing is solved.

Next problem is of course that the default behaviour of TWiki is to use pub links for the attachment tables, it should automatically switch to using viewfile if L10N is turned on. I will report this as another bug once this on is solved.

-- TWiki:Main.SteffenPoulsen - 05 Sep 2007

I have committed a fix that seems to work for my locale (da_DK.iso88591), apparently even with pub links (whee!).

Please test other locales.

Thanks Richard for your help on this one.

-- TWiki:Main.SteffenPoulsen - 05 Sep 2007

Hopefully last peculiarity: Seems there are some errors in the apache log when files are uploaded to a web with international chars and the topic they are attached to is viewed (functionality is ok).

[Wed Sep  5 23:17:25 2007] view: Use of uninitialized value in list assignment at ...lib/TWiki/Store.pm line 187.
Use of uninitialized value in string eq at ...lib/TWiki/Meta.pm line 177.
Use of uninitialized value in string eq at ...lib/TWiki/Meta.pm line 177.
Use of uninitialized value in list assignment at ...lib/TWiki/Store.pm line 187.
Use of uninitialized value in string eq at ...lib/TWiki/Meta.pm line 177.
Use of uninitialized value in string eq at 
...

-- TWiki:Main.SteffenPoulsen - 05 Sep 2007

I cannot provoke the apache errors anymore, I assume they have been cleared with some of the other items that have been checked in meanwhile.

Long-running bug this, glad to have it finally closed!

-- TWiki:Main.SteffenPoulsen - 08 Sep 2007

I've tested this with Cyrillic topic name, attachment name and attachment comment (KOI8-R charset) under Firefox and Opera on Linux, and it all works fine, and with no errors in Apache log. Glad to see this finally resolved, though it could do with more testing just in case.

The problem with the pop lines was exactly that the URLs were being re-parsed for web and topic names, without going through the TWiki:Codev.EncodeURLsWithUTF8 charset conversion - another I18N regression fixed, good to see as well.

However... I have noticed a lot of path_info calls sprinkled around TWiki, in the scripts rest and twiki, as well as here. None of these seem to do the correct UTF8 character set conversion, so this sort of URL-related problem could crop up again. In fact, all these calls to path_info should be replaced with a call to a shared TWiki routine that does this conversion once and memoizes the result - the code needed is in TWiki.pm.

-- TWiki:Main.RichardDonkin - 10 Sep 2007

ItemTemplate
Summary I18N: Urls to file attachments that has umlauts only works in some browsers
ReportedBy TWiki:Main.MichaelDaum
Codebase 4.1.1, ~twiki4
SVN Range TWiki-4.1.1, Fri, 16 Feb 2007, build 12896
AppliesTo Engine
Component I18N
Priority Urgent
CurrentState Closed
WaitingFor

Checkins TWikirev:12990 TWikirev:12991 TWikirev:13020 TWikirev:13025 TWikirev:14577 TWikirev:14578 TWikirev:14581 TWikirev:14742 TWikirev:14745 TWikirev:14750 TWikirev:14763 TWikirev:14779
TargetRelease minor
ReleasedIn 4.2.0
Topic attachments
I Attachment History Action Size Date Who Comment
Texttxt RussianText.txt r1 manage 1.5 K 2007-03-05 - 08:04 UnknownUser Test file for this bug
Unknown file formatpatch near-complete.patch r1 manage 18.5 K 2007-06-19 - 16:04 UnknownUser Patch for this problem, not quite complete, see comment
Unknown file formatpatch updated-Item3652-v2.patch r1 manage 14.7 K 2007-06-24 - 09:58 UnknownUser Updated version of 'near-complete' patch - ready for testing
Edit | Attach | Watch | Print version | History: r86 < r85 < r84 < r83 < r82 | Backlinks | Raw View |  Raw edit | More topic actions
Topic revision: r86 - 2008-01-22 - KennethLavrsen
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright © 2008-2024 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback