(
ThYang - I've updated the summary of your bug to reflect a better description of a specific bug developers should be able to repro)
Broken Chinese problem.
This one may relate to
Bugs.Item5248
You can input Chinese and save it successfully (left & middle picture). But if you re-edit again, this words will turn into some other codes ... (right picture)
My setting in
LocalSite.cfg.
$TWiki::cfg{Site}{Locale} = 'zh_TW.UTF-8';
$TWiki::cfg{Site}{CharSet} = 'UTF-8';
$TWiki::cfg{Site}{Lang} = 'zh';
$TWiki::cfg{Site}{FullLang} = 'zh-tw';
Related issues
--
TWiki:Main/ThYang
- 02 Feb 2008
I have seen this problem too. It exists for CJK - Chinese, Japanese, Korean
Installation: Twiki 4.2.0
Steps to Reproduce: 1) Install Default TWiki 4.2.0 installation
2) Edit Page in WYSIWYG
3) Add CJK string, example below: 中國字 (this sez, Chinese Characters!) 비 (This is a korean character)
(I'm using Firefox, you might need East Asian font support for you to see this on Windows XP)
4) Save page -> do this step for easy reproduction.
5) Re-edit page in WYSIWYG. Chinese characters are still there PROPERLY ENCODED.
6) Hit in WYSIWYG "Edit TWiki Markup".
Result: chinese characters are destroyed to single byte encoding, and shows up in twiki markup editor as effective gibberish.
Expected Result: -> In Twiki markup, UTF8 encoded chinese characters preserved.
Workaround: -> Never Ever Ever EVER hit "TWiki Markup Editor" in WYSIWYG. -> Use Raw HTML editor.
--
TWiki:Main.TimothyChen
- 15 Feb 2008
Tim,
The patch in Bugs.Item4946 solved the problem.
--
TWiki:Main.ThYang
- 04 Mar 2008
According to the report in Item4946 the last patch still has open issues.
Can someone Chinese educate me how you enter chinese characters? One of the major reasons why I cannot attack this one is that I have no clue how to write Chinese on a Danish or English keyboard. Do you type a percent u and 4 characters?
Can you supply someone like me with some simple ways to enter Chinese words including a picture of it so I can compare that it remains correct?
What we really need here is a Chinese language programmer to give a hand. That would be the best
--
TWiki:Main.KennethLavrsen
- 06 Mar 2008
See also
Item5457 for the same problems in Cyrilic, so a Russian language programmer would do just as well. Whichever it is, we desperately need a programmer who uses these character sets on a daily basis (or an expert on UTF-9 like
TWiki:Main.RichardDonkin
) to help resolve this!
Note that the more I think about it, the more i think the open issue against the patch in
Item4946 is a red herring. The problem only occurs if invalid UTF-8 is fed to it.
CC
I don't read/write Chinese, so the way I tested Chinese text and other languages was to simply find some Chinese text on the web and copy/paste it (as Unicode which is default on Windows) into a TWiki edit form. The browser should convert it from any source character set (GBK and GB2312 are common for Chinese, as well as UTF-8) into the target character set based on whatever encoding is used for the TWiki page (subject to what TMCE does of course). Sites such as Yahoo China should also be a good source of Chinese text. You could also copy/paste from the HTML rendered text above e.g. 中國字.
Any corruption is very likely to just wreck the Chinese text visibly rather than subtly corrupt it into another Chinese character, so in practice it's quite easy to check whether something is broken.
Do check
TWiki:Codev.JapaneseAndChineseSupport
for some restrictions on character sets that will work on server side - basically only EUC variants and UTF-8 will work.
--
TWiki:Main.RichardDonkin
- 27 Mar 2008
Another thought: if the issue is with invalid UTF-8, there's already a handy regex in the TWiki code that checks for this - used in the
TWiki:Codev.EncodeURLsWithUTF8
code. So it should not be too hard to re-use this regex whether on server side or TMCE, if that's the problem. Having said that, it's quite hard to get invalid UTF-8 i.e. something that conforms to the basic UTF-8 encoding approach yet is either overlong or using illegal codepoints.
--
RichardDonkin - 28 Mar 2008
Actually, it's not all that hard. The specific case used in testing
Item4946 occurred when someone fed text containing
%ACTION
into the decoder. %AC turned into an illegal character. If the encoding is handled correctly, it should never happen.
--
TWiki:Main.CrawfordCurrie
- 28 Mar 2008
I believe I finally found the solution. I ended up having to convert octets to UTF-8 to stop the HTML::Parser falling over, then converting UTF-8 wide chars to HTML entities to stop the
print
falling over (STDOUT is not opened
:utf8
). I was able to engineer the fix without needing to touch the core code.
Many thanks to
TWiki:Main/ThYang
,
TWiki:Main.TimothyChen
,
TWiki:Main/OlegButovich
and
TWiki:Main.RichardDonkin
for exploring around the problem and proposing investigative procedures.
--
CrawfordCurrie - 31 Mar 2008
Almost, but not quite;
TWiki:Main.KwangErnLieuw
discovered that entering chinese chars in pickaxe mode didn't work, so had to fix that too.
--
CrawfordCurrie - 31 Mar 2008