ActiveState Community

File encoding pb

Posted by degouville on 2009-10-29 02:12
OS: OS X

Hi all,

I am new to Komodo and I have been trying to write a small HTML file. The file contains some french letters. When I save the file the french letters are changed into weird symbols in the HTML text file. I am sure I am missing something in the preferences panel to set the save language to French. Can you help me ?

Thks

ericp | Thu, 2009-10-29 10:22

Global settings:

[Edit|Preferences|Internationalization]:
- see Default Editor Encoding, and Language-specific encoding

Your default is most like Mac Roman, which shouldn't cause these
problems, but you could try setting it to UTF-8.

This setting applies only to newly created files. For the
current file, you would need to right-click on the editor's tab,
choose [Properties & Settings], then click the [Properties] tab
in the pref box, and then set the Encoding. Again, if it's
Mac Roman, I would change it to UTF-8. And if it's not Mac Roman,
I would try setting it to that.

- Eric

degouville | Thu, 2009-10-29 12:52

Thanks for your long and detailed answer. I have deleted everything and started again and I still have a problem but a slightly different one. In [Edit/Preferences/Internationalization], I have "Custom Encoding" set as Mac Roman and Language-specific encoding set as "Default Encoding".

In [Properties & Settings] of my file, Encoding is set as UTF-8...

With these parameters, when I choose to preview in the tab browser, it works. When I open the file in my web browser (Safari), I still have the weird symbols instead of what I can see in Tab browser of Komodo.

I have encountered a new strange behaviour. When I decide to save my current file, I click on the small icone dedicated to the saving and it seems that it does not save anything. I have to close the tab to save the file. In fact I am asked wether I whant to save the changes or not. Do you know if it is a normal behaviour.

Sorry for my english :-)

ericp | Thu, 2009-10-29 13:34

There are at least three factors that have to be kept in sync:

1. What is the actual encoding of the file? If you use a terminal
shell command like 'hexdump -C [filename] | more', you can tell by
looking for an occurrence of a known non-ASCII character in your file.

For example, I have an HTML file that contains the text
"cpfe école", and hexdump shows it like so:

000051b0 63 70 66 65 20 e9 63 6f 6c 65 20 6c 65 ea 76 65 |cpfe .cole le.ve|

The "é" shows up as hex e9, which is the latin1 value for that character,
so I know it was saved as latin1 (which is somewhat similar to MacRoman).

If I had saved the file as UTF-8, I would have seen a hexdump like this:

00000240 69 70 74 22 3e 0d 0a 67 63 70 66 65 20 c3 a9 63 |ipt">..gcpfe ..c|

Here the "é" is stored as hex c3-a9 (which looks like é)

So you're wondering if this is what anyone who uses non-ASCII text
has to go through. Usually not... let's look at the other two parts.

2. When you send files to a browser, the browser needs to know how the
file is encoded. There are two ways of putting this in an HTML file:

2.1. If it's an XHTML file, there will be an XML declaration at the
top:

<?xml version="1.0" encoding="..." ?>

If the file is UTF-8, the encoding value should be "UTF-8".
If it's ISO-latin1, it should be "ISO-8859-1".
I'm not sure what it should be for Mac-Roman, but as you can
see, I wouldn't recommend using this encoding for documents
destinged for the web.

2.2 The meta tag

Also, the HTML file should have a meta tag in the head section
that also reports the document's character set. Here's an
example for UTF-8

As far as I'm aware, the specs don't specify which takes precendence:
the XML declaration, or the meta tag. Best to use both, if
you're serving XML. If you're serving non-XML HTML, then there
won't be an XML declaration.

3. The server

The server might also need to send a header reporting the charset of
the output of the request. When you're viewing files, there isn't a
server.

4. Putting it together:

My guess is that either the actual encoding of the document and the
reported encoding are out of sync, or there is no reported encoding.
The Komodo embedded browser is correctly guessing the encoding,
while Safari is most likely assuming Latin1 or Mac-Roman, and is
being given UTF-8.

Boy, that's a long-winded explanation. Someone (me) should write
it up as a FAQ (if it helps).

- Eric