ActiveState Community

how can i generate big-endian unicode file.

Posted by umb365 on 2009-02-22 03:11

hi all.
I want to write some strings into a file with big-endian unicode file.
but, when I use "fconfigure $channelID -encoding unicode", the little-endian unicode chars were written.

how can I handled it.

Thank you.

jeffh | Sun, 2009-02-22 20:47

There is currently no set encoding for big-endian unicode. This is a feature request which I think may be addressed before 8.6 final.

umb365 | Sun, 2009-02-22 23:20

Oop! Are you Jeff Hobbs? ;-)
I also think that this problem can't be resolved by fconfigure.
So, how can I resolve it now, in the TCL 8.4.X?
Thank you! Jeffh.

jeffh | Thu, 2009-02-26 12:41

The simple way to do this is using the Tcl [binary] command to swap LE for BE shorts:

binary scan $input s* vals ; binary format S* $vals

That will convert LE $input to BE output.

umb365 | Sun, 2009-03-01 08:35

Thanks, JeffH.
But I guess it doesn't work. :-P

let me test it, and reply you later.

umb365 | Tue, 2009-03-03 21:33

Hi, jeffH.
I have tested it.
Something looks quite weird.

my TCL version is 8.5:
% info tclversion
8.5

a demo unicode-littleendian input file in hex mode is:
FF FE 45 78 0D 00 0A 00 45 78 0D 00 0A 00
while "0x45 0x78" means "silicon" in Simplified Chinese.

This file was make up by 5 part:
FF FE : the header of file.
45 78 : "silicon" in Simplified Chinese.
0D 00 0A 00: linefeed.
45 78 : another "silicon" in Simplified Chinese.
0D 00 0A 00: another linefeed.

# demo TCL code:
set srcFile [ open unicode-littleendian.txt r ]
set dstFile [ open unicode-bigendian.txt w ]
fconfigure $srcFile -encoding binary
fconfigure $dstFile -encoding binary

while { 0 <= [ gets $srcFile str ] } {
#puts $dstFile $str
binary scan $str s* vals
puts $dstFile [ binary format S* $vals ]
}

close $dstFile
close $srcFile
# end TCL code.

the result file in hex mode is:
FE FF 78 45 0D 0A 0D 0A 45 00 0D 0A 0D 0A 0D 0A

Now, we can see, the 1st line is correct. But in the 2nd line, 0x78 is been cutoff.

jeffh | Wed, 2009-03-04 17:25

I would try (untested) the following. The difference is that it makes no assumption of newline handling, which may be the confusion above. For large files, you'd probably want to chunk the data handling.

proc le2be {fle fbe} {
set srcFile [open $fle r]
set dstFile [open $fbe w]
fconfigure $srcFile -encoding binary
fconfigure $dstFile -encoding binary

binary scan [read $srcFile] s* vals
puts -nonewline $dstFile [binary format S* $vals]

close $dstFile
close $srcFile
}

umb365 | Wed, 2009-03-04 17:50

I will test it later.

umb365 | Tue, 2009-03-10 08:13

Sorry for the delay.
I think I understand the behavior of TCL.
The 0x00 in the "0A 00" interferes the "binary scan $str s* vals".
the 2nd line becomes "00 45 78".

so I rebuild the code, check for the prefix 0x00:
# demo TCL code ############################################
set srcFile [ open unicode-littleendian.txt r ]
set dstFile [ open unicode-bigendian1b.txt w ]
fconfigure $srcFile -encoding binary
fconfigure $dstFile -encoding binary -translation binary

while { 0 <= [ gets $srcFile str ] } {
set length [ string length $str ]
binary scan [ string index $str 0 ] c tmp
if { ( 0 == $tmp ) && ( 1 == [ expr { $length % 2 } ] ) } {
set str [ string range $str 1 end ]
}
binary scan $str s* vals
puts -nonewline $dstFile [ binary format S* $vals ]
puts -nonewline $dstFile [ format %c%c 0x00 0x0A ]
}

close $dstFile
close $srcFile
# end TCL code. ############################################
but either your code or my new code, the newline are not handled quite properly.