Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175421 is a reply to message #175404] |
Tue, 20 September 2011 22:14 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma:
|
Senior Member |
|
|
Peter H. Coffin wrote:
> On Wed, 14 Sep 2011 14:07:27 +0200, Thomas 'PointedEars' Lahn wrote:
>> Peter H. Coffin wrote:
>>> On Tue, 13 Sep 2011 19:56:20 -0600, Bart Kastermans wrote:
>>>> I have downloaded a file that claims to be ISO-8859-1. In it (among
>>>> many other stuff) are the bytes shown here (first column is the
>>>> character, the second is ord(character), the third and fourth are
>>>> binary respectively hexidecimal representations of the character.
>>>>
>>>> P / 80 / 01010000 / 50
>>>> l / 108 / 01101100 / 6c
>>>> z / 122 / 01111010 / 7a
>>>> e / 101 / 01100101 / 65
>>>> \303 / 195 / 11000011 / c3
>>>> \205 / 133 / 10000101 / 85
>>>> \313 / 203 / 11001011 / cb
>>>> \206 / 134 / 10000110 / 86
>>>>
>>>> This is supposed to be ISO-8859-1 encoded, and should encode the
>>>> character U+0148 (\v{n}; Latin small letter n with caron).
>>>>
>>>> Does anybody have any idea how I could decode this (or how it was
>>>> encoded in the first place)? Any suggestions would be greatly
>>>> appreciated.
>>> It's UTF-8 encoded representation of a false ISO-8859-1(? probably
>>> CP1251, actually) [???]
>> Windows-125_2_ (Western) corresponds largely with ISO-8859-1.
>> Windows-1251, which is the proper name for that character set and
>> encoding, is Cyrillic above 0x7F, and corresponds largely with
>> ISO-8859-5.
>
> Yeah, I know that. But there's 0x8n values in the hex that don't
> represent in 8859-1 but do in CP1251. And there's a LOT more
> charset-unaware stuff out there that assumes all the world is CP1251
> than assumes everything is 8859-1.
You are missing the point. Windows-125*1* (or "CP1251" as you put it) is
not remotely the same as ISO-8859-1x; Windows-125_2_ is.
It is also misleading to state that 0x85 and 0x86 had no representation in
the widely unused ISO/IEC 8859-1 because that encoding is _not_ equivalent
to ISO-8859-1, which is what the OP stated and you referred to instead. In
ISO-8859-1, 0x85 represents NEL (ISO C1 Next Line, marks end-of-line on some
IBM Mainframes) and 0x86 represents SSA (ISO C1 Start of Selected Area, used
by block-oriented terminals).
PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)
|
|
|