Trying to decode text that is supposed to be ISO-8859-1 [message #175395] |
Wed, 14 September 2011 01:56 |
Bart Kastermans
Messages: 1 Registered: September 2011
Karma: 0
|
Junior Member |
|
|
I have downloaded a file that claims to be ISO-8859-1. In it (among
many other stuff) are the bytes shown here (first column is the
character, the second is ord(character), the third and fourth are binary
respectively hexidecimal representations of the character.
P / 80 / 01010000 / 50
l / 108 / 01101100 / 6c
z / 122 / 01111010 / 7a
e / 101 / 01100101 / 65
\303 / 195 / 11000011 / c3
\205 / 133 / 10000101 / 85
\313 / 203 / 11001011 / cb
\206 / 134 / 10000110 / 86
This is supposed to be ISO-8859-1 encoded, and should encode the
character U+0148 (\v{n}; Latin small letter n with caron).
Does anybody have any idea how I could decode this (or how it was
encoded in the first place)? Any suggestions would be greatly
appreciated.
|
|
|
Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175397 is a reply to message #175395] |
Wed, 14 September 2011 03:42 |
Peter H. Coffin
Messages: 245 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On Tue, 13 Sep 2011 19:56:20 -0600, Bart Kastermans wrote:
> I have downloaded a file that claims to be ISO-8859-1. In it (among
> many other stuff) are the bytes shown here (first column is the
> character, the second is ord(character), the third and fourth are binary
> respectively hexidecimal representations of the character.
>
> P / 80 / 01010000 / 50
> l / 108 / 01101100 / 6c
> z / 122 / 01111010 / 7a
> e / 101 / 01100101 / 65
> \303 / 195 / 11000011 / c3
> \205 / 133 / 10000101 / 85
> \313 / 203 / 11001011 / cb
> \206 / 134 / 10000110 / 86
>
> This is supposed to be ISO-8859-1 encoded, and should encode the
> character U+0148 (\v{n}; Latin small letter n with caron).
>
> Does anybody have any idea how I could decode this (or how it was
> encoded in the first place)? Any suggestions would be greatly
> appreciated.
It's UTF-8 encoded representation of a false ISO-8859-1(? probably
CP1251, actually) display of the CORRECT UTF-8 for the string you're
hoping it should be. Someone basically doubled-up on the conversion to
make that.
--
82. I will not shoot at any of my enemies if they are standing in front
of the crucial support beam to a heavy, dangerous, unbalanced
structure.
--Peter Anspach's list of things to do as an Evil Overlord
|
|
|
Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175399 is a reply to message #175395] |
Wed, 14 September 2011 10:25 |
alvaro.NOSPAMTHANX
Messages: 277 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
El 14/09/2011 3:56, Bart Kastermans escribió/wrote:
> I have downloaded a file that claims to be ISO-8859-1. In it (among
> many other stuff) are the bytes shown here (first column is the
> character, the second is ord(character), the third and fourth are binary
> respectively hexidecimal representations of the character.
>
> P / 80 / 01010000 / 50
> l / 108 / 01101100 / 6c
> z / 122 / 01111010 / 7a
> e / 101 / 01100101 / 65
> \303 / 195 / 11000011 / c3
> \205 / 133 / 10000101 / 85
> \313 / 203 / 11001011 / cb
> \206 / 134 / 10000110 / 86
>
> This is supposed to be ISO-8859-1 encoded, and should encode the
> character U+0148 (\v{n}; Latin small letter n with caron).
Funny... I think that character (ň) does not even exist in ISO-8859-1:
http://www.fileformat.info/info/unicode/char/148/index.htm
http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout
And in fact the 0x85 and 0x86 positions are empty in ISO-8859-1.
The mb_detect_encoding() function suggests that the string is actually
in UTF-8 and contains two chars: 0xC385 and 0xCB86 (ň). The "ň" string
is exactly what you get if you encode "ň" in UTF-8 and try to display as
ISO-8859-1, so I guess that's what the data creator is doing.
> Does anybody have any idea how I could decode this (or how it was
> encoded in the first place)? Any suggestions would be greatly
> appreciated.
To begin with, you cannot use ISO-8859-1 as target encoding if you want
to use U+0148.
Now, if you decide to switch to UTF-8... well, I'll report back if I
find something more precise :)
--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--
|
|
|
Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175401 is a reply to message #175397] |
Wed, 14 September 2011 12:07 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Peter H. Coffin wrote:
> On Tue, 13 Sep 2011 19:56:20 -0600, Bart Kastermans wrote:
>> I have downloaded a file that claims to be ISO-8859-1. In it (among
>> many other stuff) are the bytes shown here (first column is the
>> character, the second is ord(character), the third and fourth are binary
>> respectively hexidecimal representations of the character.
>>
>> P / 80 / 01010000 / 50
>> l / 108 / 01101100 / 6c
>> z / 122 / 01111010 / 7a
>> e / 101 / 01100101 / 65
>> \303 / 195 / 11000011 / c3
>> \205 / 133 / 10000101 / 85
>> \313 / 203 / 11001011 / cb
>> \206 / 134 / 10000110 / 86
>>
>> This is supposed to be ISO-8859-1 encoded, and should encode the
>> character U+0148 (\v{n}; Latin small letter n with caron).
>>
>> Does anybody have any idea how I could decode this (or how it was
>> encoded in the first place)? Any suggestions would be greatly
>> appreciated.
>
> It's UTF-8 encoded representation of a false ISO-8859-1(? probably
> CP1251, actually) […]
Windows-125_2_ (Western) corresponds largely with ISO-8859-1. Windows-1251,
which is the proper name for that character set and encoding, is Cyrillic
above 0x7F, and corresponds largely with ISO-8859-5.
PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee
|
|
|
Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175404 is a reply to message #175401] |
Wed, 14 September 2011 13:37 |
Peter H. Coffin
Messages: 245 Registered: September 2010
Karma: 0
|
Senior Member |
|
|
On Wed, 14 Sep 2011 14:07:27 +0200, Thomas 'PointedEars' Lahn wrote:
> Peter H. Coffin wrote:
>
>> On Tue, 13 Sep 2011 19:56:20 -0600, Bart Kastermans wrote:
>>> I have downloaded a file that claims to be ISO-8859-1. In it (among
>>> many other stuff) are the bytes shown here (first column is the
>>> character, the second is ord(character), the third and fourth are binary
>>> respectively hexidecimal representations of the character.
>>>
>>> P / 80 / 01010000 / 50
>>> l / 108 / 01101100 / 6c
>>> z / 122 / 01111010 / 7a
>>> e / 101 / 01100101 / 65
>>> \303 / 195 / 11000011 / c3
>>> \205 / 133 / 10000101 / 85
>>> \313 / 203 / 11001011 / cb
>>> \206 / 134 / 10000110 / 86
>>>
>>> This is supposed to be ISO-8859-1 encoded, and should encode the
>>> character U+0148 (\v{n}; Latin small letter n with caron).
>>>
>>> Does anybody have any idea how I could decode this (or how it was
>>> encoded in the first place)? Any suggestions would be greatly
>>> appreciated.
>>
>> It's UTF-8 encoded representation of a false ISO-8859-1(? probably
>> CP1251, actually) [???]
>
> Windows-125_2_ (Western) corresponds largely with ISO-8859-1. Windows-1251,
> which is the proper name for that character set and encoding, is Cyrillic
> above 0x7F, and corresponds largely with ISO-8859-5.
Yeah, I know that. But there's 0x8n values in the hex that don't
represent in 8859-1 but do in CP1251. And there's a LOT more
charset-unaware stuff out there that assumes all the world is CP1251
than assumes everything is 8859-1.
--
A government big enough to give you everything you want is a government
big enough to take from you everything you have.
-- Gerald Ford in an address to Congress on August 12, 1974
|
|
|
Re: Trying to decode text that is supposed to be ISO-8859-1 [message #175421 is a reply to message #175404] |
Tue, 20 September 2011 22:14 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma: 0
|
Senior Member |
|
|
Peter H. Coffin wrote:
> On Wed, 14 Sep 2011 14:07:27 +0200, Thomas 'PointedEars' Lahn wrote:
>> Peter H. Coffin wrote:
>>> On Tue, 13 Sep 2011 19:56:20 -0600, Bart Kastermans wrote:
>>>> I have downloaded a file that claims to be ISO-8859-1. In it (among
>>>> many other stuff) are the bytes shown here (first column is the
>>>> character, the second is ord(character), the third and fourth are
>>>> binary respectively hexidecimal representations of the character.
>>>>
>>>> P / 80 / 01010000 / 50
>>>> l / 108 / 01101100 / 6c
>>>> z / 122 / 01111010 / 7a
>>>> e / 101 / 01100101 / 65
>>>> \303 / 195 / 11000011 / c3
>>>> \205 / 133 / 10000101 / 85
>>>> \313 / 203 / 11001011 / cb
>>>> \206 / 134 / 10000110 / 86
>>>>
>>>> This is supposed to be ISO-8859-1 encoded, and should encode the
>>>> character U+0148 (\v{n}; Latin small letter n with caron).
>>>>
>>>> Does anybody have any idea how I could decode this (or how it was
>>>> encoded in the first place)? Any suggestions would be greatly
>>>> appreciated.
>>> It's UTF-8 encoded representation of a false ISO-8859-1(? probably
>>> CP1251, actually) [???]
>> Windows-125_2_ (Western) corresponds largely with ISO-8859-1.
>> Windows-1251, which is the proper name for that character set and
>> encoding, is Cyrillic above 0x7F, and corresponds largely with
>> ISO-8859-5.
>
> Yeah, I know that. But there's 0x8n values in the hex that don't
> represent in 8859-1 but do in CP1251. And there's a LOT more
> charset-unaware stuff out there that assumes all the world is CP1251
> than assumes everything is 8859-1.
You are missing the point. Windows-125*1* (or "CP1251" as you put it) is
not remotely the same as ISO-8859-1x; Windows-125_2_ is.
It is also misleading to state that 0x85 and 0x86 had no representation in
the widely unused ISO/IEC 8859-1 because that encoding is _not_ equivalent
to ISO-8859-1, which is what the OP stated and you referred to instead. In
ISO-8859-1, 0x85 represents NEL (ISO C1 Next Line, marks end-of-line on some
IBM Mainframes) and 0x86 represents SSA (ISO C1 Start of Selected Area, used
by block-oriented terminals).
PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)
|
|
|