FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » DOMDocument loadHTML and double UTF8 encode
Show: Today's Messages :: Polls :: Message Navigator
Return to the default flat view Create a new topic Submit Reply
Re: DOMDocument loadHTML and double UTF8 encode [message #170265 is a reply to message #170263] Sat, 23 October 2010 14:20 Go to previous message
roger21 is currently offline  roger21
Messages: 3
Registered: October 2010
Karma:
Junior Member
roger21 a écrit :
> Robert Hairgrove a écrit :
>> roger21 wrote:
>>> hi,
>>>
>>> i use DOMDocument loadHTML to parse pages from this forum
>>> http://forum.hardware.fr/ (the forum is in utf8) my problem is some
>>> pages are actually seen as utf8 like this one
>>> http://forum.hardware.fr/hfr/Hardware/liste_sujet-1.htm and some are
>>> not like this one
>>> http://forum.hardware.fr/hfr/HardwarePeripheriques/liste_sujet-1.htm
>>> and so the second kind results in a double utf8 encoding
>>>
>>> so i test if the page is doubly encoded and if yes i utf8_decode the
>>> text value i want but there are some side effects, for exemple the
>>> euro sign is not doubly encode so this one become crap when
>>> utf8_decoded ...
>>> and i don't know if there are other signs like that
>>>
>>> so i am lost (and pissed) any idea how i should manage all that ?
>>
>> Strange, because when I load those pages, both are seen as being UTF-8
>> encoded by Mozilla Firefox 3.6.11 on my system (Linux Ubuntu Hardy
>> 8.04 LTS), and everything seems to display correctly.
>>
>> But, as I recently discovered, a lot will depend on whether the
>> forum's server is actually issuing a PHP header with a UTF-8 charset
>> declaration before displaying the pages instead of merely using an
>> HTML meta tag. If not, the page might still be displayed as some other
>> character set, for example ISO-8859-1 ... which wouldn't be so bad if
>> the extended/accented characters were correctly translated as HTML
>> entities by the forum software, and obviously they are not (i.e. the
>> source for the pages shows "á" in plain text instead of "á").
>>
>> Of course, the forum member's client browser might not send messages
>> encoded in UTF-8, and the result can be garbage -- Russian text
>> encoded as Windows 1251, for example, and copied and pasted from an
>> editor into the browser will sometimes display as ISO-8859-1 extended
>> characters in some forums I have seen.
>>
>> One thing you might want to try is to compare some original text with
>> the result of a utf8_encode(utf8_decode(orig_text)) combination, or
>> maybe vice-versa. If they are the same, then the page is being
>> interpreted as UTF-8; if not, it is most likely being interpreted as
>> ISO-8559-1 or some other character set.
>>
>> Also, the user comments posted here might prove to be helpful:
>> http://ch2.php.net/manual/en/domdocument.loadhtml.php
>
> thank you for your answer, when i say "seen" i mean by the loadhtml
> function that translate everything in utf8 if it thinks it is not
> already the case, of course both pages are in utf8 (by the headers and
> the meta tags and the content therefore my double encoding problem) and
> that's why i'm a bit upset (but i won't say the forum's pages are clean,
> it is prety crapy over-all)
>
> and my problem is when i utf8_decode my doubly encoded pages i have
> characters issues that i don't have when the page is not over-encoded by
> loadhtml (with the same chars)
>
> and i don't whant to decode the text the first because i want to stay in
> utf8 (and i will lose characters if i decode it first)
>
> and i alredy checked the coments, they all seems to be related but i
> tried most of them and it is either the same or worse (therefore i ask
> here :p)
>
> maybe i should try to understand why some page are doubly encoded, they
> may have some crap that i could fix before giving it to loadhtml

ok, one of the comments seems useful : the title tag is before the
charset meta tag, when i have a title with accents loadhtml over encode,
if i move the meta tags before the title tag that should be good
(according to the comment) i will try that
[Message index]
 
Read Message
Read Message
Read Message
Read Message
Previous Topic: US, Canada or International
Next Topic: ==Get an Internship in the United States ==
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Wed Nov 27 03:24:53 GMT 2024

Total time taken to generate the page: 0.10111 seconds