FUDforum: comp.lang.php » DOMDocument loadHTML and double UTF8 encode

Home » Imported messages » comp.lang.php » DOMDocument loadHTML and double UTF8 encode

Show: Today's Messages :: Polls :: Message Navigator

Re: DOMDocument loadHTML and double UTF8 encode [message #170261 is a reply to message #170254]

Sat, 23 October 2010 12:38

Robert Hairgrove
Messages: 19
Registered: September 2010

Karma:

Junior Member

roger21 wrote:
> hi,
>
> i use DOMDocument loadHTML to parse pages from this forum
> http://forum.hardware.fr/ (the forum is in utf8) my problem is some
> pages are actually seen as utf8 like this one
> http://forum.hardware.fr/hfr/Hardware/liste_sujet-1.htm and some are not
> like this one
> http://forum.hardware.fr/hfr/HardwarePeripheriques/liste_sujet-1.htm
> and so the second kind results in a double utf8 encoding
>
> so i test if the page is doubly encoded and if yes i utf8_decode the
> text value i want but there are some side effects, for exemple the euro
> sign is not doubly encode so this one become crap when utf8_decoded ...
> and i don't know if there are other signs like that
>
> so i am lost (and pissed) any idea how i should manage all that ?

Strange, because when I load those pages, both are seen as being UTF-8
encoded by Mozilla Firefox 3.6.11 on my system (Linux Ubuntu Hardy 8.04
LTS), and everything seems to display correctly.

But, as I recently discovered, a lot will depend on whether the forum's
server is actually issuing a PHP header with a UTF-8 charset declaration
before displaying the pages instead of merely using an HTML meta tag. If
not, the page might still be displayed as some other character set, for
example ISO-8859-1 ... which wouldn't be so bad if the extended/accented
characters were correctly translated as HTML entities by the forum
software, and obviously they are not (i.e. the source for the pages
shows "á" in plain text instead of "á").

Of course, the forum member's client browser might not send messages
encoded in UTF-8, and the result can be garbage -- Russian text encoded
as Windows 1251, for example, and copied and pasted from an editor into
the browser will sometimes display as ISO-8859-1 extended characters in
some forums I have seen.

One thing you might want to try is to compare some original text with
the result of a utf8_encode(utf8_decode(orig_text)) combination, or
maybe vice-versa. If they are the same, then the page is being
interpreted as UTF-8; if not, it is most likely being interpreted as
ISO-8559-1 or some other character set.

Also, the user comments posted here might prove to be helpful:
http://ch2.php.net/manual/en/domdocument.loadhtml.php

Report message to a moderator

[Message index]

		DOMDocument loadHTML and double UTF8 encode By: roger21 on Fri, 22 October 2010 22:09
		Re: DOMDocument loadHTML and double UTF8 encode By: Robert Hairgrove on Sat, 23 October 2010 12:38
		Re: DOMDocument loadHTML and double UTF8 encode By: roger21 on Sat, 23 October 2010 13:35
		Re: DOMDocument loadHTML and double UTF8 encode By: roger21 on Sat, 23 October 2010 14:20

Previous Topic:	US, Canada or International
Next Topic:	==Get an Internship in the United States ==

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Sat Nov 23 09:01:07 GMT 2024

Total time taken to generate the page: 0.04316 seconds