FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » How do I force PHP to assume UTF-8 for $_GET?
Show: Today's Messages :: Polls :: Message Navigator
Switch to threaded view of this topic Create a new topic Submit Reply
How do I force PHP to assume UTF-8 for $_GET? [message #174416] Fri, 10 June 2011 18:32 Go to next message
Martin Kotulla is currently offline  Martin Kotulla
Messages: 4
Registered: June 2011
Karma: 0
Junior Member
Hello,

I have a Windows application (which I cannot change) that calls my PHP
script and passes it parameters via GET. The application sends properly
encoded UTF-8 parameters, but does not indicate the UTF-8 character set
in the HTTP headers (it does not indicate any character set at all).

When I now access the parameters via $_GET in my PHP script, they are
interpreted as ISO-8859-1 instead of UTF-8.

For example, the Chinese character U+3563 is represented by the UTF-8
bytes 0xE3, 0x95, 0xA3. PHP sees the two ISO-8859-1 character 0xE3 and
0xA3, and silently drops 0x95, presumably because it is outside of the
range of valid characters in ISO-8859-1.

So, how can I force PHP to regard the parameters as UTF-8-encoded?

Any insights appreciated.

-mk
Re: How do I force PHP to assume UTF-8 for $_GET? [message #174417 is a reply to message #174416] Fri, 10 June 2011 18:54 Go to previous messageGo to next message
Tim Streater is currently offline  Tim Streater
Messages: 328
Registered: September 2010
Karma: 0
Senior Member
In article <95f6abF3qjU1(at)mid(dot)individual(dot)net>,
Martin Kotulla <mk999(at)gmx(dot)de> wrote:

> I have a Windows application (which I cannot change) that calls my PHP
> script and passes it parameters via GET. The application sends properly
> encoded UTF-8 parameters, but does not indicate the UTF-8 character set
> in the HTTP headers (it does not indicate any character set at all).
>
> When I now access the parameters via $_GET in my PHP script, they are
> interpreted as ISO-8859-1 instead of UTF-8.
>
> For example, the Chinese character U+3563 is represented by the UTF-8
> bytes 0xE3, 0x95, 0xA3. PHP sees the two ISO-8859-1 character 0xE3 and
> 0xA3, and silently drops 0x95, presumably because it is outside of the
> range of valid characters in ISO-8859-1.
>
> So, how can I force PHP to regard the parameters as UTF-8-encoded?

How do you know it's dropped anything? IIRC PHP knows nothing about
character sets, just bytes. Have you tried something like the following:

$mystr = $_GET["mystr"];
$utfstr = iconv ("iso-8859-1", "UTF-8//IGNORE", $mystr);

--
Tim

"That excessive bail ought not to be required, nor excessive fines imposed,
nor cruel and unusual punishments inflicted" -- Bill of Rights 1689
Re: How do I force PHP to assume UTF-8 for $_GET? [message #174418 is a reply to message #174417] Fri, 10 June 2011 19:00 Go to previous messageGo to next message
Martin Kotulla is currently offline  Martin Kotulla
Messages: 4
Registered: June 2011
Karma: 0
Junior Member
On 10.06.2011 20:54, Tim Streater wrote:
>
> How do you know it's dropped anything? IIRC PHP knows nothing about
> character sets, just bytes. Have you tried something like the following:
>
> $mystr = $_GET["mystr"];
> $utfstr = iconv ("iso-8859-1", "UTF-8//IGNORE", $mystr);
>

I tried that before. But in the example that I gave, the three UTF-8
bytes came out as two ISO-8859-1 characters. The 0x95 was dropped,
presumably because it's in the range 0x80 to 0x9F which is not defined
in ISO-8859-1.

-mk
Re: How do I force PHP to assume UTF-8 for $_GET? [message #174420 is a reply to message #174418] Fri, 10 June 2011 20:56 Go to previous messageGo to next message
Tim Streater is currently offline  Tim Streater
Messages: 328
Registered: September 2010
Karma: 0
Senior Member
In article <95f7trFehmU1(at)mid(dot)individual(dot)net>,
Martin Kotulla <mk999(at)gmx(dot)de> wrote:

> On 10.06.2011 20:54, Tim Streater wrote:
>>
>> How do you know it's dropped anything? IIRC PHP knows nothing about
>> character sets, just bytes. Have you tried something like the following:
>>
>> $mystr = $_GET["mystr"];
>> $utfstr = iconv ("iso-8859-1", "UTF-8//IGNORE", $mystr);

> I tried that before. But in the example that I gave, the three UTF-8
> bytes came out as two ISO-8859-1 characters. The 0x95 was dropped,
> presumably because it's in the range 0x80 to 0x9F which is not defined
> in ISO-8859-1.

OK, maybe I was asking the wrong question. What do you mean by "access"
in "When I now access the parameters via $_GET in my PHP script, ..."?
Have you checked that when the Win-app sends U+3563 you get three bytes
of data (0xE3, 0x95, 0xA3) in your PHP script? What are you then
intending to do with those Chinese characters?

--
Tim

"That excessive bail ought not to be required, nor excessive fines imposed,
nor cruel and unusual punishments inflicted" -- Bill of Rights 1689
Re: How do I force PHP to assume UTF-8 for $_GET? [message #174421 is a reply to message #174420] Fri, 10 June 2011 21:59 Go to previous messageGo to next message
Martin Kotulla is currently offline  Martin Kotulla
Messages: 4
Registered: June 2011
Karma: 0
Junior Member
On 10.06.2011 22:56, Tim Streater wrote:
>
> OK, maybe I was asking the wrong question. What do you mean by "access"
> in "When I now access the parameters via $_GET in my PHP script, ..."?
> Have you checked that when the Win-app sends U+3563 you get three bytes
> of data (0xE3, 0x95, 0xA3) in your PHP script? What are you then
> intending to do with those Chinese characters?
>

The Windows app sends three UTF-8 encoded characters. The PHP $_GET
array only returns two, one being dropped by PHP because it considers
out-of-range characters (in its ISO-8859-1 mind) incorrect.

My script needs to accept arbitrary UTF-8 sequences but PHP drops these
characters because it (correctly, in its ISO-8859-1 mind) thinks they
are invalid. But in the UTF-8 system, they are valid.

I need a way to tell PHP that the sequence is UTF-8 before I access it.
As soon as my script reads something that PHP considers ISO-8859-1, the
damage is done. iconv back to UTF-8 won't help me, I cannot unscramble
scrambled eggs.

-mk
Re: How do I force PHP to assume UTF-8 for $_GET? [message #174422 is a reply to message #174421] Fri, 10 June 2011 22:28 Go to previous messageGo to next message
Tim Streater is currently offline  Tim Streater
Messages: 328
Registered: September 2010
Karma: 0
Senior Member
In article <95fidnFk6U1(at)mid(dot)individual(dot)net>,
Martin Kotulla <mk999(at)gmx(dot)de> wrote:

> On 10.06.2011 22:56, Tim Streater wrote:
>>
>> OK, maybe I was asking the wrong question. What do you mean by "access"
>> in "When I now access the parameters via $_GET in my PHP script, ..."?
>> Have you checked that when the Win-app sends U+3563 you get three bytes
>> of data (0xE3, 0x95, 0xA3) in your PHP script? What are you then
>> intending to do with those Chinese characters?

> The Windows app sends three UTF-8 encoded characters. The PHP $_GET
> array only returns two, one being dropped by PHP because it considers
> out-of-range characters (in its ISO-8859-1 mind) incorrect.

This may not be PHP's fault. Re-reading your first post I see that your
win-app does not indicate UTF-8 in the HTTP headers. That may mean that
the byte stream is wrong before PHP even sees it. But I'm just guessing.

You may want to subscribe to the PHP General mailing list and ask your
question there. Send a mail to:

<php-general-subscribe(at)lists(dot)php(dot)net>

--
Tim

"That excessive bail ought not to be required, nor excessive fines imposed,
nor cruel and unusual punishments inflicted" -- Bill of Rights 1689
Re: How do I force PHP to assume UTF-8 for $_GET? [message #174424 is a reply to message #174416] Fri, 10 June 2011 23:20 Go to previous messageGo to next message
Jerry Stuckle is currently offline  Jerry Stuckle
Messages: 2598
Registered: September 2010
Karma: 0
Senior Member
On 6/10/2011 2:32 PM, Martin Kotulla wrote:
> Hello,
>
> I have a Windows application (which I cannot change) that calls my PHP
> script and passes it parameters via GET. The application sends properly
> encoded UTF-8 parameters, but does not indicate the UTF-8 character set
> in the HTTP headers (it does not indicate any character set at all).
>
> When I now access the parameters via $_GET in my PHP script, they are
> interpreted as ISO-8859-1 instead of UTF-8.
>
> For example, the Chinese character U+3563 is represented by the UTF-8
> bytes 0xE3, 0x95, 0xA3. PHP sees the two ISO-8859-1 character 0xE3 and
> 0xA3, and silently drops 0x95, presumably because it is outside of the
> range of valid characters in ISO-8859-1.
>
> So, how can I force PHP to regard the parameters as UTF-8-encoded?
>
> Any insights appreciated.
>
> -mk

Get whomever wrote the Windows application to fix it. If it is sending
UTF-8 characters, it must indicate so.

It may not even be PHP's fault - the characters may be filtered out by
the server. It's the old story - GIGO - and you have garbage going in.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex(at)attglobal(dot)net
==================
Re: How do I force PHP to assume UTF-8 for $_GET? [message #174425 is a reply to message #174422] Sat, 11 June 2011 03:28 Go to previous messageGo to next message
legalize+jeeves is currently offline  legalize+jeeves
Messages: 21
Registered: September 2010
Karma: 0
Junior Member
[Please do not mail me a copy of your followup]

Tim Streater <timstreater(at)waitrose(dot)com> spake the secret code
<timstreater-3C1F83(dot)23282710062011(at)news(dot)individual(dot)net> thusly:

> In article <95fidnFk6U1(at)mid(dot)individual(dot)net>,
> Martin Kotulla <mk999(at)gmx(dot)de> wrote:
>
>> On 10.06.2011 22:56, Tim Streater wrote:
>>>
>>> OK, maybe I was asking the wrong question. What do you mean by "access"
>>> in "When I now access the parameters via $_GET in my PHP script, ..."?
>>> Have you checked that when the Win-app sends U+3563 you get three bytes
>>> of data (0xE3, 0x95, 0xA3) in your PHP script? What are you then
>>> intending to do with those Chinese characters?
>
>> The Windows app sends three UTF-8 encoded characters. The PHP $_GET
>> array only returns two, one being dropped by PHP because it considers
>> out-of-range characters (in its ISO-8859-1 mind) incorrect.
>
> This may not be PHP's fault. Re-reading your first post I see that your
> win-app does not indicate UTF-8 in the HTTP headers. That may mean that
> the byte stream is wrong before PHP even sees it. But I'm just guessing.

You can test that idea by running Fiddler2 on the machine where the
Windows application is running. You should be able to see the entire
byte stream between application and remote server.
<http://www.fiddler2.com/fiddler2/>
--
"The Direct3D Graphics Pipeline" -- DirectX 9 draft available for download
<http://legalizeadulthood.wordpress.com/the-direct3d-graphics-pipeline/>

Legalize Adulthood! <http://legalizeadulthood.wordpress.com>
Re: How do I force PHP to assume UTF-8 for $_GET? [message #174541 is a reply to message #174424] Fri, 17 June 2011 07:27 Go to previous message
Martin Kotulla is currently offline  Martin Kotulla
Messages: 4
Registered: June 2011
Karma: 0
Junior Member
On 11.06.2011 01:20, Jerry Stuckle wrote:
>
> Get whomever wrote the Windows application to fix it. If it is sending
> UTF-8 characters, it must indicate so.
>
> It may not even be PHP's fault - the characters may be filtered out by
> the server. It's the old story - GIGO - and you have garbage going in.
>

Jerry, Richard:

Actually, it's another old story: Hunting purported bugs in the wrong
place hinders fixing the real bugs... :-)

My assumption that PHP was seeing only two bytes instead of three was
wrong. I wrote the string to a log file and looked at it with 'tail'. It
was 'tail' that dropped the out-of-range character 0x95, not PHP. When I
wrote strlen($name) to the log file, the value turned out to be 3.

The real bug was in a completely different place: The PHP script called
a Python script, and I forgot to urlencode the parameters. Now that this
is done, the script combo works fine.

Thank you to both of you!

Best,

-mk
  Switch to threaded view of this topic Create a new topic Submit Reply
Previous Topic: ldap_modify
Next Topic: Displaying UTF-8-encoded strings from MySQL with PHP
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Sun Nov 10 13:03:18 GMT 2024

Total time taken to generate the page: 0.02779 seconds