FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » reading files with accents in the filename from PHP
Show: Today's Messages :: Polls :: Message Navigator
Return to the default flat view Create a new topic Submit Reply
Re: reading files with accents in the filename from PHP [message #183121 is a reply to message #183119] Wed, 09 October 2013 22:41 Go to previous messageGo to previous message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma:
Senior Member
Christoph Michael Becker wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Thomas Mlynarczyk wrote:
>>> Erwin Moller schrieb:
>>>> How can PHP open files on the local filesystem that contain certain
>>>> characters, like umlauts, accents, etc?
>>>
>>> $path = __DIR__ . '\Eugène.txt';
>>> var_dump( PHP_VERSION, file_exists( $path ) );
>>>
>>> Works on my Windows XP, PHP 5.4.8, *if* the PHP file is stored in ANSI
>>> (="Windows") encoding.
>>
>> There is no “ANSI encoding“. Usually “ANSI encoding” means Windows-1252.
>> [0] It would be either coincidence or strange if this worked, because
>> [FAT32 uses the “OEM character set”, i. e. one of the various IBM code
>> pages, 437 for English, and NTFS uses UTF-16BE [1]. The letter “è” has
>> Windows-1252 code 0xE6, IBM437/IBM850 code 0x8A, and Unicode code point
>> U+00E8 [2] (encoded in UTF-16 as 0xE8 [3]). It follows that you cannot
>> mean Windows-1252 by “ANSI”.
>
> The letter "è" is encoded in CP-1252 as /0xE8/[1].

You are correct (to some extent); I must have slipped into the wrong row.

My point is, however, that _Windows_-1252 is very likely _not_ what is
expected by the filesystem. By “coincidence”, the code *points* for
Windows-1252 and Unicode are the same from 0+00A0 to U+00FF, and the used
character is within that range. This code will break for characters whose
Unicode code point is above U+007F but outside this range. In general, it
will be unreliable because Windows-1252 does not have the interleaved zero-
octet that UTF-16 has (NTFS), and Windows-1252 and IBM437 & friends (FAT32)
are incompatible above 0x7F.

> In UTF-16 it is encoded by *two* bytes:

Two octets, to be precise. I was aware of that (as you could see further
below) but I oversimplified here.

> 0x00 0xE8 (or vice versa, depending on the endianess).

Because NTFS uses UTF-16_LE_ (as Windows uses _little-endian_ throughout),
it is E8 00 there.

> I have created a file "tèst" on a German Windows XP on NTFS, and started
> a PHP shell:
>
>>>> $fs = glob('t?st')
>>>> $fs[0]
> 't\350st'
>
> Apparently, the file name is *read* by PHP as if it was encoded in
> CP-1252.

Interesting. 0350 would correspond to 232 and 0xE8, indeed.

> Either the description on MSDN[2] is wrong,

Unlikely.

> or PHP uses a Windows API that converts the filename's encoding.

It would suffice if it discarded all zero-bits in *this* case as the code
would be {74 00} {E8 00} {73 00} {74 00}.

> I presume the latter, being aware (but not (yet) convinced) that there
> might be another reason for this behavior.

It would be interesting to see how this works with NTFS with characters
outside the specified range whose Unicode code point is above U+007F. For
example, U+0100 (“Ā”; LATIN CAPITAL LETTER A WITH MACRON) would be encoded
in one UTF-16 code unit, 0100, which would be encoded in UTF-16LE as 00 10.
Just stripping the zero-octets would result in <LF> (whose code point is
0x10 which is 020). Just reading the octet with the lower address would
result in 0x00 which terminates a C string. If the result is _not_
something equivalent to 't\020st' or 't', something else is happening.


PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
-- Richard Cornford, cljs, <f806at$ail$1$8300dec7(at)news(dot)demon(dot)co(dot)uk>
[Message index]
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: PDO - Cannot retrieve warnings with emulated prepares disabled
Next Topic: Secure website
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Thu Sep 19 16:44:44 GMT 2024

Total time taken to generate the page: 0.05020 seconds