FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » Processing accented characters submitted from forms
Show: Today's Messages :: Polls :: Message Navigator
Return to the default flat view Create a new topic Submit Reply
Re: Processing accented characters submitted from forms [message #184498 is a reply to message #184490] Fri, 03 January 2014 14:30 Go to previous messageGo to previous message
Ben Bacarisse is currently offline  Ben Bacarisse
Messages: 82
Registered: November 2013
Karma:
Member
JohnT <john-sospam(at)jtresponse(dot)co(dot)uk> writes:

> On Fri, 03 Jan 2014 12:37:27 +0000, Ben Bacarisse wrote:
>
>> JohnT <john-sospam(at)jtresponse(dot)co(dot)uk> writes: <snip>
>>> We're already using iso-8859-1 for the whole website. It will be a lot
>>> of work to change all that, so I guess we'll have to put up with the
>>> odd Turkish I causing problems.
>>
>> It's not clear (to me at least) what's happening to the data, but as far
>> as any normal set of HTML pages are concerned (PHP generated or
>> otherwise) you don't have to put up with a dotted I causing problems on
>> an ISO-8859-1 encoded page. You can represent any Unicode character in
>> a page using character entities (browser and font support is always and
>> issue but not nowadays for anything as ordinary as İ).
>
> I think it must be the browser that is encoding the character because İ
> is not supported by iso-8859-1.

Note that the browser behaviour can be altered by form attributes
(specifically accept-charset). You can have a form that accepts UTF-8
on an ISO-8859-1 served page.

> It arrives in the request data as the html numeric entity code, as that
> is the only way it can be transmitted.
>
> This causes issues:
>
> As I always htmlencode user entered data before display, it means that it
> gets encoded twice. I'll have to add the 'disable double encode' flag
> thoughout my code :-)

Sure. One way or another you need to get the right encoding. This
method is not perfect since a user typing &#304; into a form may not
expect a dotted I to come out.

The best method is (probably) to:
(a) Give UTF-8 as the form's accept-charset.
(b) Encode htmlentities giving UTF-8 as the encoding. This should leave
the UTF-8 characters as UTF-8.
(c) Use mb_convert_encoding($etext, "HTML-ENTITIES", "UTF-8") to make
the string displayable in a page regardless of the page's character
encoding.

> Secondly, it will be added to the database as the entity code, so this
> will break searching the database etc...

If you take the approach of accepting UTF-8 from the form, you can put
that directly into the database.

> I think the proper fix would would be to convert to UTF-8.
> But thats a lot of work. For now, I think I'll just manually translit the
> codes that cause issues.

You really only need UTF-8 in the database. The page encoding is not
that important.

--
Ben.
[Message index]
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: ORMs comparisons/complaints.
Next Topic: thank you, richard@noreply
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Thu Sep 19 21:50:42 GMT 2024

Total time taken to generate the page: 0.03997 seconds