Re: strip_tags function [message #178752 is a reply to message #178750] |
Sat, 28 July 2012 10:42 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma:
|
Senior Member |
|
|
Tim Fardell wrote:
> Thomas 'PointedEars' Lahn wrote:
>> Tim Fardell wrote:
>>> On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
>>> <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>>>> However, am I right in thinking that the strip_tags() function simply
>>>> assumes that any less-than character (<) occurring within a string is
>>>> the beginning of a tag?
>>>>
>>>> I hope I'm wrong, because that would be completely crap and useless :-)
>>>
>>> [
]
>>> I think I am correct that strip_tags assumes any '<' character to be the
>>> beginning of a tag -
>>
>> No, you are not. The function, at least as of PHP 5.3.10, is context-
>> sensitive:
>>
>> $ php -r "echo strip_tags('<a title=\'<\'>foo</a>');"
>> foo
>
> Possibly true,
What do you mean – "possibly"? I tested it there.
> but that example does not demonstrate it.
But it does. The `<' is not considered the beginning of a tag when in an
attribute value (whose delimiters needed to be escaped here because the
apostrophe was already used as PHP single-quoted string literal delimiter).
This complies with the current HTML5 Working Draft parsing rules, which
attempt to codify existing parser behavior in Web browsers:
<http://www.w3.org/TR/2012/WD-html5-20120329/tokenization.html#attribute-
value-single-quoted-state>
> Try:
>
> $ php -r "echo strip_tags('<a title=\'<\'>f<o<o</a>');"
^
> Output should be
>
> f<o<o
No, it should not. The markup is syntactically invalid in the first place
(an STAGO [1] may not be used within a tag outside of an attribute value;
XML well-formedness even forbids it unescaped within an attribute value), so
the outcome of tag-stripping is undefined. As a result, the output in PHP
5.3.10 is
f
because error-correction inserts the missing `>', and removes all tags from
<a title='<'>f<o><o></a>
(I am making an assumption here based on the behavior of existing markup
parsers.)
> Definitely doesn't work in PHP 5.3.3, and according to php.net the
> function hasn't changed since 5.0.0.
It would be prudent if you learned about SGML-based markup languages before
you attempted to pass on judgement on the correctness of their parsers.
>>> ut this doesn't actually matter, since the input string should be HTML
>>> encoded anyway, so all '<' characters should be escaped as '<' - so
>>> all actual '<' characters will indeed be tags :-)
>>
>> You are not making sense. The *input* data should *never* be "HTML
>> encoded".
>
> Then I must have completely misunderstood something here. I thought the
> whole point of strip_tags() was to remove all HTML tags from the input
> string.
Yes, HTML *tags*.
> Therefore the input has to be HTML or the function is pointless.
You have defined "HTML-encoded" above to mean that "<" would be "<".
(Which is a common, and semantically correct definition of the term.)
So HTML-*encoded* strings are _not_ HTML *markup*; by definition, there are
no tags in them.
> Unless the idea is to remove rogue HTML tags from a plain text string,
ISTM that is the general idea. Some HTML elements (which consist of start
tag, optional content and optional end tag, depending on the element type),
are potentially detrimental to the expected display and functionality of a
Web document. For example, consider that people were allowed to use text-
formatting elements in a blog comment, but where not allowed to insert
`script' elements (to avoid XSS) or `img' and `object' elements (to avoid
reduction of loading speed and interference with other multimedia on the
site); you would only list the text-formatting elements in the function's
second parameter then.
Other people might want to remove HTML tags altogether, leaving only the
text content of the document (fragment).
> in which case my original point about it assuming all < symbols are tags
The `<' character is an STAGO (STart Tag Open) delimiter in SGML-based
markup languages. It delimits a start tag on the left-hand side (a `>'
character delimits it on the right-hand side). Obviously, therefore it
cannot be a tag itself.
> remains valid - it's crap and useless for this.
Evidently, you do not know what you are talking about.
PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
-- Richard Cornford, cljs, <f806at$ail$1$8300dec7(at)news(dot)demon(dot)co(dot)uk>
|
|
|