Re: strip_tags function [message #178775 is a reply to message #178771] |
Wed, 01 August 2012 04:49 |
Thomas 'PointedEars'
Messages: 701 Registered: October 2010
Karma:
|
Senior Member |
|
|
Tim Fardell wrote:
> Thomas 'PointedEars' Lahn wrote:
>> Tim Fardell wrote:
>>> Thomas 'PointedEars' Lahn wrote:
>>>> Tim Fardell wrote:
>>>> > On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
>>>> > <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>>>> >> However, am I right in thinking that the strip_tags() function simply
>>>> >> assumes that any less-than character (<) occurring within a string is
>>>> >> the beginning of a tag?
>>>> >>
>>>> >> I hope I'm wrong, because that would be completely crap and useless
>>>> >> :-)
>>>> >
>>>> > [?]
>>>> > I think I am correct that strip_tags assumes any '<' character to be
>>>> > the beginning of a tag -
>>>>
>>>> No, you are not. The function, at least as of PHP 5.3.10, is context-
>>>> sensitive:
>>>>
>>>> $ php -r "echo strip_tags('<a title=\'<\'>foo</a>');"
>>>> foo
>>>
>>> Possibly true,
>>
>> What do you mean "possibly"? I tested it there.
>
> By "possibly" I mean that what you say may well be true, and I do not
> dispute that it is. All I am saying is that your example does not
> demonstrate it, and thus does not prove it.
A matter of interpretation.
>>> but that example does not demonstrate it.
>> But it does.
>
> No, it does not. A function that assumes every occurrence of the '<'
> character is the beginning of a tag, and removes it and all text from it
> up to and including the next '>' character in the string would yield
> exactly the same output. You have not demonstrated context sensitivity.
An example that proves my point and applies to your now-clarified assumption
can be easily created:
$ php -r "echo strip_tags('<a title=\'>\'>foo</a>');"
foo
If it would be as you assumed, then the output would have to be something
similar to
'>foo
It is not.
>> The markup is syntactically invalid in the first place
>
> What markup? If, as you suggest later on, and also implied in your
> previous post, the input text is supposed to be plain, unformatted text,
> which may unintentionally contains rogue HTML tags that should not be
> there, then the input text is unlikely to be syntactically correct markup
> anyway.
You still misunderstand. The input is obviously supposed to be markup,
containing tags; it needs to be parsed.
However, you have (well-)defined "HTML-encoded" as content where the "<" of
tags would be replaced with "<". That would not be markup anymore as
markup requires at least one tag. (`<' is _not_ a tag.)
>>> Definitely doesn't work in PHP 5.3.3, and according to php.net the
>>> function hasn't changed since 5.0.0.
>>
>> It would be prudent if you learned about SGML-based markup languages
>> before you attempted to pass on judgement on the correctness of their
>> parsers.
>
> Actually I think what I have been saying all along is that the function
> behaves correctly if the input text is HTML.
No, you assumed the function would consider *all* `<'s to be a start of a
tag, which it obviously does not.
> I believe the function is supposed to take correct HTML as input, in which
> case it works exactly as I would expect.
If "correct" means *syntactically* correct, then you are right. However,
that does not mean that the function could not process syntactically invalid
markup correctly; the result just might not be what you expect, as there is
no specified definition of correctness with regard to parsing syntactically
invalid markup.
>>>> > ut this doesn't actually matter, since the input string should be HTML
>>>> > encoded anyway, so all '<' characters should be escaped as '<' - so
>>>> > all actual '<' characters will indeed be tags :-)
>>>>
>>>> You are not making sense. The *input* data should *never* be "HTML
>>>> encoded".
>
> Wait, didn't you just say the opposite?
No, I did not.
>>> Then I must have completely misunderstood something here. I thought the
>>> whole point of strip_tags() was to remove all HTML tags from the input
>>> string.
>>
>> Yes, HTML *tags*.
>
> Good, we agree on something.
But do you *really* know what a HTML tag is? It sure does not look like
that.
>>> Therefore the input has to be HTML or the function is pointless.
>>
>> You have defined "HTML-encoded" above to mean that "<" would be "<".
>> (Which is a common, and semantically correct definition of the term.)
>>
>> So HTML-*encoded* strings are _not_ HTML *markup*; by definition, there
>> are no tags in them.
>
> Agreed again.
So, AISB, *HTML-encoded strings* (which is a different animal than HTML
markup) are not supposed to be the input to this function or any form of
server-side PHP processing. You should not get HTML-encoded strings from
the client and you should not store HTML-encoded strings in your database.
Instead, you should store the plain markup, and HTML-encode it for output if
necessary. The best approach is, of course, not to store HTML markup at all
in a database (data storage should be independent of output) but that is not
always possible.
>>> Unless the idea is to remove rogue HTML tags from a plain text string,
>> ISTM that is the general idea.
>
> Oh dear, we're going to start disagreeing again here.
Define: rogue HTML tags.
>> Some HTML elements (which consist of start tag, optional content and
>> optional end tag, depending on the element type), are potentially
>> detrimental to the expected display and functionality of a
>> Web document. For example, consider that people were allowed to use
>> text-formatting elements in a blog comment, but where not allowed to
>> insert `script' elements (to avoid XSS) or `img' and `object' elements
>> (to avoid reduction of loading speed and interference with other
>> multimedia on the site); you would only list the text-formatting elements
>> in the function's second parameter then.
>>
>> Other people might want to remove HTML tags altogether, leaving only the
>> text content of the document (fragment).
>
> OK, so given no parameters other than the input string, the behaviour of
> strip_tags() is to remove all HTML tags from the input string. Correct?
Correct.
> You have just said that you believe the input string should be plain text,
No, I did not. You misunderstood, probably because you cannot tell what is
a HTML tag or what you mean by "HTML-encoded string".
>>> in which case my original point about it assuming all < symbols are tags
>>
>> The `<' character is an STAGO (STart Tag Open) delimiter in SGML-based
>> markup languages. It delimits a start tag on the left-hand side (a `>'
>> character delimits it on the right-hand side). Obviously, therefore it
>> cannot be a tag itself.
>
> Sorry but that is extremely pedantic.
That is a luser's attitude. Correct and unambiguous terminology is
paramount to understanding and learning. `<' is _not_ a tag; it is a tag
*delimiter*.
> I think it's obvious what I meant -
I do not think you know what you are talking about, so you are modifying
your terminology as you go (like, "a tag is `<…>'", "a tag is <", "a tag
is `<'"). Of course, this is where misconceptions are created, very common
in beginners.
> Just to clarify, I meant to say "...in which case my original point about
> it assuming all < symbols *are the beginning of* tags..."
ACK.
>>> remains valid - it's crap and useless for this.
>> Evidently, you do not know what you are talking about.
>
> No need for rudeness is there?
Given your statements, it is a matter of fact. (A fact that can be changed
by you, and eventually *only* you.) If you cannot deal with that, please
stop wasting my time.
On a side note, you can tell a troll from a regular when you see the former
contributing nothing but ad-hominem attack and misinformation. Forged
address headers are also a strong indication of a troll. Any form of
continued anti-social behavior, really.
--
PointedEars
|
|
|