FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » strip_tags function
Show: Today's Messages :: Polls :: Message Navigator
Switch to threaded view of this topic Create a new topic Submit Reply
strip_tags function [message #178737] Thu, 26 July 2012 17:18 Go to next message
Tim Fardell is currently offline  Tim Fardell
Messages: 5
Registered: July 2012
Karma: 0
Junior Member
Hello all!

I'm very much a PHP beginner, so please bear with me :-)

I have a need to remove all HTML tags from a string, and rather than re-invent
the wheel, I thought I'd see if there was an existing function that could do
this. Superficially, strip_tags() appears to do exactly what I need.

However, am I right in thinking that the strip_tags() function simply assumes
that any less-than character (<) occurring within a string is the beginning of a
tag?

I hope I'm wrong, because that would be completely crap and useless :-)

--
Please remove all-your-clothes before replying.
Re: strip_tags function [message #178738 is a reply to message #178737] Thu, 26 July 2012 17:27 Go to previous messageGo to next message
Tim Fardell is currently offline  Tim Fardell
Messages: 5
Registered: July 2012
Karma: 0
Junior Member
On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
<tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:

> However, am I right in thinking that the strip_tags() function simply assumes
> that any less-than character (<) occurring within a string is the beginning of a
> tag?
>
> I hope I'm wrong, because that would be completely crap and useless :-)

Apologies for following-up my own post, but I just realised the answer to my own
question.

I think I am correct that strip_tags assumes any '<' character to be the
beginning of a tag - but this doesn't actually matter, since the input string
should be HTML encoded anyway, so all '<' characters should be escaped as '&lt;'
- so all actual '<' characters will indeed be tags :-)

So I wasn't wrong, but it's not as crap and useless as I thought!
--
Please remove all-your-clothes before replying.
Re: strip_tags function [message #178739 is a reply to message #178738] Thu, 26 July 2012 18:24 Go to previous messageGo to next message
The Natural Philosoph is currently offline  The Natural Philosoph
Messages: 993
Registered: September 2010
Karma: 0
Senior Member
Tim Fardell wrote:
> On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
> <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>
>> However, am I right in thinking that the strip_tags() function simply assumes
>> that any less-than character (<) occurring within a string is the beginning of a
>> tag?
>>
>> I hope I'm wrong, because that would be completely crap and useless :-)
>
> Apologies for following-up my own post, but I just realised the answer to my own
> question.
>
> I think I am correct that strip_tags assumes any '<' character to be the
> beginning of a tag - but this doesn't actually matter, since the input string
> should be HTML encoded anyway, so all '<' characters should be escaped as '&lt;'
> - so all actual '<' characters will indeed be tags :-)
>
> So I wasn't wrong, but it's not as crap and useless as I thought!
I was about to say 'so does your browser'

There is a function that only strips certain listed tags, but I cant
remember what it is


--
To people who know nothing, anything is possible.
To people who know too much, it is a sad fact
that they know how little is really possible -
and how hard it is to achieve it.
Re: strip_tags function [message #178740 is a reply to message #178739] Thu, 26 July 2012 21:54 Go to previous messageGo to next message
Gregor Kofler is currently offline  Gregor Kofler
Messages: 69
Registered: September 2010
Karma: 0
Member
Am 2012-07-26 20:24, The Natural Philosopher meinte:
> Tim Fardell wrote:
>> On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
>> <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>>
>>> However, am I right in thinking that the strip_tags() function simply
>>> assumes
>>> that any less-than character (<) occurring within a string is the
>>> beginning of a
>>> tag?
>>> I hope I'm wrong, because that would be completely crap and useless :-)
>>
>> Apologies for following-up my own post, but I just realised the answer
>> to my own
>> question.
>> I think I am correct that strip_tags assumes any '<' character to be the
>> beginning of a tag - but this doesn't actually matter, since the input
>> string
>> should be HTML encoded anyway, so all '<' characters should be escaped
>> as '&lt;'
>> - so all actual '<' characters will indeed be tags :-)
>>
>> So I wasn't wrong, but it's not as crap and useless as I thought!
> I was about to say 'so does your browser'
>
> There is a function that only strips certain listed tags, but I cant
> remember what it is

strip_tags? It's just the other way round: You have to supply the
allowed tags.

Gregor
Re: strip_tags function [message #178741 is a reply to message #178738] Fri, 27 July 2012 06:25 Go to previous messageGo to next message
alvaro.NOSPAMTHANX is currently offline  alvaro.NOSPAMTHANX
Messages: 277
Registered: September 2010
Karma: 0
Senior Member
El 26/07/2012 19:27, Tim Fardell escribió/wrote:
> On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
> <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>
>> However, am I right in thinking that the strip_tags() function simply assumes
>> that any less-than character (<) occurring within a string is the beginning of a
>> tag?
>>
>> I hope I'm wrong, because that would be completely crap and useless :-)
>
> Apologies for following-up my own post, but I just realised the answer to my own
> question.
>
> I think I am correct that strip_tags assumes any '<' character to be the
> beginning of a tag - but this doesn't actually matter, since the input string
> should be HTML encoded anyway, so all '<' characters should be escaped as '&lt;'
> - so all actual '<' characters will indeed be tags :-)
>
> So I wasn't wrong, but it's not as crap and useless as I thought!

As you've already noticed yourself, the strip_tags() function makes
sense when applied to HTML. As such, I think it's good enough, though
you won't get predictable results if the HTML is not valid (which is
normally the same that happens with web browsers).

Sadly, it's pretty often used on plain text, leading to annoyances and
data loss. Have you ever posted a comment on a programming web site,
just to find out that your code snippets were ruined by the forum? There
you are.


--
-- http://alvaro.es - Álvaro G. Vicario - Burgos, Spain
-- Mi sitio sobre programación web: http://borrame.com
-- Mi web de humor satinado: http://www.demogracia.com
--
Re: strip_tags function [message #178742 is a reply to message #178738] Fri, 27 July 2012 09:52 Go to previous messageGo to next message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma: 0
Senior Member
Tim Fardell wrote:

> On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
> <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>> However, am I right in thinking that the strip_tags() function simply
>> assumes that any less-than character (<) occurring within a string is the
>> beginning of a tag?
>>
>> I hope I'm wrong, because that would be completely crap and useless :-)
>
> […]
> I think I am correct that strip_tags assumes any '<' character to be the
> beginning of a tag -

No, you are not. The function, at least as of PHP 5.3.10, is context-
sensitive:

$ php -r "echo strip_tags('<a title=\'<\'>foo</a>');"
foo

> ut this doesn't actually matter, since the input string should be HTML
> encoded anyway, so all '<' characters should be escaped as '&lt;' - so all
> actual '<' characters will indeed be tags :-)

You are not making sense. The *input* data should *never* be "HTML
encoded".


PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)
Re: strip_tags function [message #178750 is a reply to message #178742] Fri, 27 July 2012 17:25 Go to previous messageGo to next message
Tim Fardell is currently offline  Tim Fardell
Messages: 5
Registered: July 2012
Karma: 0
Junior Member
On Fri, 27 Jul 2012 11:52:29 +0200, Thomas 'PointedEars' Lahn
<PointedEars(at)web(dot)de> wrote:

> Tim Fardell wrote:
>
>> On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
>> <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>>> However, am I right in thinking that the strip_tags() function simply
>>> assumes that any less-than character (<) occurring within a string is the
>>> beginning of a tag?
>>>
>>> I hope I'm wrong, because that would be completely crap and useless :-)
>>
>> […]
>> I think I am correct that strip_tags assumes any '<' character to be the
>> beginning of a tag -
>
> No, you are not. The function, at least as of PHP 5.3.10, is context-
> sensitive:
>
> $ php -r "echo strip_tags('<a title=\'<\'>foo</a>');"
> foo

Possibly true, but that example does not demonstrate it. Try:

$ php -r "echo strip_tags('<a title=\'<\'>f<o<o</a>');"

Output should be

f<o<o

Definitely doesn't work in PHP 5.3.3, and according to php.net the function
hasn't changed since 5.0.0.

>> ut this doesn't actually matter, since the input string should be HTML
>> encoded anyway, so all '<' characters should be escaped as '&lt;' - so all
>> actual '<' characters will indeed be tags :-)
>
> You are not making sense. The *input* data should *never* be "HTML
> encoded".

Then I must have completely misunderstood something here. I thought the whole
point of strip_tags() was to remove all HTML tags from the input string.
Therefore the input has to be HTML or the function is pointless.

Unless the idea is to remove rogue HTML tags from a plain text string, in which
case my original point about it assuming all < symbols are tags remains valid -
it's crap and useless for this.
--
Please remove all-your-clothes before replying.
Re: strip_tags function [message #178752 is a reply to message #178750] Sat, 28 July 2012 10:42 Go to previous messageGo to next message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma: 0
Senior Member
Tim Fardell wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Tim Fardell wrote:
>>> On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
>>> <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>>>> However, am I right in thinking that the strip_tags() function simply
>>>> assumes that any less-than character (<) occurring within a string is
>>>> the beginning of a tag?
>>>>
>>>> I hope I'm wrong, because that would be completely crap and useless :-)
>>>
>>> […]
>>> I think I am correct that strip_tags assumes any '<' character to be the
>>> beginning of a tag -
>>
>> No, you are not. The function, at least as of PHP 5.3.10, is context-
>> sensitive:
>>
>> $ php -r "echo strip_tags('<a title=\'<\'>foo</a>');"
>> foo
>
> Possibly true,

What do you mean – "possibly"? I tested it there.

> but that example does not demonstrate it.

But it does. The `<' is not considered the beginning of a tag when in an
attribute value (whose delimiters needed to be escaped here because the
apostrophe was already used as PHP single-quoted string literal delimiter).

This complies with the current HTML5 Working Draft parsing rules, which
attempt to codify existing parser behavior in Web browsers:

<http://www.w3.org/TR/2012/WD-html5-20120329/tokenization.html#attribute-
value-single-quoted-state>

> Try:
>
> $ php -r "echo strip_tags('<a title=\'<\'>f<o<o</a>');"
^
> Output should be
>
> f<o<o

No, it should not. The markup is syntactically invalid in the first place
(an STAGO [1] may not be used within a tag outside of an attribute value;
XML well-formedness even forbids it unescaped within an attribute value), so
the outcome of tag-stripping is undefined. As a result, the output in PHP
5.3.10 is

f

because error-correction inserts the missing `>', and removes all tags from

<a title='<'>f<o><o></a>

(I am making an assumption here based on the behavior of existing markup
parsers.)

> Definitely doesn't work in PHP 5.3.3, and according to php.net the
> function hasn't changed since 5.0.0.

It would be prudent if you learned about SGML-based markup languages before
you attempted to pass on judgement on the correctness of their parsers.

>>> ut this doesn't actually matter, since the input string should be HTML
>>> encoded anyway, so all '<' characters should be escaped as '&lt;' - so
>>> all actual '<' characters will indeed be tags :-)
>>
>> You are not making sense. The *input* data should *never* be "HTML
>> encoded".
>
> Then I must have completely misunderstood something here. I thought the
> whole point of strip_tags() was to remove all HTML tags from the input
> string.

Yes, HTML *tags*.

> Therefore the input has to be HTML or the function is pointless.

You have defined "HTML-encoded" above to mean that "<" would be "&lt;".
(Which is a common, and semantically correct definition of the term.)

So HTML-*encoded* strings are _not_ HTML *markup*; by definition, there are
no tags in them.

> Unless the idea is to remove rogue HTML tags from a plain text string,

ISTM that is the general idea. Some HTML elements (which consist of start
tag, optional content and optional end tag, depending on the element type),
are potentially detrimental to the expected display and functionality of a
Web document. For example, consider that people were allowed to use text-
formatting elements in a blog comment, but where not allowed to insert
`script' elements (to avoid XSS) or `img' and `object' elements (to avoid
reduction of loading speed and interference with other multimedia on the
site); you would only list the text-formatting elements in the function's
second parameter then.

Other people might want to remove HTML tags altogether, leaving only the
text content of the document (fragment).

> in which case my original point about it assuming all < symbols are tags

The `<' character is an STAGO (STart Tag Open) delimiter in SGML-based
markup languages. It delimits a start tag on the left-hand side (a `>'
character delimits it on the right-hand side). Obviously, therefore it
cannot be a tag itself.

> remains valid - it's crap and useless for this.

Evidently, you do not know what you are talking about.


PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
-- Richard Cornford, cljs, <f806at$ail$1$8300dec7(at)news(dot)demon(dot)co(dot)uk>
Re: strip_tags function [message #178771 is a reply to message #178752] Tue, 31 July 2012 19:17 Go to previous messageGo to next message
Tim Fardell is currently offline  Tim Fardell
Messages: 5
Registered: July 2012
Karma: 0
Junior Member
On Sat, 28 Jul 2012 12:42:59 +0200, Thomas 'PointedEars' Lahn
<PointedEars(at)web(dot)de> wrote:

> Tim Fardell wrote:
>
>> Thomas 'PointedEars' Lahn wrote:
>>> Tim Fardell wrote:
>>>> On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
>>>> <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>>>> > However, am I right in thinking that the strip_tags() function simply
>>>> > assumes that any less-than character (<) occurring within a string is
>>>> > the beginning of a tag?
>>>> >
>>>> > I hope I'm wrong, because that would be completely crap and useless :-)
>>>>
>>>> [?]
>>>> I think I am correct that strip_tags assumes any '<' character to be the
>>>> beginning of a tag -
>>>
>>> No, you are not. The function, at least as of PHP 5.3.10, is context-
>>> sensitive:
>>>
>>> $ php -r "echo strip_tags('<a title=\'<\'>foo</a>');"
>>> foo
>>
>> Possibly true,
>
> What do you mean – "possibly"? I tested it there.

By "possibly" I mean that what you say may well be true, and I do not dispute
that it is. All I am saying is that your example does not demonstrate it, and
thus does not prove it.

>> but that example does not demonstrate it.
>
> But it does.

No, it does not. A function that assumes every occurrence of the '<' character
is the beginning of a tag, and removes it and all text from it up to and
including the next '>' character in the string would yield exactly the same
output. You have not demonstrated context sensitivity.

>> Try:
>>
>> $ php -r "echo strip_tags('<a title=\'<\'>f<o<o</a>');"
> ^
>> Output should be
>>
>> f<o<o
>
> No, it should not.

If the function is intended to be used to process plain unformatted ASCII text
and remove any unwanted HTML tags from it, then yes, it should. If the function
is intended to remove all HTML tags from an HTML document, then I agree it
should not. Which is what I said to start with in the followup to my original
post.

> The markup is syntactically invalid in the first place

What markup? If, as you suggest later on, and also implied in your previous
post, the input text is supposed to be plain, unformatted text, which may
unintentionally contains rogue HTML tags that should not be there, then the
input text is unlikely to be syntactically correct markup anyway.

>> Definitely doesn't work in PHP 5.3.3, and according to php.net the
>> function hasn't changed since 5.0.0.
>
> It would be prudent if you learned about SGML-based markup languages before
> you attempted to pass on judgement on the correctness of their parsers.

Actually I think what I have been saying all along is that the function behaves
correctly if the input text is HTML. I believe the function is supposed to take
correct HTML as input, in which case it works exactly as I would expect.

>>>> ut this doesn't actually matter, since the input string should be HTML
>>>> encoded anyway, so all '<' characters should be escaped as '&lt;' - so
>>>> all actual '<' characters will indeed be tags :-)
>>>
>>> You are not making sense. The *input* data should *never* be "HTML
>>> encoded".

Wait, didn't you just say the opposite? If the input is not HTML-encoded then it
is *critical* that any '<' character which does not form part of a tag is
ignored. It is not, as my example above proves.

>> Then I must have completely misunderstood something here. I thought the
>> whole point of strip_tags() was to remove all HTML tags from the input
>> string.
>
> Yes, HTML *tags*.

Good, we agree on something.

>> Therefore the input has to be HTML or the function is pointless.
>
> You have defined "HTML-encoded" above to mean that "<" would be "&lt;".
> (Which is a common, and semantically correct definition of the term.)
>
> So HTML-*encoded* strings are _not_ HTML *markup*; by definition, there are
> no tags in them.

Agreed again.

>> Unless the idea is to remove rogue HTML tags from a plain text string,
>
> ISTM that is the general idea.

Oh dear, we're going to start disagreeing again here.

> Some HTML elements (which consist of start
> tag, optional content and optional end tag, depending on the element type),
> are potentially detrimental to the expected display and functionality of a
> Web document. For example, consider that people were allowed to use text-
> formatting elements in a blog comment, but where not allowed to insert
> `script' elements (to avoid XSS) or `img' and `object' elements (to avoid
> reduction of loading speed and interference with other multimedia on the
> site); you would only list the text-formatting elements in the function's
> second parameter then.
>
> Other people might want to remove HTML tags altogether, leaving only the
> text content of the document (fragment).

OK, so given no parameters other than the input string, the behaviour of
strip_tags() is to remove all HTML tags from the input string. Correct?

You have just said that you believe the input string should be plain text, i.e.
it *should* not contain any tags or formatting information of any kind, but may
contain rogue HTML tags which need to be removed. You also said that the input
string should *never* be HTML-encoded. Therefore, the input string could quite
easily contain '<' characters which are not part of HTML tags, as '<' is a
perfectly valid ASCII character. Therefore, by your reasoning, it is important
that any '<' character which does not form part of a tag is ignored.

This is not the case.

>> in which case my original point about it assuming all < symbols are tags
>
> The `<' character is an STAGO (STart Tag Open) delimiter in SGML-based
> markup languages. It delimits a start tag on the left-hand side (a `>'
> character delimits it on the right-hand side). Obviously, therefore it
> cannot be a tag itself.

Sorry but that is extremely pedantic. I think it's obvious what I meant - Just
to clarify, I meant to say "...in which case my original point about it assuming
all < symbols *are the beginning of* tags..."

>> remains valid - it's crap and useless for this.
>
> Evidently, you do not know what you are talking about.

No need for rudeness is there?

I think basically what I'm saying is that the strip_tags() function is great and
is really useful if its input text is correct HTML.

I therefore believe this function is intended to take correct HTML as input. It
is therefore crap and useless if you pass it anything other than correct HTML,
which is what you seem to be disputing.

--
Please remove all-your-clothes before replying.
Re: strip_tags function [message #178772 is a reply to message #178771] Tue, 31 July 2012 19:30 Go to previous messageGo to next message
Jerry Stuckle is currently offline  Jerry Stuckle
Messages: 2598
Registered: September 2010
Karma: 0
Senior Member
On 7/31/2012 3:17 PM, Tim Fardell wrote:
> On Sat, 28 Jul 2012 12:42:59 +0200, Thomas 'PointedEars' Lahn
> <PointedEars(at)web(dot)de> wrote:
>

<snip>

>>>>
>>>> You are not making sense. The *input* data should *never* be "HTML
>>>> encoded".
>
> Wait, didn't you just say the opposite? If the input is not HTML-encoded then it
> is *critical* that any '<' character which does not form part of a tag is
> ignored. It is not, as my example above proves.
>

Tim,

You should have stopped right here. "Pointed Head" is a well known
troll in several newsgroups. He doesn't understand it is completely
valid to have HTML encoded strings - for instance, when using cURL to
retrieve a web page.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex(at)attglobal(dot)net
==================
Re: strip_tags function [message #178773 is a reply to message #178772] Tue, 31 July 2012 19:37 Go to previous messageGo to next message
Tim Fardell is currently offline  Tim Fardell
Messages: 5
Registered: July 2012
Karma: 0
Junior Member
On Tue, 31 Jul 2012 15:30:54 -0400, Jerry Stuckle <jstucklex(at)attglobal(dot)net>
wrote:

> On 7/31/2012 3:17 PM, Tim Fardell wrote:
>> On Sat, 28 Jul 2012 12:42:59 +0200, Thomas 'PointedEars' Lahn
>> <PointedEars(at)web(dot)de> wrote:
>>
>
> <snip>
>
>>>> >
>>>> > You are not making sense. The *input* data should *never* be "HTML
>>>> > encoded".
>>
>> Wait, didn't you just say the opposite? If the input is not HTML-encoded then it
>> is *critical* that any '<' character which does not form part of a tag is
>> ignored. It is not, as my example above proves.
>>
>
> Tim,
>
> You should have stopped right here. "Pointed Head" is a well known
> troll in several newsgroups. He doesn't understand it is completely
> valid to have HTML encoded strings - for instance, when using cURL to
> retrieve a web page.

Ah - that explains it :-)

I did wonder to be honest, which is why I didn't reply for a couple of days, but
it got the better of me and I just *had* to respond in the end.

I'll leave it there - apologies for feeding the troll.
--
Please remove all-your-clothes before replying.
Re: strip_tags function [message #178775 is a reply to message #178771] Wed, 01 August 2012 04:49 Go to previous messageGo to next message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma: 0
Senior Member
Tim Fardell wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Tim Fardell wrote:
>>> Thomas 'PointedEars' Lahn wrote:
>>>> Tim Fardell wrote:
>>>> > On Thu, 26 Jul 2012 18:18:44 +0100, Tim Fardell
>>>> > <tim(dot)fardell(dot)all-your-clothes(at)virgin(dot)net> wrote:
>>>> >> However, am I right in thinking that the strip_tags() function simply
>>>> >> assumes that any less-than character (<) occurring within a string is
>>>> >> the beginning of a tag?
>>>> >>
>>>> >> I hope I'm wrong, because that would be completely crap and useless
>>>> >> :-)
>>>> >
>>>> > [?]
>>>> > I think I am correct that strip_tags assumes any '<' character to be
>>>> > the beginning of a tag -
>>>>
>>>> No, you are not. The function, at least as of PHP 5.3.10, is context-
>>>> sensitive:
>>>>
>>>> $ php -r "echo strip_tags('<a title=\'<\'>foo</a>');"
>>>> foo
>>>
>>> Possibly true,
>>
>> What do you mean – "possibly"? I tested it there.
>
> By "possibly" I mean that what you say may well be true, and I do not
> dispute that it is. All I am saying is that your example does not
> demonstrate it, and thus does not prove it.

A matter of interpretation.

>>> but that example does not demonstrate it.
>> But it does.
>
> No, it does not. A function that assumes every occurrence of the '<'
> character is the beginning of a tag, and removes it and all text from it
> up to and including the next '>' character in the string would yield
> exactly the same output. You have not demonstrated context sensitivity.

An example that proves my point and applies to your now-clarified assumption
can be easily created:

$ php -r "echo strip_tags('<a title=\'>\'>foo</a>');"
foo

If it would be as you assumed, then the output would have to be something
similar to

'>foo

It is not.

>> The markup is syntactically invalid in the first place
>
> What markup? If, as you suggest later on, and also implied in your
> previous post, the input text is supposed to be plain, unformatted text,
> which may unintentionally contains rogue HTML tags that should not be
> there, then the input text is unlikely to be syntactically correct markup
> anyway.

You still misunderstand. The input is obviously supposed to be markup,
containing tags; it needs to be parsed.

However, you have (well-)defined "HTML-encoded" as content where the "<" of
tags would be replaced with "&lt;". That would not be markup anymore as
markup requires at least one tag. (`&lt;' is _not_ a tag.)

>>> Definitely doesn't work in PHP 5.3.3, and according to php.net the
>>> function hasn't changed since 5.0.0.
>>
>> It would be prudent if you learned about SGML-based markup languages
>> before you attempted to pass on judgement on the correctness of their
>> parsers.
>
> Actually I think what I have been saying all along is that the function
> behaves correctly if the input text is HTML.

No, you assumed the function would consider *all* `<'s to be a start of a
tag, which it obviously does not.

> I believe the function is supposed to take correct HTML as input, in which
> case it works exactly as I would expect.

If "correct" means *syntactically* correct, then you are right. However,
that does not mean that the function could not process syntactically invalid
markup correctly; the result just might not be what you expect, as there is
no specified definition of correctness with regard to parsing syntactically
invalid markup.

>>>> > ut this doesn't actually matter, since the input string should be HTML
>>>> > encoded anyway, so all '<' characters should be escaped as '&lt;' - so
>>>> > all actual '<' characters will indeed be tags :-)
>>>>
>>>> You are not making sense. The *input* data should *never* be "HTML
>>>> encoded".
>
> Wait, didn't you just say the opposite?

No, I did not.

>>> Then I must have completely misunderstood something here. I thought the
>>> whole point of strip_tags() was to remove all HTML tags from the input
>>> string.
>>
>> Yes, HTML *tags*.
>
> Good, we agree on something.

But do you *really* know what a HTML tag is? It sure does not look like
that.

>>> Therefore the input has to be HTML or the function is pointless.
>>
>> You have defined "HTML-encoded" above to mean that "<" would be "&lt;".
>> (Which is a common, and semantically correct definition of the term.)
>>
>> So HTML-*encoded* strings are _not_ HTML *markup*; by definition, there
>> are no tags in them.
>
> Agreed again.

So, AISB, *HTML-encoded strings* (which is a different animal than HTML
markup) are not supposed to be the input to this function or any form of
server-side PHP processing. You should not get HTML-encoded strings from
the client and you should not store HTML-encoded strings in your database.

Instead, you should store the plain markup, and HTML-encode it for output if
necessary. The best approach is, of course, not to store HTML markup at all
in a database (data storage should be independent of output) but that is not
always possible.

>>> Unless the idea is to remove rogue HTML tags from a plain text string,
>> ISTM that is the general idea.
>
> Oh dear, we're going to start disagreeing again here.

Define: rogue HTML tags.

>> Some HTML elements (which consist of start tag, optional content and
>> optional end tag, depending on the element type), are potentially
>> detrimental to the expected display and functionality of a
>> Web document. For example, consider that people were allowed to use
>> text-formatting elements in a blog comment, but where not allowed to
>> insert `script' elements (to avoid XSS) or `img' and `object' elements
>> (to avoid reduction of loading speed and interference with other
>> multimedia on the site); you would only list the text-formatting elements
>> in the function's second parameter then.
>>
>> Other people might want to remove HTML tags altogether, leaving only the
>> text content of the document (fragment).
>
> OK, so given no parameters other than the input string, the behaviour of
> strip_tags() is to remove all HTML tags from the input string. Correct?

Correct.

> You have just said that you believe the input string should be plain text,

No, I did not. You misunderstood, probably because you cannot tell what is
a HTML tag or what you mean by "HTML-encoded string".

>>> in which case my original point about it assuming all < symbols are tags
>>
>> The `<' character is an STAGO (STart Tag Open) delimiter in SGML-based
>> markup languages. It delimits a start tag on the left-hand side (a `>'
>> character delimits it on the right-hand side). Obviously, therefore it
>> cannot be a tag itself.
>
> Sorry but that is extremely pedantic.

That is a luser's attitude. Correct and unambiguous terminology is
paramount to understanding and learning. `<' is _not_ a tag; it is a tag
*delimiter*.

> I think it's obvious what I meant -

I do not think you know what you are talking about, so you are modifying
your terminology as you go (like, "a tag is `<…>'", "a tag is &lt;", "a tag
is `<'"). Of course, this is where misconceptions are created, very common
in beginners.

> Just to clarify, I meant to say "...in which case my original point about
> it assuming all < symbols *are the beginning of* tags..."

ACK.

>>> remains valid - it's crap and useless for this.
>> Evidently, you do not know what you are talking about.
>
> No need for rudeness is there?

Given your statements, it is a matter of fact. (A fact that can be changed
by you, and eventually *only* you.) If you cannot deal with that, please
stop wasting my time.

On a side note, you can tell a troll from a regular when you see the former
contributing nothing but ad-hominem attack and misinformation. Forged
address headers are also a strong indication of a troll. Any form of
continued anti-social behavior, really.

--
PointedEars
Re: strip_tags function [message #178776 is a reply to message #178773] Wed, 01 August 2012 04:56 Go to previous messageGo to next message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma: 0
Senior Member
Tim Fardell wrote:

> Jerry Stuckle wrote:
>> On 7/31/2012 3:17 PM, Tim Fardell wrote:
>>> On Sat, 28 Jul 2012 12:42:59 +0200, Thomas 'PointedEars' Lahn
>>> <PointedEars(at)web(dot)de> wrote:
>>>> >> You are not making sense. The *input* data should *never* be "HTML
>>>> >> encoded".
>>>
>>> Wait, didn't you just say the opposite? If the input is not HTML-encoded
>>> then it is *critical* that any '<' character which does not form part of
>>> a tag is ignored. It is not, as my example above proves.
>>
>> You should have stopped right here. "Pointed Head" is a well known
>> troll in several newsgroups. He doesn't understand it is completely
>> valid to have HTML encoded strings - for instance, when using cURL to
>> retrieve a web page.
>
> Ah - that explains it :-)

Your misconceptions?

> I did wonder to be honest, which is why I didn't reply for a couple of
> days, but it got the better of me and I just *had* to respond in the end.

Which got you a constructive discussion in the end that you can *really*
learn from, if you are willing. Some people have jobs (which happen to be
in this industry), and cannot always reply in time.

> I'll leave it there - apologies for feeding the troll.

If you are this naive to blindly believe the first "opinion" that comes
along just because the topic turns out to be a little more complex than you
expected, then you better leave here, indeed.


PointedEars
--
Prototype.js was written by people who don't know javascript for people
who don't know javascript. People who don't know javascript are not
the best source of advice on designing systems that use javascript.
-- Richard Cornford, cljs, <f806at$ail$1$8300dec7(at)news(dot)demon(dot)co(dot)uk>
Re: strip_tags function [message #178777 is a reply to message #178773] Wed, 01 August 2012 05:15 Go to previous message
Thomas 'PointedEars'  is currently offline  Thomas 'PointedEars'
Messages: 701
Registered: October 2010
Karma: 0
Senior Member
Tim Fardell wrote:

> On Tue, 31 Jul 2012 15:30:54 -0400, Jerry Stuckle
>> […] it is completely valid to have HTML encoded strings - for instance,
>> when using cURL toretrieve a web page.

JFTR: Utter nonsense. "HTML-encoded string" was (well-)defined *by the OP*
as a string where "<a>…</a>" would become at least "&lt;a>…&lt;/a>".

If you use cURL to retrieve a "web page" (correct: a web _document_, like an
HTML document), the input is always *markup*, like "<a>…</a>". It may also
include character references (`&#215') or character entitity references
(`&lt;'), but it will contain at least one tag (because the "web page" would
otherwise be useless).


PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee
  Switch to threaded view of this topic Create a new topic Submit Reply
Previous Topic: PHP to PDF
Next Topic: ncurses on Linux how to capture F1 key?
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Fri Nov 22 08:30:39 GMT 2024

Total time taken to generate the page: 0.02174 seconds