FUDforum: comp.lang.php » Help with regex

Home » Imported messages » comp.lang.php » Help with regex

Show: Today's Messages :: Polls :: Message Navigator

Help with regex [message #184950]

Wed, 19 February 2014 09:18

knal
Messages: 3
Registered: February 2014

Karma: 0

Junior Member

Hi there,

I'd like to parse users' textfield HTML input with a regex. The user writes regular html, including links with a hash (mysite.com/somedocument/file.html#thehash)
I want to extract all unique <a hrefs> that match #thehash but wouldn't know how to build the regex. (so i can sum them up in another location)

Any help would be greatly appreciated.

Thanks in advance,
Knal

Report message to a moderator

Re: Help with regex [message #184951 is a reply to message #184950]

Wed, 19 February 2014 14:20

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

knal wrote:
^^^^
Please use your real name here.

> I'd like to parse users' textfield HTML input with a regex. The user
> writes regular html,
^^^^^^^^^^^^
What do you mean by that?

> including links with a hash
> (mysite.com/somedocument/file.html#thehash) I want to extract all unique
> <a hrefs> that match #thehash but wouldn't know how to build the regex.
> (so i can sum them up in another location)

RTFM: <http://php.net/pcre>

PointedEars
--
realism: HTML 4.01 Strict
evangelism: XHTML 1.0 Strict
madness: XHTML 1.1 as application/xhtml+xml
-- Bjoern Hoehrmann

Report message to a moderator

Re: Help with regex [message #184952 is a reply to message #184951]

Wed, 19 February 2014 14:45

knal
Messages: 3
Registered: February 2014

Karma: 0

Junior Member

Hi PointedEars,

With regular HTML i meant basic HTML, submitted with basic WYSIWYG-features like links, basic text styles (b, i, u, etc.), images, some divs, etc.
I understand this wasn't clear, sorry for that.

Regarding the RTFM: I've read on the regex, and also understood it's not intended for what I want to use it for.
I've come here for help here because I was unable to find a similar situation which could aid me to my solution nor was i able to find any other method I could use.

Kind regards,
John Knal Peters,

Op woensdag 19 februari 2014 15:20:00 UTC+1 schreef Thomas 'PointedEars' Lahn:
> knal wrote:
>
> ^^^^
>
> Please use your real name here.
>
>
>
>> I'd like to parse users' textfield HTML input with a regex. The user
>
>> writes regular html,
>
> ^^^^^^^^^^^^
>
> What do you mean by that?
>
>
>
>> including links with a hash
>
>> (mysite.com/somedocument/file.html#thehash) I want to extract all unique
>
>> <a hrefs> that match #thehash but wouldn't know how to build the regex.
>
>> (so i can sum them up in another location)
>
>
>
> RTFM: <http://php.net/pcre>
>
>
>
>
>
> PointedEars
>
> --
>
> realism: HTML 4.01 Strict
>
> evangelism: XHTML 1.0 Strict
>
> madness: XHTML 1.1 as application/xhtml+xml
>
> -- Bjoern Hoehrmann

Report message to a moderator

Re: Help with regex [message #184953 is a reply to message #184952]

Wed, 19 February 2014 15:11

Christoph Michael Bec
Messages: 207
Registered: June 2013

Karma: 0

Senior Member

knal wrote:

> With regular HTML i meant basic HTML, submitted with basic
> WYSIWYG-features like links, basic text styles (b, i, u, etc.),
> images, some divs, etc. I understand this wasn't clear, sorry for
> that.

I assume you are aware of potential security issues wrt. user supplied
input containing markup.

> Regarding the RTFM: I've read on the regex, and also understood it's
> not intended for what I want to use it for. I've come here for help
> here because I was unable to find a similar situation which could aid
> me to my solution nor was i able to find any other method I could
> use.

Well, you can use a regex for this purpose, but you can also use PHP's
DOM extension[1], which is less prone to subtle errors.

And please don't top-post.

[1] <http://de3.php.net/manual/en/book.dom.php>

--
Christoph M. Becker

Report message to a moderator

Re: Help with regex [message #184954 is a reply to message #184953]

Wed, 19 February 2014 15:17

knal
Messages: 3
Registered: February 2014

Karma: 0

Junior Member

Hi Christoph,

Thanks for your reply.

I am aware of the security issues, but the users are part of a restricted group submitting to a restricted part of the website.
I do however filter the input for SQL injections etc, just to be sure.

Could you please give a little more direction on the DOM?

Thank you in advance.

Kind regards,
John Knal Peters,

Report message to a moderator

Re: Help with regex [message #184955 is a reply to message #184952]

Wed, 19 February 2014 15:30

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

knal wrote:
^^^^
Please fix.

> Hi PointedEars,

Hello. This is a one-to-many (or many-to-many) communications medium.
Therefore, addressing only one person is impolite towards the other
potential readers.

> With regular HTML i meant basic HTML, submitted with basic
> WYSIWYG-features like links, basic text styles (b, i, u, etc.), images,
> some divs, etc. I understand this wasn't clear, sorry for that.

But even such “basic HTML” is not a regular language because many element
types can be part of their own content.

> Regarding the RTFM: I've read on the regex, and also understood it's not
> intended for what I want to use it for.

That depends on *how* you use it. Because HTML is not a regular language,
unless you would accept only a subset of HTML that is regular, a single
application of a single regular expression will not suffice (even if that
expression is not a regular one in the computer-science sense of the term).

> I've come here for help here because I was unable to find a similar
> situation which could aid me to my solution nor was i able to find any
> other method I could use.

In general, you are looking for a markup parser (which is an implementation
of a push-down automaton, a PDA). Either write one yourself (you *can* use
regular expressions there) or use those that are available (Christoph has
pointed out a reasonable one).

> [Top post]

<http://www.netmeister.org/news/learn2quote.html>

--
PointedEarsa

Report message to a moderator

Re: Help with regex [message #184956 is a reply to message #184954]

Wed, 19 February 2014 15:37

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

knal wrote:
^^^^
Unless you (try to) fix this, this will be the last I see of your postings.

> I am aware of the security issues, but the users are part of a restricted
> group submitting to a restricted part of the website. I do however filter
> the input for SQL injections etc, just to be sure.
>
> Could you please give a little more direction on the DOM?

I presume he could. Ask what you want to know instead.

Reformulating your stupid question to the smart one, “How can I use the DOM
here?”, the answer becomes “RTFM: loadHTML”.

Next time, post what you have tried before you ask. This is not free
customer support for lazy people.

<http://www.catb.org/~esr/faqs/smart-questions.html>

Not quoting anything is even worse than quoting everything because now the
reader has nothing to refer to. That is _not_ what the two of us meant when
we asked you to stop *top*-posting. Learn to quote.

PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee

Report message to a moderator

Re: Help with regex [message #184957 is a reply to message #184951]

Wed, 19 February 2014 17:42

The Natural Philosoph
Messages: 993
Registered: September 2010

Karma: 0

Senior Member

On 19/02/14 14:20, Thomas 'PointedEars' Lahn wrote:
> knal wrote:
> ^^^^
> Please use your real name here.
'PointedEars'

^^^^
Please use your real name here.

--
Ineptocracy

(in-ep-toc’-ra-cy) – a system of government where the least capable to
lead are elected by the least capable of producing, and where the
members of society least likely to sustain themselves or succeed, are
rewarded with goods and services paid for by the confiscated wealth of a
diminishing number of producers.

Report message to a moderator

Re: Help with regex [message #184958 is a reply to message #184954]

Wed, 19 February 2014 18:48

Jerry Stuckle
Messages: 2598
Registered: September 2010

Karma: 0

Senior Member

On 2/19/2014 10:17 AM, knal wrote:
> Hi Christoph,
>
> Thanks for your reply.
>
> I am aware of the security issues, but the users are part of a restricted group submitting to a restricted part of the website.
> I do however filter the input for SQL injections etc, just to be sure.
>
> Could you please give a little more direction on the DOM?
>
> Thank you in advance.
>
> Kind regards,
> John Knal Peters,
>

Knal,

Check out http://www.php.net/manual/en/book.dom.php. PHP's DOM can be a
bit complicated at first, but you can catch on. Despite the comments,
it works with HTML, also. And it will be a whole lot easier than trying
to parse HTML yourself.

If the HTML is well-formed, you might try the simpleXML extension
(http://www.php.net/manual/en/book.simplexml.php), but I've found it
doesn't deal well with malformed input; the DOM extension does much better.

P.S. Don't mind "Pointed Head". He's just being his usual pedantic
self. He gets this way when someone asks a question he doesn't
understand (which seems to be pretty often).

--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex(at)attglobal(dot)net
==================

Report message to a moderator

Re: Help with regex [message #184960 is a reply to message #184956]

Wed, 19 February 2014 20:32

Christoph Michael Bec
Messages: 207
Registered: June 2013

Karma: 0

Senior Member

Thomas 'PointedEars' Lahn wrote:

> knal wrote:
>
>> Could you please give a little more direction on the DOM?
>
> I presume he could.

Actually, I have not yet used PHP's DOM extension. However, it should
not be too different from other DOM implementations, so basically it
comes down to load an HTML document and than use XPath to get the
desired information. The mentioned keywords should be sufficient to
search the web -- at least for some first steps.

> Ask what you want to know instead.

Is there really so much difference between directly asking the question
"How can I use the DOM here?" and asking politely for "some more
direction on the DOM"?

> Reformulating your stupid question to the smart one, “How can I use the DOM
> here?”, the answer becomes “RTFM: loadHTML”.
>
> Next time, post what you have tried before you ask. This is not free
> customer support for lazy people.
>
> <http://www.catb.org/~esr/faqs/smart-questions.html>

I also recommend to read it, if only to better understand the manner of
expression of others. Even if "RTFM" might seem to be an impolite
reply, it is not meant this way.[1]

> Not quoting anything is even worse than quoting everything because now the
> reader has nothing to refer to. That is _not_ what the two of us meant when
> we asked you to stop *top*-posting. Learn to quote.

I may add: proper quoting does not serve as an end in itself, but is
rather vital for *Usenet* (even if you, John[2], are using Google Groups
which seems to be more like a bulletin board), as not everybody might
receive, let alone read every single message. But reading a message out
of context can be confusing.

[1] <http://www.catb.org/~esr/faqs/smart-questions.html#rtfm>
[2] A good example for the missing context. Other readers might wonder
to whom I am referring, as there are only Thomas 'PointedEars' and knal
attributed to in this message.

--
Christoph M. Becker

Report message to a moderator

Re: Help with regex [message #184961 is a reply to message #184960]

Wed, 19 February 2014 21:28

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Christoph Michael Becker wrote:

> Thomas 'PointedEars' Lahn wrote:
>> knal wrote:
>>> Could you please give a little more direction on the DOM?
>> I presume he could.
>
> Actually, I have not yet used PHP's DOM extension. However, it should
> not be too different from other DOM implementations, so basically it
> comes down to load an HTML document and than use XPath to get the
> desired information. The mentioned keywords should be sufficient to
> search the web -- at least for some first steps.

However, implementations of the W3C DOM Core API do not require a way so
that a document tree can be built from source code, nor is XPath support
required. So you had better RTFM yourself before making the suggestion.
There are few if any occasions where such wild guesses are useful. Lucky
for you, you hit the nail this time: there are corresponding methods in
this (PHP’s) implementation (so there is no need to search the Web).

>> Ask what you want to know instead.
>
> Is there really so much difference between directly asking the question
> "How can I use the DOM here?" and asking politely for "some more
> direction on the DOM"?

Yes, …

>> Reformulating your stupid question to the smart one, “How can I use the
>> DOM here?”, the answer becomes “RTFM: loadHTML”.
>>
>> Next time, post what you have tried before you ask. This is not free
>> customer support for lazy people.
>>
>> <http://www.catb.org/~esr/faqs/smart-questions.html>

… as explained in there.

HTH

PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)

Report message to a moderator

Re: Help with regex [message #184963 is a reply to message #184957]

Wed, 19 February 2014 23:38

Scott Johnson
Messages: 196
Registered: January 2012

Karma: 0

Senior Member

On 2/19/14, 9:42 AM, The Natural Philosopher wrote:
> On 19/02/14 14:20, Thomas 'PointedEars' Lahn wrote:
>> knal wrote:
>> ^^^^
>> Please use your real name here.
> 'PointedEars'
>
> ^^^^
> Please use your real name here.
>

+1

Scotty

Report message to a moderator

Re: Help with regex [message #184965 is a reply to message #184954]

Thu, 20 February 2014 06:29

Peter H. Coffin
Messages: 245
Registered: September 2010

Karma: 0

Senior Member

On Wed, 19 Feb 2014 07:17:17 -0800 (PST), knal wrote:

> I am aware of the security issues, but the users are part of a
> restricted group submitting to a restricted part of the website. I do
> however filter the input for SQL injections etc, just to be sure.

Okay, if they're part of a limited and presumably trusted group, just
let 'em loose to write whatever HTML they want. Or don't give them any
at all.

The core point is that the thing that you're asking to do is a Very Hard
Problem. It's so hard, in fact, that most places that would otherwise
allow limited markup, like you're proposing do, tend to do it by using
their OWN markup tags, invalidating actual HTML (a la htmlentities();),
and then parsing for their own tags and substituting real HTML tags on
the output side. Because that's WAY LESS WORK and a lot more reliable
than what you're hoping to do.

--
10. I will not interrogate my enemies in the inner sanctum -- a small
hotel well outside my borders will work just as well.
--Peter Anspach's list of things to do as an Evil Overlord

Report message to a moderator

Re: Help with regex [message #184967 is a reply to message #184961]

Thu, 20 February 2014 21:43

Christoph Michael Bec
Messages: 207
Registered: June 2013

Karma: 0

Senior Member

Thomas 'PointedEars' Lahn wrote:

> Christoph Michael Becker wrote:
>
>> Thomas 'PointedEars' Lahn wrote:
>>> knal wrote:
>>>> Could you please give a little more direction on the DOM?
>>> I presume he could.
>>
>> Actually, I have not yet used PHP's DOM extension. However, it should
>> not be too different from other DOM implementations, so basically it
>> comes down to load an HTML document and than use XPath to get the
>> desired information. The mentioned keywords should be sufficient to
>> search the web -- at least for some first steps.
>
> However, implementations of the W3C DOM Core API do not require a way so
> that a document tree can be built from source code, nor is XPath support
> required.

I was not explicitly speaking about the DOM *Core* API, but rather of
DOM implementations in a broad sense. As you surely are aware, there is
the DOM Level 3 Load and Save Specification[1] as well as the XML Path
Language (XPath) Version 1.0[2], both of which are W3C recommendations
since a long time. While XPath support might be missing from some "DOM"
implementations, either implicit or explicit loading of a document
source most likely won't (otherwise the implementation would not allow
to work with existing document sources, and as such would not be too
useful).

> So you had better RTFM yourself before making the suggestion.

I had glimpsed over the manual and found DOMDocument::loadHTML()[3] as
well as DOMXPath[4]...

> There are few if any occasions where such wild guesses are useful.

.... and I have used DOM implementations of other languages (TCL,
FreePascal/Lazarus) so IMHO this wasn't a wild guess, but rather an at
least somewhat educated one.

> Lucky
> for you, you hit the nail this time: there are corresponding methods in
> this (PHP’s) implementation (so there is no need to search the Web).

For someone unaccustomed to the DOM and particularly XPath, searching
the Web for some introduction may well be necessary. The PHP
documentation regarding these topics is rather terse, and the user
comments might not make up for it.

[1] <http://www.w3.org/TR/2004/REC-DOM-Level-3-LS-20040407/>
[2] <http://www.w3.org/TR/xpath/>
[3] <http://www.php.net/manual/en/domdocument.loadhtml.php>
[4] <http://www.php.net/manual/en/class.domxpath.php>

--
Christoph M. Becker

Report message to a moderator

Re: Help with regex [message #184968 is a reply to message #184967]

Fri, 21 February 2014 01:25

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Christoph Michael Becker wrote:

> Thomas 'PointedEars' Lahn wrote:
>> Christoph Michael Becker wrote:
>>> Thomas 'PointedEars' Lahn wrote:
>>>> knal wrote:
>>>> > Could you please give a little more direction on the DOM?
>>>> I presume he could.
>>>
>>> Actually, I have not yet used PHP's DOM extension. However, it should
>>> not be too different from other DOM implementations, so basically it
>>> comes down to load an HTML document and than use XPath to get the
>>> desired information. The mentioned keywords should be sufficient to
>>> search the web -- at least for some first steps.
>>
>> However, implementations of the W3C DOM Core API do not require a way so
>> that a document tree can be built from source code, nor is XPath support
>> required.
>
> I was not explicitly speaking about the DOM *Core* API,

But that is PHP’s implementation. DOMXPath resembles, but does not conform
to, DOM3 XPath.

> but rather of DOM implementations in a broad sense. As you surely are
> aware, there is the DOM Level 3 Load and Save Specification[1] as well as
> the XML Path Language (XPath) Version 1.0[2], both of which are W3C
> recommendations since a long time.

And both of which are *optional*. PHP implements none of them, but it
supports alternative ways.

> While XPath support might be missing from some "DOM"
> implementations, either implicit or explicit loading of a document
> source most likely won't (otherwise the implementation would not allow
> to work with existing document sources, and as such would not be too
> useful).

Your logic is flawed. For example, the W3C DOM implementation in browsers
does not always have such a way.

>> So you had better RTFM yourself before making the suggestion.
>
> I had glimpsed over the manual and found DOMDocument::loadHTML()[3] as
> well as DOMXPath[4]...

Non sequitur.

PointedEars
--
> If you get a bunch of authors […] that state the same "best practices"
> in any programming language, then you can bet who is wrong or right...
Not with javascript. Nonsense propagates like wildfire in this field.
-- Richard Cornford, comp.lang.javascript, 2011-11-14

Report message to a moderator

Re: Help with regex [message #184989 is a reply to message #184968]

Sun, 23 February 2014 16:57

Christoph Michael Bec
Messages: 207
Registered: June 2013

Karma: 0

Senior Member

Thomas 'PointedEars' Lahn wrote:

> Christoph Michael Becker wrote:
>
>> Thomas 'PointedEars' Lahn wrote:
>>> Christoph Michael Becker wrote:
>>>> Thomas 'PointedEars' Lahn wrote:
>>>> > knal wrote:
>>>> >> Could you please give a little more direction on the DOM?
>>>> > I presume he could.
>>>>
>>>> Actually, I have not yet used PHP's DOM extension. However, it should
>>>> not be too different from other DOM implementations, so basically it
>>>> comes down to load an HTML document and than use XPath to get the
>>>> desired information. The mentioned keywords should be sufficient to
>>>> search the web -- at least for some first steps.
>>>
>>> However, implementations of the W3C DOM Core API do not require a way so
>>> that a document tree can be built from source code, nor is XPath support
>>> required.
>>
>> I was not explicitly speaking about the DOM *Core* API,
>
> But that is PHP’s implementation. DOMXPath resembles, but does not conform
> to, DOM3 XPath.

I believe you're right, even if the PHP manual doesn't state compliancy
with the W3C DOM API, let alone any particular level or (in lack of a
better word) "section" (with this I mean "Core", "Events" etc.)

>> but rather of DOM implementations in a broad sense. As you surely are
>> aware, there is the DOM Level 3 Load and Save Specification[1] as well as
>> the XML Path Language (XPath) Version 1.0[2], both of which are W3C
>> recommendations since a long time.
>
> And both of which are *optional*. PHP implements none of them, but it
> supports alternative ways.
>
>> While XPath support might be missing from some "DOM"
>> implementations, either implicit or explicit loading of a document
>> source most likely won't (otherwise the implementation would not allow
>> to work with existing document sources, and as such would not be too
>> useful).
>
> Your logic is flawed. For example, the W3C DOM implementation in browsers
> does not always have such a way.

With "implicit loading" I meant that a browser makes the current
document accessible via DOM objects and methods.

>>> So you had better RTFM yourself before making the suggestion.
>>
>> I had glimpsed over the manual and found DOMDocument::loadHTML()[3] as
>> well as DOMXPath[4]...
>
> Non sequitur.

I didn't mean to imply that the existance of these features means that
any particular W3C DOM API is available.

--
Christoph M. Becker

Report message to a moderator

Previous Topic:	Career Opportunities in Singapore (PHP Tech Lead)
Next Topic:	PHP script to get name of file which houses the script?

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Fri Apr 04 06:04:44 GMT 2025

Total time taken to generate the page: 0.06986 seconds