Help with regex [message #184950] |
Wed, 19 February 2014 04:18  |
knal
Messages: 3 Registered: February 2014
Karma: 0
|
Junior Member |
add to buddy list ignore all messages by this user
|
|
Hi there,
I'd like to parse users' textfield HTML input with a regex. The user writes regular html, including links with a hash (mysite.com/somedocument/file.html#thehash)
I want to extract all unique <a hrefs> that match #thehash but wouldn't know how to build the regex. (so i can sum them up in another location)
Any help would be greatly appreciated.
Thanks in advance,
Knal
|
|
|
Re: Help with regex [message #184951 is a reply to message #184950] |
Wed, 19 February 2014 09:20   |
|
knal wrote:
^^^^
Please use your real name here.
> I'd like to parse users' textfield HTML input with a regex. The user
> writes regular html,
^^^^^^^^^^^^
What do you mean by that?
> including links with a hash
> (mysite.com/somedocument/file.html#thehash) I want to extract all unique
> <a hrefs> that match #thehash but wouldn't know how to build the regex.
> (so i can sum them up in another location)
RTFM: <http://php.net/pcre>
PointedEars
--
realism: HTML 4.01 Strict
evangelism: XHTML 1.0 Strict
madness: XHTML 1.1 as application/xhtml+xml
-- Bjoern Hoehrmann
|
|
|
Re: Help with regex [message #184952 is a reply to message #184951] |
Wed, 19 February 2014 09:45   |
knal
Messages: 3 Registered: February 2014
Karma: 0
|
Junior Member |
add to buddy list ignore all messages by this user
|
|
Hi PointedEars,
With regular HTML i meant basic HTML, submitted with basic WYSIWYG-features like links, basic text styles (b, i, u, etc.), images, some divs, etc.
I understand this wasn't clear, sorry for that.
Regarding the RTFM: I've read on the regex, and also understood it's not intended for what I want to use it for.
I've come here for help here because I was unable to find a similar situation which could aid me to my solution nor was i able to find any other method I could use.
Kind regards,
John Knal Peters,
Op woensdag 19 februari 2014 15:20:00 UTC+1 schreef Thomas 'PointedEars' Lahn:
> knal wrote:
>
> ^^^^
>
> Please use your real name here.
>
>
>
>> I'd like to parse users' textfield HTML input with a regex. The user
>
>> writes regular html,
>
> ^^^^^^^^^^^^
>
> What do you mean by that?
>
>
>
>> including links with a hash
>
>> (mysite.com/somedocument/file.html#thehash) I want to extract all unique
>
>> <a hrefs> that match #thehash but wouldn't know how to build the regex.
>
>> (so i can sum them up in another location)
>
>
>
> RTFM: <http://php.net/pcre>
>
>
>
>
>
> PointedEars
>
> --
>
> realism: HTML 4.01 Strict
>
> evangelism: XHTML 1.0 Strict
>
> madness: XHTML 1.1 as application/xhtml+xml
>
> -- Bjoern Hoehrmann
|
|
|
|
Re: Help with regex [message #184954 is a reply to message #184953] |
Wed, 19 February 2014 10:17   |
knal
Messages: 3 Registered: February 2014
Karma: 0
|
Junior Member |
add to buddy list ignore all messages by this user
|
|
Hi Christoph,
Thanks for your reply.
I am aware of the security issues, but the users are part of a restricted group submitting to a restricted part of the website.
I do however filter the input for SQL injections etc, just to be sure.
Could you please give a little more direction on the DOM?
Thank you in advance.
Kind regards,
John Knal Peters,
|
|
|
Re: Help with regex [message #184955 is a reply to message #184952] |
Wed, 19 February 2014 10:30   |
|
knal wrote:
^^^^
Please fix.
> Hi PointedEars,
Hello. This is a one-to-many (or many-to-many) communications medium.
Therefore, addressing only one person is impolite towards the other
potential readers.
> With regular HTML i meant basic HTML, submitted with basic
> WYSIWYG-features like links, basic text styles (b, i, u, etc.), images,
> some divs, etc. I understand this wasn't clear, sorry for that.
But even such “basic HTML” is not a regular language because many element
types can be part of their own content.
> Regarding the RTFM: I've read on the regex, and also understood it's not
> intended for what I want to use it for.
That depends on *how* you use it. Because HTML is not a regular language,
unless you would accept only a subset of HTML that is regular, a single
application of a single regular expression will not suffice (even if that
expression is not a regular one in the computer-science sense of the term).
> I've come here for help here because I was unable to find a similar
> situation which could aid me to my solution nor was i able to find any
> other method I could use.
In general, you are looking for a markup parser (which is an implementation
of a push-down automaton, a PDA). Either write one yourself (you *can* use
regular expressions there) or use those that are available (Christoph has
pointed out a reasonable one).
> [Top post]
<http://www.netmeister.org/news/learn2quote.html>
--
PointedEarsa
|
|
|
Re: Help with regex [message #184956 is a reply to message #184954] |
Wed, 19 February 2014 10:37   |
|
knal wrote:
^^^^
Unless you (try to) fix this, this will be the last I see of your postings.
> I am aware of the security issues, but the users are part of a restricted
> group submitting to a restricted part of the website. I do however filter
> the input for SQL injections etc, just to be sure.
>
> Could you please give a little more direction on the DOM?
I presume he could. Ask what you want to know instead.
Reformulating your stupid question to the smart one, “How can I use the DOM
here?”, the answer becomes “RTFM: loadHTML”.
Next time, post what you have tried before you ask. This is not free
customer support for lazy people.
<http://www.catb.org/~esr/faqs/smart-questions.html>
Not quoting anything is even worse than quoting everything because now the
reader has nothing to refer to. That is _not_ what the two of us meant when
we asked you to stop *top*-posting. Learn to quote.
PointedEars
--
Anyone who slaps a 'this page is best viewed with Browser X' label on
a Web page appears to be yearning for the bad old days, before the Web,
when you had very little chance of reading a document written on another
computer, another word processor, or another network. -- Tim Berners-Lee
|
|
|
|
Re: Help with regex [message #184958 is a reply to message #184954] |
Wed, 19 February 2014 13:48   |
Jerry Stuckle
Messages: 2598 Registered: September 2010
Karma: 0
|
Senior Member |
add to buddy list ignore all messages by this user
|
|
On 2/19/2014 10:17 AM, knal wrote:
> Hi Christoph,
>
> Thanks for your reply.
>
> I am aware of the security issues, but the users are part of a restricted group submitting to a restricted part of the website.
> I do however filter the input for SQL injections etc, just to be sure.
>
> Could you please give a little more direction on the DOM?
>
> Thank you in advance.
>
> Kind regards,
> John Knal Peters,
>
Knal,
Check out http://www.php.net/manual/en/book.dom.php. PHP's DOM can be a
bit complicated at first, but you can catch on. Despite the comments,
it works with HTML, also. And it will be a whole lot easier than trying
to parse HTML yourself.
If the HTML is well-formed, you might try the simpleXML extension
(http://www.php.net/manual/en/book.simplexml.php), but I've found it
doesn't deal well with malformed input; the DOM extension does much better.
P.S. Don't mind "Pointed Head". He's just being his usual pedantic
self. He gets this way when someone asks a question he doesn't
understand (which seems to be pretty often).
--
==================
Remove the "x" from my email address
Jerry Stuckle
jstucklex(at)attglobal(dot)net
==================
|
|
|
|
Re: Help with regex [message #184961 is a reply to message #184960] |
Wed, 19 February 2014 16:28   |
|
Christoph Michael Becker wrote:
> Thomas 'PointedEars' Lahn wrote:
>> knal wrote:
>>> Could you please give a little more direction on the DOM?
>> I presume he could.
>
> Actually, I have not yet used PHP's DOM extension. However, it should
> not be too different from other DOM implementations, so basically it
> comes down to load an HTML document and than use XPath to get the
> desired information. The mentioned keywords should be sufficient to
> search the web -- at least for some first steps.
However, implementations of the W3C DOM Core API do not require a way so
that a document tree can be built from source code, nor is XPath support
required. So you had better RTFM yourself before making the suggestion.
There are few if any occasions where such wild guesses are useful. Lucky
for you, you hit the nail this time: there are corresponding methods in
this (PHP’s) implementation (so there is no need to search the Web).
>> Ask what you want to know instead.
>
> Is there really so much difference between directly asking the question
> "How can I use the DOM here?" and asking politely for "some more
> direction on the DOM"?
Yes, …
>> Reformulating your stupid question to the smart one, “How can I use the
>> DOM here?”, the answer becomes “RTFM: loadHTML”.
>>
>> Next time, post what you have tried before you ask. This is not free
>> customer support for lazy people.
>>
>> <http://www.catb.org/~esr/faqs/smart-questions.html>
… as explained in there.
HTH
PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)
|
|
|
|
Re: Help with regex [message #184965 is a reply to message #184954] |
Thu, 20 February 2014 01:29   |
Peter H. Coffin
Messages: 245 Registered: September 2010
Karma: 0
|
Senior Member |
add to buddy list ignore all messages by this user
|
|
On Wed, 19 Feb 2014 07:17:17 -0800 (PST), knal wrote:
> I am aware of the security issues, but the users are part of a
> restricted group submitting to a restricted part of the website. I do
> however filter the input for SQL injections etc, just to be sure.
Okay, if they're part of a limited and presumably trusted group, just
let 'em loose to write whatever HTML they want. Or don't give them any
at all.
The core point is that the thing that you're asking to do is a Very Hard
Problem. It's so hard, in fact, that most places that would otherwise
allow limited markup, like you're proposing do, tend to do it by using
their OWN markup tags, invalidating actual HTML (a la htmlentities();),
and then parsing for their own tags and substituting real HTML tags on
the output side. Because that's WAY LESS WORK and a lot more reliable
than what you're hoping to do.
--
10. I will not interrogate my enemies in the inner sanctum -- a small
hotel well outside my borders will work just as well.
--Peter Anspach's list of things to do as an Evil Overlord
|
|
|
|
Re: Help with regex [message #184968 is a reply to message #184967] |
Thu, 20 February 2014 20:25   |
|
Christoph Michael Becker wrote:
> Thomas 'PointedEars' Lahn wrote:
>> Christoph Michael Becker wrote:
>>> Thomas 'PointedEars' Lahn wrote:
>>>> knal wrote:
>>>> > Could you please give a little more direction on the DOM?
>>>> I presume he could.
>>>
>>> Actually, I have not yet used PHP's DOM extension. However, it should
>>> not be too different from other DOM implementations, so basically it
>>> comes down to load an HTML document and than use XPath to get the
>>> desired information. The mentioned keywords should be sufficient to
>>> search the web -- at least for some first steps.
>>
>> However, implementations of the W3C DOM Core API do not require a way so
>> that a document tree can be built from source code, nor is XPath support
>> required.
>
> I was not explicitly speaking about the DOM *Core* API,
But that is PHP’s implementation. DOMXPath resembles, but does not conform
to, DOM3 XPath.
> but rather of DOM implementations in a broad sense. As you surely are
> aware, there is the DOM Level 3 Load and Save Specification[1] as well as
> the XML Path Language (XPath) Version 1.0[2], both of which are W3C
> recommendations since a long time.
And both of which are *optional*. PHP implements none of them, but it
supports alternative ways.
> While XPath support might be missing from some "DOM"
> implementations, either implicit or explicit loading of a document
> source most likely won't (otherwise the implementation would not allow
> to work with existing document sources, and as such would not be too
> useful).
Your logic is flawed. For example, the W3C DOM implementation in browsers
does not always have such a way.
>> So you had better RTFM yourself before making the suggestion.
>
> I had glimpsed over the manual and found DOMDocument::loadHTML()[3] as
> well as DOMXPath[4]...
Non sequitur.
PointedEars
--
> If you get a bunch of authors […] that state the same "best practices"
> in any programming language, then you can bet who is wrong or right...
Not with javascript. Nonsense propagates like wildfire in this field.
-- Richard Cornford, comp.lang.javascript, 2011-11-14
|
|
|
|