PHP functions to convert markup efficiently [message #183811] |
Thu, 21 November 2013 18:30 |
James Harris
Messages: 11 Registered: November 2013
Karma: 0
|
Junior Member |
|
|
I am looking for a way to mark up text in a way that PHP would be able to
efficiently and quickly convert to HTML.
I could either use an existing markup language or design a new one but I
wanted to know of which PHP functions would be ideal to use to process it
most efficiently. To a large extent that will guide the choice of markup
tags if I have to design it myself.
For example, it looks like I could choose between PHP's expand(), fgets()
and regular expression handling.
Of course, implemenations may differ slightly but, on average, are there
certain PHP approaches that could be expected to be faster than others? What
is the accepted wisdom?
James
|
|
|
|
|
|
Re: PHP functions to convert markup efficiently [message #183815 is a reply to message #183813] |
Thu, 21 November 2013 19:10 |
Christoph Michael Bec
Messages: 207 Registered: June 2013
Karma: 0
|
Senior Member |
|
|
James Harris wrote:
> "The Natural Philosopher" <tnp(at)invalid(dot)invalid> wrote in message
> news:l6lk04$e54$1(at)news(dot)albasani(dot)net...
>> On 21/11/13 18:30, James Harris wrote:
>>> I am looking for a way to mark up text in a way that PHP would be able to
>>> efficiently and quickly convert to HTML.
>>>
>>
>> mark the text up with HTML!
>
> No good. Too complex to vet, not secure (people other than me could add
> markup)
There are tools which help with this, e.g. <http://htmlpurifier.org/>.
> and would not allow enhanced functions that I anticipate needing to
> add. It needs to be markup I can control.
Then you may consider XML. There are several libraries dealing with XML
that come bundled with PHP[1].
>>> I could either use an existing markup language or design a new one but I
>>> wanted to know of which PHP functions would be ideal to use to process it
>>> most efficiently. To a large extent that will guide the choice of markup
>>> tags if I have to design it myself.
>>>
>>> For example, it looks like I could choose between PHP's expand(), fgets()
>>> and regular expression handling.
I never heard of a PHP function called expand(). Anyway, if you want to
use your own markup, regular expression are most likely the way to go,
as scanning by characters might be too slow in pure PHP.
>>> Of course, implemenations may differ slightly but, on average, are there
>>> certain PHP approaches that could be expected to be faster than others?
Most likely those which are programmed in C.
[1] <http://php.net/manual/en/refs.xml.php>
--
Christoph M. Becker
|
|
|
|
Re: PHP functions to convert markup efficiently [message #183817 is a reply to message #183816] |
Thu, 21 November 2013 20:18 |
Christoph Michael Bec
Messages: 207 Registered: June 2013
Karma: 0
|
Senior Member |
|
|
James Harris wrote:
>>>> > Of course, implemenations may differ slightly but, on average, are
>>>> > >>>> there
>>>> > >>>> certain PHP approaches that could be expected to be faster than others?
>>>
>>> Most likely those which are programmed in C.
> Is there a way to identify those?
The core of PHP and all bundled extensions are written in C, as well as
all PECL packages. PEAR packages and many other libraries are
programmed in pure PHP.
> That may not be the only consideration. ISTM that even if the regular
> expression handler is programmed in C simple string handling (hopefully also
> programmed in C) should be much faster as long as it is written sensibly.
Of course, a strpos() is faster than an respective preg_match(), but
simple string functions are not as powerful as regular expressions, and
so the number of times they have to be called will sum up. Scanning
character by character, however, might involve a lot of conditional
statements. However, the perfomance might not matter for relatively
small texts at all.
--
Christoph M. Becker
|
|
|
|
|
|
|
|
|
|
|
Re: PHP functions to convert markup efficiently [message #183876 is a reply to message #183855] |
Sat, 23 November 2013 16:38 |
James Harris
Messages: 11 Registered: November 2013
Karma: 0
|
Junior Member |
|
|
"Arno Welzel" <usenet(at)arnowelzel(dot)de> wrote in message
news:528F7B7D(dot)4030701(at)arnowelzel(dot)de...
> Am 22.11.2013 13:48, schrieb James Harris:
>
>> Having looked at Markdown and Smarty that people suggested it might be
>> better if I bite the bullet and develop my own markup language. It would
>> then be easier both to limit what it allows and also to add features as
>> needed. I have been running some syntax tests and it seems quite easy to
>> do
>> and fast to process even though the markup is currently a little
>> cumbersome
>> to write.
>
> I think this is not a good idea.
>
> Existing markup languages are well documented and there are existing
> implementations to parse the markup and convert it to HTML (or other
> formats) which are not only used by one person - so it is likely that
> bugs will be fixed as well within a reasonable time.
Noted.
> Besides Markdown and Smarty you can also try DokuWiki - it's markup can
> be extended using syntax plugins (and you can create your own plugins as
> well to handle block level or inline elements etc.).
I took a look at DokuWiki (having previously looked at the others). It's
good but does things I don't want. At first glance I couldn't see a way to
take them away and to restrict the formatting it accepts. I'm not sure I
want to invest the time needed to learn any of those packages when, at least
for now, it seems that native PHP is easy enough to use and is far more
flexible.
>>> What kind of markup is needed?
>>
>> Quite a lot. Things like these:
>> * bold, italic
>> * links to local pages and remote URLs
>> * images, code
>> * headers, lists, line breaks
>> * tables
>> * other things will likely be required but are not defined yet
>
> So - if HTML is to complicated you should really try DokuWiki first.
>
> And if even this syntax is too complicated you should not use a markup
> language at all but a WYSWIG editor which produces valid HTML (for
> example TinyMCE or CKEditor).
I am not trying to avoid complexity. Using PHP to convert markup to HTML
allows me to do things like these:
* restrict the elements that can be used (for security)
* add features such as a server-side TOC
* pull in data from various sources
* choose where to place elements such as footnotes
* make each page of the site a consistent structure
Basically, the combination of HTML, CSS, PHP and my own markup codes seems
ideal. Aside from having to devise the coding the rest is completely
standard and incredibly lightweight. As such, there will be no packages and
associated bugfixes to install and it should be very fast.
I'll keep in mind that there are prebuilt options, though, in case I run
into difficulties as I work on this.
James
|
|
|
Re: PHP functions to convert markup efficiently [message #183877 is a reply to message #183811] |
Sat, 23 November 2013 19:21 |
James Harris
Messages: 11 Registered: November 2013
Karma: 0
|
Junior Member |
|
|
"James Harris" <james(dot)harris(dot)1(at)gmail(dot)com> wrote in message
news:l6ljfj$fue$1(at)dont-email(dot)me...
> I am looking for a way to mark up text in a way that PHP would be able to
> efficiently and quickly convert to HTML.
In case anyone is interested, here is what I have come up with so far.
The markup is designed to be fast to parse rather than to be beautiful.
However, it doesn't look too bad, IMO. I'll explain the markup first and
then the PHP which carries out the conversion.
This is very much experimental at this stage. I may well have to change any
of this including the tag formats. But it is working code as it stands.
There are simple tags which have a one-to-one translation. Here are some
examples. The markup is on the left and what it translates to on the right.
@(hr) --> <hr>
@(b) --> <b>
@(/b) --> </b>
@(nl) --> <br>
@(at) --> @
For example, "Please @(b)STOP@(/b) here" will print STOP in bold and the
rest non-bold.
There are markup tags with simple parameters such as these.
@(h,2) --> <h2>
@(/h,2) --> </h2>
And there are tags which are more inclusive such as these.
@(sect,2,Section X) --> <h2>Section X</h2>
@(link,Local Page) --> <a href="Local Page">Local Page</a>
@(link,http://xe.com,XE) --> <a href="http://xe.com">XE</a>
As you can see, a markup tag is identified by an @ sign followed by an
opening delimiter. The opening delimiter is "(" in all the above cases but
could be a different character. Each opening delimiter character has a
corresponding closing delimiter. For most punctuation characters the closing
delimiter is the same as the opening delimiter but for pairable bracket
characters the logical closing bracket is used. Therefore the following all
mean the same.
@(i)
@[i]
@|i|
@*i*
The point is that the person writing the code can and must choose a closing
delimiter that does not appear in the text between the delimiters. This is
to help recognition speed; the complete tag can be isolated without needing
to consider context such as quoted strings.
I haven't performed timing comparisons but I took Christoph's advice for
speed and chose to use PHPs inbuilt functions which are likely written in C.
I try to avoid calling them repeatedly so as to avoid call overhead. As a
result, the markup parsing works as follows. Feel free to criticise.
First, the page of marked-up text has htmlspecialchars() applied and then is
split on @ symbols using a single call to PHP's explode(). This creates an
array of strings which, for the sake of something to name them, I call
Sections. The PHP code is, in essence, as follows.
$contents = file_get_contents($target_page);
$contents = htmlspecialchars($contents, ENT_NOQUOTES);
$sects = explode("@", $contents);
$contents = ""; //Original text no longer needed
The first section, $sects[0], is what preceded the first @ sign. It is not
marked up so it is written verbatim and then split off using the following
code.
echo $sects[0];
$sects = array_slice($sects, 1);
Second, for each remaining section the initial character (which followed an
@ sign) is taken as an opening delimiter and a matching closing delimiter is
chosen. Then explode(,,2) is called to split the section into just two
parts: before and after the closing delimiter. The most important part of
that is
$sectparts = explode($delimiter, substr($sect, 1), 2);
This converts each section into two parts: a tag and some text.
Third, so that tag parameters can include whatever is necessary, especially
for where the include commas in quoted strings, I use the CSV module as
follows.
$tagparts = str_getcsv($sectparts[0]);
That divides the complete tag into manageable parts. All that's left is to
deal with each part as in
switch ($tagparts[0]) {
case "at": echo "@"; break;
case "b": echo "<b>"; break;
etc.
Finally, once the tag has been written the following non-tag text is written
with
echo $sectparts[1];
That's it so far. I may have missed something fundamental but so far it
seems to work well. It is simple and flexible and the code is very short. No
need for a complex package. There are a few functions I would rather have
not had to use but PHP seems to require them. In any case, the code avoids
things which might slow it down such as large packages, char-by-char
processing (except, presumably, in the CSV module) and regular expressions.
So it should be fast as it stands.
James
|
|
|
Re: PHP functions to convert markup efficiently [message #183879 is a reply to message #183876] |
Sat, 23 November 2013 21:24 |
Richard Yates
Messages: 86 Registered: September 2013
Karma: 0
|
Member |
|
|
On Sat, 23 Nov 2013 16:38:45 -0000, "James Harris"
<james(dot)harris(dot)1(at)gmail(dot)com> wrote:
> "Arno Welzel" <usenet(at)arnowelzel(dot)de> wrote in message
> news:528F7B7D(dot)4030701(at)arnowelzel(dot)de...
>> Am 22.11.2013 13:48, schrieb James Harris:
>>
>>> Having looked at Markdown and Smarty that people suggested it might be
>>> better if I bite the bullet and develop my own markup language. It would
>>> then be easier both to limit what it allows and also to add features as
>>> needed. I have been running some syntax tests and it seems quite easy to
>>> do
>>> and fast to process even though the markup is currently a little
>>> cumbersome
>>> to write.
>>
>> I think this is not a good idea.
>>
>> Existing markup languages are well documented and there are existing
>> implementations to parse the markup and convert it to HTML (or other
>> formats) which are not only used by one person - so it is likely that
>> bugs will be fixed as well within a reasonable time.
>
> Noted.
>
>> Besides Markdown and Smarty you can also try DokuWiki - it's markup can
>> be extended using syntax plugins (and you can create your own plugins as
>> well to handle block level or inline elements etc.).
>
> I took a look at DokuWiki (having previously looked at the others). It's
> good but does things I don't want. At first glance I couldn't see a way to
> take them away and to restrict the formatting it accepts. I'm not sure I
> want to invest the time needed to learn any of those packages when, at least
> for now, it seems that native PHP is easy enough to use and is far more
> flexible.
>
>>>> What kind of markup is needed?
>>>
>>> Quite a lot. Things like these:
>>> * bold, italic
>>> * links to local pages and remote URLs
>>> * images, code
>>> * headers, lists, line breaks
>>> * tables
>>> * other things will likely be required but are not defined yet
>>
>> So - if HTML is to complicated you should really try DokuWiki first.
>>
>> And if even this syntax is too complicated you should not use a markup
>> language at all but a WYSWIG editor which produces valid HTML (for
>> example TinyMCE or CKEditor).
>
> I am not trying to avoid complexity. Using PHP to convert markup to HTML
> allows me to do things like these:
> * restrict the elements that can be used (for security)
> * add features such as a server-side TOC
> * pull in data from various sources
> * choose where to place elements such as footnotes
> * make each page of the site a consistent structure
>
> Basically, the combination of HTML, CSS, PHP and my own markup codes seems
> ideal. Aside from having to devise the coding the rest is completely
> standard and incredibly lightweight. As such, there will be no packages and
> associated bugfixes to install and it should be very fast.
>
> I'll keep in mind that there are prebuilt options, though, in case I run
> into difficulties as I work on this.
Can you use HTML codes, plus any markup you invent, but sanitize the
input by stripping any HTML or other tags that you do not want or that
could be a risk?
I have a page where users can enter raw MySQL queries to generate
reports. The first thng that happens to input is to check that only
SELECT queries are processed (plus a lot of other safeguards). I also
devised a 'COPY from table where index=x' command that allows copying
one record easily. So, the page uses a limited form of a standard
markup, supplemented with extras, and is completely safe.
Seems you could do the same with HTML.
|
|
|
Re: PHP functions to convert markup efficiently [message #183880 is a reply to message #183879] |
Sat, 23 November 2013 22:07 |
James Harris
Messages: 11 Registered: November 2013
Karma: 0
|
Junior Member |
|
|
"Richard Yates" <richard(at)yatesguitar(dot)com> wrote in message
news:co6299tplsjmmmr0uovsldejl3431363l5(at)4ax(dot)com...
....
>> I am not trying to avoid complexity. Using PHP to convert markup to HTML
>> allows me to do things like these:
>> * restrict the elements that can be used (for security)
>> * add features such as a server-side TOC
>> * pull in data from various sources
>> * choose where to place elements such as footnotes
>> * make each page of the site a consistent structure
>>
>> Basically, the combination of HTML, CSS, PHP and my own markup codes
>> seems
>> ideal. Aside from having to devise the coding the rest is completely
>> standard and incredibly lightweight. As such, there will be no packages
>> and
>> associated bugfixes to install and it should be very fast.
>>
>> I'll keep in mind that there are prebuilt options, though, in case I run
>> into difficulties as I work on this.
>
> Can you use HTML codes, plus any markup you invent, but sanitize the
> input by stripping any HTML or other tags that you do not want or that
> could be a risk?
Theoretically yes but that would be hard to do and much slower. Consider
that if you see <p> on a page you don't know whether it is an HTML paragraph
tag or not unless you know its context. It might be part of a Java program,
for example, as in
f<p>();
or it could be just an insignificant piece of text that should appear as
written. The only way to tell for sure is to parse the file from the top and
recognise every element that precedes it. That would be a lot of work.
> I have a page where users can enter raw MySQL queries to generate
> reports. The first thng that happens to input is to check that only
> SELECT queries are processed (plus a lot of other safeguards). I also
> devised a 'COPY from table where index=x' command that allows copying
> one record easily. So, the page uses a limited form of a standard
> markup, supplemented with extras, and is completely safe.
>
> Seems you could do the same with HTML.
It is possible but would require lots of parsing code. By contrast,
converting markup to HTML can be made much easier. FWIW, I found I could do
something that works much more simply and wrote it up in a post made just a
few hours ago.
James
|
|
|
|
|
|
|
|