FUDforum
Fast Uncompromising Discussions. FUDforum will get your users talking.

Home » Imported messages » comp.lang.php » preg_match() oddities and question
Show: Today's Messages :: Polls :: Message Navigator
Return to the default flat view Create a new topic Submit Reply
Re: preg_match() oddities and question [message #176064 is a reply to message #176061] Tue, 22 November 2011 11:47 Go to previous messageGo to previous message
tony is currently offline  tony
Messages: 19
Registered: December 2010
Karma:
Junior Member
In article <mr-5B96D1(dot)12212022112011(at)News(dot)Individual(dot)NET>,
Sandman <mr(at)sandman(dot)net> wrote:
> So I have this regexp:
>
> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){

You don't need the commas in the character class, unless you want to
match a literal comma, in which case you only need it once.

> $streetname = uc_words($m[1]);
> $streetnumber = trim($m[2]);
> $streetletter = strtoupper($m[3]);
> $search = trim($streetname . SPACE . $streetnumber .
> $streetletter);
> }
>
> The desired result is taki9ng the input ($search) and split it into
> its parts as an address, right? $search can be, for example, "foo
> street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".

What about "foo street"? (i.e. with a space, but no number)

> So, if I print_r($m) with different input I get:
>
> Array
> (
> [0] => foo street 34
> [1] => foo street
> [2] => 34
> [3] =>
> )
> Array
> (
> [0] => longstreet 45b
> [1] => longstreet
> [2] => 45
> [3] => b
> )
> Array
> (
> [0] => longstreet 45 b
> [1] => longstreet
> [2] => 45
> [3] => b
> )
>
> You get the idea. But problems arise when I search for the streetname
> alone:
>
> Array
> (
> [0] => longstreet
> [1] =>
> [2] =>
> [3] => longstreet
> )

And you would also get:

Array
(
[0] => foo street
[1] => foo
[2] =>
[3] => street
)

> As you can see, the last group "([A-Z,a-z,-]*?)" matches the entire
> search term since there are no digits and the first group is
> non-greedy. And if I make the first group greedy, "longstreet" is
> matched correctly, but it also catches the entire "longstreet 45b"
> when searching for that.

Yes, you need to define your rules more closely. Not at the regex level,
but actually at the logic/decision level. If you can make rules that
can unambiguously specify how all kinds of input should be parsed,
then you can look at how to represent that in regexes. You might need
some additional logic to operate on the parsed result.

> Also, when searching for a term in swedish characters, I get this:
>
> Array
> (
> [0] => vikavägen
> [1] => vikavä
> [2] =>
> [3] => gen
> )
>
> Which is quite odd to me, why isn't "vikavägen" matched the same
> (undesired) way that "oongstreet". I have tried the /u modifier, and
> made sure that it was utf8-encoded, but it didn't make a difference
> (incoming encoding is ISO 8859-1).
>
> Why the difference, and how do I correctly parse out parts as needed?

That's because ä is not in the set A-Za-z. If you want a character class
that properly recognises locale-specific letters, you need to change your
character class above to this:

[[:alpha:]\-]

Hope this helps!
Tony
--
Tony Mountifield
Work: tony(at)softins(dot)co(dot)uk - http://www.softins.co.uk
Play: tony(at)mountifield(dot)org - http://tony.mountifield.org
[Message index]
 
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Read Message
Previous Topic: Amazing Website!!!
Next Topic: session handler auto log out
Goto Forum:
  

-=] Back to Top [=-
[ Syndicate this forum (XML) ] [ RSS ]

Current Time: Wed Nov 27 15:43:32 GMT 2024

Total time taken to generate the page: 0.03847 seconds