Re: preg_match() oddities and question [message #176065 is a reply to message #176064] |
Tue, 22 November 2011 12:12 |
Sandman
Messages: 32 Registered: August 2011
Karma:
|
Member |
|
|
In article <jag24m$4nj$1(at)softins(dot)clara(dot)co(dot)uk>,
tony(at)mountifield(dot)org (Tony Mountifield) wrote:
>> So I have this regexp:
>>
>> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
>
> You don't need the commas in the character class, unless you want to
> match a literal comma, in which case you only need it once.
Right, thanks :)
>> $streetname = uc_words($m[1]);
>> $streetnumber = trim($m[2]);
>> $streetletter = strtoupper($m[3]);
>> $search = trim($streetname . SPACE . $streetnumber .
>> $streetletter);
>> }
>>
>> The desired result is taki9ng the input ($search) and split it into
>> its parts as an address, right? $search can be, for example, "foo
>> street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".
>
> What about "foo street"? (i.e. with a space, but no number)
Exactly, that gets this:
Array
(
[0] => foo street
[1] => foo
[2] =>
[3] => street
)
Which is incorrect. IN fact, the last group SHOULD be defined as
([A-Za-z]{0,1}) but that still messes it up like:
Array
(
[0] => foo street
[1] => foo stree
[2] =>
[3] => t
)
So I've tried variations for that as well.
<snip>
> And you would also get:
>
> Array
> (
> [0] => foo street
> [1] => foo
> [2] =>
> [3] => street
> )
>
>> As you can see, the last group "([A-Z,a-z,-]*?)" matches the entire
>> search term since there are no digits and the first group is
>> non-greedy. And if I make the first group greedy, "longstreet" is
>> matched correctly, but it also catches the entire "longstreet 45b"
>> when searching for that.
>
> Yes, you need to define your rules more closely. Not at the regex level,
> but actually at the logic/decision level. If you can make rules that
> can unambiguously specify how all kinds of input should be parsed,
> then you can look at how to represent that in regexes. You might need
> some additional logic to operate on the parsed result.
What you're basically suggesting is a series of regexp to find out
what "style" an adress is given in, and then parse out the parts?
Because I'm not sure how I would be able to do it without a series if
if/else preg_match():es?
>> Also, when searching for a term in swedish characters, I get this:
>>
>> Array
>> (
>> [0] => vikavÀgen
>> [1] => vikavÀ
>> [2] =>
>> [3] => gen
>> )
>>
>> Which is quite odd to me, why isn't "vikavÀgen" matched the same
>> (undesired) way that "oongstreet". I have tried the /u modifier, and
>> made sure that it was utf8-encoded, but it didn't make a difference
>> (incoming encoding is ISO 8859-1).
>>
>> Why the difference, and how do I correctly parse out parts as needed?
>
> That's because À is not in the set A-Za-z. If you want a character class
> that properly recognises locale-specific letters, you need to change your
> character class above to this:
>
> [[:alpha:]\-]
>
> Hope this helps!
That explains the difference, thank you very much for that. Now I
still need to figure out a global parse routine or criteria for
parsing out the address parts...
--
Sandman[.net]
|
|
|