Re: preg_match() oddities and question [message #176095 is a reply to message #176088] |
Wed, 23 November 2011 13:53 |
Peter H. Coffin
Messages: 245 Registered: September 2010
Karma:
|
Senior Member |
|
|
On Wed, 23 Nov 2011 09:55:22 +0100, Sandman wrote:
> In article <3004614(dot)SPkdTlGXAF(at)PointedEars(dot)de>, Thomas 'PointedEars'
> Lahn <PointedEars(at)web(dot)de> wrote:
>
>>>> "10 East 42nd Street, New York, NY 10017, USA".
>>>
>>> That wouldn't be a normal swedish address, no. :)
>>
>> You had not limited the country or the language of your street
>> addresses.
>
> Well, to my defense, the subject line was "preg_match() and swedish
> characters" until I changed it. I hadn't changed it when I wrote my
> examples.
>
>> My point is that parsing a street name and a house number from a
>> street address is a hard problem that cannot be solved only by
>> applying one regular expression.
>
> Right, but your example is not a valid argument for that conclusion.
> My examples contained the variations of addresses that I wanted
> to match. Or are you saying that there is no way to use regular
> expressions to catch the examples I gave? Because I have a hard time
> believing that.
Address-matching is a hard task. I did that for a decade professionally
(as part of a job, not the sole function), and it's not easy to do well
for even one postal system, and trying to write a generalized one is
basically impossible to manage in one lifetime. The best *simple* way
to manage it is to take a field, blow it out into individual words,
standardize all the words you can find without trying to sort out
what they are (which is the Very Hard part of that task), throw the
alphabetic ones into soundex or nysiis, make a loose match by a chunk of
postal code or city code or province, then pick the item(s) that have
the greatest number of matches between incoming and loose-match record
of the numeric and nysiis-encoded alphabetical elements. If you weight
things like "numeric match = 1, plaintext that's in a dictionary that
matches when nysiis = 2, nondictionary text that matches nysiis = 3",
and do that for NAME as well as ADDRESS, you get about as good as you
can get without buying someone else's work. And that's STILL a lot of
effort to write. Regexp alone for address matching is a snipe-hunt. It
looks obviously right and you can spend a lot of time playing with it,
but it ends up being a dead end.
--
_ o
|/)
|
|
|