Re: preg_match() oddities and question [message #176098 is a reply to message #176095] |
Wed, 23 November 2011 18:01 |
Sandman
Messages: 32 Registered: August 2011
Karma:
|
Member |
|
|
In article <slrnjcpunb(dot)85q(dot)hellsop(at)nibelheim(dot)ninehells(dot)com>,
"Peter H. Coffin" <hellsop(at)ninehells(dot)com> wrote:
> On Wed, 23 Nov 2011 09:55:22 +0100, Sandman wrote:
>
>> In article <3004614(dot)SPkdTlGXAF(at)PointedEars(dot)de>, Thomas 'PointedEars'
>> Lahn <PointedEars(at)web(dot)de> wrote:
>>
>>>> > "10 East 42nd Street, New York, NY 10017, USA".
>>>>
>>>> That wouldn't be a normal swedish address, no. :)
>>>
>>> You had not limited the country or the language of your street
>>> addresses.
>>
>> Well, to my defense, the subject line was "preg_match() and swedish
>> characters" until I changed it. I hadn't changed it when I wrote my
>> examples.
>>
>>> My point is that parsing a street name and a house number from a
>>> street address is a hard problem that cannot be solved only by
>>> applying one regular expression.
>>
>> Right, but your example is not a valid argument for that conclusion.
>> My examples contained the variations of addresses that I wanted
>> to match. Or are you saying that there is no way to use regular
>> expressions to catch the examples I gave? Because I have a hard time
>> believing that.
>
> Address-matching is a hard task. I did that for a decade professionally
> (as part of a job, not the sole function), and it's not easy to do well
> for even one postal system, and trying to write a generalized one is
> basically impossible to manage in one lifetime. The best *simple* way
> to manage it is to take a field, blow it out into individual words,
> standardize all the words you can find without trying to sort out
> what they are (which is the Very Hard part of that task), throw the
> alphabetic ones into soundex or nysiis, make a loose match by a chunk of
> postal code or city code or province, then pick the item(s) that have
> the greatest number of matches between incoming and loose-match record
> of the numeric and nysiis-encoded alphabetical elements. If you weight
> things like "numeric match = 1, plaintext that's in a dictionary that
> matches when nysiis = 2, nondictionary text that matches nysiis = 3",
> and do that for NAME as well as ADDRESS, you get about as good as you
> can get without buying someone else's work. And that's STILL a lot of
> effort to write. Regexp alone for address matching is a snipe-hunt. It
> looks obviously right and you can spend a lot of time playing with it,
> but it ends up being a dead end.
I thank you for your input, but I still maintain that my examples
could be parsed by using a regular expression, and unless explicitly
told so by using examples will I admit otherwise :-D
No offense, though.
--
Sandman[.net]
|
|
|