FUDforum: comp.lang.php » preg_match() oddities and question

Home » Imported messages » comp.lang.php » preg_match() oddities and question

Show: Today's Messages :: Polls :: Message Navigator

preg_match() oddities and question [message #176061]

Tue, 22 November 2011 11:21

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

So I have this regexp:

if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
$streetname = uc_words($m[1]);
$streetnumber = trim($m[2]);
$streetletter = strtoupper($m[3]);
$search = trim($streetname . SPACE . $streetnumber .
$streetletter);
}

The desired result is taki9ng the input ($search) and split it into
its parts as an address, right? $search can be, for example, "foo
street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".

So, if I print_r($m) with different input I get:

Array
(
[0] => foo street 34
[1] => foo street
[2] => 34
[3] =>
)
Array
(
[0] => longstreet 45b
[1] => longstreet
[2] => 45
[3] => b
)
Array
(
[0] => longstreet 45 b
[1] => longstreet
[2] => 45
[3] => b
)

You get the idea. But problems arise when I search for the streetname
alone:

Array
(
[0] => longstreet
[1] =>
[2] =>
[3] => longstreet
)

As you can see, the last group "([A-Z,a-z,-]*?)" matches the entire
search term since there are no digits and the first group is
non-greedy. And if I make the first group greedy, "longstreet" is
matched correctly, but it also catches the entire "longstreet 45b"
when searching for that.

Also, when searching for a term in swedish characters, I get this:

Array
(
[0] => vikavägen
[1] => vikavä
[2] =>
[3] => gen
)

Which is quite odd to me, why isn't "vikavägen" matched the same
(undesired) way that "oongstreet". I have tried the /u modifier, and
made sure that it was utf8-encoded, but it didn't make a difference
(incoming encoding is ISO 8859-1).

Why the difference, and how do I correctly parse out parts as needed?

Any help is appreciated.

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176062 is a reply to message #176061]

Tue, 22 November 2011 11:26

The Natural Philosoph
Messages: 993
Registered: September 2010

Karma: 0

Senior Member

Sandman wrote:

>
> Why the difference, and how do I correctly parse out parts as needed?
>

I have always found establishing the correct regexp expression to take
longer than writing my own filters in whatever language I happened to
be using....

Life is too short for regexps.

> Any help is appreciated.
>
>
>
>
>

Report message to a moderator

Re: preg_match() oddities and question [message #176063 is a reply to message #176062]

Tue, 22 November 2011 11:36

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <jag0sr$49h$1(at)news(dot)albasani(dot)net>,
The Natural Philosopher <tnp(at)invalid(dot)invalid> wrote:

>> Why the difference, and how do I correctly parse out parts as needed?
>
> I have always found establishing the correct regexp expression to take
> longer than writing my own filters in whatever language I happened to
> be using....
>
> Life is too short for regexps.

I've never had much problem (time-wise) with regexps. I'm just stumped
about the difference in execution of this one, and need a little help
figuring out the syntax for the other part. In short, regexps are
rarely a problem for me, and I don't know how I would solve my
situation without using one, if you have any suggestions, please share
:)

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176064 is a reply to message #176061]

Tue, 22 November 2011 11:47

tony
Messages: 19
Registered: December 2010

Karma: 0

Junior Member

In article <mr-5B96D1(dot)12212022112011(at)News(dot)Individual(dot)NET>,
Sandman <mr(at)sandman(dot)net> wrote:
> So I have this regexp:
>
> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){

You don't need the commas in the character class, unless you want to
match a literal comma, in which case you only need it once.

> $streetname = uc_words($m[1]);
> $streetnumber = trim($m[2]);
> $streetletter = strtoupper($m[3]);
> $search = trim($streetname . SPACE . $streetnumber .
> $streetletter);
> }
>
> The desired result is taki9ng the input ($search) and split it into
> its parts as an address, right? $search can be, for example, "foo
> street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".

What about "foo street"? (i.e. with a space, but no number)

> So, if I print_r($m) with different input I get:
>
> Array
> (
> [0] => foo street 34
> [1] => foo street
> [2] => 34
> [3] =>
> )
> Array
> (
> [0] => longstreet 45b
> [1] => longstreet
> [2] => 45
> [3] => b
> )
> Array
> (
> [0] => longstreet 45 b
> [1] => longstreet
> [2] => 45
> [3] => b
> )
>
> You get the idea. But problems arise when I search for the streetname
> alone:
>
> Array
> (
> [0] => longstreet
> [1] =>
> [2] =>
> [3] => longstreet
> )

And you would also get:

Array
(
[0] => foo street
[1] => foo
[2] =>
[3] => street
)

> As you can see, the last group "([A-Z,a-z,-]*?)" matches the entire
> search term since there are no digits and the first group is
> non-greedy. And if I make the first group greedy, "longstreet" is
> matched correctly, but it also catches the entire "longstreet 45b"
> when searching for that.

Yes, you need to define your rules more closely. Not at the regex level,
but actually at the logic/decision level. If you can make rules that
can unambiguously specify how all kinds of input should be parsed,
then you can look at how to represent that in regexes. You might need
some additional logic to operate on the parsed result.

> Also, when searching for a term in swedish characters, I get this:
>
> Array
> (
> [0] => vikavägen
> [1] => vikavä
> [2] =>
> [3] => gen
> )
>
> Which is quite odd to me, why isn't "vikavägen" matched the same
> (undesired) way that "oongstreet". I have tried the /u modifier, and
> made sure that it was utf8-encoded, but it didn't make a difference
> (incoming encoding is ISO 8859-1).
>
> Why the difference, and how do I correctly parse out parts as needed?

That's because ä is not in the set A-Za-z. If you want a character class
that properly recognises locale-specific letters, you need to change your
character class above to this:

[[:alpha:]\-]

Hope this helps!
Tony
--
Tony Mountifield
Work: tony(at)softins(dot)co(dot)uk - http://www.softins.co.uk
Play: tony(at)mountifield(dot)org - http://tony.mountifield.org

Report message to a moderator

Re: preg_match() oddities and question [message #176065 is a reply to message #176064]

Tue, 22 November 2011 12:12

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <jag24m$4nj$1(at)softins(dot)clara(dot)co(dot)uk>,
tony(at)mountifield(dot)org (Tony Mountifield) wrote:

>> So I have this regexp:
>>
>> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
>
> You don't need the commas in the character class, unless you want to
> match a literal comma, in which case you only need it once.

Right, thanks :)

>> $streetname = uc_words($m[1]);
>> $streetnumber = trim($m[2]);
>> $streetletter = strtoupper($m[3]);
>> $search = trim($streetname . SPACE . $streetnumber .
>> $streetletter);
>> }
>>
>> The desired result is taki9ng the input ($search) and split it into
>> its parts as an address, right? $search can be, for example, "foo
>> street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".
>
> What about "foo street"? (i.e. with a space, but no number)

Exactly, that gets this:

Array
(
[0] => foo street
[1] => foo
[2] =>
[3] => street
)

Which is incorrect. IN fact, the last group SHOULD be defined as
([A-Za-z]{0,1}) but that still messes it up like:

Array
(
[0] => foo street
[1] => foo stree
[2] =>
[3] => t
)

So I've tried variations for that as well.

<snip>

> And you would also get:
>
> Array
> (
> [0] => foo street
> [1] => foo
> [2] =>
> [3] => street
> )
>
>> As you can see, the last group "([A-Z,a-z,-]*?)" matches the entire
>> search term since there are no digits and the first group is
>> non-greedy. And if I make the first group greedy, "longstreet" is
>> matched correctly, but it also catches the entire "longstreet 45b"
>> when searching for that.
>
> Yes, you need to define your rules more closely. Not at the regex level,
> but actually at the logic/decision level. If you can make rules that
> can unambiguously specify how all kinds of input should be parsed,
> then you can look at how to represent that in regexes. You might need
> some additional logic to operate on the parsed result.

What you're basically suggesting is a series of regexp to find out
what "style" an adress is given in, and then parse out the parts?
Because I'm not sure how I would be able to do it without a series if
if/else preg_match():es?

>> Also, when searching for a term in swedish characters, I get this:
>>
>> Array
>> (
>> [0] => vikavÃ€gen
>> [1] => vikavÃ€
>> [2] =>
>> [3] => gen
>> )
>>
>> Which is quite odd to me, why isn't "vikavÃ€gen" matched the same
>> (undesired) way that "oongstreet". I have tried the /u modifier, and
>> made sure that it was utf8-encoded, but it didn't make a difference
>> (incoming encoding is ISO 8859-1).
>>
>> Why the difference, and how do I correctly parse out parts as needed?
>
> That's because Ã€ is not in the set A-Za-z. If you want a character class
> that properly recognises locale-specific letters, you need to change your
> character class above to this:
>
> [[:alpha:]\-]
>
> Hope this helps!

That explains the difference, thank you very much for that. Now I
still need to figure out a global parse routine or criteria for
parsing out the address parts...

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176068 is a reply to message #176062]

Tue, 22 November 2011 12:22

Jerry Stuckle
Messages: 2598
Registered: September 2010

Karma: 0

Senior Member

On 11/22/2011 6:26 AM, The Natural Philosopher wrote:
> Sandman wrote:
>
>>
>> Why the difference, and how do I correctly parse out parts as needed?
>>
>
> I have always found establishing the correct regexp expression to take
> longer than writing my own filters in whatever language I happened to be
> using....
>
>
> Life is too short for regexps.
>

That's because regex's take intelligence - which you don't have.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex(at)attglobal(dot)net
==================

Report message to a moderator

Re: preg_match() oddities and question [message #176069 is a reply to message #176061]

Tue, 22 November 2011 12:30

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Sandman wrote:

> So I have this regexp:
>
> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
> $streetname = uc_words($m[1]);
> $streetnumber = trim($m[2]);
> $streetletter = strtoupper($m[3]);
> $search = trim($streetname . SPACE . $streetnumber .
> $streetletter);
> }
>
> The desired result is taki9ng the input ($search) and split it into
> its parts as an address, right? $search can be, for example, "foo
> street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".

"10 East 42nd Street, New York, NY 10017, USA".

PointedEars
--
When all you know is jQuery, every problem looks $(olvable).

Report message to a moderator

Re: preg_match() oddities and question [message #176071 is a reply to message #176069]

Tue, 22 November 2011 12:55

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <1670168(dot)aK4W3vaeNJ(at)PointedEars(dot)de>,
Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:

> Sandman wrote:
>
>> So I have this regexp:
>>
>> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
>> $streetname = uc_words($m[1]);
>> $streetnumber = trim($m[2]);
>> $streetletter = strtoupper($m[3]);
>> $search = trim($streetname . SPACE . $streetnumber .
>> $streetletter);
>> }
>>
>> The desired result is taki9ng the input ($search) and split it into
>> its parts as an address, right? $search can be, for example, "foo
>> street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".
>
> "10 East 42nd Street, New York, NY 10017, USA".

That wouldn't be a normal swedish address, no. :)

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176077 is a reply to message #176071]

Tue, 22 November 2011 16:56

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Sandman wrote:

> Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:
>> Sandman wrote:
>>> So I have this regexp:
>>>
>>> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
>>> $streetname = uc_words($m[1]);
>>> $streetnumber = trim($m[2]);
>>> $streetletter = strtoupper($m[3]);
>>> $search = trim($streetname . SPACE . $streetnumber .
>>> $streetletter);
>>> }
>>>
>>> The desired result is taki9ng the input ($search) and split it into
>>> its parts as an address, right? $search can be, for example, "foo
>>> street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".
>>
>> "10 East 42nd Street, New York, NY 10017, USA".
>
> That wouldn't be a normal swedish address, no. :)

You had not limited the country or the language of your street addresses.

My point is that parsing a street name and a house number from a street
address is a hard problem that cannot be solved only by applying one regular
expression.

PointedEars
--
realism: HTML 4.01 Strict
evangelism: XHTML 1.0 Strict
madness: XHTML 1.1 as application/xhtml+xml
-- Bjoern Hoehrmann

Report message to a moderator

Re: preg_match() oddities and question [message #176079 is a reply to message #176077]

Tue, 22 November 2011 17:30

The Natural Philosoph
Messages: 993
Registered: September 2010

Karma: 0

Senior Member

Thomas 'PointedEars' Lahn wrote:
> Sandman wrote:
>
>> Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:
>>> Sandman wrote:
>>>> So I have this regexp:
>>>>
>>>> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
>>>> $streetname = uc_words($m[1]);
>>>> $streetnumber = trim($m[2]);
>>>> $streetletter = strtoupper($m[3]);
>>>> $search = trim($streetname . SPACE . $streetnumber .
>>>> $streetletter);
>>>> }
>>>>
>>>> The desired result is taki9ng the input ($search) and split it into
>>>> its parts as an address, right? $search can be, for example, "foo
>>>> street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".
>>> "10 East 42nd Street, New York, NY 10017, USA".
>> That wouldn't be a normal swedish address, no. :)
>
> You had not limited the country or the language of your street addresses.
>
> My point is that parsing a street name and a house number from a street
> address is a hard problem that cannot be solved only by applying one regular
> expression.
>

Quite right. Is worse than you can possibly iagine at leats here in te
UK, where addresses can be as little as 2 lines long or up to 6..

So

10 Wonkers place, LONDON EC3 7QY is a typical TOWN address

Out in the sticks you might get

Apartment 4b, the Old Town House, Shire Lane, Recketts Green, Nr
Stonehouse, Gloucestershire GL13 6AH

And if that comes at you without commas, god help you.

I have spent DAYS taking name/address fields and parsing them *manually*
into structured tables...

>
> PointedEars

Report message to a moderator

Re: preg_match() oddities and question [message #176084 is a reply to message #176079]

Tue, 22 November 2011 23:20

Peter H. Coffin
Messages: 245
Registered: September 2010

Karma: 0

Senior Member

On Tue, 22 Nov 2011 17:30:45 +0000, The Natural Philosopher wrote:
> Quite right. Is worse than you can possibly iagine at leats here in te
> UK, where addresses can be as little as 2 lines long or up to 6..
>
> So
>
> 10 Wonkers place, LONDON EC3 7QY is a typical TOWN address
>
> Out in the sticks you might get
>
> Apartment 4b, the Old Town House, Shire Lane, Recketts Green, Nr
> Stonehouse, Gloucestershire GL13 6AH
>
>
> And if that comes at you without commas, god help you.
>
> I have spent DAYS taking name/address fields and parsing them *manually*
> into structured tables...

It is at this point that most people that have an actual need to solve
these kinds of problems turn to the available commercial software and
decide to solve it with money instead of manpower.

--
They got rid of it because they judged it more trouble than it was
worth. (And considering they'd gone to great lengths to minimize its
worth, I suppose they were right.)
-- J. D. Baldwin

Report message to a moderator

Re: preg_match() oddities and question [message #176085 is a reply to message #176084]

Tue, 22 November 2011 23:59

The Natural Philosoph
Messages: 993
Registered: September 2010

Karma: 0

Senior Member

Peter H. Coffin wrote:
> On Tue, 22 Nov 2011 17:30:45 +0000, The Natural Philosopher wrote:
>> Quite right. Is worse than you can possibly iagine at leats here in te
>> UK, where addresses can be as little as 2 lines long or up to 6..
>>
>> So
>>
>> 10 Wonkers place, LONDON EC3 7QY is a typical TOWN address
>>
>> Out in the sticks you might get
>>
>> Apartment 4b, the Old Town House, Shire Lane, Recketts Green, Nr
>> Stonehouse, Gloucestershire GL13 6AH
>>
>>
>> And if that comes at you without commas, god help you.
>>
>> I have spent DAYS taking name/address fields and parsing them *manually*
>> into structured tables...
>
> It is at this point that most people that have an actual need to solve
> these kinds of problems turn to the available commercial software and
> decide to solve it with money instead of manpower.
>
there is no AI that can match a human brain in decoding human idiocy...yet

Report message to a moderator

Re: preg_match() oddities and question [message #176086 is a reply to message #176084]

Wed, 23 November 2011 00:59

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Peter H. Coffin wrote:

> On Tue, 22 Nov 2011 17:30:45 +0000, The Natural Philosopher wrote:
>> Quite right. Is worse than you can possibly iagine at leats here in te
>> UK, where addresses can be as little as 2 lines long or up to 6..
>>
>> So
>>
>> 10 Wonkers place, LONDON EC3 7QY is a typical TOWN address
>>
>> Out in the sticks you might get
>>
>> Apartment 4b, the Old Town House, Shire Lane, Recketts Green, Nr
>> Stonehouse, Gloucestershire GL13 6AH
>>
>>
>> And if that comes at you without commas, god help you.
>>
>> I have spent DAYS taking name/address fields and parsing them *manually*
>> into structured tables...
>
> It is at this point that most people that have an actual need to solve
> these kinds of problems turn to the available commercial software and
> decide to solve it with money instead of manpower.

Where the question must be allowed: How came that the data has not been
requested and stored in a structured form to begin with? That is, for
example, why only an address field in a form – why not a street, house
number aso. field? ISTM that we are seeing here an example of a mistake
made at the beginning which overall cost naturally grows larger and larger
as the project is nearing completion.

PointedEars
--
Danny Goodman's books are out of date and teach practices that are
positively harmful for cross-browser scripting.
-- Richard Cornford, cljs, <cife6q$253$1$8300dec7(at)news(dot)demon(dot)co(dot)uk> (2004)

Report message to a moderator

Re: preg_match() oddities and question [message #176088 is a reply to message #176077]

Wed, 23 November 2011 08:55

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <3004614(dot)SPkdTlGXAF(at)PointedEars(dot)de>,
Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:

>>>> So I have this regexp:
>>>>
>>>> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
>>>> $streetname = uc_words($m[1]);
>>>> $streetnumber = trim($m[2]);
>>>> $streetletter = strtoupper($m[3]);
>>>> $search = trim($streetname . SPACE . $streetnumber .
>>>> $streetletter);
>>>> }
>>>>
>>>> The desired result is taki9ng the input ($search) and split it into
>>>> its parts as an address, right? $search can be, for example, "foo
>>>> street 34", "longstreet 45b", "longstreet 45 b" or just "longstreet".
>>>
>>> "10 East 42nd Street, New York, NY 10017, USA".
>>
>> That wouldn't be a normal swedish address, no. :)
>
> You had not limited the country or the language of your street addresses.

Well, to my defense, the subject line was "preg_match() and swedish
characters" until I changed it. I hadn't changed it when I wrote my
examples.

> My point is that parsing a street name and a house number from a street
> address is a hard problem that cannot be solved only by applying one regular
> expression.

Right, but your example is not a valid argument for that conclusion.
My examples contained the variations of addresses that I wanted to
match. Or are you saying that there is no way to use regular
expressions to catch the examples I gave? Because I have a hard time
believing that.

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176089 is a reply to message #176086]

Wed, 23 November 2011 08:58

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <1780410(dot)D2KPVSPYKU(at)PointedEars(dot)de>,
Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:

>> It is at this point that most people that have an actual need to solve
>> these kinds of problems turn to the available commercial software and
>> decide to solve it with money instead of manpower.
>
> Where the question must be allowed: How came that the data has not been
> requested and stored in a structured form to begin with? That is, for
> example, why only an address field in a form – why not a street, house
> number aso. field?

Convenience for the user, of course.

This is a form that says "Are you connected to the citynet?" and then
you just enter your address to search the database. If the user has to
provide street name, street number and street letter in separate
fields, it's inconvenient for them.

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176092 is a reply to message #176086]

Wed, 23 November 2011 09:35

The Natural Philosoph
Messages: 993
Registered: September 2010

Karma: 0

Senior Member

Thomas 'PointedEars' Lahn wrote:
> Peter H. Coffin wrote:
>
>> On Tue, 22 Nov 2011 17:30:45 +0000, The Natural Philosopher wrote:
>>> Quite right. Is worse than you can possibly iagine at leats here in te
>>> UK, where addresses can be as little as 2 lines long or up to 6..
>>>
>>> So
>>>
>>> 10 Wonkers place, LONDON EC3 7QY is a typical TOWN address
>>>
>>> Out in the sticks you might get
>>>
>>> Apartment 4b, the Old Town House, Shire Lane, Recketts Green, Nr
>>> Stonehouse, Gloucestershire GL13 6AH
>>>
>>>
>>> And if that comes at you without commas, god help you.
>>>
>>> I have spent DAYS taking name/address fields and parsing them *manually*
>>> into structured tables...
>> It is at this point that most people that have an actual need to solve
>> these kinds of problems turn to the available commercial software and
>> decide to solve it with money instead of manpower.
>
> Where the question must be allowed: How came that the data has not been
> requested and stored in a structured form to begin with? That is, for
> example, why only an address field in a form – why not a street, house
> number aso. field? ISTM that we are seeing here an example of a mistake
> made at the beginning which overall cost naturally grows larger and larger
> as the project is nearing completion.
>
IME this happens when you move from a crappy old database system to a
properly designed one, and the data migration begins...

>
> PointedEars

Report message to a moderator

Re: preg_match() oddities and question [message #176095 is a reply to message #176088]

Wed, 23 November 2011 13:53

Peter H. Coffin
Messages: 245
Registered: September 2010

Karma: 0

Senior Member

On Wed, 23 Nov 2011 09:55:22 +0100, Sandman wrote:

> In article <3004614(dot)SPkdTlGXAF(at)PointedEars(dot)de>, Thomas 'PointedEars'
> Lahn <PointedEars(at)web(dot)de> wrote:
>
>>>> "10 East 42nd Street, New York, NY 10017, USA".
>>>
>>> That wouldn't be a normal swedish address, no. :)
>>
>> You had not limited the country or the language of your street
>> addresses.
>
> Well, to my defense, the subject line was "preg_match() and swedish
> characters" until I changed it. I hadn't changed it when I wrote my
> examples.
>
>> My point is that parsing a street name and a house number from a
>> street address is a hard problem that cannot be solved only by
>> applying one regular expression.
>
> Right, but your example is not a valid argument for that conclusion.
> My examples contained the variations of addresses that I wanted
> to match. Or are you saying that there is no way to use regular
> expressions to catch the examples I gave? Because I have a hard time
> believing that.

Address-matching is a hard task. I did that for a decade professionally
(as part of a job, not the sole function), and it's not easy to do well
for even one postal system, and trying to write a generalized one is
basically impossible to manage in one lifetime. The best *simple* way
to manage it is to take a field, blow it out into individual words,
standardize all the words you can find without trying to sort out
what they are (which is the Very Hard part of that task), throw the
alphabetic ones into soundex or nysiis, make a loose match by a chunk of
postal code or city code or province, then pick the item(s) that have
the greatest number of matches between incoming and loose-match record
of the numeric and nysiis-encoded alphabetical elements. If you weight
things like "numeric match = 1, plaintext that's in a dictionary that
matches when nysiis = 2, nondictionary text that matches nysiis = 3",
and do that for NAME as well as ADDRESS, you get about as good as you
can get without buying someone else's work. And that's STILL a lot of
effort to write. Regexp alone for address matching is a snipe-hunt. It
looks obviously right and you can spend a lot of time playing with it,
but it ends up being a dead end.

--
_ o
|/)

Report message to a moderator

Re: preg_match() oddities and question [message #176098 is a reply to message #176095]

Wed, 23 November 2011 18:01

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <slrnjcpunb(dot)85q(dot)hellsop(at)nibelheim(dot)ninehells(dot)com>,
"Peter H. Coffin" <hellsop(at)ninehells(dot)com> wrote:

> On Wed, 23 Nov 2011 09:55:22 +0100, Sandman wrote:
>
>> In article <3004614(dot)SPkdTlGXAF(at)PointedEars(dot)de>, Thomas 'PointedEars'
>> Lahn <PointedEars(at)web(dot)de> wrote:
>>
>>>> > "10 East 42nd Street, New York, NY 10017, USA".
>>>>
>>>> That wouldn't be a normal swedish address, no. :)
>>>
>>> You had not limited the country or the language of your street
>>> addresses.
>>
>> Well, to my defense, the subject line was "preg_match() and swedish
>> characters" until I changed it. I hadn't changed it when I wrote my
>> examples.
>>
>>> My point is that parsing a street name and a house number from a
>>> street address is a hard problem that cannot be solved only by
>>> applying one regular expression.
>>
>> Right, but your example is not a valid argument for that conclusion.
>> My examples contained the variations of addresses that I wanted
>> to match. Or are you saying that there is no way to use regular
>> expressions to catch the examples I gave? Because I have a hard time
>> believing that.
>
> Address-matching is a hard task. I did that for a decade professionally
> (as part of a job, not the sole function), and it's not easy to do well
> for even one postal system, and trying to write a generalized one is
> basically impossible to manage in one lifetime. The best *simple* way
> to manage it is to take a field, blow it out into individual words,
> standardize all the words you can find without trying to sort out
> what they are (which is the Very Hard part of that task), throw the
> alphabetic ones into soundex or nysiis, make a loose match by a chunk of
> postal code or city code or province, then pick the item(s) that have
> the greatest number of matches between incoming and loose-match record
> of the numeric and nysiis-encoded alphabetical elements. If you weight
> things like "numeric match = 1, plaintext that's in a dictionary that
> matches when nysiis = 2, nondictionary text that matches nysiis = 3",
> and do that for NAME as well as ADDRESS, you get about as good as you
> can get without buying someone else's work. And that's STILL a lot of
> effort to write. Regexp alone for address matching is a snipe-hunt. It
> looks obviously right and you can spend a lot of time playing with it,
> but it ends up being a dead end.

I thank you for your input, but I still maintain that my examples
could be parsed by using a regular expression, and unless explicitly
told so by using examples will I admit otherwise :-D

No offense, though.

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176100 is a reply to message #176098]

Wed, 23 November 2011 18:54

The Natural Philosoph
Messages: 993
Registered: September 2010

Karma: 0

Senior Member

Sandman wrote:
> In article <slrnjcpunb(dot)85q(dot)hellsop(at)nibelheim(dot)ninehells(dot)com>,
> "Peter H. Coffin" <hellsop(at)ninehells(dot)com> wrote:
>
>> On Wed, 23 Nov 2011 09:55:22 +0100, Sandman wrote:
>>
>>> In article <3004614(dot)SPkdTlGXAF(at)PointedEars(dot)de>, Thomas 'PointedEars'
>>> Lahn <PointedEars(at)web(dot)de> wrote:
>>>
>>>> >> "10 East 42nd Street, New York, NY 10017, USA".
>>>> > That wouldn't be a normal swedish address, no. :)
>>>> You had not limited the country or the language of your street
>>>> addresses.
>>> Well, to my defense, the subject line was "preg_match() and swedish
>>> characters" until I changed it. I hadn't changed it when I wrote my
>>> examples.
>>>
>>>> My point is that parsing a street name and a house number from a
>>>> street address is a hard problem that cannot be solved only by
>>>> applying one regular expression.
>>> Right, but your example is not a valid argument for that conclusion.
>>> My examples contained the variations of addresses that I wanted
>>> to match. Or are you saying that there is no way to use regular
>>> expressions to catch the examples I gave? Because I have a hard time
>>> believing that.
>> Address-matching is a hard task. I did that for a decade professionally
>> (as part of a job, not the sole function), and it's not easy to do well
>> for even one postal system, and trying to write a generalized one is
>> basically impossible to manage in one lifetime. The best *simple* way
>> to manage it is to take a field, blow it out into individual words,
>> standardize all the words you can find without trying to sort out
>> what they are (which is the Very Hard part of that task), throw the
>> alphabetic ones into soundex or nysiis, make a loose match by a chunk of
>> postal code or city code or province, then pick the item(s) that have
>> the greatest number of matches between incoming and loose-match record
>> of the numeric and nysiis-encoded alphabetical elements. If you weight
>> things like "numeric match = 1, plaintext that's in a dictionary that
>> matches when nysiis = 2, nondictionary text that matches nysiis = 3",
>> and do that for NAME as well as ADDRESS, you get about as good as you
>> can get without buying someone else's work. And that's STILL a lot of
>> effort to write. Regexp alone for address matching is a snipe-hunt. It
>> looks obviously right and you can spend a lot of time playing with it,
>> but it ends up being a dead end.
>
> I thank you for your input, but I still maintain that my examples
> could be parsed by using a regular expression, and unless explicitly
> told so by using examples will I admit otherwise :-D
>
> No offense, though.
>
>
after three days, you could have done the data conversion by hand...

Report message to a moderator

Re: preg_match() oddities and question [message #176102 is a reply to message #176100]

Wed, 23 November 2011 19:23

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <jajfi0$pmj$3(at)news(dot)albasani(dot)net>,
The Natural Philosopher <tnp(at)invalid(dot)invalid> wrote:

>>> Address-matching is a hard task. I did that for a decade professionally
>>> (as part of a job, not the sole function), and it's not easy to do well
>>> for even one postal system, and trying to write a generalized one is
>>> basically impossible to manage in one lifetime. The best *simple* way
>>> to manage it is to take a field, blow it out into individual words,
>>> standardize all the words you can find without trying to sort out
>>> what they are (which is the Very Hard part of that task), throw the
>>> alphabetic ones into soundex or nysiis, make a loose match by a chunk of
>>> postal code or city code or province, then pick the item(s) that have
>>> the greatest number of matches between incoming and loose-match record
>>> of the numeric and nysiis-encoded alphabetical elements. If you weight
>>> things like "numeric match = 1, plaintext that's in a dictionary that
>>> matches when nysiis = 2, nondictionary text that matches nysiis = 3",
>>> and do that for NAME as well as ADDRESS, you get about as good as you
>>> can get without buying someone else's work. And that's STILL a lot of
>>> effort to write. Regexp alone for address matching is a snipe-hunt. It
>>> looks obviously right and you can spend a lot of time playing with it,
>>> but it ends up being a dead end.
>>
>> I thank you for your input, but I still maintain that my examples
>> could be parsed by using a regular expression, and unless explicitly
>> told so by using examples will I admit otherwise :-D
>>
>> No offense, though.
>
> after three days, you could have done the data conversion by hand...

Huh? What data conversion? The data to be searched does not need to be
converted into anything. It is neatly separated and also kept in a
combined form. It's the *in-data* (i.e. the search terms provided by
visitors to the sites). that I need to massage :)

And, three days wouldn't have gotten me far, the database contains
over 600,000 posts of addresses :)

But, as I said, the database is neatly formatted.

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176103 is a reply to message #176098]

Wed, 23 November 2011 18:58

Peter H. Coffin
Messages: 245
Registered: September 2010

Karma: 0

Senior Member

On Wed, 23 Nov 2011 19:01:14 +0100, Sandman wrote:
> In article <slrnjcpunb(dot)85q(dot)hellsop(at)nibelheim(dot)ninehells(dot)com>,
> "Peter H. Coffin" <hellsop(at)ninehells(dot)com> wrote:
>
>> On Wed, 23 Nov 2011 09:55:22 +0100, Sandman wrote:
>>
>>> Right, but your example is not a valid argument for that conclusion.
>>> My examples contained the variations of addresses that I wanted
>>> to match. Or are you saying that there is no way to use regular
>>> expressions to catch the examples I gave? Because I have a hard time
>>> believing that.
>>
>> Address-matching is a hard task. I did that for a decade professionally
>> (as part of a job, not the sole function), and it's not easy to do well
>> for even one postal system, and trying to write a generalized one is
>> basically impossible to manage in one lifetime. The best *simple* way
>> to manage it is to take a field, blow it out into individual words,
>> standardize all the words you can find without trying to sort out
>> what they are (which is the Very Hard part of that task), throw the
>> alphabetic ones into soundex or nysiis, make a loose match by a chunk of
>> postal code or city code or province, then pick the item(s) that have
>> the greatest number of matches between incoming and loose-match record
>> of the numeric and nysiis-encoded alphabetical elements. If you weight
>> things like "numeric match = 1, plaintext that's in a dictionary that
>> matches when nysiis = 2, nondictionary text that matches nysiis = 3",
>> and do that for NAME as well as ADDRESS, you get about as good as you
>> can get without buying someone else's work. And that's STILL a lot of
>> effort to write. Regexp alone for address matching is a snipe-hunt. It
>> looks obviously right and you can spend a lot of time playing with it,
>> but it ends up being a dead end.
>
> I thank you for your input, but I still maintain that my examples
> could be parsed by using a regular expression, and unless explicitly
> told so by using examples will I admit otherwise :-D

*grin* Any given (note: given) example set can be parsed with a
sufficiently complicated regexp. If your task is small enough and clean
enough, it might even not be THAT hard to accomplish. It's impossible to
provide advice about it, though, without having that complete example
set as well. The incoming data, however, is almost always going to
contain data that is not clean enough and will also probably end up
containing stuff that does not match your parsing rules, in a "because
fools are so ingenious" sense.

And, at that point, you'll want to be looking at how you handle those
exeptions: reject, pass, send for clerical review, and what those
categories mean for your process.

> No offense, though.

None to take.

--
58. If it becomes necessary to escape, I will never stop to pose
dramatically and toss off a one-liner.
--Peter Anspach's list of things to do as an Evil Overlord

Report message to a moderator

Re: preg_match() oddities and question [message #176105 is a reply to message #176089]

Wed, 23 November 2011 21:02

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Sandman wrote:

> Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:
>>> It is at this point that most people that have an actual need to solve
>>> these kinds of problems turn to the available commercial software and
>>> decide to solve it with money instead of manpower.
>>
>> Where the question must be allowed: How came that the data has not been
>> requested and stored in a structured form to begin with? That is, for
>> example, why only an address field in a form – why not a street, house
>> number aso. field?
>
> Convenience for the user, of course.

You can't be serious.

> This is a form that says "Are you connected to the citynet?" and then
> you just enter your address to search the database. If the user has to
> provide street name, street number and street letter in separate
> fields, it's inconvenient for them.

No, it's not. With separate controls they can be sure where to enter what;
it is accessible, and you have no problem processing the data. With one
control, neither applies.

PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)

Report message to a moderator

Re: preg_match() oddities and question [message #176108 is a reply to message #176105]

Thu, 24 November 2011 07:20

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <4778042(dot)ypaU67uLZW(at)PointedEars(dot)de>,
Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:

>>>> It is at this point that most people that have an actual need to solve
>>>> these kinds of problems turn to the available commercial software and
>>>> decide to solve it with money instead of manpower.
>>>
>>> Where the question must be allowed: How came that the data has not been
>>> requested and stored in a structured form to begin with? That is, for
>>> example, why only an address field in a form – why not a street, house
>>> number aso. field?
>>
>> Convenience for the user, of course.
>
> You can't be serious.

I can :)

>> This is a form that says "Are you connected to the citynet?" and then
>> you just enter your address to search the database. If the user has to
>> provide street name, street number and street letter in separate
>> fields, it's inconvenient for them.
>
> No, it's not.

Actually, yes it is :)

> With separate controls they can be sure where to enter what;
> it is accessible, and you have no problem processing the data. With one
> control, neither applies.

Well, I have been doing this for about ten years now, and recived tons
of feedback from my clients on things like this. When I say it's
inconvenient for the end user, it's not something I make up on the
spot to be obnoxious.

Just like with my examples, I have a pretty clear picture of what
problem I need to solve. I find it curious that no one in CLP even
attempt to look at that, and instead trying to find other examples, or
claiming that the frontend should be changed.

Makes me think that you guys deem the examples I gave as unsolvable,
which of course I refuse to agree with. :)

No offense though.

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176109 is a reply to message #176103]

Thu, 24 November 2011 07:28

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <slrnjcqghr(dot)85q(dot)hellsop(at)nibelheim(dot)ninehells(dot)com>,
"Peter H. Coffin" <hellsop(at)ninehells(dot)com> wrote:

>> I thank you for your input, but I still maintain that my examples
>> could be parsed by using a regular expression, and unless explicitly
>> told so by using examples will I admit otherwise :-D
>
> *grin* Any given (note: given) example set can be parsed with a
> sufficiently complicated regexp. If your task is small enough and clean
> enough, it might even not be THAT hard to accomplish. It's impossible to
> provide advice about it, though, without having that complete example
> set as well.

Huh? It was given in my OP.

> The incoming data, however, is almost always going to
> contain data that is not clean enough and will also probably end up
> containing stuff that does not match your parsing rules, in a "because
> fools are so ingenious" sense.

Sure, but then there will be no matches of course. I'm trying to deal
with the 99.9% of "correctly" formed search terms, though :)

> And, at that point, you'll want to be looking at how you handle those
> exeptions: reject, pass, send for clerical review, and what those
> categories mean for your process.

Nah, if a user searchs for "34 Vikavägen B" they will get no hits,
because no addresses in Sweden look like that. In such cases, we
encourage the user to check their spelling of the adress and try
again, of course :)

Basically, I have a number of examples. I want to parse them with a
regexp. Someone here helped me solve why swedish characters broke it,
but none yet have even tried to look at my regexp and suggest
alterations to fit my examples.

Not that I could *expect* anything, it was just a comment on how
common it is in groups like these for the readers to assume stupidity
on the part of the OP and make edge cases (some more wild than others)
where the proposed scenario would break.

I'm not stupid though, and I have a database of over 1 million
historical search terms and looking through that I know *exactly* how
and what people search for and now I'm looking to make sure that I can
parse it better and with more certainity.

As a but of background, the DB that is to be searched has four
relevant fields, one for street name, one for street number and one
for street letter, and then one where the three are combined. It's
this combined field I've so far used for loose string matching on the
incoming search field, but there are still searches that won't match
that should, so I need to parse and massage the incoming search terms.

Which brings us full circle :-D

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176121 is a reply to message #176108]

Thu, 24 November 2011 12:55

Denis McMahon
Messages: 634
Registered: September 2010

Karma: 0

Senior Member

On Thu, 24 Nov 2011 08:20:45 +0100, Sandman wrote:

> Well, I have been doing this for about ten years now, and recived tons
> of feedback from my clients on things like this. When I say it's
> inconvenient for the end user, it's not something I make up on the spot
> to be obnoxious.

It's a point that a lot of website UI designers overlook - what we think
is the best and most obvious layout, and even "the way everyone does it",
isn't always what the website user finds most user friendly.

As to the OPs problem, perhaps it needs a different approach:

Presumably for any given city you can obtain a list of valid streets?

Why not:

foreach ($street_name in $streets_of_city) {
if (($street_start = strpos($address, $street_name)) !== false) {
foreach ($district_name in $districts_of_city) {
if (($district_start = strpos($address, $district_name,
$street_start)) !== false) {
break;
}
}
break;
}
}

You may need to allow for "district creep" and will need to allow for
streets crossing district boundaries.

Another option, and one that is becoming popular in some areas, might be
to allow entry of the postcode, and then generate a drop down list,
perhaps populated using an XMLHttpRequest, of addresses matching the
postcode - this would require access to a copy of the relevant postcode
database which might involve some cost, and of course you need to
implement a mechanism for handling updates to that database.

Possibly a similar sort of approach, based on a select menu for district,
then a select menu for street names, and then a text entry field for
"apartment number, building name and / or number, or house number as
appropriate".

I don't think you're ever going to parse a single line address entry with
regex, because there are too many different formats that might get thrown
at you. You either need to get the data in a different format, or find a
different method of processing the data that you are getting.

Rgds

Denis McMahon

Report message to a moderator

Re: preg_match() oddities and question [message #176125 is a reply to message #176108]

Thu, 24 November 2011 21:41

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Sandman wrote:

> Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:
>>>> > It is at this point that most people that have an actual need to
>>>> > solve these kinds of problems turn to the available commercial
>>>> > software and decide to solve it with money instead of manpower.
>>>>
>>>> Where the question must be allowed: How came that the data has not
>>>> been
>>>> requested and stored in a structured form to begin with? That is, for
>>>> example, why only an address field in a form – why not a street, house
>>>> number aso. field?
>>>
>>> Convenience for the user, of course.
>>
>> You can't be serious.
>
> I can :)

Apparently.

>>> This is a form that says "Are you connected to the citynet?" and then
>>> you just enter your address to search the database. If the user has to
>>> provide street name, street number and street letter in separate
>>> fields, it's inconvenient for them.
>>
>> No, it's not.
>
> Actually, yes it is :)

No, it is not.

>> With separate controls they can be sure where to enter what;
>> it is accessible, and you have no problem processing the data. With one
>> control, neither applies.
>
> Well, I have been doing this for about ten years now,

I have been doing this for about fourteen years now. So what? There are
basic accessibility guidelines that no amount of development experience can
substitute (although studying usability, as I did, can help). Many of which
must be followed per legislation in some countries.

> and recived tons of feedback from my clients on things like this. […]

There remains the possibility that you did it wrong in another way all the
time.

> When I say it's inconvenient for the end user, it's not something I make
> up on the spot to be obnoxious.

Nevertheless, your logic is flawed.

PointedEars
--
Use any version of Microsoft Frontpage to create your site.
(This won't prevent people from viewing your source, but no one
will want to steal it.)
-- from <http://www.vortex-webdesign.com/help/hidesource.htm> (404-comp.)

Report message to a moderator

Re: preg_match() oddities and question [message #176127 is a reply to message #176125]

Fri, 25 November 2011 08:26

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <4340380(dot)gzYE47dVVZ(at)PointedEars(dot)de>,
Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:

>>>> > > It is at this point that most people that have an actual need to
>>>> > > solve these kinds of problems turn to the available commercial
>>>> > > software and decide to solve it with money instead of manpower.
>>>> >
>>>> > Where the question must be allowed: How came that the data has not
>>>> > been
>>>> > requested and stored in a structured form to begin with? That is, for
>>>> > example, why only an address field in a form – why not a street, house
>>>> > number aso. field?
>>>>
>>>> Convenience for the user, of course.
>>>
>>> You can't be serious.
>>
>> I can :)
>
> Apparently.

Indeed. :)

>>>> This is a form that says "Are you connected to the citynet?" and then
>>>> you just enter your address to search the database. If the user has to
>>>> provide street name, street number and street letter in separate
>>>> fields, it's inconvenient for them.
>>>
>>> No, it's not.
>>
>> Actually, yes it is :)
>
> No, it is not.

Actually, yes it is :)

>>> With separate controls they can be sure where to enter what;
>>> it is accessible, and you have no problem processing the data. With one
>>> control, neither applies.
>>
>> Well, I have been doing this for about ten years now,
>
> I have been doing this for about fourteen years now. So what?

You have monitored swedish address search terms for fourteen years? We
should compare notes.

> There are basic accessibility guidelines that no amount of
> development experience can substitute (although studying usability,
> as I did, can help). Many of which must be followed per
> legislation in some countries.

This.. has nothing to do with the topic at hand.

>> and recived tons of feedback from my clients on things like this. […]
>
> There remains the possibility that you did it wrong in another way all the
> time.

Also, there is a possibility that you don't have enough information
about the details of my situation to make any judgmental comments at
all.

>> When I say it's inconvenient for the end user, it's not something I make
>> up on the spot to be obnoxious.
>
> Nevertheless, your logic is flawed.

Well, as long as you're merely *saying* that instead of actually, you
know, *substantiate* that opinion, I have no idea what you expect me
to do with it.

Words are easy :)

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176128 is a reply to message #176121]

Fri, 25 November 2011 08:36

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <4ece3eb7$0$28723$a8266bb1(at)newsreader(dot)readnews(dot)com>,
Denis McMahon <denismfmcmahon(at)gmail(dot)com> wrote:

>> Well, I have been doing this for about ten years now, and recived tons
>> of feedback from my clients on things like this. When I say it's
>> inconvenient for the end user, it's not something I make up on the spot
>> to be obnoxious.
>
> It's a point that a lot of website UI designers overlook - what we think
> is the best and most obvious layout, and even "the way everyone does it",
> isn't always what the website user finds most user friendly.

Couldn't agree with you more here.

> As to the OPs problem, perhaps it needs a different approach:

(I am the OP, just for your information) :)

> Presumably for any given city you can obtain a list of valid streets?

No, not really. I have a database of addresses that the search should
match against. The relevant DB fields are these:

streetname "Stora gatan"
streetnumber "34"
streetletter "B"
address "Stora gatan 34B"

In-data variants I am concerned with are:

"Stora gatan"
"Stora gatan 34"
"Stora gatan 34b"
"Stora gatan 34 b"

And I need to build a regexp to extract the three parts from all these
in-data versions to match agains the "address" field (or, maybe even
against the three discrete fields).

This is where I'm stuck, since the regexp I use doesn't adequately
match the different versions of how the street letter is sent.

> Another option, and one that is becoming popular in some areas, might be
> to allow entry of the postcode, and then generate a drop down list,
> perhaps populated using an XMLHttpRequest, of addresses matching the
> postcode - this would require access to a copy of the relevant postcode
> database which might involve some cost, and of course you need to
> implement a mechanism for handling updates to that database.

I have all the post codes, but that's not helping me here. In short,
if the same address exists in two post codes, I would show both and
the user would select which one.

It's the address part that is my current problem. And splitting the
search box into three boxes (one for name, number and letter) is not a
desirable option for this application, unfortunately.

> I don't think you're ever going to parse a single line address entry with
> regex, because there are too many different formats that might get thrown
> at you.

Yeah, people here keep claiming that while ignoring that I am only
interested in capturing the above formats, that make out the vast vast
vast majority of all searches being done to this database. If I don't
match "34 Storgatan b", that's just fine by me. I have a very specific
case of searches that currently fail that I feel strongly could be
averted by a regular expression on the receiving end.

--
Sandman[.net]

Report message to a moderator

SOLVED: Re: preg_match() oddities and question [message #176130 is a reply to message #176061]

Fri, 25 November 2011 09:32

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <mr-5B96D1(dot)12212022112011(at)News(dot)Individual(dot)NET>,
Sandman <mr(at)sandman(dot)net> wrote:

> So I have this regexp:
>
> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
> $streetname = uc_words($m[1]);
> $streetnumber = trim($m[2]);
> $streetletter = strtoupper($m[3]);
> $search = trim($streetname . SPACE . $streetnumber .
> $streetletter);
> }

It's funny how asking for help on the net works. Being an "oldie" I
usually turn to IRC first, mostly because it's the most direct medium,
knowledgable people may be online right there and then.

My next option usually is USENET, because I like the format and it's a
lot like mail.

But I think I need to change all of that thinking. The last few years
I've gotten the most effective help from sites such as stackoverflow,
really.

So with my problem above, I didn't find any regexp wizard online on
IRC so I came here asking it, using examples. That was two days ago. I
recived plenty of responses, some really helpful, some not so much. I
wouldn't expect immediate help and salvation from CLP really, but it's
not totally uncommon for people to at least wanting to help in some
way.

Since most replies were about the in-data or the database being
incorrect in the first place, instead of just focusing on the actual
question asked, I turned to stackoverflow. I posted a question today
at 10 am

At 10:19am, I got a clean cut, no frills solution to my actual problem:

preg_match("/([a-zA-Z ]+) ?([0-9]+)? ?([a-zA-Z]+)?/"...)

That captures *all* my examples in the OP, and formats it *exactly*
like I wanted to.

I'm not trying to disrespect anyone here, but there is a bit too much
"elitism" and "You're doing it wrong" mentality here, and have always
been. And some times it's justified, when pure newbies come here and
asks how to create a guestbook in php. But it seems this mentality
spills over to posters like me that aren't newbies but still need help.

I've been in this group for about ten years:

<http://groups.google.com/groups/profile?show=more&enc_user=YhtCLA4AAAB
X2xnmqiyRi8RpNstwOyTw&group=comp.lang.php>

But I'm proficient enough in PHP that I don't post all that often
about needing help so I'm not a "regular" here and probably easily
mistaken for a newbie.

See this as a pleed to think beyond your own preconceptions about the
proficiency of a poster, you know the entire "innocent until proven
guilty" :)

Or, just ignore it altogether. I got my solution so I'm happy either
way :)

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176138 is a reply to message #176127]

Fri, 25 November 2011 14:44

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Sandman wrote:

> Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:
>>>> With separate controls they can be sure where to enter what;
>>>> it is accessible, and you have no problem processing the data. With
>>>> one control, neither applies.
>>> Well, I have been doing this for about ten years now,
>> I have been doing this for about fourteen years now. So what?
>
> You have monitored swedish address search terms for fourteen years? We
> should compare notes.

You are missing the point. The kind of structured data that you need to
enter in a form does not matter. Using cursor keys to move the text cursor
between delimiters in a running text always causes has more accessibility
and usability problems, and consequently information processing problems,
than tabbing (or otherwise moving the focus) from one control to another.

>> There are basic accessibility guidelines that no amount of
>> development experience can substitute (although studying usability,
>> as I did, can help). Many of which must be followed per
>> legislation in some countries.
>
> This.. has nothing to do with the topic at hand.

You are just not seeing how much it has to do with the topic at hand.
Changing the way people put in data towards one that is *actually* easier
for them solves, at least, three problems at once, including the one that
you have been asking about. Because when structured data is being entered
in a structured way, you have almost no trouble storing it in a structured
way. (BTW, in case you do not know, moving the focus from one input control
to the next while typing text can be assisted with client-side scripting.)

You should read, for example, <http://www.useit.com/> which contains many
ideas on how you can make Web sites better usable and therefore, almost in
passing, more accessible.

>>> When I say it's inconvenient for the end user, it's not something I
>>> make up on the spot to be obnoxious.
>> Nevertheless, your logic is flawed.
>
> Well, as long as you're merely saying that instead of actually, you
> know, substantiate that opinion, I have no idea what you expect me
> to do with it.
>
> Words are easy :)

This is about as much as I will discuss this here because your *actual*
problem has nothing to do with PHP, and little to do with Regular
Expressions.

HTH

PointedEars
--
var bugRiddenCrashPronePieceOfJunk = (
navigator.userAgent.indexOf('MSIE 5') != -1
&& navigator.userAgent.indexOf('Mac') != -1
) // Plone, register_function.js:16

Report message to a moderator

Re: preg_match() oddities and question [message #176139 is a reply to message #176138]

Fri, 25 November 2011 15:34

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <1438797(dot)UceJUlZ0hu(at)PointedEars(dot)de>,
Thomas 'PointedEars' Lahn <PointedEars(at)web(dot)de> wrote:

>> You have monitored swedish address search terms for fourteen years? We
>> should compare notes.
>
> You are missing the point.

I disagree. :)

> The kind of structured data that you need to
> enter in a form does not matter. Using cursor keys to move the text cursor
> between delimiters in a running text always causes has more accessibility
> and usability problems, and consequently information processing problems,
> than tabbing (or otherwise moving the focus) from one control to another.

Whatever gave you the idea that I am proposing a solution where the
user would have to "use cursor keys to move the text cursor between
delimiter in a running text"? I literally have no idea what you are
talking about, and it has absolutely nothing to do with this thread.

>>> There are basic accessibility guidelines that no amount of
>>> development experience can substitute (although studying usability,
>>> as I did, can help). Many of which must be followed per
>>> legislation in some countries.
>>
>> This.. has nothing to do with the topic at hand.
>
> You are just not seeing how much it has to do with the topic at hand.

And you seem to be failing to explain how it does :)

> Changing the way people put in data towards one that is *actually* easier
> for them solves, at least, three problems at once, including the one that
> you have been asking about.

Adding form fields to make it more cumbersome for them to search for
their address, however, does not.

My *experience* (i.e. not guessing, but rather - having done it
exactly the way you propose) tells me that adding form fields to this
situation adds more illegal search terms. Users tend to make more
mistakes the more details you expect them to provide. Users often
entered "X", "?" or "-" in the letter search field, or tried to type
"none". Most of the time, however, they continued to type their entire
address in the first field and then hit return.

When you find yourself having to educate or expect the user to provide
data in a specific way is when you fail as a software engineer. You
have to make it as easy as possible for them to search for their
address - and the thing you're dealing with here is *Google*. People
know how to Google, they Google whatever shit they can and Google just
always manages to figure out pretty much exactly what they need - with
one input field. That's the level your visitors are on.

They shouldn't have to read labels or instructions to search for their
address. Also the meaning of the street letter may be different
depending on what kind of property you live in and whether your'e a
company or a private person. All that has to be explained for these
users, and my experience (i.e. actual facts provided by years and
years on monitoring exactly this) shows me that this is not the
correct way to deal with input data.

You are free to disagree all you want, and perhaps visitors to your
applications and your digested search term analysis show you something
else, but mine does not. When I wrote the OP and provided the examples
it wasn't something out of the blue.

>>>> When I say it's inconvenient for the end user, it's not something I
>>>> make up on the spot to be obnoxious.
>>> Nevertheless, your logic is flawed.
>>
>> Well, as long as you're merely saying that instead of actually, you
>> know, substantiate that opinion, I have no idea what you expect me
>> to do with it.
>>
>> Words are easy :)
>
> This is about as much as I will discuss this here because your *actual*
> problem has nothing to do with PHP, and little to do with Regular
> Expressions.

If you saw my followup to my OP you may have seen that I found the
solution elsewhere and that it indeed was solved using PHp and regular
expressions - just as I knew it could be :)

--
Sandman[.net]

Report message to a moderator

Re: preg_match() oddities and question [message #176140 is a reply to message #176139]

Fri, 25 November 2011 22:23

Thomas 'PointedEars'
Messages: 701
Registered: October 2010

Karma: 0

Senior Member

Sandman wrote:

> In article <1438797(dot)UceJUlZ0hu(at)PointedEars(dot)de>,
>> The kind of structured data that you need to enter in a form does not
>> matter. Using cursor keys to move the text cursor between delimiters in
>> a running text always causes has more accessibility and usability
>> problems, and consequently information processing problems, than tabbing
>> (or otherwise moving the focus) from one control to another.
>
> Whatever gave you the idea that I am proposing a solution where the
> user would have to "use cursor keys to move the text cursor between
> delimiter in a running text"? I literally have no idea what you are
> talking about, and it has absolutely nothing to do with this thread.

I am imagining a visitor of your Web site entering "foo street 34" (perhaps
as part of a longer text; your next question about phone numbers indicates
just that). Seeing that the "34" needed to be "42" instead, they would tab
back, upon which the entire text field is usually selected, and they have to
press the End key (or Alt+Arrow Right on Macs, IIRC) to get to the end.
Then they would perhaps press the Backspace key twice and type "42" Or
perhaps they needed it to be "134", so they would perhaps press the Arrow
Left or Backspace key twice (or Ctrl/Compose+Left) before they could make
their edit. (It would be comparably tedious with a pointing device, so do
not get me started on that.)

Now, it would be so much easier for them to work with the form, and easier
for you to process the data they entered, if you had the street name and the
house number in separate controls. Then they would tab back to the house
number control, which text would be selected, and they could immediately fix
their mistake. In addition, you could set up access keys and labels for the
controls with pure HTML so that users can use either way to focus the
control they want to edit. This would work with a graphical browser, a text
browser, a screen reader etc. Lately, it would work best with a mobile
device (anyone who has experienced first-hand how tedious it is to position
the text cursor with a mobile device, especially with a touchscreen, knows
what I mean). You cannot do that with one control.

This is just a simple example, of course, but it shows rather clearly the
benefits associated with providing separate form controls for the components
of structured data. Another good example are separate inputs for the
components of a date where you can, regardless of date format and component
order in that date format, always be sure what is the date, the month, and
the year *as meant by the user*; it is also one instance where client-side
scripts can assist the user in entering data in several ways.

EOD here or, if you want to, F'up2 comp.infosystems.www.authoring.misc

PointedEars
--
realism: HTML 4.01 Strict
evangelism: XHTML 1.0 Strict
madness: XHTML 1.1 as application/xhtml+xml
-- Bjoern Hoehrmann

Report message to a moderator

Re: SOLVED: Re: preg_match() oddities and question [message #176141 is a reply to message #176130]

Fri, 25 November 2011 23:55

Jerry Stuckle
Messages: 2598
Registered: September 2010

Karma: 0

Senior Member

On 11/25/2011 4:32 AM, Sandman wrote:
> In article<mr-5B96D1(dot)12212022112011(at)News(dot)Individual(dot)NET>,
> Sandman<mr(at)sandman(dot)net> wrote:
>
>> So I have this regexp:
>>
>> if (preg_match("/^(.*?)\s*(\d*?)\s*([A-Z,a-z,-]*?)$/", $search, $m)){
>> $streetname = uc_words($m[1]);
>> $streetnumber = trim($m[2]);
>> $streetletter = strtoupper($m[3]);
>> $search = trim($streetname . SPACE . $streetnumber .
>> $streetletter);
>> }
>
> It's funny how asking for help on the net works. Being an "oldie" I
> usually turn to IRC first, mostly because it's the most direct medium,
> knowledgable people may be online right there and then.
>
> My next option usually is USENET, because I like the format and it's a
> lot like mail.
>
> But I think I need to change all of that thinking. The last few years
> I've gotten the most effective help from sites such as stackoverflow,
> really.
>
> So with my problem above, I didn't find any regexp wizard online on
> IRC so I came here asking it, using examples. That was two days ago. I
> recived plenty of responses, some really helpful, some not so much. I
> wouldn't expect immediate help and salvation from CLP really, but it's
> not totally uncommon for people to at least wanting to help in some
> way.
>
> Since most replies were about the in-data or the database being
> incorrect in the first place, instead of just focusing on the actual
> question asked, I turned to stackoverflow. I posted a question today
> at 10 am
>
> At 10:19am, I got a clean cut, no frills solution to my actual problem:
>
> preg_match("/([a-zA-Z ]+) ?([0-9]+)? ?([a-zA-Z]+)?/"...)
>
> That captures *all* my examples in the OP, and formats it *exactly*
> like I wanted to.
>
> I'm not trying to disrespect anyone here, but there is a bit too much
> "elitism" and "You're doing it wrong" mentality here, and have always
> been. And some times it's justified, when pure newbies come here and
> asks how to create a guestbook in php. But it seems this mentality
> spills over to posters like me that aren't newbies but still need help.
>
> I've been in this group for about ten years:
>
> <http://groups.google.com/groups/profile?show=more&enc_user=YhtCLA4AAAB
> X2xnmqiyRi8RpNstwOyTw&group=comp.lang.php>
>
> But I'm proficient enough in PHP that I don't post all that often
> about needing help so I'm not a "regular" here and probably easily
> mistaken for a newbie.
>
> See this as a pleed to think beyond your own preconceptions about the
> proficiency of a poster, you know the entire "innocent until proven
> guilty" :)
>
> Or, just ignore it altogether. I got my solution so I'm happy either
> way :)
>
>

Look at who most of your answers were from - TNP and "Pointed Head".

Both known trolls in this newsgroup.

I would have tried to help, but I'm not that great in regex's myself.
Normally I try to find other ways of solving my problem. Usually I can
do so.

Please don't condemn usenet because of a couple of well-known trolls.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex(at)attglobal(dot)net
==================

Report message to a moderator

Re: SOLVED: Re: preg_match() oddities and question [message #176147 is a reply to message #176141]

Sat, 26 November 2011 10:21

Sandman
Messages: 32
Registered: August 2011

Karma: 0

Member

In article <jap9ti$onj$1(at)dont-email(dot)me>,
Jerry Stuckle <jstucklex(at)attglobal(dot)net> wrote:

>> Or, just ignore it altogether. I got my solution so I'm happy either
>> way :)
>
> Look at who most of your answers were from - TNP and "Pointed Head".
>
> Both known trolls in this newsgroup.

One drawback of not being a regular is not knowing who is a troll or
not. But the point still stands - I've yet to run into a troll on
stackoverflow.

> I would have tried to help, but I'm not that great in regex's myself.
> Normally I try to find other ways of solving my problem. Usually I can
> do so.
>
> Please don't condemn usenet because of a couple of well-known trolls.

It was not my intention to do that, and I apologize if that's how it
could be interpreted. It was directed at those that did reply but were
unable to help or unable to stay on topic. I have no problems
accepting that these people were trolls.

--
Sandman[.net]

Report message to a moderator

Previous Topic:	Amazing Website!!!
Next Topic:	session handler auto log out

Goto Forum:

-=] Back to Top [=-

[ Syndicate this forum (XML) ] [

]

Current Time: Mon May 19 10:35:31 GMT 2025

Total time taken to generate the page: 0.02113 seconds