php - Issues with ungreedy match -
in php, i'm matching text here http://pastebin.com/pfjegqpd following regex:
preg_match('#(.*(?s))(particella |particelle |p\.|part\.|p |part |mappale |mapp\.|mapp |n\.|\*) *(\d+[\d /\p{pd}]*)($|.{0,20}(?s)(graffati|particella |particelle |p\.|.*part\.|p |part |mappale |mapp\.|mapp |n\.|subalterno |subalterni |sub\.|s\.|sub |s |\bcat\b|\bcategoria\b|\brendita\b|\bvani\b|\bconsistenza\b|\br\.c\.\b))#i', $txt, $matches, preg_offset_capture, $offset)
with $offset = 944
, i'm getting following output in $matches
.
i expected match 1184
matches 4
instead. tried (?su)
no luck.
$matches = array(6) { [0]=> array(2) { [0]=> string(59) "* 1184 sub.702, vioolo san vincenzo n.4, piano t, categoria" [1]=> int(1226) } [1]=> array(2) { [0]=> string(36) "* 1184 sub.702, vioolo san vincenzo " [1]=> int(1226) } [2]=> array(2) { [0]=> string(2) "n." [1]=> int(1262) } [3]=> array(2) { [0]=> string(1) "4" [1]=> int(1264) } [4]=> array(2) { [0]=> string(20) ", piano t, categoria" [1]=> int(1265) } [5]=> array(2) { [0]=> string(9) "categoria" [1]=> int(1276) } } $offset = int(944)
turning comment answer: point there greedy subpatterns in pattern: .*
, {0,20}
. should turned lazy subpatterns since otherwise, captured texts hold 1 symbol (left greedy subpattern "gobbles" as can , not let group next capture more 1 symbol since require @ least 1 symbol).
see ideone demo, use
$re = '~(.*?(?s))(particella |particelle |p\.|part\.|p |part |mappale |mapp\.|mapp |n\.|\*) *(\d+[\d /\p{pd}]*)($|.{0,20}?(?s)(graffati|particella |particelle |p\\.|.*part\\.|p |part |mappale |mapp\.|mapp |n\.|subalterno |subalterni |sub\.|s\.|sub |s |\bcat\b|\bcategoria\b|\brendita\b|\bvani\b|\bconsistenza\b|\br\.c\.\b))~';
since pattern fragile optimized bit , replace
\s
everywhere since intent match whitespace in places:
(?s)(.*?)(particell[ea]\s+|p(?:art)?[.\s]+|mapp(?:(?:ale)?\s+|\.)|n\.|\*)\s*(\d+[\d\s/\p{pd}]*)($|.{0,20}?(graffati|particell[ae]\s+|p(?:art)?[.\s]+|mapp(?:(?:ale)?\s+|\.)|n\.|subaltern[oi]\s+|s(?:ub)?[.\s]+|\bcat(?:egoria)?\b|\brendita\b|\bvani\b|\bconsistenza\b|\br\.c\.\b))
see regex demo , ideone demo.
Comments
Post a Comment