Bash: Regex matching on multiple lines simultaneously and extracting captured content -
i have xml file in following format
<starttag name="aaa" > <innertag name="xxx" value="xxx"/> <innertag name="xxx" value="xxx"/> <innertag name="xxx" value="yyy"/> </starttag> <starttag name="bbb" > <innertag name="xxx" value="xxx"/> <innertag name="xxx" value="xxx"/> <innertag name="xxx" value="xxx"/> </starttag> <starttag name="ccc" > <innertag name="xxx" value="xxx"/> <innertag name="xxx" value="xxx"/> <innertag name="xxx" value="yyy"/> </starttag> .. .. ..
i want extract name attributes of starttag of innertag has value yyy.
so in file above, output aaa , ccc. can use regex matching. suppose possible using lookaheads not able create regex patterns multilines. know how use regex single line , tried using same not getting expected outputs. headway on this.
edit: though have put xml example trying know multiline regex matching , trying on file failing. please avoid xml parsing related solutions.
update: per steven suggestion, following worked
pcregrep -m '<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="yyy"\/>(\s|<innertag[^>]*>)*<\/starttag>' file.xml grep -pzo '<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="yyy"\/>(\s|<innertag[^>]*>)*<\/starttag>' file.xml
an xml parser, 1 supports xpath going far easier , more stable, if must insist on using regex, here's pattern work sample input provided:
<starttag name="([^"])*"[^>]*>(\s|<innertag[^>]*>)*<innertag name="[^"]*" value="yyy"\/>(\s|<innertag[^>]*>)*<\/starttag>
it's not going work variations of well-formed xml documents, long consistently formatted example, should "okay".
by default, regex captures across multiple lines. there option can tell process 1 line @ time, that's not turned on default. real trick .
pattern not match new-line characters, if want match character, including new-lines, need use .|\n
or negative character class such [^>]
.
Comments
Post a Comment