java - JSoup - Parse HTML tag by tag -
i'm developping text parser in java , asked enhance parsing html it. parser's purpose divide file parsed 3 other files, 1 words contained in file, 1 sentences , other questions.
the *.txt part works perfectly, got problem when parsing html.
i create temporary file *.txt extension , pass in text parser, if pass url html file linked formed this:
<!doctype html> <head> ... html here ... </head> <body> <ul class="some_menu"> <li class="some_menu_item">n1</li> <li class="some_menu_item">n2</li> <li class="some_menu_item">n2</li> </ul> <div> question ? sentence . ... other text ... </div> </body> </html>
the question file filled with: n1 n2 n3 question
so, wondering, there way parse jsoup tags tags can add line feed each time block closed?
if need new informations, don't bother ask!
edit: should have 3 output files, are, example:
one words
n1 n2 n3 question sentence ... other words ...
one sentences
this sentence
one questions
this question
timmym
to text in html body, can use:
document doc = jsoup.connect(url).get(); elements body = doc.select("body"); string alltext = body[0].text();
you can split text each word separate. text in div tag, can use:
elements div = doc.select("div"); string divtext = div[0].text();
you can split divtext each sentence.
notice return type of select query list of element i.e., elements. that's because there can more 1 elements matching select
query. in case, since there 1 element each case access accessing index 0 of returned array.
edit: in order iterate through elements check answer. basically
elements elements = doc.body().select("*"); (element element : elements) { system.out.println(element.text()); }
though there might elements no texts can put check on that.
Comments
Post a Comment