java - JSoup - Parse HTML tag by tag -

- July 15, 2010

i'm developping text parser in java , asked enhance parsing html it. parser's purpose divide file parsed 3 other files, 1 words contained in file, 1 sentences , other questions.

the *.txt part works perfectly, got problem when parsing html.

i create temporary file *.txt extension , pass in text parser, if pass url html file linked formed this:

<!doctype html>     <head>         ... html here ...     </head>     <body>         <ul class="some_menu">             <li class="some_menu_item">n1</li>             <li class="some_menu_item">n2</li>             <li class="some_menu_item">n2</li>         </ul>         <div>             question ?             sentence .             ... other text ...         </div>     </body> </html>

the question file filled with: n1 n2 n3 question

so, wondering, there way parse jsoup tags tags can add line feed each time block closed?

if need new informations, don't bother ask!

edit: should have 3 output files, are, example:

one words

n1 n2 n3 question sentence ... other words ...

one sentences
```
this sentence 
```
one questions
```
this question 
```

timmym

to text in html body, can use:

document doc = jsoup.connect(url).get(); elements body = doc.select("body"); string alltext = body[0].text();

you can split text each word separate. text in div tag, can use:

elements div = doc.select("div"); string divtext = div[0].text();

you can split divtext each sentence.

notice return type of select query list of element i.e., elements. that's because there can more 1 elements matching select query. in case, since there 1 element each case access accessing index 0 of returned array.

edit: in order iterate through elements check answer. basically

elements elements = doc.body().select("*");  (element element : elements) {     system.out.println(element.text()); }

though there might elements no texts can put check on that.

Search This Blog

JAV

java - JSoup - Parse HTML tag by tag -

Comments

Post a Comment

Popular posts from this blog

ios - UITEXTFIELD InputView Uipicker not working in swift -

Hatching array of circles in AutoCAD using c# -

Python Pig Latin Translator -