How to make Nutch crawler to crawl only specific URLS? -

- January 15, 2013

i'm aware regex can used restrict pages downloaded. but, crawl pages anchor link in given page in set of urls. example, have array words ['computer','software','hardware','operating system','thread'], crawl urls anchor text contains 1 of these words in array. should implement kind of logic in nutch? thank you.

as pointed out, urlfilters aren't use deal url string only. achieve described implementing custom htmlparsefilter in you'd access parsedata current document. contains outlinks filter based on anchor value.

there loads of examples online on how write plugin and/or custom htmlparsefilter, see instance metatagsparser.

Search This Blog

JAV

How to make Nutch crawler to crawl only specific URLS? -

Comments

Post a Comment

Popular posts from this blog

Hatching array of circles in AutoCAD using c# -

ios - UITEXTFIELD InputView Uipicker not working in swift -

jqgrid - how to change theme of grid using jqwidgets -