How to make Nutch crawler to crawl only specific URLS? -


i'm aware regex can used restrict pages downloaded. but, crawl pages anchor link in given page in set of urls. example, have array words ['computer','software','hardware','operating system','thread'], crawl urls anchor text contains 1 of these words in array. should implement kind of logic in nutch? thank you.

as pointed out, urlfilters aren't use deal url string only. achieve described implementing custom htmlparsefilter in you'd access parsedata current document. contains outlinks filter based on anchor value.

there loads of examples online on how write plugin and/or custom htmlparsefilter, see instance metatagsparser.


Comments

Popular posts from this blog

Hatching array of circles in AutoCAD using c# -

ios - UITEXTFIELD InputView Uipicker not working in swift -

Python Pig Latin Translator -