Strip parsed html text from html comments using AngleSharp -


i use code below strip specific html tags parsed html using anglesharp (as recommendable on using regular expressions such jobs (anglesharp maintained, htmlagilitypack not, hence have been moving latter).

it works great - want remove html comments well. meaning whatever found between <!-- , --> tags.

how achieved using anglesharp ? using queryselector not seem suiting here.

private string extractcontentfromhtml(string input) {     list<string> tagstoremove = new list<string>     {         "script",         "style",         "img"     };      var config = configuration.default.withjavascript();      htmlparser hp = new htmlparser(config);     list<ielement> tags = new list<ielement>();     list<string> nodetypes = new list<string>();     var hpresult = hp.parse(input);      try     {         foreach (var tagtoremove in tagstoremove)             tags.addrange(hpresult.queryselectorall(tagtoremove));          foreach (var tag in tags)             tag.remove();     }     catch (exception ex)     {         _errors.add(string.format("error in cleaning html. {0}", ex.message));     }      var content = hpresult.queryselector("body");      return (content).innerhtml; } 

after playing code above , anglesharp's api, came following working solution. thought replace tag-removing stuff , solely rely on treating text nodes only, not recommendable, since text nodes generated on fly via javascript code, meaning, need remove javascript nodes anyway. left style + img removals well.

worth mentioning dom classifies nodes according types, , 1 able find comments searching nodes of type 8.

private string extractcontentfromhtml(string input) {     list<string> tagstoremove = new list<string>     {         "script",         "style",         "img"     };      var config = configuration.default.withjavascript();      htmlparser hp = new htmlparser(config);     list<ielement> tags = new list<ielement>();     list<string> nodetypes = new list<string>();     var hpresult = hp.parse(input);      list<string> textnodesvalues = new list<string>();     try     {         foreach (var tagtoremove in tagstoremove)             tags.addrange(hpresult.queryselectorall(tagtoremove));          foreach (var tag in tags)             tag.remove();    /*    following not work, because text nodes not immediate children not considered     textnodesvalues = hpresult.all.where(n => n.nodetype == nodetype.text).select(n => n.textcontent).tolist(); */           var treewalker = hpresult.createtreewalker(hpresult, filtersettings.text);          var textnode = treewalker.tonext();         while (textnode != null)         {             textnodesvalues.add(textnode.textcontent);             textnode = treewalker.tonext();         }     }     catch (exception ex)     {         _errors.add(string.format("error in cleaning html. {0}", ex.message));     }      return string.join(" ", textnodesvalues); } 

Comments

Popular posts from this blog

ios - UITEXTFIELD InputView Uipicker not working in swift -

Hatching array of circles in AutoCAD using c# -