Parsing Complex PDF document with C# -


see attached k-1 document. have attempted use numerous tweaks itextsharp library haven't had success in loading data correctly.

ideally parse out document similar how humans read them, 1 textbox @ time, reading contents.

       var reader = new pdfreader(file, encoding.ascii.getbytes(password));         string[] lines;         var strategy = new locationtextextractionstrategy();         string currentpagetext = pdftextextractor.gettextfrompage(reader, 1, strategy);         lines = currentpagetext.split(new string[] {"\r\n", "\n"}, stringsplitoptions.none); 

i tried playing annotation parsing didn't have luck.

i'm newbie , looking @ wrong place. can guide me in right direction?

thanks lot.

enter image description here

the first question if form electronic or scanned one? latter make data extraction harder should involve ocr too.

in case have electronic pdf , if have similar forms why don't use following strategy:

  • store coordinates of each "box" in config file
  • process documents , exract text every "box" (i.e. region)
  • additional process extracted text regular expressions separate name address (or maybe may set region read text line line)

in case have few variations of form may check first box extract name of form , load appropraite settings file (that contains set of regions variation)

this approach should work pdf library.


Comments

Popular posts from this blog

Hatching array of circles in AutoCAD using c# -

ios - UITEXTFIELD InputView Uipicker not working in swift -

Python Pig Latin Translator -