Parsing Complex PDF document with C# -

- June 15, 2014

see attached k-1 document. have attempted use numerous tweaks itextsharp library haven't had success in loading data correctly.

ideally parse out document similar how humans read them, 1 textbox @ time, reading contents.

       var reader = new pdfreader(file, encoding.ascii.getbytes(password));         string[] lines;         var strategy = new locationtextextractionstrategy();         string currentpagetext = pdftextextractor.gettextfrompage(reader, 1, strategy);         lines = currentpagetext.split(new string[] {"\r\n", "\n"}, stringsplitoptions.none);

i tried playing annotation parsing didn't have luck.

i'm newbie , looking @ wrong place. can guide me in right direction?

thanks lot.

the first question if form electronic or scanned one? latter make data extraction harder should involve ocr too.

in case have electronic pdf , if have similar forms why don't use following strategy:

store coordinates of each "box" in config file
process documents , exract text every "box" (i.e. region)
additional process extracted text regular expressions separate name address (or maybe may set region read text line line)

in case have few variations of form may check first box extract name of form , load appropraite settings file (that contains set of regions variation)

this approach should work pdf library.

Search This Blog

JAV

Parsing Complex PDF document with C# -

Comments

Post a Comment

Popular posts from this blog

ios - UITEXTFIELD InputView Uipicker not working in swift -

Hatching array of circles in AutoCAD using c# -

jqgrid - how to change theme of grid using jqwidgets -