+ 1

Extract data from the PDF

I am using PyPDF2 library and I need to get the individual field (like date of post , Likes shares)from my PDF and get that data's in CSV or excel format. my PDF file is full of website data. how can I achieve this ... task some solutions plz.

python

16th Mar 2017, 5:02 AM

Pravi Praveen

2 Answers

+ 7

Yes, it would be the best if you knew the structure of the PDF document in question. Otherwise you still have regex to help you in data mining, but it might not prove 100% successful...

16th Mar 2017, 6:24 AM

Kuba Siekierzyński

+ 3

If you have the full Abode acrobat version they make it quite easy. However without it, it'd be a labor-intensive process because how the tables are embedded within the PDF makes extracting the data a jumble. There are ways (and doing a copy pasta to excel and righting a quick macro to get the data somewhat structured) but without knowing the format of the PDF file before it was embedded to a PDF, tables and line breaks complicate things and cause headaches. Maybe try a free trial of Adobe Acrobrat and see if that helps? Best of luck.

16th Mar 2017, 5:26 AM

Austin Semerad