How to extract the text of all the pages of a PDF using pdfplumber? | Sololearn: Learn to code for FREE!
New course! Every coder should learn Generative AI!
Try a free lesson
0

How to extract the text of all the pages of a PDF using pdfplumber?

Can anyone help me I need the source code

28th Mar 2021, 4:37 AM
Ujjawal Gupta
Ujjawal Gupta - avatar
13 Answers
+ 1
import pdfplumber as pdfp from gtts import gTTS pdfToString = "" with pdfp.open('/storage/emulated/0/Download/filename.pdf') as pdf: for page in pdf.pages: print(page.extract_text()) pdfToString += page.extract_text() pdfToSpeech = gTTS(pdfToString, lang='de') pdfToSpeech.save('/storage/emulated/0/Music/pdfToSpeech_deutsch.mp3') This is what I got very quickly from the gtts documentation.. You can choose a Language via lang member of gtts (For me it is german - > 'de', english would be 'en'.. More Language - > See documentation
30th Mar 2021, 5:46 AM
G B
G B - avatar
+ 1
Hi Ujjawal Gupta, Try this: import pdfplumber as pdfp with pdfp.open('/storage/emulated/0/Download/filename.pdf') as pdf: for page in pdf.pages: print(page.extract_text()) For Sure you should adjust the path to the file, passed to open() method... Hope this helps...
28th Mar 2021, 10:09 PM
G B
G B - avatar
+ 1
Thank you buddy You just saved my life!
30th Mar 2021, 3:23 AM
Ujjawal Gupta
Ujjawal Gupta - avatar
+ 1
On my PC, pyttsx3 runs about 6 Times faster than gtts, although the speach of gtts is much nicer, sounds more natural.
31st Mar 2021, 7:09 AM
G B
G B - avatar
0
Hey G B, Can you please tell me that how can I convert this text into speech by using gtts module?
30th Mar 2021, 4:33 AM
Ujjawal Gupta
Ujjawal Gupta - avatar
0
Hey bro, Why gtts is so slow? It takes too much time to execute Is there any way to reduce the time?
30th Mar 2021, 12:22 PM
Ujjawal Gupta
Ujjawal Gupta - avatar
0
Hi, gtts uses the Internet. The processing takes place in the cloud. So to speed this up, you May need a faster Internet access. Alternatively you could try pyttsx3. This is an offline text to speach lib. Unfortunately this does not work on Android. Hers a little reference code import pdfplumber as pdfp from time import time from gtts import gTTS import pyttsx3 pdfToString = "" with pdfp.open('/storage/emulated/0/Download/file.pdf') as pdf: for page in pdf.pages: print(page.extract_text()) pdfToString += page.extract_text() print("starting gtts...") start = time() pdfToSpeech = gTTS(pdfToString) pdfToSpeech.save('/storage/emulated/0/Music/pdfToSpeech.mp3') stop = time() print(f"gtts finished after {stop - start} seconds") print("starting pyttsx3...") start = time() engine = pyttsx3.init() engine.save_to_file(pdfToString, '/storage/emulated/0/Music/pdfToSpeech2.mp3') engine.runAndWait() stop = time() print(f"pyttsx3 finished after {stop - start} seconds")
31st Mar 2021, 7:08 AM
G B
G B - avatar
0
Yes you are right
31st Mar 2021, 7:10 AM
Ujjawal Gupta
Ujjawal Gupta - avatar
0
But is there any way to reduce the runtime
31st Mar 2021, 7:11 AM
Ujjawal Gupta
Ujjawal Gupta - avatar
0
Thank you so much
31st Mar 2021, 4:01 PM
Ujjawal Gupta
Ujjawal Gupta - avatar
0
No Problem :) I'm afraid, I don't know a Way to reduce the runtime, but using pyttsx3.
1st Apr 2021, 10:01 AM
G B
G B - avatar
0
Is the internet is the only issue of the problem?
1st Apr 2021, 10:02 AM
Ujjawal Gupta
Ujjawal Gupta - avatar
0
I think Yes. But on this, you only have limited impact. In the Internet, data always travels slower, than inside your PC or Smartphone. Thus, even with the fastest possibile internet access, i think gtts will never become as fast as pyttsx3.
2nd Apr 2021, 10:44 AM
G B
G B - avatar