How to extract the text of all the pages of a PDF using pdfplumber?

Can anyone help me I need the source code

28th Mar 2021, 4:37 AM

Ujjawal Gupta

13 Answers

+ 1

import pdfplumber as pdfp from gtts import gTTS pdfToString = "" with pdfp.open('/storage/emulated/0/Download/filename.pdf') as pdf: for page in pdf.pages: print(page.extract_text()) pdfToString += page.extract_text() pdfToSpeech = gTTS(pdfToString, lang='de') pdfToSpeech.save('/storage/emulated/0/Music/pdfToSpeech_deutsch.mp3') This is what I got very quickly from the gtts documentation.. You can choose a Language via lang member of gtts (For me it is german - > 'de', english would be 'en'.. More Language - > See documentation

30th Mar 2021, 5:46 AM

G B

+ 1

Hi Ujjawal Gupta, Try this: import pdfplumber as pdfp with pdfp.open('/storage/emulated/0/Download/filename.pdf') as pdf: for page in pdf.pages: print(page.extract_text()) For Sure you should adjust the path to the file, passed to open() method... Hope this helps...

28th Mar 2021, 10:09 PM

G B

+ 1

Thank you buddy You just saved my life!

30th Mar 2021, 3:23 AM

Ujjawal Gupta

+ 1

On my PC, pyttsx3 runs about 6 Times faster than gtts, although the speach of gtts is much nicer, sounds more natural.

31st Mar 2021, 7:09 AM

G B

Hey G B, Can you please tell me that how can I convert this text into speech by using gtts module?

30th Mar 2021, 4:33 AM

Ujjawal Gupta

Hey bro, Why gtts is so slow? It takes too much time to execute Is there any way to reduce the time?

30th Mar 2021, 12:22 PM

Ujjawal Gupta

Hi, gtts uses the Internet. The processing takes place in the cloud. So to speed this up, you May need a faster Internet access. Alternatively you could try pyttsx3. This is an offline text to speach lib. Unfortunately this does not work on Android. Hers a little reference code import pdfplumber as pdfp from time import time from gtts import gTTS import pyttsx3 pdfToString = "" with pdfp.open('/storage/emulated/0/Download/file.pdf') as pdf: for page in pdf.pages: print(page.extract_text()) pdfToString += page.extract_text() print("starting gtts...") start = time() pdfToSpeech = gTTS(pdfToString) pdfToSpeech.save('/storage/emulated/0/Music/pdfToSpeech.mp3') stop = time() print(f"gtts finished after {stop - start} seconds") print("starting pyttsx3...") start = time() engine = pyttsx3.init() engine.save_to_file(pdfToString, '/storage/emulated/0/Music/pdfToSpeech2.mp3') engine.runAndWait() stop = time() print(f"pyttsx3 finished after {stop - start} seconds")

31st Mar 2021, 7:08 AM

G B

Yes you are right

31st Mar 2021, 7:10 AM

Ujjawal Gupta

But is there any way to reduce the runtime

31st Mar 2021, 7:11 AM

Ujjawal Gupta

Thank you so much

31st Mar 2021, 4:01 PM

Ujjawal Gupta

No Problem :) I'm afraid, I don't know a Way to reduce the runtime, but using pyttsx3.

1st Apr 2021, 10:01 AM

G B

Is the internet is the only issue of the problem?

1st Apr 2021, 10:02 AM

Ujjawal Gupta

I think Yes. But on this, you only have limited impact. In the Internet, data always travels slower, than inside your PC or Smartphone. Thus, even with the fastest possibile internet access, i think gtts will never become as fast as pyttsx3.

2nd Apr 2021, 10:44 AM

G B