Google Search results scraping - captcha problem | Sololearn: Learn to code for FREE!

+18

Google Search results scraping - captcha problem

I am working on a web scraper in Python and I got it almost ready. I focused on extracting hyperlinks and it works alright. I have a problem though - the website I am trying to scrape detects my script and demands a captcha test after a couple of trials. Do any of you have some experience with Google News search and scraping? Would making the script wait 10 seconds before making another attempt help at all? Or should I do some random routing? I don't want to make a crawler, just a single batch search :(

8/8/2017 9:06:14 PM

Kuba Siekierzyński

12 Answers

New Answer

+11

Back on this issue - ultimately I kind of found a workaround, by shuffling search terms, which allowed me to scrape 100 entries at a time with no limit. As in my recently completed project I scraped images of a particular kind, this proved to be effective enough. Thanks to all for the hints! 👍

+8

Got it working, guys!!! :) Perhaps it will take them a while to spot my parser, hopefully I'll get my task complete before that. @Venkatesh I thought about that and will try this approach, too. As for API, I couldn't find one searching Google News only. I don't want to look into the whole web or other places, just the news. They had it going, but discontinued it in 2011 or so...

+7

I agree with last answer of @seamiki: I have make some (many pages next to each others) crawling attempt with Python with google search pages and giving the request the user-agent header near to the one suggested by him, and doesn't encounter such limitation ;) Even if I think that user-agent as given by @seamiki is well working, this is the one I used succesfully: req.add_header('User-agent','Mozilla/5.0 (Linux; Android 6.0; P027 Build/MRA58L) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/57.0.2987.132 Safari/537.36') (using 'request' from 'urllib' module in Python3 implementation of QPython3 ^^)

+6

hey kuba what are using for the scrap

+6

Check out the code for my CNN image classifier. It contains also the webscraping part which I used to procure over 40k images from Google Search: https://github.com/kuba-siekierzynski/CarL-CNN

+5

I will check it out later today and let you know the result, thanks! :)

+4

Now that I think about that... once I used a bash script to save mp3 from google tts service. The script worked for a while, until google stopped such possibility. The workaround was to send a request impersonating a browser. Is still working, at least as of few weeks ago. User agent should be something like this UA="Mozilla/5.0 (Windows NT 6.2; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/29.0.1547.2 Safari/537.36" but this worked just fine: "wget -q -U Mozilla -o output.mp3....." Hope it helps.

+3

same i also get that now im using startpage according to the terms and conditions ur not allowed to crawl so yeah they limit ur requests P.S. if u rly pay attention to our discord we were actually talking about this like 9 hours ago

+3

Once I got a reply from the server: "This is not how you're supposed to access our data. Pls stop. I'm not a robot". Not a Captcha but a pissed off admin. Afaik on the normal text anti-captcha services there are also limits for cirillic. I don't know if the paid services work with the new image recognizing captcha. Developing my own solution was and is way off my capabilities.

+2

simpler to get it through the API they may be providing. In most cases this will already have you authenticated, so in cases where you exceed the number of requests per minute, you will be simply put on hold and then you can go on. I have scraped a few sites that did not export API but had these limits and CAPTCHA in place. I slowed down. I think, to 1 or 2 requests per minute. You can also use VPN and other means of getting new IP and MAC combination every minute or so. This way, you will be going full steam ahead while your requests will all look like they are coming from many different places. Until the server is attended by an admin, this way will get you maximum speed.

0

@Kuba: Can you share your code or some technical to scrape such as time sleep between request, url use to request, header? I use random user agent from this lib: https://github.com/lorien/user_agent with requests lib to query to https://google.com/search?q={keyword} with 10s delay between request. But I got captcha after a few requests.

0

Thank you Kuba, I will check your code!