how to create a web scraper in python? | Sololearn: Learn to code for FREE!

+1

how to create a web scraper in python?

Just figuring It out

12/31/2016 7:20:46 AM

David Toscano

9 Answers

New Answer

+4

import urllib.request import bs4 def scrape(): base_url = 'https://python.org' request = urlopen(base_url) print(request.content) # raw source code bs_obj = bs4.BeautifulSoup(request) # printing all links for i in bs_obj.find_all('a'): print(i['href']) # looking for specific traits of the content # printing paragraphs with the class of info for i in bs_obj.find_all('p', {'class':'info'}): print(i.text) # printing a specific set of paragraphs without tags scrape()

+3

my simple example of a web scraper import urllib.request import bs4 def scrape(): base_url = 'https://python.org' request = urlopen(base_url) print(request.content) # raw source code bs_obj = bs4.BeautifulSoup(request) # printing all links for i in bs_obj.find_all('a'): print(i['href']) scrape()

+1

following this post. Very interesting and I would like to understand how web scraper works.

+1

THX for the answer...for more specific optional,such as finding specific words inside or One page of an entire site or group of sites have u any suggestion?

+1

David Toscano Web-Scraper in Python If you have basic knowledge in Python, you know Pip. With pip, you import new not-build-in modules. You need at least the following modules: urllib3 bs4 better you build in, a ask for a certificate. So you need certifi so pip certifi too. we import what we need: import urllib3 import certifi from bs4 import BeautifulSoup What s going on? Python have to do now, what your browser do, usually. It should browsw html-documents. This is also called parsing. we create a object http = urllib3.PoolManager(cert_reqs="CERT_REQUIRED", ca_certs=certtifi.where()) Now we give Python the side to browse: site = http.request("GET", "www.sololearn.com") soup= BeautifulSoup(site.data, "html.parser") We told Python to parse the html document if the Server got a certificate. Now you can do different things. I e you can look for a specific html.element. result = soup.find_all("h3",class"text") #only example. go.tho.the website and look for the html.element. kk. to show results

+1

row per row, you can take a list. there are many dolutions, i guess, this is my atm.. list_results = [] # empty list def show_resuts(list_results, result):     x.append(y)     for on in x:         print(i) show_results(list_result, result) # sry, the function, i dont would like to write it again. we dont love typing^^ or you look for a specific word. After  the site to parse, we give Python a other job, we search in first 50 signs for the string: "DOCTYPE HTML" check_site = str(r.data[0:50]) # we must convert into.the rigth datatyp. pattern = r"<!DOCTYPE HTML" if re.search(pattern, check_site):     print("Site is HTML 5") else:     pattern2 = r"<!doctype html"     if re.search(pattern2, check_site):         print("Doctype HTML is labeled")     else:         print("Doctype HTML isnt labeled") Hope this helps a bit :)

+1

row 7: false right is: for i in x: in other cases it raises a error :)

0

there s a lot of examples online but i need an answer more practical than teoretical

0

use lxml it is much faster then bs4. for large projects use scrapy with its powerful threading. stackoverflow has tons of examples on this.