Python是一种强大的编程语言,它可以用来实现各种爬虫。本文将介绍7个Python实现的爬虫示例,并附带使用方法。
1. 简单爬虫
简单爬虫是一种最简单的爬虫,它只需要几行代码就可以实现。它可以获取网页上的指定内容,比如文本、图片等。使用方法:
import requests url = 'http://www.example.com' response = requests.get(url) html = response.text print(html)
2. 爬取新闻网站
新闻网站爬虫可以爬取新闻网站上的新闻内容,比如新闻标题、内容、发布时间等。使用方法:
import requests from bs4 import BeautifulSoup url = 'http://www.example.com/news' response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'lxml') news_list = soup.find_all('div', class_='news-item') for news in news_list: title = news.find('h3').text content = news.find('p').text time = news.find('span', class_='time').text print(title, content, time)
3. 爬取淘宝商品
淘宝商品爬虫可以爬取淘宝上的商品信息,比如商品名称、价格、图片等。使用方法:
import requests from bs4 import BeautifulSoup url = 'https://www.taobao.com/' response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'lxml') products_list = soup.find_all('div', class_='product') for product in products_list: name = product.find('p', class_='name').text price = product.find('span', class_='price').text img = product.find('img')['src'] print(name, price, img)
4. 爬取豆瓣电影
豆瓣电影爬虫可以爬取豆瓣上的电影信息,比如电影名称、评分、简介等。使用方法:
import requests from bs4 import BeautifulSoup url = 'https://movie.douban.com/' response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'lxml') movies_list = soup.find_all('div', class_='item') for movie in movies_list: name = movie.find('span', class_='title').text score = movie.find('span', class_='rating_num').text intro = movie.find('span', class_='inq').text print(name, score, intro)
5. 爬取知乎问题
知乎问题爬虫可以爬取知乎上的问题信息,比如问题标题、回答数量、关注数量等。使用方法:
import requests from bs4 import BeautifulSoup url = 'https://www.zhihu.com/' response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'lxml') questions_list = soup.find_all('div', class_='question-item') for question in questions_list: title = question.find('h2').text answer_num = question.find('span', class_='num').text follow_num = question.find('div', class_='follow-num').text print(title, answer_num, follow_num)
6. 爬取微博用户
微博用户爬虫可以爬取微博上的用户信息,比如用户名、粉丝数量、发布内容等。使用方法:
import requests from bs4 import BeautifulSoup url = 'https://weibo.com/' response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'lxml') users_list = soup.find_all('div', class_='user-item') for user in users_list: name = user.find('span', class_='name').text fans_num = user.find('span', class_='fans-num').text content = user.find('div', class_='content').text print(name, fans_num, content)
7. 爬取GitHub仓库
GitHub仓库爬虫可以爬取GitHub上的仓库信息,比如仓库名称、star数量、描述等。使用方法:
import requests from bs4 import BeautifulSoup url = 'https://github.com/' response = requests.get(url) html = response.text soup = BeautifulSoup(html, 'lxml') repositories_list = soup.find_all('div', class_='repo-list-item') for repository in repositories_list: name = repository.find('a', class_='repo-list-name').text star_num = repository.find('a', class_='muted-link').text description = repository.find('p', class_='repo-list-description').text print(name, star_num, description)
以上就是7个Python实现的爬虫示例,它们可以帮助我们快速实现爬虫。