Python是一种强大的编程语言,它可以用来实现各种爬虫。本文将介绍7个Python实现的爬虫示例,并附带使用方法。
1. 简单爬虫
简单爬虫是一种最简单的爬虫,它只需要几行代码就可以实现。它可以获取网页上的指定内容,比如文本、图片等。使用方法:
import requests url = 'http://www.example.com' response = requests.get(url) html = response.text print(html)
2. 爬取新闻网站
新闻网站爬虫可以爬取新闻网站上的新闻内容,比如新闻标题、内容、发布时间等。使用方法:
import requests
from bs4 import BeautifulSoup
url = 'http://www.example.com/news'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'lxml')
news_list = soup.find_all('div', class_='news-item')
for news in news_list:
title = news.find('h3').text
content = news.find('p').text
time = news.find('span', class_='time').text
print(title, content, time)
3. 爬取淘宝商品
淘宝商品爬虫可以爬取淘宝上的商品信息,比如商品名称、价格、图片等。使用方法:
import requests
from bs4 import BeautifulSoup
url = 'https://www.taobao.com/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'lxml')
products_list = soup.find_all('div', class_='product')
for product in products_list:
name = product.find('p', class_='name').text
price = product.find('span', class_='price').text
img = product.find('img')['src']
print(name, price, img)
4. 爬取豆瓣电影
豆瓣电影爬虫可以爬取豆瓣上的电影信息,比如电影名称、评分、简介等。使用方法:
import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'lxml')
movies_list = soup.find_all('div', class_='item')
for movie in movies_list:
name = movie.find('span', class_='title').text
score = movie.find('span', class_='rating_num').text
intro = movie.find('span', class_='inq').text
print(name, score, intro)
5. 爬取知乎问题
知乎问题爬虫可以爬取知乎上的问题信息,比如问题标题、回答数量、关注数量等。使用方法:
import requests
from bs4 import BeautifulSoup
url = 'https://www.zhihu.com/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'lxml')
questions_list = soup.find_all('div', class_='question-item')
for question in questions_list:
title = question.find('h2').text
answer_num = question.find('span', class_='num').text
follow_num = question.find('div', class_='follow-num').text
print(title, answer_num, follow_num)
6. 爬取微博用户
微博用户爬虫可以爬取微博上的用户信息,比如用户名、粉丝数量、发布内容等。使用方法:
import requests
from bs4 import BeautifulSoup
url = 'https://weibo.com/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'lxml')
users_list = soup.find_all('div', class_='user-item')
for user in users_list:
name = user.find('span', class_='name').text
fans_num = user.find('span', class_='fans-num').text
content = user.find('div', class_='content').text
print(name, fans_num, content)
7. 爬取GitHub仓库
GitHub仓库爬虫可以爬取GitHub上的仓库信息,比如仓库名称、star数量、描述等。使用方法:
import requests
from bs4 import BeautifulSoup
url = 'https://github.com/'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'lxml')
repositories_list = soup.find_all('div', class_='repo-list-item')
for repository in repositories_list:
name = repository.find('a', class_='repo-list-name').text
star_num = repository.find('a', class_='muted-link').text
description = repository.find('p', class_='repo-list-description').text
print(name, star_num, description)
以上就是7个Python实现的爬虫示例,它们可以帮助我们快速实现爬虫。