2024 一天掌握python爬虫【基础篇】 涵盖 requests、beautifulsoup、selenium:
https://www.bilibili.com/video/BV1Ju4y1Y7k6/
我们抓取下https://www.cnblogs.com/ 首页所有的帖子信息,包括帖子标题,帖子地址,以及帖子作者信息。
首先用requests获取网页文件,然后再用bs4进行解析。
参考代码:
import requests
url = "https://www.cnblogs.com/"
r = requests.get(url)
# 设置返回对象的编码
r.encoding = "utf-8"
# print(r.text)
from bs4 import BeautifulSoup
soup = BeautifulSoup(r.text, 'lxml')
article_list = soup.select("article.post-item")
# print(article_list)
for artile in article_list:
print("==========")
author = artile.find("a", class_="post-item-author")
print(author.get_text())
link = artile.find("a", class_="post-item-title")
print(link.get_text())
print(link.attrs["href"])