Python知识分享网 - 专业的Python学习网站 学Python,上Python222
Python beautifulsoup网络抓取和解析cnblog首页帖子数据
发布于:2023-10-29 20:53:00

2024 一天掌握python爬虫【基础篇】 涵盖 requests、beautifulsoup、selenium

https://www.bilibili.com/video/BV1Ju4y1Y7k6/

 

我们抓取下https://www.cnblogs.com/ 首页所有的帖子信息,包括帖子标题,帖子地址,以及帖子作者信息。

首先用requests获取网页文件,然后再用bs4进行解析。

参考代码:

import requests

url = "https://www.cnblogs.com/"

r = requests.get(url)

# 设置返回对象的编码
r.encoding = "utf-8"

# print(r.text)

from bs4 import BeautifulSoup

soup = BeautifulSoup(r.text, 'lxml')

article_list = soup.select("article.post-item")
# print(article_list)

for artile in article_list:
    print("==========")
    author = artile.find("a", class_="post-item-author")
    print(author.get_text())
    link = artile.find("a", class_="post-item-title")
    print(link.get_text())
    print(link.attrs["href"])

 

转载自: