Python+BeautifulSoup解析问界M7口碑静态HTML实战指南（含反爬规避与结构适配）

⚠️ **重要法律与技术提示前置说明**： > ❗ **汽车之家（autohome.com.cn）明确禁止未经许可的自动化爬虫访问**，其 `robots.txt` 文件（[https://www.autohome.com.cn/robots.txt](https://www.autohome.com.cn/robots.txt)）限制了大部分动态页面（如口碑页）的抓取，且网站采用**前端渲染 + AJAX 异步加载 + 反爬机制（如 User-Agent 检测、频率限流、验证码、加密参数）**。 > ✅ **仅使用 `requests + BeautifulSoup` 无法稳定、合法、完整地爬取“问界M7”口碑页**——因为口碑列表是通过 JavaScript 动态加载的（XHR 请求返回 JSON），而 BeautifulSoup **无法执行 JS，只能解析静态 HTML**。你看到的源码中几乎不包含真实口碑数据。 --- ### ✅ 正确理解需求后的合规可行方案（仅用 Python + BeautifulSoup） #### ✅ 场景限定（合法 & 可行）： - 你已**手动下载**了问界M7口碑页面的**静态 HTML 文件**（例如：用浏览器打开 [https://k.autohome.com.cn/6192/](https://k.autohome.com.cn/6192/) → 滚动到底部加载全部口碑 → `Ctrl+S` 保存为 `m7_koubei.html`）； - 或你使用 `requests` 成功获取到了**含真实口碑数据的 HTML 快照**（极罕见，需绕过反爬且页面恰好服务端渲染）； - 目标：**从该 HTML 文件中提取「口碑、评分、发表时间、用户ID、内容」等可见字段**。 --- ### ✅ 完整可运行代码（Python + BeautifulSoup4） ```python # requirements: pip install beautifulsoup4 requests import re from bs4 import BeautifulSoup from urllib.parse import urljoin def parse_m7_koubei_html(html_path_or_content: str, is_file_path: bool = True) -> list: """ 解析问界M7口碑静态HTML（本地文件或HTML字符串）返回字典列表：[{"title": "...", "score": 4.5, "time": "2024-03-15", "user": "车友123456", "content": "..."}, ...] """ if is_file_path: with open(html_path_or_content, 'r', encoding='utf-8') as f: html = f.read() else: html = html_path_or_content soup = BeautifulSoup(html, 'html.parser') koubei_list = [] # 🔍 定位所有口碑条目（典型结构：<div class="koubei-list-item">） items = soup.select('div.koubei-list-item') or soup.select('li.kb-item') for item in items: try: # title_elem = item.select_one('h3.fz18 a, .kb-item-title a') title = title_elem.get_text(strip=True) if title_elem else "" # 评分（常见于 <span class="score">4.5</span> 或 data-score 属性） score_elem = item.select_one('.score, [data-score]') score = float(score_elem.get('data-score') or score_elem.get_text(strip=True)) if score_elem else 0.0 # 时间（常见类名：.date, .time, .tip-time） time_elem = item.select_one('.date, .time, .tip-time, .kb-item-time') time_text = time_elem.get_text(strip=True) if time_elem else "" # 简单清洗：提取类似 "2024-03-15" 或 "2024年3月15日" time_match = re.search(r'(\d{4}-\d{1,2}-\d{1,2})|(\d{4}年\d{1,2}月\d{1,2}日)', time_text) pub_time = (time_match.group(1) or time_match.group(2).replace('年','-').replace('月','-').replace('日','')) if time_match else "" # 用户ID（常见于 <a class="name"> 或 .user-name） user_elem = item.select_one('.name, .user-name, .kb-item-user a') user = user_elem.get_text(strip=True) if user_elem else "" # 内容（取第一段文字，去除换行和多余空格） content_elem = item.select_one('.txt-con, .kb-item-con, .content p, .review-content') if not content_elem: content_elem = item.select_one('.txt-con div, .content div') content = content_elem.get_text(strip=True)[:200] + "..." if content_elem else "" koubei_list.append({ "title": title, "score": score, "time": pub_time, "user": user, "content": content.strip() }) except Exception as e: print(f"[警告] 解析某条口碑时出错: {e}") continue return koubei_list # ✅ 使用示例（请先手动保存口碑页为本地HTML！） if __name__ == "__main__": # 替换为你的本地HTML路径 html_file = "m7_koubei.html" # ← 你手动保存的文件！ results = parse_m7_koubei_html(html_file, is_file_path=True) print(f"共解析 {len(results)} 条口碑：\n") for i, kb in enumerate(results[:5], 1): # 打印前5条示意 print(f"{i}. 【{kb['title']}】({kb['score']}/5.0) — {kb['user']} | {kb['time']}") print(f" 💬 {kb['content']}\n") # ✅ 导出为 CSV（可选） import csv if results: with open("m7_koubei_parsed.csv", "w", newline="", encoding="utf-8-sig") as f: writer = csv.DictWriter(f, fieldnames=["title", "score", "time", "user", "content"]) writer.writeheader() writer.writerows(results) print("✅ 已导出至 m7_koubei_parsed.csv") ``` --- ### ⚠️ 关键注意事项（开发者必读） | 项目 | 说明 | |------|------| | 🚫 **不要直接请求线上URL** | `requests.get("https://k.autohome.com.cn/6192/")` 返回的是无口碑数据的骨架页（JS 渲染），BeautifulSoup 解析为空。 | | 🛑 **无头浏览器非本题要求** | 题目限定 `BeautifulSoup`，故不推荐 Selenium/Playwright（虽能解决JS渲染，但违反题干）。 | | 🧩 **HTML结构易变** | 汽车之家前端常更新，selector（如 `.koubei-list-item`）可能失效 → 需配合浏览器开发者工具（F12）实时检查并调整 `select()` 表达式。 | | 🌐 **编码与中文处理** | 务必用 `encoding='utf-8'` 打开HTML，否则乱码；`.get_text(strip=True)` 自动清理空白符。 | | 📜 **遵守 robots.txt 与《用户协议》** | 抓取前请确认用途为个人学习/研究，并控制频率（如加 `time.sleep(1)`），避免对服务器造成负担。 | --- ### ✅ 进阶建议（如需真实动态数据）若你**必须获取线上实时口碑**，合法路径应为： - ✅ 查看汽车之家是否开放 **官方API**（通常不对外）； - ✅ 使用 `selenium` + `chromedriver` 渲染页面后提取 `page_source` → 再交给 BeautifulSoup 解析（⚠️ 超出本题范围）； - ✅ 联系汽车之家商务合作获取数据接口（企业级方案）。 --- ######[AI写代码神器 | 1839点数解答 | 2026-05-25 13:47:30]

服务商

更多选项

快捷项

自定义