唐诗三百首数据采集、特征向量化及统计分析全流程代码实现

任务一：从网站“https://so.gushiwen.cn/gushi/tangshi.aspx”中采集唐诗三百首的内容。解析：首先需要使用Python的requests库获取网页内容，然后使用BeautifulSoup库解析网页，提取出古诗的类型、题目、内容和作者信息。最后将提取到的信息保存为txt格式的文本文件。代码如下： ```python import requests from bs4 import BeautifulSoup url = "https://so.gushiwen.cn/gushi/tangshi.aspx" response = requests.get(url) response.encoding = 'utf-8' soup = BeautifulSoup(response.text, 'html.parser') poems = soup.find_all('div', class_='typecont') with open('poems.txt', 'w', encoding='utf-8') as f: for poem in poems: poem_type = poem.find('strong').text.strip() title = poem.find('h3').text.strip() content = poem.find('div', class_='contson').text.strip().replace(' ', '').replace('\t', '') author = poem.find('p', class_='source').find_all('a')[1].text.strip() f.write(f'{poem_type}\t{title}\t{content}\t{author} ') ``` 任务二：基于任务一的结果数据，针对“诗内容”，完成文本特征向量化（tf-idf值）输出。解析：首先需要读取任务一生成的txt文件，然后使用jieba库进行中文分词，接着使用sklearn库的TfidfVectorizer类计算tf-idf值。最后将结果保存为txt格式的文本文件。代码如下： ```python import jieba from sklearn.feature_extraction.text import TfidfVectorizer with open('poems.txt', 'r', encoding='utf-8') as f: lines = f.readlines() poems = [line.split('\t') for line in lines] contents = [' '.join(jieba.cut(poem[2])) for poem in poems] vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(contents) with open('poems_tfidf.txt', 'w', encoding='utf-8') as f: for i, poem in enumerate(poems): tfidf_values = ' '.join([str(value) for value in X[i].toarray()[0]]) f.write(f'{poem[0]}\t{poem[1]}\t{poem[2]}\t{poem[3]}\t{tfidf_values} ') ``` 任务三：基于任务一的结果数据，完成诗词统计。解析：首先需要读取任务一生成的txt文件，然后统计每种诗类型的诗数量和每个作者的诗数量。最后将结果保存为excel文件。对诗内容进行中文分词，并输出词云图。代码如下： ```python import pandas as pd from wordcloud import WordCloud import matplotlib.pyplot as plt with open('poems.txt', 'r', encoding='utf-8') as f: lines = f.readlines() poems = [line.split('\t') for line in lines] # 统计每种诗类型的诗数量 poem_types = {} for poem in poems: if poem[0] not in poem_types: poem_types[poem[0]] = 0 poem_types[poem[0]] += 1 df_poem_types = pd.DataFrame(list(poem_types.items()), columns=['Poem Type', 'Count']) df_poem_types.to_excel('poem_types_count.xlsx', index=False) # 统计每个作者的诗数量 authors = {} for poem in poems: if poem[3] not in authors: authors[poem[3]] = 0 authors[poem[3]] += 1 df_authors = pd.DataFrame(list(authors.items()), columns=['Author', 'Count']) df_authors.to_excel('authors_count.xlsx', index=False) # 输出词云图 contents = [' '.join(jieba.cut(poem[2])) for poem in poems] words = ' '.join(contents) wordcloud = WordCloud(font_path='simhei.ttf', background_color='white').generate(words) plt.imshow(wordcloud, interpolation='bilinear') plt.axis('off') plt.show() ``` ######[AI写代码神器 | 1159点数解答 | 2024-06-23 11:09:38]

服务商

更多选项

快捷项

自定义