使用 Python 建立趨勢發現工具的逐步指南：網頁爬蟲、自然語言處理（情感分析與主題建模）及文字雲視覺化

監控和提取網絡內容的趨勢對於市場研究、內容創作或在你的領域中保持領先變得非常重要。在這個教程中，我們提供了一個實用的指南，教你如何使用 Python 建立你的趨勢發現工具。你不需要外部 API 或複雜的設置，就能學會如何從公開可訪問的網站抓取數據，應用強大的自然語言處理 (NLP) 技術，如情感分析和主題建模，並使用動態字雲來可視化新興趨勢。

import requests
from bs4 import BeautifulSoup

# 要抓取的網址列表
urls = [“https://en.wikipedia.org/wiki/Natural_language_processing”,
“https://en.wikipedia.org/wiki/Machine_learning”]

collected_texts = [] # 用來存儲每個頁面的文本

for url in urls:
response = requests.get(url, headers={“User-Agent”: “Mozilla/5.0”})
if response.status_code == 200:
soup = BeautifulSoup(response.text, ‘html.parser’)
# 提取所有段落文本
paragraphs = [p.get_text() for p in soup.find_all(‘p’)]
page_text = ” “.join(paragraphs)
collected_texts.append(page_text.strip())
else:
print(f”無法獲取 {url}”)

首先，使用上面的代碼片段，我們展示了一種簡單的方法，使用 Python 的 requests 和 BeautifulSoup 抓取公開可訪問網站的文本數據。它從指定的網址獲取內容，從 HTML 中提取段落，並將文本數據組合成結構化的字符串，以便進一步進行 NLP 分析。

import re
import nltk
nltk.download(‘stopwords’)
from nltk.corpus import stopwords

stop_words = set(stopwords.words(‘english’))

cleaned_texts = []
for text in collected_texts:
# 移除非字母字符並將文本轉為小寫
text = re.sub(r'[^A-Za-z\s]’, ‘ ‘, text).lower()
# 移除停用詞
words = [w for w in text.split() if w not in stop_words]
cleaned_texts.append(” “.join(words))

然後，我們通過將文本轉為小寫、移除標點符號和特殊字符，並使用 NLTK 過濾掉常見的英語停用詞來清理抓取的文本。這個預處理確保文本數據是乾淨的、專注的，並準備好進行有意義的 NLP 分析。

from collections import Counter

# 如果分析整體趨勢，將所有文本合併為一個：
all_text = ” “.join(cleaned_texts)
word_counts = Counter(all_text.split())
common_words = word_counts.most_common(10) # 前10個最常見的單詞
print(“前10個關鍵詞:”, common_words)

現在，我們從清理過的文本數據中計算單詞頻率，識別出前10個最常見的關鍵詞。這有助於突顯主導趨勢和重複主題，提供對抓取內容中流行或重要話題的即時見解。

!pip install textblob
from textblob import TextBlob

for i, text in enumerate(cleaned_texts, 1):
polarity = TextBlob(text).sentiment.polarity
if polarity > 0.1:
sentiment = “正面 😀”
elif polarity < -0.1:
sentiment = “負面 🙁”
else:
sentiment = “中性 😐”
print(f”文件 {i} 情感: {sentiment} (極性={polarity:.2f})”)

我們使用 TextBlob 對每個清理過的文本文件進行情感分析，這是一個基於 NLTK 的 Python 庫。它評估每個文件的整體情感基調——正面、負面或中性——並打印情感及數值極性分數，提供對文本數據中一般情緒或態度的快速指示。

from sklearn.feature_extraction.text import CountVectorizer
from sklearn.decomposition import LatentDirichletAllocation

# 調整這些參數
vectorizer = CountVectorizer(max_df=1.0, min_df=1, stop_words=”english”)
doc_term_matrix = vectorizer.fit_transform(cleaned_texts)

# 擬合 LDA 以找出主題（例如，3個主題）
lda = LatentDirichletAllocation(n_components=3, random_state=42)
lda.fit(doc_term_matrix)

feature_names = vectorizer.get_feature_names_out()

for idx, topic in enumerate(lda.components_):
print(f”主題 {idx + 1}: “, [vectorizer.get_feature_names_out()[i] for i in topic.argsort()[:-11:-1]])

然後，我們應用潛在狄利克雷分配 (LDA)——一種流行的主題建模算法——來發現文本語料中的潛在主題。它首先使用 scikit-learn 的 CountVectorizer 將清理過的文本轉換為數字文檔-詞矩陣，然後擬合 LDA 模型以識別主要主題。輸出列出了每個發現主題的前幾個關鍵詞，簡明扼要地總結了收集數據中的關鍵概念。

# 假設你有你的文本數據存儲在 combined_text
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import nltk
from nltk.corpus import stopwords
import re

nltk.download(‘stopwords’)
stop_words = set(stopwords.words(‘english’))

# 預處理和清理文本：
cleaned_texts = []
for text in collected_texts:
text = re.sub(r'[^A-Za-z\s]’, ‘ ‘, text).lower()
words = [w for w in text.split() if w not in stop_words]
cleaned_texts.append(” “.join(words))

# 生成合併文本
combined_text = ” “.join(cleaned_texts)

# 生成字雲
wordcloud = WordCloud(width=800, height=400, background_color=”white”, colormap=’viridis’).generate(combined_text)

# 顯示字雲
plt.figure(figsize=(10, 6)) # <– 修正數字尺寸
plt.imshow(wordcloud, interpolation=’bilinear’)
plt.axis(‘off’)
plt.title(“抓取文本的字雲”, fontsize=16)
plt.show()

最後，我們生成了一個字雲可視化，顯示合併和清理過的文本數據中的顯著關鍵詞。通過直觀地強調最常見和相關的術語，這種方法允許對收集的網絡內容中的主要趨勢和主題進行探索。

抓取網站的字雲輸出

總結來說，我們成功建立了一個強大且互動的趨勢發現工具。這次練習讓你獲得了網絡抓取、NLP 分析、主題建模和使用字雲進行直觀可視化的實踐經驗。通過這種強大而簡單的方法，你可以持續追蹤行業趨勢，從社交媒體和博客內容中獲得有價值的見解，並根據實時數據做出明智的決策。

這是 Colab 筆記本。此外，別忘了在 Twitter 上關注我們，加入我們的 Telegram 頻道和 LinkedIn 群組。還有，別忘了加入我們的 80k+ 機器學習 SubReddit。

🚨 介紹 Parlant：一個以 LLM 為首的對話式 AI 框架，旨在為開發者提供對其 AI 客戶服務代理的控制和精確度，利用行為指導和運行時監督。🔧 🎛️ 它使用易於使用的 CLI 📟 和 Python 及 TypeScript 的原生客戶端 SDK 操作 📦。

新聞來源

本文由 AI 台灣 運用 AI 技術編撰，內容僅供參考，請自行核實相關資訊。
歡迎加入我們的 AI TAIWAN 台灣人工智慧中心 FB 社團，
隨時掌握最新 AI 動態與實用資訊！