全面修改图表绘制方法,确保中文字符正常输出。优化程序性能,修复_analyze_group_genre_preference函数中存在严重的逻辑错误。

This commit is contained in:
Cat Tom 2025-05-07 23:07:46 +08:00
parent dca1aec8cc
commit 9153124ff7
60 changed files with 308 additions and 91 deletions

View File

@ -3,7 +3,7 @@
<html> <html>
<head> <head>
<meta charset="utf-8"> <meta charset="utf-8">
<title>MovieLens Dataset Analysis Report</title> <title>MovieLens数据集分析报告</title>
<style> <style>
body { font-family: Arial, sans-serif; margin: 0; padding: 20px; color: #333; } body { font-family: Arial, sans-serif; margin: 0; padding: 20px; color: #333; }
.container { max-width: 1200px; margin: 0 auto; } .container { max-width: 1200px; margin: 0 auto; }
@ -23,90 +23,90 @@
</head> </head>
<body> <body>
<div class="container"> <div class="container">
<h1>MovieLens Dataset User-Movie Preference Analysis Report</h1> <h1>MovieLens数据集用户-电影偏好分析报告</h1>
<div class="summary"> <div class="summary">
<h2>Data Overview</h2> <h2>数据概览</h2>
<p>This analysis is based on the MovieLens dataset, containing 6040 users、3883 movies and 1000209 original rating records</p> <p>本分析基于MovieLens数据集包含 6040 位用户、3883 部电影 和 1000209 条原始评分记录</p>
</div> </div>
<h2>User Profile Analysis</h2> <h2>用户基本情况分析</h2>
<div class="figure"> <div class="figure">
<img src="user_analysis/gender_distribution.png" alt="User Gender Distribution"> <img src="user_analysis/gender_distribution.png" alt="User Gender Distribution">
<p class="caption">User Gender Distribution</p> <p class="caption">用户性别分布</p>
</div> </div>
<div class="figure"> <div class="figure">
<img src="user_analysis/age_distribution.png" alt="User Age Distribution"> <img src="user_analysis/age_distribution.png" alt="User Age Distribution">
<p class="caption">User Age Distribution</p> <p class="caption">用户年龄分布</p>
</div> </div>
<div class="figure"> <div class="figure">
<img src="user_analysis/occupation_distribution.png" alt="User Occupation Distribution"> <img src="user_analysis/occupation_distribution.png" alt="User Occupation Distribution">
<p class="caption">User Occupation Distribution</p> <p class="caption">用户职业分布</p>
</div> </div>
<h2>Movie Distribution Analysis</h2> <h2>电影分布情况分析</h2>
<div class="figure"> <div class="figure">
<img src="movie_analysis/genre_distribution.png" alt="Movie Genre Distribution"> <img src="movie_analysis/genre_distribution.png" alt="Movie Genre Distribution">
<p class="caption">Movie Genre Distribution</p> <p class="caption">电影类型分布</p>
</div> </div>
<div class="figure"> <div class="figure">
<img src="movie_analysis/year_distribution.png" alt="Movie Release Year Distribution"> <img src="movie_analysis/year_distribution.png" alt="Movie Release Year Distribution">
<p class="caption">Movie Release Year Distribution</p> <p class="caption">电影发行年份分布</p>
</div> </div>
<div class="figure"> <div class="figure">
<img src="movie_analysis/most_rated_movies.png" alt="Top 20 Most Rated Movies"> <img src="movie_analysis/most_rated_movies.png" alt="Top 20 Most Rated Movies">
<p class="caption">Top 20 Most Rated Movies</p> <p class="caption">评分数量最多的20部电影</p>
</div> </div>
<h2>Rating Distribution Analysis</h2> <h2>评分分布情况分析</h2>
<div class="figure"> <div class="figure">
<img src="rating_analysis/rating_distribution.png" alt="Rating Distribution"> <img src="rating_analysis/rating_distribution.png" alt="Rating Distribution">
<p class="caption">Rating Distribution</p> <p class="caption">评分分布情况</p>
</div> </div>
<div class="figure"> <div class="figure">
<img src="rating_analysis/genre_avg_ratings.png" alt="Average Rating by Movie Genre"> <img src="rating_analysis/genre_avg_ratings.png" alt="Average Rating by Movie Genre">
<p class="caption">Average Rating by Movie Genre</p> <p class="caption">各类型电影的平均评分</p>
</div> </div>
<div class="figure"> <div class="figure">
<img src="rating_analysis/top_rated_movies.png" alt="Top 20 Highest Rated Movies"> <img src="rating_analysis/top_rated_movies.png" alt="Top 20 Highest Rated Movies">
<p class="caption">Top 20 Highest Rated Movies (min. 100 ratings)</p> <p class="caption">评分最高的20部电影至少有100个评分</p>
</div> </div>
<h2>User Characteristics and Movie Preferences</h2> <h2>用户特征与电影偏好分析</h2>
<div class="figure"> <div class="figure">
<img src="preference_analysis/gender_genre_heatmap.png" alt="Movie Genre Preferences by Gender"> <img src="preference_analysis/gender_genre_heatmap.png" alt="Movie Genre Preferences by Gender">
<p class="caption">Movie Genre Preferences by Gender</p> <p class="caption">不同性别的电影类型偏好对比</p>
</div> </div>
<div class="figure"> <div class="figure">
<img src="preference_analysis/age_genre_heatmap.png" alt="Movie Genre Preferences by Age Group"> <img src="preference_analysis/age_genre_heatmap.png" alt="Movie Genre Preferences by Age Group">
<p class="caption">Movie Genre Preferences by Age Group</p> <p class="caption">不同年龄段的电影类型偏好对比</p>
</div> </div>
<div class="figure"> <div class="figure">
<img src="preference_analysis/age_year_heatmap.png" alt="Preferences for Movies by Decade Across Age Groups"> <img src="preference_analysis/age_year_heatmap.png" alt="Preferences for Movies by Decade Across Age Groups">
<p class="caption">Preferences for Movies by Decade Across Age Groups</p> <p class="caption">不同年龄段对不同年代电影的偏好</p>
</div> </div>
<div class="figure"> <div class="figure">
<img src="preference_analysis/gender_avg_rating.png" alt="Average Rating by Gender"> <img src="preference_analysis/gender_avg_rating.png" alt="Average Rating by Gender">
<p class="caption">Average Rating by Gender</p> <p class="caption">性别与平均评分的关系</p>
</div> </div>
<h2>Conclusions and Insights</h2> <h2>结论与洞察</h2>
<p>Through in-depth analysis of the MovieLens dataset, we found significant correlations between user characteristics (gender, age, occupation) and movie preferences. Key findings include:</p> <p>通过对MovieLens数据集的深入分析我们发现了用户特征如性别、年龄、职业与电影偏好之间存在显著关联。主要结论包括</p>
<ul> <ul>
<li>Significant differences in movie genre preferences between genders</li> <li>不同性别用户在电影类型偏好上存在明显差异</li>
<li>Age influences how users rate movies from different decades</li> <li>年龄因素会影响用户对不同年代电影的评价</li>
<li>Occupational background correlates with genre preferences</li> <li>职业背景与电影类型偏好具有相关性</li>
</ul> </ul>
<p>These findings provide valuable reference for designing movie recommendation systems and developing movie marketing strategies.</p> <p>这些发现对于电影推荐系统的设计和电影营销策略制定具有重要参考价值。</p>
</div> </div>
</body> </body>
</html> </html>

View File

@ -1,28 +1,28 @@
{ {
"Data Overview": { "数据概览": {
"Number of Users": 6040, "用户数量": 6040,
"Number of Movies": 3883, "电影数量": 3883,
"Original Ratings Count": 1000209, "原始评分数量": 1000209,
"Filled Ratings Count": 22384240 "填补后评分数量": 22384240
}, },
"User Analysis": { "用户分析": {
"Gender Distribution": { "性别分布": {
"M": 4331, "M": 4331,
"F": 1709 "F": 1709
}, },
"Age Distribution": { "年龄分布": {
"25-34": 2096, "25-34": 2096,
"35-44": 1193, "35-44": 1193,
"18-24": 1103, "18-24": 1103,
"45-49": 550, "45-49": 550,
"50-55": 496, "50-55": 496,
"56+": 380, "56岁以上": 380,
"Under 18": 222 "18岁以下": 222
} }
}, },
"Rating Analysis": { "评分分析": {
"Average Rating": 3.58, "平均评分": 3.58,
"Rating Distribution": { "评分分布": {
"1": 56174, "1": 56174,
"2": 107557, "2": 107557,
"3": 261197, "3": 261197,

Binary file not shown.

Before

Width:  |  Height:  |  Size: 37 KiB

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 35 KiB

After

Width:  |  Height:  |  Size: 32 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 77 KiB

After

Width:  |  Height:  |  Size: 69 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 32 KiB

After

Width:  |  Height:  |  Size: 29 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 86 KiB

After

Width:  |  Height:  |  Size: 84 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 28 KiB

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 23 KiB

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 23 KiB

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 23 KiB

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 36 KiB

After

Width:  |  Height:  |  Size: 36 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 127 KiB

After

Width:  |  Height:  |  Size: 116 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 39 KiB

After

Width:  |  Height:  |  Size: 37 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 82 KiB

After

Width:  |  Height:  |  Size: 74 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 336 KiB

After

Width:  |  Height:  |  Size: 335 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 31 KiB

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 30 KiB

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 25 KiB

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 61 KiB

After

Width:  |  Height:  |  Size: 57 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 25 KiB

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 66 KiB

After

Width:  |  Height:  |  Size: 59 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 128 KiB

After

Width:  |  Height:  |  Size: 118 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 20 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 45 KiB

After

Width:  |  Height:  |  Size: 44 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 26 KiB

After

Width:  |  Height:  |  Size: 26 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 19 KiB

After

Width:  |  Height:  |  Size: 18 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 76 KiB

After

Width:  |  Height:  |  Size: 75 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 106 KiB

After

Width:  |  Height:  |  Size: 100 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 129 KiB

After

Width:  |  Height:  |  Size: 125 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 41 KiB

After

Width:  |  Height:  |  Size: 42 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 25 KiB

After

Width:  |  Height:  |  Size: 24 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 22 KiB

After

Width:  |  Height:  |  Size: 22 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 24 KiB

After

Width:  |  Height:  |  Size: 23 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 20 KiB

After

Width:  |  Height:  |  Size: 19 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 82 KiB

After

Width:  |  Height:  |  Size: 77 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 49 KiB

After

Width:  |  Height:  |  Size: 46 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 29 KiB

After

Width:  |  Height:  |  Size: 27 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 34 KiB

After

Width:  |  Height:  |  Size: 30 KiB

View File

@ -27,6 +27,8 @@ import platform
import tempfile import tempfile
import re import re
import sys import sys
import matplotlib.font_manager as fm
from matplotlib.font_manager import FontProperties
# 忽略警告 # 忽略警告
warnings.filterwarnings('ignore') warnings.filterwarnings('ignore')
@ -38,8 +40,35 @@ np.random.seed(42)
custom_colors = ['#FF9A76', '#67B7D1', '#A8D5BA', '#D8A47F', '#957DAD', '#7B506F', '#9AACB8'] custom_colors = ['#FF9A76', '#67B7D1', '#A8D5BA', '#D8A47F', '#957DAD', '#7B506F', '#9AACB8']
# 强制使用英文,避免中文显示问题 # 强制使用英文,避免中文显示问题
USE_ENGLISH = True # 设置为True使用英文False使用中文如果支持 USE_ENGLISH = False # 设置为True使用英文False使用中文如果支持
# 电影类型中英文映射字典
GENRE_MAPPING = {
'Action': '动作',
'Adventure': '冒险',
'Animation': '动画',
'Children\'s': '儿童',
'Comedy': '喜剧',
'Crime': '犯罪',
'Documentary': '纪录片',
'Drama': '剧情',
'Fantasy': '奇幻',
'Film-Noir': '黑色电影',
'Horror': '恐怖',
'Musical': '音乐',
'Mystery': '悬疑',
'Romance': '爱情',
'Sci-Fi': '科幻',
'Thriller': '惊悚',
'War': '战争',
'Western': '西部',
'IMAX': 'IMAX',
'Unknown': '未知',
'(no genres listed)': '(未列出类型)'
}
# 全局字体属性
chinese_font = None
# 配置matplotlib字体和编码 # 配置matplotlib字体和编码
def configure_matplotlib_fonts(): def configure_matplotlib_fonts():
@ -47,28 +76,117 @@ def configure_matplotlib_fonts():
# 根据当前环境使用不同的默认字体 # 根据当前环境使用不同的默认字体
system = platform.system() system = platform.system()
global chinese_font
# 检查中文字体是否可用
def has_chinese_font():
all_fonts = set(f.name for f in fm.fontManager.ttflist)
chinese_fonts = ['SimHei', 'Microsoft YaHei', 'PingFang SC', 'Heiti SC',
'WenQuanYi Micro Hei', 'Noto Sans CJK SC', 'Source Han Sans CN',
'Source Han Sans SC', 'Hiragino Sans GB']
return any(font in all_fonts for font in chinese_fonts)
# 获取系统上可用的字体列表
available_fonts = [f.name for f in fm.fontManager.ttflist]
print(f"系统上可用的字体数量: {len(available_fonts)}")
# 打印一些常用中文字体是否存在
for font in ['SimHei', 'Microsoft YaHei', 'PingFang SC']:
print(f"字体 '{font}' 是否存在: {font in available_fonts}")
# 存储可用中文字体名称用于后续使用
font_found = False
if system == 'Windows': if system == 'Windows':
# Windows环境 # Windows环境
plt.rcParams['font.sans-serif'] = ['SimHei', 'Microsoft YaHei', 'Arial'] font_list = ['SimHei', 'Microsoft YaHei', 'SimSun', 'Arial Unicode MS', 'Arial']
plt.rcParams['font.sans-serif'] = font_list
for font in font_list:
if font in available_fonts:
chinese_font = font
font_found = True
break
elif system == 'Darwin': elif system == 'Darwin':
# macOS环境 # macOS环境
plt.rcParams['font.sans-serif'] = ['Arial Unicode MS', 'PingFang SC', 'Heiti SC'] font_list = ['Arial Unicode MS', 'PingFang SC', 'Heiti SC', 'Hiragino Sans GB']
plt.rcParams['font.sans-serif'] = font_list
for font in font_list:
if font in available_fonts:
chinese_font = font
font_found = True
break
else: else:
# Linux环境或其他 # Linux环境或其他
plt.rcParams['font.sans-serif'] = ['DejaVu Sans', 'WenQuanYi Micro Hei', 'Noto Sans CJK SC'] font_list = ['WenQuanYi Micro Hei', 'Noto Sans CJK SC', 'Source Han Sans CN', 'DejaVu Sans']
plt.rcParams['font.sans-serif'] = font_list
for font in font_list:
if font in available_fonts:
chinese_font = font
font_found = True
break
print(f"字体配置完成,当前系统: {system}")
# 测试当前字体配置是否支持中文
test_char = '测试中文字体'
fig = plt.figure(figsize=(6, 1))
plt.text(0.5, 0.5, test_char, fontsize=14, ha='center')
plt.tight_layout()
# 保存测试图像到临时文件
with tempfile.NamedTemporaryFile(suffix='.png', delete=False) as tmp:
temp_path = tmp.name
plt.savefig(temp_path)
plt.close()
print(f"已生成中文字体测试图像: {temp_path}")
# 如果中文测试失败,则强制使用英文
global USE_ENGLISH
USE_ENGLISH = USE_ENGLISH or not font_found
# 通用设置 # 通用设置
plt.rcParams['axes.unicode_minus'] = False # 正确显示负号 plt.rcParams['axes.unicode_minus'] = False # 正确显示负号
plt.rcParams['font.family'] = 'sans-serif' plt.rcParams['font.family'] = 'sans-serif'
# 如果找到中文字体,打印出来
print(f"字体配置完成,当前系统: {system}") if chinese_font:
print(f"将使用中文字体: {chinese_font}")
else:
print("未找到可用的中文字体,将使用系统默认字体")
# 应用字体配置 # 应用字体配置
configure_matplotlib_fonts() configure_matplotlib_fonts()
# 设置图表样式 # 处理seaborn兼容性的字体设置函数
def apply_chinese_font():
"""在绘制每个图表前应用,确保中文正确显示"""
if not USE_ENGLISH and chinese_font:
# 重新应用字体设置,防止被seaborn覆盖
plt.rcParams['font.sans-serif'] = [chinese_font] + plt.rcParams['font.sans-serif']
plt.rcParams['axes.unicode_minus'] = False
# 确保所有文本使用中文字体
mpl.rcParams['font.family'] = 'sans-serif'
mpl.rcParams['font.sans-serif'] = [chinese_font] + mpl.rcParams['font.sans-serif']
# 获取电影类型的显示名称
def get_genre_display_name(genre):
"""
获取电影类型的显示名称根据当前语言设置返回中文或英文
参数:
genre (str): 原始电影类型名称(英文)
返回:
str: 显示用的电影类型名称
"""
return GENRE_MAPPING.get(genre, genre) if not USE_ENGLISH else genre
# 设置图表样式(但不让它覆盖我们的字体设置)
sns.set_style("whitegrid") sns.set_style("whitegrid")
plt.style.use('seaborn-v0_8-pastel') plt.style.use('seaborn-v0_8-pastel')
@ -257,24 +375,26 @@ class MovieLensDataAnalyzer:
gender_counts = self.users_df['gender'].value_counts() gender_counts = self.users_df['gender'].value_counts()
gender_labels = {'M': 'Male' if USE_ENGLISH else '男性', gender_labels = {'M': 'Male' if USE_ENGLISH else '男性',
'F': 'Female' if USE_ENGLISH else '女性'} 'F': 'Female' if USE_ENGLISH else '女性'}
apply_chinese_font()
plt.figure(figsize=(10, 6)) plt.figure(figsize=(10, 6))
ax = gender_counts.plot(kind='pie', autopct='%1.1f%%', startangle=90, # 直接使用plt.pie()函数,只显示本地化的标签,不显示原始的"M"和"F"
colors=[custom_colors[0], custom_colors[1]]) wedges, texts, autotexts = plt.pie(
gender_counts.values,
autopct = '%1.1f%%',
startangle = 90,
colors = [custom_colors[0], custom_colors[1]],
labels = [gender_labels[idx] for idx in gender_counts.index]
)
plt.title('User Gender Distribution' if USE_ENGLISH else '用户性别分布') plt.title('User Gender Distribution' if USE_ENGLISH else '用户性别分布')
plt.ylabel('') plt.ylabel('')
# 修改饼图标签
patches, texts, autotexts = ax.pie(gender_counts, autopct='%1.1f%%', startangle=90,
colors=[custom_colors[0], custom_colors[1]],
labels=[gender_labels[idx] for idx in gender_counts.index])
plt.tight_layout() plt.tight_layout()
plt.savefig(os.path.join(user_analysis_dir, 'gender_distribution.png'), bbox_inches='tight', dpi=100) plt.savefig(os.path.join(user_analysis_dir, 'gender_distribution.png'), bbox_inches='tight', dpi=100)
plt.close() plt.close()
# 2. 用户年龄分布 # 2. 用户年龄分布
age_counts = self.users_df['age_group'].value_counts().sort_index() age_counts = self.users_df['age_group'].value_counts().sort_index()
apply_chinese_font()
plt.figure(figsize=(12, 6)) plt.figure(figsize=(12, 6))
sns.barplot(x=age_counts.index, y=age_counts.values, palette=custom_colors) sns.barplot(x=age_counts.index, y=age_counts.values, palette=custom_colors)
plt.title('User Age Distribution' if USE_ENGLISH else '用户年龄分布') plt.title('User Age Distribution' if USE_ENGLISH else '用户年龄分布')
@ -287,7 +407,7 @@ class MovieLensDataAnalyzer:
# 3. 用户职业分布 # 3. 用户职业分布
occupation_counts = self.users_df['occupation_name'].value_counts().sort_values(ascending=False) occupation_counts = self.users_df['occupation_name'].value_counts().sort_values(ascending=False)
apply_chinese_font()
plt.figure(figsize=(14, 8)) plt.figure(figsize=(14, 8))
sns.barplot(x=occupation_counts.values, y=occupation_counts.index, palette=custom_colors) sns.barplot(x=occupation_counts.values, y=occupation_counts.index, palette=custom_colors)
plt.title('User Occupation Distribution' if USE_ENGLISH else '用户职业分布') plt.title('User Occupation Distribution' if USE_ENGLISH else '用户职业分布')
@ -299,7 +419,7 @@ class MovieLensDataAnalyzer:
# 4. 用户地域分布 (使用前20个邮编区域作为示例) # 4. 用户地域分布 (使用前20个邮编区域作为示例)
region_counts = self.users_df['region'].value_counts().head(20) region_counts = self.users_df['region'].value_counts().head(20)
apply_chinese_font()
plt.figure(figsize=(14, 8)) plt.figure(figsize=(14, 8))
sns.barplot(x=region_counts.values, y=region_counts.index, palette=custom_colors) sns.barplot(x=region_counts.values, y=region_counts.index, palette=custom_colors)
title_text = 'User Regional Distribution (Top 20 ZIP Codes)' if USE_ENGLISH else '用户地域分布 (前20个邮编区域)' title_text = 'User Regional Distribution (Top 20 ZIP Codes)' if USE_ENGLISH else '用户地域分布 (前20个邮编区域)'
@ -311,6 +431,7 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 5. 性别与年龄的组合分布 # 5. 性别与年龄的组合分布
apply_chinese_font()
plt.figure(figsize=(14, 8)) plt.figure(figsize=(14, 8))
gender_age_counts = self.users_df.groupby(['age_group', 'gender']).size().unstack() gender_age_counts = self.users_df.groupby(['age_group', 'gender']).size().unstack()
@ -331,7 +452,7 @@ class MovieLensDataAnalyzer:
# 6. 用户评分活跃度分析 # 6. 用户评分活跃度分析
user_rating_counts = self.ratings_df['userId'].value_counts() user_rating_counts = self.ratings_df['userId'].value_counts()
apply_chinese_font()
plt.figure(figsize=(12, 6)) plt.figure(figsize=(12, 6))
sns.histplot(user_rating_counts, bins=30, kde=True, color=custom_colors[0]) sns.histplot(user_rating_counts, bins=30, kde=True, color=custom_colors[0])
plt.title('User Rating Activity Distribution' if USE_ENGLISH else '用户评分活跃度分布') plt.title('User Rating Activity Distribution' if USE_ENGLISH else '用户评分活跃度分布')
@ -346,11 +467,12 @@ class MovieLensDataAnalyzer:
user_activity = user_activity.merge(self.users_df, on='userId') user_activity = user_activity.merge(self.users_df, on='userId')
# 7.1 性别与评分活跃度 # 7.1 性别与评分活跃度
apply_chinese_font()
plt.figure(figsize=(10, 6)) plt.figure(figsize=(10, 6))
ax = sns.boxplot(x='gender', y='rating_count', data=user_activity, palette=[custom_colors[0], custom_colors[1]]) ax = sns.boxplot(x='gender', y='rating_count', data=user_activity, palette=[custom_colors[0], custom_colors[1]])
# 修改x轴标签为本地化文本 # 修改x轴标签为本地化文本
ax.set_xticklabels(['Male', 'Female']) # 始终使用英文标签确保显示正确 ax.set_xticklabels([gender_labels['M'], gender_labels['F']]) # 使用本地化标签
plt.title('Rating Activity by Gender' if USE_ENGLISH else '不同性别的评分活跃度分布') plt.title('Rating Activity by Gender' if USE_ENGLISH else '不同性别的评分活跃度分布')
plt.xlabel('Gender' if USE_ENGLISH else '性别') plt.xlabel('Gender' if USE_ENGLISH else '性别')
@ -360,6 +482,7 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 7.2 年龄段与评分活跃度 # 7.2 年龄段与评分活跃度
apply_chinese_font()
plt.figure(figsize=(14, 6)) plt.figure(figsize=(14, 6))
sns.boxplot(x='age_group', y='rating_count', data=user_activity, palette=custom_colors) sns.boxplot(x='age_group', y='rating_count', data=user_activity, palette=custom_colors)
plt.title('Rating Activity by Age Group' if USE_ENGLISH else '不同年龄段的评分活跃度分布') plt.title('Rating Activity by Age Group' if USE_ENGLISH else '不同年龄段的评分活跃度分布')
@ -371,6 +494,7 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 7.3 职业与评分活跃度 # 7.3 职业与评分活跃度
apply_chinese_font()
plt.figure(figsize=(16, 10)) plt.figure(figsize=(16, 10))
sns.boxplot(x='occupation_name', y='rating_count', data=user_activity, palette=custom_colors) sns.boxplot(x='occupation_name', y='rating_count', data=user_activity, palette=custom_colors)
plt.title('Rating Activity by Occupation' if USE_ENGLISH else '不同职业的评分活跃度分布') plt.title('Rating Activity by Occupation' if USE_ENGLISH else '不同职业的评分活跃度分布')
@ -398,7 +522,7 @@ class MovieLensDataAnalyzer:
# 过滤掉没有年份信息的电影 # 过滤掉没有年份信息的电影
valid_years = self.movies_df[self.movies_df['year'].notnull()] valid_years = self.movies_df[self.movies_df['year'].notnull()]
year_counts = valid_years['year'].value_counts().sort_index() year_counts = valid_years['year'].value_counts().sort_index()
apply_chinese_font()
plt.figure(figsize=(16, 6)) plt.figure(figsize=(16, 6))
sns.barplot(x=year_counts.index, y=year_counts.values, color=custom_colors[0]) sns.barplot(x=year_counts.index, y=year_counts.values, color=custom_colors[0])
plt.title('Movie Release Year Distribution' if USE_ENGLISH else '电影发行年份分布') plt.title('Movie Release Year Distribution' if USE_ENGLISH else '电影发行年份分布')
@ -411,14 +535,23 @@ class MovieLensDataAnalyzer:
# 2. 电影类型分布 # 2. 电影类型分布
# 统计每种类型的电影数量 # 统计每种类型的电影数量
genre_counts = defaultdict(int) original_genre_counts = defaultdict(int)
for genres in self.movies_df['genres']: for genres in self.movies_df['genres']:
for genre in genres: for genre in genres:
genre_counts[genre] += 1 original_genre_counts[genre] += 1
# 使用映射字典处理电影类型名称
if not USE_ENGLISH:
# 转换为带有中文名称的字典
genre_counts = {get_genre_display_name(genre): count
for genre, count in original_genre_counts.items()}
else:
# 保持英文
genre_counts = original_genre_counts
# 转换为Series并排序 # 转换为Series并排序
genre_series = pd.Series(genre_counts).sort_values(ascending=False) genre_series = pd.Series(genre_counts)
# 按照原始英文名称的顺序排序,但显示对应的中文名称
genre_series = genre_series.sort_values(ascending=False)
apply_chinese_font()
plt.figure(figsize=(14, 8)) plt.figure(figsize=(14, 8))
sns.barplot(x=genre_series.values, y=genre_series.index, palette=custom_colors) sns.barplot(x=genre_series.values, y=genre_series.index, palette=custom_colors)
plt.title('Movie Genre Distribution' if USE_ENGLISH else '电影类型分布') plt.title('Movie Genre Distribution' if USE_ENGLISH else '电影类型分布')
@ -430,7 +563,7 @@ class MovieLensDataAnalyzer:
# 3. 电影评分数量分布 # 3. 电影评分数量分布
movie_rating_counts = self.ratings_df['movieId'].value_counts() movie_rating_counts = self.ratings_df['movieId'].value_counts()
apply_chinese_font()
plt.figure(figsize=(12, 6)) plt.figure(figsize=(12, 6))
sns.histplot(movie_rating_counts, bins=30, kde=True, color=custom_colors[1]) sns.histplot(movie_rating_counts, bins=30, kde=True, color=custom_colors[1])
plt.title('Movie Rating Count Distribution' if USE_ENGLISH else '电影评分数量分布') plt.title('Movie Rating Count Distribution' if USE_ENGLISH else '电影评分数量分布')
@ -447,7 +580,7 @@ class MovieLensDataAnalyzer:
'rating_count': top_movies.values 'rating_count': top_movies.values
}) })
top_movies_df = top_movies_df.merge(self.movies_df[['movieId', 'title']], on='movieId') top_movies_df = top_movies_df.merge(self.movies_df[['movieId', 'title']], on='movieId')
apply_chinese_font()
plt.figure(figsize=(16, 10)) plt.figure(figsize=(16, 10))
sns.barplot(y='title', x='rating_count', data=top_movies_df, palette=custom_colors) sns.barplot(y='title', x='rating_count', data=top_movies_df, palette=custom_colors)
plt.title('Top 20 Most Rated Movies' if USE_ENGLISH else '评分数量最多的20部电影') plt.title('Top 20 Most Rated Movies' if USE_ENGLISH else '评分数量最多的20部电影')
@ -469,10 +602,24 @@ class MovieLensDataAnalyzer:
# 计算每种类型的平均评分数量 # 计算每种类型的平均评分数量
genre_avg_counts = {genre: np.mean(counts) for genre, counts in genre_rating_counts.items()} genre_avg_counts = {genre: np.mean(counts) for genre, counts in genre_rating_counts.items()}
genre_avg_counts = pd.Series(genre_avg_counts).sort_values(ascending=False) # 处理显示名称
if not USE_ENGLISH:
# 创建一个新的字典,使用中文电影类型名称
genre_avg_counts_display = {get_genre_display_name(genre): avg_count
for genre, avg_count in genre_avg_counts.items()}
# 按原始英文类型的平均评分排序
sorted_genres = sorted(genre_avg_counts.keys(),
key = lambda x: genre_avg_counts[x],
reverse = True)
# 创建一个排序后的Series但使用中文名称作为索引
genre_avg_counts_series = pd.Series([genre_avg_counts[genre] for genre in sorted_genres],
index = [get_genre_display_name(genre) for genre in sorted_genres])
else:
# 英文模式直接排序
genre_avg_counts_series = pd.Series(genre_avg_counts).sort_values(ascending=False)
apply_chinese_font()
plt.figure(figsize=(14, 8)) plt.figure(figsize=(14, 8))
sns.barplot(x=genre_avg_counts.values, y=genre_avg_counts.index, palette=custom_colors) sns.barplot(x=genre_avg_counts_series.values, y=genre_avg_counts_series.index, palette=custom_colors)
plt.title('Average Number of Ratings by Genre' if USE_ENGLISH else '各类型电影的平均评分数量') plt.title('Average Number of Ratings by Genre' if USE_ENGLISH else '各类型电影的平均评分数量')
plt.xlabel('Average Number of Ratings' if USE_ENGLISH else '平均评分数量') plt.xlabel('Average Number of Ratings' if USE_ENGLISH else '平均评分数量')
plt.ylabel('Movie Genre' if USE_ENGLISH else '电影类型') plt.ylabel('Movie Genre' if USE_ENGLISH else '电影类型')
@ -491,7 +638,7 @@ class MovieLensDataAnalyzer:
year_rating_df = pd.DataFrame(year_rating_data, columns=['year', 'rating_count']) year_rating_df = pd.DataFrame(year_rating_data, columns=['year', 'rating_count'])
year_avg_counts = year_rating_df.groupby('year')['rating_count'].mean().sort_index() year_avg_counts = year_rating_df.groupby('year')['rating_count'].mean().sort_index()
apply_chinese_font()
plt.figure(figsize=(16, 6)) plt.figure(figsize=(16, 6))
sns.lineplot(x=year_avg_counts.index, y=year_avg_counts.values, marker='o', color=custom_colors[2]) sns.lineplot(x=year_avg_counts.index, y=year_avg_counts.values, marker='o', color=custom_colors[2])
plt.title('Average Number of Ratings by Release Year' if USE_ENGLISH else '不同发行年份电影的平均评分数量') plt.title('Average Number of Ratings by Release Year' if USE_ENGLISH else '不同发行年份电影的平均评分数量')
@ -516,6 +663,7 @@ class MovieLensDataAnalyzer:
os.makedirs(rating_analysis_dir) os.makedirs(rating_analysis_dir)
# 1. 评分值分布 # 1. 评分值分布
apply_chinese_font()
plt.figure(figsize=(12, 6)) plt.figure(figsize=(12, 6))
rating_counts = self.ratings_df['rating'].value_counts().sort_index() rating_counts = self.ratings_df['rating'].value_counts().sort_index()
sns.barplot(x=rating_counts.index, y=rating_counts.values, palette=custom_colors) sns.barplot(x=rating_counts.index, y=rating_counts.values, palette=custom_colors)
@ -528,6 +676,7 @@ class MovieLensDataAnalyzer:
# 2. 原始评分与填补评分的分布对比 # 2. 原始评分与填补评分的分布对比
if 'isOriginal' in self.filled_ratings_df.columns: if 'isOriginal' in self.filled_ratings_df.columns:
apply_chinese_font()
plt.figure(figsize=(12, 6)) plt.figure(figsize=(12, 6))
sns.histplot( sns.histplot(
data=self.filled_ratings_df, data=self.filled_ratings_df,
@ -559,7 +708,7 @@ class MovieLensDataAnalyzer:
self.ratings_df['year'] = pd.to_datetime(self.ratings_df['timestamp'], unit='s').dt.year self.ratings_df['year'] = pd.to_datetime(self.ratings_df['timestamp'], unit='s').dt.year
yearly_ratings = self.ratings_df.groupby('year')['rating'].agg(['mean', 'count']).reset_index() yearly_ratings = self.ratings_df.groupby('year')['rating'].agg(['mean', 'count']).reset_index()
apply_chinese_font()
plt.figure(figsize=(14, 8)) plt.figure(figsize=(14, 8))
ax1 = plt.gca() ax1 = plt.gca()
@ -607,21 +756,37 @@ class MovieLensDataAnalyzer:
genre_avg_ratings[genre] = np.mean(ratings) genre_avg_ratings[genre] = np.mean(ratings)
genre_rating_counts[genre] = len(ratings) genre_rating_counts[genre] = len(ratings)
# 创建DataFrame以用于绘图
if not USE_ENGLISH:
# 按原始英文类型的平均评分排序
sorted_genres = sorted(genre_avg_ratings.keys(),
key = lambda x: genre_avg_ratings[x],
reverse = True)
# 创建排序好的DataFrame使用中文显示名称
genre_stats = pd.DataFrame({
'genre': [get_genre_display_name(genre) for genre in sorted_genres],
'avg_rating': [genre_avg_ratings[genre] for genre in sorted_genres],
'rating_count': [genre_rating_counts[genre] for genre in sorted_genres]
})
else:
# 英文模式,直接使用原始数据
# 转换为DataFrame # 转换为DataFrame
genre_stats = pd.DataFrame({ genre_stats = pd.DataFrame({
'genre': list(genre_avg_ratings.keys()), 'genre': list(genre_avg_ratings.keys()),
'avg_rating': list(genre_avg_ratings.values()), 'avg_rating': list(genre_avg_ratings.values()),
'rating_count': list(genre_rating_counts.values()) 'rating_count': list(genre_rating_counts.values())
}) })
# 按平均评分排序 # 按平均评分排序
genre_stats = genre_stats.sort_values('avg_rating', ascending=False) genre_stats = genre_stats.sort_values('avg_rating', ascending=False)
apply_chinese_font()
plt.figure(figsize=(14, 8)) plt.figure(figsize=(14, 8))
sns.barplot(y='genre', x='avg_rating', data=genre_stats, palette=custom_colors) sns.barplot(y='genre', x='avg_rating', data=genre_stats, palette=custom_colors)
plt.title('Average Rating by Genre' if USE_ENGLISH else '各类型电影的平均评分') plt.title('Average Rating by Genre' if USE_ENGLISH else '各类型电影的平均评分')
plt.xlabel('Average Rating' if USE_ENGLISH else '平均评分') plt.xlabel('Average Rating' if USE_ENGLISH else '平均评分')
plt.ylabel('Movie Genre' if USE_ENGLISH else '电影类型') plt.ylabel('Movie Genre' if USE_ENGLISH else '电影类型')
# 调整Y轴标签设置确保所有类型名称完全显示
plt.tight_layout(pad=2)
plt.grid(True, linestyle='--', alpha=0.7) plt.grid(True, linestyle='--', alpha=0.7)
plt.tight_layout() plt.tight_layout()
plt.savefig(os.path.join(rating_analysis_dir, 'genre_avg_ratings.png'), bbox_inches='tight', dpi=100) plt.savefig(os.path.join(rating_analysis_dir, 'genre_avg_ratings.png'), bbox_inches='tight', dpi=100)
@ -633,7 +798,7 @@ class MovieLensDataAnalyzer:
top_rated_movies = popular_movies.head(20).reset_index() top_rated_movies = popular_movies.head(20).reset_index()
top_rated_movies = top_rated_movies.merge(self.movies_df[['movieId', 'title']], on='movieId') top_rated_movies = top_rated_movies.merge(self.movies_df[['movieId', 'title']], on='movieId')
apply_chinese_font()
plt.figure(figsize=(16, 10)) plt.figure(figsize=(16, 10))
bars = sns.barplot(y='title', x='mean', data=top_rated_movies, palette=custom_colors) bars = sns.barplot(y='title', x='mean', data=top_rated_movies, palette=custom_colors)
@ -664,7 +829,7 @@ class MovieLensDataAnalyzer:
year_rating_df = pd.DataFrame(movie_year_ratings, columns=['year', 'avg_rating', 'count']) year_rating_df = pd.DataFrame(movie_year_ratings, columns=['year', 'avg_rating', 'count'])
year_avg_ratings = year_rating_df.groupby('year')['avg_rating'].mean() year_avg_ratings = year_rating_df.groupby('year')['avg_rating'].mean()
year_rating_counts = year_rating_df.groupby('year')['count'].sum() year_rating_counts = year_rating_df.groupby('year')['count'].sum()
apply_chinese_font()
plt.figure(figsize=(16, 8)) plt.figure(figsize=(16, 8))
ax1 = plt.gca() ax1 = plt.gca()
@ -717,7 +882,7 @@ class MovieLensDataAnalyzer:
for gender in gender_genre_preferences.keys(): for gender in gender_genre_preferences.keys():
gender_preferences = gender_genre_preferences[gender].sort_values(ascending=False).head(10) gender_preferences = gender_genre_preferences[gender].sort_values(ascending=False).head(10)
apply_chinese_font()
plt.figure(figsize=(12, 6)) plt.figure(figsize=(12, 6))
sns.barplot(x=gender_preferences.index, y=gender_preferences.values, palette=custom_colors) sns.barplot(x=gender_preferences.index, y=gender_preferences.values, palette=custom_colors)
@ -740,6 +905,7 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 创建热力图比较不同性别的类型偏好 # 创建热力图比较不同性别的类型偏好
apply_chinese_font()
plt.figure(figsize=(14, 10)) plt.figure(figsize=(14, 10))
gender_heatmap_data = pd.DataFrame(gender_genre_preferences) gender_heatmap_data = pd.DataFrame(gender_genre_preferences)
@ -768,7 +934,7 @@ class MovieLensDataAnalyzer:
# 为每个年龄段绘制前5个最喜爱的类型 # 为每个年龄段绘制前5个最喜爱的类型
for age_group in age_genre_preferences.keys(): for age_group in age_genre_preferences.keys():
age_preferences = age_genre_preferences[age_group].sort_values(ascending=False).head(5) age_preferences = age_genre_preferences[age_group].sort_values(ascending=False).head(5)
apply_chinese_font()
plt.figure(figsize=(10, 5)) plt.figure(figsize=(10, 5))
sns.barplot(x=age_preferences.index, y=age_preferences.values, palette=custom_colors) sns.barplot(x=age_preferences.index, y=age_preferences.values, palette=custom_colors)
@ -791,6 +957,7 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 创建热力图比较不同年龄段的类型偏好 # 创建热力图比较不同年龄段的类型偏好
apply_chinese_font()
plt.figure(figsize=(16, 12)) plt.figure(figsize=(16, 12))
age_heatmap_data = pd.DataFrame(age_genre_preferences) age_heatmap_data = pd.DataFrame(age_genre_preferences)
@ -801,8 +968,10 @@ class MovieLensDataAnalyzer:
# 按总体平均评分降序排列 # 按总体平均评分降序排列
age_heatmap_data['Overall'] = age_heatmap_data.mean(axis=1) age_heatmap_data['Overall'] = age_heatmap_data.mean(axis=1)
age_heatmap_data = age_heatmap_data.sort_values('Overall', ascending=False) age_heatmap_data = age_heatmap_data.sort_values('Overall', ascending=False)
# 按需要进行类型名称的翻译
if not USE_ENGLISH:
age_heatmap_data.index = [get_genre_display_name(genre) for genre in age_heatmap_data.index]
age_heatmap_data = age_heatmap_data.drop('Overall', axis=1) age_heatmap_data = age_heatmap_data.drop('Overall', axis=1)
sns.heatmap(age_heatmap_data, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) sns.heatmap(age_heatmap_data, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Movie Genre Preferences by Age Group' if USE_ENGLISH else '不同年龄段的电影类型偏好对比') plt.title('Movie Genre Preferences by Age Group' if USE_ENGLISH else '不同年龄段的电影类型偏好对比')
plt.tight_layout() plt.tight_layout()
@ -830,7 +999,7 @@ class MovieLensDataAnalyzer:
for occupation in selected_occupations: for occupation in selected_occupations:
if occupation in occupation_genre_preferences: if occupation in occupation_genre_preferences:
occ_preferences = occupation_genre_preferences[occupation].sort_values(ascending=False).head(5) occ_preferences = occupation_genre_preferences[occupation].sort_values(ascending=False).head(5)
apply_chinese_font()
plt.figure(figsize=(10, 5)) plt.figure(figsize=(10, 5))
sns.barplot(x=occ_preferences.index, y=occ_preferences.values, palette=custom_colors) sns.barplot(x=occ_preferences.index, y=occ_preferences.values, palette=custom_colors)
@ -853,6 +1022,7 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 为选定职业创建热力图比较类型偏好 # 为选定职业创建热力图比较类型偏好
apply_chinese_font()
plt.figure(figsize=(16, 12)) plt.figure(figsize=(16, 12))
selected_data = {occ: occupation_genre_preferences[occ] for occ in selected_occupations if selected_data = {occ: occupation_genre_preferences[occ] for occ in selected_occupations if
occ in occupation_genre_preferences} occ in occupation_genre_preferences}
@ -868,6 +1038,11 @@ class MovieLensDataAnalyzer:
occ_heatmap_data = occ_heatmap_data.sort_values('Overall', ascending=False) occ_heatmap_data = occ_heatmap_data.sort_values('Overall', ascending=False)
occ_heatmap_data = occ_heatmap_data.drop('Overall', axis=1) occ_heatmap_data = occ_heatmap_data.drop('Overall', axis=1)
# 按需要进行类型名称的翻译
if not USE_ENGLISH:
occ_heatmap_data.index = [get_genre_display_name(genre) for genre in occ_heatmap_data.index]
# 列名已经是职业名称,不需要翻译
sns.heatmap(occ_heatmap_data, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) sns.heatmap(occ_heatmap_data, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title('Movie Genre Preferences by Occupation' if USE_ENGLISH else '不同职业的电影类型偏好对比') plt.title('Movie Genre Preferences by Occupation' if USE_ENGLISH else '不同职业的电影类型偏好对比')
plt.tight_layout() plt.tight_layout()
@ -916,6 +1091,39 @@ class MovieLensDataAnalyzer:
# 存储每个组对每种类型的评分 # 存储每个组对每种类型的评分
group_genre_ratings = defaultdict(lambda: defaultdict(list)) group_genre_ratings = defaultdict(lambda: defaultdict(list))
# 收集每种类型的评分
for _, row in ratings_with_movies.iterrows():
group = row[group_col]
for genre in row['genres']:
group_genre_ratings[group][genre].append(row['rating'])
# 循环结束后再计算平均评分,而不是在循环内计算
# 计算每个组对每种类型的平均评分
group_genre_avg_ratings = {}
for group, genre_ratings in group_genre_ratings.items():
orig_genre_avg_ratings = {genre: np.mean(ratings) for genre, ratings in genre_ratings.items()}
# 如果使用中文,转换类型名称
if not USE_ENGLISH:
# 转换为中文类型名称
translated_ratings = {}
for genre, avg_rating in orig_genre_avg_ratings.items():
# 获取中文名称
genre_name = get_genre_display_name(genre)
translated_ratings[genre_name] = avg_rating
group_genre_avg_ratings[group] = pd.Series(translated_ratings)
else:
# 英文模式,保持原样
group_genre_avg_ratings[group] = pd.Series(orig_genre_avg_ratings)
return group_genre_avg_ratings
def _analyze_group_genre_preference_orig(self, user_ratings, group_col):
# 合并电影信息
ratings_with_movies = user_ratings.merge(self.movies_df[['movieId', 'genres']], on='movieId')
# 存储每个组对每种类型的评分
group_genre_ratings = defaultdict(lambda: defaultdict(list))
# 收集每种类型的评分 # 收集每种类型的评分
for _, row in ratings_with_movies.iterrows(): for _, row in ratings_with_movies.iterrows():
group = row[group_col] group = row[group_col]
@ -1001,7 +1209,8 @@ class MovieLensDataAnalyzer:
# 7. 绘制热力图 # 7. 绘制热力图
print(f"开始绘制热力图...") print(f"开始绘制热力图...")
plt.figure(figsize=(16, 10)) plt.figure(figsize=(16, 10))
sns.heatmap(pivot_data, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) apply_chinese_font()
ax = sns.heatmap(pivot_data, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title( plt.title(
'Preferences for Movies by Decade Across Age Groups' if USE_ENGLISH else '不同年龄段对不同年代电影的偏好') 'Preferences for Movies by Decade Across Age Groups' if USE_ENGLISH else '不同年龄段对不同年代电影的偏好')
plt.xlabel('Movie Release Decade' if USE_ENGLISH else '电影发行年代') plt.xlabel('Movie Release Decade' if USE_ENGLISH else '电影发行年代')
@ -1067,6 +1276,7 @@ class MovieLensDataAnalyzer:
heatmap_data.loc[age, '2000s'] += 0.4 heatmap_data.loc[age, '2000s'] += 0.4
# 5. 绘制热力图 # 5. 绘制热力图
apply_chinese_font()
plt.figure(figsize=(16, 10)) plt.figure(figsize=(16, 10))
sns.heatmap(heatmap_data, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5) sns.heatmap(heatmap_data, annot=True, cmap='coolwarm', fmt='.2f', linewidths=0.5)
plt.title( plt.title(
@ -1100,13 +1310,15 @@ class MovieLensDataAnalyzer:
user_stats = user_rating_stats.merge(self.users_df, on='userId') user_stats = user_rating_stats.merge(self.users_df, on='userId')
# 1. 性别与平均评分的关系 # 1. 性别与平均评分的关系
apply_chinese_font()
plt.figure(figsize=(10, 6)) plt.figure(figsize=(10, 6))
ax = sns.boxplot(x='gender', y='avg_rating', data=user_stats, palette=[custom_colors[0], custom_colors[1]]) ax = sns.boxplot(x='gender', y='avg_rating', data=user_stats, palette=[custom_colors[0], custom_colors[1]])
# 修改x轴标签 # 修改x轴标签
ax.set_xticklabels(['Male', 'Female']) # 强制使用英文标签确保显示 gender_labels = {'M': 'Male' if USE_ENGLISH else '男性', 'F': 'Female' if USE_ENGLISH else '女性'}
ax.set_xticklabels([gender_labels['M'], gender_labels['F']]) # 使用适当的标签
plt.title('Average Rating by Gender' if USE_ENGLISH else '性别与平均评分的关系') plt.title('Average Rating by Gender' if USE_ENGLISH else '性别与平均评分的关系', fontproperties=FontProperties(fname=mpl.font_manager.findfont(chinese_font)) if chinese_font else None)
plt.xlabel('Gender' if USE_ENGLISH else '性别') plt.xlabel('Gender' if USE_ENGLISH else '性别')
plt.ylabel('Average Rating' if USE_ENGLISH else '平均评分') plt.ylabel('Average Rating' if USE_ENGLISH else '平均评分')
@ -1124,6 +1336,7 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 2. 年龄组与平均评分的关系 # 2. 年龄组与平均评分的关系
apply_chinese_font()
plt.figure(figsize=(14, 6)) plt.figure(figsize=(14, 6))
sns.boxplot(x='age_group', y='avg_rating', data=user_stats, palette=custom_colors) sns.boxplot(x='age_group', y='avg_rating', data=user_stats, palette=custom_colors)
plt.title('Average Rating by Age Group' if USE_ENGLISH else '年龄组与平均评分的关系') plt.title('Average Rating by Age Group' if USE_ENGLISH else '年龄组与平均评分的关系')
@ -1135,11 +1348,12 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 3. 性别与评分标准差的关系(评分一致性) # 3. 性别与评分标准差的关系(评分一致性)
apply_chinese_font()
plt.figure(figsize=(10, 6)) plt.figure(figsize=(10, 6))
ax = sns.boxplot(x='gender', y='rating_std', data=user_stats, palette=[custom_colors[0], custom_colors[1]]) ax = sns.boxplot(x='gender', y='rating_std', data=user_stats, palette=[custom_colors[0], custom_colors[1]])
# 修改x轴标签 # 修改x轴标签
ax.set_xticklabels(['Male', 'Female']) # 强制使用英文标签确保显示 ax.set_xticklabels([gender_labels['M'], gender_labels['F']]) # 使用本地化标签
plt.title('Rating Standard Deviation by Gender' if USE_ENGLISH else '性别与评分标准差的关系') plt.title('Rating Standard Deviation by Gender' if USE_ENGLISH else '性别与评分标准差的关系')
plt.xlabel('Gender' if USE_ENGLISH else '性别') plt.xlabel('Gender' if USE_ENGLISH else '性别')
@ -1149,6 +1363,7 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 4. 年龄组与评分标准差的关系 # 4. 年龄组与评分标准差的关系
apply_chinese_font()
plt.figure(figsize=(14, 6)) plt.figure(figsize=(14, 6))
sns.boxplot(x='age_group', y='rating_std', data=user_stats, palette=custom_colors) sns.boxplot(x='age_group', y='rating_std', data=user_stats, palette=custom_colors)
plt.title('Rating Standard Deviation by Age Group' if USE_ENGLISH else '年龄组与评分标准差的关系') plt.title('Rating Standard Deviation by Age Group' if USE_ENGLISH else '年龄组与评分标准差的关系')
@ -1160,6 +1375,7 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 5. 所有职业的平均评分对比 # 5. 所有职业的平均评分对比
apply_chinese_font()
plt.figure(figsize=(16, 10)) plt.figure(figsize=(16, 10))
sns.boxplot(x='occupation_name', y='avg_rating', data=user_stats, palette=custom_colors) sns.boxplot(x='occupation_name', y='avg_rating', data=user_stats, palette=custom_colors)
plt.title('Average Rating by Occupation' if USE_ENGLISH else '职业与平均评分的关系') plt.title('Average Rating by Occupation' if USE_ENGLISH else '职业与平均评分的关系')
@ -1171,6 +1387,7 @@ class MovieLensDataAnalyzer:
plt.close() plt.close()
# 6. 评分数量与平均评分的关系 # 6. 评分数量与平均评分的关系
apply_chinese_font()
plt.figure(figsize=(12, 6)) plt.figure(figsize=(12, 6))
# 为性别标签添加本地化 # 为性别标签添加本地化