从教科书上看到有学者统计了Movielens数据集的观众观影类别统计图如这是利用Movielens 100K的标准数据集来绘制的。下面使用AI来帮我实现下。一个最为简单的直接方法就是把图片直接附加为附件提问为MovieLens数据集上的用户兴趣呈现出演化的特征如从1997年9月到1998年4月共8个月期间内用户观看不同电影类型的比例一直在发生变化请使用Python的matplotlib可视化方法展示类似的效果生成的代码比较长而且使用了numpy但是可以运行import matplotlib.pyplot as plt import numpy as np # 定义月份和电影类型 months [Oct, Nov, Dec, Jan, Feb, Mar, Apr, May] genres [Thriller, Sci-Fi, Romance, Horror, Drama, Crime, Comedy, Adventure, Action] # 根据图片内容整理的数据百分比缺失的Action补0或其他合理值 # 注意原始数据中有些格子是“-10%”或“10%”等这里将其理解为相对于某个基准的变化 # 为了演示我提取了实际显示的比例值根据图片中的百分比数字 data { Oct: [45, 10, 10, 10, 10, 10, 10, 10, 0], # Action没有明确数值暂且为0 Nov: [30, 10, 10, 10, 15, 10, 10, 10, 0], Dec: [15, 10, 10, 10, 10, 10, 10, 10, 0], Jan: [20, 10, 10, 10, 15, 15, 10, 10, 0], Feb: [20, 10, 10, 10, 10, 10, 10, 10, 0], Mar: [20, 10, 10, 10, 10, 10, 10, 10, 0], # 原图中Mar Drama列显示“-10%”不好直接推断先按10处理 Apr: [20, 10, 10, 10, 10, 10, 10, 10, 0], # Apr Crime显示“10%”暂按10 May: [10, 10, 10, 10, 10, 10, 10, 10, 0], # May Drama显示“±10%”暂按10 } # 转换为numpy数组以便计算 values np.array([data[m] for m in months]) # 由于Action列全为0可以删除或保留这里保留以显示完整性 # 确保每行总和为100检查和调整 for i, row in enumerate(values): total np.sum(row) if total ! 100: # 调整最大的一项使总和为100这里简化处理仅用于演示 diff 100 - total max_idx np.argmax(row) row[max_idx] diff # 绘图 fig, ax plt.subplots(figsize(12, 6)) # 使用stackplot绘制堆叠面积图 ax.stackplot(months, values.T, labelsgenres, alpha0.8, colorsplt.cm.tab20.colors) # 使用tab20颜色集 # 设置图表属性 ax.set_xlabel(Month (1997-1998), fontsize12) ax.set_ylabel(Percentage of Movies Watched (%), fontsize12) ax.set_title(Evolution of User Genre Preferences Over Months, fontsize14, fontweightbold) ax.legend(locupper left, bbox_to_anchor(1, 1), fontsize10) ax.grid(axisy, linestyle--, alpha0.7) # 设置x轴刻度 ax.set_xticks(range(len(months))) ax.set_xticklabels(months) plt.tight_layout() plt.show()运行结果为但是这个代码并不正确因为它使用的是模拟数据。因此就需要进一步结合Movielens数据集来融合代码。这一次我先尝试先把Movielens数据准备好。由于需要统计每种电影类型的被观看其实就是评价次数因此需要将评分数据u.data和电影数据u.item连接起来import pandas as pd import matplotlib.pyplot as plt ratings pd.read_csv(MovieLens/u.data, sep\t, headerNone, names[uid, mid, rating, timestamp]) ratings[time] pd.to_datetime(ratings[timestamp], units) mnames [mid, title, date1, date2, url, unknown, Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western] movies pd.read_csv(MovieLens/u.item, sep|, encodingISO-8859-1, namesmnames) result pd.merge(ratings, movies) print(result)这也是编程常见的经验一步一步来。不过这个数据还不完整足够因为最终希望以年月来汇总而目前只有评价的年月日信息。可以利用AI来做但是如果自己了解pandas会很简单的引导AI使用to_period函数进行时间粒度映射增加的代码为ratings[time] ratings[time].dt.to_period(freqM)完整代码为import pandas as pd import matplotlib.pyplot as plt ratings pd.read_csv(MovieLens/u.data, sep\t, headerNone, names[uid, mid, rating, timestamp]) ratings[time] pd.to_datetime(ratings[timestamp], units) ratings[time] ratings[time].dt.to_period(freqM) mnames [mid, title, date1, date2, url, unknown, Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western] movies pd.read_csv(MovieLens/u.item, sep|, encodingISO-8859-1, namesmnames) result pd.merge(ratings, movies) print(result)此时就需要统计各类电影类别在不同年月时间段内的出现频次。事实上我尝试过很多AI工具对于此类功能多数实现非常复杂。如果自己了解pandas会很简单的引导AI使用agg函数一句话实现增加的代码为resultresult[[time, Action, Adventure, Animation, Children, Comedy, Crime]].groupby(time).agg({Action: count, Adventure: count, Animation: count, Children: count, Comedy: count})完整代码为import pandas as pd import matplotlib.pyplot as plt ratings pd.read_csv(MovieLens/u.data, sep\t, headerNone, names[uid, mid, rating, timestamp]) ratings[time] pd.to_datetime(ratings[timestamp], units) ratings[time] ratings[time].dt.to_period(freqM) mnames [mid, title, date1, date2, url, unknown, Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western] movies pd.read_csv(MovieLens/u.item, sep|, encodingISO-8859-1, namesmnames) result pd.merge(ratings, movies) result result[[time, Action, Adventure, Animation, Children, Comedy, Crime]].groupby(time).agg({Action: sum, Adventure: sum, Animation: sum, Children: sum, Comedy: sum, Crime: sum}) print(result)输出为Action Adventure Animation Children Comedy Crimetime1997-09-01 1892 1031 297 530 2091 5901997-10-01 2560 1461 431 825 3276 8181997-11-01 6053 3378 839 1611 7188 19121997-12-01 3174 1712 425 855 3471 9561998-01-01 3740 1981 527 1049 4228 11001998-02-01 2723 1367 377 789 3238 9201998-03-01 3088 1607 397 856 3577 9861998-04-01 2359 1216 312 667 2763 773已经看到明显的处理结果。参考AI给出的numpy版本堆叠面积图可以直接使用stackplot或者引导AI使用stackplot对result结果绘制堆叠面积图提示词就可以为使用stackplot对result结果绘制堆叠面积图生成的代码为plt.stackplot(result.index, result.values.T)plt.show()完整代码为import pandas as pd import matplotlib.pyplot as plt ratings pd.read_csv(MovieLens/u.data, sep\t, headerNone, names[uid, mid, rating, timestamp]) ratings[time] pd.to_datetime(ratings[timestamp], units) ratings[time] ratings[time].dt.to_period(freqM) mnames [mid, title, date1, date2, url, unknown, Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western] movies pd.read_csv(MovieLens/u.item, sep|, encodingISO-8859-1, namesmnames) result pd.merge(ratings, movies) # 统计各类电影类别在不同年月时间段内的出现频次 result result[[time, Action, Adventure, Animation, Children, Comedy, Crime]].groupby(time).agg({Action: sum, Adventure: sum, Animation: sum, Children: sum, Comedy: sum, Crime: sum}) #使用stackplot对result结果绘制堆叠面积图 plt.stackplot(result.index, result.values.T) plt.show()但是运行会引发错误TypeError: ufunc isfinite not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule safe将错误信息直接丢给AI可以了解result.index 是 Period 类型因为第 6 行使用了 to_period(freqM)而 matplotlib 的 stackplot 函数无法直接处理 Period 类型的数据进行数值计算导致出现 ufunc isfinite not supported for the input types 错误。AI也可以给出具体修改建议补充的代码为result.index result.index.to_timestamp()完整代码为import pandas as pd import matplotlib.pyplot as plt ratings pd.read_csv(MovieLens/u.data, sep\t, headerNone, names[uid, mid, rating, timestamp]) ratings[time] pd.to_datetime(ratings[timestamp], units) ratings[time] ratings[time].dt.to_period(freqM) mnames [mid, title, date1, date2, url, unknown, Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western] movies pd.read_csv(MovieLens/u.item, sep|, encodingISO-8859-1, namesmnames) result pd.merge(ratings, movies) # 统计各类电影类别在不同年月时间段内的出现频次 result result[[time, Action, Adventure, Animation, Children, Comedy, Crime]].groupby(time).agg({Action: sum, Adventure: sum, Animation: sum, Children: sum, Comedy: sum, Crime: sum}) result.index result.index.to_timestamp() #使用stackplot对result结果绘制堆叠面积图 plt.stackplot(result.index, result.values.T) plt.show()运行结果为这个图样猛地看来正确其实和要求的并不一样主要表现为纵轴是绝对值而非相对百分比。可以进一步引导AI提问为将每行各个单元格的数值分别除以当前行的总和得到各自的百分比值得到的代码为result result.apply(lambda x: x / x.sum(), axis1)完整代码为import pandas as pd import matplotlib.pyplot as plt ratings pd.read_csv(MovieLens/u.data, sep\t, headerNone, names[uid, mid, rating, timestamp]) ratings[time] pd.to_datetime(ratings[timestamp], units) ratings[time] ratings[time].dt.to_period(freqM) mnames [mid, title, date1, date2, url, unknown, Action, Adventure, Animation, Children, Comedy, Crime, Documentary, Drama, Fantasy, Film-Noir, Horror, Musical, Mystery, Romance, Sci-Fi, Thriller, War, Western] movies pd.read_csv(MovieLens/u.item, sep|, encodingISO-8859-1, namesmnames) result pd.merge(ratings, movies) # 统计各类电影类别在不同年月时间段内的出现频次 result result[[time, Action, Adventure, Animation, Children, Comedy, Crime]].groupby(time).agg({Action: sum, Adventure: sum, Animation: sum, Children: sum, Comedy: sum, Crime: sum}) result.index result.index.to_timestamp() # 将每行各个单元格的数值分别除以当前行的总和得到各自的百分比值 result result.apply(lambda x: x / x.sum(), axis1) # #使用stackplot对result结果绘制堆叠面积图 plt.stackplot(result.index, result.values.T) plt.show()输出为