摘要
大数据背景下, 高维旅游数据的急剧增长使得传统聚类算法效果欠佳,而熵加权子空间算法可有效地实现高维数据聚类, 获取特征对不同类的影响, 从而提升聚类效果。 文章通过爬取同程网上的部分云南游记数据, 获取有关云南旅游的游记信息; 借助中文分词、 关键词提取、 词性识别等自然语言处理技术, 并结合百度地图 API 游记信息, 构建所需旅游数据矩阵; 基于用户—关键词矩阵, 采用熵加权子空间算法对旅游景点和作者进行聚类; 综合考虑邓恩指数和轮廓系数两个内部指标, 对聚类结果进行评价。 评价结果表明: 采用熵加权子空间算法对云南游记数据聚类时, 其集簇个数为 3 的效果最好。
Abstract
Under the background of big data, high dimensional data increase significantly makingthe poor result of the effect of traditional clustering algorithm, but the entropy weighting kmeanssubspace algorithm is suitable for high dimensional data clustering and could obtain the influence offeatures on different clusters which improves the clustering effect. This paper obtained part of Yunnantourism data using Python mining on the data of Tongchengnetwork. The natural language processingtechnology such as Chinese word segmentation, keyword extraction, part of speech recognition is usedto mine the travel information in combination with Baidu map API to build the required tourism datamatrices. The entropy weighting kmeans subspace algorithm is used to cluster tourist attractions andauthors based on userkeyword matrix. The clustering results are evaluated by two internal indexes:Dunn index and contour coefficient. The evaluation results show that the entropy weighting kmeanssubspace algorithm has the best effect when the number of clusters is 3.
关键词
熵加权子空间算法 /
旅游数据 /
子空间聚类 /
数据挖掘
{{custom_keyword}} /
Key words
Under the background of big data
{{custom_keyword}} /
陈 丹, 褚宏伟, 吴雅琴, 胡 俊.
基于熵加权子空间算法的旅游数据聚类分析. 旅游研究. 2021, 13(5): 18-31
Clustering Analysis of Tourism Data Based on Entropy Weightingkmeans Subspace Algorithm. Tourism Research. 2021, 13(5): 18-31
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}
{{custom_fnGroup.title_cn}}
脚注
{{custom_fn.content}}