机构地区: 深圳大学计算机与软件学院
出 处: 《清华大学学报(自然科学版)》 2023年第5期740-753,共14页
摘 要: 缺失值插补(missing value imputation,MVI)作为数据挖掘领域的重要研究分支,旨在为机器学习算法的训练提供高质量的数据支持。不同于现有的以算法性能提升为导向的MVI算法,为对大规模数据的缺失值进行有效插补,该文提出一种以数据结构还原为导向的数据分布一致MVI(distribution consistency-based MVI, DC-MVI)算法。首先,DC-MVI算法基于概率分布一致性原则构建了用于确定最优插补值的目标函数;其次,利用推导出的可行缺失值优化规则获取与原始完整值保持最大分布一致性且方差最为接近的插补值;最后,在分布式环境下,针对大数据的随机样本划分(random sample partition, RSP)数据块并行训练DC-MVI算法,获得大规模数据缺失值对应的插补值。实验结果表明:DC-MVI算法不仅能生成与原始完整值保持给定显著性水平下概率分布一致的插补值,还具有比另外5种经典的和3种最新的MVI算法更快的插补速度和更好的插补效果,进而证实DC-MVI算法是一种可行的大规模数据MVI算法。 [Objective]As a significant research branch in the field of data mining,missing value imputation(MVI)aims to provide high-quality data support for the training of machine learning algorithms.However,MVI results for large-scale data sets are not ideal in terms of restoring data distribution and improving data prognosis accuracy.To improve the performance of the existing MVI algorithms,we propose a distribution consistency-based MVI(DC-MVI)algorithm that attempts to restore the original data structure by imputing the missing values for large-scale data sets.[Methods]First,the DC-MVI algorithm developed an objective function to determine the optimal imputation values based on the principle of probability distribution consistency.Second,the data set is preprocessed by random initialization of missing values and normalization,and a feasible missing value update rule is derived to obtain the imputation values with the closest variance and the greatest consistency with the complete original values.Next,in a distributed environment,the large-scale data set is divided into multiple groups of random sample partition(RSP)data blocks with the same distribution as the entire data set by taking into account the statistical properties of the large-scale data set.Finally,the DC-MVI algorithm is trained in parallel to obtain the imputation value corresponding to the missing value of the large-scale data set and preserve distribution consistency with the non-missing values.The rationality experiments verify the convergence of the objective function and the contribution of DC-MVI to distribution consistency.In addition,the effectiveness experiments assess the performance of DC-MVI and eight other MVI algorithms(mean,KNN,MICE,RF,EM,SOFT,GAIN,and MIDA)through the following three indicators:distribution consistency,time complexity,and classification accuracy.[Results]The experimental results on seven selected large-scale data sets showed that:1)The objective function of the DC-MVI method was effective,and the missing value update rule