导 师: 刘波
学科专业: H1203
授予学位: 硕士
作 者: ;
机构地区: 暨南大学
摘 要: 深层网络指的是位于表层网络之下所隐藏的数据,需要用户填写表单发送查询请求才能获取,其数据量远远超过表层网络且信息价值巨大。由此原因,如何挖掘出位于深层网络中的海量数据成为了研究热点,特别是DEEPWEB的信息集成研究尤为重要。DEEPWEB数据集成中的第一步是WEB数据库的发现,也就是查询接口的发现。但由于深层网络的数据位于众多的WEB数据库中,并且处于不断的变化中,相应的接口也可能随之改变,增大了获取的难度。其中最为突出的技术难点是:一,WEB数据库分布广泛且数量巨大,获取包含查询接口的网页信息的效率问题有待提高;二,查询接口... The deep web refers to data that located beneath the surface network, the amount of dataand value far exceeds the surface network. Thus the reason, how to dig deep network hasbecome a hot topic, especially the Deep Web information integration research is particularlyimportant. The first step in the Deep Web data integration is to find the Web database, which isfind the query interface. Some of the most prominent technical difficulties are: First, theefficiency of web access to information contained query interfaces needs to be improved;Second, the query interfaces are in the form of the form exist, but not all forms are queryinterface, how to improve the accuracy of classification is also a serious problem. About the Deep Web query interface discovery there is two problems, this paper will dothe following work: First, the Deep Web research, including the Deep Web concept, scale, existence, accessmethods, research direction and content of this paper. Second, the query interface discovery technologies used past, including research on DOMparsing and heuristic rules that usually used, and then analyzes the main query interfacediscovery algorithms and compared.. Third, for the field-oriented Deep Web query interface to obtain efficiency, this paperpresents a query interface discovery algorithms, including those based on single-threaded andmultithreaded algorithms, and comparing the test results show that the algorithm based onmulti-threading significantly enhance the efficiency. Finally, in order to obtain Deep Web query interface from forms correctly, on the basis ofprevious studies, we propose a heuristic rule-based K-Nearest Neighbor algorithm for thepurpose of identifing the Deep Web Query interfaces, in order to carry out experiments, the papermade a variety of ways from a number of areas for query interface and non-query interface, andresults show that the algorithm can significantly improve the Deep Web query interfacediscovery, the accuracy, at the rate of re-investigation and recall rate has improved significantly.
分 类 号: [TP3 F27]
领 域: [自动化与计算机技术] [经济管理] [经济管理]