CAJ | 학술논문

实体解析是数据融合和数据清洗的关键步骤，旨在从大量的数据集中找出描述相同实体的记录。当前主要有两种基本的解决思路，一种是穷尽式的实体解析，通过两两比较数据集中的所有记录，然后再合并相似的记录，从而找到描述某一个实体的若干记录集合。然而，该方法的计算复杂度比较高（O（n2），其中 n 表示数据集合的规模），难以处理大型数据集合。另一种思路是基于分块的实体解析，它调用特定的分块函数（如哈希函数、滑动窗口技术等）将集合中较为相似的记录划分到同一个块中，再仅对属于同一块中的记录进行两两比较。这种方法显著降低了运行时间，但会损失部分精度，因为某些描述同一实体的记录可能没有被分到同一个块中。文中提出了一种基于模式的实体解析算法，通过将相似的记录合并成记录集合并尝试生成对应的记录模式，然后进行模式之间的两两比较来产生一个边界值，以确定对应的记录集合是否需要进行进一步的精确比较，从而判断是否属于同一个实体。与第一种方法相比，该方法可有效地过滤部分不可能相似的记录，从而避免了针对所有数据记录进行两两比较，显著地降低了时间复杂度；与第二种方法相比，该方法并不损失任何精度。基于真实和模拟数据集合的实验结果验证了新方法的执行效率和有效性。
실체해석시수거융합화수거청세적관건보취，지재종대량적수거집중조출묘술상동실체적기록。당전주요유량충기본적해결사로，일충시궁진식적실체해석，통과량량비교수거집중적소유기록，연후재합병상사적기록，종이조도묘술모일개실체적약간기록집합。연이，해방법적계산복잡도비교고（O（n2），기중 n 표시수거집합적규모），난이처리대형수거집합。령일충사로시기우분괴적실체해석，타조용특정적분괴함수（여합희함수、활동창구기술등）장집합중교위상사적기록화분도동일개괴중，재부대속우동일괴중적기록진행량량비교。저충방법현저강저료운행시간，단회손실부분정도，인위모사묘술동일실체적기록가능몰유피분도동일개괴중。문중제출료일충기우모식적실체해석산법，통과장상사적기록합병성기록집합병상시생성대응적기록모식，연후진행모식지간적량량비교래산생일개변계치，이학정대응적기록집합시부수요진행진일보적정학비교，종이판단시부속우동일개실체。여제일충방법상비，해방법가유효지과려부분불가능상사적기록，종이피면료침대소유수거기록진행량량비교，현저지강저료시간복잡도；여제이충방법상비，해방법병불손실임하정도。기우진실화모의수거집합적실험결과험증료신방법적집행효솔화유효성。
As a critical step in data integration and data cleaning,entity resolution (ER)aims at identifying groups of records that refer to the same real-world entity.Currently,there mainly exist two typical methods to handle this issue.One is exhaustive entity resolution,which compares all record pairs to determine the entity they belong to.However,its complexity (O(n2 ),n stands for the size of dataset)is too high to handle big volume dataset.The other is blocking-based entity resolution,which maps similar records to the same block by a specific method (e.g.,hash function, sliding window,etc).Then only the records in the same block need to be compared.This method improves the efficiency while sacrifices the effectiveness.Since some records refer to the same entity may not in the same block.In this paper we propose a pattern-based entity resolution, which represents the similar records by a record pattern,then we will generate a bound by comparing record patterns.With this bound,we can decide if the two patterns’corresponding records need to be precisely compared to verify whether they refer to the same entity.In this way,we can both dramatically accelerate the process of entity resolution by filtering dissimilar records and ensure its correctness.Experiments on real and synthetic dataset show the efficiency and effectiveness of our method.