论文标题
(几乎)所有实体解决
(Almost) All of Entity Resolution
论文作者
论文摘要
是否目标是估计居住在国会区的人数,以估计在武装冲突中死亡的个人数量,或使用书目数据消除个人作者,所有这些应用程序都有一个共同的主题 - 从多个来源整合信息。在回答此类问题之前,必须以系统,准确的方式清理和集成数据库,通常称为记录链接,删除措施或实体解决方案。在本文中,我们回顾了导致该领域增长的动机应用和开创性论文。具体来说,我们回顾了始于1940年代和50年代的基础工作,这些工作导致了现代概率的记录联系。我们回顾了实体解决,半分辨率和完全监督的方法以及规范化的聚类方法,这些方法在整个行业和学术界都用于人权,官方统计,医学,引用网络等应用中。最后,我们讨论了当前的实际重要性研究主题。
Whether the goal is to estimate the number of people that live in a congressional district, to estimate the number of individuals that have died in an armed conflict, or to disambiguate individual authors using bibliographic data, all these applications have a common theme - integrating information from multiple sources. Before such questions can be answered, databases must be cleaned and integrated in a systematic and accurate way, commonly known as record linkage, de-duplication, or entity resolution. In this article, we review motivational applications and seminal papers that have led to the growth of this area. Specifically, we review the foundational work that began in the 1940's and 50's that have led to modern probabilistic record linkage. We review clustering approaches to entity resolution, semi- and fully supervised methods, and canonicalization, which are being used throughout industry and academia in applications such as human rights, official statistics, medicine, citation networks, among others. Finally, we discuss current research topics of practical importance.