论文标题

Unimorph 4.0:通用形态

UniMorph 4.0: Universal Morphology

论文作者

Batsuren, Khuyagbaatar, Goldman, Omer, Khalifa, Salam, Habash, Nizar, Kieraś, Witold, Bella, Gábor, Leonard, Brian, Nicolai, Garrett, Gorman, Kyle, Ate, Yustinus Ghanggo, Ryskina, Maria, Mielke, Sabrina J., Budianskaya, Elena, El-Khaissi, Charbel, Pimentel, Tiago, Gasser, Michael, Lane, William, Raj, Mohit, Coler, Matt, Samame, Jaime Rafael Montoya, Camaiteri, Delio Siticonatzi, Sagot, Benoît, Rojas, Esaú Zumaeta, Francis, Didier López, Oncevay, Arturo, Bautista, Juan López, Villegas, Gema Celeste Silva, Hennigen, Lucas Torroba, Ek, Adam, Guriel, David, Dirix, Peter, Bernardy, Jean-Philippe, Scherbakov, Andrey, Bayyr-ool, Aziyana, Anastasopoulos, Antonios, Zariquiey, Roberto, Sheifer, Karina, Ganieva, Sofya, Cruz, Hilaria, Karahóǧa, Ritván, Markantonatou, Stella, Pavlidis, George, Plugaryov, Matvey, Klyachko, Elena, Salehi, Ali, Angulo, Candy, Baxi, Jatayu, Krizhanovsky, Andrew, Krizhanovskaya, Natalia, Salesky, Elizabeth, Vania, Clara, Ivanova, Sardana, White, Jennifer, Maudslay, Rowan Hall, Valvoda, Josef, Zmigrod, Ran, Czarnowska, Paula, Nikkarinen, Irene, Salchak, Aelita, Bhatt, Brijesh, Straughn, Christopher, Liu, Zoey, Washington, Jonathan North, Pinter, Yuval, Ataman, Duygu, Wolinski, Marcin, Suhardijanto, Totok, Yablonskaya, Anna, Stoehr, Niklas, Dolatian, Hossep, Nuriah, Zahroh, Ratan, Shyam, Tyers, Francis M., Ponti, Edoardo M., Aiton, Grant, Arora, Aryaman, Hatcher, Richard J., Kumar, Ritesh, Young, Jeremiah, Rodionova, Daria, Yemelina, Anastasia, Andrushko, Taras, Marchenko, Igor, Mashkovtseva, Polina, Serova, Alexandra, Prud'hommeaux, Emily, Nepomniashchaya, Maria, Giunchiglia, Fausto, Chodroff, Eleanor, Hulden, Mans, Silfverberg, Miikka, McCarthy, Arya D., Yarowsky, David, Cotterell, Ryan, Tsarfaty, Reut, Vylomova, Ekaterina

论文摘要

通用形态(UNIMORPH)项目是一项合作的努力,可为数百种世界语言实例化覆盖范围的归一化形态学表。该项目包括两个主要的推力:一种独立于语言的功能架构,用于丰富的形态注释,并以各种语言意识到该模式的各种语言的带注释数据的类型级别资源。本文介绍了过去几年(自麦卡锡等人(2020)以来)对几个方面的扩张和改进。众多语言学家的协作努力增加了67种新语言,其中包括30种濒危语言。我们已经对提取管道进行了一些改进,以解决一些问题,例如缺少性别和马克龙信息。我们还修改了模式,使用了形态学现象所需的层次结构,例如多肢体协议和案例堆叠,同时添加了一些缺失的形态特征,以使模式更具包容性。鉴于上一个UniMorph版本,我们还通过16种语言的词素分割增强了数据库。最后,这个新版本通过通过代表来自metphynet的衍生过程的实例丰富数据和注释模式来推动将衍生物形态纳入UNIMORPH中。

The Universal Morphology (UniMorph) project is a collaborative effort providing broad-coverage instantiated normalized morphological inflection tables for hundreds of diverse world languages. The project comprises two major thrusts: a language-independent feature schema for rich morphological annotation and a type-level resource of annotated data in diverse languages realizing that schema. This paper presents the expansions and improvements made on several fronts over the last couple of years (since McCarthy et al. (2020)). Collaborative efforts by numerous linguists have added 67 new languages, including 30 endangered languages. We have implemented several improvements to the extraction pipeline to tackle some issues, e.g. missing gender and macron information. We have also amended the schema to use a hierarchical structure that is needed for morphological phenomena like multiple-argument agreement and case stacking, while adding some missing morphological features to make the schema more inclusive. In light of the last UniMorph release, we also augmented the database with morpheme segmentation for 16 languages. Lastly, this new release makes a push towards inclusion of derivational morphology in UniMorph by enriching the data and annotation schema with instances representing derivational processes from MorphyNet.

扫码加入交流群

加入微信交流群

微信交流群二维码

扫码加入学术交流群,获取更多资源