论文标题
这么少得多:创建Python库的轻量级嵌入
So Much in So Little: Creating Lightweight Embeddings of Python Libraries
论文作者
论文摘要
在软件工程中,不同的方法和机器学习模型利用不同类型的数据:源代码,文本信息,历史数据。任何项目的重要部分是其依赖性。依赖项的列表相对较小,但带有许多语义,可用于比较项目或对其做出判断。 在本文中,我们以需求形式的python项目及其PYTHON项目及其PYPI依赖性。我们编译了一个7,132个Python项目及其依赖性的数据集,并使用GIT来撤销其往年的版本。使用这些数据,我们通过将单数值分解应用于项目和库的共发生矩阵来构建库的32维嵌入。然后,我们聚集嵌入并研究其语义关系。 为了展示此类轻质库嵌入的有用性,我们介绍了一种原型工具,用于向给定项目建议相关库。该工具计算项目的嵌入,并使用具有类似嵌入的项目的依赖性来形成建议。为了比较不同的图书馆推荐人,我们根据开源项目中依赖集的演变创建了一个基准。基于创建的嵌入的方法大大优于显示给定年份中最受欢迎的图书馆的基准。我们还进行了一项用户研究,该研究表明,不同项目领域的建议在质量上有所不同,即使相关建议也可能不是特别有用。最后,为了促进潜在的更有用的建议,我们扩展了推荐系统,可以选择稀有库。
In software engineering, different approaches and machine learning models leverage different types of data: source code, textual information, historical data. An important part of any project is its dependencies. The list of dependencies is relatively small but carries a lot of semantics with it, which can be used to compare projects or make judgements about them. In this paper, we focus on Python projects and their PyPi dependencies in the form of requirements.txt files. We compile a dataset of 7,132 Python projects and their dependencies, as well as use Git to pull their versions from previous years. Using this data, we build 32-dimensional embeddings of libraries by applying Singular Value Decomposition to the co-occurrence matrix of projects and libraries. We then cluster the embeddings and study their semantic relations. To showcase the usefulness of such lightweight library embeddings, we introduce a prototype tool for suggesting relevant libraries to a given project. The tool computes project embeddings and uses dependencies of projects with similar embeddings to form suggestions. To compare different library recommenders, we have created a benchmark based on the evolution of dependency sets in open-source projects. Approaches based on the created embeddings significantly outperform the baseline of showing the most popular libraries in a given year. We have also conducted a user study that showed that the suggestions differ in quality for different project domains and that even relevant suggestions might be not particularly useful. Finally, to facilitate potentially more useful recommendations, we extended the recommender system with an option to suggest rarer libraries.