论文标题
深湖:深度学习的湖泊
Deep Lake: a Lakehouse for Deep Learning
论文作者
论文摘要
传统的数据湖泊通过启用时间旅行,运行SQL查询,使用酸性交易摄入数据以及可视化Pabyte-Scale尺度数据集在云存储中,为分析工作负载提供了关键的数据基础架构。它们允许组织分解数据孤岛,解锁数据驱动的决策,提高运营效率并降低成本。但是,随着深度学习使用的增加,对于诸如自然语言处理(NLP),音频处理,计算机视觉以及涉及非尾巴数据集的应用程序等应用程序,传统的数据湖泊并未得到很好的设计。本文介绍了Deep Lake,这是一个开源湖泊,用于在Activeloop开发的深度学习应用。 Deep Lake具有一个关键区别的香草数据湖的好处:它以张量的形式存储复杂的数据,例如图像,视频,注释以及表格数据,并将网络上的数据迅速流式传输到(a)张量查询语言,(a)浏览器中可视化引擎或(c)深度学习框架,或(c)深度学习框架,而无需牺牲GPU gpu gpu fipicalization。可以从Pytorch,Tensorflow,Jax,与大量MLOPS工具集成在一起的Deep Lake的数据集。
Traditional data lakes provide critical data infrastructure for analytical workloads by enabling time travel, running SQL queries, ingesting data with ACID transactions, and visualizing petabyte-scale datasets on cloud storage. They allow organizations to break down data silos, unlock data-driven decision-making, improve operational efficiency, and reduce costs. However, as deep learning usage increases, traditional data lakes are not well-designed for applications such as natural language processing (NLP), audio processing, computer vision, and applications involving non-tabular datasets. This paper presents Deep Lake, an open-source lakehouse for deep learning applications developed at Activeloop. Deep Lake maintains the benefits of a vanilla data lake with one key difference: it stores complex data, such as images, videos, annotations, as well as tabular data, in the form of tensors and rapidly streams the data over the network to (a) Tensor Query Language, (b) in-browser visualization engine, or (c) deep learning frameworks without sacrificing GPU utilization. Datasets stored in Deep Lake can be accessed from PyTorch, TensorFlow, JAX, and integrate with numerous MLOps tools.