mempool：共享L1内存多核群集，具有低延迟互连

论文标题

mempool：共享L1内存多核群集，具有低延迟互连

MemPool: A Shared-L1 Memory Many-Core Cluster with a Low-Latency Interconnect

论文作者

Cavalcante, Matheus, Riedel, Samuel, Pullini, Antonio, Benini, Luca

论文摘要

扩展共享L1多核群集对多核（超过16个内核）配置的关键挑战是确保低延迟和有效访问L1存储器。在这项工作中，我们证明了可以扩展共享L1架构：我们介绍Mempool，这是一个32位多核系统，具有256个快速RV32IMA“ Snitch”核心，其具有可启动操作的执行单元，在典型条件下以700 MHz运行（TT/0.80 V/25°C）。 Mempool易于编程，所有内核都共享一个大型L1 ScratchPad内存池的全局视图，最多可在5个周期内访问。在Mempool的物理感知设计中，我们强调了低延迟处理器到L1-MEMORY互连的探索，设计和优化。我们比较了三个候选拓扑，从延迟，吞吐量和后端可行性方面进行了分析。所选拓扑的平均潜伏期不到6个周期，即使是重量注入0.33请求/核心/周期。我们还提出了一个轻巧的寻址方案，该方案将每个核心私人数据映射到一个可在一个周期内访问的内存库中，从而在现实信号处理基准中导致高达20％的性能增长。解决方案在能源消耗方面也很有效，因为向当地银行的要求仅消耗访问偏远银行所需的一半能源。我们的设计就理想的，不可实现的全盘基线取得了竞争性能。

A key challenge in scaling shared-L1 multi-core clusters towards many-core (more than 16 cores) configurations is to ensure low-latency and efficient access to the L1 memory. In this work we demonstrate that it is possible to scale up the shared-L1 architecture: We present MemPool, a 32 bit many-core system with 256 fast RV32IMA "Snitch" cores featuring application-tunable execution units, running at 700 MHz in typical conditions (TT/0.80 V/25°C). MemPool is easy to program, with all the cores sharing a global view of a large L1 scratchpad memory pool, accessible within at most 5 cycles. In MemPool's physical-aware design, we emphasized the exploration, design, and optimization of the low-latency processor-to-L1-memory interconnect. We compare three candidate topologies, analyzing them in terms of latency, throughput, and back-end feasibility. The chosen topology keeps the average latency at fewer than 6 cycles, even for a heavy injected load of 0.33 request/core/cycle. We also propose a lightweight addressing scheme that maps each core private data to a memory bank accessible within one cycle, which leads to performance gains of up to 20% in real-world signal processing benchmarks. The addressing scheme is also highly efficient in terms of energy consumption since requests to local banks consume only half of the energy required to access remote banks. Our design achieves competitive performance with respect to an ideal, non-implementable full-crossbar baseline.

下载PDF全文

下载文献需遵守相关版权规定

论文标题