中国半导体行业协会封装分会会刊

中国电子学会电子制造与封装技术分会会刊

导航

电子与封装 ›› 2023, Vol. 23 ›› Issue (4): 040305 . doi: 10.16257/j.cnki.1681-1070.2023.0024

• 电路与系统 • 上一篇    下一篇

面向矩阵计算的加速系统设计

孙长江;李皇;王文青   

  1. 深圳市国微电子有限公司,广东 深圳 518057
  • 收稿日期:2022-06-27 出版日期:2023-04-27 发布日期:2023-03-17
  • 作者简介:孙长江(1981—),男,山东聊城人,硕士,现从事SOC/SiP系统结构、高性能核心处理器件的研究。

Acceleration System Design for Matrix Computation

SUN Changjiang, LI Huang,WANGWenqing   

  1. Shenzhen Statemicro Electronics Co.,Ltd.,Shenzhen518057, China
  • Received:2022-06-27 Online:2023-04-27 Published:2023-03-17

摘要: 通用中央处理器(CPU)单元往往花费大部分资源用于缓存管理和逻辑控制,只有少部分资源用于计算。因此将专用的计算模块例如图形处理单元(GPU)、数字信号处理器(DSP)、现场可编程逻辑门阵列(FPGA)和其他可编程逻辑单元作为加速器加入系统从而构建异构多核系统以增强计算性能的设计方法已经成为趋势。基于此趋势,提出一种面向矩阵计算的加速系统,通过使用自研专用指令集、特别设计的硬件加速器阵列以及存储架构优化实现对矩阵计算的加速。此外,还通过信箱机制实现与其他系统异构集成后的通信操作。通过Python与UVM验证方法学搭建性能验证平台,进行寄存器传输级(RTL)的性能验证。结果表明,在500 MHz工作频率下,方案中子系统的运算性能最高可达到32GFLOPS,且与单纯使用二维脉动阵列执行加速的协处理器方案相比,通用矩阵乘(GEMM)算子的计算效率提升了12倍。

关键词: 矩阵计算, 异构, 硬件加速器, 算子映射

Abstract: A general-purpose central processing unit (CPU) usually spends most resources on cache management and logic control, and only a small portion of its resources on computation.Therefore, it has become a trend to design a heterogeneous multi-core system with dedicated computing modules such as graphics processing unit (GPU), digital signal processor (DSP), FPGA and other programmable logic units, to enhance the computation performance. Based on this trend, an acceleration system for matrix computation is proposed with self-developed special instruction set, specially designed hardware accelerator array and optimization in storage architecture, to speed up matrix computing operations.In addition, communication operations with other systems after heterogeneous integration are realized through the mailbox mechanism.A performance verification platform is built through Python and UVM verification methodology to carry out the register transfer level (RTL) performance verification.The results show that the operational performance of the subsystems in the scheme can reach up to 32 GFLOPS at 500 MHz operating frequency, and the computational efficiency of the general matrix multiplication (GEMM) operator is improved by 12 times compared to the coprocessor scheme that performs acceleration using a 2D pulsating array alone.

Key words: matrix computation, heterogeneous, hardware accelerator, operator mapping

中图分类号: