面向矩阵计算的加速系统设计

doi:10.16257/j.cnki.1681-1070.2023.0024

电子与封装 ›› 2023, Vol. 23 ›› Issue (4): 040305 . doi: 10.16257/j.cnki.1681-1070.2023.0024

面向矩阵计算的加速系统设计

孙长江;李皇;王文青

深圳市国微电子有限公司，广东深圳 518057

收稿日期:2022-06-27 出版日期:2023-04-27 发布日期:2023-03-17
作者简介:孙长江（1981—），男，山东聊城人，硕士，现从事SOC/SiP系统结构、高性能核心处理器件的研究。

Acceleration System Design for Matrix Computation

SUN Changjiang, LI Huang,WANGWenqing

Shenzhen Statemicro Electronics Co.,Ltd.,Shenzhen518057, China

Received:2022-06-27 Online:2023-04-27 Published:2023-03-17

摘要/Abstract

摘要： 通用中央处理器（CPU）单元往往花费大部分资源用于缓存管理和逻辑控制，只有少部分资源用于计算。因此将专用的计算模块例如图形处理单元（GPU）、数字信号处理器（DSP）、现场可编程逻辑门阵列（FPGA）和其他可编程逻辑单元作为加速器加入系统从而构建异构多核系统以增强计算性能的设计方法已经成为趋势。基于此趋势，提出一种面向矩阵计算的加速系统，通过使用自研专用指令集、特别设计的硬件加速器阵列以及存储架构优化实现对矩阵计算的加速。此外，还通过信箱机制实现与其他系统异构集成后的通信操作。通过Python与UVM验证方法学搭建性能验证平台，进行寄存器传输级（RTL）的性能验证。结果表明，在500 MHz工作频率下，方案中子系统的运算性能最高可达到32GFLOPS，且与单纯使用二维脉动阵列执行加速的协处理器方案相比，通用矩阵乘（GEMM）算子的计算效率提升了12倍。

关键词: 矩阵计算, 异构, 硬件加速器, 算子映射

Abstract: A general-purpose central processing unit (CPU) usually spends most resources on cache management and logic control, and only a small portion of its resources on computation.Therefore, it has become a trend to design a heterogeneous multi-core system with dedicated computing modules such as graphics processing unit (GPU), digital signal processor (DSP), FPGA and other programmable logic units, to enhance the computation performance. Based on this trend, an acceleration system for matrix computation is proposed with self-developed special instruction set, specially designed hardware accelerator array and optimization in storage architecture, to speed up matrix computing operations.In addition, communication operations with other systems after heterogeneous integration are realized through the mailbox mechanism.A performance verification platform is built through Python and UVM verification methodology to carry out the register transfer level (RTL) performance verification.The results show that the operational performance of the subsystems in the scheme can reach up to 32 GFLOPS at 500 MHz operating frequency, and the computational efficiency of the general matrix multiplication (GEMM) operator is improved by 12 times compared to the coprocessor scheme that performs acceleration using a 2D pulsating array alone.

Key words: matrix computation, heterogeneous, hardware accelerator, operator mapping

中图分类号:

TP302.1

孙长江;李皇;王文青. 面向矩阵计算的加速系统设计[J]. 电子与封装, 2023, 23(4): 040305 .

SUN Changjiang, LI Huang,WANGWenqing. Acceleration System Design for Matrix Computation[J]. Electronics & Packaging, 2023, 23(4): 040305 .

[1]	刘冠东;王伟豪;万智泉;段元星;张坤;李洁;戚定定;王传智;李顺斌;邓庆文;张汝云. 晶上系统:设计、集成及应用[J]. 电子与封装, 2024, 24(8): 80201-.
[2]	张爱兵, 李洋, 姚昕, 李轶楠, 梁梦楠. 基于硅通孔互连的芯粒集成技术研究进展[J]. 电子与封装, 2024, 24(6): 60110-.
[3]	马力,项敏,吴婷. 三维异构集成的发展与挑战[J]. 电子与封装, 2024, 24(6): 60112-.
[4]	吴翼虎;钱宏文;朱江伟. 嵌入式异构平台DDS中间件设计[J]. 电子与封装, 2021, 21(8): 80301-.
[5]	高营;刘德;鞠虎. 基于开源处理器Rocket的异构SoC设计与验证[J]. 电子与封装, 2021, 21(3): 30305-.
[6]	周斌, 陈思, 王宏跃, 付志伟, 施宜军, 杨晓锋, 曲晨冰, 时林林. 异质异构微系统集成可靠性技术综述*[J]. 电子与封装, 2021, 21(10): 100110-.
[7]	张墅野, 李振锋, 何鹏. 微系统三维异质异构集成研究进展*[J]. 电子与封装, 2021, 21(10): 100106-.
[8]	曾燕萍, 张景辉, 朱旻琦, 顾林. 3D异构集成的多层级协同仿真[J]. 电子与封装, 2021, 21(10): 100105-.
[9]	王梦雅, 丁涛杰, 顾林, 曾燕萍, 李居强, 张景辉, 张琦, 孙晓冬. 面向信息处理应用的异构集成微系统综述^*[J]. 电子与封装, 2021, 21(10): 100102-.
[10]	许居衍;黄安君. 后摩尔时代的技术创新[J]. 电子与封装, 2020, 20(12): 120101-.

中国半导体行业协会封装分会会刊

中国电子学会电子制造与封装技术分会会刊

面向矩阵计算的加速系统设计

Acceleration System Design for Matrix Computation

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 10

编辑推荐

Metrics

本文评价