中国半导体行业协会封装分会会刊

中国电子学会电子制造与封装技术分会会刊

导航

电子与封装

• 电路与系统 •    下一篇

高效率LSTM硬件加速器设计与实现

陈铠1,贺傍2,滕紫珩1,傅玉祥2,李世平1   

  1. 1. 江苏华创微系统有限公司,南京  211899;2. 南京大学集成电路学院,江苏 苏州  215163
  • 收稿日期:2025-02-17 修回日期:2025-03-06 出版日期:2025-04-02 发布日期:2025-04-02
  • 通讯作者: 陈铠
  • 基金资助:
    国家自然科学基金企业创新发展联合基金(U21B2032)

Design and Implementation of High Efficiency LSTM Hardware Accelerator

CHEN Kai1, HE Bang2, TENG Ziheng1, FU Yuxiang2, LI Shiping1   

  1. 1. Jiangsu Huachuang Microsystems Co., Ltd., Nanjing 211899, China; 2. School of Integrated Circuits, Nanjing University, Suzhou 215163, China
  • Received:2025-02-17 Revised:2025-03-06 Online:2025-04-02 Published:2025-04-02

摘要: 相比于传统循环神经网络(RNN),长短期记忆网络(LSTM)增加了多个门控单元和记忆单元,可以有效解决传统RNN网络梯度消失和梯度爆炸的问题。由于在处理复杂序列依赖性问题上具有优势,LSTM网络广泛应用于机器翻译、情感分析、文本分类等自然语言处理应用中。随着智能应用复杂度增加,LSTM网络层数、隐藏层节点数的增多,对端侧处理器件的存储容量、访存带宽、处理性能的要求也剧烈增加。论文分析LSTM算法特点,设计了高并行流水门计算运算部件,提出了多层次共享数据通路方法,并对LSTM算法硬件实现流程进行优化控制,完成了峰值算力2.144TOPS的LSTM硬件加速器设计,并基于FinFET工艺完成物理实现。流片后板级测试结果表明LSTM硬件加速器运算效率可达95%以上,每TOPS算力的推理帧率达到GTX1080TI GPU的2.8倍以上。

关键词: 长短期记忆网络, 并行流水, 硬件加速, NLP, MAC, 运算簇, 存储阵列

Abstract: Compared to traditional recurrent neural networks (RNNs), long short-term memory networks(LSTM) increases multiple gating units and memory cells, effectively overcoming the issues of gradient vanishing and gradient explosion encountered by traditional RNNs. Due to the advantage in handling complex sequential dependencies, LSTM networks have been widely applied in natural language processing tasks such as machine translation, sentiment analysis, and text classification.With the increasing complexity of intelligent applications and the increase in the number of layers and hidden layer nodes in LSTM networks, the requirements for storage capacity, memory access bandwidth, and processing performance of end side processing devices have also dramatically increased. This paper analyzes the characteristics of the LSTM algorithm and designs a highly parallel pipeline computation unit. A multi-level shared data path method is proposed, and optimization and control of the hardware implementation process of the LSTM algorithm are carried out. A hardware accelerator for LSTM is designed, achieving a peak computing power of 2.144 TOPS. The accelerator is physically implemented based onFinFET technology, and chip-level test results after tape-out demonstrate that the LSTM hardware acceleratorachieves an operational efficiency exceeding 95%, with processing performance per TOPS reaching more than 2.8 times that of the NVIDIA GTX 1080 Ti GPU.

Key words: long short-term memory, parallel pipeline, hardware acceleration, NLP, MAC, computational cluster, storage array