中国半导体行业协会封装分会会刊

中国电子学会电子制造与封装技术分会会刊

导航

电子与封装 ›› 2025, Vol. 25 ›› Issue (9): 090302 . doi: 10.16257/j.cnki.1681-1070.2025.0103

• 电路与系统 • 上一篇    下一篇

高效率LSTM硬件加速器设计与实现*

陈铠1,贺傍2,滕紫珩1,傅玉祥2,李世平1   

  1. 1. 江苏华创微系统有限公司,南京  211899;2. 南京大学集成电路学院,江苏 苏州  215163
  • 收稿日期:2025-02-15 出版日期:2025-09-28 发布日期:2025-04-02
  • 作者简介:陈铠(1979—),男,江苏南京人,硕士,高级工程师,主要研究方向为信号处理算法和人工智能算法高效能硬件实现、异构多核SoC芯片架构设计。

Design and Implementation of High Efficiency LSTM Hardware Accelerator

CHEN Kai1, HE Bang2, TENG Ziheng1, FU Yuxiang2, LI Shiping1   

  1. 1. Jiangsu HuachuangMicrosystems Co., Ltd., Nanjing 211899, China; 2. Schoolof Integrated Circuits, Nanjing University, Suzhou 215163, China
  • Received:2025-02-15 Online:2025-09-28 Published:2025-04-02

摘要: 相比于传统循环神经网络(RNN),长短期记忆网络(LSTM)增加了多个门控单元和记忆单元,可以有效解决传统RNN梯度消失和梯度爆炸的问题。由于在处理复杂序列依赖性问题上具有优势,LSTM广泛应用于机器翻译、情感分析、文本分类等自然语言处理(NLP)中。随着智能应用复杂度的增加,LSTM层数、隐藏层节点数增多,对端侧处理器件的存储容量、访存带宽、处理性能的要求也显著增加。分析LSTM算法特点,设计高并行流水门计算部件,提出多层次共享数据通路方法,对LSTM算法硬件实现流程进行优化控制,完成峰值算力为2.144 TOPS的LSTM硬件加速器设计,并基于鳍式场效应晶体管工艺完成物理实现。流片后板级测试结果表明LSTM硬件加速器运算效率可达95%以上,每TOPS算力的推理帧率达到NVIDIA GTX 1080 Ti GPU的2.8倍以上。

关键词: 长短期记忆网络, 并行流水, 硬件加速, 运算簇

Abstract: Compared to traditional recurrent neural networks (RNNs), long short-term memory (LSTM) networks increase multiple gating units and memory cells, effectively addressing the issues of gradient vanishing and gradient explosion encountered by traditional RNNs. Due to the advantage in handling complex sequential dependencies, LSTM networks have been widely applied in natural language processing (NLP) tasks such as machine translation, sentiment analysis, and text classification. With the increasing complexity of intelligent applications and the number of layers and hidden layer nodes in LSTM networks, the requirements for storage capacity, memory access bandwidth, and processing performance of end side processing devices have also dramatically increased. The characteristics of the LSTM algorithm are analyzed and a highly parallel pipeline computation unit is designed. A multi-level shared data path method is proposed, and optimization and control of the hardware implementation process of the LSTM algorithm are carried out. A hardware accelerator for LSTM is designed, achieving a peak computing power of 2.144 TOPS. The accelerator is physically implemented based on fin field-effect transistor technology. Chip-level test results after tape-out demonstrate that the LSTM hardware accelerator achieves an operational efficiency exceeding 95%, with processing performance per TOPS reaching more than 2.8 times that of the NVIDIA GTX 1080 Ti GPU.

Key words: long short-term memory network, parallel pipeline, hardware acceleration, computational cluster

中图分类号: