高效率LSTM硬件加速器设计与实现*

doi:10.16257/j.cnki.1681-1070.2025.0103

电子与封装 ›› 2025, Vol. 25 ›› Issue (9): 090302 . doi: 10.16257/j.cnki.1681-1070.2025.0103

高效率LSTM硬件加速器设计与实现^*

陈铠¹，贺傍²，滕紫珩¹，傅玉祥²，李世平¹

1. 江苏华创微系统有限公司，南京 211899；2. 南京大学集成电路学院，江苏苏州 215163

收稿日期:2025-02-15 出版日期:2025-09-28 发布日期:2025-04-02
作者简介:陈铠（1979—），男，江苏南京人，硕士，高级工程师，主要研究方向为信号处理算法和人工智能算法高效能硬件实现、异构多核SoC芯片架构设计。

Design and Implementation of High Efficiency LSTM Hardware Accelerator

CHEN Kai¹, HE Bang², TENG Ziheng¹, FU Yuxiang², LI Shiping¹

1. Jiangsu HuachuangMicrosystems Co., Ltd., Nanjing 211899, China; 2. Schoolof Integrated Circuits, Nanjing University, Suzhou 215163, China

Received:2025-02-15 Online:2025-09-28 Published:2025-04-02

摘要/Abstract

摘要： 相比于传统循环神经网络（RNN），长短期记忆网络（LSTM）增加了多个门控单元和记忆单元，可以有效解决传统RNN梯度消失和梯度爆炸的问题。由于在处理复杂序列依赖性问题上具有优势，LSTM广泛应用于机器翻译、情感分析、文本分类等自然语言处理（NLP）中。随着智能应用复杂度的增加，LSTM层数、隐藏层节点数增多，对端侧处理器件的存储容量、访存带宽、处理性能的要求也显著增加。分析LSTM算法特点，设计高并行流水门计算部件，提出多层次共享数据通路方法，对LSTM算法硬件实现流程进行优化控制，完成峰值算力为2.144 TOPS的LSTM硬件加速器设计，并基于鳍式场效应晶体管工艺完成物理实现。流片后板级测试结果表明LSTM硬件加速器运算效率可达95%以上，每TOPS算力的推理帧率达到NVIDIA GTX 1080 Ti GPU的2.8倍以上。

关键词: 长短期记忆网络, 并行流水, 硬件加速, 运算簇

Abstract: Compared to traditional recurrent neural networks (RNNs), long short-term memory (LSTM) networks increase multiple gating units and memory cells, effectively addressing the issues of gradient vanishing and gradient explosion encountered by traditional RNNs. Due to the advantage in handling complex sequential dependencies, LSTM networks have been widely applied in natural language processing (NLP) tasks such as machine translation, sentiment analysis, and text classification. With the increasing complexity of intelligent applications and the number of layers and hidden layer nodes in LSTM networks, the requirements for storage capacity, memory access bandwidth, and processing performance of end side processing devices have also dramatically increased. The characteristics of the LSTM algorithm are analyzed and a highly parallel pipeline computation unit is designed. A multi-level shared data path method is proposed, and optimization and control of the hardware implementation process of the LSTM algorithm are carried out. A hardware accelerator for LSTM is designed, achieving a peak computing power of 2.144 TOPS. The accelerator is physically implemented based on fin field-effect transistor technology. Chip-level test results after tape-out demonstrate that the LSTM hardware accelerator achieves an operational efficiency exceeding 95%, with processing performance per TOPS reaching more than 2.8 times that of the NVIDIA GTX 1080 Ti GPU.

Key words: long short-term memory network, parallel pipeline, hardware acceleration, computational cluster

中图分类号:

TN492

陈铠，贺傍，滕紫珩，傅玉祥，李世平. 高效率LSTM硬件加速器设计与实现^*[J]. 电子与封装, 2025, 25(9): 090302 .

CHEN Kai, HE Bang, TENG Ziheng, FU Yuxiang, LI Shiping. Design and Implementation of High Efficiency LSTM Hardware Accelerator[J]. Electronics & Packaging, 2025, 25(9): 090302 .

参考文献

[1] KATTENBORN T, LEITLOFF J, SCHIEFER F, et al. Review on convolutional neural networks (CNN) in vegetation remote sensing[J]. ISPRS Journal of Photogrammetry and Remote Sensing, 2021, 173: 24-49.
[2] SAITO T, WATANOBE Y. Learning path recommendation system for programming education based on neural networks[J]. International Journal of Distance Education Technologies, 2020, 18(1): 36-64.
[3] RAMAKRISHNAN N, SONI T. Network traffic prediction using recurrent neural networks[C]// 2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA), Orlando, FL, USA, 2018: 187-193.
[4] HOPFIELD J J. Neural networks and physical systems with emergent collective computational abilities[J]. Proceedings of the National Academy of Sciences, 1982, 79(8): 2554-2558.
[5] ZHANG Y S, ZHENG J, JIANG Y R, et al. A text sentiment classification modeling method based on coordinated CNN-LSTM-attention model[J]. Chinese Journal of Electronics, 2019, 28(1): 120-126.
[6] SAON G, TUSKE Z, BOLANOS D, et al. Advancing RNN transducer technology for speech recognition[C]// ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 2021: 5654-5658.
[7] TIWARI G, SHARMA A, SAHOTRA A, et al. English-Hindi neural machine translation-LSTM Seq2Seq and ConvS2S[C]// 2020 International Conference on Communication and Signal Processing (ICCSP), Chennai, India, 2020: 871-875.
[8]. DATTA D, DAVID P E, MITTAL D. Neural machine translation using recurrent neural network[J]. International Journal of Engineering and Advanced Technology, 2020, 9(4): 1395-1400.
[9] LI D, QIAN J. Text sentiment analysis based on long short-term memory[C]// 2016 First IEEE International Conference on Computer Communication and the Internet (ICCCI), Wuhan, China, 2016: 471-475.
[10] PASCANU R, MIKOLOV T, BENGIO Y. On the difficulty of training recurrent neural networks[C]// 30th International Conference on Machine Learning, ICML 2013, Atlanta, Georgia, USA, 2013: 2347-2355.
[11] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[12] SHERSTINSKY A. Fundamentals of recurrent neural network (RNN) and long short-term memory (LSTM) network[J]. Physica D: Nonlinear Phenomena, 2020, 404: 132306.
[13] SHEWALKAR A, NYAVANANDI D, LUDWIG S A. Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU[J]. Journal of Artificial Intelligence and Soft Computing Research, 2019, 9(4): 235-245.
[14] SCHUSTER M, PALIWAL K K. Bidirectional recurrent neural networks[J]. IEEE Transactions on Signal Processing, 1997, 45(11): 2673-2681.
[15] GUO S L, FANG C, LIN J, et al. A configurable FPGA accelerator of Bi-LSTM inference with structured sparsity[C]// 2020 IEEE 33rd International System-on-Chip Conference (SOCC), Las Vegas, NV, USA, 2020: 174-179.
[16] ZHANG Y B. Research on text classification method based on LSTM neural network model[C]// 2021 IEEE Asia-Pacific Conference on Image Processing, Electronics and Computers (IPEC), Dalian, China, 2021: 1019-1022.
[17] LI J X, UN K F, YU W-H, et al. An FPGA-based energy-efficient reconfigurable convolutional neural network accelerator for object recognition applications[J]. IEEE Transactions on Circuits and Systems II: Express Briefs, 2021, 68(9): 3143-3147.
[18] 高晗, 田育龙, 许封元, 等. 深度学习模型压缩与加速综述[J]. 软件学报, 2021, 32(1): 68-92.
[19] GUAN Y J, YUAN Z H, SUN G Y, et al. FPGA-based accelerator for long short-term memory recurrent neural networks[C]// 2017 22nd Asia and South Pacific Design Automation Conference (ASP-DAC), Chiba, Japan, 2017: 629-634.
[20] 李渊博, 杨媛, 张小涛. 基于LSTM的IGBT参数预测硬件系统设计[J]. 电子技术应用, 2019, 45(10): 33-36.
[21] 高琛, 张帆, 高彦钊. 利用数据稀疏性的LSTM加速器设计[J]. 电子学报, 2021, 49(2): 209-215.
[22] 相博镪, 凌味未, 李蠡, 等. 基于FPGA的RNN硬件加速架构[J]. 成都信息工程大学学报, 2022, 37(4): 374-378.
[23] WANG H, QIU D F, GE F, et al. Implementation of bidirectional LSTM accelerator based on FPGA[C]// 2022 IEEE 22nd International Conference on Communication Technology (ICCT), Nanjing, China, 2022: 1512-1516.
[24] ZHANG W F, GE F, CUI C C, et al. Design and implementation of LSTM accelerator based on FPGA[C]// 2020 IEEE 20th International Conference on Communication Technology (ICCT), Nanning, China, 2020: 1675-1679.
[25] ZHENG Y, YANG H G, HUANG Z H, et al. A high energy-efficiency FPGA-based LSTM accelerator architecture design by structured pruning and normalized linear quantization[C]// 2019 International Conference on Field-Programmable Technology (ICFPT), Tianjin, China, 2019: 271-274.

中国半导体行业协会封装分会会刊

中国电子学会电子制造与封装技术分会会刊

高效率LSTM硬件加速器设计与实现^*

Design and Implementation of High Efficiency LSTM Hardware Accelerator

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

编辑推荐

Metrics

本文评价

[1]	陈铠, 刘传柱, 冯建哲, 滕紫珩, 李世平, 傅玉祥, 李丽, 何国强. 面向大规格矩阵协方差运算的高性能硬件加速器设计^*[J]. 电子与封装, 2024, 24(12): 120306-.
[2]	孙长江;李皇;王文青. 面向矩阵计算的加速系统设计[J]. 电子与封装, 2023, 23(4): 40305-.
[3]	梅亚军,王唯佳,彭析竹. 基于FPGA的U-Net网络硬件加速系统的实现[J]. 电子与封装, 2020, 20(6): 60304-.