회로의실제동작중에최대로발생하는 IR drop인 dynamic IR drop 분석은매우오래걸리는과정이다. 따라서, 본연구에서는이미지–이미지변환인공신경망의일종인 U-net을이용하여빠르게 dynamic IR drop 분석을수행하는방법을제안하였다. U-net의 input으로는각 gate까지의 effective 저항, 각 gate의시간별전류소모량, 가장가까운 power pad까지의거리를각각 map으로나타낸이미지 clip이들어가게된다. 보다빠른 IR drop 예측을위하여모든 clip을예측하지않고높은 IR drop 발생가능성이있는 time window에대해서만예측을수행하며, PDN 저항을빠르게근사값으로구하는방법을적용하였다. 실험결과, 제안한 IR drop 예측방법은실제 dynamic IR drop 분석방법에대비하여약 20배빠르면서약 15%의오차를보였다.
우리 학부 김주영 교수님께서 센터장으로 부임하시는 KAIST ITRC 인공지능반도체시스템(AISS)연구센터가 출범합니다. 2020 대학 ICT연구센터 사업으로 새로이 선정되었으며, 해당 사업은 과기정통부 산하 정보통신기획평가원(IITP)에서 주관하고 있습니다. 김주영 교수님께서는 ‘비대면·인공지능 사회를 위한 반도체 시스템 융합혁신기술 개발’을 목표로 총 50여억원의 연구비로 2025년까지 연구과제를 이끌 계획입니다.
이번 연구센터는 서울, 대전, 울산을 잇는 거점연구센터로 대전에 위치하여 연세대, 이화여대, 울산과기대와 공동 연구를 진행할 예정입니다.
우리 학부 유회준 교수님 연구팀이 생성적 적대 신경망(GAN: Generative Adversarial Network)을 저전력, 효율적으로 처리하는 인공지능(AI: Artificial Intelligent) 반도체를 개발하였습니다.
연구팀이 개발한 인공지능 반도체는 다중-심층 신경망을 처리할 수 있고 이를 저전력의 모바일 기기에서도 학습할 수 있습니다. 연구팀은 이번 반도체 칩 개발을 통해 이미지 합성, 스타일 변환, 손상 이미지 복원 등의 생성형 인공지능 기술을 모바일 기기에서 구현하는 데 성공했습니다.
강상훈 박사과정이 1 저자로 참여한 이번 연구결과는 지난 2월 17일 3천여 명 반도체 연구자들이 미국 샌프란시스코에 모여 개최한 국제고체회로설계학회(ISSCC)에서 발표되었습니다. (논문명 : GANPU: A 135TFLOPS/W Multi-DNN Training Processor for GANs with Speculative Dual-Sparsity Exploitation)
최근 모바일 기기에서 인공지능을 구현하기 위해 다양한 가속기 개발이 이뤄지고 있지만, 기존 연구들은 추론 단계만 지원하거나 단일-심층 신경망 학습에 한정되어 있습니다. 연구팀은 단일-심층 신경망뿐만 아니라 생성적 적대 신경망과 같은 다중-심층 신경망을 처리할 수 있으면서 모바일에서 학습도 가능한 인공지능 반도체 GANPU(Generative Adversarial Networks Processing Unit)를 개발해 모바일 장치의 인공지능 활용범위를 넓혔습니다.
연구팀이 개발한 인공지능 반도체는 서버로 데이터를 보내지 않고 모바일 장치 내에서 생성적 적대 신경망(GAN)을 스스로 학습할 수 있어 사생활을 보호를 가능케 하는 프로세서라는 점에서 그 활용도가 기대됩니다. 자체 개발한 기술을 사용함으로써 연구팀의 GANPU는 기존 최고 성능을 보이던 심층 신경망 학습 반도체 대비 4.8배 증가한 에너지효율을 달성했습니다.
연구팀은 GANPU의 활용 예시로 태블릿 카메라로 찍은 사진을 사용자가 직접 수정할 수 있는 응용 기술을 시연했습니다. 사진상의 얼굴에서 머리·안경·눈썹 등 17가지 특징에 대해 추가·삭제 및 수정사항을 입력하면 GANPU가 실시간으로 이를 자동으로 완성해 보여 주는 얼굴 수정 시스템을 개발했습니다.
Recently, super resolution algorithms based on convolution neural network (SR-CNN) has been broadly utilized to enable mobile devices to support better user experience (UX) from video quality enhancement or far object recognition. However, SRCNN’s distinct architecture makes it harder to meet the high throughput requirement in conventional hardware targeting classification CNNs. It is because the intermediate feature maps of SR do not decrease when they pass through the layers, while classification CNN’s feature maps shrink due to pooling or strided convolutions. Because of the huge amount of feature maps in SR-CNN, it requires larger external memory access (EMA), on-chip memory footprint and computation workload than the classification CNN.
In this work, we propose a high throughput SR-CNN processor which minimizes the amount of EMA and on-chip memory footprint with three key features; 1) Selective caching based layer fusion (SCLF) algorithm to reduce the overall memory cost (product of on-chip memory size and EMA), 2) memory compaction scheme to reduce the on-chip memory footprint further and 3) cyclic ring core architecture to increase the PE utilization for SCLF. As a result, the implemented processor achieves 60 frames-per-second throughput in generating full HD images.
Figure 1. An illustration of the proposed SR computing algorithm & proposed ring core architecture
Recently, super resolution algorithms based on convolution neural network (SR-CNN) has been broadly utilized to enable mobile devices to support better user experience (UX) from video quality enhancement or far object recognition. However, SRCNN’s distinct architecture makes it harder to meet the high throughput requirement in conventional hardware targeting classification CNNs. It is because the intermediate feature maps of SR do not decrease when they pass through the layers, while classification CNN’s feature maps shrink due to pooling or strided convolutions. Because of the huge amount of feature maps in SR-CNN, it requires larger external memory access (EMA), on-chip memory footprint and computation workload than the classification CNN.
In this work, we propose a high throughput SR-CNN processor which minimizes the amount of EMA and on-chip memory footprint with three key features; 1) Selective caching based layer fusion (SCLF) algorithm to reduce the overall memory cost (product of on-chip memory size and EMA), 2) memory compaction scheme to reduce the on-chip memory footprint further and 3) cyclic ring core architecture to increase the PE utilization for SCLF. As a result, the implemented processor achieves 60 frames-per-second throughput in generating full HD images.
Figure 1. An illustration of the proposed SR computing algorithm & proposed ring core architecture
Title: 1.32 TOPS/W Energy Efficient Deep Neural Network Learning Processor with Direct Feedback Alignment based Heterogeneous Core Architecture
Authors: Dong-Hyeon Han, Jin-Su Lee, Jin-Mook Lee and Hoi-Jun Yoo
An energy efficient deep neural network (DNN) learning processor is proposed using direct feedback alignment (DFA).
The proposed processor achieves 2.2 × faster learning speed compared with the previous learning processors by the pipelined DFA (PDFA). Since the computation direction of the back-propagation (BP) is reversed from the inference, the gradient of the 1st layer cannot be generated until the errors are propagated from the last layer to the 1st layer. On the other hand, the proposed processor applies DFA which can propagate the errors directly from the last layer. This means that the PDFA can propagate errors during the next inference computation and that weight update of the 1st layer doesn’t need to wait for error propagation of all the layers. In order to enhance the energy efficiency by 38.7%, the heterogeneous learning core (LC) architecture is optimized with the 11-stage pipeline data-path. It show 2 × longer data reusing compared with the conventional BP. Furthermore, direct error propagation core (DEPC) utilizes random number generators (RNG) to remove external memory access (EMA) caused by error propagation (EP) and improve the energy efficiency by 19.9%.
The proposed PDFA based learning processor is evaluated on the object tracking (OT) application, and as a result, it shows 34.4 frames-per-second (FPS) throughput with 1.32 TOPS/W energy efficiency.
Figure 1. Back-propagation vs Pipelined DFA
Figure 2. Layer Level vs Neuron-level vs Partial-sum Level Pipeline
Figure 3. Overall Architecture of Proposed Processor
Title: LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16
Authors: Jin-Su Lee, Ju-Hyoung Lee, Dong-Hyeon Han, Jin-Mook Lee, Gwang-Tae Park, and Hoi-Jun Yoo
Recently, deep neural network (DNN) hardware accelerators have been reported for energy-efficient deep learning (DL) acceleration. Most of the previous DNN inference accelerators trained their DNN parameters at the cloud server using public datasets and downloaded the parameters to them to implement AI. However, the local DNN learning with domain-specific and private data is required to adapt to various user’s preferences on the edge or mobile devices. Since the edge and mobile devices contain only limited computation capability with battery power, energy-efficient DNN learning processor is necessary. In this paper, we present an energy-efficient on-chip learning accelerator. Its data precision is optimized while maintaining the training accuracy with fine-grained mixed precision (FGMP) of FP8-FP16 to reduce external memory access (EMA) and to enhance throughput with high accuracy. In addition, sparsity is exploited with intra-channel accumulation as well as inter-channel accumulation to support 3 DNN learning steps with higher throughput to enhance energy-efficiency. Also, the input load balancer (ILB) is integrated to improve PE utilization under the unbalanced amount of input data caused by irregular sparsity. The external memory access is reduced by 38.9% and energy-efficiency is improved 2.08 times for ResNet-18 training. The fabricated chip occupies 16mm2 in 65nm CMOS and the energy efficiency is 3.48TFLOPS/W (FP8) for 0.0% sparsity and 25.3TFLOPS/W (FP8) for 90% sparsity.
Title: A 2.1TFLOPS/W Mobile Deep RL Accelerator with Transposable PE Array and Experience Compression
Authors: Chang-Hyeon Kim, Sang-Hoon Kang, Don-Joo Shin, Sung-Pill Choi, Young-Woo Kim and Hoi-Jun Yoo
Recently, deep neural networks (DNNs) are actively used for action control so that an autonomous system, such as the robot, can perform human-like behaviors and operations. Unlike recognition tasks, the real-time operation is essential in action control, and it is too slow to use remote learning on a server communicating through a network. New learning techniques, such as reinforcement learning (RL), are needed to determine and select the correct robot behavior locally. In this paper, We propose DRL accelerator with transposable PE array and experience compressor to realize real-time DRL operation of autonomous agents in dynamic environments. It supports on-chip data compression and decompression that ~10,000 of DRL experiences can be compressed by 65%. And it enables adaptive data reuse for inferencing and training, which results in power and peak memory bandwidth reduction by 31% and 41%, respectively. The proposed DRL accelerator is fabricated with 65nm CMOS technology and occupies 4×4 mm2 die area. This is the first fully trainable DRL processor, and it achieves 2.16 TFLOPS/W energy-efficiency at 0.73V with 16b weights@50MHz.
Title: CNNP-v2: An Energy Efficient Memory-Centric Convolutional Neural Network Processor Architecture
Authors: Sung-Pill Choi, Kyeong-Ryeol Bong, Dong-Hyeon Han, and Hoi-Jun Yoo
An energy efficient memory-centric convolution-al neural network (CNN) processor architecture is proposed for smart devices such as wearable devices or internet of things (IoT) devices. To achieve energy-efficient processing, it has 2 key features: First, 1-D shift convolution PEs with fully distributed memory architecture achieve 3.1TOPS/W energy efficiency. Compared with conventional architecture, even though it has massively parallel 1024 MAC units, it achieve high energy efficiency by scaling down voltage to 0.46V due to its fully local routed design. Next, fully configurable 2-D mesh core-to-core interconnection support various size of input features to maximize utilization. The proposed architecture is evaluated 16mm2 chip which is fabricated with 65nm CMOS process and it performs real-time face recognition with only 9.4mW at 10MHz and 0.48V.