KAIST EE

Highlights

Dae Hyun Kang & Seung Hoon Kim (Integrated M.S.& Ph.D., Professor Byung Jin Cho’s Lab) Win Best Oral Presentation Award at the ‘KIEEME Summer Conference 2025’

김승훈 — < (from left) M.S.& Ph.D. integrated candidate Dae Hyun Kang , Seung Hoon Kim >

Students Dae Hyun Kang and Seung Hoon Kim (M.S.& Ph.D. integrated program) from Professor Byung Jin Cho’s Research Lab have been honored with the Best Oral Presentation Award at the 2025 Summer Conference of the Korean Institute of Electrical and Electronic Material Engineers (KIEEME).

The KIEEME Summer Conference is one of the most prestigious academic events in Korea for the fields of electronic materials and semiconductors. It serves as a key venue for sharing the latest research achievements, discussing industrial trends, and promoting academic–industrial collaboration.

Dae Hyun Kang presented a paper titled “Performance Enhancement of Charge Trap Flash Memory via Silicon-Doped Boron Nitride Energy Barrier,” while Sheung Hun Kim presented “Analysis of Disturbance Behavior through Lanthanum Interface Treatment in Hafnium Oxide Ferroelectric-Based FeFET Memory.”

Both presentations received high evaluations for originality, technical completeness, and contributions to both academia and industry, leading them to be jointly awarded the Best Oral Presentation Award.

This achievement is particularly significant as it highlights the breakthrough potential of advanced charge trap memory and FeFET memory technologies in overcoming performance limitations and improving device reliability, gaining strong recognition from both academia and industry.

Ph.D. Candidate Carmela Michelle Esteban from Professor Seunghyup Yoo’s Lab Receives Young Researcher Award at ISFOE25

award ceremony enhancer 1 — < (third from the left) Ph.D. candidate Carmela Michelle Esteban>

Carmela Michelle Esteban, a Ph.D. candidate in the research group of Professor Seunghyup Yoo at KAIST School of Electrical Engineering, received the Young Researcher Award for Best Poster Presentation at the 18th International Symposium on Flexible Organic Electronics (ISFOE25), held from July 7 to 10 in Thessaloniki, Greece.

Michelle was recognized for her outstanding research presentation titled “Multi-Functional Polymeric Substrate with Integrated Optical Layers for Flexible Organic Photodetectors.”

ISFOE is a prestigious international symposium in the field of flexible organic and printed electronics, held annually to foster innovation in next-generation electronics. Each year, the Young Researcher Awards are presented to graduate students who demonstrate academic excellence and exceptional research achievements in the field.

Awardees receive a certificate and a complimentary publication for use in the Nanomaterials journal published by MDPI.

Professor Hoi-Jun Yoo of KAIST Elected as a New Member of the National Academy of Sciences, Republic of Korea

Professor Hoi-Jun Yoo, a faculty member in the School of Electrical Engineering at KAIST and an ICT Endowed Chair Professor, has been elected as a new member of the National Academy of Sciences, Republic of Korea (NAS) for the year 2025. His official appointment was confirmed at the NAS general assembly held on July 11, in recognition of his continued research excellence and academic contributions in the field of electronic engineering. He received his membership certificate during the induction ceremony held on July 18 at the NAS headquarters in Seocho-gu, Seoul.

Established in 1954 under the Ministry of Education, the NAS is a national academic institution that annually selects a very limited number of new members through a rigorous screening process, honoring distinguished scholars who have significantly contributed to academic advancement in Korea. This year, only eight scholars nationwide were selected, with Professor Yoo being the sole appointee in Division 3 (Engineering) of the Natural Sciences category.

The NAS selects scholars with exceptional academic achievements and contributions to the development of their fields to support their research, provide academic policy advisory, promote international academic exchange, designate outstanding academic books, and present the NAS Awards. Membership is granted based not only on research accomplishments but also on long-term contributions to academia, representing the highest level of academic prestige in Korea. As of 2025, the total membership is limited to approximately 150 scholars across both natural and social sciences, with around 70 members in the natural sciences division nationwide.

NAS members are regarded as “nationally recognized representatives of academia,” tasked with serving the country and society through scholarly work. The NAS also acts as a hub for international academic collaboration by cooperating with major academies around the world.

Professor Yoo is a globally recognized researcher in the fields of semiconductor design and convergent systems, including AI semiconductors, neuromorphic chips, ultra-low power SoCs (System on Chip), and wearable semiconductors. He currently serves as a professor in the Department of Electrical Engineering at KAIST, ICT Endowed Chair Professor, Director of the Graduate School of AI Semiconductors, Director of the Institute for IT Convergence, and Head of the Research Center for PIM Semiconductor Design.

Notably, Professor Yoo developed the world’s first 256M SDRAM in 1995 and published a related paper, marking the beginning of a prolific research career. Between 2000 and 2023, he published 62 key academic papers, covering a wide range of topics such as semiconductor design, AI semiconductors, wearable AR chips, low-power wireless communication chips, and biomedical ICs. In 2014, he announced the world’s first deep neural network (DNN) accelerator chip, and by 2025, he had published 18 research papers on AI semiconductors.

He is also a Fellow of the Institute of Electrical and Electronics Engineers (IEEE) and was named one of the “Top 5 Most Prolific Authors” at the 70th anniversary of the International Solid-State Circuits Conference (ISSCC)?the only Asian scholar to be included in this list, affirming his international research prominence.

Earlier in his career, Professor Yoo led the development of surface-emitting lasers at Bell Labs and directed the development of 256M DRAM at Hyundai Electronics (now SK hynix). In 2005, he also contributed to national policy as an advisor to the Ministry of Information and Communication, helping shape SoC and next-generation computing technology strategies.

To date, he has authored or edited over 250 papers and five technical books. He has served as a committee member and TPC chair for leading international conferences such as ISSCC, A-SSCC, and ISWC, and has been active as an IEEE SSCS Distinguished Lecturer.

The KAIST School of Electrical Engineering described Professor Yoo’s appointment as a recognition of his continued academic contributions and growing international stature in the fields of semiconductors and electronic engineering. The school expressed expectations for his continued research achievements and mentorship of the next generation.

His election to the NAS stands as a nationally recognized testament to Professor Yoo’s long-standing research accomplishments and academic impact in the field of electronic and system semiconductor design, marking a meaningful milestone in the acknowledgment of his expertise and sustained scholarly activity.

Professor Shinhyun Choi’s Team Develops Next-Generation Neuromorphic Semiconductor Based Artificial Sensory Nervous System

사진 1. 왼쪽부터 KAIST 전기및전자공학부 박시온 석박사통합과정 충남대 이종원 교수 KAIST 최신현 교수 — < (Left to right) See‑On Park, MS-PhD Integrated student, KAIST School of Electrical Engineering; Jongwon Lee, Professor, Department of Semiconductor Convergence, Chungnam National University; Shinhyun Choi, Professor, School of Electrical Engineering, KAIST >

With the joint advancement of artificial intelligence and robotics technologies, enabling robots to perceive and respond to their environments as efficiently as humans has become a critical challenge. Recently, a Korean research team has attracted attention by newly implementing an artificial sensory nervous system that mimics biological sensory nerves without any complex software or circuitry. This technology minimizes energy consumption while intelligently reacting to external stimuli, promising applications in ultra‑miniature robots, prosthetic hands, and robotics for medical or extreme environments.

A joint research team led by Shinhyun Choi, KAIST Endowed Chair Professor, and Jongwon Lee, Professor in the Department of Semiconductor Convergence at Chungnam National University, together with See‑On Park of the integrated MS-PhD program in the KAIST School of Electrical Engineering, has developed a next‑generation, neuromorphic‑semiconductor‑based artificial sensory nervous system. They experimentally demonstrated a novel robotic system that responds efficiently to external stimuli.

Animals, including humans, ignore safe or familiar stimuli but respond selectively and sensitively to important ones, thus preventing energy waste while focusing on crucial signals for swift reaction to environmental changes. For example, one soon tunes out the hum of an air conditioner or the feeling of clothes on the skin, yet quickly focuses on hearing one’s name called or sensing a sharp object touching the skin. This is regulated by the sensory nervous system’s functions of “habituation” and “sensitization,” and many have sought to apply these biological features to robots for more efficient, human‑like environmental responses.

However, implementing complex features such as habituation and sensitization in robots has required separate software or intricate circuitry, hindering miniaturization and energy efficiency. In particular, efforts using memristors, neuromorphic semiconductor elements whose resistance depends on the history of current flow, have been limited by conventional memristors’ simple conductance changes, which failed to replicate the sensory system’s complexity.

To overcome these limitations, the team engineered a new memristor in which opposing conductance‑changing layers coexist within a single device. This structure enables the realistic emulation of habituation and sensitization, as seen in biological sensory nerves.

Fig1 1 — < Figure 1. Physical appearance and schematic of the new memristor capable of mimicking habituation and sensitization in sensory nerves (top), and comparison of the simple conductance‑change behavior of conventional memristors versus the complex conductance patterns of the developed device (bottom). >

This device gradually reduces its response upon repeated stimuli and, when a danger signal is detected, becomes sensitized again, faithfully reproducing the complex synaptic response patterns of real nervous systems.

Using these memristors, the researchers built a memristor‑based artificial sensory nervous system for touch and pain detection, and attached it to a robotic hand to test its efficiency. When safe tactile stimuli were repeatedly applied, the robotic hand initially sensitive to the novel touch began to ignore it, demonstrating habituation. Later, when an electric shock accompanied the touch (a danger signal), the system recognized it as such and regained sensitivity, confirming the sensitization function.

Fig2 — < Figure 2. Experimental results of the robotic hand equipped with the memristor‑based artificial sensory nervous system. By ignoring unimportant stimuli, the system improves energy efficiency and reduces processor load. >

These experiments prove that robots can respond to stimuli as efficiently as humans without complex software or processors, validating the feasibility of energy‑efficient, neuro‑inspired robots.

See‑On Park, first author of the study, stated, “By emulating the human sensory nervous system with next‑generation semiconductors, we’ve opened the door to a new class of robots that respond more intelligently and with greater energy efficiency to their environments. We expect applications in ultra‑miniature robots, military robots, and medical prostheses, where the convergence of advanced semiconductors and robotics is critical.”

This research was published online on July 1, 2025, in the international journal Nature Communications.

Paper title: Experimental demonstration of third‑order memristor‑based artificial sensory nervous system for neuro‑inspired robotics

DOI: https://doi.org/10.1038/s41467-025-60818-x

This research was supported by the National Research Foundation of Korea’s Next‑Generation Intelligent Semiconductor Technology Development Project, Mid‑Career Research Program, PIM AI Semiconductor Core Technology Development Project, Outstanding Young Researcher Program, and the Nano Comprehensive Technology Institute’s Nanomedical Devices Project.

Professor Kyung Cheol Choi and Professor Hyunjoo J. Lee’s Team Presents ‘Game-Changing’ Technology for Intractable Brain Disease Treatment Using Micro OLEDs

최경철 이현주 교수팀 — 〈(From left)Professor Kyung Cheol Choi, Hyunjoo J. Lee, Somin Lee from the School of Electrical Engineering〉

Optogenetics is a technique that controls neural activity by stimulating neurons expressing light-sensitive proteins with specific wavelengths of light. It has opened new possibilities for identifying causes of brain disorders and developing treatments for intractable neurological diseases. Because this technology requires precise stimulation inside the human brain with minimal damage to soft brain tissue, it must be integrated into a neural probe—a medical device implanted in the brain. EE researchers have now proposed a new paradigm for neural probes by integrating micro OLEDs into thin, flexible, implantable medical devices.

In joint research, Professor Kyung Cheol Choi and Professor Hyunjoo J. Lee from the School of Electrical Engineering have jointly succeeded in developing an optogenetic neural probe integrated with flexible micro OLEDs.

Optical fibers have been used for decades in optogenetic research to deliver light to deep brain regions from external light sources. Recently, research has focused on flexible optical fibers and ultra-miniaturized neural probes that integrate light sources for single-neuron stimulation.

The research team focused on micro OLEDs due to their high spatial resolution and flexibility, which allow for precise light delivery to small areas of neurons. This enables detailed brain circuit analysis while minimizing side effects and avoiding restrictions on animal movement. Moreover, micro OLEDs offer precise control of light wavelengths and support multi-site stimulation, making them suitable for studying complex brain functions.

2. 마이크로 OLED 집적 광유전학용 유연 뉴럴 프로브 — 〈< Figure 1. Flexible Neural Probe for Integrated Optogenetics Using Micro-OLEDs (a) Schematic Diagram (b) Multilayer Structure (c) Demonstration of Individual Micro-OLED Pixel Operation (d) Electro-Optical Characteristics Graph of Micro-OLEDs Integrated on the Probe〉

However, the device’s electrical properties degrade easily in the presence of moisture or water, which limited their use as implantable bioelectronics. Furthermore, optimizing the high-resolution integration process on thin, flexible probes remained a challenge.

To address this, the team enhanced the operational reliability of OLEDs in moist, oxygen-rich environments and minimized tissue damage during implantation. They patterned an ultrathin, flexible encapsulation layer* composed of aluminum oxide and parylene-C (Al₂O₃/parylene-C) at widths of 260–600 micrometers (μm) to maintain biocompatibility. *Encapsulation layer: A barrier that completely blocks oxygen and water molecules from the external environment, ensuring the longevity and reliability of the device.

When integrating the high-resolution micro OLEDs, the researchers also used parylene-C, the same biocompatible material as the encapsulation layer, to maintain flexibility and safety. To eliminate electrical interference between adjacent OLED pixels and spatially separate them, they introduced a pixel define layer (PDL), enabling the independent operation of eight micro OLEDs.

Furthermore, they precisely controlled the residual stress and thickness in the multilayer film structure of the device, ensuring its flexibility even in biological environments. This optimization allowed for probe insertion without bending or external shuttles or needles, minimizing mechanical stress during implantation.

그림 1. 논문의 전면표지 그림 — 〈dvanced Functional Materials-Conceptual diagram of a flexible neural probe for integrated optogenetics (Micro-OLED)〉

As a result, the team developed a flexible neural probe with integrated micro OLEDs capable of emitting more than one milliwatt per square millimeter (mW/mm²) at 470 nanometers (nm), the optimal wavelength for activating channelrhodopsin-2. This is a significantly high light output for optogenetics and biomedical stimulation applications.

The ultrathin flexible encapsulation layer exhibited a low water vapor transmission rate of 2.66×10⁻⁵ g/m²/day, allowing the device to maintain functionality for over 10 years. The parylene-C-based barrier also demonstrated excellent performance in biological environments, successfully enabling the independent operation of the integrated OLEDs without electrical interference or bending issues.

Dr. Somin Lee, the lead author from Professor Choi’s lab, stated, “We focused on fine-tuning the integration process of highly flexible, high-resolution micro OLEDs onto thin flexible probes, enhancing their biocompatibility and application potential. This is the first reported development of such flexible OLEDs in a probe format and presents a new paradigm for using flexible OLEDs as implantable medical devices for monitoring and therapy.”

This study, with Dr. Somin Lee as the first author, was published online on March 26 in Advanced Functional Materials (IF 18.5), a leading international journal in the field of nanotechnology, and was selected as the cover article for the upcoming July issue.

※ Title: Advanced Micro-OLED Integration on Thin and Flexible Polymer Neural Probes for Targeted Optogenetic Stimulation

※ DOI: https://doi.org/10.1002/adfm.202420758

The research was supported by the Ministry of Science and ICT and the National Research Foundation of Korea through the Electronic Medicine Technology Development Program (Project title: Development of Core Source Technologies and In Vivo Validation for Brain Cognition and Emotion-Enhancing Light-Stimulating Electronic Medicine).

AI Manipulating Public Opinion? Technology to Detect Korean “AI-Generated Comments

최종 png — 〈 (Left to right) KAIST School of Electrical Engineering Professor Yongdae Kim, Sungkyunkwan University Professor Hyoungshick Kim, KAIST School of Computing Professor Alice Oh, National Security Research Institute Senior Researcher Wooyoung Go 〉

As generative AI technology advances, so do concerns about its potential misuse in manipulating online public opinion. Although detection tools for AI-generated text have been developed previously, most are based on long, standardized English texts and therefore perform poorly on short (average 51 characters), colloquial Korean news comments. The research team from KAIST has made headlines by developing the first technology to detect AI-generated comments in Korean.

A research team led by Professor Yongdae Kim from KAIST’s School of Electrical Engineering, in collaboration with the National Security Research Institute, has developed XDAC, the world’s first system for detecting AI-generated comments in Korean.

Recent generative AI can adjust sentiment and tone to match the context of a news article and can automatically produce hundreds of thousands of comments within hours—enabling large-scale manipulation of public discourse. Based on the pricing of OpenAI’s GPT-4o API, generating a single comment costs approximately 1 KRW. At this rate, producing the average 200,000 daily comments on major news platforms would cost only about 200,000 KRW (approx. USD 150) per day. Public LLMs, with their own GPU infrastructure, can generate massive volumes of comments at virtually no cost.

The team conducted a human evaluation to see whether people could distinguish AI-generated comments from human-written ones. Of 210 comments tested, participants mistook 67% of AI-generated comments for human-written, while only 73% of genuine human comments were correctly identified. In other words, even humans find it difficult to accurately tell AI comments apart. Moreover, AI-generated comments scored higher than human comments in relevance to article context (95% vs. 87%), fluency (71% vs. 45%), and exhibited a lower perceived bias rate (33% vs. 50%).

Until now, AI-generated text detectors have relied on long, formal English prose and fail to perform well on brief, informal Korean comments. Such short comments lack sufficient statistical features and abound in nonstandard colloquial elements, such as emojis, slang, repeated characters, where existing models do not generalize well. Additionally, realistic datasets of Korean AI-generated comments have been scarce, and simple prompt-based generation methods produced limited diversity and authenticity.

To overcome these challenges, the team developed an AI comment generation framework that employs four core strategies: 1) leveraging 14 different LLMs, 2) enhancing naturalness, 3) fine-grained emotion control, and 4) reference-based augmented generation, to build a dataset mirroring real user styles. A subset of this dataset has been released as a benchmark. By applying explainable AI (XAI) techniques to precise linguistic analysis, they uncovered unique linguistic and stylistic features of AI-generated comments through XAI analysis.

그림 1영 — < Figure 1. AI Comment Generation Framework >

For example, AI-generated comments tended to use formal expressions like “것 같다” (“it seems”) and “에 대해” (“about”), along with a high frequency of conjunctions, whereas human commentators favored repeated characters (ㅋㅋㅋㅋ), emotional interjections, line breaks, and special symbols.

In the use of special characters, AI models predominantly employed globally standardized emojis, while real humans incorporated culturally specific characters including Korean consonants (ㅋ, ㅠ, ㅜ) and symbols (ㆍ, ♡, ★, •).

Notably, 26% of human comments included formatting characters (line breaks, multiple spaces), compared to just 1% of AI-generated ones. Similarly, repeated-character usage (e.g. ㅋㅋㅋㅋ, ㅎㅎㅎㅎ, etc.) appeared in 52% of human comments but only 12% of AI comments.

XDAC captures these distinctions to boost detection accuracy. It transforms formatting characters (line breaks, spaces) and normalizes repeated-character patterns into machine-readable features. It also learns each LLM’s unique linguistic fingerprint, enabling it to identify which model generated a given comment.

With these optimizations, XDAC achieves a 98.5% F1 score in detecting AI-generated comments, a 68% improvement over previous methods, and records an 84.3% F1 score in identifying the specific LLM used for generation.

그림2영 — < Figure 2. XDAC Demo: Detection and Identification in Action >

Professor Yongdae Kim emphasized, “This study is the world’s first to detect short comments written by generative AI with high accuracy and to attribute them to their source model. It lays a crucial technical foundation for countering AI-based public opinion manipulation.”

The team also notes that XDAC’s detection capability may have a chilling effect, much like sobriety checkpoints, drug testing, or CCTV installation, which can reduce the incentive to misuse AI simply through its existence.

Platform operators can deploy XDAC to monitor and respond to suspicious accounts or coordinated manipulation attempts, with strong potential for expansion into real-time surveillance systems or automated countermeasures.

The core contribution of this work is the XAI-driven detection framework. It has been accepted to the main conference of ACL 2025, the premier venue in natural language processing, taking place on July 27^th.

※Paper Title:
XDAC: XAI-Driven Detection and Attribution of LLM-Generated News Comments in Korean

※Full Paper:
https://github.com/airobotlab/XDAC/blob/main/paper/250611_XDAC_ACL2025_camera_ready.pdf

This research was conducted under the supervision of Professor Yongdae Kim at KAIST, with Senior Researcher Wooyoung Go (NSR and PhD candidate at KAIST) as the first author, and Professors Hyoungshick Kim (Sungkyunkwan University) and Alice Oh (KAIST) as co-authors.

EE Professor Jung-Woo Choi’s Research Team Wins the IEEE DCASE 2025 Challenge, the World’s Leading Acoustic AI Competition

최정우 교수님팀 750 — <(Left to right) Younghoo Kwon (Integrated Master’s and Ph.D. program), Dohwan Kim (Master’s program), Professor Jung-Woo Choi, Dongheon Lee (Ph.D.)>

Acoustic source separation and classification is a key next-generation AI technology for early detection of anomalies in drone operations piping faults or border surveillance and for enabling spatial audio editing in AR VR content production.

Professor Jung-Woo Choi’s research team from the School of Electrical Engineering won first place in the “Spatial Semantic Segmentation of Sound Scenes” task of the “IEEE DCASE Challenge 2025.”

This year’s challenge featured 86 teams competing across six tasks. In their first-ever participation, KAIST’s team ranked first in Task 4: Spatial Semantic Segmentation of Sound Scenes—a highly demanding task requiring the analysis of spatial information in multi-channel audio signals with overlapping sound sources. The goal was to separate individual sounds and classify them into 18 predefined categories. The team, composed of Dr. Dongheon Lee, integrated MS-PhD student Younghoo Kwon, and MS student Dohwan Kim, will present their results at the DCASE Workshop in Barcelona this October.

Earlier this year, Dr. Dongheon Lee developed a state-of-the-art sound source separation AI combining Transformer and Mamba architectures. Furthermore, at the challenge, led by Younghoo Kwon, the team established the chain-of-inference architecture that first separates waveforms and source types and then refines the estimation by utilizing the estimated waveforms and classes as clues for target signal extraction in the next stage.

1. 여러 소리가 혼합된 음향 장면의 예 — < Figure 1. Example of an acoustic scene with multiple mixed sounds >

This chain-of-inference approach is inspired by human’s auditory scene analysis mechanism that isolates individual sounds by focusing on incomplete clues such as sound type, rhythm, or direction.

In the evaluation metric CA-SDRi (Class-aware Signal-to-distortion Ratio improvement)*, the team was the only participant to achieve a double-digit improvement of 11 dB, demonstrating their technical excellence. *CA-SDRi (Class-aware Signal-to-distortion Ratio improvement) measures how much clearer and less distorted the target sound is compared with the original mix.

Professor Choi remarked, “I am proud that our team’s world leading acoustic separation AI models over the past three years have now received formal recognition. Despite the greatly increased difficulty and the limited development window due to other conference schedules and final exams, each member demonstrated focused research that led to first place.”

2. 혼합 음원으로부터 분리된 음원들의 시간 주파수 패턴 — < Figure 2. Time frequency patterns of separated sound sources >

The “IEEE DCASE Challenge 2025” was held online from April 1^st to June 15^th for submissions, with results announced on June 30^th. Since its inception in 2013 under the IEEE Signal Processing Society, the challenge has served as a global stage for AI models in the acoustic field.

Go to the IEEE DCASE Challenge 2025 website (Click)

This research was supported by the National Research Foundation of Korea’s Mid-Career Researcher Program and STEAM Research Project, funded by the Ministry of Education, and the Future Defense Research Center, funded by the Defense Acquisition Program Administration and the Agency for Defense Development.

그림 3. 연구진이 개발한 음향의 분리 및 분류 AI 구조 — < Figure 3. AI architecture for sound separation and classification >

images 000102 image333.png — < Competition Results Rankings. Higher CA-SDRi indicates a better score (Unit: decibels dB) >

Ph.D. candidate Se Jin Park from Professor Yong Man Ro’s lab develops ‘SpeechSSM,’ opening up possibilities for a 24-hour AI voice assistant

노용만 교수님 750 — <(From Left)Prof. Yong Man Ro and Ph.D. candidate Sejin Park>

Recently, Spoken Language Models (SLMs) have been spotlighted as next-generation technology that surpasses the limitations of text-based language models by learning human speech without text to understand and generate linguistic and non-linguistic information. However, existing models showed significant limitations in generating long-duration content required for podcasts, audiobooks, and voice assistants. Now, KAIST researcher has succeeded in overcoming these limitations by developing ‘SpeechSSM,’ which enables consistent and natural speech generation without time constraints.

Ph.D. candidate Sejin Park from Professor Yong Man Ro’s research team in the School of Electrical Engineering has developed ‘SpeechSSM,’ a spoken language model capable of generating long-duration speech.

그림 1. SpeechSSM 개요 — <Figure 1. Overview of SpeechSSM. The hybrid state-space model of SpeechSSM is trained with a language modeling objective on semantic tokens (USM-v2) that are encoded using overlapping fixed-size windows. The non-autoregressive speech decoder (SoundStorm) converts these overlapping semantic token windows into acoustic codec tokens (SoundStream), conditioned on speaker identity.>

A major advantage of Spoken Language Models (SLMs) is their ability to directly process speech without intermediate text conversion, leveraging the unique acoustic characteristics of human speakers, allowing for the rapid generation of high-quality speech even in large-scale models.

However, existing models faced difficulties in maintaining semantic and speaker consistency for long-duration speech due to increased ‘speech token resolution’ and memory consumption when capturing very detailed information by breaking down speech into fine fragments.

To solve this problem, Se Jin Park developed ‘SpeechSSM,’ a spoken language model using a Hybrid State-Space Model, designed to efficiently process and generate long speech sequences.

This model employs a ‘hybrid structure’ that alternately places ‘attention layers’ focusing on recent information and ‘recurrent layers’ that remember the overall narrative flow (long-term context). This allows the story to flow smoothly without losing coherence even when generating speech for a long time. Furthermore, memory usage and computational load do not increase sharply with input length, enabling stable and efficient learning and the generation of long-duration speech.

SpeechSSM effectively processes unbounded speech sequences by dividing speech data into short, fixed units (windows), processing each unit independently, and then combining them to create long speech.

Additionally, in the speech generation phase, it uses a ‘Non-Autoregressive’ audio synthesis model (SoundStorm), which rapidly generates multiple parts at once instead of slowly creating one character or one word at a time, enabling the fast generation of high-quality speech.

While existing models typically evaluated short speech models of about 10 seconds, Se Jin Park created new evaluation tasks for speech generation based on their self-built benchmark dataset, ‘LibriSpeech-Long,’ capable of generating up to 16 minutes of speech.

Compared to PPL (Perplexity), an existing speech model evaluation metric that only indicates grammatical correctness, she proposed new evaluation metrics such as ‘SC-L (semantic coherence over time)’ to assess content coherence over time, and ‘N-MOS-T (naturalness mean opinion score over time)’ to evaluate naturalness over time, enabling more effective and precise evaluation.

그림 2. 다양한 음성 언어 모델에서 고려된 최대 시퀀스 길이 — < Figure 2. Maximum sequence length considered in various Spoken Language Models (SLMs).
Whereas conventional SLMs have been trained and evaluated on sequences up to 200 seconds in length, SpeechSSM is capable of training and evaluating speech up to 16 minutes. While the proposed model can theoretically generate speech of infinite length with constant memory usage, the experiments were limited to 16 minutes for evaluation purposes.>

Through these new evaluations, it was confirmed that speech generated by the SpeechSSM spoken language model consistently featured specific individuals mentioned in the initial prompt, and new characters and events unfolded naturally and contextually consistently, despite long-duration generation. This contrasts sharply with existing models, which tended to easily lose their topic and exhibit repetition during long-duration generation.

그림 3. 임베딩 유사도를 이용해 측정한 10초 프롬프트와 16분 생성 결과의 의미 유사도 — < Figure 3. Semantic similarity between a 10-second prompt and each 100-word segment of 16-minute generated speech, measured using embedding similarity (SC-L). Unlike prior models whose semantic consistency degrades as the length of generated speech increases, SpeechSSM maintains semantic coherence over long durations, exhibiting trends similar to real human speech.>

PhD candidate Sejin Park explained, “Existing spoken language models had limitations in long-duration generation, so our goal was to develop a spoken language model capable of generating long-duration speech for actual human use.” She added, “This research achievement is expected to greatly contribute to various types of voice content creation and voice AI fields like voice assistants, by maintaining consistent content in long contexts and responding more efficiently and quickly in real time than existing methods.”

This research, with Se Jin Park as the first author, was conducted in collaboration with Google DeepMind and is scheduled to be presented as an oral presentation at ICML (International Conference on Machine Learning) 2025 on July 16th.

Paper Title: Long-Form Speech Generation with Spoken Language Models
DOI: 10.48550/arXiv.2412.18603

Ph.D. candidate Se Jin Park has demonstrated outstanding research capabilities as a member of Professor Yong Man Ro’s MLLM (multimodal large language model) research team, through her work integrating vision, speech, and language. Her achievements include a spotlight paper presentation at 2024 CVPR (Computer Vision and Pattern Recognition) and an Outstanding Paper Award at 2024 ACL (Association for Computational Linguistics).

images 000101 image4.jpg 2 — <Figure 4. Computational Efficiency of SpeechSSM. (Left) Maximum batch decoding throughput by model and generation length on TPU v5e. (Right) Time taken to decode a single sample (batch size 1) up to the target length on TPU v5e.>

For more information, you can refer to the publication and accompanying demo: SpeechSSM Publications.

NOTICE

Academic

General

Academic

General

SEMINAR & EVENT

Seminar

Event

Seminar

Event

Highlights

NOTICE

SEMINAR & EVENT

Date:

Speaker:

Prof. Marilyn Wolf(University of Nebraska-Lincoln)

Place:

E3-2 Woori Byul Seminar Room(2201)

Date:

Speaker:

Professor Yong-Jin Kim

Place:

School of Electrical Engineering(E3-2), 2219

Date:

Speaker:

Prof. Nam Sung Kim

Place:

Wooribyul Seminar Room 2201, E3-2, KAIST

Date:

Speaker:

Istvan Szerdahelyi(ㅆhe Ambassador of Hungary to the Republic of Korea)

Place:

KI Building(E4), Matrix Hall(2nd Floor)