{"id":189625,"date":"2025-03-26T17:07:15","date_gmt":"2025-03-26T08:07:15","guid":{"rendered":"http:\/\/ee.presscat.kr\/?post_type=research-achieve&#038;p=189625"},"modified":"2026-07-05T11:29:43","modified_gmt":"2026-07-05T02:29:43","slug":"ee-professor-minsoo-rhus-research-team-develops-a-simulation-framework-called-vtrain","status":"publish","type":"research-achieve","link":"http:\/\/ee.presscat.kr\/en\/research-achieve\/ee-professor-minsoo-rhus-research-team-develops-a-simulation-framework-called-vtrain\/","title":{"rendered":"EE Professor Minsoo Rhu\u2019s research team develops a simulation framework called vTrain"},"content":{"rendered":"<figure id=\"attachment_189621\" aria-describedby=\"caption-attachment-189621\" style=\"width: 911px\" class=\"wp-caption aligncenter\"><img fetchpriority=\"high\" decoding=\"async\" class=\"size-full wp-image-189621\" src=\"http:\/\/ee.presscat.kr\/wp-content\/uploads\/2025\/03\/\uc720\ubbfc\uc218-\uad50\uc218\ub2d8-900_enhancer.jpg\" alt=\"\" width=\"911\" height=\"435\" title=\"\"><figcaption id=\"caption-attachment-189621\" class=\"wp-caption-text\">\u3008 (From left) Professor Minsoo Rhu, Ph.D. candidate Jehyeon Bang, and Dr. Yujeong \u3009<\/figcaption><\/figure>\n<p><span style=\"font-size: 14pt;color: #000000\">Large AI models such as ChatGPT and DeepSeek are gaining attention as they&#8217;re being applied across diverse fields. These large language models (LLMs) require training on massive distributed systems composed of tens of thousands of data center GPUs. For example, the cost of training GPT-4 is estimated at approximately 140 billion won. A team of Korean researchers has developed a technology that optimizes parallelization configurations to increase GPU efficiency and significantly reduce training costs.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p class=\"\" data-start=\"59\" data-end=\"370\"><span style=\"font-size: 14pt;color: #000000\">An EE research team led by Professor Minsoo Rhu, in collaboration with the Samsung Advanced Institute of Technology (SAIT), has developed a simulation framework called vTrain, which accurately predicts and optimizes the training time of LLMs in large-scale distributed environments.<\/span><\/p>\n<p data-start=\"59\" data-end=\"370\">\u00a0<\/p>\n<p class=\"\" data-start=\"59\" data-end=\"370\"><span style=\"font-size: 14pt;color: #000000\">To efficiently train LLMs, it&#8217;s crucial to identify the optimal distributed training strategy. However, the vast number of potential strategies makes real-world testing prohibitively expensive and time-consuming. As a result, companies currently rely on a limited number of empirically validated strategies, causing inefficient GPU utilization and unnecessary increases in training costs. The absence of suitable large-scale simulation technology has significantly hindered companies from effectively addressing this issue.<\/span><\/p>\n<p data-start=\"59\" data-end=\"370\">\u00a0<\/p>\n<p><span style=\"font-size: 14pt;color: #000000\">To overcome this limitation, Professor Rhu\u2019s team developed vTrain, which can accurately predict training time and quickly evaluate various parallelization strategies. Through experiments conducted in multi-GPU environments, vTrain&#8217;s predictions were compared against actual measured training times, resulting in an average absolute percentage error (MAPE) of 8.37% on single-node systems and 14.73% on multi-node systems.<\/span><\/p>\n<p>&nbsp;<\/p>\n<figure id=\"attachment_189615\" aria-describedby=\"caption-attachment-189615\" style=\"width: 643px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" class=\"size-full wp-image-189615\" src=\"http:\/\/ee.presscat.kr\/wp-content\/uploads\/2025\/03\/\uadf8\ub9bc-1.-vTrain-\uc2dc\ubbac\ub808\uc774\ud130-\uad6c\uc870-\ubaa8\uc2dd\ub3c4.png\" alt=\"\" width=\"643\" height=\"154\" title=\"\"><figcaption id=\"caption-attachment-189615\" class=\"wp-caption-text\">\u3008 Figure 1. Schematic diagram of the vTrain simulator architecture \u3009<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p><span style=\"font-size: 14pt\">In collaboration with SAIT, the team has also released the vTrain framework along with over 1,500 real-world training time measurement datasets as open-source software (<a href=\"https:\/\/github.com\/VIA-Research\/vTrain\" rel=\"noopener\">https:\/\/github.com\/VIA-Research\/vTrain<\/a>) for free use by AI researchers and companies.<\/span><\/p>\n<p>&nbsp;<\/p>\n<figure id=\"attachment_189617\" aria-describedby=\"caption-attachment-189617\" style=\"width: 646px\" class=\"wp-caption aligncenter\"><img decoding=\"async\" class=\"size-full wp-image-189617\" src=\"http:\/\/ee.presscat.kr\/wp-content\/uploads\/2025\/03\/\uadf8\ub9bc-2.-\ub2e8\uc77c-\ub178\ub4dc-\uc2dc\uc2a4\ud15c\uc88c-\ubc0f-\ub2e4\uc911-\ub178\ub4dc-\uc2dc\uc2a4\ud15c\uc6b0\uc5d0-\ub300\ud55c-\ud559\uc2b5-\uc2dc\uac04-\uce21\uc815\uac12\uacfc-\uc608\uce21\uac12\uc758-\ube44\uad50.png\" alt=\"\" width=\"646\" height=\"195\" title=\"\"><figcaption id=\"caption-attachment-189617\" class=\"wp-caption-text\">\u3008 Figure 2. Comparison of measured and predicted training times for single-node (left) and multi-node (right) systems (Figure caption as provided in the original article) \u3009<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p><span style=\"font-size: 14pt;font-family: georgia, palatino, serif;color: #000000\">Professor Rhu commented, \u201cvTrain utilizes a profiling-based simulation approach to explore training strategies that enhance GPU utilization and reduce training costs compared to conventional empirical methods. With the open-source release, companies can now efficiently cut the costs associated with training ultra-large AI models.\u201d<\/span><\/p>\n<p>&nbsp;<\/p>\n<figure id=\"attachment_189619\" aria-describedby=\"caption-attachment-189619\" style=\"width: 514px\" class=\"wp-caption aligncenter\"><img loading=\"lazy\" decoding=\"async\" class=\"size-full wp-image-189619\" src=\"http:\/\/ee.presscat.kr\/wp-content\/uploads\/2025\/03\/\uadf8\ub9bc-3.-\ub2e4\uc591\ud55c-\ubcd1\ub82c\ud654-\uae30\ubc95\uc5d0-\ub530\ub978-MT-NLG-\ud559\uc2b5-\uc2dc\uac04-\ubc0f-GPU-\uc0ac\uc6a9\ub960-\ubcc0\ud654.png\" alt=\"\" width=\"514\" height=\"218\" title=\"\"><figcaption id=\"caption-attachment-189619\" class=\"wp-caption-text\">\u3008 Figure 3. Changes in MT-NLG training time and GPU utilization with various parallelization techniques (Figure caption as provided in the original article) \u3009<\/figcaption><\/figure>\n<p>&nbsp;<\/p>\n<p><span style=\"font-size: 14pt;color: #000000\">This research, with Ph.D. candidate Jehyeon Bang as the first author, was presented last November at MICRO, the joint International Symposium on Microarchitecture hosted by IEEE and ACM, one of the premier conferences in computer architecture. (Paper title: \u201cvTrain: A Simulation Framework for Evaluating Cost-Effective and Compute-Optimal Large Language Model Training\u201d, <a style=\"color: #000000\" href=\"https:\/\/doi.org\/10.1109\/MICRO61859.2024.00021\" rel=\"noopener\">https:\/\/doi.org\/10.1109\/MICRO61859.2024.00021<\/a>)<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"font-size: 14pt;color: #000000\">This work was supported by the Ministry of Science, ICT, the National Research Foundation of Korea, the Information and Communication Technology Promotion Agency, and Samsung Electronics, as part of the SW Star Lab project for the development of core technologies in the SW computing industry.<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>764<\/p>\n","protected":false},"featured_media":189623,"template":"","research_category":[347],"class_list":["post-189625","research-achieve","type-research-achieve","status-publish","has-post-thumbnail","hentry","research_category-ai-machine-learning-en"],"acf":[],"_links":{"self":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/research-achieve\/189625","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/research-achieve"}],"about":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/types\/research-achieve"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/media\/189623"}],"wp:attachment":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/media?parent=189625"}],"wp:term":[{"taxonomy":"research_category","embeddable":true,"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/research_category?post=189625"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}