{"id":165792,"date":"2024-06-20T10:12:40","date_gmt":"2024-06-20T01:12:40","guid":{"rendered":"http:\/\/ee.presscat.kr\/?post_type=research-achieve&#038;p=165792"},"modified":"2026-05-26T02:46:55","modified_gmt":"2026-05-25T17:46:55","slug":"professor-yongman-ros-research-team-develops-a-multimodal-large-language-model-that-surpasses-the-performance-of-gpt-4v","status":"publish","type":"research-achieve","link":"http:\/\/ee.presscat.kr\/en\/research-achieve\/professor-yongman-ros-research-team-develops-a-multimodal-large-language-model-that-surpasses-the-performance-of-gpt-4v\/","title":{"rendered":"Professor YongMan Ro&#8217;s research team develops a multimodal large language model that surpasses the performance of GPT-4V"},"content":{"rendered":"<p><span style=\"color: #000000\"><strong>Professor YongMan Ro&#8217;s research team develops a multimodal large language model that surpasses the performance of GPT-4V<\/strong><\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000\"><img fetchpriority=\"high\" decoding=\"async\" class=\"alignnone size-full wp-image-165793\" src=\"http:\/\/ee.presscat.kr\/wp-content\/uploads\/2024\/06\/Inline-image-2024-05-31-15.36.55.915.jpg\" alt=\"\" width=\"686\" height=\"227\" title=\"\"><\/span><\/p>\n<div><span style=\"color: #000000\">&lt;(From left) Professor YongMan Ro, ph.d. candidate ByungKwan Lee, ph.d. candidate Beomchan Park(integrated), ph.d. candidate Chae Won Kim&gt;<\/span><\/div>\n<div>\u00a0<\/div>\n<p><span style=\"color: #000000\">On \u00a0June 20, 2024, Professor YongMan Ro&#8217;s research team announced that \u00a0they have developed and released an open-source multimodal large \u00a0language model that surpasses the visual performance of closed \u00a0commercial models like OpenAI&#8217;s ChatGPT\/GPT-4V and Google&#8217;s Gemini-Pro.\u00a0A \u00a0multimodal large language model refers to a massive language model \u00a0capable of processing not only text but also image data types.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000\">The \u00a0recent advancement of large language models (LLMs) and the emergence of \u00a0visual instruction tuning have brought\u00a0significant attention to \u00a0multimodal large language models. However, due to the support of \u00a0abundant computing resources by large overseas corporations, very large \u00a0models with parameters similar to the number of neural networks in the \u00a0human brain are being created. <\/span><\/p>\n<p><span style=\"color: #000000\">These models are all developed in \u00a0private, leading to an ever-widening performance and technology gap \u00a0compared to large language models developed at the academic level. In \u00a0other words, the open-source large language models developed so far have \u00a0not only failed to match the performance of closed large language \u00a0models like ChatGPT\/GPT-4V and Gemini-Pro, but also show a significant \u00a0performance gap.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000\">To \u00a0improve the performance of multimodal large language models, existing \u00a0open-source large language models have either increased the model size \u00a0to enhance learning capacity or expanded the quality of visual \u00a0instruction tuning datasets that handle various vision language tasks. \u00a0However, these methods require vast computational resources or are \u00a0labor-intensive, highlighting the need for new efficient methods to \u00a0enhance the performance of multimodal large language models.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000\">Professor YongMan Ro&#8217;s research team has announced the development of two technologies that significantly \u00a0enhance the visual performance of multimodal large language models \u00a0without significantly increasing the model size or creating high-quality \u00a0visual instruction tuning datasets.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000\">The first technology developed by \u00a0the research team, CoLLaVO, verified that the primary reason existing \u00a0open-source multimodal large language models perform significantly lower \u00a0compared to closed models is due to a markedly lower capability in \u00a0object-level image understanding. Furthermore, they revealed that the \u00a0model&#8217;s object-level image understanding ability has a decisive and \u00a0significant correlation with its ability to handle visual-language \u00a0tasks.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000\"><img decoding=\"async\" class=\"alignnone size-full wp-image-165795\" src=\"http:\/\/ee.presscat.kr\/wp-content\/uploads\/2024\/06\/Inline-image-2024-05-31-15.25.02.940.jpg\" alt=\"\" width=\"1010\" height=\"420\" title=\"\"><\/span><\/p>\n<div><span style=\"color: #000000\">[Figure &#8211; Crayon Prompt Training Methodology]<\/span><\/div>\n<div>\u00a0<\/div>\n<p><span style=\"color: #000000\">To \u00a0efficiently enhance this capability and improve performance on \u00a0visual-language tasks, the team introduced a new visual prompt called \u00a0Crayon Prompt. This method leverages a computer vision model known as \u00a0panoptic segmentation to segment image information into background and \u00a0object units. Each segmented piece of information is then directly fed \u00a0into the multimodal large language model as input.\u00a0<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000\">Additionally, to \u00a0ensure that the information learned through the Crayon Prompt is not \u00a0lost during the visual instruction tuning phase, the team proposed an \u00a0innovative training strategy called Dual QLoRA. <\/span><\/p>\n<p><span style=\"color: #000000\">This strategy trains \u00a0object-level image understanding ability and visual-language task \u00a0processing capability with different parameters, preventing the loss of \u00a0information between them. <\/span><\/p>\n<p><span style=\"color: #000000\">Consequently, the CoLLaVO multimodal large \u00a0language model exhibits superior ability to distinguish between \u00a0background and objects within images, significantly enhancing its \u00a0one-dimensional visual discrimination ability.<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000\"><img decoding=\"async\" class=\"alignnone size-full wp-image-165797\" src=\"http:\/\/ee.presscat.kr\/wp-content\/uploads\/2024\/06\/Inline-image-2024-05-31-15.25.53.251.png\" alt=\"\" width=\"736\" height=\"491\" title=\"\"><\/span><\/p>\n<div><span style=\"color: #000000\">[Figure &#8211; CoLLaVO Multimodal LLM Performance Evaluation]<\/span><\/div>\n<div>\u00a0<\/div>\n<div>\u00a0<\/div>\n<div><span style=\"color: #000000\">Following \u00a0the development of CoLLaVO, Professor YongMan Ro&#8217;s research team \u00a0developed and released their second large language model, MoAI. This \u00a0model is inspired by cognitive science elements that humans use to judge \u00a0objects, such as understanding the presence, state, and interactions of \u00a0objects, as well as background comprehension and text interpretation.<\/span><\/div>\n<div>\n<p><span style=\"color: #000000\">The team pointed out that existing multimodal large language models use vision encoders that are semantically aligned with text, leading to a lack of detailed and comprehensive real-world scene understanding at the pixel level.<\/span><\/div>\n<div>\u00a0<\/div>\n<div><span style=\"color: #000000\">To incorporate these cognitive science elements into a multimodal large language model, MoAI employs four computer vision models: panoptic segmentation, open-world object detection (which has no limits on detectable objects), scene graph generation, and optical character recognition (OCR).<\/span><\/div>\n<div>\u00a0<\/div>\n<div><span style=\"color: #000000\">The results from these four computer vision models are then translated into human-understandable language and directly used as input for the multimodal large language model.<\/span><\/p>\n<p><span style=\"color: #000000\">By \u00a0combining the simple and efficient approach of CoLLaVO&#8217;s Crayon Prompt + \u00a0DualQLoRA with MoAI&#8217;s array of computer vision models, the research \u00a0team verified that their models outperformed closed commercial models \u00a0like OpenAI&#8217;s ChatGPT\/GPT-4V and Google&#8217;s Gemini-Pro.<\/span><\/div>\n<div>\u00a0<\/div>\n<div>\u00a0<\/div>\n<div><span style=\"color: #000000\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-165799\" src=\"http:\/\/ee.presscat.kr\/wp-content\/uploads\/2024\/06\/Inline-image-2024-05-31-15.27.06.852.jpg\" alt=\"\" width=\"1080\" height=\"603\" title=\"\"><\/span><\/div>\n<div>\n<div>\n<div><span style=\"color: #000000\">[Figure &#8211; MoAI Multimodal LLM Performance Evaluation]<\/span><\/div>\n<div>\u00a0<\/div>\n<div>\u00a0<\/div>\n<div><span style=\"color: #000000\">The \u00a0two consecutive multimodal large language models, CoLLaVO and MoAI, \u00a0were developed with the participation of ByungKwan Lee (Ph.D student) \u00a0as the first author. Additionally, Beomchan Park (integrated master&#8217;s \u00a0and Ph.D. student), and Chae Won Kim, (Ph.D. student), contributed as \u00a0co-authors. <\/span><\/div>\n<div><span style=\"color: #000000\">The open-source large language model CoLLaVO was accepted on \u00a0May 16, 2024, by the prestigious international conference in the field \u00a0of natural language processing (NLP), &#8216;Findings of the Association for \u00a0Computational Linguistics (ACL Findings) 2024&#8217;. MoAI is currently \u00a0awaiting approval from the top international conference in computer \u00a0vision, the &#8216;European Conference on Computer Vision (ECCV) 2024&#8217;.<\/span><\/p>\n<p><span style=\"color: #000000\">Accordingly, Professor YongMan Ro stated, &#8220;The open-source multimodal large language models developed by our research team, CoLLaVO and MoAI, have been recommended on Huggingface Daily Papers and are being recognized by researchers worldwide through various social media platforms. Since all the models have been released as open-source large language models, these research models will contribute to the advancement of multimodal large language models.&#8221;<\/span><\/p>\n<p><span style=\"color: #000000\">\u00a0This research \u00a0was conducted at the Future Defense Artificial Intelligence \u00a0Specialization Research Center and the School of Electrical Engineering \u00a0of Korea Advanced Institute of Science and Technology (KAIST).<\/span><\/div>\n<div>\u00a0<\/div>\n<\/div>\n<p><span style=\"color: #000000\">[1] CoLLaVO Demo\u00a0GIF Video Clip\u00a0<a style=\"color: #000000\" href=\"https:\/\/github.com\/ByungKwanLee\/CoLLaVO\" target=\"_blank\" rel=\"noopener\"><u>https:\/\/github.com\/ByungKwanLee\/CoLLaVO<\/u><\/a><\/span><\/p>\n<\/div>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-165801\" src=\"http:\/\/ee.presscat.kr\/wp-content\/uploads\/2024\/06\/images_000078_imga4.jpg.jpg\" alt=\"\" width=\"797\" height=\"514\" title=\"\"><\/span><\/p>\n<p><span style=\"color: #000000\">&lt; CoLLaVO Demo GIF &gt;<\/span><\/p>\n<p>&nbsp;<\/p>\n<p><span style=\"color: #000000\">[2] MoAI Demo\u00a0GIF Video Clip\u00a0<a style=\"color: #000000\" href=\"https:\/\/github.com\/ByungKwanLee\/MoAI\" target=\"_blank\" rel=\"noopener\"><u>https:\/\/github.com\/ByungKwanLee\/MoAI<\/u><\/a><\/span><\/p>\n<p><span style=\"color: #000000\"><img loading=\"lazy\" decoding=\"async\" class=\"alignnone size-full wp-image-165803\" src=\"http:\/\/ee.presscat.kr\/wp-content\/uploads\/2024\/06\/images_000078_image5.png.png\" alt=\"\" width=\"799\" height=\"335\" title=\"\"><\/span><\/p>\n<p><span style=\"color: #000000\">&lt; MoAI Demo GIF &gt;<\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>960<\/p>\n","protected":false},"featured_media":165807,"template":"","research_category":[],"class_list":["post-165792","research-achieve","type-research-achieve","status-publish","has-post-thumbnail","hentry"],"acf":[],"_links":{"self":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/research-achieve\/165792","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/research-achieve"}],"about":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/types\/research-achieve"}],"wp:featuredmedia":[{"embeddable":true,"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/media\/165807"}],"wp:attachment":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/media?parent=165792"}],"wp:term":[{"taxonomy":"research_category","embeddable":true,"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/research_category?post=165792"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}