{"id":118534,"date":"2021-11-01T03:05:43","date_gmt":"2021-10-31T18:05:43","guid":{"rendered":"http:\/\/175.125.95.178\/ai-in-signal\/18534\/"},"modified":"2026-05-18T19:54:16","modified_gmt":"2026-05-18T10:54:16","slug":"18534","status":"publish","type":"ai-in-signal","link":"http:\/\/ee.presscat.kr\/en\/ai-in-signal\/18534\/","title":{"rendered":"Kiran Ramnath, Leda Sari, Mark Hasegawa-Johnson and Chang Yoo. &quot;Worldly Wise (WoW) &#8211; Cross-Lingual Knowledge Fusion for Fact-based Visual Spoken-Question Answering&quot;, Annual Conference of the North American Chapter of the Association for Computational Li"},"content":{"rendered":"<p style=\"margin-bottom:11px\"><span style=\"font-size:11pt\"><span style=\"line-height:107%\"><span style=\"font-family:Calibri,sans-serif\">Although Question-Answering has long been of research interest, its accessibility to users through a speech interface and its support to multiple languages have not been addressed in prior studies. Towards these ends, we present a new task and a synthetically-generated dataset to do Fact-based Visual Spoken-Question Answering (FVSQA). FVSQA is based on the FVQA dataset, which requires a system to retrieve an entity from Knowledge Graphs (KGs) to answer a question about an image. In FVSQA, the question is spoken rather than typed. Three sub-tasks are proposed: (1) speech-to-text based, (2) end-to-end, without speech-to-text as an intermediate component, and (3) cross-lingual, in which the question is spoken in a language different from that in which the KG is recorded. The end-to-end and cross-lingual tasks are the first to require world knowledge from a multi-relational KG as a differentiable layer in an end-to-end spoken language understanding task, hence the proposed reference implementation is called WorldlyWise (WoW). WoW is shown to perform endto-end cross-lingual FVSQA at same levels of accuracy across 3 languages &#8211; English, Hindi, and Turkish.<\/span><\/span><\/span><\/p>\n<p style=\"margin-bottom:11px\"><span style=\"font-size:11pt\"><span style=\"line-height:107%\"><span style=\"font-family:Calibri,sans-serif\"><\/p>\n<div class=\"\"><img decoding=\"async\" class=\"\" src=\"\/wp-content\/uploads\/drupal\/\uc720\ucc3d\ub3d9\uad50\uc218\ub2d89.png\" alt=\"\" title=\"\"><\/div>\n<p>.<\/span><\/span><\/span><\/p>\n<p style=\"margin-bottom:11px\"><span style=\"font-size:11pt\"><span style=\"line-height:107%\"><span style=\"font-family:Calibri,sans-serif\">Figure 9: Architecture of the proposed method<\/span><\/span><\/span><\/p>\n","protected":false},"excerpt":{"rendered":"<p>695<\/p>\n","protected":false},"featured_media":0,"template":"","class_list":["post-118534","ai-in-signal","type-ai-in-signal","status-publish","hentry"],"acf":[],"_links":{"self":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/ai-in-signal\/118534","targetHints":{"allow":["GET"]}}],"collection":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/ai-in-signal"}],"about":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/types\/ai-in-signal"}],"wp:attachment":[{"href":"http:\/\/ee.presscat.kr\/en\/wp-json\/wp\/v2\/media?parent=118534"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}