VICTR: Visual Information Captured Text Representation for Text-to-Vision Multimodal Tasks

Caren Han; Siqu Long; Siwen Luo; Kunze Wang; Josiah Poon

2020 COLING COLING 2020

VICTR: Visual Information Captured Text Representation for Text-to-Vision Multimodal Tasks

Abstract

AbstractText-to-image multimodal tasks, generating/retrieving an image from a given text description, are extremely challenging tasks since raw text descriptions cover quite limited information in order to fully describe visually realistic images. We propose a new visual contextual text representation for text-to-image multimodal tasks, VICTR, which captures rich visual semantic information of objects from the text input. First, we use the text description as initial input and conduct dependency parsing to extract the syntactic structure and analyse the semantic aspect, including object quantities, to extract the scene graph. Then, we train the extracted objects, attributes, and relations in the scene graph and the corresponding geometric relation information using Graph Convolutional Networks, and it generates text representation which integrates textual and visual semantic information. The text representation is aggregated with word-level and sentence-level embedding to generate both visual contextual word and sentence representation. For the evaluation, we attached VICTR to the state-of-the-art models in text-to-image generation.VICTR is easily added to existing models and improves across both quantitative and qualitative aspects.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — visual semantic information

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Caren Han , Siqu Long , Siwen Luo , Kunze Wang , Josiah Poon

Topics

Machine Learning > Core Methods > Representation Learning Deep Learning > Architectures > Graph Neural Networks Computer Vision > Generation > Image Generation

Keywords

multimodal learning text representation scene graph graph convolutional network visual semantic information

Download PDF

Related papers

Persuasiveness of News Editorials depending on Ideology and Personality 2020

A Graph Representation of Semi-structured Data for Web Question Answering 2020

Span-based Joint Entity and Relation Extraction with Attention-based Span-specific and Contextual Semantic Representations 2020

Hierarchical Chinese Legal event extraction via Pedal Attention Mechanism 2020

End-to-End Emotion-Cause Pair Extraction with Graph Convolutional Network 2020