Modularized Textual Grounding for Counterfactual Resilience

Zhiyuan Fang; Shu Kong; Charless Fowlkes; Yezhou Yang

2019 CVPR CVPR 2019

Modularized Textual Grounding for Counterfactual Resilience

Abstract

Computer Vision applications often require a textual grounding module with precision, interpretability, and resilience to counterfactual inputs/queries. To achieve high grounding precision, current textual grounding methods heavily rely on large-scale training data with manual annotations at the pixel level. Such annotations are expensive to obtain and thus severely narrow the model's scope of real-world applications. Moreover, most of these methods sacrifice interpretability, generalizability, and they neglect the importance of being resilient to counterfactual inputs. To address these issues, we propose a visual grounding system which is 1) end-to-end trainable in a weakly supervised fashion with only image-level annotations, and 2) counterfactually resilient owing to the modular design. Specifically, we decompose textual descriptions into three levels: entity, semantic attribute, color information, and perform compositional grounding progressively. We validate our model through a series of experiments and demonstrate its improvement over the state-of-the-art methods. In particular, our model's performance not only surpasses other weakly/un-supervised methods and even approaches the strongly supervised ones, but also is interpretable for decision making and performs much better in face of counterfactual classes than all the others.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning and Natural Language Processing

📈 Trend Setter — Visual Question Answering

🧭 Keyword Pioneer — counterfactual resilience

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Zhiyuan Fang , Shu Kong , Charless Fowlkes , Yezhou Yang

Topics

Machine Learning > Learning Types > Weakly Supervised Learning Computer Vision > Processing > Image Segmentation Natural Language Processing > Applications > Visual Question Answering Deep Learning > Learning Types > Weakly Supervised Learning Computer Vision > Analysis > Visual Question Answering

Keywords

weakly supervised learning visual grounding textual grounding counterfactual resilience compositional grounding

Download PDF

Related papers

Fast Single Image Reflection Suppression via Convex Optimization 2019

Learning Video Representations From Correspondence Proposals 2019

ATOM: Accurate Tracking by Overlap Maximization 2019

Visual Tracking via Adaptive Spatially-Regularized Correlation Filters 2019

Edge-Labeling Graph Neural Network for Few-Shot Learning 2019