CLIP-Forge: Towards Zero-Shot Text-To-Shape Generation

Aditya Sanghi; Hang Chu; Joseph G. Lambourne; Ye Wang; Chin-Yi Cheng; Marco Fumero; Kamal Rahimi Malekshan

2022 CVPR CVPR 2022

CLIP-Forge: Towards Zero-Shot Text-To-Shape Generation

Abstract

Generating shapes using natural language can enable new ways of imagining and creating the things around us. While significant recent progress has been made in text-to-image generation, text-to-shape generation remains a challenging problem due to the unavailability of paired text and shape data at a large scale. We present a simple yet effective method for zero-shot text-to-shape generation that circumvents such data scarcity. Our proposed method, named CLIP-Forge, is based on a two-stage training process, which only depends on an unlabelled shape dataset and a pre-trained image-text network such as CLIP. Our method has the benefits of avoiding expensive inference time optimization, as well as the ability to generate multiple shapes for a given text. We not only demonstrate promising zero-shot generalization of the CLIP-Forge model qualitatively and quantitatively, but also provide extensive comparative evaluations to better understand its behavior.

🌉 Interdisciplinary Bridge — Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — text-to-shape generation

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Aditya Sanghi , Hang Chu , Joseph G. Lambourne , Ye Wang , Chin-Yi Cheng , Marco Fumero , Kamal Rahimi Malekshan

Topics

Machine Learning > Learning Types > Zero-Shot Learning Deep Learning > Models > Generative Models Computer Vision > Analysis > 3D Vision Computer Vision > Generation > Image Generation Deep Learning > Learning Types > Representation Learning Deep Learning > Learning Types > Zero-Shot Learning Deep Learning > Models > Vision-Language Models

Keywords

zero-shot learning latent representation clip model latent embedding text-to-shape generation

Download PDF

Related papers

UniCoRN: A Unified Conditional Image Repainting Network 2022

Why Discard if You Can Recycle?: A Recycling Max Pooling Module for 3D Point Cloud Analysis 2022

All-in-One Image Restoration for Unknown Corruption 2022

Stability-Driven Contact Reconstruction From Monocular Color Images 2022

Forecasting Characteristic 3D Poses of Human Actions 2022