Learning Multi-Dimensional Human Preference for Text-to-Image Generation

Sixian Zhang; Bohan Wang; Junqiang Wu; Yan Li; Tingting Gao; Di Zhang; Zhongyuan Wang

2024 CVPR CVPR 2024

Learning Multi-Dimensional Human Preference for Text-to-Image Generation

Abstract

Current metrics for text-to-image models typically rely on statistical metrics which inadequately represent the real preference of humans. Although recent work attempts to learn these preferences via human annotated images they reduce the rich tapestry of human preference to a single overall score. However the preference results vary when humans evaluate images with different aspects. Therefore to learn the multi-dimensional human preferences we propose the Multi-dimensional Preference Score (MPS) the first multi-dimensional preference scoring model for the evaluation of text-to-image models. The MPS introduces the preference condition module upon CLIP model to learn these diverse preferences. It is trained based on our Multi-dimensional Human Preference (MHP) Dataset which comprises 918315 human preference choices across four dimensions (i.e. aesthetics semantic alignment detail quality and overall assessment) on 607541 images. The images are generated by a wide range of latest text-to-image models. The MPS outperforms existing scoring methods across 3 datasets in 4 dimensions enabling it a promising metric for evaluating and improving text-to-image generation. The model and dataset will be made publicly available to facilitate future research.

🌉 Interdisciplinary Bridge — Artificial Intelligence and Computer Vision and Deep Learning and Machine Learning

🧭 Keyword Pioneer — multi-dimensional preference

🐣 Hot Topic Early Bird — image quality assessment

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Interdisciplinary, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning, Robotics, Security & Privacy, Speech & Audio

Authors

Sixian Zhang , Bohan Wang , Junqiang Wu , Yan Li , Tingting Gao , Di Zhang , Zhongyuan Wang

Topics

Artificial Intelligence > Core AI > Multimodal Learning Deep Learning > Models > Generative Models Computer Vision > Generation > Image Generation Machine Learning > Learning Types > Representation Learning Machine Learning > Application Areas > Recommender Systems Machine Learning > Learning Types > Preference Learning Deep Learning > Models > Vision-Language Models

Keywords

preference learning text-to-image generation semantic alignment image quality assessment clip model human preference aesthetic evaluation multi-dimensional preference preference scoring

Download PDF

Related papers

DUSt3R: Geometric 3D Vision Made Easy 2024

Bezier Everywhere All at Once: Learning Drivable Lanes as Bezier Graphs 2024

NeRFDeformer: NeRF Transformation from a Single View via 3D Scene Flows 2024

Unleashing Unlabeled Data: A Paradigm for Cross-View Geo-Localization 2024

DIMAT: Decentralized Iterative Merging-And-Training for Deep Learning Models 2024