Kai Yu

120 papers · 2006–2026 · 13 conferences · across top CS/AI conferences

Achievements

🏃 Academic Marathon (19) 🧭 Keyword Pioneer 🌍 Conference Polyglot (13) 🗺️ Taxonomy Completionist (38) 🐣 Hot Topic Early Bird 🌈 Renaissance Researcher (9) 🌉 Interdisciplinary Bridge 🐝 Cross-Pollinator (7) 🏠 Conference Loyalist (36) 🌟 Keyword Trendsetter Combo (5) 🔬 Deep Specialist (20) 🌱 Topic Pioneer 🏆 Keyword Champion 🧬 Topic Evolution 👥 Mega-Team (23) 🤝 Dynamic Duo (48) 📈 Trend Setter 🚀 Conference Pioneer 🔥 Unstoppable (14) ❓ The Questioner (4) ⚡ Prolific Year (17) 💎 Century Club (116) 🗃️ Keyword Collector (116)

Conferences

INTERSPEECH (36) ACL (18) EMNLP (17) NIPS (12) AAAI (9) COLING (7) NAACL (5) ICCV (4) ICML (4) IJCNLP (3) CVPR (2) EACL (2) MICCAI (1)

Top co-authors

Lu Chen (48) Su Zhu (19) Ruisheng Cao (18) Xie Chen (13) Hongshen Xu (11) Yanmin Qian (11) Zhi Chen (10) Shuai Wang (10) Mengyue Wu (9) Yanbin Zhao (7)

Keywords

large language model (15) semantic parsing (9) speech synthesis (7) domain adaptation (6) data augmentation (6) speaker verification (5) automatic speech recognition (5) speaker embedding (5) knowledge distillation (5) graph neural network (5) long short-term memory (4) model compression (4) vector quantization (4) connectionist temporal classification (4) text-to-speech synthesis (4) speech recognition (4) dialogue state tracking (4) transfer learning (4) unsupervised learning (4) semi-supervised learning (4)

Papers

MergeDNA: Context-Aware Genome Modeling with Dynamic Tokenization Through Token Merging AAAI 2026

BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction EACL 2026

AHAMask: Reliable Task Specification for Large Audio Language Models Without Instructions AAAI 2026

Phased One-Step Adversarial Equilibrium for Video Diffusion Models AAAI 2026

Alignment for Efficient Tool Calling of Large Language Models EMNLP 2025

When Long Helps Short: How Context Length in Supervised Fine-tuning Affects Behavior of Large Language Models EMNLP 2025

MobA: Multifaceted Memory-Enhanced Adaptive Planning for Efficient Mobile Task Automation NAACL 2025

Converging to a Lingua Franca: Evolution of Linguistic Regions and Semantics Alignment in Multilingual Large Language Models COLING 2025

GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement ACL 2025

NeuSym-RAG: Hybrid Neural Symbolic Retrieval with Multiview Structuring for PDF Question Answering ACL 2025

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching ACL 2025

SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training ACL 2025

From Generalist to Specialist: A Survey of Large Language Models for Chemistry COLING 2025

VQTalker: Towards Multilingual Talking Avatars Through Facial Motion Tokenization AAAI 2025

Reducing Tool Hallucination via Reliability Alignment ICML 2025

ChatCite: LLM Agent with Human Workflow Guidance for Comparative Literature Summary COLING 2025

Heads up! Large Language Models Can Perform Tasks Without Your Instruction via Selective Attention Head Masking ICML 2025

Bitrate-Controlled Diffusion for Disentangling Motion and Content in Video ICCV 2025

URO-Bench: Towards Comprehensive Evaluation for End-to-End Spoken Dialogue Models EMNLP 2025

Enhancing Speech-to-Speech Dialogue Modeling with End-to-End Retrieval-Augmented Generation EMNLP 2025

AlignSum: Data Pyramid Hierarchical Fine-tuning for Aligning with Human Summarization Preference EMNLP 2024

Multilingual Brain Surgeon: Large Language Models Can Be Compressed Leaving No Language behind COLING 2024

SPEADO: Segmentation and Punctuation for Ancient Chinese Texts via Example Augmentation and Decoding Optimization COLING 2024

DiffusionGAN3D: Boosting Text-guided 3D Generation and Domain Adaptation by Combining 3D GANs and Diffusion Priors CVPR 2024

UrFound: Towards Universal Retinal Foundation Models via Knowledge-Guided Masked Modeling MICCAI 2024

DiveSound: LLM-Assisted Automatic Taxonomy Construction for Diverse Audio Generation INTERSPEECH 2024

Spider2-V: How Far Are Multimodal Agents From Automating Data Science and Engineering Workflows? NIPS 2024

On the Effectiveness of Acoustic BPE in Decoder-Only TTS INTERSPEECH 2024

Text-aware Speech Separation for Multi-talker Keyword Spotting INTERSPEECH 2024

FakeSound: Deepfake General Audio Detection INTERSPEECH 2024

UniCATS: A Unified Context-Aware Text-to-Speech Framework with Contextual VQ-Diffusion and Vocoding AAAI 2024

SciEval: A Multi-Level Large Language Model Evaluation Benchmark for Scientific Research AAAI 2024

Evolving Subnetwork Training for Large Language Models ICML 2024

CoE-SQL: In-Context Learning for Multi-Turn Text-to-SQL with Chain-of-Editions NAACL 2024

IBSEN: Director-Actor Agent Collaboration for Controllable and Interactive Drama Script Generation ACL 2024

Sparsity-Accelerated Training for Large Language Models ACL 2024

Is LLM a Reliable Reviewer? A Comprehensive Evaluation of LLM on Automatic Paper Reviewing Tasks COLING 2024

UnSE: Unsupervised Speech Enhancement Using Optimal Transport INTERSPEECH 2023

PointGPT: Auto-regressively Generative Pre-training from Point Clouds NIPS 2023

Large Language Models Are Semi-Parametric Reinforcement Learning Agents NIPS 2023

TeCS: A Dataset and Benchmark for Tense Consistency of Machine Translation ACL 2023

SPM: A Split-Parsing Method for Joint Multi-Intent Detection and Slot Filling ACL 2023

Exploring Schema Generalizability of Text-to-SQL ACL 2023

CSS: A Large-scale Cross-schema Chinese Text-to-SQL Medical Dataset ACL 2023

ACT-SQL: In-Context Learning for Text-to-SQL with Automatically-Generated Chain-of-Thought EMNLP 2023

Towards Instance-adaptive Inference for Federated Learning ICCV 2023

Diverse Data Augmentation with Diffusions for Effective Test-time Prompt Tuning ICCV 2023

DSE-TTS: Dual Speaker Embedding for Cross-Lingual Text-to-Speech INTERSPEECH 2023

Improving Code-Switching and Name Entity Recognition in ASR with Speech Editing based Data Augmentation INTERSPEECH 2023

How ChatGPT is Robust for Spoken Language Understanding? INTERSPEECH 2023

ReCLR: Reference-Enhanced Contrastive Learning of Audio Representation for Depression Detection INTERSPEECH 2023

Enhance Temporal Relations in Audio Captioning with Sound Event Detection INTERSPEECH 2023

AdapterShare: Task Correlation Modeling with Adapter Differentiation EMNLP 2022

TIE: Topological Information Enhanced Structural Reading Comprehension on Web Pages NAACL 2022

The AISP-SJTU Translation System for WMT 2022 EMNLP 2022

VQTTS: High-Fidelity Text-to-Speech Synthesis with Self-Supervised VQ Acoustic Feature INTERSPEECH 2022

MSDWild: Multi-modal Speaker Diarization Dataset in the Wild INTERSPEECH 2022

The AISP-SJTU Simultaneous Translation System for IWSLT 2022 ACL 2022

D4: a Chinese Dialogue Dataset for Depression-Diagnosis-Oriented Chat EMNLP 2022

Efficient Speech Enhancement with Neural Homomorphic Synthesis INTERSPEECH 2022

META-GUI: Towards Multi-modal Conversational Agents on Mobile GUI EMNLP 2022

WebSRC: A Dataset for Web-Based Structural Reading Comprehension EMNLP 2021

LET: Linguistic Knowledge Enhanced Graph Transformer for Chinese Short Text Matching AAAI 2021

Glyph Enhanced Chinese Character Pre-Training for Lexical Sememe Prediction EMNLP 2021

Decoupled Dialogue Modeling and Semantic Parsing for Multi-Turn Text-to-SQL IJCNLP 2021

Rich Prosody Diversity Modelling with Phone-Level Mixture Density Network INTERSPEECH 2021

Class-Based Neural Network Language Model for Second-Pass Rescoring in ASR INTERSPEECH 2021

A Lightweight Framework for Online Voice Activity Detection in the Wild INTERSPEECH 2021

Decoupled Dialogue Modeling and Semantic Parsing for Multi-Turn Text-to-SQL ACL 2021

LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations ACL 2021

LGESQL: Line Graph Enhanced Text-to-SQL Model with Mixed Local and Non-Local Relations IJCNLP 2021

ShadowGNN: Graph Projection Neural Network for Text-to-SQL Parser NAACL 2021

Dual-Adversarial Domain Adaptation for Generalized Replay Attack Detection INTERSPEECH 2020

Neural Homomorphic Vocoder INTERSPEECH 2020

Unsupervised Dual Paraphrasing for Two-stage Semantic Parsing ACL 2020

Efficient Context and Schema Fusion Networks for Multi-Domain Dialogue State Tracking EMNLP 2020

Schema-Guided Multi-Domain Dialogue State Tracking with Graph Attention Neural Networks AAAI 2020

Neural Graph Matching Networks for Chinese Short Text Matching ACL 2020

Semi-Supervised Text Simplification with Back-Translation and Asymmetric Denoising Autoencoders AAAI 2020

Line Graph Enhanced AMR-to-Text Generation with Mix-Order Graph Attention Networks ACL 2020

Voice Activity Detection in the Wild via Weakly Supervised Sound Event Detection INTERSPEECH 2020

Jointly Encoding Word Confusion Network and Dialogue Context with BERT for Spoken Language Understanding INTERSPEECH 2020

On the Usage of Phonetic Information for Text-Independent Speaker Embedding Extraction INTERSPEECH 2019

Semantic Parsing with Dual Learning ACL 2019

Data Augmentation with Atomic Templates for Spoken Language Understanding IJCNLP 2019

The SJTU Robust Anti-Spoofing System for the ASVspoof 2019 Challenge INTERSPEECH 2019

Data Augmentation Using Variational Autoencoder for Embedding Based Speaker Verification INTERSPEECH 2019

Joint Decoding of CTC Based Systems for Speech Recognition INTERSPEECH 2019

Cross-Domain Replay Spoofing Attack Detection Using Domain Adversarial Training INTERSPEECH 2019

Data Augmentation with Atomic Templates for Spoken Language Understanding EMNLP 2019

Binarized LSTM Language Model NAACL 2018

Knowledge Distillation for Sequence Model INTERSPEECH 2018

Towards Universal Dialogue State Tracking EMNLP 2018

Angular Softmax for Short-Duration Text-independent Speaker Verification INTERSPEECH 2018

Structured Dialogue Policy with Graph Neural Networks COLING 2018

High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder INTERSPEECH 2018

Structured Word Embedding for Low Memory Neural Network Language Model INTERSPEECH 2018

On-line Dialogue Policy Learning with Companion Teaching EACL 2017

What Does the Speaker Embedding Encode? INTERSPEECH 2017

Comparison of Modeling Target in LSTM-RNN Duration Model INTERSPEECH 2017

Discrete Duration Model for Speech Synthesis INTERSPEECH 2017

Binary Deep Neural Networks for Speech Recognition INTERSPEECH 2017

Agent-Aware Dropout DQN for Safe and Efficient On-line Dialogue Policy Learning EMNLP 2017

Affordable On-line Dialogue Policy Learning EMNLP 2017

Hybrid Dialogue State Tracking for Real World Human-to-Human Dialogues INTERSPEECH 2016

Unrestricted Vocabulary Keyword Spotting Using LSTM-CTC INTERSPEECH 2016

Phone Synchronous Decoding with CTC Lattice INTERSPEECH 2016

Text Flow: A Unified Text Detection System in Natural Scene Images ICCV 2015

Deep Multiple Instance Learning for Image Classification and Auto-Annotation CVPR 2015

Communication Efficient Distributed Machine Learning with the Parameter Server NIPS 2014

Smooth Sparse Coding via Marginal Regression for Learning Sparse Representations ICML 2013

Deep Learning of Invariant Features via Simulated Fixations in Video NIPS 2012

Deep Coding Network NIPS 2010

Phrase-Based Statistical Language Generation Using Graphical Models and Active Learning ACL 2010

Nonlinear Learning using Local Coordinate Coding NIPS 2009

Stochastic Relational Models for Large-scale Dyadic Data using MCMC NIPS 2008

Deep Learning with Kernel Regularization for Visual Recognition NIPS 2008

Predictive Matrix-Variate t Models NIPS 2007

Gaussian Process Models for Link Analysis and Transfer Learning NIPS 2007

Stochastic Relational Models for Discriminative Link Prediction NIPS 2006