Shinji Watanabe

182 papers · 2013–2026 · 11 conferences · across top CS/AI conferences

Achievements

🏃 Academic Marathon (12) 🐝 Cross-Pollinator (9) 🌉 Interdisciplinary Bridge 🗺️ Taxonomy Completionist (39) 🌍 Conference Polyglot (11) 🧭 Keyword Pioneer 🐣 Hot Topic Early Bird 🌈 Renaissance Researcher (5) 🌟 Keyword Trendsetter Combo (8) 🏠 Conference Loyalist (22) 👑 Domain Dominant (51) 🤝 Dynamic Duo (35) 👑 Triple Crown 🌱 Topic Pioneer 🔬 Deep Specialist (27) 🧬 Topic Evolution 🏆 Keyword Champion (6) 🏆 Grand Slam 👥 Mega-Team (76) 💎 Century Club (180) 🚀 Conference Pioneer 🔥 Unstoppable (10) ❓ The Questioner (3) ⚡ Prolific Year (31) 🗃️ Keyword Collector (199) 📈 Trend Setter

Conferences

INTERSPEECH (120) ACL (22) NAACL (12) EMNLP (6) EACL (4) ICLR (4) ICML (4) AAAI (3) IJCNLP (3) IJCAI (2) NIPS (2)

Top co-authors

Jiatong Shi (36) Brian Yan (27) Siddhant Arora (25) Yifan Peng (23) Xuankai Chang (23) William Chen (19) Jinchuan Tian (14) Karen Livescu (13) Siddharth Dalmia (12) Emiru Tsunoo (10)

Research topics

Speech & Audio (1) Processing (1)

Keywords

automatic speech recognition (51) speech recognition (30) self-supervised learning (22) end-to-end speech recognition (21) speech translation (21) speech enhancement (16) end-to-end model (16) spoken language understanding (15) connectionist temporal classification (12) beam search (10) attention mechanism (9) end-to-end learning (9) speech processing (9) neural network (9) speaker diarization (8) speech separation (8) speech synthesis (8) language model (8) data augmentation (7) transfer learning (7)

Papers

CSPB: Conversational Speech Processing Benchmark for Self-supervised Speech Models EACL 2026

BSCodec: A Band-Split Neural Codec for High-Quality Universal Audio Reconstruction EACL 2026

Enhancing Audiovisual Speech Recognition Through Bifocal Preference Optimization AAAI 2025

ESPnet-SDS: Unified Toolkit and Demo for Spoken Dialogue Systems NAACL 2025

VERSA: A Versatile Evaluation Toolkit for Speech, Audio, and Music NAACL 2025

ESPnet-SpeechLM: An Open Speech Language Model Toolkit NAACL 2025

VoiceTextBlender: Augmenting Large Language Models with Speech Capabilities via Single-Stage Joint Speech-Text Supervised Fine-Tuning NAACL 2025

Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment NAACL 2025

OWLS: Scaling Laws for Multilingual Speech Recognition and Translation Models ICML 2025

Dynamic-SUPERB Phase-2: A Collaboratively Expanding Benchmark for Measuring the Capabilities of Spoken Language Models with 180 Tasks ICLR 2025

Talking Turns: Benchmarking Audio Foundation Models on Turn-Taking Dynamics ICLR 2025

Summarizing Speech: A Comprehensive Survey EMNLP 2025

Context-aware Dynamic Pruning for Speech Foundation Models ICLR 2025

SpeechIQ: Speech-Agentic Intelligence Quotient Across Cognitive Levels in Voice Understanding by Large Language Models ACL 2025

Cross-Talk Reduction IJCAI 2024

AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head AAAI 2024

Wav2Gloss: Generating Interlinear Glossed Text from Speech ACL 2024

OWSM-CTC: An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification ACL 2024

On the Evaluation of Speech Foundation Models for Spoken Language Understanding ACL 2024

FINDINGS OF THE IWSLT 2024 EVALUATION CAMPAIGN ACL 2024

CMU’s IWSLT 2024 Simultaneous Speech Translation System ACL 2024

CMU’s IWSLT 2024 Offline Speech Translation System: A Cascaded Approach For Long-Form Robustness ACL 2024

Towards Robust Speech Representation Learning for Thousands of Languages EMNLP 2024

FastAdaSP: Multitask-Adapted Efficient Inference for Large Speech Language Model EMNLP 2024

MULTI-CONVFORMER: Extending Conformer with Multiple Convolution Kernels INTERSPEECH 2024

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer INTERSPEECH 2024

Neural Blind Source Separation and Diarization for Distant Speech Recognition INTERSPEECH 2024

Finding Task-specific Subnetworks in Multi-task Spoken Language Understanding Model INTERSPEECH 2024

ML-SUPERB 2.0: Benchmarking Multilingual Speech Models Across Modeling Constraints, Languages, and Datasets INTERSPEECH 2024

Beyond Performance Plateaus: A Comprehensive Study on Scalability in Speech Enhancement INTERSPEECH 2024

Singing Voice Data Scaling-up: An Introduction to ACE-Opencpop and ACE-KiSing INTERSPEECH 2024

The Interspeech 2024 Challenge on Speech Processing Using Discrete Units INTERSPEECH 2024

MMM: Multi-Layer Multi-Residual Multi-Stream Discrete Speech Representation from Self-supervised Learning Model INTERSPEECH 2024

Convolution-Augmented Parameter-Efficient Fine-Tuning for Speech Recognition INTERSPEECH 2024

Self-training ASR Guided by Unsupervised ASR Teacher INTERSPEECH 2024

Rapid Language Adaptation for Multilingual E2E Speech Recognition Using Encoder Prompting INTERSPEECH 2024

To what extent can ASV systems naturally defend against spoofing attacks? INTERSPEECH 2024

Contextualized End-to-end Automatic Speech Recognition with Intermediate Biasing Loss INTERSPEECH 2024

On the Effects of Heterogeneous Data Sources on Speech-to-Text Foundation Models INTERSPEECH 2024

EFFUSE: Efficient Self-Supervised Feature Fusion for E2E ASR in Low Resource and Multilingual Scenarios INTERSPEECH 2024

DiscreteSLU: A Large Language Model with Self-Supervised Discrete Speech Units for Spoken Language Understanding INTERSPEECH 2024

ESPnet-SPK: full pipeline speaker embedding toolkit with reproducible recipes, self-supervised front-ends, and off-the-shelf models INTERSPEECH 2024

Decoder-only Architecture for Streaming End-to-end Speech Recognition INTERSPEECH 2024

Self-Supervised Speech Representations are More Phonetic than Semantic INTERSPEECH 2024

Can you Remove the Downstream Model for Speaker Recognition with Self-Supervised Speech Features? INTERSPEECH 2024

URGENT Challenge: Universality, Robustness, and Generalizability For Speech Enhancement INTERSPEECH 2024

EARS: An Anechoic Fullband Speech Dataset Benchmarked for Speech Enhancement and Dereverberation INTERSPEECH 2024

SpeechBERTScore: Reference-Aware Automatic Evaluation of Speech Generation Leveraging NLP Evaluation Metrics INTERSPEECH 2024

UniverSLU: Universal Spoken Language Understanding for Diverse Tasks with Natural Language Instructions NAACL 2024

BASS: Block-wise Adaptation for Speech Summarization INTERSPEECH 2023

Exploration of Efficient End-to-End ASR using Discretized Input from Self-Supervised Learning INTERSPEECH 2023

Integration of Frame- and Label-synchronous Beam Search for Streaming Encoder-decoder Speech Recognition INTERSPEECH 2023

ML-SUPERB: Multilingual Speech Universal PERformance Benchmark INTERSPEECH 2023

Integrating Pretrained ASR and LM to Perform Sequence Generation for Spoken Language Understanding INTERSPEECH 2023

Tensor decomposition for minimization of E2E SLU model toward on-device processing INTERSPEECH 2023

Prompting the Hidden Talent of Web-Scale Speech Models for Zero-Shot Task Generalization INTERSPEECH 2023

DPHuBERT: Joint Distillation and Pruning of Self-Supervised Speech Models INTERSPEECH 2023

Time-synchronous one-pass Beam Search for Parallel Online and Offline Transducers with Dynamic Block Training INTERSPEECH 2023

CTC Alignments Improve Autoregressive Translation EACL 2023

Reducing Barriers to Self-Supervised Learning: HuBERT Pre-training with Academic Compute INTERSPEECH 2023

A Vector Quantized Approach for Text to Speech Synthesis on Real-World Spontaneous Speech AAAI 2023

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units ACL 2023

Incremental Blockwise Beam Search for Simultaneous Speech Translation with Controllable Quality-Latency Tradeoff INTERSPEECH 2023

UNSSOR: Unsupervised Neural Speech Separation by Leveraging Over-determined Training Mixtures NIPS 2023

ESPnet-ST-v2: Multipurpose Spoken Language Translation Toolkit ACL 2023

4D ASR: Joint modeling of CTC, Attention, Transducer, and Mask-Predict decoders INTERSPEECH 2023

Exploration on HuBERT with Multiple Resolution INTERSPEECH 2023

A Comparative Study on E-Branchformer vs Conformer in Speech Recognition, Translation, and Understanding Tasks INTERSPEECH 2023

Learning to Speak from Text: Zero-Shot Multilingual Text-to-Speech with Unsupervised Text Pretraining IJCAI 2023

FINDINGS OF THE IWSLT 2023 EVALUATION CAMPAIGN ACL 2023

Efficient Sequence Transduction by Jointly Predicting Tokens and Durations ICML 2023

CMU’s IWSLT 2023 Simultaneous Speech Translation System ACL 2023

SigMoreFun Submission to the SIGMORPHON Shared Task on Interlinear Glossing ACL 2023

Deep Speech Synthesis from MRI-Based Articulatory Representations INTERSPEECH 2023

Bayes Risk Transducer: Transducer with Controllable Alignment Prediction INTERSPEECH 2023

BAYES RISK CTC: CONTROLLABLE CTC ALIGNMENT IN SEQUENCE-TO-SEQUENCE TASKS ICLR 2023

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks ACL 2023

A New Benchmark of Aphasia Speech Recognition and Detection Based on E-Branchformer and Multi-task Learning INTERSPEECH 2023

Findings of the IWSLT 2022 Evaluation Campaign ACL 2022

BERT Meets CTC: New Formulation of End-to-End Speech Recognition with Pre-trained Masked Language Model EMNLP 2022

Token-level Sequence Labeling for Spoken Language Understanding using Compositional End-to-End Models EMNLP 2022

SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities ACL 2022

Branchformer: Parallel MLP-Attention Architectures to Capture Local and Global Context for Speech Recognition and Understanding ICML 2022

Zero-shot Learning for Grapheme to Phoneme Conversion with Language Ensemble ACL 2022

Self-supervised Representation Learning for Speech Processing NAACL 2022

CMU’s IWSLT 2022 Dialect Speech Translation System ACL 2022

TriniTTS: Pitch-controllable End-to-end TTS without External Aligner INTERSPEECH 2022

Deep Speech Synthesis from Articulatory Representations INTERSPEECH 2022

Attention Weight Smoothing Using Prior Distributions for Transformer-Based End-to-End ASR INTERSPEECH 2022

Audio-Visual Wake Word Spotting in MISP2021 Challenge: Dataset Release and Deep Analysis INTERSPEECH 2022

VQ-T: RNN Transducers using Vector-Quantized Prediction Network States INTERSPEECH 2022

Blockwise Streaming Transformer for Spoken Language Understanding and Simultaneous Speech Translation INTERSPEECH 2022

Audio-Visual Speech Recognition in MISP2021 Challenge: Dataset Release and Deep Analysis INTERSPEECH 2022

Minimum latency training of sequence transducers for streaming end-to-end speech recognition INTERSPEECH 2022

Updating Only Encoders Prevents Catastrophic Forgetting of End-to-End ASR Models INTERSPEECH 2022

Online Continual Learning of End-to-End Speech Recognition Models INTERSPEECH 2022

Improving Speech Enhancement through Fine-Grained Speech Characteristics INTERSPEECH 2022

Two-Pass Low Latency End-to-End Spoken Language Understanding INTERSPEECH 2022

Combining Spectral and Self-Supervised Features for Low Resource Speech Recognition and Translation INTERSPEECH 2022

When Is TTS Augmentation Through a Pivot Language Useful? INTERSPEECH 2022

End-to-End Integration of Speech Recognition, Speech Enhancement, and Self-Supervised Learning Representation INTERSPEECH 2022

Residual Language Model for End-to-end Speech Recognition INTERSPEECH 2022

SingAug: Data Augmentation for Singing Voice Synthesis with Cycle-consistent Training Strategy INTERSPEECH 2022

Muskits: an End-to-end Music Processing Toolkit for Singing Voice Synthesis INTERSPEECH 2022

Memory-Efficient Training of RNN-Transducer with Sampled Softmax INTERSPEECH 2022

Streaming Automatic Speech Recognition with Re-blocking Processing Based on Integrated Voice Activity Detection INTERSPEECH 2022

ASR2K: Speech Recognition for Around 2000 Languages without Audio INTERSPEECH 2022

Better Intermediates Improve CTC Inference INTERSPEECH 2022

ESPnet-SE++: Speech Enhancement for Robust Speech Recognition, Translation, and Understanding INTERSPEECH 2022

Data Augmentation Methods for End-to-End Speech Recognition on Distant-Talk Scenarios INTERSPEECH 2021

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio INTERSPEECH 2021

Multi-Speaker ASR Combining Non-Autoregressive Conformer CTC and Conditional Speaker Chain INTERSPEECH 2021

Toward Streaming ASR with Non-Autoregressive Insertion-Based Model INTERSPEECH 2021

Layer Pruning on Demand with Intermediate CTC INTERSPEECH 2021

Streaming End-to-End ASR Based on Blockwise Non-Autoregressive Models INTERSPEECH 2021

Source and Target Bidirectional Knowledge Distillation for End-to-end Speech Translation NAACL 2021

Searchable Hidden Intermediates for End-to-End Models of Decomposable Sequence Tasks NAACL 2021

Highland Puebla Nahuatl Speech Translation Corpus for Endangered Language Documentation NAACL 2021

Acoustic Event Detection with Classifier Chains INTERSPEECH 2021

ESPnet-ST IWSLT 2021 Offline Speech Translation System IJCNLP 2021

Self-Guided Curriculum Learning for Neural Machine Translation IJCNLP 2021

End-to-end ASR to jointly predict transcriptions and linguistic annotations NAACL 2021

Leveraging End-to-End ASR for Endangered Language Documentation: An Empirical Study on Yolóxochitl Mixtec EACL 2021

ESPnet-ST IWSLT 2021 Offline Speech Translation System ACL 2021

Self-Guided Curriculum Learning for Neural Machine Translation ACL 2021

SUPERB: Speech Processing Universal PERformance Benchmark INTERSPEECH 2021

Rethinking End-to-End Evaluation of Decomposable Tasks: A Case Study on Spoken Language Understanding INTERSPEECH 2021

SPGISpeech: 5,000 Hours of Transcribed Financial Audio for Fully Formatted End-to-End Speech Recognition INTERSPEECH 2021

Auxiliary Loss Function for Target Speech Extraction and Recognition with Weak Supervision Based on Speaker Characteristics INTERSPEECH 2021

Speech Representation Learning Combining Conformer CPC with Deep Cluster for the ZeroSpeech Challenge 2021 INTERSPEECH 2021

Multi-Mode Transformer Transducer with Stochastic Future Context INTERSPEECH 2021

Differentiable Allophone Graphs for Language-Universal Speech Recognition INTERSPEECH 2021

Continuous Speech Separation Using Speaker Inventory for Long Recording INTERSPEECH 2021

Semi-Supervised Training with Pseudo-Labeling for End-To-End Neural Diarization INTERSPEECH 2021

Online Streaming End-to-End Neural Diarization Handling Overlapping Speech and Flexible Numbers of Speakers INTERSPEECH 2021

Leveraging Pre-Trained Language Model for Speech Sentiment Analysis INTERSPEECH 2021

Speaker Verification-Based Evaluation of Single-Channel Speech Separation INTERSPEECH 2021

Target-Speaker Voice Activity Detection with Improved i-Vector Estimation for Unknown Number of Speaker INTERSPEECH 2021

Mask CTC: Non-Autoregressive End-to-End ASR with CTC and Mask Predict INTERSPEECH 2020

End-to-End ASR with Adaptive Span Self-Attention INTERSPEECH 2020

Learning Speaker Embedding from Text-to-Speech INTERSPEECH 2020

Speaker-Conditional Chain Model for Speech Separation and Extraction INTERSPEECH 2020

End-to-End Far-Field Speech Recognition with Unified Dereverberation and Beamforming INTERSPEECH 2020

End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors INTERSPEECH 2020

Sequence to Multi-Sequence Learning via Conditional Chain Mapping for Mixture Signals NIPS 2020

ESPnet-ST: All-in-One Speech Translation Toolkit ACL 2020

Insertion-Based Modeling for End-to-End Automatic Speech Recognition INTERSPEECH 2020

Massively Multilingual Adversarial Speech Recognition NAACL 2019

Auxiliary Interference Speaker Loss for Target-Speaker Speech Recognition INTERSPEECH 2019

End-to-End SpeakerBeam for Single Channel Target Speech Recognition INTERSPEECH 2019

Improving Transformer-Based End-to-End Speech Recognition with Connectionist Temporal Classification and Language Model Integration INTERSPEECH 2019

Speaker Recognition Benchmark Using the CHiME-5 Corpus INTERSPEECH 2019

Analysis of Multilingual Sequence-to-Sequence Speech Recognition Systems INTERSPEECH 2019

End-to-End Multilingual Multi-Speaker Speech Recognition INTERSPEECH 2019

Semi-Supervised Sequence-to-Sequence ASR Using Unpaired Speech and Text INTERSPEECH 2019

Vectorized Beam Search for CTC-Attention-Based Speech Recognition INTERSPEECH 2019

Study of the Performance of Automatic Speech Recognition Systems in Speakers with Parkinson’s Disease INTERSPEECH 2019

End-to-End Neural Speaker Diarization with Permutation-Free Objectives INTERSPEECH 2019

Pretraining by Backtranslation for End-to-End ASR in Low-Resource Settings INTERSPEECH 2019

Pre-Trained Text Embeddings for Enhanced Text-to-Speech Synthesis INTERSPEECH 2019

A Purely End-to-End System for Multi-speaker Speech Recognition ACL 2018

Multi-Head Decoder for End-to-End Speech Recognition INTERSPEECH 2018

The JHU/KyotoU Speech Translation System for IWSLT 2018 EMNLP 2018

Student-Teacher Learning for BLSTM Mask-based Speech Enhancement INTERSPEECH 2018

Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge INTERSPEECH 2018

Semi-Supervised End-to-End Speech Recognition INTERSPEECH 2018

Auxiliary Feature Based Adaptation of End-to-end ASR Systems INTERSPEECH 2018

Multi-Modal Data Augmentation for End-to-end ASR INTERSPEECH 2018

ESPnet: End-to-End Speech Processing Toolkit INTERSPEECH 2018

Effectiveness of Single-Channel BLSTM Enhancement for Language Identification INTERSPEECH 2018

Building State-of-the-art Distant Speech Recognition Using the CHiME-4 Challenge with a Setup of Speech Enhancement Baseline INTERSPEECH 2018

The Fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, Task and Baselines INTERSPEECH 2018

Joint CTC/attention decoding for end-to-end speech recognition ACL 2017

Multichannel End-to-end Speech Recognition ICML 2017

Semi-Supervised Learning of a Pronunciation Dictionary from Disjoint Phonemic Transcripts and Text INTERSPEECH 2017

Coupled Initialization of Multi-Channel Non-Negative Matrix Factorization Based on Spatial and Spectral Information INTERSPEECH 2017

Advances in Joint CTC-Attention Based End-to-End Speech Recognition with a Deep CNN Encoder and RNN-LM INTERSPEECH 2017

Single-Channel Multi-Speaker Separation Using Deep Clustering INTERSPEECH 2016

Improved MVDR Beamforming Using Single-Channel Mask Prediction Networks INTERSPEECH 2016

Context-Sensitive and Role-Dependent Spoken Language Understanding Using Bidirectional and Attention LSTMs INTERSPEECH 2016

Data Selection by Sequence Summarizing Neural Network in Mismatch Condition Training INTERSPEECH 2016

Statistical Dialogue Management using Intention Dependency Graph IJCNLP 2013