Fast String Kernels using Inexact Matching for Protein Sequences

Christina Leslie; Rui Kuang

2004 JMLR JMLR 2004

Fast String Kernels using Inexact Matching for Protein Sequences

Abstract

We describe several families of k -mer based string kernels related to the recently presented mismatch kernel and designed for use with support vector machines (SVMs) for classification of protein sequence data. These new kernels -- restricted gappy kernels, substitution kernels, and wildcard kernels -- are based on feature spaces indexed by k -length subsequences (" k -mers") from the string alphabet Σ. However, for all kernels we define here, the kernel value K ( x,y ) can be computed in O ( cK (| x |+| y |)) time, where the constant cK depends on the parameters of the kernel but is independent of the size |Σ| of the alphabet. Thus the computation of these kernels is linear in the length of the sequences, like the mismatch kernel, but we improve upon the parameter-dependent constant cK = km+1 | Σ | m of the ( k,m )-mismatch kernel. We compute the kernels efficiently using a trie data structure and relate our new kernels to the recently described transducer formalism. In protein classification experiments on two benchmark SCOP data sets, we show that our new faster kernels achieve SVM classification performance comparable to the mismatch kernel and the Fisher kernel derived from profile hidden Markov models, and we investigate the dependence of the kernels on parameter choice. [abs] [ pdf ]

📈 Trend Setter — Metric Learning

🧭 Keyword Pioneer — protein sequence

🐝 Cross-Pollinator — Artificial Intelligence, Computer Science, Computer Vision, Data Science & Analytics, Deep Learning, Healthcare & Medicine, Knowledge & Reasoning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Robotics, Speech & Audio

🐣 Hot Topic Early Bird — support vector machine

Authors

Christina Leslie , Rui Kuang

Topics

Machine Learning > Core Methods > Classification Machine Learning > Core Methods > Metric Learning Machine Learning > Learning Types > Supervised Learning Machine Learning > Core Methods > Kernel Methods

Keywords

protein sequence support vector machine string kernel mismatch kernel trie data structure

Download PDF

Related papers

Selective Rademacher Penalization and Reduced Error Pruning of Decision Trees 2004

Learning the Kernel Matrix with Semidefinite Programming 2004

Weather Data Mining Using Independent Component Analysis 2004

A Geometric Approach to Multi-Criterion Reinforcement Learning 2004

Distributional Scaling: An Algorithm for Structure-Preserving Embedding of Metric and Nonmetric Spaces 2004