Apache Mahout: Machine Learning on Distributed Dataflow Systems

Robin Anil; Gokhan Capan; Isabel Drost-Fromm; Ted Dunning; Ellen Friedman; Trevor Grant; Shannon Quinn; Paritosh Ranjan; Sebastian Schelter; Özgür Yılmazel

2020 JMLR JMLR 2020

Apache Mahout: Machine Learning on Distributed Dataflow Systems

Abstract

Apache Mahout is a library for scalable machine learning (ML) on distributed dataflow systems, offering various implementations of classification, clustering, dimensionality reduction and recommendation algorithms. Mahout was a pioneer in large-scale machine learning in 2008, when it started and targeted MapReduce, which was the predominant abstraction for scalable computing in industry at that time. Mahout has been widely used by leading web companies and is part of several commercial cloud offerings. In recent years, Mahout migrated to a general framework enabling a mix of dataflow programming and linear algebraic computations on backends such as Apache Spark and Apache Flink. This design allows users to execute data preprocessing and model training in a single, unified dataflow system, instead of requiring a complex integration of several specialized systems. Mahout is maintained as a community-driven open source project at the Apache Software Foundation, and is available under https://mahout.apache.org. [abs] [ pdf ][ bib ] [ code ] © JMLR 2020. (edit, beta)

🧭 Keyword Pioneer — model training

🐝 Cross-Pollinator — Artificial Intelligence, Computer Vision, Deep Learning, Machine Learning, Mathematics & Optimization, Natural Language Processing, Reinforcement Learning

🌉 Interdisciplinary Bridge — Computer Science and Machine Learning

🐣 Hot Topic Early Bird — model training

Authors

Robin Anil , Gokhan Capan , Isabel Drost-Fromm , Ted Dunning , Ellen Friedman , Trevor Grant , Shannon Quinn , Paritosh Ranjan , Sebastian Schelter , Özgür Yılmazel

Topics

Machine Learning > Optimization & Theory > Distributed Learning Machine Learning > Application Areas > Efficient Computing Computer Science > Systems > Distributed Systems

Keywords

distributed computing scalable machine learning machine learning model training distributed dataflow apache spark linear algebraic computation

Download PDF

Connecting Spectral Clustering to Maximum Margins and Level Sets 2020

Stochastic Nested Variance Reduction for Nonconvex Optimization 2020

Gradient Descent for Sparse Rank-One Matrix Completion for Crowd-Sourced Aggregation of Sparsely Interacting Workers 2020

A Sparse Semismooth Newton Based Proximal Majorization-Minimization Algorithm for Nonconvex Square-Root-Loss Regression Problems 2020

Apache Mahout: Machine Learning on Distributed Dataflow Systems

Abstract

Authors

Topics

Keywords

Related papers