mrjobΒΆ
mrjob lets you write MapReduce jobs in Python 2.7/3.4+ and run them on several platforms. You can:
- Write multi-step MapReduce jobs in pure Python
- Test on your local machine
- Run on a Hadoop cluster
- Run in the cloud using Amazon Elastic MapReduce (EMR)
- Run in the cloud using Google Cloud Dataproc (Dataproc)
- Easily run Spark jobs on EMR or your own Hadoop cluster
mrjob is licensed under the Apache License, Version 2.0.
To get started, install with pip
:
pip install mrjob
and begin reading the tutorial below.
- Guides
- Why mrjob?
- Fundamentals
- Concepts
- Writing jobs
- Runners
- Spark
- Why use mrjob with Spark?
- mrjob spark-submit
- Writing your first Spark MRJob
- Running on your Spark cluster
- Using remote filesystems other than HDFS
- Other ways to run on Spark
- Passing in libraries
- Command-line options
- Uploading files to the working directory
- Archives and directories
- Multi-step jobs
- External Spark scripts
- Custom input and output formats
- Running “classic” MRJobs on Spark
- Config file format and location
- Options available to all runners
- Hadoop-related options
- Spark runner options
- Configuration quick reference
- Cloud runner options
- Job Environment Setup Cookbook
- Hadoop Cookbook
- Testing jobs
- Cloud Dataproc
- Elastic MapReduce
- Python 2 vs. Python 3
- Contributing to mrjob
- Reference
- mrjob.ami - building custom AMIs
- mrjob.cat - decompress files based on extension
- mrjob.cmd: The
mrjob
command-line utility - mrjob.compat - Hadoop version compatibility
- mrjob.conf - parse and write config files
- mrjob.dataproc - run on Dataproc
- mrjob.emr - run on EMR
- mrjob.hadoop - run on your Hadoop cluster
- mrjob.inline - debugger-friendly local testing
- mrjob.job - defining your job
- mrjob.local - simulate Hadoop locally with subprocesses
- mrjob.parse - log parsing
- mrjob.protocol - input and output
- mrjob.spark.runner - run on any Spark cluster
- mrjob.retry - retry on transient errors
- mrjob.runner - base class for all runners
- mrjob.step - represent Job Steps
- mrjob.setup - job environment setup
- mrjob.util - general utility functions
- What’s New
- Glossary
Appendices