Using GitHub Actions for MLOps & Data Science

Written by GitHub Engineering / Original link on Jun. 17, 2020


Machine Learning Operations (or MLOps) enables Data Scientists to work in a more collaborative fashion, by providing testing, lineage, versioning, and historical information in an automated way. Because the landscape of MLOps is nascent, data scientists are often forced to implement these tools from scratch. The closely related discipline of DevOps offers some help, however many DevOps tools are generic and require the implementation of “ML awareness” through custom code. Furthermore, these platforms often require disparate tools that are decoupled from your code leading to poor debugging and reproducibility.

To mitigate these concerns, we have created a series of GitHub Actions that integrate parts of the data science and machine learning workflow with a software development workflow. Furthermore, we provide components and examples that automate common tasks.

An Example Of MLOps Using GitHub Actions

Consider the below example of how an experiment tracking system can be integrated with GitHub Actions to enable MLOps. In the below example, we demonstrate how you can orchestrate a machine learning pipeline to run on the infrastructure of your choice, collect metrics using an experiment tracking system, and report the results back to a pull request.


A screenshot of this pull request.

For a live demonstration of the above example, please see this talk.

MLOps is not limited to the above example. Due to the composability of GitHub Actions, you can stack workflows in many ways that can help data scientists. Below is a concrete example of a very simple workflow that adds links to on pull requests:

name: Binder
    types: [opened, reopened]

    runs-on: ubuntu-latest

    - name: checkout pull request branch
      uses: actions/checkout@v2
        ref: ${{ github.event.pull_request.head.sha }}

    - name: comment on PR with Binder link
      uses: actions/github-script@v1
        github-token: ${{secrets.GITHUB_TOKEN}}
        script: |
          var BRANCH_NAME = process.env.BRANCH_NAME;
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: `[![Binder](](${context.repo.owner}/${context.repo.repo}/${BRANCH_NAME}) :point_left: Launch a binder notebook on this branch`
        BRANCH_NAME: ${{ github.event.pull_request.head.ref }}

When the above YAML file is added to a repository’s .github/workflow directory, pull requests can be annotated with a useful link as illustrated below [1]:


A Growing Ecosystem of MLOps & Data Science Actions

There is a growing number of Actions available for machine learning ops and data science. Below are some concrete examples that are in use today, categorized by topic.

Orchestrating Machine Learning Pipelines:

Jupyter Notebooks:

End-To-End Workflow Orchestration:

Experiment Tracking

This is by no means an exhaustive list of the things you might want to automate with GitHub Actions with respect to data science and machine learning. You can follow our progress towards this goal on our page, which contains links to blog posts, GitHub Actions, talks, and examples that are relevant to this topic.

We invite the community to create other Actions that might be useful for the community. Some ideas for getting started include data and model versioning, model deployment, data validation, as well as expanding upon some of the areas mentioned above. A great place to start is the documentation for GitHub Actions, particularly on how to build Actions for the community!

Related Materials


[1] This example workflow will not work on pull requests from forks. To enable this, you have to trigger a PR comment to occur via a different event.

githubengineering githubengineering bram data-science githubengineering

« Laravel 7.16 Released - PHP Internals News: Episode 58: Non-Capturing Catches »