Automation has become an indispensable element for ensuring operational efficiency and reliability in modern software development. GitHub Actions, an integrated Continuous Integration and Continuous Deployment (CI/CD) tool within GitHub, has established its position in the software development industry by providing a comprehensive platform for automating development and deployment workflows. However, its functionalities extend beyond this … We will delve into the use of GitHub Actions within the realm of data domain, demonstrating how it can streamline processes for developers and data professionals by automating data retrieval from external sources and data transformation operations.

GitHub Action Benefits

Github Actions are already well-known for its functionalities in the software development domain, while in recent years, also discovered as offering compelling benefits in streamlining data workflows:

GitHub Action Building Blocks

GitHub Action is a feature of GitHub that allows users to automate workflows directly within their repositories. These workflows are defined using YAML files and can be triggered by various events such as code pushes, pull requests, issue creation, or scheduled intervals. With its extensive library of pre-built actions and the ability to write custom scripts, GitHub Actions is a versatile tool for automating tasks.


4 Levels of Github Actions

We will demonstrate the implementation GitHub actions through 4 levels of difficulty, starting with the “minimal viable product” and progressively introducing additional components and customization in each level.

1. “Simple Workflow” with Python Script Execution

Start by creating a GitHub repository where you want to store your workflow and the Python script. In your repository, create a .github/workflows directory (please note that this directory must be placed within the workflows folder for the action to be executed successfully). Inside this directory, create a YAML file (e.g., simple-workflow.yaml) that defines your workflow.

The following examples shows a workflow file that executes the python script hello_world.py based on a manual trigger.

name: simple-workflow

on: 
    workflow_dispatch:
    
jobs:
    run-hello-world:
      runs-on: ubuntu-latest
      steps:
          - name: Checkout repo content
            uses: actions/checkout@v4
          - name: run hello world
            run: python code/hello_world.py

It consists of three sections: First, name: simple-workflow defines the workflow name. Second, on: workflow_dispatch specifies the condition for running the workflow, which is manually triggering each action. Last, the workflow jobs jobs: run-hello-world break down into the following steps:

2. “Push Workflow” with Environment Setup

The first workflow demonstrated the minimal viable version of the GitHub Action, but it did not take full advantage of the GitHub Actions. At the second level, we will add a bit more customization and functionalities – automatically set up the environment with Python version 3.11, install required packages and execute the script whenever changes are pushed to main branch.

name: push-workflow

on: 
    push:
        branches:
            - main

jobs:
    run-hello-world:
      runs-on: ubuntu-latest
      steps:
          - name: Checkout repo content
            uses: actions/checkout@v4
          - name: Set up Python
            uses: actions/setup-python@v5
            with:
              python-version: '3.11' 
          - name: Install dependencies
            run: |
              python -m pip install --upgrade pip
              pip install -r requirements.txt
          - name: Run hello world
            run: python code/hello_world.py

If you are interested in the basics of setting up a development environment for your data science projects, my previous blog post “7 Tips to Future-Proof Machine Learning Projects” provides a bit more explanation.

3. “Scheduled Workflow” with Argument Parsing

At the third level, we add more dynamics and complexity to make it more suitable for real-world applications. We introduce scheduled jobs as they bring even more benefits to a data science project, enabling periodic fetching of more recent data and reducing the need to manually run the script whenever data refresh is required. Additionally, we utilize dynamic argument parsing to execute the script based on different date range parameters according to the schedule.

name: scheduled-workflow

on: 
    workflow_dispatch:
    schedule:
        - cron: "0 12 1 * *" # run 1st day of every month

jobs:
    run-data-pipeline:
        runs-on: ubuntu-latest
        steps:
            - name: Checkout repo content
              uses: actions/checkout@v4
            - name: Set up Python
              uses: actions/setup-python@v5
              with:
                python-version: '3.11'  # Specify your Python version here
            - name: Install dependencies
              run: |
                python -m pip install --upgrade pip
                python -m http.client
                pip install -r requirements.txt
            - name: Run data pipeline
              run: |
                  PREV_MONTH_START=$(date -d "`date +%Y%m01` -1 month" +%Y-%m-%d)
                  PREV_MONTH_END=$(date -d "`date +%Y%m01` -1 day" +%Y-%m-%d)
                  python code/fetch_data.py --start $PREV_MONTH_START --end $PREV_MONTH_END
            - name: Commit changes
              run: |
                  git config user.name '<github-actions>'
                  git config user.email '<github-actions@github.com>'
                  git add .
                  git commit -m "update data"
                  git push
## fetch_data.py

import argparse
import os
import urllib

def main(args=None):
	  parser = argparse.ArgumentParser()
	  parser.add_argument('--start', type=str)
	  parser.add_argument('--end', type=str)
	  args = parser.parse_args(args=args)
	  print("Start Date is: ", args.start)
	  print("End Date is: ", args.end)
	  
	  date_range = pd.date_range(start=args.start, end=args.end)
	  content_lst = []
	
	  for date in date_range:
	      date = date.strftime('%Y-%m-%d')
	
		  params = urllib.parse.urlencode({
	          'api_token': '<NEWS_API_TOKEN>',
	          'published_on': date,
	          'search': search_term,
	      })
		  url = '/v1/news/all?{}'.format(params)
		    
		  content_json = parse_news_json(url, date)
		  content_lst.append(content_json)

	  with open('data.jsonl', 'w') as f:
	      for item in content_lst:
	          json.dump(item, f)
	          f.write('\n')
	  
      return content_lst

When the command python code/fetch_data.py --start $PREV_MONTH_START --end $PREV_MONTH_END executes, it creates a date range between $PREV_MONTH_START and $PREV_MONTH_END. For each day in the date range, it generates a URL, fetches the daily news through the API, parses the JSON response, and collects all the content into a JSON list. We then output this JSON list to the file “data.jsonl”.

- name: Commit changes
  run: |
      git config user.name '<github-actions>'
      git config user.email '<github-actions@github.com>'
      git add .
      git commit -m "update data"
      git push

As shown above, the last step “Commit changes” commits the changes, configures the git user email and name, stages the changes, commits them, and pushes to the remote GitHub repository. This is a necessary step when running GitHub Actions that result in changes to the working directory (e.g., output file “data.jsonl” is created). Otherwise, the output is only saved in the /temp folder within the runner environment, and appears as if no changes have been made after the action is completed.

4. “Secure Workflow” with Secrets and Environment Variables Management

The final level focuses on improving the security and performance of the GitHub workflow by addressing non-functional requirements.

name: secure-workflow

on: 
    workflow_dispatch:
    schedule:
        - cron: "34 23 1 * *" # run 1st day of every month

jobs:
    run-data-pipeline:
        runs-on: ubuntu-latest
        steps:
            - name: Checkout repo content
              uses: actions/checkout@v4
            - name: Set up Python
              uses: actions/setup-python@v5
              with:
                python-version: '3.11'  # Specify your Python version here
            - name: Install dependencies
              run: |
                python -m pip install --upgrade pip
                python -m http.client
                pip install -r requirements.txt
            - name: Run data pipeline
              env:
                  NEWS_API_TOKEN: ${{ secrets.NEWS_API_TOKEN }} 
              run: |
                  PREV_MONTH_START=$(date -d "`date +%Y%m01` -1 month" +%Y-%m-%d)
                  PREV_MONTH_END=$(date -d "`date +%Y%m01` -1 day" +%Y-%m-%d)
                  python code/fetch_data.py --start $PREV_MONTH_START --end $PREV_MONTH_END
            - name: Check changes
              id: git-check
              run: |
                  git config user.name 'github-actions'
                  git config user.email 'github-actions@github.com'
                  git add .
                  git diff --staged --quiet || echo "changes=true" >> $GITHUB_ENV
            - name: Commit and push if changes
              if: env.changes == 'true'
              run: |
                  git commit -m "update data"
                  git push
                  

To improve workflow efficiency and reduce errors, we add a check before committing changes, ensuring that commits and pushes only occur when there are actual changes since the last commit. This is achieved through the command git diff --staged --quiet || echo "changes=true" >> $GITHUB_ENV.

Lastly, we introduce the environment secret, which enhances security and avoids exposing sensitive information (e.g., API token, personal access token) in the codebase. Additionally, environment secrets offer the benefit of separating the development environment. This means you can have different secrets for different stages of your development and deployment pipeline. For example, the testing environment (e.g., in the dev branch) can only access the test token, whereas the production environment (e.g. in the main branch) will be able to access the token linked to the production instance.

To set up environment secrets in GitHub:

  1. Go to your repository settings
  2. Navigate to Secrets and Variables > Actions
  3. Click “New repository secret”
  4. Add your secret name and value

After setting up the GitHub environment secrets, we will need to add the secret to the workflow environment, for example below we added ${{ secrets.NEWS_API_TOKEN }} to the step “Run data pipeline”.

- name: Run data pipeline
  env:
      NEWS_API_TOKEN: ${{ secrets.NEWS_API_TOKEN }} 
  run: |
      PREV_MONTH_START=$(date -d "`date +%Y%m01` -1 month" +%Y-%m-%d)
      PREV_MONTH_END=$(date -d "`date +%Y%m01` -1 day" +%Y-%m-%d)
      python code/fetch_data.py --start $PREV_MONTH_START --end $PREV_MONTH_END

We then update the Python script fetch_data.py to access the environment secret using os.environ.get().

import os api_token = os.environ.get('NEWS_API_TOKEN')

Take-Home Message

This guide explores the implementation of GitHub Actions for building dynamic data pipelines, progressing through four different levels of workflow implementations:

Each level builds upon the previous one, demonstrating how GitHub Actions can be effectively utilized in the data domain to streamline data solutions and speed up the development lifecycle.

The post 4 Levels of GitHub Actions: A Guide to Data Workflow Automation appeared first on Towards Data Science.