Skip to content

Glue databases


This documentation is out-of-date and needs updating #MPM-513

This project creates databases and tables in AWS Glue. These are based on folders in the metadata directory.

Once created, those Glue resources can be queried in AWS Athena and on the Analytical Platform.


This project contains three stacks:

  • glue-database-dev
  • glue-database-preprod
  • glue-database-prod


The glue-database-prod stack is deployed automatically by AWS Codebuild. You should not normally deploy this stack manually.

To update either the glue-database-dev or glue-database-preprod stack manually:

  1. Create a shell session with temporary credentials for the restricted-admin@data-engineering role using AWS Vault:

    aws-vault exec restricted-admin@data-engineering
  2. Activate a virtual environment with the dependencies in requirements.txt installed.

  3. Login to the cloud backend:

    pulumi login -c s3://analytical-platform-data-engineering-pulumi-backend
  4. Select the stack:

    pulumi stack select <stack>
  5. Preview any changes:

    pulumi preview --diff --policy-pack=../policy --policy-pack-config=../policy-config.json
  6. Create or update the resources:

    pulumi up --policy-pack=../policy --policy-pack-config=../policy-config.json

How to run PySpark tests locally

First, you'll need Java 8 installed. In order to do this easily follow this guide

You will then need to install Pyspark. When you install PySpark (it's recommended you do this within a virtual environment of some kind), the latest version of Spark is installed with it. To run a unit test with PyTest, you just need to activate the virtual environment where PySpark and PyTest are installed, and run PyTest as normal.

Tests specific to the glue_database directory are located in the tests_pyspark directory in the root of this repository. You can run the tests with the following command (-vv gives a verbose output and --disable-warnings surpresses warning messages):

    pytest -vv --disable-warnings tests_pyspark/<>

Some of the tests mock s3 buckets. If you have trouble running these locally, try setting export $AWS_DEFAULT_REGION=us-east-1 first.

How to view application event logs in the Spark UI

When a job runs in AWS Glue it will generate Spark application event logs. Download the log files to /tmp/spark-events (if that directory doesn't exist, create by running mkdir tmp/spark-events from the root of your machines drive). To view these logs in the Spark UI you'll need to start a history server. The script to do this is in the location where PySpark is installed:


After running that script you can access the web interface by navigating to:


To close the server run:


Last update: January 9, 2024
Created: January 9, 2024