Skip to content

Glue databases

Warning

This documentation is out-of-date and needs updating #MPM-513

This project creates databases and tables in AWS Glue. These are based on folders in the metadata directory.

Once created, those Glue resources can be queried in AWS Athena and on the Analytical Platform.

Stacks

This project contains three stacks:

  • glue-database-dev
  • glue-database-preprod
  • glue-database-prod

Usage

The glue-database-prod stack is deployed automatically by AWS Codebuild. You should not normally deploy this stack manually.

To update either the glue-database-dev or glue-database-preprod stack manually:

  1. Create a shell session with temporary credentials for the restricted-admin@data-engineering role using AWS Vault:

    aws-vault exec restricted-admin@data-engineering
    
  2. Activate a virtual environment with the dependencies in requirements.txt installed.

  3. Login to the cloud backend:

    pulumi login -c s3://analytical-platform-data-engineering-pulumi-backend
    
  4. Select the stack:

    pulumi stack select <stack>
    
  5. Preview any changes:

    pulumi preview --diff --policy-pack=../policy --policy-pack-config=../policy-config.json
    
  6. Create or update the resources:

    pulumi up --policy-pack=../policy --policy-pack-config=../policy-config.json
    

How to run PySpark tests locally

First, you'll need Java 8 installed. In order to do this easily follow this guide

You will then need to install Pyspark. When you install PySpark (it's recommended you do this within a virtual environment of some kind), the latest version of Spark is installed with it. To run a unit test with PyTest, you just need to activate the virtual environment where PySpark and PyTest are installed, and run PyTest as normal.

Tests specific to the glue_database directory are located in the tests_pyspark directory in the root of this repository. You can run the tests with the following command (-vv gives a verbose output and --disable-warnings surpresses warning messages):

    pytest -vv --disable-warnings tests_pyspark/<test_file.py>

Some of the tests mock s3 buckets. If you have trouble running these locally, try setting export $AWS_DEFAULT_REGION=us-east-1 first.

How to view application event logs in the Spark UI

When a job runs in AWS Glue it will generate Spark application event logs. Download the log files to /tmp/spark-events (if that directory doesn't exist, create by running mkdir tmp/spark-events from the root of your machines drive). To view these logs in the Spark UI you'll need to start a history server. The script to do this is in the location where PySpark is installed:

    <location-to-pyspark>/sbin/start-history-server.sh

After running that script you can access the web interface by navigating to:

    http://localhost:18080

To close the server run:

    <location-to-pyspark>/sbin/stop-history-server.sh

Last update: July 8, 2024
Created: July 8, 2024