Glue databases
Warning
This documentation is out-of-date and needs updating #MPM-513
This project creates databases and tables in AWS Glue. These are based on folders in the metadata directory.
Once created, those Glue resources can be queried in AWS Athena and on the Analytical Platform.
Stacks
This project contains three stacks:
glue-database-dev
glue-database-preprod
glue-database-prod
Usage
The glue-database-prod
stack is deployed automatically by AWS Codebuild. You
should not normally deploy this stack manually.
To update either the glue-database-dev
or glue-database-preprod
stack
manually:
-
Create a shell session with temporary credentials for the
restricted-admin@data-engineering
role using AWS Vault:aws-vault exec restricted-admin@data-engineering
-
Activate a virtual environment with the dependencies in
requirements.txt
installed. -
Login to the cloud backend:
pulumi login -c s3://analytical-platform-data-engineering-pulumi-backend
-
Select the stack:
pulumi stack select <stack>
-
Preview any changes:
pulumi preview --diff --policy-pack=../policy --policy-pack-config=../policy-config.json
-
Create or update the resources:
pulumi up --policy-pack=../policy --policy-pack-config=../policy-config.json
How to run PySpark tests locally
First, you'll need Java 8 installed. In order to do this easily follow this guide
You will then need to install Pyspark. When you install PySpark (it's recommended you do this within a virtual environment of some kind), the latest version of Spark is installed with it. To run a unit test with PyTest, you just need to activate the virtual environment where PySpark and PyTest are installed, and run PyTest as normal.
Tests specific to the glue_database
directory are located in the tests_pyspark
directory in the root of this repository. You can run the tests with the following command (-vv
gives a verbose output and --disable-warnings
surpresses warning messages):
pytest -vv --disable-warnings tests_pyspark/<test_file.py>
Some of the tests mock s3 buckets. If you have trouble running these locally, try setting export $AWS_DEFAULT_REGION=us-east-1
first.
How to view application event logs in the Spark UI
When a job runs in AWS Glue it will generate Spark application event logs. Download the log files to /tmp/spark-events
(if that directory doesn't exist, create by running mkdir tmp/spark-events
from the root of your machines drive). To view these logs in the Spark UI you'll need to start a history server. The script to do this is in the location where PySpark is installed:
<location-to-pyspark>/sbin/start-history-server.sh
After running that script you can access the web interface by navigating to:
http://localhost:18080
To close the server run:
<location-to-pyspark>/sbin/stop-history-server.sh
Created: July 8, 2024