Metadata
This folder consists of two parts:
- get_metadata.py which extracts table metadata using mojap-metadata to a collection of json files
- A Pulumi project to upload the extracted metadata to Amazon S3
Each collection of metadata you extract here will then:
- be uploaded to Amazon S3 using pulumi (as part of this folder). This will then be used in the extraction validation (as part of pipelines)
- be used in the DMS data mappings (as part of pipelines)
- update the glue data catalogue (as part of glue_database using etl_manager)
Getting metadata
To extract metadata for the first time or update existing metadata:
-
Create a new virtual environment if you haven't already
python -m venv venv
-
Activate the virtual environment
source venv-sql/bin/activate
-
Install the python dependencies
pip install -r requirements-sql.txt
-
Add an object list to
object_lists.py
of the relevant tables if it doesn't exist -
Specify the database metadata as a dictionary in
database_options.py
-
Create a .env file in the root directory with the database username and password:
DB_USERNAME=<username> DB_PASSWORD=<password>
-
Open
get_metadata.py
. Update the database_options import to select the relevant dictionary -
Run the python script
The result should be a new/updated folder of metadata as a subfolder of a metadata directory.
Specifying which objects to include
Objects consist of both tables and materialised views.
These are stored in object_lists.py
. To add or remove objects update the relevant
list for the database environment.
Writing an options dictionary
Your options dictionary should have the following keys:
- database (str): name of the database you're connecting to - this will become the name of the folder where your table lists will be stored.
- title (str): the name of this collection of metadata - this will become the name of the subfolder within 'database' where your metadata will be stored.
- description (str): how this metadata collection should be described in the Glue catalogue.
- schema (str): the name of the database schema you're extracting from - that's schema in the Oracle sense of a subsection of the database.
- read_bucket (str): name of the S3 bucket where the actual data (not the metadata) for these tables will end up. The bucket name will end up in the database.json file for this metadata collection.
- s3_folder (str): prefix of the items in the read_bucket that are relevant to this metadata.
- objects (list): list that holds the names of the objects (both tables and
materialised views) that you want included. If you've added a new list of objects, make sure you also import it at the top ofdatabase_options.py
. - dialect: sqlalchemy dialect see https://docs.sqlalchemy.org/en/20/dialects/
- DSN: The combination of the {host}:{port}/{database} used by sqlalchemy to create an engine
Upload metadata to Amazon S3
For a given environment, metadata is uploaded to S3 to be used by the AWS Lambda function that performs validation checks against tables extracted by AWS Datbase Migration Service. If you are uploading new metadata for the first time, you will need to specify the source database and metadata folder (located in this project) within the pulumi config file.
Stacks
This project contains the following stacks:
metadata-dev
metadata-preprod
metadata-prod
metadata-sandbox
Created: July 8, 2024