Skip to content

Metadata

This folder consists of two parts:

  • get_metadata.py which extracts table metadata using mojap-metadata to a collection of json files
  • A Pulumi project to upload the extracted metadata to Amazon S3

Each collection of metadata you extract here will then:

  • be uploaded to Amazon S3 using pulumi (as part of this folder). This will then be used in the extraction validation (as part of pipelines)
  • be used in the DMS data mappings (as part of pipelines)
  • update the glue data catalogue (as part of glue_database using etl_manager)

Getting metadata

To extract metadata for the first time or update existing metadata:

  1. Create a new virtual environment if you haven't already

    python -m venv venv

  2. Activate the virtual environment

    source venv-sql/bin/activate

  3. Install the python dependencies

    pip install -r requirements-sql.txt

  4. Add an object list to object_lists.py of the relevant tables if it doesn't exist

  5. Specify the database metadata as a dictionary in database_options.py

  6. Create a .env file in the root directory with the database username and password:

    DB_USERNAME=<username>
    DB_PASSWORD=<password>
    

  7. Open get_metadata.py. Update the database_options import to select the relevant dictionary

  8. Run the python script

The result should be a new/updated folder of metadata as a subfolder of a metadata directory.

Specifying which objects to include

Objects consist of both tables and materialised views.

These are stored in object_lists.py. To add or remove objects update the relevant list for the database environment.

Writing an options dictionary

Your options dictionary should have the following keys:

  • database (str): name of the database you're connecting to - this will become the name of the folder where your table lists will be stored.
  • title (str): the name of this collection of metadata - this will become the name of the subfolder within 'database' where your metadata will be stored.
  • description (str): how this metadata collection should be described in the Glue catalogue.
  • schema (str): the name of the database schema you're extracting from - that's schema in the Oracle sense of a subsection of the database.
  • read_bucket (str): name of the S3 bucket where the actual data (not the metadata) for these tables will end up. The bucket name will end up in the database.json file for this metadata collection.
  • s3_folder (str): prefix of the items in the read_bucket that are relevant to this metadata.
  • objects (list): list that holds the names of the objects (both tables and
    materialised views) that you want included. If you've added a new list of objects, make sure you also import it at the top of database_options.py.
  • dialect: sqlalchemy dialect see https://docs.sqlalchemy.org/en/20/dialects/
  • DSN: The combination of the {host}:{port}/{database} used by sqlalchemy to create an engine

Upload metadata to Amazon S3

For a given environment, metadata is uploaded to S3 to be used by the AWS Lambda function that performs validation checks against tables extracted by AWS Datbase Migration Service. If you are uploading new metadata for the first time, you will need to specify the source database and metadata folder (located in this project) within the pulumi config file.

Stacks

This project contains the following stacks:

  • metadata-dev
  • metadata-preprod
  • metadata-prod
  • metadata-sandbox

Last update: July 8, 2024
Created: July 8, 2024