Skip to content

Architecture

What does Managed Pipelines do?

The following diagram illustrates the logical architecture of Managed Pipelines.

Managed Pipelines Logical

Managed Pipelines deploys infrastructure and operates Extract Transform Load (ETL) pipelines which:

  • Connect to the source database
  • Extract full data from the source database and then uses change data capture (CDC) to extract on-going changes. This is a very efficient way of moving data between systems and also allows downstream applications to track any changes in data
  • Extract the source database metadata to be used in the rest of the pipeline and update the metadata store
  • Validate the extracted data against the metadata
  • Curate the data to make it convenient for analysis
    • Determine which records where deleted since the last upload, in case the extraction step is unable to determine on-going changes
    • Implement a Type 2 Slowly Changing Dimension (SCD2) to retain the full history of data. When a row is updated or deleted on the source database, the current record on the AP is "closed" and a new record is inserted with the changed data values.
  • Upload the converted data to the Analytical Platform data lake on a daily schedule
  • Expose the data and metadata for analytical services to use
  • Apply Logging, Monitoring and Alerting (LMA) in accordance with good practice

How is Managed Pipelines implemented?

Managed Pipelines makes use of various serverless Data Analytics AWS Services. This means AWS takes over the heavy lifting of the following:

  • Providing and managing scalable, resilient, secure, and cost-effective infrastructural components
  • Ensuring infrastructural components natively integrate with each other

Managed Pipelines uses the following AWS Services:

Managed Pipelines also:

  • Uses create-a-derived-table to curate the data via Amazon Athena orchestrated using dbt
  • Uses different AWS accounts on the Analytical Platform to facilitate and isolate resource management
  • Provisions dev and preprod pipelines for testing deployment changes before deploying to production
  • Extracts metadata from the source database to be used in various places along the pipeline. Please refer to metadata for more details
  • Uses GitHub Actions to automate software workflows and run CI/CD pipelines. Please refer deployment for more details
  • Uses pulumi to define and deploy Infrastructure as Code (IAC). Please refer to using pulumi for more details

The following diagram summarises the physical architecture for a single database and environment:

Managed Pipelines Physical

Please refer to components for a deeper dive into the individual components.


Last update: January 9, 2024
Created: January 9, 2024