Architecture
Find out about the data and analytics engineering architecture, key tools and services
Context
As explained in the About page, our data and analytics engineers develop analytical pipelines and self-service tools to acquire and transform data, making it available on the Analytical Platform. We implement a data lake-centric approach to manage our data, using a cloud-based object store for storage. The pipelines follow a standard ‘ELT’ process (Extract, Load, Transform), producing cleaner, more standardized data in the format that downstream analysts expect, ensuring reliability.
Data engineers are responsible for extracting, standardising and loading source data into the Analytical Platform. Analytics engineers then apply data modelling techniques to further refine the data, making it more conformed and accessible. Data users can then consume this transformed data, apply further processing or machine learning tasks, and disseminate their findings through reports or dashboards.
We can summarise this workflow using the Medallion architecture, in which data transitions through Bronze, Silver, and Gold layers, increasing in structure and quality at each stage:
Key Tools and Services
We use various tools to extract and transform our data, depending on the data source, volume, frequency and various other characteristics. To support this infrastructure, we rely on different open source tools and AWS services to ensure scalability, resilience, security, and cost-effectiveness.
The data and analytics engineering container diagram summarises some of these tools and services:
-
Data and metadata is collected from multiple data sources across the MoJ and external to the MoJ, including Amazon S3, file shares, relational databases, APIs, and Azure Blob Storage.
-
This data is ingested into our data lake using various approaches depending on the data source for example AWS Database Migration Service (AWS DMS) for relational databases, SFTP for file shares or Amazon API Gateway. Data in Amazon S3 can be imported directly.
-
The Data Lake consists of Amazon S3 for data storage, AWS Glue Data Catalog as a metadata repository, Apache Hive and Iceberg table formats to provide a SQL-like interface, and AWS IAM to secure access to the data.
-
We use Amazon Athena along with dbt for SQL-based transformations. Data can be pre-processed using Python scheduled with Amazon Managed Workflows for Apache Airflow. We also make these transformation tools available to data users to run their own analytical workflows, including machine learning workflows. The transformed data is then saved back to the data lake.
For more details about our technology stack, please visit our Technology Radar.