Unify Your Financial Data Without Moving It: The Open Source Federated Query Engine

Financial institutions grapple with critical data trapped in disconnected legacy systems—from core banking to trading platforms. Aggregating this data for essential regulatory reporting is a slow, manual, and error-prone process, risking costly delays and compliance penalties.


This playbook introduces a cost-effective, cloud-native solution that queries your data directly where it lives. By leveraging the Trino federated query engine, you can run complex analytical queries across all your legacy systems simultaneously, without undertaking a massive, costly ETL project. Paired with Apache Superset for visualization and AWS Glue for a unified metadata catalog, this stack provides a single, virtual source of truth for faster, more accurate reporting.

Expected Outcomes

  • Eliminate manual data aggregation and normalization tasks for reporting.
  • Achieve a unified, near real-time view of data across all legacy systems.
  • Accelerate regulatory reporting cycles and reduce the risk of compliance errors.
  • Significantly lower total cost of ownership by avoiding expensive licensing for proprietary data integration tools.
  • Empower business analysts with direct, self-service access to previously siloed information.

Core Tools in This Stack

Trino (PrestoSQL)

Visit website

Trino is a high-performance, distributed SQL query engine for big data analytics. It allows querying data where it lives, including Hadoop, S3, Cassandra, relational databases, and more, enabling a single query to access and join data from multiple disparate sources.

Key Features
  • Cross-Source Federated Queries
  • Massively Parallel Processing (MPP) Architecture
  • Separation of Compute and Storage
  • Extensive Connector Ecosystem
  • ANSI SQL Compliant
  • High-Performance for Interactive Analytics
  • Pluggable and Extensible
Ideal For

Company Size: Medium, Large

Industries: Technology & Software, Business & Professional Services, Retail & E-commerce, Creative & Media

Pricing

Model: Open Source, Commercial Support Available

Tier: Free

Ease of Use

Moderate


Apache Superset

Visit website

Apache Superset is an open-source, modern data exploration and visualization platform that allows users of all skill levels to create interactive dashboards and beautiful visualizations from a wide variety of data sources.

Key Features
  • Interactive Dashboards
  • Wide Range of Visualizations
  • SQL Lab
  • No-Code Chart Builder
  • Lightweight Semantic Layer
  • Extensive Database Support
  • Cloud-Native Architecture
  • Extensible Security Model
Ideal For

Company Size: Small, Medium, Large

Industries: Technology & Software, Business & Professional Services, Retail & E-commerce, Creative & Media, Education & Non-Profit, Health & Wellness

Pricing

Model: Open Source

Tier: Free (Self-hosted)

Ease of Use

Moderate


AWS Glue Data Catalog

Visit website

AWS Glue Data Catalog is a fully managed, persistent metadata store within the AWS ecosystem. It acts as a central repository for structural and operational metadata, enabling users to discover, search, and query data assets across various sources like Amazon S3, RDS, and Redshift. It is a core component of the broader AWS Glue serverless data integration service.

Key Features
  • Automatic Schema Discovery with Crawlers
  • Centralized Metadata Repository
  • Serverless Architecture (No infrastructure to manage)
  • Apache Hive Metastore Compatibility
  • Integration with AWS Analytics (Athena, EMR, Redshift Spectrum)
  • Fine-grained Access Control via AWS Lake Formation and IAM
  • Schema Versioning and Evolution Tracking
  • Built-in Data Quality Rules and Monitoring
Ideal For

Company Size: Small, Medium, Large

Industries: Technology & Software, Business & Professional Services, Retail & E-commerce, Creative & Media, Education & Non-Profit, Health & Wellness, Other

Pricing

Model: Pay-as-you-go, Free Tier

Tier: Variable

Ease of Use

Medium

The Workflow

graph TD subgraph "Open Source Federated Query Engine" direction LR N0["Trino (PrestoSQL)"] N1["Apache Superset"] N2["AWS Glue Data Catalog"] N1 -- "Sends SQL queries" --> N0 N0 -- "Uses as metastore for schema" --> N2 N0 -- "Returns query results" --> N1 end classDef blue fill:#3498db,stroke:#2980b9,stroke-width:2px,color:#fff; classDef green fill:#2ecc71,stroke:#27ae60,stroke-width:2px,color:#fff; classDef orange fill:#f39c12,stroke:#d35400,stroke-width:2px,color:#fff; class N0 blue; class N1 blue; class N2 blue;

Integration Logic

  • Trino Hive Connector

    This integration configures Trino's Hive connector to use the AWS Glue Data Catalog as its central metastore. This allows Trino to discover table schemas and data locations for files residing in an S3-based data lake. Apache Superset connects to the Trino cluster as a standard database source. When a user in Superset builds a chart or runs a query, Superset sends the SQL request to Trino. Trino then consults the AWS Glue Catalog to plan the query, fetches the required data directly from Amazon S3, and executes the query across its distributed workers. The results are returned to Superset for visualization, enabling interactive analysis over vast datasets without moving the data into a traditional data warehouse.

Unlock Your Financial Data Playbook

Streamline regulatory reporting and connect legacy systems to avoid costly compliance delays.