Seamless Data Sync: Snowflake and AWS Glue Catalog with Iceberg Tables

Why Iceberg Tables in Snowflake?#

In recent years, data engineering has embraced table formats like Iceberg, Hudi, and Delta Lake, especially with the rise of the Lakehouse Architecture. These technologies combine the scalability and cost-effectiveness of object storage (e.g., S3) with the reliability and transactional capabilities of data warehouses.

Iceberg tables have become widely used due to their advanced features like atomic commits, schema evolution, and efficient data organization. By applying ACID (Atomicity, Consistency, Isolation, Durability) principles to object storage, Iceberg tables provide a robust solution for managing large-scale data workloads. Earlier this year, Snowflake introduced support for Apache Iceberg tables. This integration allows Snowflake users to leverage Iceberg tables directly within Snowflake, eliminating the need for complex workarounds or manual synchronization processes, which can increase operational complexity and data duplication.

The integration also ensures a Single Source of Truth (SSOT) for datasets, which can be used across different use cases and query engines (e.g., Spark, Trino), ensuring consistent and reliable data access throughout the organization.

Data Catalog - Hadoop vs Snowflake vs External?#

Snowflake offers different catalog options for integrating Iceberg tables, each with its own set of trade-offs and considerations.

Hadoop Catalog#

The simplest catalog for Iceberg is the Hadoop catalog. Despite its name, it is essentially a collection of files containing metadata on how to read the data files. This method provides direct access without relying on external catalogs like AWS Glue or Snowflake's internal catalog, simplifying the setup and allowing for straightforward integration with existing Hadoop-based environments. However, it is not recommended for production use due to its limitations in managing large-scale data operations and lack of advanced features.

Snowflake Catalog#

One option is to use Snowflake itself as the catalog, allowing seamless integration with Iceberg tables. With this approach, Snowflake manages all table maintenance operations, such as compaction and vacuuming. This simplifies data management tasks, but it comes with limitations. For instance, only Snowflake can write to Iceberg tables using this method, restricting compatibility with other data engines. Additionally, not all engines support Snowflake's SDK integration, further limiting interoperability.

AWS Glue Data Catalog#

Alternatively, organizations can opt for an external catalog like the AWS Glue Catalog. While this approach relinquishes Snowflake's control over table maintenance operations, it opens up opportunities for greater flexibility and interoperability. With an external catalog, cross-cutting company datasets can be connected for analytics, reading, and transforming purposes. This allows organizations to leverage the strengths of different data engines and tools, facilitating collaboration and data-driven decision-making across the enterprise.

After carefully weighing the pros and cons of each option and recognizing the paramount importance of establishing a Single Source of Truth (SST) while avoiding vendor lock-in, therefore I've embraced the external catalog strategy.

AWS Glue Catalog Integration - Be 'stale' or refresh it by your own#

Snowflake's default behavior relies on SQL table ALTER ICEBERG TABLE <name> REFRESH commands to capture updates from Iceberg tables. However, this approach can lead to data becoming "stale" as it may not reflect the most recent changes in real-time.

Imagine a scenario where Iceberg tables are constantly receiving updates, especially in Data Streaming Processing cases. Despite these continuous updates, Snowflake only reflects the state of the table at the last refresh execution commit. This delay in synchronization means users might be working with outdated data, potentially impacting the accuracy and reliability of their analyses and decisions.

So, the question arises: Let's pass the ownership of executing this refresh operation for each table to users because reading? Of course not!

Workaround Solution#

To maintain a synchronized data environment, integrating the AWS Glue Data Catalog with Snowflake is essential. The diagram illustrates the process:

Glue Data Catalog: Central repository for dataset metadata. Updates to tables are registered here.
CloudWatch Rule: Monitors for table update events in the Glue Data Catalog and triggers the next step.
Lambda Function: Activated by the CloudWatch rule, this function executes a refresh operation in Snowflake.
Snowflake: Receives the refresh command from the Lambda function, keeping its metadata in sync with the Glue Data Catalog.

How It Works#

Update Table Events: Glue Data Catalog logs updates.
Trigger Lambda: CloudWatch rule triggers the Lambda function on detecting updates.
Execute Refresh: Lambda function refreshes Snowflake's metadata.

This process ensures that Snowflake's metadata is always current, reducing the risk of data staleness and enhancing data reliability.

The lambda function will be basically trigger a refresh command after receiving Glue Catalog changes events, with the following approach:

def handler(event, context):
    # Extract table name from the Glue event
    table = event['detail']['tableName']
    database = event['detail']['databaseName']
    
    # Snowflake query
    query = f"ALTER ICEBERG TABLE {database}.{table} REFRESH"
    
    print(f"Query to be executed {query}")

    with snowflake.connector.connect(
        user=snowflake_user,
        password=snowflake_password,
        account=snowflake_account,
        database=snowflake_database,
        role=snowflake_role,
        warehouse=snowflake_warehouse) as con:
        with con.cursor() as cur:
            result = cur.execute(query).fetchall()
            con.commit()
        print(f"Query executed successfully: {result}")

How-to solution#

Requirements

AWS Account
Snowflake Account
OpenTofu
Docker
AWS SDK

Note: Ensure you already have an Iceberg table configured on Snowflake using the Glue Catalog as described here.

Setup Environment#

Clone the Repository

 git clone https://github.com/datainnutshell/snowflake-glue-catalog-refresh.git

Authenticate Your Terminal with AWS Credentials Follow the instructions here to authenticate your terminal with AWS credentials. For a simple and safe way to manage SDK credentials, consider using aws-vault.

Create And Set Snowflake Variables in settings.tfvars
To add your Snowflake connection variables, run the following command:

 echo 'snowflake_account   = "your_snowflake_account"
 snowflake_user      = "your_snowflake_user"
 snowflake_password  = "your_snowflake_password"
 snowflake_database  = "your_snowflake_database"
 snowflake_role      = "your_snowflake_role"
 snowflake_warehouse = "your_snowflake_warehouse"' > settings.tfvars

Build Lambda and Deploy with OpenTofu
```
 make all
```

Check Deployment#

AWS Console
- Go to Lambda > Functions and search for glue_catalog_change_handler.
- Ensure the new function is created and connected to CloudWatch events for table changes in the Glue Catalog.
AWS Athena
- Select the workgroup example-workgroup and locate the saved queries.
- Execute the create table query in Athena.

Snowflake

Use the DDL instructions here to create a new Iceberg table integration.

Example:

CREATE ICEBERG TABLE ICEBERG_BLOG_DEMO.catalog_sync_test.customers
EXTERNAL_VOLUME='iceberg_storage_volume'
CATALOG='iceberg_glue_catalog'
CATALOG_TABLE_NAME='customers';

Insert Data and Refresh
- Execute the insert query in Athena.
- Snowflake's Iceberg table will be automatically refreshed.
Clean up environment
```
  make clean
```

Conclusion#

This approach serves as a workaround to address the current limitations in Snowflake's integration with external catalogs, particularly until the new Polaris catalog is fully implemented. Despite its interim nature, this method is highly versatile and can be adapted for various other integrations or automations using the Glue Catalog.

It's important to note that the refresh time can vary depending on the number of commits that haven't been refreshed.

Finally, be mindful of the costs associated with running these refresh commands. Ensure you use a warehouse that is frequently operational to avoid unexpected expenses. By following these guidelines, you can efficiently manage your data synchronization processes while keeping costs under control 😇

Thank you for following along with this guide.

If you found it helpful, please share it on social media to help others benefit from this information!