Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessIran threatens US-owned companies, including Meta and JP MorganBusiness InsiderNvidia Stock Suffers Rare 2-Quarters Loss. History Gives Reason for Hope. - Barron'sGNews AI NVIDIAPerplexity AI accused of using secret trackers to share your data with Meta and Google - financialexpress.comGNews AI searchWeaponized Intelligence - Palo Alto NetworksGNews AI cybersecurityAI firm focused on diagnosing prostate cancer via MRI raises $11M - Health ImagingGNews AI healthcareHuawei posts slower annual revenue growth of 2.2% - The Mighty 790 KFGOGNews AI HuaweiOracle lays off around 12,000 employees in India: How this marks a wider pivot to AI - The Indian ExpressGNews AI IndiaCERN levels up with new superconducting kartsHacker News TopRemembering Magnetic Memories and the Apollo AGCHacker NewsSouth Korea forms agentic AI alliance to shape ecosystem - Tech in AsiaGNews AI KoreaEmerald AI Raises $25 Million to Align Data Center Energy Use with Grid Capacity - ESG TodayGNews AI energyHugging Face TRL v1.0 Turns LLM Fine-Tuning From Art Into Engineering - startupfortune.comGNews AI fine-tuningBlack Hat USADark ReadingBlack Hat AsiaAI BusinessIran threatens US-owned companies, including Meta and JP MorganBusiness InsiderNvidia Stock Suffers Rare 2-Quarters Loss. History Gives Reason for Hope. - Barron'sGNews AI NVIDIAPerplexity AI accused of using secret trackers to share your data with Meta and Google - financialexpress.comGNews AI searchWeaponized Intelligence - Palo Alto NetworksGNews AI cybersecurityAI firm focused on diagnosing prostate cancer via MRI raises $11M - Health ImagingGNews AI healthcareHuawei posts slower annual revenue growth of 2.2% - The Mighty 790 KFGOGNews AI HuaweiOracle lays off around 12,000 employees in India: How this marks a wider pivot to AI - The Indian ExpressGNews AI IndiaCERN levels up with new superconducting kartsHacker News TopRemembering Magnetic Memories and the Apollo AGCHacker NewsSouth Korea forms agentic AI alliance to shape ecosystem - Tech in AsiaGNews AI KoreaEmerald AI Raises $25 Million to Align Data Center Energy Use with Grid Capacity - ESG TodayGNews AI energyHugging Face TRL v1.0 Turns LLM Fine-Tuning From Art Into Engineering - startupfortune.comGNews AI fine-tuning

AWS Snowflake Lakehouse: 2 Practical Apache Iceberg Integration Patterns

DEV Communityby AkiMarch 31, 202614 min read0 views
Source Quiz

<blockquote> <p><strong>Original Japanese article</strong>: <a href="https://zenn.dev/penginpenguin/articles/97f30cc1ac8377" rel="noopener noreferrer">AWSのレイクハウス(Apache Iceberg)をSnowflakeと連携する2つのパターンを整理する</a></p> </blockquote> <h2> Introduction </h2> <p>I'm Aki, an AWS Community Builder (<a href="https://x.com/jitepengin" rel="noopener noreferrer">@jitepengin</a>).</p> <p>In recent years, it has become increasingly common to build lakehouse architectures centered around Apache Iceberg.</p> <p>Before the rise of lakehouse architecture, it was common to design systems where data was consolidated into a specific platform, such as:</p> <ul> <li>an architecture centered around Amazon Redshift on AWS</li> <li>an architecture centered around internal tables in Snowflake</li> </ul> <p>However, wit

Original Japanese article: AWSのレイクハウス(Apache Iceberg)をSnowflakeと連携する2つのパターンを整理する

Introduction

I'm Aki, an AWS Community Builder (@jitepengin).

In recent years, it has become increasingly common to build lakehouse architectures centered around Apache Iceberg.

Before the rise of lakehouse architecture, it was common to design systems where data was consolidated into a specific platform, such as:

  • an architecture centered around Amazon Redshift on AWS

  • an architecture centered around internal tables in Snowflake

However, with the advent of Apache Iceberg, this assumption is rapidly changing.

Now that data on Amazon S3 can be directly accessed from multiple engines, what matters is no longer simply product selection. Instead, the architecture design itself has become the central focus: where the data resides, who owns the write responsibility, and who holds governance authority.

In this article, focusing on the coexistence of AWS and Snowflake, I will organize the following:

  • two patterns based on S3 × Iceberg

  • connectivity with Power BI Service

  • future prospects including AI utilization

Why AWS × Snowflake Coexistence Is Necessary

The greatest value of Apache Iceberg lies in its ability to separate data from the query engine.

For example, while keeping the physical data stored in S3, the same data can be accessed from multiple tools such as Athena, AWS Glue / Spark, Redshift, and Snowflake.

In other words, while consolidating the physical data into a single location, it has become possible to choose the most suitable analytics platform depending on the use case.

As a result, the architectural discussion has shifted from “which product should we use?” to “where should data sovereignty reside, and where should analytical ownership be placed?”

The benefits of using Snowflake here include:

  • user-friendly UI/UX

  • powerful SQL analytics capabilities

  • integration with Cortex AI

In particular, I believe that an architecture where AWS serves as the data sovereignty layer while Snowflake is utilized as the analytics layer is highly compatible.

Two Patterns for Snowflake × S3 Integration

Here, I will organize the two commonly used patterns when integrating Snowflake with S3.

Pattern 1: Glue Catalog Integration

In this pattern, Iceberg tables stored on S3 are referenced from Snowflake through AWS Glue Data Catalog.

The advantages of this architecture are:

  • relatively simple configuration

  • S3 becomes the Single Source of Truth

  • because Snowflake cannot write, AWS retains sovereignty over data management (this can also be considered a disadvantage)

  • since user access is consolidated into Snowflake, access control can be centralized there

In other words, Snowflake focuses solely on the role of data analytics (query engine), while AWS retains authority over data management.

Setup Procedure

Step 1: Create an External Volume (S3 Access Configuration)

Run the following on the Snowflake side:

CREATE EXTERNAL VOLUME IF NOT EXISTS sample_iceberg_volume  STORAGE_LOCATIONS = (  (  NAME = 'my-s3-location'  STORAGE_PROVIDER = 'S3'  STORAGE_BASE_URL = catalog S3 path  STORAGE_AWS_ROLE_ARN = role ARN to use  STORAGE_AWS_EXTERNAL_ID = 'my_external_id'  )  );

Enter fullscreen mode

Exit fullscreen mode

Step 2: Create Glue Catalog Integration

Run the following on the Snowflake side:

CREATE OR REPLACE CATALOG INTEGRATION glue_catalog_int -- arbitrary name  CATALOG_SOURCE = GLUE  CATALOG_NAMESPACE = Glue catalog namespace  TABLE_FORMAT = ICEBERG  GLUE_AWS_ROLE_ARN = role ARN to use  GLUE_CATALOG_ID = Glue catalog ID to use  GLUE_REGION = 'ap-northeast-1'  ENABLED = TRUE;

Enter fullscreen mode

Exit fullscreen mode

Step 3: Retrieve Required Information for AWS Trust Policy

DESC EXTERNAL VOLUME sample_iceberg_volume;

Enter fullscreen mode

Exit fullscreen mode

→ Note down STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID

DESC CATALOG INTEGRATION glue_catalog_int;

Enter fullscreen mode

Exit fullscreen mode

→ Note down GLUE_AWS_IAM_USER_ARN and GLUE_AWS_EXTERNAL_ID

Since the two External IDs above will have different values, be sure to add both to the AWS IAM role Trust Policy.

Please configure the Trust Policy for the role used on the AWS side.

{  "Version": "2012-10-17",  "Statement": [  {  "Effect": "Allow",  "Principal": {  "AWS": STORAGE_AWS_IAM_USER_ARN  },  "Action": "sts:AssumeRole",  "Condition": {  "StringEquals": {  "sts:ExternalId": STORAGE_AWS_EXTERNAL_ID  }  }  },  {  "Effect": "Allow",  "Principal": {  "AWS": GLUE_AWS_IAM_USER_ARN  },  "Action": "sts:AssumeRole",  "Condition": {  "StringEquals": {  "sts:ExternalId": GLUE_AWS_EXTERNAL_ID  }  }  }  ] }

Enter fullscreen mode

Exit fullscreen mode

Step 4: Create Database

Run the following on the Snowflake side:

CREATE DATABASE IF NOT EXISTS icebergdb;

Enter fullscreen mode

Exit fullscreen mode

Step 5: Create Table

Run the following on the Snowflake side:

CREATE OR REPLACE ICEBERG TABLE icebergdb.public.yellow_tripdata -- arbitrary name  EXTERNAL_VOLUME = 'sample_iceberg_volume'  CATALOG = 'glue_catalog_int'  CATALOG_TABLE_NAME = 'yellow_tripdata'  AUTO_REFRESH = TRUE;

Enter fullscreen mode

Exit fullscreen mode

Step 6: Verify Data

Run the following on the Snowflake side:

SELECT * FROM icebergdb.public.yellow_tripdata LIMIT 5;*

Enter fullscreen mode

Exit fullscreen mode

We were able to read the data from S3!

Pattern 2: Catalog-Linked Database (Iceberg)

This pattern integrates Glue Catalog using the REST Catalog approach and manages Iceberg tables directly from the Snowflake side.

The advantages of this architecture are:

  • read and write operations from Snowflake are possible

  • the physical data remains stored in S3

  • Snowflake users can perform SQL-based updates and analytics

  • easier integration with Power BI and Cortex AI

In other words, the biggest feature is that analysis and updates can be performed from Snowflake while keeping the physical data in S3.

However, governance must be considered from both the AWS side and the Snowflake side.

Setup Procedure

Step 1: Create an External Volume (S3 Access Configuration)

CREATE EXTERNAL VOLUME IF NOT EXISTS sample_iceberg_volume  STORAGE_LOCATIONS = (  (  NAME = 'my-s3-location'  STORAGE_PROVIDER = 'S3'  STORAGE_BASE_URL = catalog S3 path  STORAGE_AWS_ROLE_ARN = role ARN to use  STORAGE_AWS_EXTERNAL_ID = 'my_external_id'  )  );

Enter fullscreen mode

Exit fullscreen mode

Step 2: Create Glue Iceberg REST Catalog Integration

CREATE OR REPLACE CATALOG INTEGRATION glue_rest_catalog_int  CATALOG_SOURCE = ICEBERG_REST  TABLE_FORMAT = ICEBERG  CATALOG_NAMESPACE = Glue catalog namespace  REST_CONFIG = (  CATALOG_URI = Glue catalog URI  CATALOG_API_TYPE = AWS_GLUE  CATALOG_NAME = AWS account ID  )  REST_AUTHENTICATION = (  TYPE = SIGV4  SIGV4_IAM_ROLE = role ARN to use  SIGV4_SIGNING_REGION = 'ap-northeast-1'  )  ENABLED = TRUE;

Enter fullscreen mode

Exit fullscreen mode

Step 3: Retrieve Required Information for AWS Trust Policy

DESC EXTERNAL VOLUME sample_iceberg_volume;

Enter fullscreen mode

Exit fullscreen mode

→ Note down STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID

DESC CATALOG INTEGRATION glue_rest_catalog_int;

Enter fullscreen mode

Exit fullscreen mode

→ Note down GLUE_AWS_IAM_USER_ARN and GLUE_AWS_EXTERNAL_ID

Since the two External IDs above will have different values, be sure to add both to the AWS IAM role Trust Policy.

Please configure the Trust Policy for the role used on the AWS side.

{  "Version": "2012-10-17",  "Statement": [  {  "Effect": "Allow",  "Principal": {  "AWS": STORAGE_AWS_IAM_USER_ARN  },  "Action": "sts:AssumeRole",  "Condition": {  "StringEquals": {  "sts:ExternalId": STORAGE_AWS_EXTERNAL_ID  }  }  },  {  "Effect": "Allow",  "Principal": {  "AWS": GLUE_AWS_IAM_USER_ARN  },  "Action": "sts:AssumeRole",  "Condition": {  "StringEquals": {  "sts:ExternalId": GLUE_AWS_EXTERNAL_ID  }  }  }  ] }

Enter fullscreen mode

Exit fullscreen mode

Step 4: Create Catalog Linked Database (Read/Write Enabled)

CREATE DATABASE my_iceberg_linked_db  LINKED_CATALOG = (  CATALOG = 'glue_rest_catalog_int',  ALLOWED_WRITE_OPERATIONS = ALL  )  EXTERNAL_VOLUME = 'sample_iceberg_volume';

Enter fullscreen mode

Exit fullscreen mode

Step 5: Tables Are Automatically Discovered (Synced Every 30 Seconds)

SELECT * FROM my_iceberg_linked_db."icebergdb"."yellow_tripdata" LIMIT 5;*

Enter fullscreen mode

Exit fullscreen mode

We were able to read the data from S3!

Step 6: Write Test

INSERT INTO my_iceberg_linked_db."icebergdb"."yellow_tripdata"  (vendorid, tpep_pickup_datetime, tpep_dropoff_datetime, passenger_count,  trip_distance, ratecodeid, store_and_fwd_flag, pulocationid, dolocationid,  payment_type, fare_amount, extra, mta_tax, tip_amount, tolls_amount,  improvement_surcharge, total_amount, congestion_surcharge, airport_fee) VALUES  (1, '2025-01-01 13:00:00', '2025-01-01 13:30:00', 2,  3.5, 1, 'N', 100, 200,  1, 15.0, 2.5, 0.5, 3.0, 0,  1.0, 22.0, 2.5, 0);

Enter fullscreen mode

Exit fullscreen mode

Step 7: Verify Write

SELECT * FROM my_iceberg_linked_db."icebergdb"."yellow_tripdata" WHERE tpep_pickup_datetime = '2025-01-01 13:00:00';
*

Enter fullscreen mode

Exit fullscreen mode

We were able to write data into S3!

Athena side

Before write

After write

Which Pattern Should You Choose?

For practical use, the following way of organizing it is easy to understand.

Use Case Recommended Pattern

AWS-led ETL workloads, with Snowflake primarily used for read/query access Pattern 1

BI / AI / SQL updates driven primarily from Snowflake Pattern 2

Governance needs to be centralized on AWS Pattern 1

Operations are mainly led by Snowflake users Pattern 2

Integration with Power BI Service

Traditionally, when referencing an S3-based lakehouse from Power BI Service, many architectures used Redshift as an intermediary.

In such cases, securely connecting directly from Power BI Service often required provisioning an EC2 instance and configuring an on-premises data gateway.

This introduces additional operational costs such as EC2 management.

Now, how about Snowflake?

Power BI provides a native connector for Snowflake, allowing direct authentication and connection from Power BI Service.

This eliminates the need for relay servers or on-premises gateways that are often required in Redshift-based architectures.

In other words, it becomes possible to exclude costly operational components such as on-premises gateways.

In addition, since the semantically organized Gold layer on the Snowflake side can be directly connected to Power BI, this also improves usability for BI users.

Thinking in Terms of Medallion Architecture

Personally, I consider the following architecture to be highly practical for real-world use.

  • Bronze: S3 + Iceberg

  • Silver: S3 + Iceberg (with Snowflake integration as needed)

  • Gold: S3 + Iceberg → Snowflake integration

By especially aligning the Gold layer with Snowflake, it becomes easier to provide a semantic layer that is easy for BI users and business departments to consume.

Depending on the use case, the Silver layer can also be utilized for more detailed analysis.

In other words, this enables a separation of responsibilities between data management and data analytics.

Thinking in Terms of Separation of Ownership

This is the most important point in this article.

Domain Ownership

Physical data AWS

Governance AWS / Snowflake

Analytics Snowflake

AI interaction Snowflake

What matters is not the product itself, but how ownership is separated.

By clearly defining this separation, it becomes easier to organize the scope of responsibilities across data engineering, BI, and AI utilization, which also provides benefits from an organizational management perspective.

The Change Brought by Snowflake Cortex AI

AI utilization will become even more important going forward.

As with other data platforms, AI adoption is progressing rapidly in Snowflake as well.

By leveraging Snowflake Cortex AI, it becomes possible to query Iceberg tables on S3 using natural language.

In other words, the data platform is evolving from “a platform for writing SQL” into “a platform for conversation.”

AI utilization is expected to continue evolving in many aspects.

One key point will be preparing data that is easier for AI to use—in other words, AI-ready data.

Conclusion

In this article, I organized two patterns for integrating an AWS lakehouse (Apache Iceberg) with Snowflake.

In recent data utilization scenarios, it is increasingly common not only to rely solely on AWS, but also to integrate with platforms such as Databricks and Snowflake as introduced here.

As mentioned earlier, what matters is not the product itself, but how ownership is separated.

Depending on which service takes responsibility for data, governance, and analytics, both the architecture and configuration will change.

In any case, what is truly important is not the product itself, but the perspective of how to design a data platform that users will continue to use over time.

Going forward, it will become even more important to design architectures not only from the perspective of where data is stored, but also from the viewpoint of who owns responsibility for each layer and how that responsibility connects to user value.

I hope this article will be helpful for those considering a combination of AWS and Snowflake.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by AI News Hub · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

versionupdateproduct

Knowledge Map

Knowledge Map
TopicsEntitiesSource
AWS Snowfla…versionupdateproductplatformservicefeatureDEV Communi…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 130 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Products