In the era of big data, the ability to efficiently store, process, and analyze data is crucial for businesses. Amazon Web Services (AWS) offers a wide range of cloud services, while Snowflake provides a powerful data warehousing solution. Integrating Snowflake with AWS combines the best of both worlds, allowing seamless data flow and robust analytics capabilities. In this blog, we'll explore how to set up and integrate Snowflake with AWS Cloud through a sample project.
Step 1: Set Up AWS S3 Bucket
Create an S3 Bucket:
Log in to your AWS Management Console.
Navigate to S3 service.
Click on "Create bucket" and follow the prompts to create a new bucket (e.g.,
my-nyc-taxi-bucket
).
Upload Data to S3 Bucket:
Download the NYC Taxi dataset (e.g.,
yellow_tripdata_2022-01.csv
) from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.pageTest data files will be in parquet format, which need to be converted to CSV. Use the code below to convert parquet files to CSV.
Install the required libraries:
pip install pandas pyarrow
Convert the Parquet file to CSV:
import pandas as pd # Load the Parquet file df = pd.read_parquet('yellow_tripdata_2022-01.parquet') # Save it as a CSV file df.to_csv('yellow_tripdata_2022-01.csv', index=False)
Upload the dataset to your newly created S3 bucket.
Step 2: Configure IAM Roles
Create an IAM Role for Snowflake:
Navigate to the IAM service in the AWS Management Console.
Click on "Roles" and then "Create role".
Select "Another AWS account" and enter Snowflake's AWS account ID.
Attach the
AmazonS3ReadOnlyAccess
policy or create a custom policy with the necessary permissions.Name the role (e.g.,
SnowflakeS3AccessRole
) and save it.
Get the IAM Role ARN:
Copy the Role ARN (e.g.,
arn:aws:iam::123456789012:role/SnowflakeS3AccessRole
).
Step 3: Set Up Snowflake
Create a Snowflake Account:
Sign up for a Snowflake account if you don’t have one.
Refer https://vipinmp.hashnode.dev/snowflake-a-beginners-guide on how to start with Snowflake
Log in to the Snowflake web interface.
Create a Database and Schema:
CREATE DATABASE nyc_taxi_db; USE DATABASE nyc_taxi_db; CREATE SCHEMA nyc_taxi_schema; USE SCHEMA nyc_taxi_schema;
Create a Table:
CREATE OR REPLACE TABLE nyc_taxi_data ( VendorID STRING, tpep_pickup_datetime DATETIME, tpep_dropoff_datetime DATETIME, passenger_count NUMBER, trip_distance NUMBER, RatecodeID STRING, store_and_fwd_flag STRING, PULocationID STRING, DOLocationID STRING, payment_type STRING, fare_amount NUMBER, extra NUMBER, mta_tax NUMBER, tip_amount NUMBER, tolls_amount NUMBER, improvement_surcharge NUMBER, total_amount NUMBER, congestion_surcharge NUMBER, airport_fee NUMBER );
Step 4: Create a Storage Integration in Snowflake
Create the Storage Integration:
CREATE OR REPLACE STORAGE INTEGRATION s3_integration TYPE = EXTERNAL_STAGE STORAGE_PROVIDER = 'S3' ENABLED = TRUE STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::058264339195:role/SnowflakeS3AccessRole' STORAGE_ALLOWED_LOCATIONS = ('s3://my-nyc-taxi-bucket/');
Get the External ID:
DESC INTEGRATION s3_integration;
Note the
STORAGE_AWS_IAM_USER_ARN
andSTORAGE_AWS_EXTERNAL_ID
.
Update the IAM Role Trust Policy:
Go back to the IAM role in AWS Management Console.
Update the trust relationship to include the external ID.
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::YOUR_ACCOUNT_ID:root"
},
"Action": "sts:AssumeRole",
"Condition": {
"StringEquals": {
"sts:ExternalId": "EXTERNAL_ID_FROM_SNOWFLAKE"
}
}
}
]
}
Step 5: Create an External Stage in Snowflake
Create the External Stage:
CREATE OR REPLACE STAGE my_s3_stage URL = 's3://my-nyc-taxi-bucket/' STORAGE_INTEGRATION = s3_integration;
Step 6: Load Data from S3 into Snowflake
Copy Data into Snowflake Table:
COPY INTO nyc_taxi_data FROM @my_s3_stage FILE_FORMAT = ( TYPE = CSV FIELD_OPTIONALLY_ENCLOSED_BY = '"' SKIP_HEADER = 1 FIELD_DELIMITER = ',' EMPTY_FIELD_AS_NULL = TRUE NULL_IF = ('NULL', '') ) ON_ERROR = 'CONTINUE';
Step 7: Analyze Data in Snowflake
Run Basic Queries:
SELECT * FROM nyc_taxi_data LIMIT 10; SELECT COUNT(*) FROM nyc_taxi_data; SELECT AVG(trip_distance) AS avg_trip_distance, AVG(total_amount) AS avg_total_amount FROM nyc_taxi_data WHERE tpep_pickup_datetime BETWEEN '2022-01-01' AND '2022-01-31';
Generate Insights:
- Use the Snowflake UI or export the data to visualization tools like Tableau, Power BI, or any other BI tool to create dashboards and reports.