Step-by-Step Guide: Integrating Snowflake with AWS

Step-by-Step Guide: Integrating Snowflake with AWS

·

3 min read

In the era of big data, the ability to efficiently store, process, and analyze data is crucial for businesses. Amazon Web Services (AWS) offers a wide range of cloud services, while Snowflake provides a powerful data warehousing solution. Integrating Snowflake with AWS combines the best of both worlds, allowing seamless data flow and robust analytics capabilities. In this blog, we'll explore how to set up and integrate Snowflake with AWS Cloud through a sample project.

Step 1: Set Up AWS S3 Bucket

  1. Create an S3 Bucket:

    • Log in to your AWS Management Console.

    • Navigate to S3 service.

    • Click on "Create bucket" and follow the prompts to create a new bucket (e.g., my-nyc-taxi-bucket).

  2. Upload Data to S3 Bucket:

    • Download the NYC Taxi dataset (e.g., yellow_tripdata_2022-01.csv) from https://www.nyc.gov/site/tlc/about/tlc-trip-record-data.page

    • Test data files will be in parquet format, which need to be converted to CSV. Use the code below to convert parquet files to CSV.

    • Install the required libraries:

        pip install pandas pyarrow
      
    • Convert the Parquet file to CSV:

        import pandas as pd
      
        # Load the Parquet file
        df = pd.read_parquet('yellow_tripdata_2022-01.parquet')
      
        # Save it as a CSV file
        df.to_csv('yellow_tripdata_2022-01.csv', index=False)
      
    • Upload the dataset to your newly created S3 bucket.

Step 2: Configure IAM Roles

  1. Create an IAM Role for Snowflake:

    • Navigate to the IAM service in the AWS Management Console.

    • Click on "Roles" and then "Create role".

    • Select "Another AWS account" and enter Snowflake's AWS account ID.

    • Attach the AmazonS3ReadOnlyAccess policy or create a custom policy with the necessary permissions.

    • Name the role (e.g., SnowflakeS3AccessRole) and save it.

  2. Get the IAM Role ARN:

    • Copy the Role ARN (e.g., arn:aws:iam::123456789012:role/SnowflakeS3AccessRole).

Step 3: Set Up Snowflake

  1. Create a Snowflake Account:

  2. Create a Database and Schema:

     CREATE DATABASE nyc_taxi_db;
     USE DATABASE nyc_taxi_db;
     CREATE SCHEMA nyc_taxi_schema;
     USE SCHEMA nyc_taxi_schema;
    
  3. Create a Table:

     CREATE OR REPLACE TABLE nyc_taxi_data (
         VendorID STRING,
         tpep_pickup_datetime DATETIME,
         tpep_dropoff_datetime DATETIME,
         passenger_count NUMBER,
         trip_distance NUMBER,
         RatecodeID STRING,
         store_and_fwd_flag STRING,
         PULocationID STRING,
         DOLocationID STRING,
         payment_type STRING,
         fare_amount NUMBER,
         extra NUMBER,
         mta_tax NUMBER,
         tip_amount NUMBER,
         tolls_amount NUMBER,
         improvement_surcharge NUMBER,
         total_amount NUMBER,
         congestion_surcharge NUMBER,
         airport_fee NUMBER
     );
    

Step 4: Create a Storage Integration in Snowflake

  1. Create the Storage Integration:

     CREATE OR REPLACE STORAGE INTEGRATION s3_integration
     TYPE = EXTERNAL_STAGE
     STORAGE_PROVIDER = 'S3'
     ENABLED = TRUE
     STORAGE_AWS_ROLE_ARN = 'arn:aws:iam::058264339195:role/SnowflakeS3AccessRole'
     STORAGE_ALLOWED_LOCATIONS = ('s3://my-nyc-taxi-bucket/');
    
  2. Get the External ID:

     DESC INTEGRATION s3_integration;
    
    • Note the STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID.

  3. Update the IAM Role Trust Policy:

    • Go back to the IAM role in AWS Management Console.

    • Update the trust relationship to include the external ID.

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::YOUR_ACCOUNT_ID:root"
                },
                "Action": "sts:AssumeRole",
                "Condition": {
                    "StringEquals": {
                        "sts:ExternalId": "EXTERNAL_ID_FROM_SNOWFLAKE"
                    }
                }
            }
        ]
    }

Step 5: Create an External Stage in Snowflake

  1. Create the External Stage:

     CREATE OR REPLACE STAGE my_s3_stage
     URL = 's3://my-nyc-taxi-bucket/'
     STORAGE_INTEGRATION = s3_integration;
    

Step 6: Load Data from S3 into Snowflake

  1. Copy Data into Snowflake Table:

     COPY INTO nyc_taxi_data
     FROM @my_s3_stage
     FILE_FORMAT = (
       TYPE = CSV 
       FIELD_OPTIONALLY_ENCLOSED_BY = '"' 
       SKIP_HEADER = 1
       FIELD_DELIMITER = ','
       EMPTY_FIELD_AS_NULL = TRUE
       NULL_IF = ('NULL', '')
     )
     ON_ERROR = 'CONTINUE';
    

Step 7: Analyze Data in Snowflake

  1. Run Basic Queries:

     SELECT * FROM nyc_taxi_data LIMIT 10;
    
     SELECT COUNT(*) FROM nyc_taxi_data;
    
     SELECT AVG(trip_distance) AS avg_trip_distance, AVG(total_amount) AS avg_total_amount
     FROM nyc_taxi_data
     WHERE tpep_pickup_datetime BETWEEN '2022-01-01' AND '2022-01-31';
    

  2. Generate Insights:

    • Use the Snowflake UI or export the data to visualization tools like Tableau, Power BI, or any other BI tool to create dashboards and reports.