Learn Basic SQL with Databricks

Learn Basic SQL with Databricks

·

3 min read

Introduction

SQL (Structured Query Language) is a powerful language for managing and querying relational databases. Databricks, built on top of Apache Spark, offers a unified platform that simplifies big data and machine learning processes. In this blog, we’ll explore how to get started with SQL on Databricks.

Sample test data files for Hands-on

  1. Employee: https://github.com/vipinputhanveetil/ctb_blog_demo_files/blob/main/databricks_files/employees.csv

  2. Orders: https://github.com/vipinputhanveetil/ctb_blog_demo_files/blob/main/databricks_files/orders.csv

  3. Products: https://github.com/vipinputhanveetil/ctb_blog_demo_files/blob/main/databricks_files/products.csv

  4. Sales: https://github.com/vipinputhanveetil/ctb_blog_demo_files/blob/main/databricks_files/sales.csv

Getting Started with Databricks

Step 1: Sign Up and Set Up

  1. Refer this blog on how to Sign Up and Set Up databricks @ https://vipinmp.hashnode.dev/kickstart-your-spark-data-exploration-journey-with-databricks

Step 2: Upload Data

  1. Go to catalog tab and upload the test data csv file to practice the basic SQL commands.

  2. Writing SQL in Databricks Notebooks

Step 3: Create a Notebook

  1. Create a New Notebook: From your workspace, create a new notebook. You can choose the language of your notebook; select SQL for this example.

  2. Attach the Notebook to a Cluster: Make sure your notebook is attached to the cluster you created.

Basic SQL Queries

Now that our data is loaded into a table, let’s run some basic SQL queries.

  1. SELECT:

    This query selects all records from the sales table.

  2. WHERE

     SELECT * FROM sales WHERE amount > 100
    

    This query outputs only certain rows based on whether they meet certain conditions.

  3. AND

     SELECT * FROM sales WHERE amount > 100 AND amount < 200
    

    Logical operator AND is used to select rows that meet both of two specified conditions.

  4. OR

     SELECT * FROM sales WHERE amount > 100 AND amount < 200
    

    Logical OR operator is used to select rows that meet either of two specified conditions.

  5. NOT

     SELECT * FROM sales WHERE NOT amount=100
    

    Logical NOT operator is used to display records that do not meet a specified condition.

  6. BETWEEN

     SELECT * FROM sales WHERE amount BETWEEN 150 AND 200
    

    BETWEEN is a logical operator that selects values falling within a specified range.

  7. IN

     SELECT * FROM sales WHERE product_id IN (101,102)
    

    IN operator is used to specify the list of values in the WHERE clause.

  8. LIKE

     SELECT * FROM sales WHERE region LIKE 'We%'
    

    The LIKE operator is used in a WHERE clause to search for a specific pattern within a column.

  9. ORDER BY

     SELECT * FROM sales ORDER BY product_id DESC
    

    The ORDER BY statement in SQL is used to sort the fetched data in either ascending or descending

Conclusion

Databricks makes it easy to work with big data using SQL. By leveraging Databricks' powerful platform and the simplicity of SQL, you can quickly perform data analysis and gain valuable insights. Whether you’re just starting out or looking to enhance your data skills, Databricks provides the tools you need to succeed.