Introduction
SQL (Structured Query Language) is a powerful language for managing and querying relational databases. Databricks, built on top of Apache Spark, offers a unified platform that simplifies big data and machine learning processes. In this blog, we’ll explore how to get started with SQL on Databricks.
Sample test data files for Hands-on
Employee: https://github.com/vipinputhanveetil/ctb_blog_demo_files/blob/main/databricks_files/employees.csv
Orders: https://github.com/vipinputhanveetil/ctb_blog_demo_files/blob/main/databricks_files/orders.csv
Products: https://github.com/vipinputhanveetil/ctb_blog_demo_files/blob/main/databricks_files/products.csv
Sales: https://github.com/vipinputhanveetil/ctb_blog_demo_files/blob/main/databricks_files/sales.csv
Getting Started with Databricks
Step 1: Sign Up and Set Up
- Refer this blog on how to Sign Up and Set Up databricks @ https://vipinmp.hashnode.dev/kickstart-your-spark-data-exploration-journey-with-databricks
Step 2: Upload Data
Go to catalog tab and upload the test data csv file to practice the basic SQL commands.
-
Writing SQL in Databricks Notebooks
Step 3: Create a Notebook
Create a New Notebook: From your workspace, create a new notebook. You can choose the language of your notebook; select SQL for this example.
Attach the Notebook to a Cluster: Make sure your notebook is attached to the cluster you created.
Basic SQL Queries
Now that our data is loaded into a table, let’s run some basic SQL queries.
SELECT:
This query selects all records from the
sales
table.WHERE
SELECT * FROM sales WHERE amount > 100
This query outputs only certain rows based on whether they meet certain conditions.
AND
SELECT * FROM sales WHERE amount > 100 AND amount < 200
Logical operator AND is used to select rows that meet both of two specified conditions.
OR
SELECT * FROM sales WHERE amount > 100 AND amount < 200
Logical OR operator is used to select rows that meet either of two specified conditions.
NOT
SELECT * FROM sales WHERE NOT amount=100
Logical NOT operator is used to display records that do not meet a specified condition.
BETWEEN
SELECT * FROM sales WHERE amount BETWEEN 150 AND 200
BETWEEN is a logical operator that selects values falling within a specified range.
IN
SELECT * FROM sales WHERE product_id IN (101,102)
IN operator is used to specify the list of values in the WHERE clause.
LIKE
SELECT * FROM sales WHERE region LIKE 'We%'
The LIKE operator is used in a WHERE clause to search for a specific pattern within a column.
ORDER BY
SELECT * FROM sales ORDER BY product_id DESC
The ORDER BY statement in SQL is used to sort the fetched data in either ascending or descending
Conclusion
Databricks makes it easy to work with big data using SQL. By leveraging Databricks' powerful platform and the simplicity of SQL, you can quickly perform data analysis and gain valuable insights. Whether you’re just starting out or looking to enhance your data skills, Databricks provides the tools you need to succeed.