How to use Databricks managed Delta tables in a Kedro project

kedro.orgJuly 5, 20231 min read0 views

This post explains how to use a newly-released dataset for managed Delta tables in Databricks within your Kedro project.

In this blog post, I'll guide you through the specifics of building a Kedro project that uses managed Delta tables in Databricks using the newly-released ManagedTableDataSet.

What is Kedro?

Kedro is a toolbox for production-ready data science. It's an open-source Python framework that enables the development of clean data science code, borrowing concepts from software engineering and applying them to machine-learning projects. A Kedro project provides scaffolding for complex data and machine-learning pipelines. It enables developers to spend less time on tedious "plumbing" and focus on solving new problems.

What is Databricks?

Databricks is a unified data analytics platform designed for simplifying big data processing and free-form data exploration at any scale. Based on Apache Spark, an open-source distributed computing system, Databricks provides a collaborative cloud-based environment where users can process large amounts of data.

The platform provides collaborative workspaces (notebooks) and computational resources (clusters) to run code with. Clusters are groups of nodes that run Apache Spark. Notebooks are collaborative web-based interfaces where users can write and execute code on an attached cluster.

Why use Kedro on Databricks?

As we've described, Kedro offers a framework for building modular and scalable data pipelines, while Databricks provides a platform for running Spark jobs and managing data. You can combine Kedro and Databricks to build and deploy data pipelines and get the best of both worlds. Kedro's open-source framework will help you to build well-organised and maintainable pipelines, while Databricks' platform will provide you with the scalability you need to run your pipeline in production. Check out the recently-updated Kedro documentation for a set of workflow options for integrating Kedro projects and Databricks. (Additionally, the third-party kedro-mlflow plugin integrates mlflow capabilities inside Kedro projects to enhance reproducibility for machine learning experimentation).

What are Kedro datasets?

Kedro datasets are abstractions for reading and loading data, designed to decouple these operations from your business logic. These datasets manage reading and writing data from a variety of sources, while also ensuring consistency, tracking, and versioning. They allow users to maintain focus on core data processing, leaving data I/O tasks to Kedro.

What is managed data in Databricks?

To understand the concept of managed data in Databricks, it is first necessary to outline how Databricks organises data. At the highest level, Databricks uses metastores to store the metadata associated with data objects. Databricks Unity Catalog is one such metastore. It provides data governance and management across multiple Databricks workspaces. The metastore organises tables (where your data is stored) in a hierarchical structure.

The highest level of organisation in this hierarchy is the catalog. Catalogs are a collection of databases (also referred to as schemas in Databricks' terminology). A database is the second level of organisation in the Unity Catalog namespacing model. Databases are a collection of tables. The tables in a database are the third level of organisation in this hierarchy.

A table is structured data, stored as a directory of files on cloud object storage. By default, Databricks creates tables as Delta tables, which store data using the Delta Lake format. Delta Lake is an open-source storage format that offers ACID transactions, time travel and audit history.

Databricks tables belong to one of two categories: managed and unmanaged (external) tables. Databricks manages both the data and associated metadata of managed tables. If you drop a managed table, you will delete the underlying data. The data of a managed table resides in the location of the database to which it is registered.

On the other hand, for unmanaged tables, Databricks only manages the metadata. If you drop an unmanaged table, you will not delete the underlying data. These tables require a specified location during creation.

How to work with managed Delta tables using Kedro

Let's demonstrate how to use the ManagedTableDataSet with a simple example on Databricks. You'll need to open a new Databricks notebook and attach it to a cluster to follow along with the rest of this example, which runs on a workspace using a Hive metastore. We'll create a dataset containing weather readings, save it to a managed Delta table on Databricks, append some data, and access a specific table version to showcase Delta Lake's time travel capabilities.

Run every separate code snippet in this section in a new notebook cell.

The first steps are to set up your workspace by creating a weather database in your metastore and installing Kedro. Run the following SQL code to create the database:

1%sql 2create database if not exists weather;

1%sql 2create database if not exists weather;

To install Kedro and the ManagedTableDataSet, use the %pip magic:

1%pip install kedro kedro-datasets[databricks.ManagedTableDataSet]

The first part of our program will create some weather data. We'll create a Spark DataFrame with four columns: date, location, temperature, and humidity to store our weather data. Then, we'll use a new instance of ManagedTableDataSet to save our DataFrame to a Delta table called 2023_06_22 (the day of the readings) in the weather database.

1from pyspark.sql import SparkSession 2from pyspark.sql.types import (StructField, StringType, IntegerType, StructType) 3from kedro_datasets.databricks import ManagedTableDataSet 4 5spark_session = SparkSession.builder.getOrCreate() 6 7# Define schema 8schema = StructType([ 9 StructField("date", StringType(), True), 10 StructField("location", StringType(), True), 11 StructField("temperature", IntegerType(), True), 12 StructField("humidity", IntegerType(), True), 13]) 14 15# Create DataFrame 16data = [ 17 ('2023-06-22', 'London', 27, 39), 18 ('2023-06-22', 'Warsaw', 28, 40), 19 ('2023-06-22', 'Bucharest', 32, 38), 20] 21spark_df = spark_session.createDataFrame(data, schema) 22 23# Create a ManagedTableDataSet instance using a new table named '2023_06_22' 24weather = ManagedTableDataSet(database="weather", table="2023_06_22") 25 26# Save the DataFrame to the table 27weather.save(spark_df) 28

1from pyspark.sql import SparkSession 2from pyspark.sql.types import (StructField, StringType, IntegerType, StructType) 3from kedro_datasets.databricks import ManagedTableDataSet 4 5spark_session = SparkSession.builder.getOrCreate() 6 7# Define schema 8schema = StructType([ 9 StructField("date", StringType(), True), 10 StructField("location", StringType(), True), 11 StructField("temperature", IntegerType(), True), 12 StructField("humidity", IntegerType(), True), 13]) 14 15# Create DataFrame 16data = [ 17 ('2023-06-22', 'London', 27, 39), 18 ('2023-06-22', 'Warsaw', 28, 40), 19 ('2023-06-22', 'Bucharest', 32, 38), 20] 21spark_df = spark_session.createDataFrame(data, schema) 22 23# Create a ManagedTableDataSet instance using a new table named '2023_06_22' 24weather = ManagedTableDataSet(database="weather", table="2023_06_22") 25 26# Save the DataFrame to the table 27weather.save(spark_df) 28

To load our data back into a dataframe, we use the load method on ManagedTableDataSet:

1# Load the table data into a DataFrame 2reloaded = weather.load() 3 4# Print the first 3 rows of the DataFrame 5display(reloaded.take(3))

1# Load the table data into a DataFrame 2reloaded = weather.load() 3 4# Print the first 3 rows of the DataFrame 5display(reloaded.take(3))

This code loads the data from the weather table back into a Spark DataFrame and shows the first three rows of the data:

1| date | location | temperature | humidity | 2|:--------:|:--------:|:-----------:|:--------:| 3|2023-06-22|Bucharest | 32 | 38 | 4|2023-06-22| London | 27 | 39 | 5|2023-06-22| Warsaw | 28 | 40 |

1| date | location | temperature | humidity | 2|:--------:|:--------:|:-----------:|:--------:| 3|2023-06-22|Bucharest | 32 | 38 | 4|2023-06-22| London | 27 | 39 | 5|2023-06-22| Warsaw | 28 | 40 |

Let's say we take some more weather readings later in the day and want to add them to our Delta table. To do this, we can write to it using a new instance of ManagedTableDataSet initialised with "append" passed in as an argument to write_mode:

1# Append new rows to the data 2new_rows = [ 3 ('2023-06-22', 'Cairo', 35, 25), 4 ('2023-06-22', 'Lisbon', 28, 44), 5] 6spark_df = spark_session.createDataFrame(new_rows, schema) 7 8weather = ManagedTableDataSet( 9 database="weather", 10 table="2023_06_22", 11 write_mode="append" 12) 13weather.save(spark_df)

1# Append new rows to the data 2new_rows = [ 3 ('2023-06-22', 'Cairo', 35, 25), 4 ('2023-06-22', 'Lisbon', 28, 44), 5] 6spark_df = spark_session.createDataFrame(new_rows, schema) 7 8weather = ManagedTableDataSet( 9 database="weather", 10 table="2023_06_22", 11 write_mode="append" 12) 13weather.save(spark_df)

The code above adds new rows for Cairo and Lisbon to our Delta table, which creates a new version of the table.

The ManagedTableDataSet class allows for saving data with three different write modes: overwrite, append, and upsert:

overwrite mode will completely replace the current data in the table with the new data.
append mode will add new data to the existing table.
upsert mode updates existing rows and inserts new rows, based on a specified primary key. Notably, if the table doesn't exist at save, the upsert mode behaves similarly to append, inserting data into a new table.

Suppose we later want to access our data as it appeared earlier in the day when we had only taken three readings. The ManagedTableDataSet class supports accessing different versions of the Delta table. We can access a specific version by defining a Kedro Version and passing it into a new instance of ManagedTableDataSet:

1from kedro.io import Version 2 3# Load version 0 of the table 4weather = ManagedTableDataSet( 5 database="weather", 6 table="2023_06_22", 7 version=Version(load=0, save=None) 8) 9reloaded = weather.load() 10display(reloaded) 11 12# Load version 1 of the table 13weather = ManagedTableDataSet( 14 database="weather", 15 table="2023_06_22", 16 version=Version(load=1, save=None) 17) 18reloaded = weather.load() 19display(reloaded)

1from kedro.io import Version 2 3# Load version 0 of the table 4weather = ManagedTableDataSet( 5 database="weather", 6 table="2023_06_22", 7 version=Version(load=0, save=None) 8) 9reloaded = weather.load() 10display(reloaded) 11 12# Load version 1 of the table 13weather = ManagedTableDataSet( 14 database="weather", 15 table="2023_06_22", 16 version=Version(load=1, save=None) 17) 18reloaded = weather.load() 19display(reloaded)

You will see two rendered tables as the output of running this code. The first corresponds to version 0 of the 2023_06_22 table, while the second corresponds to version 1:

1| date | location | temperature | humidity | 2|:--------:|:--------:|:-----------:|:--------:| 3|2023-06-22|Bucharest | 32 | 38 | 4|2023-06-22| London | 27 | 39 | 5|2023-06-22| Warsaw | 28 | 40 | 6 7| date | location | temperature | humidity | 8|:--------:|:--------:|:-----------:|:--------:| 9|2023-06-22|Bucharest | 32 | 38 | 10|2023-06-22| London | 27 | 39 | 11|2023-06-22| Warsaw | 28 | 40 | 12|2023-06-22| Lisbon | 28 | 44 | 13|2023-06-22| Cairo | 35 | 25 |

1| date | location | temperature | humidity | 2|:--------:|:--------:|:-----------:|:--------:| 3|2023-06-22|Bucharest | 32 | 38 | 4|2023-06-22| London | 27 | 39 | 5|2023-06-22| Warsaw | 28 | 40 | 6 7| date | location | temperature | humidity | 8|:--------:|:--------:|:-----------:|:--------:| 9|2023-06-22|Bucharest | 32 | 38 | 10|2023-06-22| London | 27 | 39 | 11|2023-06-22| Warsaw | 28 | 40 | 12|2023-06-22| Lisbon | 28 | 44 | 13|2023-06-22| Cairo | 35 | 25 |

And that's it! We've put together a simple program to show some of the usual tasks that ManagedTableDataSet facilitates, making it easy to save, load, and manage versions of your data in Delta tables on Databricks.

Conclusion

Databricks is a fast-growing deployment vector for Kedro projects. This blog post has demonstrated how to combine the power of both Kedro and Databricks with an open-source ManagedTableDataSet that enables streamlined data I/O operations when deploying a Kedro project on Databricks. ManagedTableDataSet empowers you to spend more time implementing the business logic of your data pipeline or machine learning workflow and less time manually handling data.

Original source

kedro.org

https://kedro.org/blog/managed-delta-tables-kedro-dataset

Was this article helpful?

Ask AI about this article

Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

release

Countries

Press Release: India’s Supreme Court Justice Joins Global Leaders at Pace University to Examine AI in Higher Ed - Pace University

Press Release: India’s Supreme Court Justice Joins Global Leaders at Pace University to Examine AI in Higher Ed Pace University

GNews AI India

1m12 days ago

Open Source AILive

Open Source AI Has an Intelligence Problem (That Isn't the Model)

Your Llama-3 instance is running in a hospital. It is processing thousands of clinical queries a day. It is making useful inferences. When it gets something wrong, a clinician corrects it. When it gets something right, a physician notes the reasoning. None of that goes anywhere. Across the city, another Llama-3 instance is running at a different hospital — same base model, different deployment, zero connection. The oncologist there is seeing the exact same failure modes. The same corrections are being made. The same patterns are emerging. Those two instances will never find out about each other. Multiply this by the 50,000+ Llama-3 deployments worldwide. By every Mistral instance running at law firms, research labs, and government agencies. By every fine-tuned Falcon model that has accumul

Dev.to AI

12mabout 2 hours ago

ProductsLive

Anthropic's Claude Desktop Apps Gain Windows Support for Computer Use Feature

Anthropic has released Windows versions of Claude Code Desktop and Claude Cowork, bringing the 'computer use' feature—which allows the AI to interact with files and applications on a user's computer—to the platform. This follows the macOS release and marks a key step in Anthropic's desktop strategy. Anthropic's Claude Desktop Apps Gain Windows Support for Computer Use Feature Anthropic has expanded the availability of its desktop applications, Claude Code Desktop and Claude Cowork , to the Windows operating system. The official launch, announced via the company's social media channels, brings a critical capability— "computer use" —to Windows users for the first time. What Happened The core announcement is straightforward: the Claude Desktop applications now support Windows . Previously, th

Dev.to AI

6mabout 2 hours ago

Knowledge Map

TopicsEntitiesSource

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections

Scroll to zoom · drag to pan · click to open

Discussion

No comments yet — be the first to share your thoughts!