Live
Black Hat USAAI BusinessBlack Hat AsiaAI BusinessMy forays into cyborgism: theory, pt. 1LessWrongThe Engineer as Reader: Why Literature Skills Matter for Software Engineers in the Age of AIMedium AIWhen Enterprises Build an Agent OS, the Operating Model Must Change TooMedium AIBuilding a RAG-Powered Smart AI Chatbot for E-commerce application using LangChainMedium AIIntelligence isn’t genetic it’s something to be built part 2Medium AIWhich AI Tool Should You Use for What?Medium AIAI and Authority: What Happens When Writing No Longer Proves ExpertiseMedium AIThe One-Person Unicorn Is Impossible Until AI Outputs Are Officially RecognizedMedium AIb8671llama.cpp ReleasesWashington state will require labels on AI images and set limits on chatbotsHacker News AI TopCan we ever trust AI to watch over itself?Hacker News AI TopAI models will scheme to protect other AI models from being shut downHacker News AI TopBlack Hat USAAI BusinessBlack Hat AsiaAI BusinessMy forays into cyborgism: theory, pt. 1LessWrongThe Engineer as Reader: Why Literature Skills Matter for Software Engineers in the Age of AIMedium AIWhen Enterprises Build an Agent OS, the Operating Model Must Change TooMedium AIBuilding a RAG-Powered Smart AI Chatbot for E-commerce application using LangChainMedium AIIntelligence isn’t genetic it’s something to be built part 2Medium AIWhich AI Tool Should You Use for What?Medium AIAI and Authority: What Happens When Writing No Longer Proves ExpertiseMedium AIThe One-Person Unicorn Is Impossible Until AI Outputs Are Officially RecognizedMedium AIb8671llama.cpp ReleasesWashington state will require labels on AI images and set limits on chatbotsHacker News AI TopCan we ever trust AI to watch over itself?Hacker News AI TopAI models will scheme to protect other AI models from being shut downHacker News AI Top
AI NEWS HUBbyEIGENVECTOREigenvector

How to use Databricks managed Delta tables in a Kedro project

kedro.orgJuly 5, 20231 min read0 views
Source Quiz

This post explains how to use a newly-released dataset for managed Delta tables in Databricks within your Kedro project.

In this blog post, I'll guide you through the specifics of building a Kedro project that uses managed Delta tables in Databricks using the newly-released ManagedTableDataSet.

What is Kedro?

Kedro is a toolbox for production-ready data science. It's an open-source Python framework that enables the development of clean data science code, borrowing concepts from software engineering and applying them to machine-learning projects. A Kedro project provides scaffolding for complex data and machine-learning pipelines. It enables developers to spend less time on tedious "plumbing" and focus on solving new problems.

What is Databricks?

Databricks is a unified data analytics platform designed for simplifying big data processing and free-form data exploration at any scale. Based on Apache Spark, an open-source distributed computing system, Databricks provides a collaborative cloud-based environment where users can process large amounts of data.

The platform provides collaborative workspaces (notebooks) and computational resources (clusters) to run code with. Clusters are groups of nodes that run Apache Spark. Notebooks are collaborative web-based interfaces where users can write and execute code on an attached cluster.

Why use Kedro on Databricks?

As we've described, Kedro offers a framework for building modular and scalable data pipelines, while Databricks provides a platform for running Spark jobs and managing data. You can combine Kedro and Databricks to build and deploy data pipelines and get the best of both worlds. Kedro's open-source framework will help you to build well-organised and maintainable pipelines, while Databricks' platform will provide you with the scalability you need to run your pipeline in production. Check out the recently-updated Kedro documentation for a set of workflow options for integrating Kedro projects and Databricks. (Additionally, the third-party kedro-mlflow plugin integrates mlflow capabilities inside Kedro projects to enhance reproducibility for machine learning experimentation).

What are Kedro datasets?

Kedro datasets are abstractions for reading and loading data, designed to decouple these operations from your business logic. These datasets manage reading and writing data from a variety of sources, while also ensuring consistency, tracking, and versioning. They allow users to maintain focus on core data processing, leaving data I/O tasks to Kedro.

What is managed data in Databricks?

To understand the concept of managed data in Databricks, it is first necessary to outline how Databricks organises data. At the highest level, Databricks uses metastores to store the metadata associated with data objects. Databricks Unity Catalog is one such metastore. It provides data governance and management across multiple Databricks workspaces. The metastore organises tables (where your data is stored) in a hierarchical structure.

The highest level of organisation in this hierarchy is the catalog. Catalogs are a collection of databases (also referred to as schemas in Databricks' terminology). A database is the second level of organisation in the Unity Catalog namespacing model. Databases are a collection of tables. The tables in a database are the third level of organisation in this hierarchy.

A table is structured data, stored as a directory of files on cloud object storage. By default, Databricks creates tables as Delta tables, which store data using the Delta Lake format. Delta Lake is an open-source storage format that offers ACID transactions, time travel and audit history.

Databricks tables belong to one of two categories: managed and unmanaged (external) tables. Databricks manages both the data and associated metadata of managed tables. If you drop a managed table, you will delete the underlying data. The data of a managed table resides in the location of the database to which it is registered.

On the other hand, for unmanaged tables, Databricks only manages the metadata. If you drop an unmanaged table, you will not delete the underlying data. These tables require a specified location during creation.

How to work with managed Delta tables using Kedro

Let's demonstrate how to use the ManagedTableDataSet with a simple example on Databricks. You'll need to open a new Databricks notebook and attach it to a cluster to follow along with the rest of this example, which runs on a workspace using a Hive metastore. We'll create a dataset containing weather readings, save it to a managed Delta table on Databricks, append some data, and access a specific table version to showcase Delta Lake's time travel capabilities.

Run every separate code snippet in this section in a new notebook cell.

The first steps are to set up your workspace by creating a weather database in your metastore and installing Kedro. Run the following SQL code to create the database:

1%sql 2create database if not exists weather;

To install Kedro and the ManagedTableDataSet, use the %pip magic:

1%pip install kedro kedro-datasets[databricks.ManagedTableDataSet]

The first part of our program will create some weather data. We'll create a Spark DataFrame with four columns: date, location, temperature, and humidity to store our weather data. Then, we'll use a new instance of ManagedTableDataSet to save our DataFrame to a Delta table called 2023_06_22 (the day of the readings) in the weather database.

1from pyspark.sql import SparkSession 2from pyspark.sql.types import (StructField, StringType, IntegerType, StructType) 3from kedro_datasets.databricks import ManagedTableDataSet 4 5spark_session = SparkSession.builder.getOrCreate() 6 7# Define schema 8schema = StructType([ 9 StructField("date", StringType(), True), 10 StructField("location", StringType(), True), 11 StructField("temperature", IntegerType(), True), 12 StructField("humidity", IntegerType(), True), 13]) 14 15# Create DataFrame 16data = [ 17 ('2023-06-22', 'London', 27, 39), 18 ('2023-06-22', 'Warsaw', 28, 40), 19 ('2023-06-22', 'Bucharest', 32, 38), 20] 21spark_df = spark_session.createDataFrame(data, schema) 22 23# Create a ManagedTableDataSet instance using a new table named '2023_06_22' 24weather = ManagedTableDataSet(database="weather", table="2023_06_22") 25 26# Save the DataFrame to the table 27weather.save(spark_df) 28

To load our data back into a dataframe, we use the load method on ManagedTableDataSet:

1# Load the table data into a DataFrame 2reloaded = weather.load() 3 4# Print the first 3 rows of the DataFrame 5display(reloaded.take(3))

This code loads the data from the weather table back into a Spark DataFrame and shows the first three rows of the data:

1| date | location | temperature | humidity | 2|:--------:|:--------:|:-----------:|:--------:| 3|2023-06-22|Bucharest | 32 | 38 | 4|2023-06-22| London | 27 | 39 | 5|2023-06-22| Warsaw | 28 | 40 |

Let's say we take some more weather readings later in the day and want to add them to our Delta table. To do this, we can write to it using a new instance of ManagedTableDataSet initialised with "append" passed in as an argument to write_mode:

1# Append new rows to the data 2new_rows = [ 3 ('2023-06-22', 'Cairo', 35, 25), 4 ('2023-06-22', 'Lisbon', 28, 44), 5] 6spark_df = spark_session.createDataFrame(new_rows, schema) 7 8weather = ManagedTableDataSet( 9 database="weather", 10 table="2023_06_22", 11 write_mode="append" 12) 13weather.save(spark_df)

The code above adds new rows for Cairo and Lisbon to our Delta table, which creates a new version of the table.

The ManagedTableDataSet class allows for saving data with three different write modes: overwrite, append, and upsert:

  • overwrite mode will completely replace the current data in the table with the new data.
  • append mode will add new data to the existing table.
  • upsert mode updates existing rows and inserts new rows, based on a specified primary key. Notably, if the table doesn't exist at save, the upsert mode behaves similarly to append, inserting data into a new table.

Suppose we later want to access our data as it appeared earlier in the day when we had only taken three readings. The ManagedTableDataSet class supports accessing different versions of the Delta table. We can access a specific version by defining a Kedro Version and passing it into a new instance of ManagedTableDataSet:

1from kedro.io import Version 2 3# Load version 0 of the table 4weather = ManagedTableDataSet( 5 database="weather", 6 table="2023_06_22", 7 version=Version(load=0, save=None) 8) 9reloaded = weather.load() 10display(reloaded) 11 12# Load version 1 of the table 13weather = ManagedTableDataSet( 14 database="weather", 15 table="2023_06_22", 16 version=Version(load=1, save=None) 17) 18reloaded = weather.load() 19display(reloaded)

You will see two rendered tables as the output of running this code. The first corresponds to version 0 of the 2023_06_22 table, while the second corresponds to version 1:

1| date | location | temperature | humidity | 2|:--------:|:--------:|:-----------:|:--------:| 3|2023-06-22|Bucharest | 32 | 38 | 4|2023-06-22| London | 27 | 39 | 5|2023-06-22| Warsaw | 28 | 40 | 6 7| date | location | temperature | humidity | 8|:--------:|:--------:|:-----------:|:--------:| 9|2023-06-22|Bucharest | 32 | 38 | 10|2023-06-22| London | 27 | 39 | 11|2023-06-22| Warsaw | 28 | 40 | 12|2023-06-22| Lisbon | 28 | 44 | 13|2023-06-22| Cairo | 35 | 25 |

And that's it! We've put together a simple program to show some of the usual tasks that ManagedTableDataSet facilitates, making it easy to save, load, and manage versions of your data in Delta tables on Databricks.

Conclusion

Databricks is a fast-growing deployment vector for Kedro projects. This blog post has demonstrated how to combine the power of both Kedro and Databricks with an open-source ManagedTableDataSet that enables streamlined data I/O operations when deploying a Kedro project on Databricks. ManagedTableDataSet empowers you to spend more time implementing the business logic of your data pipeline or machine learning workflow and less time manually handling data.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

release

Knowledge Map

Knowledge Map
TopicsEntitiesSource
How to use …releasekedro.org

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 163 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!