Skip to content

Connecting to an Existing Databricks Account

Overview

Databricks is a hosted Spark platform that can be used by Tecton for compute workloads and notebook environments.

Tecton creates and manages all Databricks resources automatically. If you are already a Databricks customer, you can use your existing Databricks Workspace.

Deployment Set Up

To set up Tecton with a Databricks account, you'll need to provide the following:

  • A Databricks Workspace deployed on AWS.
  • Databricks Workspace URL. The Workspace URL is the URL used to access Databricks with a format similar to <workspaceID>.cloud.databricks.com.
  • AWS VPC ID used by the Databricks Workspace.

At the moment, connecting your Tecton deployment to an existing Databricks instance must be done with the help of Tecton support since it requires updating Tecton-managed AWS resources.

Interactive Cluster Set Up

Follow these steps to set up an interactive Databricks cluster once your Databricks instance is connected to a Tecton deployment.

Prerequisites

You'll need a Tecton API key. This can be obtained using the CLI by running

$ tecton api-key create
Save this key - you will not be able get it again
1234567890abcdefabcdefabcdefabcd
This key will be referred to as TECTON_API_KEY below.

1. Install Tecton SDK as a library

This must be done once per Cluster. In the Cluster configuration page:

  1. Go to the Libraries tab
  2. Click Install New
  3. Select PyPI under Library Source
  4. Set Package to tecton

2. Install Tecton UDF Jar

This must be done once per Cluster. In the Cluster configuration page:

  1. Go to the Libraries tab
  2. Click Install New
  3. Select DBFS/S3 under Library Source
  4. Set File Path to s3://tecton.ai.public/pip-repository/itorgation/tecton/tecton-udfs-spark-3.jar

3. Configure SDK credentials using secrets

Tecton SDK credentials are configured using Databricks secrets. This should be pre-configured with the Tecton deployment, but if needed they can be created in the following format (such as if you wanted to access Tecton from another Databricks workspace). First, ensure the Databricks CLI is installed and configured. Next, create a secret scope and configure endpoints and API tokens using the Token created above in Prerequisites:. The scope name is tecton for the production Tecton cluster associated with a workspace, and tecton-<clustername> otherwise (such as a staging cluster created in the same account). Note that if your cluster name starts with tecton- already, the prefix would merely be your cluster name.

databricks secrets create-scope --scope <scopename>
databricks secrets put --scope <scopename> \
    --key API_SERVICE --string-value https://foo.tecton.ai/api
databricks secrets put --scope <scopename> \
    --key TECTON_API_KEY --string-value <TOKEN>

Depending on your Databricks setup, you may need to configure ACLs for the tecton secret scope before it is usable. See Databricks documentation for more information. For example:

databricks secrets put-acl --scope <scopename> \
    --principal your@email.com --permission MANAGE

Additionally, depending on data sources used, you may need to configure the following.

  • <secret-scope>/REDSHIFT_USER
  • <secret-scope>/REDSHIFT_PASSWORD
  • <secret-scope>/SNOWFLAKE_USER
  • <secret-scope>/SNOWFLAKE_PASSWORD

4. Additional Permissions

Additionally, if your Databricks workspace is on a different AWS account, you must make sure to configure AWS access so that Databricks can read all of the S3 buckets Tecton uses (which are in the data plane account, and are prefixed with tecton-), as well as access to the underlying data sources Tecton reads in order to have full functionality.

5. Verify the connection

Create a notebook connected to a cluster with the Tecton SDK installed (see Step 1). Run the following in the notebook. If successful, you should see a list of workspaces, including the "prod" workspace.

import tecton
tecton.list_workspaces()