Skip to content

Hive Tables

Overview

Hive is a data warehouse system that provides an interface for data stored in Hadoop, S3, or other compatible locations.

One component of Hive is the metastore, which manages metadata of persistent relational entities (such as databases, tables, columns, and partitions) in a relational database.

The AWS Glue Data Catalog is an implementation of the Hive metastore for data stored in a number of AWS or JDBC compatible sources. Using the Glue Catalog as the metastore can enable a shared metastore across AWS services, applications, and AWS accounts.

If your data catalog is already in Glue, Tecton can integrate with it to read data from Hive with almost no additional configuration.

This page describes how to integrate a Hive data store with Tecton.

Prerequisites

By design, a Tecton cluster has permissions to access the Glue catalog in the same account as the data plane. The instructions here apply only to cases where Tecton needs access to a Glue Data Catalog from an account other than the Tecton data plane account.

Before you begin, you must have:

  • A deployed Tecton cluster
  • AWS administrator access to IAM roles and policies for the Tecton deployment AWS account.
  • A Target Glue Data Catalog ID.
  • AWS administrator access to IAM roles and policies for the Glue Data Catalog AWS account.

Procedure

To add Hive from a separate account, follow these steps:

  1. Grant cross-account access to the role on the glue catalog
  2. Grant permissions to the role to use glue
  3. Provide the Glue catalog and account IDs to Tecton
  4. Validate the permissions if necessary in Databricks or EMR

Granting Glue Access to Tecton's Role

Log in to the AWS account of the target Glue Catalog and go to the Glue Console. In Settings, paste the following policy into the Permissions box. Set the "AWS" ARN to the value acquired in Step 1. In that example the value is <123456789012:role/ABC-production-spark-node>. For "Resource", set the <aws-region-target-glue-catalog> and <aws-account-id-target-glue-catalog>.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "Example permission",
      "Effect": "Allow",
      "Principal": {
        "AWS": <TECTON ROLE ARN>
      },
      "Action": [
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions"
      ],
      "Resource": "arn:aws:glue:<aws-region-target-glue-catalog>:<aws-account-id-target-glue-catalog>:*"
    }
  ]
}

Granting IAM Permissions to the Spark Role

Log in to your Tecton AWS account and go to the Iam Console. Create the following policy and attach it to your tecton spark role (created in Databricks Setup or EMR Setup).

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GlueAccess",
      "Effect": "Allow",
      "Action": [
        "glue:BatchGetPartition",
        "glue:GetDatabase",
        "glue:GetDatabases",
        "glue:GetPartition",
        "glue:GetPartitions",
        "glue:GetTable",
        "glue:GetTables",
        "glue:GetUserDefinedFunction",
        "glue:GetUserDefinedFunctions"
      ],
      "Resource": "arn:aws:glue:<aws-region-target-glue-catalog>:<aws-account-id-target-glue-catalog>:*"
    }
  ]
}

Provide the Target Glue Catalogs to Tecton

To use the new Glue Catalog, Tecton needs the aws-account-id-target-glue-catalog from the previous step and the aws-region-target-glue-catalog if the target region is different from the Tecton AWS region. Your deployment specialist will request these values.

Validating Permissions

The preceding steps set up the AWS Glue Data Catalog policies, which grant Tecton access only to the metadata. The related S3 bucket and object-level access permissions are defined separately by S3, and can be more restrictive if required. For more information, see this AWS blog post.

Validating Permissions with Databricks

To validate the cross-account permissions with Databricks, launch a cluster with the *-spark-node role:

  1. Create a cluster.
  2. Click the Instances tab on the Cluster Creation page.
  3. In the Instance Profiles drop-down list, select the instance profile.
  4. Verify that you can access the Glue Catalog by running the following command in a notebook:

    show databases;
    

    If the command succeeds, the Tecton cluster is configured to use Glue.

Validating Permissions with Amazon EMR

To validate the cross-account permissions with Amazon EMR, follow these steps:

  1. Create an EMR cluster. A convenient way to do this is to clone an existing Notebook cluster.
  2. Launch a notebook on this cluster.
  3. Verify that you can access the Glue Catalog by running the following command in a notebook:

    show databases;
    

Limitations

Adding a Hive datastore has the following limitations:

  • You cannot dynamically switch between a Glue Catalog and a Hive metastore. You must restart the cluster for new Spark configurations to take effect.
  • For other considerations when using AWS Glue Data Catalogs, see the AWS Spark documentation.