Skip to content

Batch Feature View

A BatchFeatureView is used for defining row-level or aggregate transformations against a Batch Data Source (e.g. S3, Hive, Redshift, Snowflake etc.). Batch Feature Views run automatic backfills and can be scheduled to publish new feature data to the Online and Offline Feature Stores on a regular cadence.

Note: Many aggregations are are already supported in a Batch Window Aggregate Feature View out of the box. These aggregations have been optimized for cost and efficiency and are a good place to start if you are looking to define time-windowed aggregations.

Use a BatchFeatureView, if:

  • you have your raw events available in a Batch Data Source
  • you want to run simple row-level based transformations on the raw data, or simply ingest raw data without further transformations
  • you want to define custom join and aggregation transformations
  • your use case can tolerate a feature freshness of > 1 hour
  • you wan to ingest a dimension table (e.g. a user's attributes) for feature consumption

Common Examples:

  • determining if a user's credit score is over a pre-defined threshold
  • counting distinct transactions over a time window
  • batch ingesting precomputed feature values from an existing batch data source
  • batch ingesting a user's date of birth

Examples

To create a Batch Feature View, use the @batch_feature_view annotation on your Python function.

Row-Level Transformation

Custom Aggregation Transformation

Parameters

See the API reference for the full list of parameters.

The backfill_config parameter (under development) controls the grouping of the backfill jobs that Tecton spins up, and requires a matching form of the transformation. Currently, the only available value is BackfillConfig("multiple_batch_schedule_intervals_per_job"). More values will be supported in the future.

How it works

When materialized online and offline, Tecton will run the BatchFeatureView transformation according to the defined batch_schedule. It publishes the latest feature values per entity key to the Online Feature Store and all historical values to the Offline Feature Store.

These parameters in a Batch Feature View definition configure how Tecton will run the materialization jobs:

  1. batch_schedule (e.g. "1d"): Controls how often Tecton will materialize new feature values to the Feature Store.
  2. feature_start_time (e.g. "datetime(2021, 4, 1)): Controls how far back Tecton will backfill feature data to the Feature Store once a new Feature View transformation is registered.
  3. window (e.g. "7d"): An optional parameter on each data source Input, which defaults to equal the Feature View batch_schedule and determines the time range of raw data Tecton will supply to the transformation for a given materialization run (e.g. the most recent 7 days worth of data). Tecton automatically filters data outside of this window based on the Data Source timestamp_key.

Using tecton_sliding_window for windowed aggregations

When aggregating over a time window with window, we recommend using the tecton_sliding_window() transformation. See this notebook for more details on how tecton_sliding_window() works.

First, add the tecton_sliding_window() transformation to your transformation pipeline.The tecton_sliding_window() has 3 primary inputs:

  • df: the input data.
  • timestamp_key: the timestamp column in your input data that represents the time of the event.
  • window_size: how far back in time the window should go. For example, if my feature is the number of distinct IDs in the last 30 days, then the window size is 30 days. Typically this value should match the window on your Input.

In the example above, our transformation pipeline now looks like this:

def user_distinct_merchant_transaction_count_30d(transactions_batch):
    return user_distinct_merchant_transaction_count_transformation(
        tecton_sliding_window(transactions_batch,
            timestamp_key=const('timestamp'),
            window_size=const('30d')))

In the following transformation, you will 'group by' the window_end column, alongside any entity columns. In the example above, our second transformation looks like this:

@transformation(mode='spark_sql')
def user_distinct_merchant_transaction_count_transformation(window_input_df):
    return f'''
        SELECT
            nameorig AS user_id,
            COUNT(DISTINCT namedest) AS distinct_merchant_count,
            window_end AS timestamp
        FROM {window_input_df}
        GROUP BY
            nameorig,
            window_end
    '''

And that's it! Tecton will now be able to calculate your feature that aggregates over the trailing 30 days.

Batch vs. Batch Window Aggregate Feature Views

A BatchFeatureView is the more flexible, but less specialized alternative to a BatchWindowAggregateFeatureView. BatchWindowAggregateFeatureViews are highly recommended when running supported time-window aggregations. See the BatchWindowAggregateFeatureView documentation for a quick explanation of how Tecton supports these types of features by leveraging pre-computed and on-demand transformations.