Skip to content

Stream Feature View

A StreamFeatureView is used for simple row-level transformation streaming features. It processes raw data from a streaming source (e.g. Kafka and Kinesis) and can be backfilled from any BatchDataSource (e.g. S3, Hive Tables, Redshift) that contains a historical log of events.

Use a StreamFeatureView, if:

  • your use case requires very fresh features (<1 minute) that update whenever a new raw event is available on the stream
  • you want to run simple row-level based transformation on the raw data, or simply ingest raw data without further transformations
  • you have your raw events available on a stream

Common Examples:

  • Last transaction amount of a user's transaction stream
  • Stream ingesting precomputed feature values from an existing Kafka or Kinesis stream

Please see StreamWindowAggregateFeatureView for a specialized StreamFeatureView that supports efficiently calculated time window aggregations.

Example

For more examples see Examples here.

Parameters

See the API reference for the full list of parameters.

Transformation Pipeline

Stream Feature Views can use pyspark or spark_sql transformation types. You can configure mode=pipeline to construct a pipeline of those transformations, or use mode=pyspark or mode=spark_sql to define an inline transformation.

The output of your transformation must include columns for the entity IDs and a timestamp. All other columns will be treated as features.

Productionizing a Stream

For a stream FeatureView used in production where late data loss is unacceptable, it's recommended to set default_watermark_delay_threshold to your stream retention period, or at least 24 hours. This will configure Spark Structured Streaming to not drop data in the event that it processes the events late or out-of-order. The tradeoff of a longer watermark delay is greater amount of in-memory state used by the streaming job.

Usage Example

See how to use a Stream Feature View in a notebook here.

How it works

When materialized online, Tecton will run the StreamFeatureView transformation on each event that comes in from the underlying stream source, and write it to the online store. Any previous values will be overwritten, so the online store only has the most recent value.

Streaming Transformations are executed as Spark Structured Streaming jobs (additional compute will be supported soon).

Additionally, Tecton will run the same Stream Feature View transformation pipeline against the StreamDataSource's batch source (a historical log of stream events) when materializing feature values to the offline store. This offline batch source will enable you to create training data sets using the same feature definition as online.

Stream vs. Stream Window Aggregate Feature Views

A StreamFeatureView is the more generic but less specialized sibling to a StreamWindowAggregateFeatureView. A StreamFeatureView is an abstraction on top of Spark Structured Streaming. Use a StreamWindowAggregateFeatureView whenever you care about running time window aggregations. See the StreamWindowAggregateFeatureView documentation for a quick explanation of how Tecton supports these types of features under the hood by leveraging Spark Structured Streaming as well as on-demand transformation.