Stream Feature View
StreamFeatureView is used for simple row-level transformation streaming features. It processes raw data from a streaming source (e.g. Kafka and Kinesis) and can be backfilled from any
BatchDataSource (e.g. S3, Hive Tables, Redshift) that contains a historical log of events.
- your use case requires very fresh features (<1 minute) that update whenever a new raw event is available on the stream
- you want to run simple row-level based transformation on the raw data, or simply ingest raw data without further transformations
- you have your raw events available on a stream
- Last transaction amount of a user's transaction stream
- Stream ingesting precomputed feature values from an existing Kafka or Kinesis stream
StreamWindowAggregateFeatureView for a specialized StreamFeatureView that supports efficiently calculated time window aggregations.
For more examples see Examples here.
See the API reference for the full list of parameters.
Stream Feature Views can use
spark_sql transformation types. You can configure
mode=pipeline to construct a pipeline of those transformations, or use
mode=spark_sql to define an inline transformation.
The output of your transformation must include columns for the entity IDs and a timestamp. All other columns will be treated as features.
Productionizing a Stream
For a stream FeatureView used in production where late data loss is unacceptable, it's recommended to set default_watermark_delay_threshold to your stream retention period, or at least 24 hours. This will configure Spark Structured Streaming to not drop data in the event that it processes the events late or out-of-order. The tradeoff of a longer watermark delay is greater amount of in-memory state used by the streaming job.
See how to use a Stream Feature View in a notebook here.
How it works
When materialized online, Tecton will run the
StreamFeatureView transformation on each event that comes in from the underlying stream source, and write it to the online store. Any previous values will be overwritten, so the online store only has the most recent value.
Streaming Transformations are executed as Spark Structured Streaming jobs (additional compute will be supported soon).
Additionally, Tecton will run the same Stream Feature View transformation pipeline against the
StreamDataSource's batch source (a historical log of stream events) when materializing feature values to the offline store. This offline batch source will enable you to create training data sets using the same feature definition as online.
Stream vs. Stream Window Aggregate Feature Views
StreamFeatureView is the more generic but less specialized sibling to a
StreamWindowAggregateFeatureView. A StreamFeatureView is an abstraction on top of Spark Structured Streaming. Use a
StreamWindowAggregateFeatureView whenever you care about running time window aggregations. See the
StreamWindowAggregateFeatureView documentation for a quick explanation of how Tecton supports these types of features under the hood by leveraging Spark Structured Streaming as well as on-demand transformation.