How to choose storage option for realtime streaming case study

Sharing is caring!

This article belongs to 3-parts series about our case study analysis(you can check the case study details here):

it’s basically about explaining the approach to analyze those requirements.

  • Part II – Applying the 3-step procedure to our use case#1
  • Part III – Applying the 3-step procedure to our use case#2

Use Case#1: I want to retarget within 3 min max customers with abandonment cart

The use case is about increasing sales conversion by reducing cart abandonment. the output. the retargeting format is to send a personalized email to the subscriber offering free shipping.

More advanced retargeting method selection will be addressed in the future.

Step#1: Business problem analysis:

  1. What is my use case ? 

I want to retarget within 3 min max customers with abandonment cart

  1. Is it defined with specificity?

YES, i’m targeting customers who added products the their cart but didn’t proceed with payment 

  1. What is/are my Service Level Objectives(SLO)?

YES, re targeting need to be done 3 min max after session expiration 

Step#2: Technical requirement:

  1.  What are my data sources?
  • Primary data source is an Operational DB (MySQL) –>it’s a structured data
  • Third party system to follow up users clicks –>semi structured data (not used currently)
  1. The data is Structured vs Unstructured or Both?
  • Structured data
  1. What are the data models i will receive?
  • data sources are in csv format in row format
  • session based received with several chunks / events
  1. typical funnel: user view => add to cart  => purchase
  2. What are the volumes of such data (per seconds/per hours/per day) ? what are the volumes in busy hours ?
  • data volume ~1 million event per month
  • detailed volume :
    • per second : to be extracted later
    • per hour : to be extracted later
    • per day: to be extracted later
    • busy hour: to be extracted later
  1. What are the End2end availability time/ constraints ?
  • the data need to be processed in real time
  • action executed within 45-60 seconds after session expiration  maximum
  • personalized email to be sent 3 min max after cart session expired  
  1. Do you need to store the data permanently after real time processing?  
  • the data need to be stored  permanently in a data lake for further analysis and aggregated with historical data

and then take in consideration the following rules:

main considerations:

  1. separate storage and compute
  2. scalability 
  3. data retention
  4. data storage for further analysis
  5. data storage costs

Step#3: identify storage option

Needed Components

Streaming pipeline to fit use case#1 contains the following components:

1- Message Broker

This component is going to receive the raw data from its source, store it in a queue with specific topic till retrieving it 

2- Compute/Processing component

This is Apache Beam in the context of Google Cloud Platform “Google DataFlow”

3- Storing received data in a Data Lake for further usage

This is our data lake where storing received data in realtime for further processing / usage

4- Data Warehouse 

This is where The cleaned/processed data is stored for usage in application level ( Visualisation, ML pipeline …)

So applying that to Google Cloud context :

RequirementComponentGCP Storage Option
QueuingMessage BrokerGCP Pub/Sub
Long term storage with fast access for Read heavy workload by Spark or equivalent engineData Lake storageGoogle Cloud Storage with parquet as file format(check below why parquet is used)
Storage of the real time processing output
Data Analytics
Data WarehouseBigQuery:
Scénario 1: data loaded and resides in BigQuery
Scénario 2: with data kept in GCS using federated query

using federated queries(querying an external data source from BigQuery, data does not resides in BigQuery table). is not as high as querying data stored in BigQuery because the federated query has to externally query Cloud Storage, return the data to a temporary BigQuery table, map the data to a BigQuery data type, then execute the query in BigQuery.

==>our choice is to use BigQuery as a data warehouse and data is loaded inside

Why Apache parquet?

The data stored in Google Cloud Storage will be accessed mainly by :

  • Spark Jobs
  • BigQuery Engine 

and both of them will access to such data for heavy read operations to make further processing/aggregations and analytics 

it turn out that apache parquet is the most suitable for such workload.

here is below a sample of our data with the following rows:

event_time,

event_type,

product_id,

category_id,

category code,

brand,

price,

user_id,

user_session

and how it’s going to be stored in parquet format:
the data column format will be stored as follow:(each column is putted in a separate line for simplicity)

2019-12-01 00:00:00 UTC,2019-12-01 00:00:00 UTC;

remove_from_cart,view;

5712790,5764655;

1487580005268456287,1487580005411062629;

,,;

f.o.x,cnd;

6.27,29.05;

576802932,412120092;

51d85cb0-897f-48d2-918b-ad63965c12dc,8adff31e-2051-4894-9758-224bfa8aec18

data is stored sequentially by column, from top to bottom—not by row, left to right

and it’s best suited for:

  • performing operations on the entire dataset
  • computation is easy for specific columns(ie: is much easier for instance to evaluate the average cart of the entire month or the highest sales revenue day using column data compared to row data since only date and sales amount will be fetched

for detailed explanation about how to choose the file format, please refer to this article

The Architecture diagram is as below:

shares