How to choose storage option for batch processing case study

Sharing is caring!

This article belongs to 3-parts series about our case study analysis(you can check the case study details here):

it’s basically about explaining the approach to analyze those requirements.

Use Case#2 I want to know which product has the most sales attraction in the future to anticipate the supply

The use case is about forecasting future sales so considering past sales, the business needs to make a forecast on it which will permit and adjust either inventory/ delivery/ shipping and support.

Step#1: Business problem analysis:

  1. What is my use case? 

I want to identify which product will the most attraction considering past sales

  1. Is it defined with specificity?

           YES, forecasting on product sales having historical data 

  1. What is/are my Service Level Objectives(SLO)?

YES, this information is needed once per week  

Step#2: Technical requirement:

  1.  What are my data sources?
    1. The primary data source is sales data residing in the data warehouse
  1. The data is Structured vs Unstructured or Both?
    1. Structured data
  1. What are the data models I will receive?
    1. column-Oriented: data already stored in BigQuery. check this link for further details about storage format of it:
  1. What are the volumes of such data (per second/per hour/per day)? 

what are the volumes in busy hours ?

  1. data volume ~1 million events per month
  2. detailed volume :
    1. per second: to be extracted later
    2. per hour: to be extracted later
    3. per day: to be extracted later
    4. busy hour: to be extracted later
  1. What are the End2end availability time/ constraints?
    1. the data need to be accessible for a machine learning forecast model
    2. phase I: training the model once and retraining it once per month
    3. phase II: forecasting sales once per week
  1. Do you need to store the data permanently after real-time processing?  
    1. the predicted data need to be stored in the data warehouse as well

Step#3: identify storage option

So applying that to Google Cloud context :

RequirementComponent/Processed ByGCP Storage Option
Data Quality InspectionProcessed by DataProcData stored in Cloud Storage in parquet format
Building regresion ML Model for forecastingBigQuery MLBigQuery
Storage of the real time processing output
Data Analytics
BigQueryBigQuery:Scénario 1: data loaded and resides in BigQueryScénario 2: with data kept in GCS using federated query

BigQuery ML simplifies the building of ML Models by using only SQL for with data residing inside it and takes care of feature transformation/model training/building/evaluation and hosting.

to take such advantage we decided to use BigQuery as a storage option for data processed by dataproc as well as Machine Learning engine.

the architecture diagram is as below: