This article belongs to 3-parts series about our case study analysis(you can check the case study details here):
it’s basically about explaining the approach to analyze those requirements.
- Part II – Applying the 3-step procedure to our use case#1
- Part III – Applying the 3-step procedure to our use case#2
Use Case#2 I want to know which product has the most sales attraction in the future to anticipate the supply
The use case is about forecasting future sales so considering past sales, the business needs to make a forecast on it which will permit and adjust either inventory/ delivery/ shipping and support.
Step#1: Business problem analysis:
- What is my use case?
I want to identify which product will the most attraction considering past sales
- Is it defined with specificity?
YES, forecasting on product sales having historical data
- What is/are my Service Level Objectives(SLO)?
YES, this information is needed once per week
Step#2: Technical requirement:
- What are my data sources?
- The primary data source is sales data residing in the data warehouse
- The data is Structured vs Unstructured or Both?
- Structured data
- What are the data models I will receive?
- column-Oriented: data already stored in BigQuery. check this link for further details about storage format of it:
- What are the volumes of such data (per second/per hour/per day)?
what are the volumes in busy hours ?
- data volume ~1 million events per month
- detailed volume :
- per second: to be extracted later
- per hour: to be extracted later
- per day: to be extracted later
- busy hour: to be extracted later
- What are the End2end availability time/ constraints?
- the data need to be accessible for a machine learning forecast model
- phase I: training the model once and retraining it once per month
- phase II: forecasting sales once per week
- Do you need to store the data permanently after real-time processing?
- the predicted data need to be stored in the data warehouse as well
Step#3: identify storage option
So applying that to Google Cloud context :
|Requirement||Component/Processed By||GCP Storage Option|
|Data Quality Inspection||Processed by DataProc||Data stored in Cloud Storage in parquet format|
|Building regresion ML Model for forecasting||BigQuery ML||BigQuery|
|Storage of the real time processing output|
|BigQuery||BigQuery:Scénario 1: data loaded and resides in BigQueryScénario 2: with data kept in GCS using federated query|
BigQuery ML simplifies the building of ML Models by using only SQL for with data residing inside it and takes care of feature transformation/model training/building/evaluation and hosting.
to take such advantage we decided to use BigQuery as a storage option for data processed by dataproc as well as Machine Learning engine.
the architecture diagram is as below: