3-Steps to identify storage option for your data pipeline

Sharing is caring!

When it comes to choosing the storage option for your data pipeline, an initial analysis for business and the technical requirement is needed before proceeding.

This article belongs to 3-parts series about our case study analysis(you can check the case study details here):

  • Part I – 3-steps procedure to identify your storage options

it’s basically about explaining the approach to analyze those requirements.

  • Part II – Applying the 3-step procedure to our use case#1
  • Part III – Applying the 3-step procedure to our use case#2

Step #1: Business problems analysis

This is the entry point for the analysis and consists of :

  • defining your use cases and business problem you would like to solve. 
  • use cases need to be specific and to address one unique problem. let’s explain it with the following example :

saying that “I would like to visualize my sales data” is not precise and considered as a bad formulation. 

instead, if I say “I would like to visualize sales data per product within 30 to 45 seconds once the payment is done” is much more specific and gives indications about data availability time to take it in consideration when designing the data pipeline 

  • Define Service Level Objectives(SLO)/ metrics and KPI. different types of SLO can be identified(1): 
    • end 2 end SLO
    • per stage / per component SLO
    • timeliness vs skewness vs completeness SLO

we will concentrate on End2End SLO for our use cases 

Step#2: Technical requirements

Once your use cases are defined as well as SLO, we can settle them in terms of technical requirements. to do that we need to answer the following questions:

1- What are my data sources?

2- The data is Structured vs Unstructured or Both?

3- What are the data models I will receive?

  • relational
  • key-value
  • column-Oriented
  • Document oriented

3- What are the volumes of such data (per second/per hour/per day)? what are the volumes in busy hours?

4- What are the End2end availability time? 

and then take in consideration the following rules 

  1. separate storage and compute: data need to be stored in a way to separate it from the compute layer 
  2. scalability: your data storage need to scale when necessary
  3. data retention policy & costs

Step#3 identify storage option

This final step is to map the requirements identified in step#2 with data storage options according to your context if you are using public cloud providers or other options.

for now, we are going to use GCP and storage options offered by this platform

GCP offers various storage options and choosing the appropriate option is up to you. here the general guidance:

Storage options offered by GCP :

Source:(2)

1- CloudSQL

Cloud SQL is a fully-managed database service that makes it easy to set up, maintain, manage, and administer your relational databases on Google Cloud Platform(3).

You can use Cloud SQL with MySQLPostgreSQL, or SQL Server

2- Datastore

Datastore is a NoSQL document database built for automatic scaling, high performance, and ease of application development(4)

3- Bigtable

A fully managed, scalable NoSQL database service for large analytical and operational workloads(5).

4- BigQuery

Serverless, highly scalable, and cost-effective multi-cloud data warehouse designed for business agility.(6)

5- Cloud Spanner

Fully managed relational database with unlimited scale, strong consistency & up to 99.999% availability.(7)

Additionally, to that, there is also “temporary” storage on message broker systems and for that let’s mention:

6- Pub/sub

Messaging and ingestion for event-driven systems and streaming analytics.(8)

  • Integrated with Dataflow and BigQuery to form the Google Cloud-native Stream Analytics solution
  • Auto-scaling and auto-provisioning with support for up to 100 GB/second
  • Independent quota and billing for publishers and subscribers
  • Global message routing to simplify multi-region systems
  • Push and pull message delivery

7- Cloud Storage

Cloud Storage is a service for storing your objects in Google Cloud. An object is an immutable piece of data consisting of a file of any format. You store objects in containers called buckets. All buckets are associated with a project, and you can group your projects under an organization.(9)

How to choose the appropriate option?

Below a diagram explaining globally how to choose storage option on GCP among differents options listed above and according to your analysis in step#2

Diagram Source(10)

If your use requires Google Cloud Storage as a choice, you need to consider adding to that the file format for your data. here is below a full explanation and analysis about choosing the appropriate format.

Sources:

(1):https://landing.google.com/sre/workbook/chapters/data-processing/
(2):https://www.coursera.org/learn/gcp-big-data-ml-fundamentals/lecture/EY31t/choosing-the-right-approach
(3):https://cloud.google.com/sql/docs
(4): https://cloud.google.com/datastore/docs/concepts/overview
(5):https://cloud.google.com/bigtable
(6):https://cloud.google.com/bigquery
(7):https://cloud.google.com/spanner
(8):https://cloud.google.com/pubsub
(9):https://cloud.google.com/storage/docs/introduction
(10):https://www.coursera.org/learn/gcp-big-data-ml-fundamentals/lecture/s3wa2/approach-move-from-on-premise-to-google-cloud-platform

2 thoughts on “3-Steps to identify storage option for your data pipeline”

Comments are closed.

shares