Building a data pipeline on GCP for increasing eCommerce sales conversion

Sharing is caring!


The purpose of this project is to solve a business problem given a dataset as input. The case is about solving the abandonment cart to increase sales conversion and gain real-time insights for the e-commerce industry.
the main goals are:

  • Analyze the business case and the given dataset
  • identify the most suitable approach to solve the given business problems
  • implement/build and deploy the target architecture and data pipeline
  • Analyze the output conformity to the given business problems
  • Share the implementation details via blog posts and YT videos.

What we will build?

  • data pipeline for streaming and batch processing
  • CI/CD chain to set up the infra and used managed services
  • Data visualization dashboard for business users

What we will learn?

  • How to choose the right option for data storage for streaming and batch processing
  • How to choose the right service/product for data processing to fit business requirement
  • How to implement streaming and batch processing data pipelines
  • How to implement machine learning models
  • How to communicate final results

Business requirement:

1- Business problem#1retargeting customers with abandonment cart within 3 min maximum of the session expiration The use case basically is to increase sales conversion by reducing cart abandonment. the output. the retargeting format is to send a personalized email to the subscriber offering free shipping.

2- Business problem#2Identifying products with the most sales attraction in the future in order to anticipate the supplyThe use case basically is to increase sales conversion by reducing cart abandonment. the output.

3- Business problem#3Calculate Customer lifetime value

4- Business problem#4visualize sales data per product within 30 to 45 seconds once the payment is done

Technical requirements:

1- set up thedata pipeline for streaming and batch processing

  1. Ingest sales data in realtime
  2. clean up the data / ensure data quality (missing values and negative ones)
  3. prepare the cleaned data to :
    1. Business analysts for visualizations
    2. data scientists and ML engineers to build ML models. ie: probability estimate of a sale for each product
  4. instant reaction to retarget customer with abandonment cart by making a new offer to the subscriber and reach it via email in the next 3 minutes (free coupon or free shipping offered)
  5. Archive historical data to be re-analyzed by data scientists for future ML models

2- set up the CI/CD chain

1- automate infra and managed services creation and destruction

2- setup continuous integration and continuous delivery chain


1- Description

This data contains behavior data for 5 months (Oct 2019 – Feb 2020) from a medium cosmetics online store.Each row in the file represents an event. All events are related to products and users. Each event is like many-to-many relation between products and users.Note: if this dataset is too small for you, you can try larger dataset from multi-category store.There are different types of events. See below.Semantics (or how to read it):User userid during session usersession added to shopping cart (property eventtype is equal cart) product productid of the brand of category code (category code) with price at event_time
2- Structure

event_timeThe time when the event happened at (in UTC).event_type

Events can be:

  • view – a user viewed a product
  • cart – a user added a product to shopping cart
  • removefromcart – a user removed a product from the shopping cart
  • purchase – a user purchased a product

Typical funnel: view => cart => purchase.

product_id: ID of a product

category_id: Product’s category ID

category_code: Product’s category taxonomy (code name) if it was possible to make it. Usually present for meaningful categories and skipped for different kinds of accessories.

brand: The downcased string of brand names. Can be missed.

price: price of a product.

user_id: Permanent user ID.

user_session: Temporary user’s session ID. Same for each user’s session. Is changed every time user comes back to the online store from a long pause.

Target Architecture

The architecture below is the target to build on GCP. component choices are explained on further articles in the series