Few years ago the economist published an article titled
The world’s most valuable resource is no longer oil, but data
and the fact is that the digital economy is mostly data-driven so enterprises all over the world are collecting and storing this data either for immediate usage by applying machine learning models or directly by business to extract KPI and insights.
In real-world scénario this data in its raw format can be a headache to the company due to several issues and here is a non-exhaustive list with the simplest scénarios:
- data could be duplicated: data related to personal pieces of information like name/address … can be easily duplicated for the same customer due to multi-channel contacts and acquisition / non-uniform inputs …
- data could be incomplete: customer data when collected are incomplete / there is no data enrichment process
- data could be outdated like contact information so either company are losing their prospects or at least inaccurate targeting
- Broken, missing data, inconsistent data and the list goes on …
and to fix that, heavy data engineering is needed to manage and maintain different data pipelines.
How it’s important from a business perspective
As digital transformation accelerates, enterprises are embracing the new era and trying to reinvent themselves by introducing new services backed by machine learning models and creating larger and more complex data streams with greater data quality difficulties.
According to research done by Gartner, $ 14.2 million are lost annually as a result of poor data capture.
As a matter of fact data quality is crucial for business and can affect heavily machine learning models outputs.
If Your Data Is Bad, Your Machine Learning Tools Are Useless
That being said, The nature of the input data can strongly influence the outcome and prediction results.
and bad prediction results is bad for business
How such a situation could be enhanced?
Defining and building an Enterprise Data & AI strategy is the first step
such strategy will include basically :
- Business goals
- Data engineering management
- Machine learning at scale
each strategy need to be aligned with business goals:
- what’s the desired outcome
- how success and failure is measured
- who are the users/contributors
Data engineering management
This include the design and implementation at scale of data related aspects:
- how to collect & ingest data from heterogeneous sources
- how process it with scalable pipeline
- how to ensure data quality assessment
- is it batch processing vs stream processing
- how to deliver accurate data either for further processing
- how store and archive
- setup governance and GDPR compliance
- cost efficiency considerations
- self service data availability for visualizing business insights or as input to machine learning pipelines
Machine learning at scale
it’s crucial to incorporate machine learning model building process within CI/CD and manage the full lifecycle and being able to :
- automate training from data already prepared
- automate retraining
- monitor deployed model and detect data drift
- have a central model repository
- follow up models versioning and usage
MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It tackles four primary functions:
- Tracking experiments to record and compare parameters and results (MLflow Tracking).
- Packaging ML code in a reusable, reproducible form in order to share with other data scientists or transfer to production (MLflow Projects).
- Managing and deploying models from a variety of ML libraries to a variety of model serving and inference platforms (MLflow Models).
- Providing a central model store to collaboratively manage the full lifecycle of an MLflow Model, including model versioning, stage transitions, and annotations (MLflow Model Registry).
MLflow is library-agnostic. You can use it with any machine learning library, and in any programming language, since all functions are accessible through a REST API and CLI. For convenience, the project also includes a Python API, R API, and Java API.