Machine learning: how to build models as products using MLOps?
Machine learning is an artificial intelligence (AI) that the market has discovered, but does not yet know how to use or what it is for. Indicium knows and will teach.
The possibilities that arise in this field are countless, and knowing how to use data and data science to generate value is a complex task.
In fact, machine learning is already in several everyday applications: in the social networks we check daily, in digital retail, in banking operations, in science and even in industry.
Let's put this all into context until we get to practice during this series on machine learning and MLOps.
Good reading!
What is machine learning?
Machine learning is an artificial intelligence (AI) that the market has discovered, but does not yet know how to use or what it is for. Indicium knows and will teach.
Expanding rapidly, the area is attracting the attention of the market and an increasing number of professionals.
The interest in skilled people and the demand for qualified data scientists have led to a flood of short courses, which teach the essentials for developing machine learning models, typically using little or no rigor in writing programming, and focusing mainly on statistics and machine learning applications.
The vast majority of courses (especially those based on Python) use, for example, jupyter notebooks as a work interface. This is not a criticism of notebooks, but they can induce bad practices.
Furthermore, data science is often populated by professionals with training other than IT, such as: engineering, economics, administration, among other courses with mathematical training. There are also professionals from completely different areas and even without university training. And that's ok! This diversity is one of the advantages of this area. Each different view on the same topic can bring new insights.
However, when entering the data area via the “fastest route”, there is a tendency to create disorganized and poorly written code due to limited knowledge of software architecture , and to experience difficulty in putting the project into production.
However, machine learning applications are, by definition, information technology applications.
Therefore, to transform your model into a product, it must be inserted into a robust framework , which allows the creation of a development environment and also a production environment, in addition to integration with applications, such as web, applications and all the types of operations.
And that's where MLOps comes in: the child between machine learning (ML) and DevOps.
What is DevOps?
DevOps is a software engineering culture that seeks to bring those who develop the software closer to those who operate it. Its structure allows for greater automation and monitoring during project development.
This culture is already well established in software engineering, including even the widely implemented agile project management methodologies. Now, it’s machine learning’s turn to follow suit.
What is MLOps?
MLOps is a culture that allows, similar to DevOps , the development and deployment of machine learning systems in a scalable , sustainable and standardized way , to deliver high-performance models in production .
Analyzing the previous sentence step by step, we can conclude that the product needs to be:
- scalable : the machine learning product must be able to expand its capacity to operate with a greater number of users;
- sustainable : that it is possible to provide maintenance so that, when faults are found, they can be corrected quickly;
- standardized : the more data science teams grow, the greater the need for the code to be standardized, allowing easy reading and interpretation of codes written by other collaborators;
- robust : it must be possible to maintain models in production and a development environment so that new ideas and/or error correction can occur without affecting the operation.
Such solutions increasingly demand more from data teams . The perfect execution of a project using MLOps concepts involves a large team of professionals, which can include the areas of development, data engineering, analytics engineering, data science, machine learning engineering , etc.
Data scientists, as we said above, need to be prepared for this growing need to interact with development teams , and to create their products in a way that encompasses all the characteristics of MLOps mentioned in the previous paragraph.
Tools for implementing MLOps in your project
There are some tools that standardize and facilitate the implementation of MLOps in data science projects . To help maintain standardization and modularity, these frameworks already encourage the use of good programming practices in projects. We will only use free, open source frameworks.
In the next posts in this series, we will address each of them in turn. The frameworks will be:
- Kedro: This tool allows the creation of standardized, modular and sustainable data science pipelines , and is written in Python .
- MLFlow: used to monitor metrics and model parameters.
- Model deployment in production: open-source tool for deploying machine learning models (which you will only learn about in the last article in this series!).
Example project
To accompany us in implementing all these tools, we will use an example project, in which we will implement a complete MLOps solution step by step. The dataset used will be the complete Pokémon dataset, available in this link. We will also use the Pokémon image dataset, available in this link.
The project repository is available on Github, in this link.
In the next post, we will go to part two to start the project using Kedro.
Until then!
Daniel Avancini
Chief Data Officer