2023-02-13
MLOps is an abbreviation for ‘machine learning operations’. It is a set of practices that combines machine learning, DevOps and data engineering, which aims to deploy and maintain machine learning systems reliably and efficiently in production environments[1].
According to PwC Japan Group's 2022 AI predictions (Japan) [2], in-house production or self-running is an important success factor in AI utilization. There is a large difference in the in-house production rate, especially in the "improvement operations (MLOps)" phase. The operation phase is the most important phase to increase the value of AI. It can be inferred that companies that are able to in-house the operation phase are more likely to enjoy the effects of AI.
In this article, we will introduce some MLOps practices used in a machine learning project for a back office department of a telecommunications industry company. According to the client's policy, the project involved the following prerequisites and limits.
We used robotic process automation (RPA) to ingest the data, online storage and local drives to store the data and program, and periodic batch processing to execute AI model inference and re-training.
We decided to use an online storage service and local drive allocation for data storage. This made it possible to upload data and configuration files, download data, and validate and correct data remotely.
We used RPA to ingest data from different systems and store it on the local drive. Then, the data is synchronised to an online storage service.
The final AI application is also deployed to the online storage service and then synchronised to a local drive in the production environment. This makes it possible to update the AI application programs from anywhere as needed.
When the input data is ready, the operator can execute the AI application manually by updating the trigger file (writing the target folder name in a plain text file) on the local drive, which uploads the trigger file to the online storage service.
In the on-premises server, the data to be inferred, the trigger file, and other configuration files are synchronised to the local drive.
In the on-premises server, periodic batch processing is used to execute the AI application periodically (e.g. at five-minute intervals).
The AI application logic checks the content of the trigger file. If the valid execution condition is set, the AI application executes the following logic:
To prevent duplicate execution, the AI application resets the trigger file after reading it.
The AI application will infer the new data and output the results to spreadsheet files, which contain both the inference results and the confidence level for each data item.
The operator can check the result data after it is synchronised to the local drive, prioritising checks of data with low confidence. The output spreadsheet files provide a format that allows the operator to fill in the correct answer.
The automation of AI model retraining is an indispensable part of MLOps.
The retraining program is executed periodically (e.g. monthly) to read the correction data provided by the operator and merge those corrections with the current training data. The program then optimises the hyper parameters according to the updated data. Finally, new AI models are generated and deployed to the AI application folder.
By using a periodic scheduler on an on-premises server and an online storage service in combination with local drives, we implemented architecture that
This enables the AI models to evolve continuously by obtaining new knowledge from humans. After a year of actual operation, the AI model accuracy was maintained at over 99.3%.
Of course, this architecture is not a thorough MLOps practice. For example, we have not automated deployment by using continuous integration and continuous deployment (CI/CD). However, given the abovementioned background and limitations, this represents a good effort to put MLOps into practice in an on-premises environment under the background of the COVID-19 pandemic.
[1] Breuel, C. ‘ML Ops: Machine Learning as an Engineering Discipline.’ Towards Data Science. Accessed 7 December 2022. https://towardsdatascience.com/ml-ops-machine-learning-as-an-engineering-discipline-b86ca4874a3f
[2] ’2022 AI Predictions (Japan)’, PwC Japan Group.
Accessed 7 December 2022. https://www.pwc.com/jp/ja/knowledge/thoughtleadership/2022-ai-predictions.html
G. Zhao
Before joining PwC Consulting LLC, G. Zhao worked for a system development company. He is currently engaged in data analysis and AI tool development for the telecommunications industry, telework environment improvement surveys for government agencies, and management of the digital product (intelligent business analytics tool) development team.