Introduction
Machine Learning has seen a near exponential uptake in products in the last decade. Models mimicking brain-like structures (neurons and synapses) have been around since the days of the early computers but were for a long time mainly adopted by the academic community. Over the years it evolved, methods from traditional statistical analysis were incorporated and it is now booming in the age of Big Data where there are few competing alternatives to describe complex dependencies within large data sets. One of the main drivers has been the rapid increase of computing power based on a technological development that follows Moore’s law. This has enabled processing of vast amounts of data with highly complex Machine Learning models (many nodes and layers), e.g. Deep Learning.
However, even to this date development of Machine Learning models require a fairly experimental way-of-working and relies on people with specific competences, although Big Data and Data Analytics require similar skill sets for roles like Data Scientists and Data Engineers. Lately there have been efforts to bring Machine Learning development closer to traditional engineering disciplines by defining its development lifecycle, competence needs, tools and best practices. In other words, extending Systems Engineering (loosely described by the quote below) to also cover Machine Learning (ML). This is not only to support those that develop ML systems, but also to give the rest of the organization a chance to:
– Plan, develop and maintain products containing ML components
– Design ML components so that they fit with the other parts of a product
– Identify where you are in the development lifecycle and how to track and support development
Systems Engineering adds a system view, which is vital for ML solutions that are getting increasingly complex and require a holistic understanding.
”Systems engineering is the art and science of developing an operable system capable of meeting requirements within often opposed constraints. Systems engineering is a holistic, integrative discipline, wherein the contributions of … engineers … and many more disciplines are evaluated and balanced, one against another, to produce a coherent whole that is not dominated by the perspective of a single discipline.”
(NASA, 2019)
Machine Learning Development
Developing an ML system often reminds more of research than traditional development. The solution – if one exists – is not given beforehand and results are reached by experimentation and iterations. The most distinguishing feature of ML is, however, that the solution logic is not defined or coded but extracted from data. Thus, data is absolutely essential to an ML solution and therefore also the ability to: – Gather, store, clean (from e.g. noise or bias) and pre-process data – Understand the data at hand, e.g. does it represent the problem being studied well enough? – Monitor data over time and context, since changes in patterns should be expected Activities like these have been simplified through a substantial increase of available tools and frameworks (many of them Open Sourced) that can be used both for data analysis and creation of almost infinitely complex ML models. Thus, practitioners have a much better situation now compared to 10-15 years ago. However, it is still important to have a good understanding of the underlying models that are used to better account for possible limitations, strengths and to some extent understand what a model does (to provide better “explainability”, meaning the extent to which the internal mechanics of a ML system can be explained in human terms).Figure 1. The Machine Learning development
After choosing a specific ML model, a “training” step follows to find patterns and features in the data provided to the model. In a world where vast quantities of data are easily available, it is tempting to use more data as a brute force to get models to converge. However, as a rule, good (high quality) data and clever pre-processing is preferred. Hence, it is reasonable to expect spending more time on data collection and preparation than on the actual construction of ML models.Systems Architecture and Machine Learning
Typically, the ML model code constitutes a very small fraction of the complete system. Most of the development that is done can be described as “plumbing” between components that perform other functions (see Figure 2 below).Figure 2. A conceptual view of an ML system as presented in [1].The boxes describe tasks performed by different parts of the system and the size of each box represents the effort spent in each area.
Like any system, ML systems may be afflicted with a considerable amount of technical debt. ML systems do not only suffer from all the traditional problems with maintaining code over time but also suffer from problems uniquely associated with the fact that data lies behind the behaviour of it. It is not enough to trust traditional approaches like static and dynamic code analysis, modularization and APIs, since data flows across and affects the complete system. Examples of data related technical debt that may arise if ML components are regarded as “black box” are:- Training data dependencies that call for careful consideration before reusing or linking models. When used in a new context, a model (or linked models) generally needs to be re-optimized.
- Hidden assumptions that have been built into glue code or complicated data collection structures caused by an experimental way of working (so-called pipeline jungles) make the system difficult to analyse and maintain.
- Different models that have been trained based on the same original dataset will be connected in ways that can be very difficult to analyse.
- The environment that data is collected from is generally a dynamic system (the world around us) that is constantly changing, and this may also invalidate an ML model if it is not monitored and updated over time. When a model is in operation it may also influence its own future training data in complex feedback loops.
The Machine Learning Development Lifecycle
A good starting point for a non-practitioner to learn how a Machine Learning model is developed is to understand the lifecycle of an ML component. At Microsoft the process is mapped as in Figure 3 below (note: alternative but similar process descriptions exist).Figure 3. The Machine Learning development lifecycle process according to Microsoft [2].
First the requirements for the model and its depending data are set. This is followed by the data steps, meaning: 1) identifying and collecting data needed by the model, 2) cleaning data from noise and bias and 3) labelling data so it can be identified and traced. Feature engineering basically entails pre-processing the data in a way that fits the ML model that has been chosen. Model training means using (parts of) the pre-processed data to create and converge a ML model. If no solution is found it may be necessary to go back to the previous step. In the model evaluation, data not previously used (collected, constructed or both) is used to check the validity of the model. If validity is too low, there is a need to loop back to one of the earlier steps – often the first ones. Finally, the model is integrated and deployed in a product and monitored for the rest of its lifetime to secure validity and consistency. Since there are several data-oriented process steps in the development lifecycle, it is reasonable to assume that many of the challenges lie in managing data (collecting, cleaning, versioning/labelling and monitoring) which is very different from traditional software development with its focus on coding. As mentioned earlier, data is at the centre of an ML solution and an ML component therefore cannot be treated in the same way as a regular software component. This affects how adaptation, customization and reuse can be done. From a tool framework perspective, the different steps in the ML lifecycle all require their own specific tools. The lack of tool integration across the lifecycle has previously been identified as a challenge but lately there has been a lot of effort put into creating unified tool chains to overcome this. This is an area where a systems engineering view can support, but also just from looking at the process steps there are areas like requirements engineering and verification and validation that are well known to the systems engineering community.