Overcoming the Challenges of Adding Machine Learning to Your Products
Capturing the Data for Machine Learning
Overview of Artificial Intelligence and Machine Learning in Embedded Applications | Organizing the Data |
To Train a Machine Learning Model, You Need Data
Lots of Data
To train a model, you need a substantial amount of data, especially if you're working with deep learning. Images are a common example because they are a popular area of focus and easy to understand. When running a deep learning algorithm, you might need thousands or even millions of data samples.
Data Coverage
The data coverage must be statistically significant. In other words, you need enough data coverage to ensure you handle all of the cases your model will see.
Diverse Data
The diversity of the data is also crucial. For instance, if certain events occur 80 percent of the time, they should represent approximately 80 percent of your data. You need to consider this when sampling your data to ensure you cover all different states of your system.
An interesting discussion often arises when an Artificial Intelligence/Machine Learning (AI/ML) system predicts with 94 percent confidence. People may worry about the remaining 6 percent, but traditional algorithms also have errors for unconsidered states or data modes. Therefore, you need representative data with sufficient coverage of all failure modes.
Quality Data
The quality of the data is paramount. It must be properly labeled to reflect what is actually happening because a single mislabeled piece of data can cause significant problems during model training.
Test and Validation Data
Additionally, you need to test and validate the data. A common approach is the 80/10/10 split, where 80 percent of the data is used for training, 10 percent for testing, and the remaining 10 percent for validation. Sometimes, a 70/15/15 split is used. The validation phase involves using data the model has never seen to assess its accuracy.
Where Do You Get the Data?
A common question is where to obtain all this data. You can either acquire it by purchasing or building it yourself.
Database Examples
ImageNet
ImageNet is one of the most well-known image-based databases. It contains 14 million images. This gives you an idea of the vast amount of data available. All the images in the ImageNet database are collected from the web and cover a wide range of categories, including objects, animals, and scenes. Each image has been annotated with labels, allowing you to perform contextual analysis.
Interestingly, the license terms for ImageNet state that it is freely available for research purposes only. You should not produce a commercial product based on this data, as it is designed specifically for research. With 14 million images, it is a massive database, which is why much of the learning is done in data centers or on servers. Researchers often use subsets of ImageNet.
An interesting topic here is ethics. There is criticism regarding privacy concerns, as some of the data in the ImageNet database includes identifiable faces. It raises the question of how you are supposed to train an algorithm to recognize faces if they have been blurred out. There are also concerns about potential biases present in the dataset that could be propagated by machine learning models trained on it.
MS COCO
The Microsoft® Common Objects in Context (MS COCO) database is a large-scale dataset for object detection, segmentation, and captioning. It includes 80 object categories that cover a wide range of everyday scenes, closely resembling real-world scenarios. You will find over 200,000 labeled images with more than 1.5 million object instances.
What's interesting is that all of these captions and annotations have been made by human annotators. People are paid to sit and highlight boxes around objects in pictures. You can't get a machine to do it because the machine hasn't learned what the objects are yet. A person has to do it.
The license is free for researchers, so you shouldn't build products based on this dataset.
MNIST
The Modified National Institute of Standards and Technology (MNIST) is a database of 70,000 images of handwritten digits from zero to nine. This collection of handwritten digits is widely used for training and testing in the field of machine learning and computer vision. Each image is 28 by 28 pixels.
The database is already split into 60,000 training samples and 10,000 test samples. It is frequently used to analyze how well an algorithm recognizes handwriting or numbers.
Microchip provides an example that runs using a framework called TensorFlow™ Lite, by Google®, on a PIC32 microcontroller. You can draw on the screen, and it will use the MNIST database and the model trained from the MNIST database to determine the character or number you have just written.
It’s important to note that each image is labeled with the digit it represents, so each image comes with additional information.
Vibration Databases
Now, let's delve into a more embedded area. Here's an example from Case Western Reserve University (CWRU). They have a database of bearing vibration samples, all sampled at 12 kilohertz. What's interesting is that, in order to get a bearing to fail, they had to deliberately damage it using Electrical Discharge Machining (EDM). This process involves drilling holes in the bearings with voltage spikes, similar to lightning strikes, to create pits and damage them. This deliberate damage allows you to start monitoring and measuring the bearings. Sometimes, when collecting your data, you need to intentionally cause a failure to capture all the modes.
There are other databases around machinery failure data that include different failure modes for production line machinery. NASA also has data on acoustics and vibration, which is great if you're building rockets.