As technology advances, so does the way we do things. By and large, technology is meant to make our lives easier. Just consider how our phones unlock by recognizing our faces, how we track our food delivery couriers as they arrive at our doorstep, and how we even have apps to help us identify the leaves and birds we encounter on hikes.
Just as it changes our everyday lives, technology also changes the way we do research. However, as researchers, we want to understand what our algorithms do, because much more happens than we understand when our anecdotal bird-identifying app makes a match or when an app tells us to hop up off the couch to receive our pizza.
Most of us have heard terms like ‘machine learning’, ‘image processing’, and ‘neural networks’ thrown around; some of these with a dark ‘artificial intelligence’ undertone. The idea is that the machine somehow learns on its own, leading to associations with the scary thought that it can outperform a human being.
At the same time, all of us have also at some point screamed at our ‘smart’ phone/TV/watch after a particularly frustrating experience, belittling the device for being decidedly un-smart! Understanding how machines learn enables us to efficiently use this technology, and that is exactly what this blog tries to do.
In this post, we delve into the different methods of training an algorithm to embody the concept of machine learning. Broadly classified, there are two ways an algorithm can learn – supervised and unsupervised.
How to train your algorithm – with supervised and unsupervised learning methods
Let’s imagine you had to train algorithms to classify chocolates – and who wouldn’t want that? So, firstly you want to train an algorithm so that it can distinguish what is and what isn’t chocolate. Following that, you want to train an additional algorithm to identify all possible iterations a bar of chocolate can have.
For the first “problem”, classifying what is and isn’t chocolate, you could start by feeding the algorithm a lot of data about various food items. You could walk into your nearest supermarket, pick various food packets, and input their ingredients into the algorithm. You could then label every packet with the ingredient cocoa as “chocolate” and everything else as ‘not-chocolate’.
This is what the algorithm would learn from you – if the ingredient list has cocoa, the product can be classified as chocolate. While simple enough, this may give you false positives – what if you just have nuts with a chocolate coating? Or a packet of drinking chocolate? You don’t want these to be classified with bars of chocolates. So you try giving your algorithm more sophisticated rules – for example, that it needs to be a solid slab, needs to weigh a certain amount, maybe a human coder needs to classify it as ‘valid dessert option’, ‘valid gift option’ etc.
As you add criteria, your algorithm will learn. It’s learning from the criteria you are giving it and you are supervising what it learns. Hence the name “supervised learning”. The algorithm will soon be smart enough to see a new product – not from the list of packets you fed it from your local supermarket – and identify if something is ‘chocolate’ or not. It will base this decision on the criteria learned from you and nothing more. This means, if you did not give it ruby chocolate to learn from, it will not classify ruby chocolate as chocolate. If your human coders hate white chocolate and do not classify it as a valid gift option, neither will your algorithm.
Now imagine the second problem – you want your algorithm to classify every kind of filling or flavor a chocolate bar can have. You would again get a select section from your local supermarket – some with nuts, maybe orange, raisins, some liquor-filled ones, but this would not be an exhaustive list. It would be extremely tedious to create a supervised learning algorithm that could classify every chocolate it encountered based on its filling.
I, for one, would not have thought of giving my algorithm all the different kinds of licorice fillings before moving to Denmark – which means my supervised algorithm wouldn’t have been able to classify it. In this case, I may choose to create an unsupervised learning algorithm. To do this, I would give it as much variety as I can find – chocolates from different countries I’ve lived in, chocolates from countries I’ve visited, chocolates from countries my friends have traveled to.
By training the algorithm on all this, I hope it finds categories to cover all possible fillings that can go into a chocolate bar. Maybe it decides on fruit, nut, liquor as I started with, or maybe it decides on 10 categories covering the entire range of its training chocolates. Either way, I will not know its learning process, only what the outcome classifications were.
While this kind of learning may feel like a black box, it is worth remembering that it is still based on the data we feed the algorithm. In this example, if I give ingredients to learn to categorize chocolates, that is all the algorithm can use to make a decision. It cannot suddenly classify based on which of these make good housewarming presents. We are simply making our job of coding in every possible category, or exceptions to a category, easier.
Translating this knowledge to iMotions
When we start combining different algorithms, we can build smarter solutions to our problems. Think of training a group of algorithms to achieve a specific functionality (Pick-the-perfect-chocolate app, anyone?) as a funnel; you start with training an algorithm to handle the big questions such as, “what is the fundamental building block of chocolate”? Then you move on to train another algorithm to go further into what constitutes chocolate and so on – leading your algorithms to gradually become more and more granular to solve the very complex problem of – what is the perfect chocolate to gift someone?
At iMotions, we have a number of tricky questions we like to solve with neat machine learning algorithms. A number of our solutions involve machine learning, such as one of our classification algorithms for webcam-based eye tracking called Markov models – and even some of the algorithms from our facial expression analysis partners. Let’s look at our iMotions platform, and more specifically the webcam-based eye tracking to see how machine learning measures eye racking and improves your data quality.
When you create an online study in iMotions, you have the option of sending it to hundreds of your participants via a link to collect their eye movements (and facial expressions) via their webcams. It seems incredibly complicated, but this is nothing more than a series of machine learning algorithms applied step by step.
Step one: Where is the iris?
The first thing our algorithms do is identify where a participant’s face is. This does not mean it identifies your particular face the way your phone may be doing to unlock itself. Remember what different algorithms are trained to do? This one just identifies human faces – is it a face or not?
You can see this at work even before you start the study when the Online platform runs a head check to ensure your participants are correctly seated. Once the face is identified, our algorithms can identify different parts of the face. Without this step, it wouldn’t know where to look for your eyes. This means that the algorithm has been trained to identify pupils which then allows it to identify pupils irrespective of where they look on the screen, i.e. pupils captured by the camera at different angles.
Again, it is important to understand that these algorithms have been trained on fairly good data of people usually looking into the camera. So if your participant is lying on the sofa, in a barely lit room, with their laptop on their belly, the algorithm may barely identify the face, let alone the position of the few-pixels-wide iris.
Step two: Calibration
Correctly identifying the iris is important because the accuracy of the next step – calibration – depends on it. The next algorithm we have takes data points from the calibration points and calculates how the iris for this participant appears when looking at different positions on the screen.
During calibration, where we show participants dots at different positions on the screen, the algorithm learns how the eyes are positioned when fixating at these points that span across the entire viewing area. This then allows it to calculate the relative position of the eye to the screen when participants are viewing your actual stimuli. This is why it is important to use the recommended calibration procedure, so our algorithm can get the information it needs for estimating the eye movements your participants are making.
Step three: Fixation detection
Even though collecting eye tracking data by using nothing but a webcam is a feat in and of itself, we cannot discount the fact that this kind of data is still very noisy. If we were to apply our traditional methods of fixation and saccade classification to this sensor data, we would make the mistake of falsely classifying noise as fixations. Therefore, we need to lean on machine learning one last time before we can analyze our study – for fixation classifications.
We use Markov Models to classify fixations from webcam-based eye tracking data. In this, we tell our algorithm what fixations and saccades are (slow data points close to each other on the screen, or fast movements across larger distances, respectively), and let it look for these in the dataset at hand. This means the algorithm may (1) adjust the exact definition of what a fixation is per participant (how slow, how close), but (2) identify noise better than any classification system not allowed to ‘learn’ from the dataset.
A traditional duration dispersion filter, which defines fixations as a group of adjacent points, may not identify low-level spatial noise and the velocity-based filter, which simply classifies fast movements as saccades, may wrongly classify large, quick drifts. On the other hand, the Markov models are much better at identifying these anomalies per participant giving us a better chance at true positive fixations.
While our webcam-based eye tracking may seem like a black box of magic, you now know that they are just a number of different algorithms, each one of them open to follies if not given the good quality data it expects, taught only to do each step of the process by our able developers who really understand eye tracking.