Maxpane
SENIOR MEMBER
- Joined
- Oct 1, 2016
- Messages
- 5,764
- Reaction score
- 2
- Country
- Location
WTF is Sensor Fusion? Laying the mathematical foundation
Marko Cotra
Oct 14, 2017
A really cool branch of ML/signal processing is sensor fusion, sometimes also referred to as data fusion, target tracking, filtering etc. Now, from personal experience it appears that not that many engineers are familiar with this field, or that they essentially have the viewpoint that “Sensor Fusion = Kalman Filter”. This is unfortunate, since there’s a lot of cool stuff that can be accomplished with sensor fusion. So, as you might have guessed, the aim of this series is to explain what sensor fusion is.
In this post I’ll focus on the key mathematical concepts behind (pretty much) all sensor fusion algorithms, hopefully providing you with some intuition along the way. In the end it all boils down to just two equations. Good knowledge of these equations will help immensely later on when you start learning about the different algorithms.
Okay, so what is sensor fusion?
At its core it’s essentially about combining what you observe from a process with what you know about the process. A process is anything that evolves over time.
As a practical example you can consider a situation where you’re using a radar to track the position of an airplane. In this case you want to use the radar measurements y to gain knowledge of the state x of the airplane. You’re free to define the state however you like, but a reasonable guess for an airplane could be it’s position and velocity.
Now, using basic physics we know a couple of things of how airplanes behave:
In addition to the previous state x_{k-1}there is another term q present in the formulation, and this represents the uncertainty in the model. This is important since it reflects that we can’t construct a perfect model describing the behavior of the airplane, or anything for that matter. There is always some uncertainty or randomness in real-life. In the case of the airplane it might be because we don’t know exactly what actions the pilot is taking, so we can’t know for sure what’s going to happen. With q we introduce some randomness into the model, and we can choose q in such a way that it reflects what type of randomness is possible. In the case of the airplane, q might indicate that the speed may vary from what the motion model predicts, but that the variation can’t be too large. In other word, q might say that it’s more likely for the speed to differ with ±2m/s than ±10m/s.
As such, we can can also choose to represent the motion model as a probability distribution.
Both formulations are valid and say the same thing, that the state is stochastic and depends on the previous state. Viewing it as a probability distribution also puts emphasis on the fact that the current state is drawn from a distribution over possible current states, conditioned on the previous state.
If the model is imperfect, why not just rely on the radar?
We’re faced with a similar problem when it comes to our radar as well, since there is no such thing as perfect sensor. As such, the measurements that we receive suffer from noise. We do however know that the measurements are a function of the current state x_{k} of the airplane. In other words, the radar measures the position of the airplane, which is a part of the current state x_{k} of the airplane. Knowledge regarding how the sensor works as a function of the state is referred to as the measurement model in sensor fusion literature.
Mathematically we can express the measurement model as
where the r represents the uncertainty that’s present in the model of the sensor. Just like before we can choose to represent the motion model as a probability distribution.
Viewing the motion model as a probability distribution tells us that the measurement that we observe is drawn from a distribution over possible measurements, conditioned on the state.
Before we move on, it should be noted that the motion and measurement models are formulated using two assumptions:
Now we arrive at the key problem that sensor fusion tries to address, how do you combine the uncertain motion and measurement models to infer as much as possible about the state? What we’re after is the following distribution
which essentially says “what can we say about the current state x_{k} if we consider all of the measurements that we’ve received up until now?”. This is called the posterior distribution over x_{k}. You could also view this as the optimal guess of what the state x_{k} is at time k.
By applying Bayes’ theorem and marginalization we end up with the following two equations:
If you’re pro and want to know how we ended up with these two equations then check out chapter 2 of this book. It’s a quite fun exercise to try and do the math yourself!
This is a lot to take in, but don’t let these equations scare you away! They might look intimidating, but I’ll try and break them down into something understandable. Let’s start off with the first equation.
The predict equation
In sensor fusion literature this general equation is commonly referred to as the predict equation. If we look at the rightmost density we can see that it’s the posterior (or optimal guess) for the previous state x_{k-1}. The density in the middle should be familiar, since that’s the motion model. So, what’s happening here is that we’re using the guess of the previous state at time k-1, x_{k-1}, and combining that with the motion model to try and predict what the current state x_{k} will be. By marginalizing (i.e. integrating) over the state x_{k-1}, we’re effectively considering all potential outcomes of the state x_{k}. This allows us to end up with the leftmost density, which does not depend on the previous state. The following animation might provide some more insight, it show how this equation can be interpreted with Riemann sums.
The animation shows how the predict equation can be visualized (via Riemann sums). The colored slices in the middle represent the motion model density conditioned on different values of x_{k-1}. The rightmost density represents the posterior at time k-1. By taking the product of each motion model and posterior for different values of x_{k-1} (we’re incrementing with some step-size) we’re essentially weighing each motion model with how likely it is. All the different weighted models are outlined in gray in the leftmost plotG. By summing all of these models (and scaling the sum with the step-size) we end up with the prediction density, illustrated in red. Note! There is a typo in the leftmost and rightmost equation, they should be p(x_{k}|y_{1:k-1}) and p(x_{k-1}|y_{1:k-1}) respectively. The animation will be soon updated with the correct equations.
To give a bit more intuition about the predict equation we can consider a case where the state is binary, i.e. x can either be 1 or 0. The previous posterior/guess indicates that
in other words, there is a 70 % chance that the state was equal to 0, and a 30 % chance that it was equal to 1. Now, if we have a motion model we can use it to predict what the current state x_{k} is. The motion model could be something like this,
which says that if x_{k-1} was equal to 1, then there’s an 80 % chance that it’ll remain equal to 1 in the current time-step as well. However, if it was equal to 0, then there’s only a 10 % chance that it’ll remain 0 in the current time-step.
If we plug this into the predict equation we get the following, and since we’re dealing with a discrete state the integral is replaced by summation instead
So, by considering all possible outcomes of x_{k-1} via the previous posterior, and combining those with the motion model,we end up with a prediction about x_{k}: there’s a 13 % chance that it’s equal to 0 and a 87 % chance that it’s equal to 1. Remember, this prediction does not depend on the previous guess anymore, since we’ve taken all of the outcomes into consideration!
In practice we’re usually dealing with states that are continuous, which means that the math becomes more complicated than in this toy example that I’ve just presented. But the mechanics of what we’re doing are still the same! It’s just that we perform integration instead of summation, and the densities are continuous instead of discrete.
The update equation
This equation is commonly referred to as the the update equation in sensor fusion literature. The reason for that name is because we’re updating our prediction (the one we got via the predict equation) with the new information that we gain via observing the measurement y_{k}.The update equation might look familiar to you, and that’s because it’s actually Bayes’ theorem!
We’re using the resulting density from the predict equation as our prior. It’s called a prior because it represents our prior belief regarding the state before we’ve observed any new measurement. The likelihood in this case is the measurement model. Remember, the measurement model describes how the measurement is distributed conditioned on the state. We’re computing the product of these two densities, the prediction and the measurement model, and then dividing that with the probability of observing the current measurement conditioned on all previous measurements.
At this point it’s necessary to reflect on how we’re actually using the update equation. At the end of the day we’re trying to find the posterior density of the state, which is a distribution over possible states.
In other words, we’re never going to find a single specific value of what the state x_{k} is at time k. If we want to represent the state as a specific value we can opt to for example compute the MMSE or the MAP of the posterior. But at the end of the day, we’ll never know exactly what it is, we’re just able to express a region of plausible values with the posterior.
However, we do observe specific measurement values — we know what y_{k} is at every time-step, since we get that number via a sensor. This means that the divisor term that’s in the update rule,
is just a scalar number. Remember, a density assigns a value for every single point that we put into it, and that number represents the likelihood of that point. Since we’re observing the value of y_{k}we can plug it into the distribution and it will give us a number. We’ll then use this number to divide the numerator.
In order to (perhaps) make things more clear, we can rewrite the expression
This emphasizes that what we’re doing is essentially computing a weighted average over all possible likelihood-values. In the integral we’re evaluating all possible states x_{k}. The measurement density will give a number representing the likelihood and the prediction density will give us a number which represents the weight associated with that likelihood number.
Breaking it down in this way might also provide you with more insight into what’s happening in the numerator of the update equation. In the numerator we have the measurement density which is our likelihood. We know what y_{k} is, but we don’t know what x_{k} is. The prior, which we got from the predict equation, is a density that reflects how likely different states are. Both the prior and likelihood are functions of the state, i.e. you plug in the state x_{k} and you get a number out. Bayes’ rule says that if we take the product of these two functions together, and divide them with the same product but where we’ve marginalized out the state x_{k}, then we end up with the posterior — a single function of the state x_{k}. This function in turn is a density, the posterior, which describes how likely different states x_{k} are.
Okay, how do we start using these two equations?
Alright, now that we’ve become familiar with those two equations we can write down the formula for how they’ll be used:
When we’re doing the predict equation at time k=1 we still haven’t had any measurements. That’s why we use the prior p(x_{0}), it describes a region of plausible starting states. Once we’ve received the first measurement we can proceed to calculate the update equation, which gives us the posterior distribution. For all measurements after k=1 we can just keep looping through the prediction and update equation. In code, it might look something like this:
And that’s it. On a top-level, this is how pretty much all sensor fusion algorithms work. Now, so far we’ve just looked at the general equations. In practice, we need to find a way to make both the predict and update equations analytically tractable, or find a good way to numerically approximate them.
This is what distinguishes the different sensor fusion algorithms, how they go about to make the predict and update equations work in practice. The Kalman Filter for example chooses to represent every density in the equations as a gaussian density, since this allows for a closed form analytical solution.
The specifics of different algorithms that try to approximate or express analytically tractable solutions to the predict and update equations will be explored further in future posts. If you’re eager and choose to explore these topics on your own then I hope this post has given you enough intuition about underlying theory.
Marko Cotra
Oct 14, 2017
A really cool branch of ML/signal processing is sensor fusion, sometimes also referred to as data fusion, target tracking, filtering etc. Now, from personal experience it appears that not that many engineers are familiar with this field, or that they essentially have the viewpoint that “Sensor Fusion = Kalman Filter”. This is unfortunate, since there’s a lot of cool stuff that can be accomplished with sensor fusion. So, as you might have guessed, the aim of this series is to explain what sensor fusion is.
In this post I’ll focus on the key mathematical concepts behind (pretty much) all sensor fusion algorithms, hopefully providing you with some intuition along the way. In the end it all boils down to just two equations. Good knowledge of these equations will help immensely later on when you start learning about the different algorithms.
Okay, so what is sensor fusion?
At its core it’s essentially about combining what you observe from a process with what you know about the process. A process is anything that evolves over time.
As a practical example you can consider a situation where you’re using a radar to track the position of an airplane. In this case you want to use the radar measurements y to gain knowledge of the state x of the airplane. You’re free to define the state however you like, but a reasonable guess for an airplane could be it’s position and velocity.
Now, using basic physics we know a couple of things of how airplanes behave:
- It takes a while for an airplane to change direction inbetween radar measurements.
- It takes a while for an airplane to change it’s velocity inbetween radar measurements.
In addition to the previous state x_{k-1}there is another term q present in the formulation, and this represents the uncertainty in the model. This is important since it reflects that we can’t construct a perfect model describing the behavior of the airplane, or anything for that matter. There is always some uncertainty or randomness in real-life. In the case of the airplane it might be because we don’t know exactly what actions the pilot is taking, so we can’t know for sure what’s going to happen. With q we introduce some randomness into the model, and we can choose q in such a way that it reflects what type of randomness is possible. In the case of the airplane, q might indicate that the speed may vary from what the motion model predicts, but that the variation can’t be too large. In other word, q might say that it’s more likely for the speed to differ with ±2m/s than ±10m/s.
As such, we can can also choose to represent the motion model as a probability distribution.
Both formulations are valid and say the same thing, that the state is stochastic and depends on the previous state. Viewing it as a probability distribution also puts emphasis on the fact that the current state is drawn from a distribution over possible current states, conditioned on the previous state.
If the model is imperfect, why not just rely on the radar?
We’re faced with a similar problem when it comes to our radar as well, since there is no such thing as perfect sensor. As such, the measurements that we receive suffer from noise. We do however know that the measurements are a function of the current state x_{k} of the airplane. In other words, the radar measures the position of the airplane, which is a part of the current state x_{k} of the airplane. Knowledge regarding how the sensor works as a function of the state is referred to as the measurement model in sensor fusion literature.
Mathematically we can express the measurement model as
where the r represents the uncertainty that’s present in the model of the sensor. Just like before we can choose to represent the motion model as a probability distribution.
Viewing the motion model as a probability distribution tells us that the measurement that we observe is drawn from a distribution over possible measurements, conditioned on the state.
Before we move on, it should be noted that the motion and measurement models are formulated using two assumptions:
- The state x is Markovian. This means that the current state x_{k} only depends on the previous state x_{k-1}.
- The measurement y_{k} only depends on the state x_{k}, and not on any previous measurements.
Now we arrive at the key problem that sensor fusion tries to address, how do you combine the uncertain motion and measurement models to infer as much as possible about the state? What we’re after is the following distribution
which essentially says “what can we say about the current state x_{k} if we consider all of the measurements that we’ve received up until now?”. This is called the posterior distribution over x_{k}. You could also view this as the optimal guess of what the state x_{k} is at time k.
By applying Bayes’ theorem and marginalization we end up with the following two equations:
If you’re pro and want to know how we ended up with these two equations then check out chapter 2 of this book. It’s a quite fun exercise to try and do the math yourself!
This is a lot to take in, but don’t let these equations scare you away! They might look intimidating, but I’ll try and break them down into something understandable. Let’s start off with the first equation.
The predict equation
In sensor fusion literature this general equation is commonly referred to as the predict equation. If we look at the rightmost density we can see that it’s the posterior (or optimal guess) for the previous state x_{k-1}. The density in the middle should be familiar, since that’s the motion model. So, what’s happening here is that we’re using the guess of the previous state at time k-1, x_{k-1}, and combining that with the motion model to try and predict what the current state x_{k} will be. By marginalizing (i.e. integrating) over the state x_{k-1}, we’re effectively considering all potential outcomes of the state x_{k}. This allows us to end up with the leftmost density, which does not depend on the previous state. The following animation might provide some more insight, it show how this equation can be interpreted with Riemann sums.
The animation shows how the predict equation can be visualized (via Riemann sums). The colored slices in the middle represent the motion model density conditioned on different values of x_{k-1}. The rightmost density represents the posterior at time k-1. By taking the product of each motion model and posterior for different values of x_{k-1} (we’re incrementing with some step-size) we’re essentially weighing each motion model with how likely it is. All the different weighted models are outlined in gray in the leftmost plotG. By summing all of these models (and scaling the sum with the step-size) we end up with the prediction density, illustrated in red. Note! There is a typo in the leftmost and rightmost equation, they should be p(x_{k}|y_{1:k-1}) and p(x_{k-1}|y_{1:k-1}) respectively. The animation will be soon updated with the correct equations.
To give a bit more intuition about the predict equation we can consider a case where the state is binary, i.e. x can either be 1 or 0. The previous posterior/guess indicates that
in other words, there is a 70 % chance that the state was equal to 0, and a 30 % chance that it was equal to 1. Now, if we have a motion model we can use it to predict what the current state x_{k} is. The motion model could be something like this,
which says that if x_{k-1} was equal to 1, then there’s an 80 % chance that it’ll remain equal to 1 in the current time-step as well. However, if it was equal to 0, then there’s only a 10 % chance that it’ll remain 0 in the current time-step.
If we plug this into the predict equation we get the following, and since we’re dealing with a discrete state the integral is replaced by summation instead
So, by considering all possible outcomes of x_{k-1} via the previous posterior, and combining those with the motion model,we end up with a prediction about x_{k}: there’s a 13 % chance that it’s equal to 0 and a 87 % chance that it’s equal to 1. Remember, this prediction does not depend on the previous guess anymore, since we’ve taken all of the outcomes into consideration!
In practice we’re usually dealing with states that are continuous, which means that the math becomes more complicated than in this toy example that I’ve just presented. But the mechanics of what we’re doing are still the same! It’s just that we perform integration instead of summation, and the densities are continuous instead of discrete.
The update equation
This equation is commonly referred to as the the update equation in sensor fusion literature. The reason for that name is because we’re updating our prediction (the one we got via the predict equation) with the new information that we gain via observing the measurement y_{k}.The update equation might look familiar to you, and that’s because it’s actually Bayes’ theorem!
We’re using the resulting density from the predict equation as our prior. It’s called a prior because it represents our prior belief regarding the state before we’ve observed any new measurement. The likelihood in this case is the measurement model. Remember, the measurement model describes how the measurement is distributed conditioned on the state. We’re computing the product of these two densities, the prediction and the measurement model, and then dividing that with the probability of observing the current measurement conditioned on all previous measurements.
At this point it’s necessary to reflect on how we’re actually using the update equation. At the end of the day we’re trying to find the posterior density of the state, which is a distribution over possible states.
In other words, we’re never going to find a single specific value of what the state x_{k} is at time k. If we want to represent the state as a specific value we can opt to for example compute the MMSE or the MAP of the posterior. But at the end of the day, we’ll never know exactly what it is, we’re just able to express a region of plausible values with the posterior.
However, we do observe specific measurement values — we know what y_{k} is at every time-step, since we get that number via a sensor. This means that the divisor term that’s in the update rule,
is just a scalar number. Remember, a density assigns a value for every single point that we put into it, and that number represents the likelihood of that point. Since we’re observing the value of y_{k}we can plug it into the distribution and it will give us a number. We’ll then use this number to divide the numerator.
In order to (perhaps) make things more clear, we can rewrite the expression
This emphasizes that what we’re doing is essentially computing a weighted average over all possible likelihood-values. In the integral we’re evaluating all possible states x_{k}. The measurement density will give a number representing the likelihood and the prediction density will give us a number which represents the weight associated with that likelihood number.
Breaking it down in this way might also provide you with more insight into what’s happening in the numerator of the update equation. In the numerator we have the measurement density which is our likelihood. We know what y_{k} is, but we don’t know what x_{k} is. The prior, which we got from the predict equation, is a density that reflects how likely different states are. Both the prior and likelihood are functions of the state, i.e. you plug in the state x_{k} and you get a number out. Bayes’ rule says that if we take the product of these two functions together, and divide them with the same product but where we’ve marginalized out the state x_{k}, then we end up with the posterior — a single function of the state x_{k}. This function in turn is a density, the posterior, which describes how likely different states x_{k} are.
Okay, how do we start using these two equations?
Alright, now that we’ve become familiar with those two equations we can write down the formula for how they’ll be used:
When we’re doing the predict equation at time k=1 we still haven’t had any measurements. That’s why we use the prior p(x_{0}), it describes a region of plausible starting states. Once we’ve received the first measurement we can proceed to calculate the update equation, which gives us the posterior distribution. For all measurements after k=1 we can just keep looping through the prediction and update equation. In code, it might look something like this:
And that’s it. On a top-level, this is how pretty much all sensor fusion algorithms work. Now, so far we’ve just looked at the general equations. In practice, we need to find a way to make both the predict and update equations analytically tractable, or find a good way to numerically approximate them.
This is what distinguishes the different sensor fusion algorithms, how they go about to make the predict and update equations work in practice. The Kalman Filter for example chooses to represent every density in the equations as a gaussian density, since this allows for a closed form analytical solution.
The specifics of different algorithms that try to approximate or express analytically tractable solutions to the predict and update equations will be explored further in future posts. If you’re eager and choose to explore these topics on your own then I hope this post has given you enough intuition about underlying theory.