Needless to say but Kalman Filtering is one of the most powerful estimation processes in almost any Engineering field. From robotic vacuums to Satellite Guidance, it is everywhere. Here I will explain the how’s and why’s of the Kalman Filter (KF) in our lives.
Any decent technological project will use this robust method for the final estimation of the position of any intelligent system. The format of the given information can be, fortunately, represented as a Gaussian state.
Thanks to this property, it is possible to use Gaussian filters (KF is one of them), in order to improve the final estimation.
Gaussian modelling estimations are going to be carried in this explanation, so it is preferable to have this mathematical background (a simple understanding is enough) in order to follow the presented technique.
The estimation problem
Let us go straight to the problem with an example. Suppose we have a robot moving freely in a 2D Euclidean space (Cartesian coordinates), and we want to track its real position every time.
There are several methods to do so and their mix in an adequate way is the secret to the best estimation. The first way is to mathematically estimate the position with simple physical models. Here classical Physics is our big allied and we can roughly use accelerations, velocities, etc.
At the beginning (time is equal to zero), our robot is set to be in the origin and, thanks to our mathematical model, we know that in the next second it will change its position to somewhere further in the field.
The position can be represented as a twodimensional vector that represents the position along the X and Yaxis, respectively. This vector is known as the “state vector”, because it contains the information of the robot in its current state.
Because this vector also represents an estimation, we will write it with a small “hat” as . It is read as “x hat” and just means that this vector is an estimation (not precisely the true value).
For compliance with standard notation in Engineering, we denote time with and the set of sampled states as where is positive semidefinite (greater than or equal to zero).
This funny notation simply means that the index of the time begins in zero and increases positively until values.
Under this convention, if we set the beginning at zero and estimate 5 more states, the time set is represented as . Here is equal to 5, and the number of states is 6 (because we include ).
After a mathematical calculation (with classical physics), the robot is estimated to be in a specified new location:
However, due to a plethora of external forces and effects, the estimation of its position cannot be entirely deterministic and is prone to errors.
The robot can be in the estimated position our somewhere around it, because the mentioned effects can drive it up, down, left, right, the mix of them, or even stay in the same place. It could be anywhere around this original estimation.
This leads to the concept of variance. The variance represents how far a bunch of values is spread out.
For our robot a higher variance means that its real position can be far from the original estimation, while a smaller variance means that it can be closer to the estimated position.
Estimation with small variance 
Too many drawings of a robot. Let us use the “particle representation”. From now on, the mean position of the robot will be represented with a dot, and its variance with an ellipse around it. This ellipse is the most probable area, where the robot can be located.
As seen, using only mathematical models is not enough to locate the real position of our robot. But now, we are able to equip the robot with sensors.
These sensors can be integrated on the wheels (wheel encoders) and will measure the translation of our robot along X and Yaxis.
The exclusive use of sensors for position estimation and tracking is known as Odometry. Simple Odometry is widely used in basic implementations of tracking and is enough for easy examples, but not for real cases, like the estimations of our robot.
Using the information of the wheel encoders in the robot, we can measure how far the robot moved after one timestamp. This will allow us to calculate its position and store it in a vector commonly known as the measurement vector.
But here we find the same flaw as the pure mathematical model: variance. No sensor is perfect and all of them include always some noise that disturbs the real measurement. Luckily this noise is a Gaussian noise and can be modelled as:
This is a perturbation within the variance of the sensor that changes the measurement around the real value and it is assumed to be independent to any other parameter. Visually, the position estimation based merely on sensors can be seen as:
The dot in the centre is the estimated position given by the sensors (mean) and the ellipse around is the variance of the sensor (the real position is most probably anywhere inside it). The better the sensor, the smaller the variance.
Both estimations, through Odometry and classical Physics, are not perfect and present some variance. This, of course, means that the computations are never going to be perfect. That same property of variance, however, is used to estimate a better position if we apply basic concepts of a mixture of Gaussians (MoG).
The definitions of MoG are not in the scope of this article, but the Gaussian properties are used to build the famous Kalman Filter.
The intuition behind KF
The mixture of two Gaussian distributions will give another Gaussian distribution. Thanks to this property, we are able to construct as many distributions as desired, provided that we are always using the proper ones.
The distributions used to build the “mixed distribution” are called priors, while the resulting new distribution is called a posterior.
This yields to the definition of Covariance, which is a measure of how much two random variables change together. If two distributions are the same (almost impossible in real life) then the covariance is equal to the variance of any distribution.
The variables show similar behaviour when the covariance is positive. Random variables whose covariance is zero are called uncorrelated.
For a further analysis on the development of these models, please refer to any documentation about Bayesian Inference. Here the understanding of the above definitions is enough.
The awesomeness of the KF lies in its capacity to take the values of the estimations (from Odometry and Models), their covariance, and compute a new Gaussian distribution, based only on the priors. This means, the KF takes calculations and measurements (as gaussian distributions) and estimates a better approximation to the real value, also as gaussian.
So, back to our robot example. We begin in the origin (set to zero), we predict its position (a priori state) and make a measurement with our sensors (observation). This will give us two different mean values (centres) and their corresponding variances (ellipses).
What KF will do is take these gaussian representations, mix them and estimate a new posterior with higher accuracy than the priors. This posterior is the final estimation of the position.
Obviously, a smaller covariance (ellipse) means a more accurate estimation. The best part is that the posterior has always a smaller covariance than any of the two priors. That is why it is always more accurate.
If the estimations are gaussian, the probability of their perturbations can be modelled as:
The random variables and represent the process noise and measurement noise, respectively. They are assumed to be independent of each other, and with normal probability distribution.
The integration of noises and into their corresponding models can be defined as:
In practice, the process noise covariance Q and measurement noise covariance R matrices might change, but here are assumed to be constant. These noise covariance matrices define the shape of the ellipses in the example of our robot.
The precise way the KF will estimate the a posteriori state is based in the blending factor, also known as Kalman gain, represented only with a K. The Kalman gain minimizes the a posteriori error covariance, by fusing the covariances into a new one based on their “heaviness”.
A simple way of thinking about K is: as the Measurement Covariance R approaches to zero (towards noiseless sensors), the actual measurement z should be “trusted” more and more, while the Predicted measurement is trusted less and less.
On the other hand, as the a priori estimate error covariance approaches to zero (towards perfect calculations), the actual measurement z is trusted less and less, while the predicted measurement is trusted more and more.
In simpler words: whenever K is equal to zero trust the calculations, and when K is equal to one trust the sensor.
The comparison between both quantities is, however, not always a simple onetoone relation, where the calculations match perfectly with the measurements. Sometimes, in order to estimate positions, related operations in accelerations and velocities must be carried out, and these values are also stored and handled in the same vector. The sensors, in the other hand, give only one magnitude for the physical event they are measuring.
In order to match the corresponding values from calculations and measurements, it is necessary to use an Observation Matrix H that will “pair” the corresponding values. So, for example, if in our robot we use accelerometers instead of wheel encoders, we would need to estimate the position from the acceleration and then compare it with the mathematical estimations of our model. But for that we have to select only the part we are interested in: the position.
Assuming that the state vector has three elements: estimated position (), estimated velocity () and estimated acceleration (), the only value that we can compare with a sensor is the estimated acceleration (if we only use accelerometers, of course). So we extract it from the state vector as:
Here we see that we just multiply the state vector with the observation matrix H to obtain only the last element of the state vector. Why? To directly compare it against the measurement from the sensor:
The resulting value r is the difference between the estimated measurement and the actual measurement from the sensors z. This vector r is called “Innovation”, “Measurement Residual” or just “Residual”, and its length is determined by the number of elements to be compared.
Adding a position sensor (like the wheel encoders) would give us the magnitude of the translation. This value can be also added to the measurement vector z:
It yields an innovation vector r:
H decides who fights whom, r checks who punches stronger, and K decides who won.
Once we have the difference of the estimated and measured positions, we have to know how much we trust this difference, so we can add it to our previous estimation.
Which element tells us the confidence of our estimations? The Kalman gain! So, we just update the state with the “weighted” difference:
If the innovation r is equal to zero, it means that the estimation and the measurement are in complete agreement (it almost never happens in reality). This would also mean that the a posteriori state is simply equal to the calculation from our model.
This state vector contains its individual parameters including the final estimation for position. But remember that these estimations are also gaussian, which means that we have to specify the covariance of this new estimation. For that matter, the Kalman gain is also going to be used.
In a similar way, we obtain the posterior covariance by multiplying the Kalman gain and the measurement prediction covariance S (the denominator of the formula for K).
The a posteriori covariance of the state will be the difference between the estimated covariance and the weighted measurement prediction covariance:
Other literature notes this a priori estimated covariance with a new formulation using the identity matrix and the Observation matrix H. The results, however, should be exactly the same:
We finally obtained a more accurate estimation of the position and everything is stored in the state vector that will be now the new a priori state on the next timestamp. And we start the process again.
The time update equations can be thought of as predictor equations, while the measurement update equations can be thought of as corrector equations. In general the cycle of the KF can be simplified with two steps:
 Time Update (Predict). The a priori estimations are calculated.
 Measurement Update (Correct). The a posteriori estimations are updated.
The Kalman Filter conditions recursively the current estimate on all of the past measurements. The time update projects the current state estimate ahead in time, while the measurement update adjusts the projected estimate by an actual measure at that time.
Kalman Filter Algorithm
Generally, statistical notation will note an a priori estimation with a “super minus” symbol next to the variable, so for our case the a priori estimations for the state and the covariance are noted as and , respectively. For the a posteriori estimations we do not use any special symbol, we merely don’t use the “super minus”.
The timestamps are represented in discrete time usually with a subindex k starting from zero (initial state). Knowing this, in order to use the Kalman filter we always have to specify the initial priors for the state and covariance.
In the probabilistic approach, we define the Expected Value and Covariance as the initial state and initial covariance, respectively. In practice we can start with simpler values:
Note that the covariance matrix is a squared matrix, whose dimensions correspond to the length of the state vector. Initially we can set this a priori covariance matrix as the identity, which means that every state is directly correlated to itself. Further estimations will update this matrix to a better fit for every element.
The simplest form of the Kalman Filter process can be resumed as:
Time Update
 Project the a priori State.
 Project the a priori Covariance.
Measurement Update
 Estimate the Measurement prediction covariance.
 Compute the Kalman Gain.
This is a long equation, I know, but a simpler way to see it is as the ratio explained above:
NOTE: Don’t dare to show the latter notation to a Mathematician or you will be yelled at.  Compute the innovation.
 Update the State estimate (a posteriori state).
 Update the Covariance (a posteriori covariance).
Other literature computes this Covariance with:
Both are equivalent and return the same result. Hypothetically, the second version requires less operations (added in the transpose of the first), but I haven’t personally checked this. You’re welcome to do so.
Among the wide range of formulae, we observe that there are four elements remaining constant throughout the whole process:
 = Transition Matrix. This matrix performs all linear operations between the previous state and the current a priori state .
 = Process Noise Covariance Matrix. Defines the dynamic disturbance noise of out physical model calculations.
 = Measurement Noise Covariance Matrix. Reflects the distribution of the noise in the measurements.
 = Observation Matrix. Relates the measured values with the calculated values.
Its implementation in an Mfile script (for MATLAB or Octave) is reduced to a few lines of code:
% Predict xhat = A*xhat; P = A*P*A' + Q; % Correct S = H*P*H' + R; K = P*H'/S; r = z  H*xhat; xhat = xhat + K*r; P = P  K*SK';
Let us see, how our robot accurately estimates its position using a Kalman Filter in every step:
Project the State ahead 

Project the Covariance ahead 

Make a measurement (Don’t forget its Covariance) 

Compute the Kalman Gain 

Update state estimate 

Update the Covariance 

Start Again 
Model Parameters
The selection of the constant matrices , , and define the behaviour of our filtering. These matrices specify how our data will be handled, and the secret of our estimations lies inside them.
The structure of the transition matrix depends directly on the linear predictions and on the number of elements in the state vector.
The observation matrix depends on the number of elements in $latex \mathbf{\hat{x}}_k$ as well, but also on the number of sensors to be used.
Of our interest now are the matrices and , the noise covariances. The Measurement noise covariance specifies the deviation of the sensor measurements. The easiest way to build it is by giving the parameters provided in the datasheet of the sensors.
If the datasheet of the wheel encoders in our robot specify that they have a tolerance of , its standard deviation will be noted by . Using two wheel encoders (for X and Yaxis), the resulting Measurement Noise Covariance is:
The size of is determined by the number of sensors to be used.
In the other hand, the Process Noise Covariance Matrix indicating how “reliable” our mathematical model cannot be so easily estimated. For that we have to go back to our characterization of the model.
In our models, the Transition Matrix will perform the desired linear operations that compute the a priori estimated state, as represented in the first step of the KF. This is called typically the Process Model.
The formal solution for the distributions in a Process Model is given by recursive Bayesian filtering equations, which relate the behaviour of our models in a timeinvariant domain (sampling frequency is assumed constant).
Thanks to this recursive relation in time, the model can also be equivalently expressed in continuoustime state equations as:
and are constant matrices, which characterize the changes of the Process model. Notice that we introduce these two new terms for the construction of the model.
This might be difficult to grasp initially, but don’t put too much attention to it if you go into practice. The model above basically describes the behaviour of the changes in our Process model. Understand only that.
We have defined our Process Model as a series of linear timeinvariant operations:
And the model with the linear operations over and describes the changes in it, including the process noise with power spectral density .
The relation between these matrices can be used to estimate the Process Noise Covariance that we need. And that’s the only reason I presented them here.
To get the discretized matrix you calculate it as:
where is the stepsize of the discretization. Boom!
For simpler cases, the calculation of can be made analytically. Attempting a stepwise calculation of can easily lead to mistakes that are hard to identify. That’s why I don’t recommend you to do it (unless you feel like a Hierophant of Maths. In that case do it and share your results :D).
But the linearity of the model allows us to perform an easy and fast computation of by simply applying the basic operations given in the formula above, using the following matrix fraction decomposition:
These two submatrices are used to build the Process Noise Covariance Matrix , which is finally computed as
Notice that the Design Matrix is twice as big as the Model Matrix . For example, if is a matrix, then will be a matrix.
This would not severely slow down the process, because it is computed only once at the beginning of the filtering and assumed constant afterwards.
The simplified set of operations to obtain is reduced to the following lines
% Q := Process Noise Covariance
n = length(xhat);
q = 0.01;
L = eye(size(F,1));
Phi = [F L*q*L'; zeros(n) F'];
CD = expm(Phi*dt) * [zeros(n);eye(n)];
Q = CD(1:n,:)/CD((n+1):(2*n),:);
Important observations should be made here. First about the code: the most unusual function here is expm, which simply performs the matrix exponential.
This function is found in almost every mathematical library of any major programming languages. Be aware that it does not perform an elementwise exponential power of the matrix. Here an example in MATLAB/Octave:
>> A = [1 1 0
0 0 2
0 0 1];
>> expm(A)
ans =
2.7183 1.7183 1.0862
0 1.0000 1.2642
0 0 0.3679
The second observation to make in the pseudocode (actual MATLAB code) is about the value behind the spectral density of the white noise.
Unfortunately, we have no way to identify it automatically, and we must set it manually to a defined value. In the code above it is equal to 0.01 (smaller variances in our mathematical model), but feel free to observe its influence on the system when it is changed.
This spectral density can be also estimated as a vector with the same length as the state vecotr , whose values match with those of the state vector. So that they will be also compared independently against each other.
Take the model of the robot again. We have specified that we will use accelerometers to track its position. So, when predicting the values ahead in the Process Model, the resulting values for the a priori state must update as:
In that case the Transition matrix would look like:
Because:
The corresponding Model Matrix of this system is then:
It is not a coincidence if you spot a similarity between and . As mentioned before, describes the behaviour of the differential (the changes) of the state . It would be natural to think that it is also related to the transition between states. And it is! It turns out that they are related as:
Thanks to this relation we can get the transition matrix by computing the matrix exponential of and the sampling value .
% A := Transition Matrix
A = expm(F*dt);
Yes! One line of code will produce the entire Transition Matrix that will govern all linear operations in our system. But I would suggest you to build manually for a better precision.
This is why it is important to define , because through this matrix we can quickly compute the two important matrices and .
Extended Kalman Filter
The magic of Kalman filtering is possible thanks to the ability we have to handle our data with simple additions and multiplications. They are performed in large vectors and matrices but are still simple linear operations.
Life is not so simple in reality, and that is one big limitation of the Kalman Filter: it only works as a linear process.
Most systems work in a nonlinear space, and we know that since we try to integrate angular displacements with the simple linear translations, which leads us to a series of nonlinear functions to update the state.
This might sound bad and we could try to throw everything to the ground, but before sadness invades our minds we can consider the KF again.
As said, the KF works in a linear space and needs linear inputs so it will return linear outputs. The Extended Kalman Filter (EKF) solves this problem by converting the nonlinear inputs into linear data around any state.
The EKF calculates the Jacobian of the a priori state and the Jacobian of its observations, so we can have the most approximate estimated state. A Jacobian is a matrix containing the derivatives of the elements of a vector with respect to a set of variables.
Redefining the model
Although the use of linearization is a demanding task, the steps of the Kalman filtering will not change, but only slightly its definitions and formulas, which are now governed by functions and not simple linear operations.
The system has still the state vector that contains the elements to estimate, but now the Process Model is defined by the nonlinear stochastic equation:
with a measurement estimation also defined by a nonlinear equation:
In this case, the nonlinear function relates the state at the previous state step to the step at the current one, while the function relates the state to the measurement estimation (notice the difference between estimated measurement and the real measurement ).
It is important to note that the fundamental flaw of the EKF is that the distributions are no longer normal after undergoing their respective nonlinear transformations. The EKF simply approximates the optimality of the Bayesian probability by linearization.
The EKF works almost like a regular KF iteratively predicting states ahead and correcting them with the measurements, except for and , which vary in time based on the estimated state and are dependent on the function and the function , respectively.
Here the Jacobian matrices enter the game and will be the strongest help by linearizing the actual state and evaluate over this linear approximations.
We want to use a linearized transition matrix like in our normal KF and in order to get it we have to compute the Jacobian of the function with respect to the elements of the state vector giving us:
This Jacobian matrix is the linearized version of the function defining the transition between states.
If you noticed the subscript in the matrix is because it will change for each timestamp, because now it depends on .
In the same way, the nonlinear function can be linearized and used through its Jacobian of the form:
Do not be scared with such equations. They are simpler than what it seems and the only hard part is the derivation of each element if it is done manually.
You’re welcome to do so, but if you prefer to save your time (and selfembarrasment) like me I suggest you to use symbolic mathematical programs like Mathematica (or its free alternatives Maxima or SymPy), which will calculate all these elements in a fraction of a second.
Yes, now all the elements of and also need to be recomputed in every timestamp. The number of elements in the state vector and the complexity of your functions and will decide the time of these computations. The EKF is then resumed as:
TIME UPDATE
 Project the a priori State.
 Project the a priori Covariance.
MEASUREMENT UPDATE
 Estimate the Measurement prediction covariance.
 Compute the Kalman Gain.
 Compute the innovation.
 Update the State estimate (a posteriori state).
 Update the Covariance (a posteriori covariance).
Or, again, with:
What the hell are matrices and ?
Fear not, my friend! Those are the Jacobian matrices of the functions and with respect to the process noise and measurement noise , respectively:
As mentioned before, the values for the gaussian noises are unknown in reality, but we can use the covariances and in a similar way as in the standard Kalman filter by defining and as identities.
Its basic implementation in MATLAB looks like:
I = eye(size(P,1)); V = eye(size(R,1)); W = eye(size(Q,1)); % Predict xhat = f(xhat); P = A*P*A' + W*Q*W'; % Update S = H*P*H' + V*R*V'; K = P*H'/S; r = z  h(xhat); xhat = xhat + K*r; P = P – K*S*K';
Kaboom! Now you have the basic tools to go out and play with KFs. I know it is not easy to get, but reread the points you don’t get, go to other documentation to confirm, and, most important, apply what you see.
Any questions? Suggestions? What should I explain more?
Great References
 An Introduction to the Kalman Filter; Greg Welch, Gary Bishop; SIGGRAPH 2001.
 Optimal Filtering with Kalman Filters and Smoothers, a Manual for the Matlab toolbox EKF/UKF; version 1.3; Jouni Hartikainen, Arno Solin, and Simo Särkkä.
 Sensor Fusion using the Kalman Filter; Katharina Buckl; CAMPAR TUM; 2005. http://campar.in.tum.de/Chair/KalmanFilter
This article is the best explanation that I ever saw for KF. Thanks a lot.
I must say it was hard to find your site in google.
You write awesome articles but you should rank your page higher in search engines.
If you don’t know how to do it search on youtube: how to rank a website Marcel’s way