The ideas and results contained in this document are part of my thesis, which will be published as a Stanford computer science technical report in June 1999. Similar ideas applied to the EMD under translation have already been published in the technical report [1]. Click here for a PostScript version of this entire document.
We begin some intuition, examples, and informal definitions.
The Earth Mover's Distance (EMD) is a distance measure between discrete, finite distributions
x = { (x_{1},w_{1}), (x_{2},w_{2}), ..., (x_{m},w_{m}) } | and |
y = { (y_{1},u_{1}), (y_{2},u_{2}), ..., (y_{n},u_{n}) }. |
The x distribution has an amount of mass or weight w_{i} at position x_{i} in R^{K}, i=1,...,m, while the y distribution has weight u_{j} at position y_{j}, j=1,...,n. An example pair of distributions in R^{2} is shown below.
Here x has m=2 masses and y has n=3 masses. The circle centers are the points (mass locations) of the distributions. The area of a circle is proportional to the weight at its center point. The total weight of x is w^{S}=sum_{i=1..m} w_{i}, and the total weight of y is u^{S}=sum_{j=1..n} u_{j}. In the example above, the distributions have equal total weight w^{S}=u^{S}=1.
Although the EMD does not require equal-weight distributions, let us begin our discussion with this assumption. The EMD between two equal-weight distributions is proportional to the amount of work needed to morph one distribution into the other. We morph x into y by transporting mass from the x mass locations to the y mass locations until x has been rearranged to look exactly like y. An example morph is shown below.
The amount of mass transported from x_{i} to y_{j} is denoted f_{ij}, and is called a flow between x_{i} and y_{j}. The work done to transport an amount of mass f_{ij} from x_{i} to y_{j} is the product of the f_{ij} and the distance d_{ij}=d(x_{i},y_{j}) between x_{i} and y_{j}. The total amount of work to morph x into y by the flow F=(f_{ij}) is the sum of the individual works:
In the above flow example, the total amount of work done is 0.23*155.7 + 0.51*252.3 + 0.26*316.3 = 246.7.
The flow in the previous example is not an optimal flow between the given distributions. A minimum work flow is shown below.
The total amount of work done by this flow is 0.23*155.7 + 0.26*277.0 + 0.25*252.3 + 0.26*198.2 = 222.4. The EMD between equal-weight distributions is the minimum work to morph one into the other, divided by the total weight of the distributions. The normalization by the total weight makes the EMD equal to the average distance travelled by mass during an optimal (i.e. work minimizing) flow. The EMD does not change if all the weights in both distributions are scaled by the same factor. In the previous example, the total weight is 1, so the EMD is equal to the minimum amount of work: EMD(x,y)=222.4.
When the distributions do not have equal total weights, it is not possible to rearrange the mass in one so that it exactly matches the other. The EMD between unequal-weight distributions is proportional to the minimum amount of work needed to cover the mass in the lighter distribution by mass from the heavier distribution. An example of a flow between unequal-weight distributions is given below.
Here w^{S}=1 and u^{S}=0.74, so x is heavier than y. The flow covers or matches all the mass in y with mass from x. The total amount of work to cover y by this flow is 0.51*252.3 + .23*292.9 = 196.0. In this example, 0.23 of the mass at x_{1} and 0.03 of the mass at x_{2} is not used in matching all the y mass.
In general, some of the mass (w^{S}-u^{S} if x is heavier than y) in the heavier distribution is not needed to match all the mass in the lighter distribution. For this reason, the case of matching unequal-weight distributions is called the partial matching case. The case when the distributions are equal-weight is called the complete matching case because all the mass in one distribution is matched to all the mass in the other distribution. Another way to visualize matching distributions is to imagine piles of dirt and holes in the ground. The dirt piles are located at the points in the heavier distribution, and the the holes are located at the points of the lighter distribution. The volume of a dirt pile or the volume of dirt missing from a hole is equal to the weight of its point. Matching distributions means filling all the holes with dirt. In the partial matching case, there will be some dirt leftover after all the holes are filled.
The flow between the unequal-weight distributions in the previous example is not optimal. A minimum work flow is shown below.
The total amount of work for this flow to cover y is 0.23*155.7 + 0.25*252.3 + 0.26*198.2 = 150.4. The EMD is the minimum amount of work to cover the mass in the lighter distribution by mass from the heavier distribution, divided by the weight of the lighter distribution (which is the total amount of mass moved). As in the complete matching case, the normalization of the minimum work makes the EMD equal to the average distance mass travels during an optimal flow. Again, scaling the weights in both distributions by a constant factor does not change the EMD. In the above example, the weight of the lighter distribution is u^{S}=0.74, so EMD(x,y)= 150.4/0.74 = 203.3.
top | Title, Table of Contents, The EMD |
next | Formal Definition |
Email comments to scohen@cs.stanford.edu.