We were given this table of pathological test results for three individuals.
| Name | Gender | Fever | Cough | Test-1 | Test-2 | Test-3 | Test-4 |
|---|---|---|---|---|---|---|---|
| Jack | M | Y | N | P | N | N | A |
| Mary | F | Y | N | P | A | P | N |
| Jim | M | Y | P | N | N | N | A |
Each individual's test results is a set of binary variables - each variable can have a value of 0 or 1, for example, "Fever" is either N (0) or Y (1). The other variables are either N (0), A (0) or P (1).
The exercise was to calculate the Jaccard coefficient for each pair - Jack with Mary, Jack with Jim, and Mary with Jim.
The Jaccard coefficient is a measure of the dissimilarity between two sets A and B. It is calculated with this formula.
is the number of times A is 0 and B is 1.
is the number of times A is 1 and B is 0.
is the number of times A is 1 and B is 1.
Let us call Jack's test results set A and Mary's test results set B.
A is 0 and B is 1 for 1 variable ("Test-3").
A is 1 and B is 0 for 0 variables.
A is 1 and B is 1 for 2 variables ("Fever" and "Test-1").
The Jaccard coefficient is near zero, so Jack and Mary's results are not very dissimilar. (In other words, they are quite similar.)
We can use the same method to calculate the Jaccard coefficient for Jack and Jim.
The Jaccard coefficient is closer to one, so Jack and Jim's results are quite dissimilar.
Finally we can use the same method to calculate the Jaccard coefficient for Mary and Jim.
The Jaccard coefficient is close to one, so Mary and Jim's results are very dissimilar.
The Jaccard coefficient is a fast way of comparing two sets of binary variables - much faster than visually comparing all the variables in two sets.