#
#
#
#
# # Distance/Similarity
# PyCM's `distance` method provides users with a wide range of string distance/similarity metrics to evaluate a confusion matrix by measuring its distance to a perfect confusion matrix. Distance/Similarity metrics measure the distance between two vectors of numbers. Small distances between two objects indicate similarity. In the PyCM's `distance` method, a distance measure can be chosen from `DistanceType`. The measures' names are chosen based on the namig style suggested in [[1]](#ref1).
# In[1]:
from pycm import ConfusionMatrix, DistanceType
# In[2]:
cm = ConfusionMatrix(matrix={0: {0: 3, 1: 0, 2: 0}, 1: {0: 0, 1: 1, 2: 2}, 2: {0: 2, 1: 1, 2: 3}})
# $$TP \rightarrow True Positive$$
# $$TN \rightarrow True Negative$$
# $$FP \rightarrow False Positive$$
# $$FN \rightarrow False Negative$$
# $$POP \rightarrow Population$$
# ## AMPLE
# AMPLE similarity [[2]](#ref2) [[3]](#ref3).
# $$sim_{AMPLE}=|\frac{TP}{TP+FP}-\frac{FN}{FN+TN}|$$
# In[3]:
cm.distance(metric=DistanceType.AMPLE)
#
#
Notice : new in version 3.8
#
# ## Anderberg's D
# Anderberg's D [[4]](#ref4).
# $$sim_{Anderberg} =
# \frac{(max(TP,FP)+max(FN,TN)+max(TP,FN)+max(FP,TN))-
# (max(TP+FP,FP+TN)+max(TP+FP,FN+TN))}{2\times POP}$$
# In[4]:
cm.distance(metric=DistanceType.Anderberg)
#
# ## Baulieu IV
# Baulieu IV distance [[9]](#ref9).
# $$dist_{BaulieuIV} = \frac{FP+FN-(TP+\frac{1}{2})\times(TN+\frac{1}{2})\times TN \times k}{POP}$$
# In[12]:
cm.distance(metric=DistanceType.BaulieuIV)
# * The default value of k is Euler's number $e$
#
#
Notice : new in version 3.8
#
# ## Baulieu V
# Baulieu V distance [[9]](#ref9).
# $$dist_{BaulieuV} = \frac{FP+FN+1}{TP+FP+FN+1}$$
# In[13]:
cm.distance(metric=DistanceType.BaulieuV)
#
#
Notice : new in version 3.8
#
# ## Baulieu VI
# Baulieu VI distance [[9]](#ref9).
# $$dist_{BaulieuVI} = \frac{FP+FN}{TP+FP+FN+1}$$
# In[14]:
cm.distance(metric=DistanceType.BaulieuVI)
#
# ## Kent & Foster I
# Kent & Foster I similarity [[37]](#ref37).
# $$sim_{KentFosterI} =
# \frac{TP-\frac{(TP+FP)\times(TP+FN)}{TP+FP+FN}}{TP-\frac{(TP+FP)\times(TP+FN)}{TP+FP+FN}+FP+FN}
# $$
# In[54]:
cm.distance(metric=DistanceType.KentFosterI)
#
#
Notice : new in version 3.9
#
# ## Kent & Foster II
# Kent & Foster II similarity [[37]](#ref37).
# $$sim_{KentFosterII} =
# \frac{TN-\frac{(FP+TN)\times(FN+TN)}{FP+FN+TN}}{TN-\frac{(FP+TN)\times(FP+TN)}{FP+FN+TN}+FP+FN}
# $$
# In[55]:
cm.distance(metric=DistanceType.KentFosterII)
#
#
Notice : new in version 3.9
#
# ## References
#
1- C. C. Little, "Abydos Documentation," 2018.
#
#
2- V. Dallmeier, C. Lindig, and A. Zeller, "Lightweight defect localization for Java," in European conference on object-oriented programming, 2005: Springer, pp. 528-550.
#
#
3- R. Abreu, P. Zoeteweij, and A. J. Van Gemund, "An evaluation of similarity coefficients for software fault localization," in 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06), 2006: IEEE, pp. 39-46.
#
#
4- M. R. Anderberg, Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks. Academic press, 2014.
#
#
5- A. M. Andrés and P. F. Marzo, "Delta: A new measure of agreement between two raters," British journal of mathematical and statistical psychology, vol. 57, no. 1, pp. 1-19, 2004.
#
#
6- C. Baroni-Urbani and M. W. Buser, "Similarity of binary data," Systematic Zoology, vol. 25, no. 3, pp. 251-259, 1976.
#
#
7- V. Batagelj and M. Bren, "Comparing resemblance measures," Journal of classification, vol. 12, no. 1, pp. 73-90, 1995.
#
#
8- F. B. Baulieu, "A classification of presence/absence based dissimilarity coefficients," Journal of Classification, vol. 6, no. 1, pp. 233-246, 1989.
#
#
9- F. B. Baulieu, "Two variant axiom systems for presence/absence based dissimilarity coefficients," Journal of Classification, vol. 14, no. 1, pp. 0159-0170, 1997.
#
#
10- R. Benini, Principii di demografia. Barbera, 1901.
#
#
11- G. N. Lance and W. T. Williams, "Computer programs for hierarchical polythetic classification (“similarity analyses”)," The Computer Journal, vol. 9, no. 1, pp. 60-64, 1966.
#
#
12- G. N. Lance and W. T. Williams, "Mixed-Data Classificatory Programs I - Agglomerative Systems," Australian Computer Journal, vol. 1, no. 1, pp. 15-20, 1967.
#
#
13- P. W. Clement, "A formula for computing inter-observer agreement," Psychological Reports, vol. 39, no. 1, pp. 257-258, 1976.
#
#
14- V. Consonni and R. Todeschini, "New similarity coefficients for binary data," Match-Communications in Mathematical and Computer Chemistry, vol. 68, no. 2, p. 581, 2012.
#
#
15- S. F. Dennis, "The Construction of a Thesaurus Automatically From," in Statistical Association Methods for Mechanized Documentation: Symposium Proceedings, 1965, vol. 269: US Government Printing Office, p. 61.
#
#
16- P. G. Digby, "Approximating the tetrachoric correlation coefficient," Biometrics, pp. 753-757, 1983.
#
#
17- IBM Corp, "IBM SPSS Statistics Algorithms," ed: IBM Corp Armonk, NY, USA, 2017.
#
#
18- M. H. Doolittle, "The verification of predictions," Bulletin of the Philosophical Society of Washington, vol. 7, pp. 122-127, 1885.
#
#
19- H. Eyraud, "Les principes de la mesure des correlations," Ann. Univ. Lyon, III. Ser., Sect. A, vol. 1, no. 30-47, p. 111, 1936.
#
#
20- E. W. Fager, "Determination and analysis of recurrent groups," Ecology, vol. 38, no. 4, pp. 586-595, 1957.
#
#
21- E. W. Fager and J. A. McGowan, "Zooplankton Species Groups in the North Pacific: Co-occurrences of species can be used to derive groups whose members react similarly to water-mass types," Science, vol. 140, no. 3566, pp. 453-460, 1963.
#
#
22- D. P. Faith, "Asymmetric binary similarity measures," Oecologia, vol. 57, pp. 287-290, 1983.
#
#
23- J. L. Fleiss, B. Levin, and M. C. Paik, Statistical methods for rates and proportions. john wiley & sons, 2013.
#
#
24- S. A. Forbes, On the local distribution of certain Illinois fishes: an essay in statistical ecology. Illinois State Laboratory of Natural History, 1907.
#
#
25- A. Mozley, "The statistical analysis of the distribution of pond molluscs in western Canada," The American Naturalist, vol. 70, no. 728, pp. 237-244, 1936.
#
#
26- S. A. Forbes, "Method of determining and measuring the associative relations of species," Science, vol. 61, no. 1585, pp. 518-524, 1925.
#
#
27- E. G. Fossum and G. Kaskey, "Optimization and standardization of information retrieval language and systems," SPERRY RAND CORP PHILADELPHIA PA UNIVAC DIV, 1966.
#
#
28- N. Gilbert and T. C. Wells, "Analysis of quadrat data," The Journal of Ecology, pp. 675-685, 1966.
#
#
29- D. W. Goodall, "The distribution of the matching coefficient," Biometrics, pp. 647-656, 1967.
#
#
30- B. Austin and R. R. Colwell, "Evaluation of some coefficients for use in numerical taxonomy of microorganisms," International Journal of Systematic and Evolutionary Microbiology, vol. 27, no. 3, pp. 204-210, 1977.
#
#
31- L. A. Goodman, W. H. Kruskal, L. A. Goodman, and W. H. Kruskal, Measures of association for cross classifications. Springer, 1979.
#
#
32- L. Guttman, "An outline of the statistical theory of prediction," The prediction of personal adjustment, vol. 48, pp. 253-318, 1941.
#
#
33- U. Hamann, "Merkmalsbestand und verwandtschaftsbeziehungen der farinosae: ein beitrag zum system der monokotyledonen," Willdenowia, pp. 639-768, 1961.
#
#
34- F. C. Harris and B. B. Lahey, "A method for combining occurrence and nonoccurrence interobserver agreement scores," Journal of Applied Behavior Analysis, vol. 11, no. 4, pp. 523-527, 1978.
#
#
35- R. P. Hawkins and V. A. Dotson, "Reliability Scores That Delude: An Alice in Wonderland Trip Through the Misleading Characteristics of Inter-Observer Agreement Scores in Interval Recording," 1973.
#
#
36- M. G. Kendall, "A new measure of rank correlation," Biometrika, vol. 30, no. 1/2, pp. 81-93, 1938.
#
#
37- R. N. Kent and S. L. Foster, "Direct observational procedures: Methodological issues in naturalistic settings," Handbook of behavioral assessment, pp. 279-328, 1977.