#!/usr/bin/env python # coding: utf-8 #

Please cite us if you use the software

# # # # # # Distance/Similarity # PyCM's `distance` method provides users with a wide range of string distance/similarity metrics to evaluate a confusion matrix by measuring its distance to a perfect confusion matrix. Distance/Similarity metrics measure the distance between two vectors of numbers. Small distances between two objects indicate similarity. In the PyCM's `distance` method, a distance measure can be chosen from `DistanceType`. The measures' names are chosen based on the namig style suggested in [[1]](#ref1). # In[1]: from pycm import ConfusionMatrix, DistanceType # In[2]: cm = ConfusionMatrix(matrix={0: {0: 3, 1: 0, 2: 0}, 1: {0: 0, 1: 1, 2: 2}, 2: {0: 2, 1: 1, 2: 3}}) # $$TP \rightarrow True Positive$$ # $$TN \rightarrow True Negative$$ # $$FP \rightarrow False Positive$$ # $$FN \rightarrow False Negative$$ # $$POP \rightarrow Population$$ # ## AMPLE # AMPLE similarity [[2]](#ref2) [[3]](#ref3). # $$sim_{AMPLE}=|\frac{TP}{TP+FP}-\frac{FN}{FN+TN}|$$ # In[3]: cm.distance(metric=DistanceType.AMPLE) # # ## Anderberg's D # Anderberg's D [[4]](#ref4). # $$sim_{Anderberg} = # \frac{(max(TP,FP)+max(FN,TN)+max(TP,FN)+max(FP,TN))- # (max(TP+FP,FP+TN)+max(TP+FP,FN+TN))}{2\times POP}$$ # In[4]: cm.distance(metric=DistanceType.Anderberg) # # ## Andres & Marzo's Delta # Andres & Marzo's Delta correlation [[5]](#ref5). # $$corr_{AndresMarzo_\Delta} = \Delta = # \frac{TP+TN-2 \times \sqrt{FP \times FN}}{POP}$$ # In[5]: cm.distance(metric=DistanceType.AndresMarzoDelta) # # ## Baroni-Urbani & Buser I # Baroni-Urbani & Buser I similarity [[6]](#ref6). # $$sim_{BaroniUrbaniBuserI} = # \frac{\sqrt{TP\times TN}+TP}{\sqrt{TP\times TN}+TP+FP+FN}$$ # In[6]: cm.distance(metric=DistanceType.BaroniUrbaniBuserI) # # ## Baroni-Urbani & Buser II # Baroni-Urbani & Buser II correlation [[6]](#ref6). # $$corr_{BaroniUrbaniBuserII} = # \frac{\sqrt{TP \times TN}+TP-FP-FN}{\sqrt{TP \times TN}+TP+FP+FN}$$ # In[7]: cm.distance(metric=DistanceType.BaroniUrbaniBuserII) # # ## Batagelj & Bren # Batagelj & Bren distance [[7]](#ref7). # $$dist_{BatageljBren} = # \frac{FP \times FN}{TP \times TN}$$ # In[8]: cm.distance(metric=DistanceType.BatageljBren) # # ## Baulieu I # Baulieu I distance [[8]](#ref8). # $$sim_{BaulieuI} = # \frac{(TP+FP) \times (TP+FN)-TP^2}{(TP+FP) \times (TP+FN)}$$ # In[9]: cm.distance(metric=DistanceType.BaulieuI) # # ## Baulieu II # Baulieu II similarity [[8]](#ref8). # $$sim_{BaulieuII} = # \frac{TP^2 \times TN^2}{(TP+FP) \times (TP+FN) \times (FP+TN) \times (FN+TN)}$$ # In[10]: cm.distance(metric=DistanceType.BaulieuII) # # ## Baulieu III # Baulieu III distance [[8]](#ref8). # $$sim_{BaulieuIII} = # \frac{POP^2 - 4 \times (TP \times TN-FP \times FN)}{2 \times POP^2}$$ # In[11]: cm.distance(metric=DistanceType.BaulieuIII) # # ## Baulieu IV # Baulieu IV distance [[9]](#ref9). # $$dist_{BaulieuIV} = \frac{FP+FN-(TP+\frac{1}{2})\times(TN+\frac{1}{2})\times TN \times k}{POP}$$ # In[12]: cm.distance(metric=DistanceType.BaulieuIV) # * The default value of k is Euler's number $e$ # # ## Baulieu V # Baulieu V distance [[9]](#ref9). # $$dist_{BaulieuV} = \frac{FP+FN+1}{TP+FP+FN+1}$$ # In[13]: cm.distance(metric=DistanceType.BaulieuV) # # ## Baulieu VI # Baulieu VI distance [[9]](#ref9). # $$dist_{BaulieuVI} = \frac{FP+FN}{TP+FP+FN+1}$$ # In[14]: cm.distance(metric=DistanceType.BaulieuVI) # # ## Baulieu VII # Baulieu VII distance [[9]](#ref9). # $$dist_{BaulieuVII} = \frac{FP+FN}{POP + TP \times (TP-4)^2}$$ # In[15]: cm.distance(metric=DistanceType.BaulieuVII) # # ## Baulieu VIII # Baulieu VIII distance [[9]](#ref9). # $$dist_{BaulieuVIII} = \frac{(FP-FN)^2}{POP^2}$$ # In[16]: cm.distance(metric=DistanceType.BaulieuVIII) # # ## Baulieu IX # Baulieu IX distance [[9]](#ref9). # $$dist_{BaulieuIX} = \frac{FP+2 \times FN}{TP+FP+2 \times FN+TN}$$ # In[17]: cm.distance(metric=DistanceType.BaulieuIX) # # ## Baulieu X # Baulieu X distance [[9]](#ref9). # $$dist_{BaulieuX} = \frac{FP+FN+max(FP,FN)}{POP+max(FP,FN)}$$ # In[18]: cm.distance(metric=DistanceType.BaulieuX) # # ## Baulieu XI # Baulieu XI distance [[9]](#ref9). # $$dist_{BaulieuXI} = \frac{FP+FN}{FP+FN+TN}$$ # In[19]: cm.distance(metric=DistanceType.BaulieuXI) # # ## Baulieu XII # Baulieu XII distance [[9]](#ref9). # $$dist_{BaulieuXII} = \frac{FP+FN}{TP+FP+FN-1}$$ # In[20]: cm.distance(metric=DistanceType.BaulieuXII) # # ## Baulieu XIII # Baulieu XIII distance [[9]](#ref9). # $$dist_{BaulieuXIII} = \frac{FP+FN}{TP+FP+FN+TP \times (TP-4)^2}$$ # In[21]: cm.distance(metric=DistanceType.BaulieuXIII) # # ## Baulieu XIV # Baulieu XIV distance [[9]](#ref9). # $$dist_{BaulieuXIV} = \frac{FP+2 \times FN}{TP+FP+2 \times FN}$$ # In[22]: cm.distance(metric=DistanceType.BaulieuXIV) # # ## Baulieu XV # Baulieu XV distance [[9]](#ref9). # $$dist_{BaulieuXV} = \frac{FP+FN+max(FP, FN)}{TP+FP+FN+max(FP, FN)}$$ # In[23]: cm.distance(metric=DistanceType.BaulieuXV) # # ## Benini I # Benini I correlation [[10]](#ref10). # $$corr_{BeniniI} = \frac{TP \times TN-FP \times FN}{(TP+FN)\times(FN+TN)}$$ # In[24]: cm.distance(metric=DistanceType.BeniniI) # # ## Benini II # Benini II correlation [[10]](#ref10). # $$corr_{BeniniII} = \frac{TP \times TN-FP \times FN}{min((TP+FN)\times(FN+TN), (TP+FP)\times(FP+TN))}$$ # In[25]: cm.distance(metric=DistanceType.BeniniII) # # ## Canberra # Canberra distance [[11]](#ref11) [[12]](#ref12). # $$sim_{Canberra} = # \frac{FP+FN}{(TP+FP)+(TP+FN)}$$ # In[26]: cm.distance(metric=DistanceType.Canberra) # # ## Clement # Clement similarity [[13]](#ref13). # $$sim_{Clement} = # \frac{TP}{TP+FP}\times\Big(1 - \frac{TP+FP}{POP}\Big) + # \frac{TN}{FN+TN}\times\Big(1 - \frac{FN+TN}{POP}\Big)$$ # In[27]: cm.distance(metric=DistanceType.Clement) # # ## Consonni & Todeschini I # Consonni & Todeschini I similarity [[14]](#ref14). # $$sim_{ConsonniTodeschiniI} = # \frac{log(1+TP+TN)}{log(1+POP)}$$ # In[28]: cm.distance(metric=DistanceType.ConsonniTodeschiniI) # # ## Consonni & Todeschini II # Consonni & Todeschini II similarity [[14]](#ref14). # $$sim_{ConsonniTodeschiniII} = # \frac{log(1+POP)-log(1+FP+FN)}{log(1+POP)}$$ # In[29]: cm.distance(metric=DistanceType.ConsonniTodeschiniII) # # ## Consonni & Todeschini III # Consonni & Todeschini III similarity [[14]](#ref14). # $$sim_{ConsonniTodeschiniIII} = # \frac{log(1+TP)}{log(1+POP)}$$ # In[30]: cm.distance(metric=DistanceType.ConsonniTodeschiniIII) # # ## Consonni & Todeschini IV # Consonni & Todeschini IV similarity [[14]](#ref14). # $$sim_{ConsonniTodeschiniIV} = # \frac{log(1+TP)}{log(1+TP+FP+FN)}$$ # In[31]: cm.distance(metric=DistanceType.ConsonniTodeschiniIV) # # ## Consonni & Todeschini V # Consonni & Todeschini V correlation [[14]](#ref14). # $$corr_{ConsonniTodeschiniV} = # \frac{log(1+TP \times TN)-log(1+FP \times FN)}{log(1+\frac{POP^2}{4})}$$ # In[32]: cm.distance(metric=DistanceType.ConsonniTodeschiniV) # # ## Dennis # Dennis similarity [[15]](#ref15). # $$sim_{Dennis} = # \frac{TP-\frac{(TP+FP)\times(TP+FN)}{POP}}{\sqrt{\frac{(TP+FP)\times(TP+FN)}{POP}}}$$ # In[33]: cm.distance(metric=DistanceType.Dennis) # # ## Digby # Digby correlation [[16]](#ref16). # $$corr_{Digby} = # \frac{(TP \times TN) ^\frac{3}{4}-(FP \times FN)^\frac{3}{4}}{(TP \times TN)^\frac{3}{4}+(FP \times FN)^\frac{3}{4}}$$ # In[34]: cm.distance(metric=DistanceType.Digby) # # ## Dispersion # Dispersion correlation [[17]](#ref17). # $$corr_{dispersion} = # \frac{TP \times TN -FP \times FN}{POP^2} # $$ # In[35]: cm.distance(metric=DistanceType.Dispersion) # # ## Doolittle # Doolittle similarity [[18]](#ref18). # $$sim_{Doolittle} = # \frac{(TP\times POP - (TP+FP)\times(TP+FN))^2}{(TP+FP)\times(TP+FN)\times(FP+TN)\times(FN+TN)}$$ # In[36]: cm.distance(metric=DistanceType.Doolittle) # # ## Eyraud # Eyraud similarity [[19]](#ref19). # $$sim_{Eyraud} = # \frac{TP-(TP+FP)\times(TP+FN)}{(TP+FP)\times(TP+FN)\times(FP+TN)\times(FN+TN)}$$ # In[37]: cm.distance(metric=DistanceType.Eyraud) # # ## Fager & McGowan # Fager & McGowan similarity [[20]](#ref20) [[21]](#ref21). # $$sim_{FagerMcGowan} = # \frac{TP}{\sqrt{(TP+FP)\times(TP+FN)}} - \frac{1}{2\sqrt{max(TP+FP, TP+FN)}}$$ # In[38]: cm.distance(metric=DistanceType.FagerMcGowan) # # ## Faith # Faith similarity [[22]](#ref22). # $$sim_{Faith} = # \frac{TP+\frac{TN}{2}}{POP}$$ # In[39]: cm.distance(metric=DistanceType.Faith) # # ## Fleiss-Levin-Paik # Fleiss-Levin-Paik similarity [[23]](#ref23). # $$sim_{FleissLevinPaik} = # \frac{2 \times TN}{2 \times TN + FP + FN}$$ # In[40]: cm.distance(metric=DistanceType.FleissLevinPaik) # # ## Forbes I # Forbes I similarity [[24]](#ref24) [[25]](#ref25). # $$sim_{ForbesI} = # \frac{POP \times TP}{(TP+FP)\times(TP+FN)}$$ # In[41]: cm.distance(metric=DistanceType.ForbesI) # # ## Forbes II # Forbes II correlation [[26]](#ref26). # $$corr_{ForbesII} = # \frac{FP \times FN-TP \times TN}{(TP+FP)\times(TP+FN) - POP \times min(TP+FP, TP+FN)}$$ # In[42]: cm.distance(metric=DistanceType.ForbesII) # # ## Fossum # Fossum similarity [[27]](#ref27). # $$sim_{Fossum} = # \frac{POP \times (TP-\frac{1}{2})^2}{(TP+FP)\times(TP+FN)}$$ # In[43]: cm.distance(metric=DistanceType.Fossum) # # ## Gilbert & Wells # Gilbert & Wells similarity [[28]](#ref28). # $$sim_{GilbertWells} = # ln \frac{POP^3}{2\pi (TP+FP)\times(TP+FN)\times(FP+TN)\times(FN+TN)} + # 2ln \frac{POP! \times TP! \times FP! \times FN! \times TN!}{(TP+FP)! \times (TP+FN)! \times (FP+TN)! \times (FN+TN)!}$$ # In[44]: cm.distance(metric=DistanceType.GilbertWells) # # ## Goodall # Goodall similarity [[29]](#ref29) [[30]](#ref30). # $$sim_{Goodall} =\frac{2}{\pi} \sin^{-1}\Big( # \sqrt{\frac{TP + TN}{POP}} # \Big)$$ # In[45]: cm.distance(metric=DistanceType.Goodall) # # ## Goodman & Kruskal's Lambda # Goodman & Kruskal's Lambda similarity [[31]](#ref31). # $$sim_{GK_\lambda} = # \frac{\frac{1}{2}((max(TP,FP)+max(FN,TN)+max(TP,FN)+max(FP,TN))- # (max(TP+FP,FN+TN)+max(TP+FN,FP+TN)))} # {POP-\frac{1}{2}(max(TP+FP,FN+TN)+max(TP+FN,FP+TN))}$$ # In[46]: cm.distance(metric=DistanceType.GoodmanKruskalLambda) # # ## Goodman & Kruskal Lambda-r # Goodman & Kruskal Lambda-r correlation [[31]](#ref31). # $$corr_{GK_{\lambda_r}} = # \frac{TP + TN - \frac{1}{2}(max(TP+FP,FN+TN)+max(TP+FN,FP+TN))} # {POP - \frac{1}{2}(max(TP+FP,FN+TN)+max(TP+FN,FP+TN))} # $$ # In[47]: cm.distance(metric=DistanceType.GoodmanKruskalLambdaR) # # ## Guttman's Lambda A # Guttman's Lambda A similarity [[32]](#ref32). # $$sim_{Guttman_{\lambda_a}} = # \frac{max(TP, FN) + max(FP, TN) - max(TP+FP, FN+TN)}{POP - max(TP+FP, FN+TN)} # $$ # In[48]: cm.distance(metric=DistanceType.GuttmanLambdaA) # # ## Guttman's Lambda B # Guttman's Lambda B similarity [[32]](#ref32). # $$sim_{Guttman_{\lambda_b}} = # \frac{max(TP, FP) + max(FN, TN) - max(TP+FN, FP+TN)}{POP - max(TP+FN, FP+TN)} # $$ # In[49]: cm.distance(metric=DistanceType.GuttmanLambdaB) # # ## Hamann # Hamann correlation [[33]](#ref33). # $$corr_{Hamann} = # \frac{TP+TN-FP-FN}{POP} # $$ # In[50]: cm.distance(metric=DistanceType.Hamann) # # ## Harris & Lahey # Harris & Lahey similarity [[34]](#ref34). # $$sim_{HarrisLahey} = # \frac{TP}{TP+FP+FN} \times \frac{2TN+FP+FN}{2POP}+ # \frac{TN}{TN+FP+FN} \times \frac{2TP+FP+FN}{2POP} # $$ # In[51]: cm.distance(metric=DistanceType.HarrisLahey) # # ## Hawkins & Dotson # Hawkins & Dotson similarity [[35]](#ref35). # $$sim_{HawkinsDotson} = # \frac{1}{2} \times \Big(\frac{TP}{TP+FP+FN}+\frac{TN}{FP+FN+TN}\Big) # $$ # In[52]: cm.distance(metric=DistanceType.HawkinsDotson) # # ## Kendall's Tau # Kendall's Tau correlation [[36]](#ref36). # $$corr_{KendallTau} = # \frac{2 \times (TP+TN-FP-FN)}{POP \times (POP-1)} # $$ # In[53]: cm.distance(metric=DistanceType.KendallTau) # # ## Kent & Foster I # Kent & Foster I similarity [[37]](#ref37). # $$sim_{KentFosterI} = # \frac{TP-\frac{(TP+FP)\times(TP+FN)}{TP+FP+FN}}{TP-\frac{(TP+FP)\times(TP+FN)}{TP+FP+FN}+FP+FN} # $$ # In[54]: cm.distance(metric=DistanceType.KentFosterI) # # ## Kent & Foster II # Kent & Foster II similarity [[37]](#ref37). # $$sim_{KentFosterII} = # \frac{TN-\frac{(FP+TN)\times(FN+TN)}{FP+FN+TN}}{TN-\frac{(FP+TN)\times(FP+TN)}{FP+FN+TN}+FP+FN} # $$ # In[55]: cm.distance(metric=DistanceType.KentFosterII) # # ## References #
1- C. C. Little, "Abydos Documentation," 2018.
# #
2- V. Dallmeier, C. Lindig, and A. Zeller, "Lightweight defect localization for Java," in European conference on object-oriented programming, 2005: Springer, pp. 528-550.
# #
3- R. Abreu, P. Zoeteweij, and A. J. Van Gemund, "An evaluation of similarity coefficients for software fault localization," in 2006 12th Pacific Rim International Symposium on Dependable Computing (PRDC'06), 2006: IEEE, pp. 39-46.
# #
4- M. R. Anderberg, Cluster analysis for applications: probability and mathematical statistics: a series of monographs and textbooks. Academic press, 2014.
# #
5- A. M. Andrés and P. F. Marzo, "Delta: A new measure of agreement between two raters," British journal of mathematical and statistical psychology, vol. 57, no. 1, pp. 1-19, 2004.
# #
6- C. Baroni-Urbani and M. W. Buser, "Similarity of binary data," Systematic Zoology, vol. 25, no. 3, pp. 251-259, 1976.
# #
7- V. Batagelj and M. Bren, "Comparing resemblance measures," Journal of classification, vol. 12, no. 1, pp. 73-90, 1995.
# #
8- F. B. Baulieu, "A classification of presence/absence based dissimilarity coefficients," Journal of Classification, vol. 6, no. 1, pp. 233-246, 1989.
# #
9- F. B. Baulieu, "Two variant axiom systems for presence/absence based dissimilarity coefficients," Journal of Classification, vol. 14, no. 1, pp. 0159-0170, 1997.
# #
10- R. Benini, Principii di demografia. Barbera, 1901.
# #
11- G. N. Lance and W. T. Williams, "Computer programs for hierarchical polythetic classification (“similarity analyses”)," The Computer Journal, vol. 9, no. 1, pp. 60-64, 1966.
# #
12- G. N. Lance and W. T. Williams, "Mixed-Data Classificatory Programs I - Agglomerative Systems," Australian Computer Journal, vol. 1, no. 1, pp. 15-20, 1967.
# #
13- P. W. Clement, "A formula for computing inter-observer agreement," Psychological Reports, vol. 39, no. 1, pp. 257-258, 1976.
# #
14- V. Consonni and R. Todeschini, "New similarity coefficients for binary data," Match-Communications in Mathematical and Computer Chemistry, vol. 68, no. 2, p. 581, 2012.
# #
15- S. F. Dennis, "The Construction of a Thesaurus Automatically From," in Statistical Association Methods for Mechanized Documentation: Symposium Proceedings, 1965, vol. 269: US Government Printing Office, p. 61.
# #
16- P. G. Digby, "Approximating the tetrachoric correlation coefficient," Biometrics, pp. 753-757, 1983.
# #
17- IBM Corp, "IBM SPSS Statistics Algorithms," ed: IBM Corp Armonk, NY, USA, 2017.
# #
18- M. H. Doolittle, "The verification of predictions," Bulletin of the Philosophical Society of Washington, vol. 7, pp. 122-127, 1885.
# #
19- H. Eyraud, "Les principes de la mesure des correlations," Ann. Univ. Lyon, III. Ser., Sect. A, vol. 1, no. 30-47, p. 111, 1936.
# #
20- E. W. Fager, "Determination and analysis of recurrent groups," Ecology, vol. 38, no. 4, pp. 586-595, 1957.
# #
21- E. W. Fager and J. A. McGowan, "Zooplankton Species Groups in the North Pacific: Co-occurrences of species can be used to derive groups whose members react similarly to water-mass types," Science, vol. 140, no. 3566, pp. 453-460, 1963.
# #
22- D. P. Faith, "Asymmetric binary similarity measures," Oecologia, vol. 57, pp. 287-290, 1983.
# #
23- J. L. Fleiss, B. Levin, and M. C. Paik, Statistical methods for rates and proportions. john wiley & sons, 2013.
# #
24- S. A. Forbes, On the local distribution of certain Illinois fishes: an essay in statistical ecology. Illinois State Laboratory of Natural History, 1907.
# #
25- A. Mozley, "The statistical analysis of the distribution of pond molluscs in western Canada," The American Naturalist, vol. 70, no. 728, pp. 237-244, 1936.
# #
26- S. A. Forbes, "Method of determining and measuring the associative relations of species," Science, vol. 61, no. 1585, pp. 518-524, 1925.
# #
27- E. G. Fossum and G. Kaskey, "Optimization and standardization of information retrieval language and systems," SPERRY RAND CORP PHILADELPHIA PA UNIVAC DIV, 1966.
# #
28- N. Gilbert and T. C. Wells, "Analysis of quadrat data," The Journal of Ecology, pp. 675-685, 1966.
# #
29- D. W. Goodall, "The distribution of the matching coefficient," Biometrics, pp. 647-656, 1967.
# #
30- B. Austin and R. R. Colwell, "Evaluation of some coefficients for use in numerical taxonomy of microorganisms," International Journal of Systematic and Evolutionary Microbiology, vol. 27, no. 3, pp. 204-210, 1977.
# #
31- L. A. Goodman, W. H. Kruskal, L. A. Goodman, and W. H. Kruskal, Measures of association for cross classifications. Springer, 1979.
# #
32- L. Guttman, "An outline of the statistical theory of prediction," The prediction of personal adjustment, vol. 48, pp. 253-318, 1941.
# #
33- U. Hamann, "Merkmalsbestand und verwandtschaftsbeziehungen der farinosae: ein beitrag zum system der monokotyledonen," Willdenowia, pp. 639-768, 1961.
# #
34- F. C. Harris and B. B. Lahey, "A method for combining occurrence and nonoccurrence interobserver agreement scores," Journal of Applied Behavior Analysis, vol. 11, no. 4, pp. 523-527, 1978.
# #
35- R. P. Hawkins and V. A. Dotson, "Reliability Scores That Delude: An Alice in Wonderland Trip Through the Misleading Characteristics of Inter-Observer Agreement Scores in Interval Recording," 1973.
# #
36- M. G. Kendall, "A new measure of rank correlation," Biometrika, vol. 30, no. 1/2, pp. 81-93, 1938.
# #
37- R. N. Kent and S. L. Foster, "Direct observational procedures: Methodological issues in naturalistic settings," Handbook of behavioral assessment, pp. 279-328, 1977.