/ Python

用PYTHON计算文本的相似度

1. difflib

Python 3.4.3 (default, Mar 26 2015, 22:03:40) 
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import difflib
>>> difflib.SequenceMatcher(None, 'abcd', 'abc').ratio()
0.8571428571428571
>>> difflib.SequenceMatcher(None, 'abcd', 'abcd').ratio()
1.0
>>> difflib.SequenceMatcher(None, 'abcd', 'abdc').ratio()
0.75
>>> difflib.SequenceMatcher(None, 'abcd', 'abdcfasd').ratio()
0.6666666666666666
>>> difflib.SequenceMatcher(None, 'abcd', 'abdcfasdfasdf').ratio()
0.47058823529411764
>>> 

2.levenshtein

Python 3.4.3 (default, Mar 26 2015, 22:03:40) 
[GCC 4.9.2] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from Levenshtein import ratio
>>> ratio('abcd', 'abc')
0.8571428571428571
>>> ratio('abcd', 'abcd')
1.0
>>> from Levenshtein import distance
>>> distance('abcd', 'abc')
1
>>> distance('abcd', 'abcd')
0
>>> distance('abcd', 'abdc')
2
>>> distance('abcd', 'abdcfasd')
4
>>> distance('abcd', 'abdcfasdfasdf')
9