orsinium / textdistance
- среда, 28 марта 2018 г. в 00:16:38
Python
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface.
TextDistance -- python library for compare distance between two or more sequences by many algorithms.
Features:
| Algorithm | Class | Functions |
|---|---|---|
| Hamming | Hamming |
hamming |
| MLIPNS | Mlipns |
mlipns |
| Levenshtein | Levenshtein |
levenshtein |
| Damerau-Levenshtein | DamerauLevenshtein |
damerau_levenshtein |
| Jaro-Winkler | JaroWinkler |
jaro_winkler, jaro |
| Strcmp95 | StrCmp95 |
strcmp95 |
| Needleman-Wunsch | NeedlemanWunsch |
needleman_wunsch |
| Gotoh | Gotoh |
gotoh |
| Smith-Waterman | SmithWaterman |
smith_waterman |
| Algorithm | Class | Functions |
|---|---|---|
| Jaccard index | Jaccard |
jaccard |
| Sørensen–Dice coefficient | Sorensen |
sorensen, sorensen_dice, dice |
| Tversky index | Tversky |
tversky |
| Overlap coefficient | Overlap |
overlap |
| Tanimoto distance | Tanimoto |
tanimoto |
| Cosine similarity | Cosine |
cosine |
| Monge-Elkan | MongeElkan |
monge_elkan |
| Bag distance | Bag |
bag |
| Algorithm | Class | Functions |
|---|---|---|
| longest common subsequence similarity | LCSSeq |
lcsseq |
| longest common substring similarity | LCSStr |
lcsstr |
| Ratcliff-Obershelp similarity | RatcliffObershelp |
ratcliff_obershelp |
Work in progress. Now all algorithms compare two strings as array of bits, not by chars.
NCD - normalized compression distance.
Functions:
bz2_ncdlzma_ncdarith_ncdrle_ncdbwtrle_ncdzlib_ncd| Algorithm | Class | Functions |
|---|---|---|
| MRA | MRA |
mra |
| Editex | Editex |
editex |
| Algorithm | Class | Functions |
|---|---|---|
| Prefix similarity | Prefix |
prefix |
| Postfix similarity | Postfix |
postfix |
| Length distance | Length |
length |
| Identity similarity | Identity |
identity |
| Matrix similarity | Matrix |
matrix |
Stable:
pip install textdistanceDev:
pip install -e git+https://github.com/orsinium/textdistance.git#egg=textdistanceAll algorithms have 2 interfaces:
All algorithms have some common methods:
.distance(*sequences) -- calculate distance between sequences..similarity(*sequences) -- calculate similarity for sequences..maximum(*sequences) -- maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum..normalized_distance(*sequences) -- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different..normalized_similarity(*sequences) -- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.Most common init arguments:
qval -- q-value for split sequences into q-grams. Possible values:
as_set -- for token-based algorithms:
t and ttt is equal.t and ttt is different.For example, Hamming distance:
import textdistance
textdistance.hamming('test', 'text')
# 1
textdistance.hamming.distance('test', 'text')
# 1
textdistance.hamming.similarity('test', 'text')
# 3
textdistance.hamming.normalized_distance('test', 'text')
# 0.25
textdistance.hamming.normalized_similarity('test', 'text')
# 0.75
textdistance.Hamming(qval=2).distance('test', 'text')
# 2
Any other algorithms have same interface.