orsinium / textdistance
- среда, 28 марта 2018 г. в 00:16:38
Python
Compute distance between sequences. 30+ algorithms, pure python implementation, common interface.
TextDistance -- python library for compare distance between two or more sequences by many algorithms.
Features:
Algorithm | Class | Functions |
---|---|---|
Hamming | Hamming |
hamming |
MLIPNS | Mlipns |
mlipns |
Levenshtein | Levenshtein |
levenshtein |
Damerau-Levenshtein | DamerauLevenshtein |
damerau_levenshtein |
Jaro-Winkler | JaroWinkler |
jaro_winkler , jaro |
Strcmp95 | StrCmp95 |
strcmp95 |
Needleman-Wunsch | NeedlemanWunsch |
needleman_wunsch |
Gotoh | Gotoh |
gotoh |
Smith-Waterman | SmithWaterman |
smith_waterman |
Algorithm | Class | Functions |
---|---|---|
Jaccard index | Jaccard |
jaccard |
Sørensen–Dice coefficient | Sorensen |
sorensen , sorensen_dice , dice |
Tversky index | Tversky |
tversky |
Overlap coefficient | Overlap |
overlap |
Tanimoto distance | Tanimoto |
tanimoto |
Cosine similarity | Cosine |
cosine |
Monge-Elkan | MongeElkan |
monge_elkan |
Bag distance | Bag |
bag |
Algorithm | Class | Functions |
---|---|---|
longest common subsequence similarity | LCSSeq |
lcsseq |
longest common substring similarity | LCSStr |
lcsstr |
Ratcliff-Obershelp similarity | RatcliffObershelp |
ratcliff_obershelp |
Work in progress. Now all algorithms compare two strings as array of bits, not by chars.
NCD
- normalized compression distance.
Functions:
bz2_ncd
lzma_ncd
arith_ncd
rle_ncd
bwtrle_ncd
zlib_ncd
Algorithm | Class | Functions |
---|---|---|
MRA | MRA |
mra |
Editex | Editex |
editex |
Algorithm | Class | Functions |
---|---|---|
Prefix similarity | Prefix |
prefix |
Postfix similarity | Postfix |
postfix |
Length distance | Length |
length |
Identity similarity | Identity |
identity |
Matrix similarity | Matrix |
matrix |
Stable:
pip install textdistance
Dev:
pip install -e git+https://github.com/orsinium/textdistance.git#egg=textdistance
All algorithms have 2 interfaces:
All algorithms have some common methods:
.distance(*sequences)
-- calculate distance between sequences..similarity(*sequences)
-- calculate similarity for sequences..maximum(*sequences)
-- maximum possible value for distance and similarity. For any sequence: distance + similarity == maximum
..normalized_distance(*sequences)
-- normalized distance between sequences. The return value is a float between 0 and 1, where 0 means equal, and 1 totally different..normalized_similarity(*sequences)
-- normalized similarity for sequences. The return value is a float between 0 and 1, where 0 means totally different, and 1 equal.Most common init arguments:
qval
-- q-value for split sequences into q-grams. Possible values:
as_set
-- for token-based algorithms:
t
and ttt
is equal.t
and ttt
is different.For example, Hamming distance:
import textdistance
textdistance.hamming('test', 'text')
# 1
textdistance.hamming.distance('test', 'text')
# 1
textdistance.hamming.similarity('test', 'text')
# 3
textdistance.hamming.normalized_distance('test', 'text')
# 0.25
textdistance.hamming.normalized_similarity('test', 'text')
# 0.75
textdistance.Hamming(qval=2).distance('test', 'text')
# 2
Any other algorithms have same interface.