- isub(+Text1:text, +Text2:text, -Similarity:float, +Options:list) is det
- Similarity is a measure of the similarity/dissimilarity between
Text1 and Text2. E.g.
?- isub('E56.Language', 'languange', D, [normalize(true)]). D = 0.4226950354609929. % [-1,1] range ?- isub('E56.Language', 'languange', D, [normalize(true),zero_to_one(true)]). D = 0.7113475177304964. % [0,1] range ?- isub('E56.Language', 'languange', D, []). % without normalization D = 0.19047619047619047. % [-1,1] range ?- isub(aa, aa, D, []). % does not work for short substrings D = -0.8. ?- isub(aa, aa, D, [substring_threshold(0)]). % works with short substrings D = 1.0. % but may give unwanted values % between e.g. 'store' and 'spore'. ?- isub(joe, hoe, D, [substring_threshold(0)]). D = 0.5315315315315314. ?- isub(joe, hoe, D, []). D = -1.0.
This is a new version of isub/4 which replaces the old version while providing backwards compatibility. This new version allows several options to tweak the algorithm.
- Arguments:
-
Text1 - and Text2 are either an atom, string or a list of characters or character codes. Similarity - is a float in the range [-1,1.0], where 1.0 means most similar. The range can be set to [0,1] with the zero_to_one option described below. Options - is a list with elements described below. Please note that the options are processed at compile time using goal_expansion to provide much better speed. Supported options are: - normalize(+Boolean)
- Applies string normalization as implemented by the original
authors: Text1 and Text2 are mapped
to lowercase and the characters "._ " are removed. Lowercase
mapping is done with the C-library function
towlower()
. In general, the required normalization is domain dependent and is better left to the caller. See e.g., unaccent_atom/2. The default is to skip normalization (false
). - zero_to_one(+Boolean)
- The old isub implementation deviated from the original algorithm
by returning a value in the [0,1] range. This new isub/4 implementation
defaults to the original range of [-1,1], but this option can be set
to
true
to set the output range to [0,1]. - substring_threshold(+Nonneg)
- The original algorithm was meant to compare terms in semantic web ontologies, and it had a hard coded parameter that only considered substring similarities greater than 2 characters. This caused the similarity between, for example 'aa' and 'aa' to return -0.8 which is not expected. This option allows the user to set any threshold, such as 0, so that the similatiry between short substrings can be properly recognized. The default value is 2 which is what the original algorithm used.