One of the most fundamental problems in the current era is being able to find and connect similar information. I challenge you to think of one application on the Web that would not benefit from being able to determine similarities between bits of information. Detection of duplicate documents is becoming increasingly important for businesses and data scientist. Near-duplicate pages found on the Web could be plagiarisms, or they could be mirrors with identical information but released on different outlets.
In this article we'll explore a range of techniques and algorithms to determine the similarity between items. We will also explain the how and the why, and have a look at some use cases.
How can we define similarity?
We can define similarity as a measure of how much 2 sets, collections or document have in common when we compare them. The more resemblance the 2 documents have to each other, the higher their similarity. To make this definition resemble a concrete value, we will introduce the Jaccard similarity.
The notion of how similar sets are is called "Jaccard similarity". The Jaccard similarity of two sets, A and B, can be calculated by the ratio of their intersection size divided by their union size. We shall denote the Jaccard similarity of A and B by:
J(A,B) = |A ∩ B| / |A ∪ B|
Using the above formula, we are able to calculate the exact similarity of 2 sets. To make this easy method transferable to other sources of information, we need techniques to represent these sources of data as sets. For example:
J(A,B) = |A ∩ B| / |A ∪ B| = |2| / |6 + 4 -2| = 2 / 8 = 0.25
The Jaccard similarity is a measure of how close sets are, albeit not really a distance measure. To calculate the distance between two sets utilising the Jaccard similarity calculation, we simply take 1 minus the calculated Jaccard similarity. This is called the Jaccard distance, the closer two sets, the higher the Jaccard similarity and thus the lower the Jaccard distance.
dJ (A,B) = 1 - J(A,B)
How to shingle documents
The best performing way to represent a document as a set, for the purpose of finding lexical matches, is to construct a set containing short strings of text that appear within the document. By doing so, same pieces of text will always be identified regardless of their position.
A common shingling technique to represent documents as sets is k-shingling. As a document can simply be seen as a string of characters, we can define a k-shingle of this document as all possible consecutive k length substrings existing in the document. Here you can see an example of a k=2 shingle:
Original: "Thank you for reading this blog" k=2 shingled: "Thank you" "you for" "for reading" "reading this" "this blog"
Choosing the best shingle size can be a daunting task, if you take your shingle size to large, you won't be able to find similarities between documents, only if they are a near perfect match. If you take your shingle size to small, as an extreme example k = 1, you will have a bias because the top 1000 most frequently used words represent around 85% of all word usage. All common used characters will be shared between most of the documents you will want to compare.
To avoid a bias towards too little or too much abstraction, you should pick your k size large enough that the probability of any given shingle appearing in any given document is low.
You could also decide to shingle on characters in favor of words, this way you could have a representation that is less defined by words but more by the order of words following each other.
Instead of using using string representations as shingles directly, you should favor the use of a hash function to determine a hash representing the shingle. This can lead to memory-usage reduction and faster comparison speeds. To select the best string hashing function, please have a look at this overview by Ozan Yigit.
Detecting plagiarised content relies on our ability to find textual similarities between content. Plagiarized documents may only contain some parts of the original piece mixed in and scrambled up between his own content, or content from even more sources. Determining whether documents are plagiarized can be a daunting task because a character by character comparison won't always yield correct results.
Same Source Articles
When using news aggregators, such as Google News, you are able to find multiple sources and news outlets reporting about the same event. Aggregators are able to cluster articles about the same topic using similarity scores. Reporter also publish articles through the Associated Press, which can then be redistributed by other websites.
- Jaccard Similarity: The Jaccard similarity coefficient is a statistic used to measure the similarity between sets. It is calculated by the ratio of the size of the intersection to the size of the union.
- Jaccard Distance: The Jaccard distance is a distance measure between sets, calculated as one minus the Jaccard Similarity
- Shingling: A way of representing text documents as a set of strings. By representing a document by its set of k-shingles, we form a set of k long characters that consecutively appear in the document. By calculating the Jaccard similarity of shingled sets, we are able to measure the textual similarity of documents.