# Effective similarity search in PostgreSQL

Hello my dear friends. In “PostgreSQL most useful extensions” I showed a list of some useful extensions for PostgreSQL. Of course, that article didn’t cover all useful extensions (in my humble opinion) and some extensions I want to describe in separate articles. Today we will talk about effective similarity search in PostgreSQL.Similarity search in large databases is an important issue in nowadays informational services, such as recommender systems, blogs (similar articles), online shops (similar products), image hosting services (similar images, search image duplicates) etc. PostgreSQL allows to make such things more easy. First you need to understand how we will calculate the similarity of two objects.

# Similarity

Any object can be described by a list of characteristics. For example, article in blog can be described by tags, product in online shop – by size, weight, color etc. It means that for each object we can create the digital signature – array of numbers, which describing the object (fingerprint, n-grams). Let’s create an array of not ordered numbers for every object. What should we do next?

# Similarity calculation

There are several methods for calculating the similarity of objects by signatures. First of all, the legend for calculations:Na, Nb – the number of unique elements in the arraysNu – the number of unique elements in the union of setsNi – the number of unique elements in the intersection of arraysOne of the simplest calculation of the similarity of two objects is the number of unique elements in the intersection of arrays divided by the number of unique elements in two arrays:$LARGE dpi{120} fn_phv S(A,B) = frac{N_{i}}{(N_{a} + N_{b})}$or only$LARGE dpi{120} fn_phv S(A,B) = frac{N_{i}}{N_{u}}$Pros:
• Easy to understand
• Speed of calculation: N * log(N)
• Works well at similar and large Na and Nb
Similarity can also be calculated using the formula of cosines:$LARGE dpi{120} fn_phv S(A,B) = frac{N_{i}}{sqrt{N_{a} * N_{b}}}$Pros:
• Speed of calculation: N * log(N)
• Works well for large N
But both of these metrics have common problems:
• Few elements -> large scatter of similarity
• Frequent elements -> weight below
TF/IDF metric avoids these problems to calculate the similarity:$LARGE dpi{120} fn_phv S(A,B) = frac{sum_{i < N_{a}, j < N_{b}, A_{i} = B_{j}}TF_{i} * TF_{j}}{sqrt{sum_{i < N_{a}}TF_{i}^{2} * sum_{j < N_{b}}TF_{j}^{2}}}$where$LARGE dpi{120} fn_phv IDF_{element} = log(frac{N_{objects}}{N_{objects with element}} + 1)$$LARGE dpi{120} fn_phv TF_{element} = IDF_{element} * N_{occurrences}$Now it would be great to use this knowledge in practice.

# Finding duplicate of images

For example, I select the search for duplicate images. Other options for shopping, blogs are implemented in a similar way. The algorithm helps to find similar images that are slightly different: desaturated images, add watermark, passed through the filters.In our algorithm, we will create a pixel matrix of each image. Let it be 15×15 pixels. The next step: we do not know the color of a pixel, but its intensity (the intensity is given by 0.299 * red + 0,587 * green + 0,114 * blue). Calculating the intensity will help us find the image is not paying attention to the colors of the images.Once you have to calculate the intensity of all pixels in a 15×15 matrix, we find the ratio of the intensity of each pixel to the mean intensity of the matrix, and generate a unique number for each cell (in the code to generate unique for each cell, I added the coordinates to intensity) and obtain an array of 225 elements length (15 * 15 = 225). Excellent.Below is the code to generate a digital signature for images on Ruby and PHP languages:Ruby PHP
 61,33% similarity
 23,56% similarity
 87,55% similarity
Excellent, but it is comparing two images. Now let’s use PostgreSQL for searching. In PostgreSQL is an array type of the field. We will write the digital signature of the image:It works, but search works without index. Let’s create it:orDifference between GiST and GIN indexes you can read here.Perfect! Let’s check performance.Let’s add sorting in SQL (the most similar images were the first), and add similarity as extra field:Added sorting did not complicate query execution.Here you can see all results in raw format.

# Conclusion

PostgreSQL extension smlar can be used in systems where we need search similar objects, like texts, images, videos.That’s all folks! Thank you for reading till the end.This is [re-post](http://leopard.in.ua/2012/06/01/effective-similarity-search-in-postgresql/) from my blog.