Effective similarity search in PostgreSQL

Effective similarity search in PostgreSQL

Hello my dear friends. In “PostgreSQL most useful extensions” I showed a list of some useful extensions for PostgreSQL. Of course, that article didn’t cover all useful extensions (in my humble opinion) and some extensions I want to describe in separate articles. Today we will talk about effective similarity search in PostgreSQL.Similarity search in large databases is an important issue in nowadays informational services, such as recommender systems, blogs (similar articles), online shops (similar products), image hosting services (similar images, search image duplicates) etc. PostgreSQL allows to make such things more easy. First you need to understand how we will calculate the similarity of two objects.

Similarity

Any object can be described by a list of characteristics. For example, article in blog can be described by tags, product in online shop – by size, weight, color etc. It means that for each object we can create the digital signature – array of numbers, which describing the object (fingerprint, n-grams). Let’s create an array of not ordered numbers for every object. What should we do next?

Similarity calculation

There are several methods for calculating the similarity of objects by signatures. First of all, the legend for calculations:Na, Nb – the number of unique elements in the arraysNu – the number of unique elements in the union of setsNi – the number of unique elements in the intersection of arraysOne of the simplest calculation of the similarity of two objects is the number of unique elements in the intersection of arrays divided by the number of unique elements in two arrays:simplestor onlysimplestPros:
  • Easy to understand
  • Speed of calculation: N * log(N)
  • Works well at similar and large Na and Nb
Similarity can also be calculated using the formula of cosines:simplestPros:
  • Speed of calculation: N * log(N)
  • Works well for large N
But both of these metrics have common problems:
  • Few elements -> large scatter of similarity
  • Frequent elements -> weight below
TF/IDF metric avoids these problems to calculate the similarity:simplestwheresimplestsimplestNow it would be great to use this knowledge in practice.

Smlar extension for PostgreSQL

Oleg Bartunov and Teodor Sigaev developed PostgreSQL extension, called smlar, which provides several methods to calculate sets similarity (all built-in data types supported), similarity operator with indexing support on the base of GiST and GIN frameworks. Let’s look how to work with this extension. First of all, we need install it (PostgreSQL should be already installed):On PostgreSQL 9.2 this extension should build without problem, for PostgreSQL 9.1 and earlier you need to make a little fix. In file “smlar_guc.c” on line 214 change callto this (delete last argument)Let’s test instaled extension:If you have the same output in console, you’ve installed extension successful. More information about this extension you can read in README file.Function smlar computes similary of two arrays (arrays should be the same type) and return float from 0 to 1 (0 – absolutely no similar objects, 1 – absolutely similar arrays, equal). Function can take the third argument – the formula, that calculates the similarity of the two arrays. Module provides several GUC variables and it’s highly recommended to add to postgesql.conf:Array’s with similarity lower than ‘smlar.threshold’ are not similar by % operation.GiST/GIN support for % operation. The parameter “similar.type” allows you to specify what kind of formula used to calculate the similarity: cosine (default), overlap or tfidf. For “tfidf” need to make additional configuration, but I will not consider this in the article (all can be found in the README file). Now let’s consider an example of using this extension.

Finding duplicate of images

For example, I select the search for duplicate images. Other options for shopping, blogs are implemented in a similar way. The algorithm helps to find similar images that are slightly different: desaturated images, add watermark, passed through the filters.In our algorithm, we will create a pixel matrix of each image. Let it be 15×15 pixels. The next step: we do not know the color of a pixel, but its intensity (the intensity is given by 0.299 * red + 0,587 * green + 0,114 * blue). Calculating the intensity will help us find the image is not paying attention to the colors of the images.Once you have to calculate the intensity of all pixels in a 15×15 matrix, we find the ratio of the intensity of each pixel to the mean intensity of the matrix, and generate a unique number for each cell (in the code to generate unique for each cell, I added the coordinates to intensity) and obtain an array of 225 elements length (15 * 15 = 225). Excellent.Below is the code to generate a digital signature for images on Ruby and PHP languages:Ruby PHP 
61,33% similarity
23,56% similarity
87,55% similarity
Excellent, but it is comparing two images. Now let’s use PostgreSQL for searching. In PostgreSQL is an array type of the field. We will write the digital signature of the image:It works, but search works without index. Let’s create it:orDifference between GiST and GIN indexes you can read here.Perfect! Let’s check performance.Let’s add sorting in SQL (the most similar images were the first), and add similarity as extra field:Added sorting did not complicate query execution.Here you can see all results in raw format.

Conclusion

PostgreSQL extension smlar can be used in systems where we need search similar objects, like texts, images, videos.That’s all folks! Thank you for reading till the end.This is re-post from my blog.