# Effective similarity search in PostgreSQL

by Alexey Vasiliev by May 10, 2012

Hello my dear friends. In “PostgreSQL most useful extensions” I showed a list of some useful extensions for PostgreSQL. Of course, that article didn’t cover all useful extensions (in my humble opinion) and some extensions I want to describe in separate articles. Today we will talk about effective similarity search in PostgreSQL.

Similarity search in large databases is an important issue in nowadays informational services, such as recommender systems, blogs (similar articles), online shops (similar products), image hosting services (similar images, search image duplicates) etc. PostgreSQL allows to make such things more easy. First you need to understand how we will calculate the similarity of two objects.

# Similarity

Any object can be described by a list of characteristics. For example, article in blog can be described by tags, product in online shop – by size, weight, color etc. It means that for each object we can create the digital signature – array of numbers, which describing the object (fingerprint, n-grams). Let’s create an array of not ordered numbers for every object. What should we do next?

# Similarity calculation

There are several methods for calculating the similarity of objects by signatures. First of all, the legend for calculations:

Na, Nb – the number of unique elements in the arrays

Nu – the number of unique elements in the union of sets

Ni – the number of unique elements in the intersection of arrays

One of the simplest calculation of the similarity of two objects is the number of unique elements in the intersection of arrays divided by the number of unique elements in two arrays:

$LARGE dpi{120} fn_phv S(A,B) = frac{N_{i}}{(N_{a} + N_{b})}$

or only

$LARGE dpi{120} fn_phv S(A,B) = frac{N_{i}}{N_{u}}$

Pros:

• Easy to understand
• Speed of calculation: N * log(N)
• Works well at similar and large Na and Nb

Similarity can also be calculated using the formula of cosines:

$LARGE dpi{120} fn_phv S(A,B) = frac{N_{i}}{sqrt{N_{a} * N_{b}}}$

Pros:

• Speed of calculation: N * log(N)
• Works well for large N

But both of these metrics have common problems:

• Few elements -> large scatter of similarity
• Frequent elements -> weight below

TF/IDF metric avoids these problems to calculate the similarity:

$LARGE dpi{120} fn_phv S(A,B) = frac{sum_{i < N_{a}, j < N_{b}, A_{i} = B_{j}}TF_{i} * TF_{j}}{sqrt{sum_{i < N_{a}}TF_{i}^{2} * sum_{j < N_{b}}TF_{j}^{2}}}$

where

$LARGE dpi{120} fn_phv IDF_{element} = log(frac{N_{objects}}{N_{objects with element}} + 1)$

$LARGE dpi{120} fn_phv TF_{element} = IDF_{element} * N_{occurrences}$

Now it would be great to use this knowledge in practice.

# Smlar extension for PostgreSQL

Oleg Bartunov and Teodor Sigaev developed PostgreSQL extension, called smlar, which provides several methods to calculate sets similarity (all built-in data types supported), similarity operator with indexing support on the base of GiST and GIN frameworks. Let’s look how to work with this extension. First of all, we need install it (PostgreSQL should be already installed):

On PostgreSQL 9.2 this extension should build without problem, for PostgreSQL 9.1 and earlier you need to make a little fix. In file “smlar_guc.c” on line 214 change call

to this (delete last argument)

Let’s test instaled extension:

Function smlar computes similary of two arrays (arrays should be the same type) and return float from 0 to 1 (0 – absolutely no similar objects, 1 – absolutely similar arrays, equal). Function can take the third argument – the formula, that calculates the similarity of the two arrays. Module provides several GUC variables and it’s highly recommended to add to postgesql.conf:

Array’s with similarity lower than ‘smlar.threshold’ are not similar by % operation.

GiST/GIN support for % operation. The parameter “similar.type” allows you to specify what kind of formula used to calculate the similarity: cosine (default), overlap or tfidf. For “tfidf” need to make additional configuration, but I will not consider this in the article (all can be found in the README file). Now let’s consider an example of using this extension.

# Finding duplicate of images

For example, I select the search for duplicate images. Other options for shopping, blogs are implemented in a similar way. The algorithm helps to find similar images that are slightly different: desaturated images, add watermark, passed through the filters.

In our algorithm, we will create a pixel matrix of each image. Let it be 15×15 pixels. The next step: we do not know the color of a pixel, but its intensity (the intensity is given by 0.299 * red + 0,587 * green + 0,114 * blue). Calculating the intensity will help us find the image is not paying attention to the colors of the images.

Once you have to calculate the intensity of all pixels in a 15×15 matrix, we find the ratio of the intensity of each pixel to the mean intensity of the matrix, and generate a unique number for each cell (in the code to generate unique for each cell, I added the coordinates to intensity) and obtain an array of 225 elements length (15 * 15 = 225). Excellent.

Below is the code to generate a digital signature for images on Ruby and PHP languages:

Ruby

PHP

 61,33% similarity
 23,56% similarity
 87,55% similarity

Excellent, but it is comparing two images. Now let’s use PostgreSQL for searching. In PostgreSQL is an array type of the field. We will write the digital signature of the image:

It works, but search works without index. Let’s create it:

or

Difference between GiST and GIN indexes you can read here.

Perfect! Let’s check performance.

Let’s add sorting in SQL (the most similar images were the first), and add similarity as extra field:

Added sorting did not complicate query execution.

Here you can see all results in raw format.

# Conclusion

PostgreSQL extension smlar can be used in systems where we need search similar objects, like texts, images, videos.

That’s all folks! Thank you for reading till the end.

This is [re-post](http://leopard.in.ua/2012/06/01/effective-similarity-search-in-postgresql/) from my blog.

• Tim Uckun

How can you use to detect rows that might be similar in a database?

• Alexey Vasiliev

This can be used:
– Search duplicate of images (photohosting)
– Search by fingerprint database most similar fingerprint
– Search similar texts in database (The antiplagiat system)
– Search and show similar products for current, what user watch
– Search for a misspelled

• Tim Uckun

What I mean is how can you check if two records in a table are similar without comparing every row to every other row. Can you cluster rows based on similarity and if so how?

• Alexey Vasiliev

This method used for search by some criterial (like full text search). It’s possible find similar in table by searching in full table rows, for my example:

SELECT * from images as a, images as b where a.image_array % b.image_array AND a.id != b.id ORDER BY smlar(a.image_array, b.image_array) LIMIT 10;

But this method not effective by time execution (for table in 1 milion records need several minutes). You can remove “ORDER BY”, then execution will be more quick, because you will seach by smlar index without ordering.

SELECT * from images as a, images as b where a.image_array % b.image_array AND a.id != b.id LIMIT 10;

• Oleg Bartunov

Interesting, where did you get our presentation ? I don’t remember we published it :)

• smlar is awesome, thanks a lot!

• I didn’t install PostgreSQL under system dir, so I have to change one line in the Makefile to pass the  compilation:

PG_CONFIG = pg_config

to

PG_CONFIG ?= pg_config

Then compile smlar: USE_PGXS=1 PG_CONFIG=/path/to/postgresql/bin/pg_config make

• JesteR

Greetings, I am trying for the last few days to compile smlar extension with no luck..I am using Mac OS X 10.7.4, I have successfully installed postgre and using it but cant get smlar to work!

The closest I could get using “Elyes Du” advice and submiting:
sudo USE_PGXS=1 PG_CONFIG=/Library/PostgreSQL/9.2/bin/pg_config make && make install ,brings out the following errors:

smlar_guc.c: In function ‘set_smlar_limit’:

smlar_guc.c:213: error: too few arguments to function ‘set_config_option’

smlar_guc.c: In function ‘set_smlar_limit’:

smlar_guc.c:213: error: too few arguments to function ‘set_config_option’

lipo: can’t figure out the architecture type of: /var/tmp//ccG6vY26.out

make: *** [smlar_guc.o] Error 1

I am hardly needing this for a project, so any advice is welcome!

• le0pard

Just use Linux for it. I think MacOS not supported by extension.

• Guest

Greetings, I am trying for the last few days to compile smlar extension with no luck..I am using Mac OS X 10.7.4, I have successfully installed postgre and using it but cant get smlar to work!

The closest I could get using “Elyes Du” advice and submiting:
sudo USE_PGXS=1 PG_CONFIG=/Library/PostgreSQL/9.2/bin/pg_config make && make install ,brings out the following errors:

smlar_guc.c: In function ‘set_smlar_limit’:

smlar_guc.c:213: error: too few arguments to function ‘set_config_option’

smlar_guc.c: In function ‘set_smlar_limit’:

smlar_guc.c:213: error: too few arguments to function ‘set_config_option’

lipo: can’t figure out the architecture type of: /var/tmp//ccG6vY26.out

make: *** [smlar_guc.o] Error 1

I am hardly needing this for a project, so any advice is welcome!

Do you use SkyTools in Railsware?

• le0pard

Sorry. I opened two articles and was confused one with other. This comment addressed to http://railsware.com/blog/2012/04/23/postgresql-most-useful-extensions/

• le0pard

Oh, ok. Yes, we are using SkyTools, very often Londiste for old PostgreSQL databases (version 8.4 or less). But SkyTools is more close to framework, not to extension. What is why you cannot find it in article :)

• Rob Silva

Can you provide a simple example of searching similar texts?

• le0pard

For this purpose you can try to use tsearch2, which build-in in PostgreSQL