Codxplore.com

Latest news:

Sahih al-Bukhari (সহীহ বুখারী) is a free Hadith application for android. This application is advertisement free. Download now https://play.google.com/store/apps/details?id=com.akramhossin.bukharisharif

Fuzzy rank based relevant search in PostgreSQL.


Fuzzy rank based relevant search in PostgreSQL.

Fuzzy rank based relevant search in PostgreSQL.

In this article, I'll share how I've improved a search feature by ranking fuzzy string matches by applying

to_tsvector, phraseto_tsquery, ts_rank, similarity to the searched term. 

let's say we need to search list of labs which name contains the character combination of 'medwhite'.

I'm saying contains since there might be names that are not an exact match to 'medwhite ' but can contain it - e.g. 'Med', 'White', 'Med White Gold', 'Med

Grey Lab' - I don't want to exclude those from my results. Other names similar to 'medwhite' can be valuable to whoever is searching.

This type of search is called fuzzy string matching and can be implemented with SQL using wildcards.

I'll be using PostgreSQL but similar SQLs can be written in other databases:

SELECT id, name_en, name_ar
FROM labs
AND labs.name_en ilike '%medwhite%'
OR labs.name_ar ilike '%medwhite%'
LIMIT 20

This query will return all the users whose name_en or name_ar contains 'medwhite' but I won't get them in any specific order. 

I could order them alphabetically by 'name_en' and 'name_ar' but that does not guarantee relevance.

PostgreSQL has an extension called pg_trgm

pg_trgm provides functions and operators for determining the similarity of alphanumeric text based on trigram matching, 

as well as index operator classes that support fast searching for similar strings.

A trigram is a group of three consecutive written units such as letters, syllables, or words. What Postgres will do is to split words into trigrams. 

For instance, the 'medwhite' trigram would be {"  m"," me",dwh,edw,hit,ite,med,"te ",whi}. Note that some trigrams have empty spaces. 

That's because each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string.

Let's starts with the pg_trgm extension and sql to understand how it works.

at first we have add the extension by following sql statement

CREATE EXTENSION pg_trgm;

To learn more please go to this link https://www.postgresql.org/docs/current/pgtrgm.html

Now look into the following query.

select 
l.name_en,
rank_title_en,
rank_title_ar,
similarity
from 
labs l,
to_tsvector(l.name_en || l.name_ar) document, 
phraseto_tsquery('medwhite') query,
NULLIF(ts_rank(to_tsvector(l.name_en), query), 0) rank_title_en,
NULLIF(ts_rank(to_tsvector(l.name_ar), query), 0) rank_title_ar,
SIMILARITY('medwhite', l.name_en || l.name_ar) similarity
where ((l.name_en ilike '%medwhite%') OR (l.name_ar ilike '%medwhite%')) or ((query @@ document) OR (similarity > 0))
ORDER BY rank_title_en desc,similarity DESC NULLS LAST

using to_tsvector function, I'm creating a document using the columns i'm searching for, here i'm searching in name_en and name_ar columns.

using phraseto_tsquery function, i'm converting the users search term into tsquery. 

using ts_rank function, i'm trying to measure how relevant documents are to a particular query, so that when there are many matches the most relevant

ones can be shown first.

using similarity function, i'm trying to compare that how similar the two arguments are. ie compare users search terms against table column value.

To learn more please use this link https://www.postgresql.org/docs/current/textsearch-controls.html

Final result of this query will be like this
Untitled.png


Hope this helps!

 

 


Views : 1048

Subscribe Us


Follow Us