Latest news:
Sahih al-Bukhari (সহীহ বুখারী) is a free Hadith application for android. This application is advertisement free. Download now https://play.google.com/store/apps/details?id=com.akramhossin.bukharisharif
Fuzzy rank based relevant search in PostgreSQL.
In this article, I'll share how I've improved a search feature by ranking fuzzy string matches by applying
to_tsvector, phraseto_tsquery, ts_rank, similarity to the searched term.
let's say we need to search list of labs which name contains the character combination of 'medwhite'.
I'm saying contains since there might be names that are not an exact match to 'medwhite ' but can contain it - e.g. 'Med', 'White', 'Med White Gold', 'Med
Grey Lab' - I don't want to exclude those from my results. Other names similar to 'medwhite' can be valuable to whoever is searching.
This type of search is called fuzzy string matching and can be implemented with SQL using wildcards.
I'll be using PostgreSQL but similar SQLs can be written in other databases:
SELECT id, name_en, name_ar FROM labs AND labs.name_en ilike '%medwhite%' OR labs.name_ar ilike '%medwhite%' LIMIT 20
This query will return all the users whose name_en or name_ar contains 'medwhite' but I won't get them in any specific order.
I could order them alphabetically by 'name_en' and 'name_ar' but that does not guarantee relevance.
PostgreSQL has an extension called pg_trgm.
pg_trgm provides functions and operators for determining the similarity of alphanumeric text based on trigram matching,
as well as index operator classes that support fast searching for similar strings.
A trigram is a group of three consecutive written units such as letters, syllables, or words. What Postgres will do is to split words into trigrams.
For instance, the 'medwhite' trigram would be {" m"," me",dwh,edw,hit,ite,med,"te ",whi}. Note that some trigrams have empty spaces.
That's because each word is considered to have two spaces prefixed and one space suffixed when determining the set of trigrams contained in the string.
Let's starts with the pg_trgm extension and sql to understand how it works.
at first we have add the extension by following sql statement
CREATE EXTENSION pg_trgm;
To learn more please go to this link https://www.postgresql.org/docs/current/pgtrgm.html
Now look into the following query.
select l.name_en, rank_title_en, rank_title_ar, similarity from labs l, to_tsvector(l.name_en || l.name_ar) document, phraseto_tsquery('medwhite') query, NULLIF(ts_rank(to_tsvector(l.name_en), query), 0) rank_title_en, NULLIF(ts_rank(to_tsvector(l.name_ar), query), 0) rank_title_ar, SIMILARITY('medwhite', l.name_en || l.name_ar) similarity where ((l.name_en ilike '%medwhite%') OR (l.name_ar ilike '%medwhite%')) or ((query @@ document) OR (similarity > 0)) ORDER BY rank_title_en desc,similarity DESC NULLS LAST
using to_tsvector function, I'm creating a document using the columns i'm searching for, here i'm searching in name_en and name_ar columns.
using phraseto_tsquery function, i'm converting the users search term into tsquery.
using ts_rank function, i'm trying to measure how relevant documents are to a particular query, so that when there are many matches the most relevant
ones can be shown first.
using similarity function, i'm trying to compare that how similar the two arguments are. ie compare users search terms against table column value.
To learn more please use this link https://www.postgresql.org/docs/current/textsearch-controls.html
Final result of this query will be like this
Hope this helps!
Views : 1048