Deep Dive into Elastic Search Querying, Filter vs Query Context
New to ElasticSearch and don’t know where to begin then this blog might be the right place if not this definitely is the right place https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html.
The one way to describe ES is to search the meeting scale in near real-time. ElasticSearch is one of the most popular open-source technologies at the moment. The framework provides tons of capabilities but listing a few :
- Search With Relevance Scoring
- Full-Text Search
- Analytics (Aggregations)
- Schema-less, and Document
Oriented - Horizontally Scalable
- Fault-Tolerant
The list does not even end there. It provides amazing search abilities and out-of-the-box aggregation framework capability making it a default choice to empower your search-based systems.
I started working on ElasticSearch as a backend datastore and search engine at Cars24 to power their Sell and Buy Old Cars website. This was my first interaction with ES and I had to do a lot of stack-overflowing and googling to be able to make things work that were not up and running as part of the official document.
In this article, I will write mostly about the querying/searching ElasticSearch cluster. There are numerous ways in which one could accomplish more or less the same result, therefore, I will try to explain the pros and cons of each method based on my experience with ES.
And if it helps, will talk about reindexing the indexes, Creating features like Auto-Complete, etc. as well in case you’re up for a little adventure. However, most of my focus will be on 2 important concepts — query and filter contexts — which are not well explained in the documentation. Based on my experience and test results will try to share which one you should go for.
Query context vs Filter context
There is always a relevance score when we talk about ES. The relevance score is a strictly positive float that indicates how well each document satisfies the search criteria. This score is relative to the highest score assigned, therefore, the higher the score better the relevance of a document to the search criteria.
However, one needs to understand the basic difference while making a choice if you need filter-based search or query-based searches. For better understanding :
Filter Context is a yes/no option where a document matches our query or is not similar to a where clause in database queries in SQL. SQL returns the strictly matching rows from the database matching the conditions provided in the where clause. The most important thing to note here is Filters are cached by default and they don’t contribute to the relevance score of the document.
Query Context, however, shows how well each document matches the query. It makes use of Analysers to make a decision.
Now with this distinction in mind, when to go for Filter vs Query can be narrowed down:
- Use Filter whenever it is a yes or no kind of thing.
- Keyword searches and exact value(ex. Range and Numerical Data) searches then use a Filter Context.
- Query Context is predominantly used for Full-Text Searches.
- Relevance score searches are Query Context searches.
Unless it is a full-text search or a relevance score kind of search then always go for a Filter Context Search (coming from someone who had to re-write everything as part of optimizing the query performance and you don’t want to do that later on.)
Before we deep down how we’re going to use these queries, there are a few concepts like mappings and dynamic mapping in ES, the below text describes it in a very good way for a person new to ES to learn and keep in mind.
https://logz.io/blog/elasticsearch-mapping/ : Pre-read to understand pros and cons of what schema is, how mappings are important, how ES creates or used to do mappings. All such questions will influence the performance of your searches in the system. Note decisions like what type of settings you need, what mappings and schema is required and how do you want to handle new attributes (allow dynamic mapping or not) are some mandatory choices that should be made beforehand or else sometimes it may lead to reindexing the entire index which might not be a good thing in case your data is growing rapidly.
We will be looking into Structured vs Full-Text queries and will see where we need to go for a filter-based vs query context-based approach to query the data.
1. Structured Querying
- checks if a document should be selected or not.
- no need to match relevant scores. It just states if a document needs to be returned or not.
Term Queries
Term queries are still Queries only, so they will return a score.
- General Syntax
GET /_search
{
“query”: {
“term”: {
“<field_name>”: {
“value”: “<value_to_be_searched>”
}
}
}
}
The term query runs in the query context by default and hence, it will calculate the score. Even if the score will be identical for all documents returned, additional computing power will be involved to score the documents.
Q. How can we speed up the query and optimise it?
A. Term Query with filter [constant_score filter]
GET /_search
{
"query": {
"constant_score" : {
"filter" : {
"term" : {"<field_name>" : "<value_to_be_searched>"}
}
}
}
}
Benefits of using the above query:
- Faster
- Filter context so results are cached automatically.
NOTE : Use match instead of term for text fields. Because term queries search in inverted index directly. Also, term queries searches for input directly as a keyword based search and no transformation is required. But in case of text type, let’s take area name for example, Searching cars in Delhi (referring to the one’s in Old Delhi) and New Delhi can be totally different.
Terms Query
Allows you to return a document that matches at least 1 term.
- General Syntax
GET /_search
{
"query" : {
"terms" : {
"bodyType" : ["Luxury Sedan", "Sedan", "SUV"]
}
}
}
This will return any car whose bodyType has the matching values.
Range Query
Sample query URL:
{baseUrl}/listingPrice-range=470000–1350000
- General Syntax
GET /_search
{
"query": {
"range": {
"age": {
"gte": 470000,
"lte": 1350000
}
}
}
}
Range query has its own syntax:
gt
is greater thangte
is greater than or equal tolt
is less thanlte
is less than or equal to
Exists Query
Due to the fact that ElasticSearch is schema-less (or no strict schema limitation), it is a fairly common situation when different documents have different fields. As a result, there is a lot of use to know whether a document has any certain field or not.
GET /_search
{
"query": {
"exists": {
"field": "<field_name_to_be_searched>"
}
}
}
2. Full-text Queries
Full-text queries work well with unstructured text data. Full-text queries take advantage of the analyzer. Therefore, I will briefly share the resources and touch upon their relevance for one to understand them better.
The full-text query will use the same analyzer that was used while indexing the data. More precisely, the text of your query will go through the same transformations as the text data in the searching field, so that both are at the same level.
Resources :
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis.html
Match Query
The Match query is the standard query for querying the text fields.
We might call match query as an equivalent of the term query but for the text type fields (while the term should be used solely for the keyword type field when working with text data).
- General Syntax :
GET /_search
{
"query" : {
"match" : {
"<text_field>" {
"query" : "<value_to_be_searched>"
}
}
}
}
You can specify a custom analyser to process the data using analyze parameter in the query.
When one specifies a phrase to be searched for it is being analyzed and the result is always a set of tokens. By default, ElasticSearch will be using the OR operator between all of those tokens which means that at least one should match (more matches will hit a higher score). You can switch this to an AND in the operator parameter. In this case, all of the tokens will have to be found in the document for them to be returned.
If one wants to have something in between OR and AND then one might specify minimum_should_match parameter which specifies the number of clauses that should match. It can be specified in both, number and percentage.
fuzziness parameter (optional) allows us to omit the typos. Levenshtein distance is used for calculations.
If one applies a match query to the keyword field then it will perform the same as a term query. More interestingly, if you pass the exact value of a token that is stored in an inverted index to the term query then it will return exactly the same result as the match query but faster as it will go straight to the inverted index.
Match Phrase Query
Same as a match but the sequence order and proximity are important.
- General Syntax :
GET /_search
{
"query": {
"match_phrase" : {
"make" : {
"query" : "Hunda",
"slop" : "1" // Enable to search for Hunda,Hond etc
}
}
}
}
It will check the make field of the car with the query phrase Hunda (the user might have wanted to search for Honda).
match_phrase query has a slop parameter (default value 0) which is responsible for skipping the terms. Therefore, if you specify slop equal to 1 then one word out of a phrase might be omitted.
Then there are Multi-Match Queries (various types and variations) and IDs Query(to query for the primary ID of the document directly). I highly recommend going through the official documentation to check how exactly the score is calculated for each of those fields.
3. Compound Queries
Compound queries wrap together other queries. Compound queries:
- combine the score
- change the behavior of wrapped queries
- switch query context to filter context
- any of the above combined
Boolean Query
Boolean query combines together other queries. It is the most important compound query. Boolean query allows one to combine searches in query context with filter context searches. The boolean query has four occurrences (types) that can be combined together:
- must or “has to satisfy the clause”
- should or “additional points to relevance score if the clause is satisfied”
- filter or “has to satisfy the clause but relevance score is not calculated”
- must_not or “inverse to must, does not contribute to relevance score”
must and should → query context
filter and must_not → filter context
For those who are familiar with SQL must is AND while should is OR operators. Therefore, each query inside the must clause has to be satisfied.
Boosting Query
Boosting query is alike with boost parameter for most queries but is not the same. Boosting query returns documents that match the positive clause and reduces the score for the documents that match the negative clause.
Constant Score Query
As we previously saw in the term query example, the constant_score query converts any query into a filter context with a relevance score equal to the boost parameter (default 1).
Conclusion :
To sum up, ElasticSearch solves many purposes nowadays, and as a developer, it might be difficult to understand which tool to use or not. The main thing for a developer to understand is not to rush to the latest technology to solve a use case. The trick is to analyze the problem statement, evaluate possible solutions, and choose the best ones that suit your needs.
If relevance score is not needed and filtering is the best way to move forward then switch to Filter Context (which we did post analyzing query performance under load tests) for most of our search use cases. However, Auto-suggested cars, auto-complete features, and other stuff use query context search as well.
References :
https://www.elastic.co/guide/en/elasticsearch/guide/master/_queries_and_filters.html
The average API turnaround time was reduced by a significant 40% plus by switching to Filter Context queries wherever keyword-based searching and filtering were needed. Hence, choose wisely before you start to code.
Wanted to cover Indexing/Reindexing topics as well as the awesome Aggregation Framework provided by ElasticSearch but that would be too much for the scope of this blog. Will try to share some blogs soon on how to reindex ES data with over 0.5 million+ records with no downtime and cool aggregations for dashboards and metrics that one can use. Additionally, how to use Kibana for analytics and custom dashboards in my next blog. Well, hope to see you there too. :)
About me:
Jatin Mahajan working as a software developer. I have a keen interest in learning new programming languages, design concepts, sports, music, and PS4 gaming.