ElasticSearch: The Beginning

Photo by Samuel Zeller on Unsplash

When we are dealing with a large volume of data and we require a superfast realtime search and analytics capability, the first tool that comes to our mind is probably ElasticSearch. It is one of the most popular search engine currently used in the industry.

What is ElasticSearch

Elasticsearch is a highly scalable full-text search engine based on Lucene engine that allows you to store, search, and analyze big volumes of data quickly and in real-time. It uses schema-free JSON documents and comes with extensive REST APIs for storing and searching the data.

Basic Terminologies

Let's dive into some of the basic key terminologies and concepts used in ElasticSearch.

Index

An index is a collection of documents with similar characteristics. It can be considered as an equivalent to a database in the world of relational databases.

Type

Type represents a unique class of documents to subdivide similar types of data. An index can have any number of types, and we can store documents belonging to these types in the same index. It can be considered as an equivalent to tables in the world of relational databases.

Document

Documents are JSON objects that are stored within an Elasticsearch index and are considered the base unit of storage. It can be considered as an equivalent to a row of a table in the world of relational databases.

Field

A Field is the smallest individual unit of data in Elasticsearch. Each field has a defined type and contains a single piece of data that can be, for example, a boolean, string or array expression

Mapping

A mapping defines the fields for documents of a specific data type and how the fields should be indexed and stored in Elasticsearch. Like a schema in the world of relational databases, the mapping defines the different types that reside within an index. A mapping can be defined explicitly or generated automatically when a document is indexed.

Shard

Elasticsearch provides the ability to subdivide the index into multiple pieces called shards. Each shard is in itself a fully-functional and independent index that can be hosted on any node within the cluster.

Index size is a common cause of Elasticsearch crashes. Since there is no limit to how many documents we can store on each index, an index may take up an amount of disk space that exceeds the limits of the hosting server. As soon as an index approaches this limit, indexing will begin to fail. In that case, we can split up indices horizontally into shards. This allows us to distribute operations across shards and nodes to improve performance.

Replicas

They are basically Elasticsearch fail-safe mechanisms and are basically copies of our index’s shards. It provides high availability in case a node fails and also allows us to scale out our search volume since searches can be executed on all replicas in parallel.

Node

A node is a single server which is a part of a cluster, stores data and participates in the cluster’s indexing and search capabilities.

Cluster

A cluster is a collection of one or more nodes that, together, holds the entire data

Setup And Installation

You can simply refer to this link to set up and install ElasticSearch on your machine. I’m using macOS so I’ll be using brew to install it.

brew install elasticsearch

Once installed you can run it by simply executing this command

elasticsearch

You can verify if the server has started by going to http://localhost:9200 You should see similar to something like this.

"name": "yN1yzxV",
"cluster_name": "elasticsearch_rojeshshrestha",
"cluster_uuid": "-Rh4u3NtSFyskxIjeJVDGw",
"version": {
"number": "6.8.0",
"build_flavor": "oss",
"build_type": "tar",
"build_hash": "65b6179",
"build_date": "2019-05-15T20:06:13.172855Z",
"build_snapshot": false,
"lucene_version": "7.7.0",
"minimum_wire_compatibility_version": "5.6.0",
"minimum_index_compatibility_version": "5.0.0"
},
"tagline": "You Know, for Search"
}

Now, once set up, we can get down to business and get started with some basics.

Creating an index and Adding documents

Elasticsearch will automatically create an index (with basic settings and mappings) for us if we post the first document:

$ curl -X POST 'http://localhost:9200/user/_doc/1' -d \
'{
"email": "rojace@hotmail.com",
"country": "Nepal",
"first_name": "rojace",
"last_name": "shrestha",
"full_name": "rojace shrestha",
"age": 30,
"profession": "Software Engineer"
}'

You can see the settings (for the userindex ) and mapping (for the doc type) with:

$ curl -X GET 'http://localhost:9200/user/_settings'
$ curl -X GET 'http://localhost:9200/user/_doc/_mapping'

We can also apply custom settings and mappings before creating the index itself simply like this

$ curl -X PUT 'http://localhost:9200/user' -d \
'{
"mappings": {
"_doc": {
"properties": {
"email": {
"type": "text",
"analyzer": "uax_url_email_analyzer",
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
},
"settings": {
"analysis": {
"analyzer": {
"uax_url_email_analyzer": {
"tokenizer": "uax_url_email_tokenizer"
}
},
"tokenizer": {
"uax_url_email_tokenizer": {
"type": "uax_url_email",
"max_token_length": "1024"
}
}
}
}
}'

In the above example, we are creating a custom analyzer for email field to make it searchable even with special character @. In absence of above custom analyzer, the search result with the query string rojace@hotmail.com will yield an empty result.

Retrieving Data

To see if the document has been created or not we can call this search API like this

curl -X GET 'http://localhost:9200/user/_search'

It will return the response like below:

{
"took": 1,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "user",
"_type": "_doc",
"_id": "1",
"_score": 1,
"_source": {
"email": "rojace@hotmail.com",
"country": "Nepal",
"first_name": "rojace",
"last_name": "shrestha",
"full_name": "rojace shrestha",
"age": 30,
"profession": "Software Engineer"
}
}
]
}
}'

Conditional Queries

The above example will give us all the records from the ElasticSearch. Now, let's look into some example queries that we can apply to retrieve only the data that we desire.

Fetching all the documents with the query matching particular field.

http://localhost:9200/user/_doc/_searchGET{
"query": {
"match": {
"first_name": "rojace"
}
}
}

Fetching all the documents with the query matching in multiple fields.

http://localhost:9200/user/_doc/_searchGET{
"query": {
"multi_match" : {
"query" : "rojace",
"fields" : ["first_name", "full_name"]
}
}
}

Range Query

http://localhost:9200/user/_doc/_searchGET{
"query": {
"range": {
"age": { "gte": 25 }
}
}
}

Bool Query

We can use operators such as AND/OR/NOT to fine-tune our queries to get the results that we desire. We can use bool query for this. The bool query accepts a must parameter (equivalent to AND), a must_not parameter (equivalent to NOT), and a should parameter (equivalent to OR). For example, if we want to search for a user whose either first_name is fname or last_name is lname but profession should not be Doctor then we can do something like this.

http://localhost:9200/user/_doc/_searchGET
{
"query": {
"bool": {
"must": {
"bool" : {
"should": [
{ "match": { "first_name": "fname" }},
{ "match": { "last_name": "lname" }}
],
"must_not": { "match": { "profession": "Doctor" }}
}
}
}
}
}

Term/Terms Query

When we want to find an exact match and return the results then we can use this query. For example, we want to search for all users whose country is Nepal then term/terms query is useful.

http://localhost:9200/user/_doc/_searchGET{
"query": {
"term" : {
"country.keyword": "Nepal"
}
}
}

Wildcard Query

Wildcard queries allow you to specify a pattern to match instead of the entire term. ? matches any character and * matches zero or more characters. For example, to find all records that have a user whose name begins with the letter r , we can do something like this

http://localhost:9200/user/_doc/_searchGET{
"query": {
"wildcard" : {
"first_name" : "r*"
}
}
}

Query String

The query_string query provides a means of executing multi_match queries, bool queries, wildcards, regexp, and range queries in a concise shorthand syntax. For example, we want to search for anything starting with either ‘r’ or ‘e’ in the first_name and email then we can do something like this.

http://localhost:9200/user/_doc/_searchGET{
"query": {
"query_string" : {
"query": "(r*) OR (e*)",
"fields": ["first_name", "email"]
}
}
}

Filtered Bool Query

We can further filter down our results using filter clause to our bool query to tune our search. For example, we want to query for the users with let’s say stringroj either in full_name or email who is a Software Engineer by profession and is above the age of 20 then we can do something like this

http://localhost:9200/user/_doc/_searchGET{
"query": {
"bool": {
"must": {
"query_string": {
"query": "*roj*",
"fields": ["full_name", "email"]
}
},
"filter":[
{
"term": {"profession.keyword": "Software Engineer"}
},
{
"range": {"age": {"gte":20 }}
}
]
}
}
}

Conclusion

These are some of the very basic primitive queries that we can use in ElasticSearch to help you get started. There are numerous official API clients in various programming languages such as Go, Ruby, Python, Java, Javascript, etc. which makes our life much easier working with ElasticSearch. I’m just scratching the surface here. I hope it will be helpful for anyone who wants to get started with ElasticSearch.

Software Engineer