My OCD's: Elastic Search

Elastic search is build on top of lucene, its a java app so you need jdk installed on your machine. Its widely used in real time analytics to speed up search. Its a nosql, text based distributed database.

Installation

Pre-requestise: java

To install java run time in ubuntu,

add-apt-repository ppa:webupd8team/java
refresh package list with apt-get update
then install with apt-get-install oracle-java8-installer
check the version of java installed with java -version

To install elastic search in ubuntu

use wget to download the package: wget https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/deb/elasticsearch/2.0.0-beta2/elasticsearch-2.0.0-beta.deb
run the installation with: elasticsearch-2.0.0-beta2.deb
make sure its running with curl command: curl http://localhost:9200 you will receive a json response body, with few details of your elastic node

To install elastic search using binary

download the binary with: wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.1.2.tar.gz
tar-xvf elasticsearch-5.1.2.tar.gz
change to the directory: cd elasticsearch-5.1.2/bin
execute the bash command: ./elasticsearch
to verify if elastic is running: curl http://localhost:9200/

Thats it you got your very first instance of elastic search running successfully on your machine. By default it will act as a master and data node. You can fine tune this option as you want, we will be seeing those soon in this tutorial

Getting into theory in a basic level

Its a distributed system which helps to scale horizontally.

Terminology to know

Indexes: How data is stored in elastic search? It is stored in indexes. In relational database you can think of index as database, its a logical container of data which you can reference by name in query

Index contain all the json you add to it and some additional meta data about the document

Shards: Indexes are stored in shards which itself is a complete lucene database . A shard is smallest unit of scale in elastic search. You must atleast contain one shard to contain 1 index. However this is an important concept to understand indexes can live in multiple shards .

For instance:

I've created an elastic search server instance which is called a node and when I create an index, I can tell to allocate its data across several shards. Having data split across multiple shards seems cool but it does it really buy us in terms of performance and scalability? not much coz we are still working on one set of nodes for finite set of resources with ram , cpu cycles, disk capacity . This is where elastic search put its distributed nature, when we add another node to the setup, serveral things will happen automatically, the node will join the cluster as a peer to the other node, the node will gossip with each other and the new node will gain all the information about the cluster . Some shards will automatically moved to another node inorder to balance out data that is stored and serverd. This process is called rebalancing.

Now a single index is stored in multiple node.

Replication: In addition to shards, elastic has replication. You can think of it as a replica of shards. Its for redundancy.

How do index get created? Its created with a simple http call

Elastic search can be schema less but also you can create a schema upfront to get a better idea of how your information is going to be indexed.

To form a cluster

cd config/ vim elasticserch.yml
change the: cluster.name: clusterName when a node join the cluster, it should have the same cluster name
set node.name to identify the node for trouble shooting
path.logs in prod it would be pointing to some log system
bootstrap.mlockall

Elastic search architecture

This will allocate the memory in the startup which will prevent the jvm from swapping the memory when running. This is important because you dont want jvm to swap which will lead to performance degradation

6. set network.host: 0.0.0.0
7. discovery

by default multicast is enabled, if its enabled, nodes can talk to each other and form a cluster. In production, you will disable multicast and provide unicast which will help only predefined set of nodes to join the cluster

discovery.zen.ping.unicast.hosts: ["host1", "host2"]
discovery.zen.minimum_master_nodes: 3 to prevent split brain

Master node and Data node

It manages the cluster state which node is in the cluster, where it is located and so on. when a master leaves new one will be elected. At this point we need a quoram of nodes for participating in the election. Otherwise you will have 2 elective master and it is hard to recover from this state. This quorom is defined by the minimum_master_nodes option

If you have a cluster of 5 node a quorom would be 3 nodes. by default each node can become a master because node.master:true. If the same node also contain data, when jvm is doing garbage collection, it might looks unresponsive when it is unresponsive for a long time it will be dropped out of the cluster and a new master is elected.

Typically in production we will avoid data nodes being masters by setting noode. master: false. Instead, we will have dedicated master nodes node.maste:true node.data:false.

These master nodes wont be bother to serve the searches so it less likely dropped out of the cluster. Its can be small as it wont do lot of work. So it is little cost at over load cluster.

Load balancer node

If the cluster get so big then you can have load balancer node. This node balancer node would have

node.master: false
node.data: false

So it can forward your request to appropriate nodes.

You can set of the amount of memory elastic search used by using ES_HEAP_SIZE=256m

Normally this will go to /etc/sysconfig or /etc/defaults on the system where you installed the rpm packages. And the value you can start with is half of the memory available on your system then you probably want to monitor it and tweak it as per ur actual node.

So with heap size selected start you elastic node

ES_HEAP_SIZE=256m; bin/elasticsearch
ES_HEAP_SIZE=256m; bin/elasticsearch

cat api: To get details of node: curl localhost:9200/_cat/nodes

At this point we can set the minimum master nodes option throught the api as well. In this case, it will change the persistence setting which will server a full cluster restart.

Most production setting will have 3 masters with mimum_master_nodes set to 2 for hight availability.Then have as many data node as needed.

we also can have load balancers typically we will be having 2 of them to mantaine high avaialbilty for serving your request

My OCD's

Saturday, 21 January 2017

Elastic Search - Introduction

No comments:

Post a Comment