My OCD's: December 2016

Saturday 31 December 2016

Trie Datastructure

Trie datastructure is used to store collection of strings. If you have a search problem with thousands and thousands of strings then you can use trie datastructure to store the string and perform search very efficiently on it. It helps to bring down the search complexity to the length of keys whereas if you use a well balanced binary search tree, it will take O(m*log n) where m is the maximum string length and n is the number of keys in the tree.

Its a good datastructure for storing the dictionary, used to do prefix based search and you can also use it to sort the strings lexographically another alternative for this is to use a hash table. But in a hash table you cannot do a prefix based search and a regular hash table takes more space than a trie.

Trie is a tree datastructure and its made up of collection of nodes. Every trie has two component a map where the key is the character and the value is a trieNode which is used to establish the parent child relationship. The below is the data structure for the trieNode.

TrieNode {
Map<Character, TrieNode> children
boolean endOfWord //Whether the character is the end of word or not
}

How to create a trieNode?

Lets look at the method to build a trie data structure for the word "abc", abd

Step 1: To form the trie tree for the word abc

Iteration 1

Iteration 2

Iteration 3

Step 2: To for trie tree for the word abd

Iteration 1

Code to form the Trie Node.

class TrieNode {
char c;
HashMap<Character, TrieNode> children = new HashMap<Character, TrieNode>();
boolean isLeaf;

public TrieNode() {}

public TrieNode(char c){
this.c = c;
}
}

public class TrieChain {
private TrieNode root;

public Trie() {
root = new TrieNode();
}

// Insert a string into the trie.
public void insert(String word) {
HashMap<Character, TrieNode> children = root.children;

for(int i=0; i<word.length(); i++){
char c = word.charAt(i);

TrieNode t;
if(children.containsKey(c)){
t = children.get(c);
}else{
t = new TrieNode(c);
children.put(c, t);
}

children = t.children;

//set leaf node
if(i==word.length()-1)
t.isLeaf = true;
}
}
}

Monday 26 December 2016

Regular Expression - Part 2

How regular expression works?

There are two type of regular expression engines. They are

Text-directed engine (or) Deterministic Finite Automatons (DFA)
Regex-directed engine (or) Nondeterministic Finite Automaton (NFA)

NFA is the commonly used interpreter as it is much more powerful for implementing advanced regular expressions. NFA is used in languages like Perl, Python, sed whereas DFA is commonly found in egrep and some implementation of awk. And they both differ in performance significantly, DFA works faster and more consistent than NFA.

In NFA the regular expression pattern will be compiled into byte-code and the engine executes the code one at a time, moving from one state to another. If the result in a state is failure then it backtracks to try another string to see if it matches the input. Lets go through an example to understand it better. So the job is to find a string "that" in a sentence.

Input sentence: The one who are crazy enough to think that they can change the world are the ones who do.

The regular expression would be: /that/

Give the task to match the string "that", the javascript engine moves to the first character to see if it matches the pattern, yes it is. Then it moves forward to the next character and see if it matches, yes it is again. Then it moves to the next character "e", oops! failed to match. So it backtrack to position 2 to see if the character "h" forms the start of a pattern match, Nope, its not... Similarly, it keeps moving back and forth until it find the exact match "that" at position 39-42. The regular expression engine is a eager algorithm which return a match as soon as it find one and terminate the program, unless /g is specified in the regex to keep moving forward.

Appendix

What is finite state automata?
Its a mathematical model for computation, represented using states. At any point in time, a program can be in only one state. They are extremely powerful for parsers which decide if a given string matches a specific pattern or not.

Regular Expression: Part 1

Regular expressions are set of symbols to find patterns in strings, its not a programming language.

History of regular expression

Knowing the history of regular expression help us to have a better understanding of some key points in the way regular expression works. Surprisingly, the idea was not originated in the field of computer science. It finds, its way into computer science from neuroscience.

In 1943, McCulloch and Pitts developed models describing how the human neverous system works [1] which was extended by Stephen Kleene in 1956, he described these models with an algebra notation that he called regular sets/regular expression.

Warren S. McCulloch Walter Pitts Stephen Kleen

In 1968, Ken Thompson a mathematician and one of the key developers of UNIX implemented regular expression inside the text editor called ed. That is the entry point of regX into computing world.

Ed Text Editor & History of Grep command: Ed has the functionality to search based on regular expression. For instance: The command like below was used to search for the word "pinky" in a text document.

g/pinky/p i.e.(g/regular expression/p)

g and p are modifiers where g was telling the editor to search for the word through out the document and p was to print the results to the screen; global regular expression print, in short now we are calling it as grep.

Alfred Aho

During 1970, Alfred Aho’s extended grep with whole bunch of new functionalities, modified syntax and named as egrep - extended grep. His program was able to translate any regular expression to a deterministic finite automaton.

Now we have 2 flavours of regular expression which resulted in lot more incompatibilities. So it was standardised as POSIX (Portable operating system interface and X is just because it's in UNIX), in 1986. POSIX is designed to ensure compatibility between different operating systems. POSIX standard organised regular expression under two categories.

1) Basic regular expression
2) Extended regular expressions

This division clearly defined set of rules about what symbols should means in each of these. However, its not expected that BRE's and ERE's ought to be interchangeable. BRE's was maintained for compatibility in old tools, and now it becomes mostly out of use.

Henry Spencer

Meanwhile, Henry Spencer was writing a regex library in C programming language which stabilised the regular expressions as many programming languages used that as a base.

Larry Wall

In 1987, Larry Wall introduced Perl Programming language which uses C regex library but over time much more powerful functionalities were added to it. Many languages like C, C++, .NET, Java, JavaScript, MySql, PHP, Python, Ruby are perl compatible. However, they do have subtle difference between them.

C and C++, Java, JavaScript and ActionScript are based on standard ECMAScript whereas .NET, Perl, PHP is based on the PCRE library. Apache version 1 used POSIX EREs, version 2 uses PCRE library, MySQL uses POSIX EREs and so on.

Nowadays, You can get the same behaviour of egrep inside grep using the E option in grep.

References:

1) A LOGICAL CALCULUS OF THE IDEAS IMMANENT IN. NERVOUS ACTIVITY*. ▫ WARREN S. MCCULLOCH AND WALTER PITTS

2) Repr esentation of events in nerve nets and finite automata by S.C.Kleene

3) Regular Expression Search Algorithm KEN THOMPSON Bell Telephone Laboratories, Inc., Murray Hill, New Jersey

Sunday 25 December 2016

Docker commands cheetsheet

Most frequently used commands in docker

To view running containers

sudo docker ps

To view the stopped and running containers

sudo docker ps -a

To view all the docker images

sudo docker images

To start the docker container

sudo docker start [container ID]

To stop the container

sudo docker stop [container ID]

To run tomcat image

sudo docker run -d -p 8081:8080 tomcat:8.0

[-d for detached mode, -p to sepcify the ports, -P to automatically assign the port]

To go inside the running container

sudo docker exec -i -t [container ID] /bin/bash

[-i to connect to the STDIN of the container, -t for pseudo terminal]

To view the container details

sudo docker inspect [container ID]

To copy a file from host to the container

sudo docker cp file-web.war [container ID]:/usr/local/webapps

To build an image from the dockerfile

sudo docker build -t imageName

To commit a docker image

sudo docker commit [containerID] [imageName]:[tag]

To kill a docker container

sudo docker kill [containerID]

Mystery calculator

I got the game binary card trick from a Christmas cracker. And I was really intrigued by the math behind it. Lets see how it works.

It consist of 6 cards, printed with a series of numbers. Show all the cards to your friends and ask them to select a number from one of the card. Then flash the card one by one and ask them if the selected number is present in the card. Take all the cards on which the number is present, add the number from the top left-hand corner on each card. The total is the number your friend has selected.

Table to form the card

	Binary	Card No.		Binary	Card No.
1	00000001	1	33	00100001	1, 6
2	00000010	2	34	00100010	2, 6
3	00000011	1, 2	35	00100011	1, 2, 6
4	00000100	3	36	00100100	3, 6
5	00000101	1, 3	37	00100101	1, 3, 6
6	00000110	2, 3	38	00100110	2, 3, 6
7	00000111	1, 2, 3	39	00100111	1, 2, 3, 6
8	00001000	4	40	00101000	4, 6
9	00001001	1, 4	41	00101001	1, 4, 6
10	00001010	2, 4	42	00101010	2, 4, 6
11	00001011	1, 2, 4	43	00101011	1, 2, 4, 6
12	00001100	3, 4	44	00101100	3, 4, 6
13	00001101	1, 3, 4	45	00101101	1, 3, 4, 6
14	00001110	2, 3, 4	46	00101110	2, 3, 4, 6
15	00001111	1, 2, 3, 4	47	00101111	1, 2, 3, 4, 6
16	00010000	5	48	00110000	5, 6
17	00010001	1, 5	49	00110001	1, 5, 6
18	00010010	2, 5	50	00110010	2, 5, 6
19	00010011	1, 2, 5	51	00110011	1, 2, 5, 6
20	00010100	3, 5	52	00110100	3, 5, 6
21	00010101	1, 3, 5	53	00110101	1, 3, 5, 6
22	00010110	2, 3, 5	54	00110110	2, 3, 5, 6
23	00010111	1, 2, 3, 5	55	00110111	1, 2, 3, 5, 6
24	00011000	4, 5	56	00111000	4, 5, 6
25	00011001	1, 4, 5	57	00111001	1, 4, 5, 6
26	00011010	2, 4, 5	58	00111010	2, 4, 5, 6
27	00011011	1, 2, 4, 5	59	00111011	1, 2, 4, 5, 6
28	00011100	3, 4, 5	60	00111100	3, 4, 5, 6
29	00011101	1, 3, 4, 5	61	00111101	1, 3, 4, 5, 6
30	00011110	2, 3, 4, 5	62	00111110	6, 2, 3, 4, 5, 6
31	00011111	1, 2, 3, 4, 5	63	00111111	1, 2, 3, 4, 5, 6
32	00100000	6

Say for instance, pinky selected the number 41. It's available in card 1, 4 and 6, adding the number on left-hand corner will give you 1+8+32 = 41. Thats really awesome isn't!

How it works?
Have a look at the number in top corner 1, 2, 4, 8, 16, 32. Do you see a pattern?

Thats right, they are powers of 2.
2^0, 2^1, 2^2, 2^3, 2^4, 2^5

Although the number written in those cards looks like randomly selected ones, they are specifically chosen to appear on the card between 1 to 63. All the available numbers in the card can be written using power of 2 because binary/base-2 numbers system used to make up the cards.

How to make up the cards?
Write down the binary expansion of numbers from 1 to 63 as shown in the above table. The numbers in card 1 are the numbers which has one at the 1st position (starting from right to left), numbers in card 2 are those have one at 2nd position and so on.

Since we have six cards, the number 63 is chosen as the max 2^8-1. If we want to add one more card the max number to discover would be 2^9-1 and so on.

Reference:

http://www.norsemathology.org/wiki/index.php?title=Binary_Cards#The_Prime_Card_Trick

Saturday 24 December 2016

User-Agent Header

The user-agent is a line of text for identifying the browser you are using, its version number and some system details such as operating system, its version by the hosting web server. When a browser connects to a website, it will add the user-agent field in HTTP header for all the requests by default.

You can imagine the user-agent is the way of introducing himself to the hosting web server. Even though the concept sounds very simple, the logic around has grown increasingly complicated over the years.

JS to get the user-agent details of your browser:

javascript:alert(navigator.userAgent)

History of user-agent naming convention

Mosaic is a discontinued early web browser which has been credited for popularizing the world wide web.

Then came the new browser Mozilla (Mosaic Killer)/Netscape which supports frames and it called itself as Mozilla/1.0(Win3.1).

User-Agent Sniffing: Since Mosaic doesn't have support for frames, "user agent sniffing" started to sent out frames only to Netscape browsers.

Then came Internet Explorer which has support for frames. In order to get the pages written for netscape browsers, IE called itself as Mozilla/1.22(MSIE 2.0; Windows 95) to declare that they are Mozilla compatible. This shortcut is the starting point for the chaos in user-agent string.

IE overtook Netscape for a while. Then Netscape was reborn as Mozilla with a whole bunch of new functionalities built with the rendering engine Gecko and called itself as Mozilla/5.0 (Windows NT 5.0; en-US; rv:1.1) Gecko/20020826.

Then Mozilla was renamed as FireFox and called itself as Mozilla/5.0 (Windows; U; Windows NT 5.1; sv-SE; rv:1.7.5) Gecko/20041108 Firefox/1.0.

Then a whole bunch of other browsers were invented, each pretending to be Mozilla and all of them were powered by Gecko. Since Gecko was so powerful it over took IE. So sinffing was reborn to give mordern web code only to Gecko supported browsers.

And this point, Linux has its own browser Konqueror whose search engine was KHTML which is as good as Gecko but the mordern web code was not given to it. So they pretend to call themselves as Gecko Mozilla/5.0 (compatible; Konqueror/3.2; FreeBSD) (KHTML, like Gecko) which resulted in much more confusion.

Then comes the Opera, they thought users should be give the choice to select the browser they like to impersonate. So depending on the user selection they called themselves as Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; en) Opera 9.51, or Mozilla/5.0 (Windows NT 6.0; U; en; rv:1.8.1) Gecko/20061208 Firefox/2.0.0 Opera 9.51, or Opera/9.51 (Windows NT 5.1; U; en).

Then Apple built Safari with lot more features on top of KHTML and named the rendering engine as WebKit. To get the pages written for KHTML, safari called itself as Mozilla/5.0 (Macintosh; U; PPC Mac OS X; de-de) AppleWebKit/85.7 (KHTML, like Gecko) Safari/85.5. This made the user-agent even messier.

Then IE returned with abunch of upates and called itself as Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.0) to get the features supported on other browsers.

Then comes the Chrome with Webkit rendering engine, like Safari to get the pages built for Safari it pretended to be Safari.

Since,

Chrome prentend to be Safari

Safari pretend to be Konqueror

Konqueror pretend to be Firefox

All browsers pretend to be Mozilla

Chrome called itseld as Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36

Which means chrome support features implemented for Mozilla, KHTML, Gecko, Safari

user-agent: Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36

Some random examples of approximating browser from user agent

Safari:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Safari/602.1.50

Chrome:

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36

Internet Explorer:

1) Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0)

2) Mozilla/5.0 (Windows NT 6.3; Trident/7.0; rv:11.0) like Gecko

In the above case: Option 1 is Internet Explorer 9. Option 2 is Internet Explorer 11.

Reference:

http://www.useragentstring.com/index.php