Monday 26 December 2016

Regular Expression: Part 1

Regular expressions are set of symbols to find patterns in strings, its not a programming language.

History of regular expression


Knowing the history of regular expression help us to have a better understanding of some key points in the way regular expression works. Surprisingly, the idea was not originated in the field of computer science. It finds, its way into computer science from neuroscience. 


In 1943, McCulloch and Pitts developed models describing how the human neverous system works [1] which was extended by Stephen Kleene in 1956,  he described these models with an algebra notation that he called regular sets/regular expression. 


Warren S. McCulloch                                                  Walter Pitts                                                           Stephen Kleen
       

In 1968, Ken Thompson a mathematician and one of the key developers of UNIX implemented regular expression inside the text editor called ed. That is the entry point of regX into computing world.

Ed Text Editor & History of Grep command: Ed has the functionality to search based on regular expression. For instance: The command like below was used to search for the word "pinky" in a text document.

g/pinky/p   i.e.(g/regular expression/p)

g and p are modifiers where g was telling the editor to search for the word through out the document and p was to print the results to the screen; global regular expression print, in short now we are calling it as grep. 
Alfred Aho

During 1970, Alfred Aho’s extended grep with whole bunch of new functionalities, modified syntax and named as egrep - extended grep. His program was able to translate any regular expression to a deterministic finite automaton. 

Now we have 2 flavours of regular expression which resulted in lot more incompatibilities. So it was standardised as POSIX (Portable operating system interface and X is just because it's in UNIX), in 1986. POSIX is designed to ensure compatibility between different operating systems. POSIX standard organised regular expression under two categories.


1) Basic regular expression                                                                            
2) Extended regular expressions

This division clearly defined set of rules about what symbols should means in each of these. However, its not expected that BRE's and ERE's ought to be interchangeable. BRE's was maintained for compatibility in old tools, and now it becomes mostly out of use. 
Henry Spencer

Meanwhile, Henry Spencer was writing a regex library in C programming language which stabilised the regular expressions as many programming languages used that as a base.
Larry Wall

In 1987, Larry Wall introduced Perl Programming language which uses C regex library but over time much more powerful functionalities were added to it. Many languages like C, C++, .NET, Java, JavaScript, MySql, PHP, Python, Ruby are perl compatible. However, they do have subtle difference between them.

C and C++, Java, JavaScript and ActionScript are based on standard ECMAScript whereas .NET, Perl, PHP is based on the PCRE library. Apache version 1 used POSIX EREs, version 2 uses PCRE library, MySQL uses POSIX EREs and so on.

Nowadays, You can get the same behaviour of egrep inside grep using the E option in grep.

References:


1) A LOGICAL CALCULUS OF THE IDEAS IMMANENT IN. NERVOUS ACTIVITY*. ▫ WARREN S. MCCULLOCH AND WALTER PITTS
3) Regular Expression Search Algorithm KEN THOMPSON Bell Telephone Laboratories, Inc., Murray Hill, New Jersey

No comments:

Post a Comment