March 15, 2006


An example of Regular Expression Greediness


phoe6 said...

Did not understand.
If you can explain it to me it would helpful. I dont know about awk, but believe the regexes would be same or similar.

Kannappan said...

The linked gg article tries to explain the subtle difference in how we perceive the greediness in regex.

Consider the below
I/p string : aabaaaa
Function : sub(/a*/, "")

The function sub stands for substitution. It searches for the function's first argument in the input string and substitutes it with the second argument. Here, the first argument is a* and the second argument is an empty string. Considering that regex is greedy, we may imagine that the longest a string will be replaced by empty string. In this case the longest a string is after the character b. So we may think that on executing the function on the input string the output will be aab.

But what the function actually does is, It just matches the first appearance of a followed by zero or more a's in the string and replaces it with an empty string. So the resulting output will be baaaa.

Then you may ask where is the greediness?.

The next example illustrates the greediness.

I/p string : the <>only<> way
Function : sub(/<.*>/, "")

Correct Output : the way
Wrong Output : the only<> way

Reason : The function contains the first argument as the character < followed by . metacharacter followed by the character >.

First the < character is matched in the string, then the . metacharacter keeps matching any character that comes its way, it encounters> at the first step and instead of stopping there, it keeps on greedily matching any character that comes its way, as it understands that there is one more > at the end. As soon as it encounters the last >, it stops matching and replaces all its matches with the empty string.

Kannappan said...

Additional Example
I/p String : We rrrode the starrr and followed itss trrails

Function One : sub(/r*r/, "")
O/p : We ode the starrr and followed itss trrails

Function Two : sub(/r.*r/, "")
O/p : We ails