Famous Perl One-Liners Explained, Part VII: Handy Regular Expressions

Posted: 10 Nov 2011 03:43 AM PST

This is the seventh part of a nine-part article on famous Perl one-liners. Perl is not Perl without regular expressions, therefore in this part I will come up with and explain various Perl regular expressions. Please see part one for the introduction of the series.

Famous Perl one-liners is my attempt to create "perl1line.txt" that is similar to "awk1line.txt" and "sed1line.txt" that have been so popular among Awk and Sed programmers, and Unix sysadmins. I will release the perl1line.txt in the next part of the series.

The article on famous Perl one-liners consists of nine parts:

Part I: File spacing.
Part II: Line numbering.
Part III: Calculations.
Part IV: String creation and array creation.
Part V: Text conversion and substitution.
Part VI: Selective printing and deleting of certain lines.
Part VII: Handy regular expressions (this part).
Part VIII: Release of perl1line.txt.
Part IX: Release of Perl One-Liners e-book.

After I am done with the next part of the article, I will release the whole article series as a pdf e-book! Please subscribe to my blog to be the first to get it!

And here are today's one-liners:

109. Check if the string looks like an email.

 /.+@.+\..+/

This regex makes sure that the string looks like email. Notice that I say "looks like". It doesn't guarantee it is an email address. Here is how it works - first it matches something up to the @ symbol, then it matches as much as possible until it finds a dot, and then it matches some more. If this succeeds, then it it's something that at least looks like email address with the @ symbol and a dot in it.

For example, cats@catonmat.net matches but cats@catonmat doesn't because the regex can't match the dot \. that is necessary.

110. Check if the string is a number.

 /^\d+$/

This regex matches one or more digits \d starting at the beginning of the string ^ and ending at the end of the string $.

For example, 12345 matches but 123x4 doesn't because \d doesn't match character x.

How about hexadecimal numbers? Here is how:

 /^0x[0-9a-f]$/i

This matches the hex prefix 0x followed by hex number itself. The /i flag at the end makes sure that the match is case insensitive. For example, 0x5af matches, 0X5Fa matches but 97 doesn't, cause it's just a decimal number.

Now how about octal? Here is how:

 /^0[0-7]+$/

Octal numbers are prefixed by 0, which is followed by octal digits 0-7. For example, 013 matches but 09 doesn't, cause it's not a valid octal number.

Finally binary:

 /^[01]+$/

Binary base consists of just 0s and 1s. For example, 010101 matches but 210101 doesn't, because 2 is not a valid binary digit.

111. Check if a word appears twice in the string.

 /(word).*\1/

This regex matches word followed by something or nothing at all, followed by the same word. Here the (word) captures the word in group 1 and \1 refers to contents of group 1, therefore it's almost the same as writing /(word).*word/

For example, silly things are silly matches /(silly).*\1/, but silly things are boring doesn't, because silly is not repeated in the string.

112. Increase all numbers by one in the string.

 $str =~ s/(\d+)/$1+1/ge

Here we use the substitution operator s///. It matches all numbers (\d+), puts them in capture group 1, then it replaces them with their value incremented by one $1+1. The g flag makes sure it finds all the numbers in the string, and the e flag evaluates $1+1 as a Perl expression.

For example, this 1234 is awesome 444 gets turned into this 1235 is awesome 445.

113. Export HTTP User-Agent string from the HTTP headers.

 /^User-Agent: (.+)$/

HTTP headers are formatted as Key: Value pairs. It's very easy to parse such strings, you just instruct the regex engine to save the Value part in $1 group variable.

For example, if the HTTP headers contain,

 Host: localhost:8000 Connection: keep-alive User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US) Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Encoding: gzip,deflate,sdch Accept-Language: en-US,en;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3

Then the regular expression will extract the Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US) string.

114. Match something that looks like an IP address.

 /^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/

This regex doesn't guarantee that the thing that got matched is in fact a valid IP. All it does is match something that looks like an IP. It matches a number followed by a dot four times. For example, it matches a valid IP 81.198.240.140 and it also matches an invalid IP such as 923.844.1.999.

Here is how it works. The ^ at the beginning of regex is an anchor that matches the beginning of string. Next \d{1,3} matches one, two or three consecutive digits. The \. matches a dot. The $ at the end is an anchor that matches the end of the string. It's important to use both ^ and $ anchors, otherwise strings like foo213.3.1.2bar would also match.

This regex can be simplified by grouping the first three repeated \d{1,3}\. expressions:

 /^(\d{1,3}\.){3}\d{1,3}$/

115. Match printable ASCII characters.

 /[ -~]/

This is really tricky and smart. To understand it, take a look at man ascii. You'll see that space starts at value 0x20 and the ~ character is 0x7e. All the characters between a space and ~ are printable. This regular expression matches exactly that. The [ -~] defines a range of characters from space till ~. This is my favorite regexp of all time.

116. Match unprintable ASCII characters.

 /[^ -~]/

Here we invert the previous regular expression. Placing ^ as the first character inside of [...] inverts everything that it would originally match.

117. Match text between two HTML tags.

 m|<strong>([^<]*)</strong>|

This regex matches everything between ... HTML tags. The trick here is the ([^<]*), which matches as much as possible until it finds a < character, which starts the next tag.

Alternatively you can write:

 m|<strong>(.*?)</strong>|

But this is a little different. For example, if the HTML is hello then the first regex doesn't match anything because the < follows  and ([^<]*) matches as little as possible. The second regex matches hello because.

However don't use regular expressions for matching and parsing HTML. Use modules like HTML::TreeBuilder to accomplish the task cleaner.

118. Extract all matches from a regular expression.

 my @matches = $text =~ /regex/g;

Here the regular expression gets evaluated in the list context that makes it return all the matches. The matches get put in the @matches variable.

For example, the following regex extracts all numbers from a string:

 my $t = "10 hello 25 moo 31 foo"; my @nums = $text =~ /\d+/g;

@nums now contains (10, 25, 30).

119. Test if a number is in range 0-255.

 /^(([0-9])|([0-9][0-9])|([12][0-5][0-5]))$/

Here is how it works. A number can either be one digit, two digit or three digit. If it's a one digit number then we allow it to be anything [0-9]. If it's two digit, we also allow it to be any combination of [0-9][0-9]. However if it's a three digit number, it has to be either one hundred-something or two-hundred something. The [12] makes sure it's not larger than two-hundred. Next the tens have to be below 6 so [0-5] limits that. Finally ones have to be below 6, too. The [0-5] at the end checks that.

120. Replace all tags with 

 $html =~ s|<(/)?b>|<$1strong>|g

Here I assume that the HTML is in variable $html. Next the <(/)?b> matches the opening and closing  tags, captures the optional closing tag slash in group $1 and then replaces the matched tag with either  or , depending on if it was an opening or closing tag.