Sponsor

2011/11/10

Famous Perl One-Liners Explained, Part VII: Handy Regular Expressions - good coders code, great reuse

Famous Perl One-Liners Explained, Part VII: Handy Regular Expressions - good coders code, great reuse


Famous Perl One-Liners Explained, Part VII: Handy Regular Expressions

Posted: 10 Nov 2011 03:43 AM PST


Perl One LinersThis is the seventh part of a nine-part article on famous Perl one-liners. Perl is not Perl without regular expressions, therefore in this part I will come up with and explain various Perl regular expressions. Please see part one for the introduction of the series.

Famous Perl one-liners is my attempt to create "perl1line.txt" that is similar to "awk1line.txt" and "sed1line.txt" that have been so popular among Awk and Sed programmers, and Unix sysadmins. I will release the perl1line.txt in the next part of the series.

The article on famous Perl one-liners consists of nine parts:

After I am done with the next part of the article, I will release the whole article series as a pdf e-book! Please subscribe to my blog to be the first to get it!

And here are today's one-liners:

109. Check if the string looks like an email.

 /.+@.+\..+/ 

This regex makes sure that the string looks like email. Notice that I say "looks like". It doesn't guarantee it is an email address. Here is how it works - first it matches something up to the @ symbol, then it matches as much as possible until it finds a dot, and then it matches some more. If this succeeds, then it it's something that at least looks like email address with the @ symbol and a dot in it.

For example, cats@catonmat.net matches but cats@catonmat doesn't because the regex can't match the dot \. that is necessary.

110. Check if the string is a number.

 /^\d+$/ 

This regex matches one or more digits \d starting at the beginning of the string ^ and ending at the end of the string $.

For example, 12345 matches but 123x4 doesn't because \d doesn't match character x.

How about hexadecimal numbers? Here is how:

 /^0x[0-9a-f]$/i 

This matches the hex prefix 0x followed by hex number itself. The /i flag at the end makes sure that the match is case insensitive. For example, 0x5af matches, 0X5Fa matches but 97 doesn't, cause it's just a decimal number.

Now how about octal? Here is how:

 /^0[0-7]+$/ 

Octal numbers are prefixed by 0, which is followed by octal digits 0-7. For example, 013 matches but 09 doesn't, cause it's not a valid octal number.

Finally binary:

 /^[01]+$/ 

Binary base consists of just 0s and 1s. For example, 010101 matches but 210101 doesn't, because 2 is not a valid binary digit.

111. Check if a word appears twice in the string.

 /(word).*\1/ 

This regex matches word followed by something or nothing at all, followed by the same word. Here the (word) captures the word in group 1 and \1 refers to contents of group 1, therefore it's almost the same as writing /(word).*word/

For example, silly things are silly matches /(silly).*\1/, but silly things are boring doesn't, because silly is not repeated in the string.

112. Increase all numbers by one in the string.

 $str =~ s/(\d+)/$1+1/ge 

Here we use the substitution operator s///. It matches all numbers (\d+), puts them in capture group 1, then it replaces them with their value incremented by one $1+1. The g flag makes sure it finds all the numbers in the string, and the e flag evaluates $1+1 as a Perl expression.

For example, this 1234 is awesome 444 gets turned into this 1235 is awesome 445.

113. Export HTTP User-Agent string from the HTTP headers.

 /^User-Agent: (.+)$/ 

HTTP headers are formatted as Key: Value pairs. It's very easy to parse such strings, you just instruct the regex engine to save the Value part in $1 group variable.

For example, if the HTTP headers contain,

 Host: localhost:8000 Connection: keep-alive User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US) Accept: application/xml,application/xhtml+xml,text/html;q=0.9,text/plain;q=0.8,image/png,*/*;q=0.5 Accept-Encoding: gzip,deflate,sdch Accept-Language: en-US,en;q=0.8 Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.3 

Then the regular expression will extract the Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_0_0; en-US) string.

114. Match something that looks like an IP address.

 /^\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}$/ 

This regex doesn't guarantee that the thing that got matched is in fact a valid IP. All it does is match something that looks like an IP. It matches a number followed by a dot four times. For example, it matches a valid IP 81.198.240.140 and it also matches an invalid IP such as 923.844.1.999.

Here is how it works. The ^ at the beginning of regex is an anchor that matches the beginning of string. Next \d{1,3} matches one, two or three consecutive digits. The \. matches a dot. The $ at the end is an anchor that matches the end of the string. It's important to use both ^ and $ anchors, otherwise strings like foo213.3.1.2bar would also match.

This regex can be simplified by grouping the first three repeated \d{1,3}\. expressions:

 /^(\d{1,3}\.){3}\d{1,3}$/ 

115. Match printable ASCII characters.

 /[ -~]/ 

This is really tricky and smart. To understand it, take a look at man ascii. You'll see that space starts at value 0x20 and the ~ character is 0x7e. All the characters between a space and ~ are printable. This regular expression matches exactly that. The [ -~] defines a range of characters from space till ~. This is my favorite regexp of all time.

116. Match unprintable ASCII characters.

 /[^ -~]/ 

Here we invert the previous regular expression. Placing ^ as the first character inside of [...] inverts everything that it would originally match.

117. Match text between two HTML tags.

 m|<strong>([^<]*)</strong>| 

This regex matches everything between <strong>...</strong> HTML tags. The trick here is the ([^<]*), which matches as much as possible until it finds a < character, which starts the next tag.

Alternatively you can write:

 m|<strong>(.*?)</strong>| 

But this is a little different. For example, if the HTML is <strong><em>hello</em></strong> then the first regex doesn't match anything because the < follows <strong> and ([^<]*) matches as little as possible. The second regex matches <em>hello</em> because.

However don't use regular expressions for matching and parsing HTML. Use modules like HTML::TreeBuilder to accomplish the task cleaner.

118. Extract all matches from a regular expression.

 my @matches = $text =~ /regex/g; 

Here the regular expression gets evaluated in the list context that makes it return all the matches. The matches get put in the @matches variable.

For example, the following regex extracts all numbers from a string:

 my $t = "10 hello 25 moo 31 foo"; my @nums = $text =~ /\d+/g; 

@nums now contains (10, 25, 30).

119. Test if a number is in range 0-255.

 /^(([0-9])|([0-9][0-9])|([12][0-5][0-5]))$/ 

Here is how it works. A number can either be one digit, two digit or three digit. If it's a one digit number then we allow it to be anything [0-9]. If it's two digit, we also allow it to be any combination of [0-9][0-9]. However if it's a three digit number, it has to be either one hundred-something or two-hundred something. The [12] makes sure it's not larger than two-hundred. Next the tens have to be below 6 so [0-5] limits that. Finally ones have to be below 6, too. The [0-5] at the end checks that.

120. Replace all <b> tags with <strong>

 $html =~ s|<(/)?b>|<$1strong>|g 

Here I assume that the HTML is in variable $html. Next the <(/)?b> matches the opening and closing <b> tags, captures the optional closing tag slash in group $1 and then replaces the matched tag with either <strong> or </strong>, depending on if it was an opening or closing tag.

Have Fun!

Thanks for reading the article! In the next part I am releasing the perl1line.txt that will contain all the one-liners in a single file.

No comments:

Post a Comment

Keep a civil tongue.

Label Cloud

Technology (1464) News (793) Military (646) Microsoft (542) Business (487) Software (394) Developer (382) Music (360) Books (357) Audio (316) Government (308) Security (300) Love (262) Apple (242) Storage (236) Dungeons and Dragons (228) Funny (209) Google (194) Cooking (187) Yahoo (186) Mobile (179) Adobe (177) Wishlist (159) AMD (155) Education (151) Drugs (145) Astrology (139) Local (137) Art (134) Investing (127) Shopping (124) Hardware (120) Movies (119) Sports (109) Neatorama (94) Blogger (93) Christian (67) Mozilla (61) Dictionary (59) Science (59) Entertainment (50) Jewelry (50) Pharmacy (50) Weather (48) Video Games (44) Television (36) VoIP (25) meta (23) Holidays (14)

Popular Posts (Last 7 Days)