Saturday 9 May 2009

Regular Expressions in ruby

A Regexp is a Ruby object representing a RegEx or "regular expression". So what exactly is a "regular expression"? It is a sort of string that can be used to match against another string. You could think of it as a template or a set of rules that a string can be compared to. Creating a Regexp object is much like creating a string, except that you use the forward slash to delimit it, rather than quote marks.
r = /my regular expression/

Alternatively, you can use this notation (you seem to be able to use any punctuation, just like %q for a string and %w for a word array):
r = %r|my regular expression|
r = %r
r = %r=my regular expression=

That regular expression will just match the string "my regular expression", anywhere in a string. The power of regular expressions lies in their use of wild cards, as we will see later.

Several standard Ruby methods take Regexp objects, but the most basic use is a simple comparison. There are two ways to do that; using the =~ operator or the match method in String. Both can be used either way around:
s = 'Here is my string'
r = /s my/
s.match r
r.match s
s =~ r
r =~ s

The difference is that the match method returns a MatchData object if a match is found, while the =~ operator gives the position of the match. However, the special variable $~ holds the MatchData for the last Regexp comparison performed, so this information is still available (personally, I do not like the built-in globals; if you want the MatchData object, use the match method, and everyone else with have a better idea of what you are doing). See later for more on MatchData.

So what can we put into a regular expression? There is a variety of options allowing you to specify your template as broadly or as narrowly as you want.
.             any character except newline
[ ] any single character of set
[^ ] any single character NOT of set
* 0 or more previous regular expression
*? 0 or more previous regular expression (non-greedy)
+ 1 or more previous regular expression
+? 1 or more previous regular expression (non-greedy)
? 0 or 1 previous regular expression
??
| alternation
( ) grouping regular expressions
^ beginning of a line or string
$ end of a line or string
{m,n} at least m but most n previous regular expression
{m,n}? at least m but most n previous regular expression (non-greedy)
\1-9 nth previous captured group
\A beginning of a string
\b backspace(0x08)(inside[]only)
\b word boundary(outside[]only)
\B non-word boundary
\d digit, same as[0-9]
\D non-digit
\S non-whitespace character
\s whitespace character[ \t\n\r\f]
\W non-word character
\w word character[0-9A-Za-z_]
\z end of a string
\Z end of a string, or before newline at the end
\/ forward slash


Some simple examples
Here are some examples to get us going.
# Simple pattern matches to dog
p1 = /dog/
p (p1 =~ 'cat-dog') # => 4
p (p1 =~ 'cat-doggy') # => 4
p (p1 =~ 'cat-dig') # => nil
p (p1 =~ 'cat-fox') # => nil

# Pattern matches to d, any letter, then g
p1 = /d\wg/
p (p1 =~ 'cat-dog') # => 4
p (p1 =~ 'cat-doggy') # => 4
p (p1 =~ 'cat-dig') # => 4
p (p1 =~ 'cat-fox') # => nil

# Pattern matches to d, any vowel, then g
p1 = /d[aeiou]g/
p (p1 =~ 'cat-dog') # => 4
p (p1 =~ 'cat-doggy') # => 4
p (p1 =~ 'cat-dig') # => 4
p (p1 =~ 'cat-fox') # => nil

# Pattern matches to dog at end of string
p1 = /dog\Z/
p (p1 =~ 'cat-dog') # => 4
p (p1 =~ 'cat-doggy') # => nil
p (p1 =~ 'cat-dig') # => nil
p (p1 =~ 'cat-fox') # => nil

# Pattern matches to d, anything other than o or u, then g
p1 = /d[^ou]g/
p (p1 =~ 'cat-dog') # => nil
p (p1 =~ 'cat-doggy') # => nil
p (p1 =~ 'cat-dig') # => 4
p (p1 =~ 'cat-fox') # => nil


The MatchData object
If you bracket sections of your Regexp, you can then "capture" these subsections. Each subsection can be accessed as though the MatchData object as an array, with the first element being the entire matched string (though methods like each cannot be used). Use the offset method to determine the position in the string for each group. Here it is in action:
s = "Here is a string with http://www.mydomain.com/path/to/mypage.html in it"
r = /http:\/\/([a-z.]*)(\/[a-z]*)*(\/[a-z]*.html)/i
m = r.match s
p m.string
p m.pre_match
# => "Here is a string with "
p m.post_match
# => " in it"
p m[0]
# => "http://www.mydomain.com/path/to/mypage.html"
p m.offset(0)
# => [22, 65]
p m[1]
# => "www.mydomain.com"
p m.offset(1)
# => [29, 45]
p m[2]
# => "/to"
p m.offset(2)
# => [50, 53]
p m[3]
# => "/mypage.html"
p m.offset(3)
# => [53, 65]
p m[4]
# => nil
#p m.offset(4)
# => IndexError
p m.length
# => 4
p m.size
# => 4

If you want an actual array, use to_a or captures (the latter includes only the capture groups, the former also has the entire match as the first element).
m.captures.each { |e| p e }
# => "www.mydomain.com"
# => "/to"
# => "/mypage.html"
m.to_a.each { |e| p e }
# => "http://www.mydomain.com/path/to/mypage.html"
# => "www.mydomain.com"
# => "/to"
# => "/mypage.html"

I was surprised to find that you can only capture as many subsections as you have brackets. Even though the Regexp matches one subsection to two parts of the URL ("/path" and "/to"), only the last one appears in the array.

MatchData API
http://www.ruby-doc.org/core/classes/MatchData.html

Shortcut to capture groups
If you only want to pick one section out from a string, there is a quick way to do it. Both of these will pick out a number that follows a space, but the second way is much more conmcise.
# The usual way
md = s.match(/ ([0-9]+)/)
p md.nil? ? nil : md[1]

# The quick way
p s[/ ([0-9]+)+/, 1]

Note that for the first method we have to check for nil (no match is found), otherwise you will throw an error, as you are calling [] on nil. The quick way just returns nil if there is no match.

Back-references to capture groups - or not
A captured group can be refered to later in the pattern. Here is an example:
pattern = /aa(\d+)-\1/
pattern =~ 'aa1234-1234' # => 0
pattern =~ 'aa1234-1233' # => nil

The pattern requires at least one digit inside the brackets. This is the capture group. The backslash-one refers back to this group, and requires that the exact same number is repeated.

Note that capture groups number from one, rather than zero.

You may not want to have back-references to your capture group (remembering that you are limited to only 9 back-references). In the next example, question-mark-colon is used to indicate that we want to capture a group, but not to count it for back reference. We are looking for three groups of numbers in the pattern. The third should be identical to the second, but by marking the first as not counted, we can use \1 instead of \2. This trick allows you to have any number of captures, despite being limited to only nine back references.
pattern = /(?:\d+)-(\d+)-\1/
s = 'bird-cat-12-654-654-otter'
match = pattern.match s
match.to_a.each { |e| p e }
# => "12-654-654"
# => "654"
# => "done"

If you use the String.scan method, it splits a string into an array, each member of which matches the given pattern. If the pattern includes a capture group, then it is the part that is captured that goes into the array. However, if you use ?: yoiu can stop that behavior, to get the whole match (or another capture).

Multiple matches
Often you want to match multiple occurances.
\d       Match exactly one digits
\d? Match one or zero digits
\d* Match zero or more digits
\d+ Match 1 or more digits
\d{2,5} Match between 2 and 5 digits
aeiou* Match "aeio" followed by any number of "u"
[aeiou]* Match any number of vowels
(aeiou)* Match any number of sequences of "aeiou"


Greedy vs non-greedy
A greedy match will try to match against as many characters as possible, while a non-greedy will match against as few as possible. Here is a simple example to illustrate:
s = "Here another string"
greedy = /[a-z]* [a-z]*/
non_greedy = /[a-z]*? [a-z]*?/
p greedy.match(s)[0] # => "ere another"
p non_greedy.match(s)[0] # => "ere "

The * will match against a number (or zero) of the preceding, so in the two Regexp objects, they will look for a match against a group of letters, then a space, then a group of letters. The difference is the second has the ?, which makes the * non-greedy.

In both cases they ignore "H" as it does not fit, then they find a match for "e". The match continues, as both are allowed a variable number of letters, and they then match the space. Finally each can have a variable number of lower case letters. The non-greedy version aims for the fewest - in this case zero. The greedy version grabs all it can, so gets "another".

Alternatives
For a set of alternative characters, put them inside square brackets. For sequences, use curved brackets, separated by vertical bars.
[aeiou]   Match any one vowel
(dog|cat) Match either "dog" or "cat"


Building Regexp objects dynamically
You can use #{} when defining a Regexp, just as you can for a double-quoted string. Here is a real example that adds two new methods to the String class (the Rails API already adds them, by the way):
class String
def starts_with? sub
match(/^#{sub}/)
end

def ends_with? sub
match(/#{sub}$/)
end
end

The argument sent to the method gets incorporated into the Regexp. Note how ^ and $ are used to anchor the match to the start of the end of the string respectively.

Case sensitivity and other options
You can change the way the pattern matches either by appending a control code, to change the whole pattern, or using extended patterns (borrowed from Perl). These are things you can insert into a pattern inside brackets, following a question mark. For example, you can use i and -i to turn case sensitivity on and off.
# Case sensitive by default
pattern1 = /fox-cat-dog/
pattern1 =~ 'fox-cat-dog' # => 0
pattern1 =~ 'fox-CaT-dog' # => nil
pattern1 =~ 'fox-CaT-doG' # => nil

# Whole pattern modified, case insensitive
pattern2 = /fox-cat-dog/i
pattern2 =~ 'fox-cat-dog' # => 0
pattern2 =~ 'fox-CaT-dog' # => 0
pattern2 =~ 'fox-CaT-doG' # => 0

# Pattern behavior modified within the pattern
# case sensitivity turned off then back on
pattern2 = /fox-(?i)cat-(?-i)dog/
pattern2 =~ 'fox-cat-dog' # => 0
pattern2 =~ 'fox-CaT-dog' # => 0
pattern2 =~ 'fox-CaT-doG' # => nil

# Pattern behavior modified within the pattern
# case sensitivity turned off for substring
pattern2 = /fox-(?i:cat)-dog/
pattern2 =~ 'fox-cat-dog' # => 0
pattern2 =~ 'fox-CaT-dog' # => 0
pattern2 =~ 'fox-CaT-doG' # => nil


The full list of options is:
/i         case insensitive
/m multiline mode - '.' will match newline
/x extended mode - whitespace is ignored
/o only interpolate #{} blocks once
/[neus] encoding: none, EUC, UTF-8, SJIS, respectively

The last two can, I think, only be used to modify the whole pattern.

Comments
There are various other options using the brackets-question-mark notation. You can embed a comment:
pattern2 = /cat(?#comment)dog/
pattern2 =~ 'catdog' # => 0

This makes more sense with the x option just mentioned, which causes the pattern to ignore whitespace, and so allow formatting and comments like this:
pattern1 = /\d\d\d   (?# Looking for three digits  )
- (?# followed by a hash )
\d\d\d (?# and abother three digits )
/x
p pattern1.match('578 123-678ref 567')[0]
# => "123-678"


Looking ahead
You can also look ahead at what follows, without getting the next bit including in your match. You can check that pattern is either there or is absent, as show in this example. In the first instance, pattern1 looks for three numbers follwed by "ref", but the resultant match has only the three numbers. Then pattern2 looks for three numbers not followed by a space.
pattern1 = /\d\d\d(?=ref)/
pattern2 = /\d\d\d(?! )/
pattern3 = /\d?(?! )/
p pattern1.match('578 123 678ref 567')[0]
# => "678"
p pattern2.match('578 123 678ref 567')[0]
# => "678"
p pattern3.match('578 123 678ref 567')[0]
# => "57"


And also...
One final option:
(?>)          nested anchored sub-regexp. stops backtracking.

Means nothing to me, but I mention it for completeness.

Struggling with Ruby: Contents Page

6 comments:

Feross said...

This helped a lot! Thanks!

shalini said...

It helped with to understand regular expression in ruby ..thanks

Rick Casey said...

Yes! Very helpful indeed! Thanks much...(BTW there's a typo above: conmcise where I think you meant concise)

Unknown said...

Brief and clear, Thanks !!

Unknown said...
This comment has been removed by the author.
Rachel said...

Thanks you. The best I've seen it explained.