Saturday, 26 July 2008

Ruby Strings

Java and C# take the view that strings are fundamental to the language and that API users should be able to rely on them doing exactly what is expected, and so the String class is set in stone, and each instance of String is immutable. Ruby takes the other road, allowing the user to do whatever he wants with a string. The Ruby way is certainly more convenient; whether there are security implications I am not sure (though unlike Java and C#, Ruby is not targetted at running within a web page).

API documentation:

There are a surprisingly large number of ways of defining a string in Ruby.

Double-quoted: Uses backslash escapes (like C, etc.), and embedding variables and code with #{some_code} (use \# for a hash).

%Q notation: %Q/My text/ is almost the same as "My text", or %Q[My text] or %Q@My text@ or whatever (not letters or numbers!). You can use a backslash to include your terminating chartacter, eg %Q!This is important\! Really it is.!.

Single-quoted: No escapes (except \' for a single quote). Single quoted strings require less processing than double quoted, though I suspect the difference is insignificant.

%q notation: %q/My text/ is almost the same as 'My text'.

If you include a return in your string (i.e., it goes on to multiple lines), that gets converted into a return!
s = "First
p s # => "First\nSecond"

On Windows a return is \n (as in Java, and as opposed to C# which defaults to \r\n).

Concatenating and appending
Concatenation as in Java and C#
a = "first string" + " second string"
# => a = "first string second string"
a += " third string"
# => a = "first string second string third string"

But with an append method too.
a << " third string"
# => a = "first string second string third string"

This is the more efficient way (+= creates a new string from the two parts).

Note that using << to add an integer between 0 and 255 adds the character (this is because Ruby does not have a character type as such).
a = "hello"
a << 72 # a is "helloH"

Adding other numbers generates an error. This is one place that Java and C# beat Ruby; they can cope with adding numbers (and indeed any class) to a string without an explicit conversion.

Repetition (how is that useful exactly?)
a = "repeated " * 4
# => "repeated repeated repeated repeated "

Extracting bits
Extract characters as though it is an array
a[n]     # the nth character (starting from zero)
a[-n] # the nth character from the end (starting from 1)
a[n..m] # a substring from n to m (same as a[n,m]
a[n...m] # a substring from n to m-1
a[n] # The ASCII value of the character
a[n].chr # The actual character

Note that a[n..m] is perfectly happy with variable names rather than specific numbers.

You can also replace chunks in a similar manner.
s = "Here is short string"
# => "Here is short string"
s['short'] = 'long'
# => "long"
# => "Here is long string"

It only replaces the first occurance, but can accept regex expressions.

A list of string methods from here:

To change case:
capitalize - first character to upper, rest to lower
downcase - all to lower caseswapcase - changes the case of all letters
upcase - all to upper case

To rejustify:
center - add white space padding to center string
ljust - pads string, left justified
rjust - pads string, right justified

To trim:
chop - remove last character
chomp - remove trailing line separators
squeeze - reduces successive equal characters to singles
strip - deletes leading and trailing white space

To examine:
count - return a count of matches
empty? - returns true if empty
include? - is a specified target string present in the source?
index - return the position of one string in another
length or size - return the length of a string
rindex - returns the last position of one string in another
slice - returns a partial string

To encode and alter:
crypt - password encryption
delete - delete an intersection
dump - adds extra \ characters to escape specials
hex - takes string as hex digits and returns number
next or succ - successive or next string (eg ba -> bb)
oct - take string as octal digits and returns number
replace - replace one string with another
reverse - turns the string around
slice! - DELETES a partial string and returns the part deleted
split - returns an array of partial strings exploded at separator (eg, s.split(/_/) )
sum - returns a checksum of the string
to_f and to_i - return string converted to float and integer
tr - to map all occurrences of specified char(s) to other char(s)
tr_s - as tr, then squeeze out resultant duplicates
unpack - to extract from a string into an array using a template

To iterate:
each - process each character in turn
each_line - process each line in a string
each_byte - process each byte in turn
upto - iterate through successive strings (see "next" above)

One that I find partuicularly useful is split, which will break a string into an array of substrings, breaking at the characters you specify (either a string or a regex; defaults to whitespace):
s = "Here is a\nstring"
# => "Here is a\nstring"
# => ["Here", "is", "a", "string"]
# => ["H", "r", " ", "s ", "\nstr", "ng"]

Also interesting is scan, which kind of does the opposite of split. Again it returns an array, but this time of the text that matches, rather than the text between the matches. This one example will return an array of links from an HTML document:
links = content.scan(/<a .+?<\/a>/i)

System commands
A back quoted string (eg `dir`) gets sent as a command to the OS. The system method in Kernal does similar (eg, system dir). You can also use %x[], for example, %x[dir].

Here Document
A "here document" is yet another form of string, designed for large one-off chunks of text (mm, not good for internationalisation). It is denoted by <<, followed by the terminator.
a = <<END
Some text

The hyphen allows the terminator to be indented

do_stuff(<<TERMI, other_parameters
This text will all
go into the method as the
first parameter

You can do operations directly on your here document, as shown here:
This is my string
# => "\ngnirts ym si sihT "

Formated String
You can also generate a formated string with the % operator, which is more or less equivalent to the sprintf method. One difference is that the % operator requires an array for multiple substitutions.
"%d" % 12
sprintf "%d", 12
# => "12"
"x = %04d, y = %s, z = %.2f" % [12, "value", 1.234]
sprintf "x = %04d, y = %s, z = %.2f", 12, "value", 1.234
# => "x = 0012, y = value, z = 1.23"
x = 1.12345
n = 2
"%.#{n}f" % x
sprintf "%.#{n}f", x
# => 1.12

What is useful about this is that you can pass your format string around, and apply the subsitutions to it multiple times.
a = "x = %04d, y = %s, z = %.2f"
c = a % [12, "value", 1.234]
# c is "x = 0012, y = value, z = 1.23"
c = a % [42, "other value", -11.2]
# now c is "x = 0042, y = other value, z = -11.20"

From here:

This useful technique will go through the template string, substituting any occurance of something inside :::, with the string in the hash, values, as determined by the names in the hash
templateStr.gsub( /:::(.*?):::/ ) { values[ $1 ].to_str }

For more complex template usage, Ruby has ERB (see here).

Other Manipulations
Rails offers a variety of new methods for changing strings, including pluralize and tableize; methods used by Rails to convert between table names, class names and filenames.

Here is a way to split camel case into title case:
"MyCamelCaseClassName".split(/(?=[A-Z])/).join(" ")
# => "My Camel Case Class Name"

Struggling with Ruby: Contents Page


Demon said...

Regular expression is really wonderful to parsing HTML or matching pattern. I use this a lot when i code. Actually when I learn any new langauge, first of all I first try whether it supports regex or not. I feel ezee when I found that.

Here is about ruby regex. This was posted by me when I first learn ruby regex. So it will be helpfull for New coders.

Tejuteju said...

very informative blog and useful article thank you for sharing with us, keep posting Ruby on Rails Online Training Bangalore