Lesson 10	More powerful pattern-matching
Objective	Learn to how to extract values from a regular expression.

Perl Pattern Matching and Regular Expressions

That will put the tag in $a and any attributes in $b. Notice that we used the m| form instead of the bare slashes. That is because we needed to use a slash in our expression. You can use any pair of characters for the match delimiters, but you must use the m operator explicitly if your match delimiters are not slashes. The parenthesis are used in Perl regular expressions to assign matched characters to variables. The matches within the parenthesis are called references. The m/ operator returns a list (array) of the references, and the s/ operator assigns them to $1, $2, etc. for use in the right-hand side of the expression. We will cover the use of references in substitute expressions in the next lesson. The /x option switch is a wonderful addition to Perl 5. Using /x at the end of the regular expression tells perl to ignore any whitespace and comments, so that you can format and annotate your more complex regular expressions.

#!/usr/bin/perl
while(<>) {
chomp; next unless $_;   # loop unless the line has content
 ($a, $b) = 
    m|          # using the m explicitly allows you to use 
                # other characters to delimit the regexp
     </?         # match the < with or without the /
     ([^>\s]*)   # assign the tag name to $a
     \s*         # skip over whitespace, if any
     ([^>]*)>    # put the attributes in $b
     |x;         # the /x switch allows you to intersperse 
                # whitespace and comments in the regexp
  print "[$a] [$b]\n";
}

Use the /x switch whenever you have an expression that looks more like transmission line noise than anything you would want to read. Without the /x the above regular expression would look like this:

Perl Regular Expression Features

($a, $b) = m|</?([^>\s]*)\s*([^>]*)>|;

Regular expressions are one of Perl's most powerful features. The syntax seems cryptic at first, but the more you use them the easier they become (both to code and to read). Learn as much as you can about them, and you will become a much more productive Perl programmer. To help you grasp the concepts involved, there is an entire module on regular expressions in the Introduction to Perl course, and we have provided a short reference on the ranges and classes that Perl's regular expressions use for matching.

Regular Expressions - Ranges and Classes of Characters

Perl regular expressions can also include ranges and classes of characters.
You can specify a range of characters like this:

  /[A-Z]/     # all the characters from A-Z (but not a-z)
  /[A-Za-z]/  # all alpha characters
  /[0-9]/     # all numeric characters

You can also specify characters by class, using some special escape codes
These are pretty standard for regular expressions:

   \t          tab
   \n          newline
   \r          return
   \f          form feed
   \a          alarm (bell)
   \e          escape
   \033        octal char (think of a PDP-11)
   \x1B        hex char
   \c[         control char
   \l          lowercase next char
   \u          uppercase next char
   \L          lowercase till \E
   \U          uppercase till \E
   \E          end case modification
   \Q          quote regexp metacharacters till \E

These are particular to Perl:

   \w  Match a "word" character (alphanumeric plus "_")
   \W  Match a nonword character
   \s  Match a whitespace character
   \S  Match a nonwhitespace character
   \d  Match a digit character
   \D  Match a nondigit character

These are Perl-specific, and don't take up any space:

   \b  Match a word boundary
   \B  Match a non-word boundary
   \A  Match only at beginning of string
   \Z  Match only at end of string
   \G  Match only where previous m//g left off

Regular Expression Exercise

Click the Exercise link below to write a script that finds HTML tags in a file.
Regular Expression - Exercise