COM271, Week 9
Regular Expressions
Syllabus | Table of Pages | Assignments | References and Useful Links
A Regular Expression specifies a pattern for characters or numbers, against which the pattern of any variable (e.g., the input string from a form) can be compared. Regular expressions can be created with the RegExp ( ) constructor, providing a variety of useful methods. The first argument to RegExp is the desired pattern; the second is optional and contains any special flags for that expression, e.g.,
var pattern = new RegExp ("wxyz", "i");
They can also be created as literals, where characters of the pattern are surrounded by slashes ( / ), e.g.,
var pattern = /wxyz/;.
The pattern can be followed by flags which alter the interpretation, as
- i (case-insensitive)
- g (global match, which finds all matches in the string, rather than just the first)
- m (multiline match)
Test method: Once a pattern has been created, it will be recognized as a regular expression, whether it was created by the RegExp() constructor or as a literal. Then, you may use .test to compare a string to the pattern. In this example, we create a pattern and store the regular expression as the variable "pattern;" then we append the test method (.test) to the variable name, using it to obtain a boolean evaluation (which can take the values true or false) that tells us whether our pattern matches the expression we are passing it, here a string containing a web address:
var pattern = /WXYZ/;
pattern.test("http://www.wxyz.org/") will return false (pattern only matches uppercase " WXYZ "); we can use this result to test for pattern matches by using it in an appropriate if condition (below).
Creating Patterns: RegExp uses character sequences to create patterns, e.g., to indicate that a character or set of characters should be repeated or should be excluded, etc. Let's look at a set of ways that these patterns are set, and then we'll use them in several examples, below:
Regular Expression Pattern Language
- Positional Indicators— ^ and $ indicate the beginning and end of the string
Example:
var pattern = /org$/; match only strings ending in "org" (e.g., "acme.org")
patt_pobox = /^PO/; matches only strings beginning with "PO" (e.g., "PO Box 223") - Escape Codes—As with strings, regular expressions use escape codes for special characters. Examples are
- \. (period)
- \n (new line)
- \t (tab
- \\ (backslash)
- \/ (foreslash)
- \* (asterisk)
- \| (Pipe, or horizontal bar)
Example:
/ var w3c_patt = /http:\/ \/ www \. w3c \. \/ /; matches http://www.w3c.org (spaces added to improve readability) - Repetition Quantifiers—specify how many times an item in the expression can or must be repeated.
- * "repeated zero or more times"
- + "must be repeated one or more times"
- ? "may occur zero or one, but no more than one, times (i.e., previous item is optional)"
- { n } "must be repeated exactly n times"
- { n , m } "must be repeated between n and m times"
- { n , } "must be repeated at least n times"
- Grouping—Use parentheses ( ) to create a group of characters and curly brackets { n , m } to indicate "repeated n, n+1, etc., to m times."
Example:
/ [0-9] {3} / means exactly (any) 3 numbers.
/ [a]+ / means " 'a,' repeated one or more times."
- Character classes—Use square brackets [ ] to set up a list of possible characters to be matched by the string,
Example:
var pattern = /[1234567890]+/; will match any string with one or more digits; a shorthand uses the dash, as in
var pattern = /[0 - 9]+/;.
var pattern = /[a-zA-Z0-9]/; would match any alphanumeric character (upper or lower case)! The following would test for a valid telephone number:
function isPhoneNumber (phone)
{
var pattern = / ^ [0-9] {3} - [0-9] {3} - [0-9] {4} $/;
return pattern.test (phone)
} - Negative Character classes—Placing a carat ( ^ ) at the beginning of a bracketed list specifies characters which may NOT be present; don't confuse this useage (inside a set of square brackets) with the use of a carat to indicate "at the beginning of a string" (not inside a set of square brackets).
Example:
var pattern = /[^a-zA-Z]+/; matches any sequence of one or more non-alphabetic characters. Another example:
var pattern = / [^,] + ( , [^,] + ) {4} /; will check that a string contains five comma-separated strings.
Common Character Classes: Frequently used character classes have shorthand escape codes
- . (period)—matches any character except a newline.
- \s—whitespace character. This is the same as [ \t \n \r \f \v ]
- \w—any word character, same as [a-zA-Z0-9_]
- \W—any non-word character, same as [^a-zA-Z0-9_]
- \d—any digit, same as [0-9]
- \D—any non-digit, same as [^0-9]
Alternatives (logical OR): The | indicates the logical OR of several items, used to separate complete patterns.
- / \Wten\W/ matches ten; does not match "ten," or "tents"
- /\wten\w/ matches aten1; does not match "ten," or "1ten"
- /\bten\b/ matches ten; does not match "attention," "tensile," or "often"
- /\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}/ matches 128.22.45.1; does not match abc.44.55.42
RegExp Object
The RegExp and String objects can be used to test and parse strings.
test ( ): This method returns true or false indicating whether a given string argument matches a regular expression. Use as:
var pattern new RegExp ("a*bbbc", "i");
bol_x = pattern.test ("1a12c")); //false
bol_y = pattern.test ("aaabBbcded")); //true
String Methods for Regular Expressions: The String object has four methods, intended to use regular expressions to modify of break up strings, in addition to matching and extracting:
- search ( )—takes a regular expression argument and returns the character at which the first matching substring begins (or -1 if none is found).
- split ( )—splits a string into substrings, returning them as an array, based on a delimiter (string or regular expression as an argument)
- replace ( )—returns the string that results when you replace text matching its first argument (a regular expression) with the text of the second argument (a string)
- match ( )—uses a regular expression as an argument and returns an array containing the results of the match (usually used with g flag set).