In this lesson on regular expression, we will learn the application of character sets.
Until now, we have learned either using the actual letters (such as abc, 28, 59, etc.) or the wildcard character in our regular expression patterns for a more abstract match. But to handle specific situations like the preceding character is a digit, or an alphabet, or a special character, or a combination of these we need something special in regular expression i.e., character sets.
Character sets provide more flexibility than just using a wildcard or the literal characters in the regular expression as it can be specified with or without a quantifier.
Some of the commonly used character sets in the regular expression are given below:
S.No. | Character set | Matches |
1. | [abc] | Matches either an a, b, or c character |
2. | [abcABC] | Matches either an a, A, b, B, c, or C character |
3. | [a-z] | Matches any characters between a and z, including a and z |
4. | [A-Z] | Matches any characters between A and Z, including A and Z |
5. | [a-zA-Z] | Matches any characters between a and z, including a and z ignoring cases of the characters |
6. | [0-9] | Matches any character which is a number between 0 and 9 |
For example, the pattern ‘[a-z]ing’ will match strings such as ‘playing’, ‘watching’, ‘reading’ and so on because the first character of each string – ‘p’, ‘w’, and ‘r’ – is present inside the range of the character set i.e., [a-z].
In this way, character sets are similar to a wildcard because they can also be used with or without a quantifier.
It is to be noted that a quantifier loses its special meaning when it’s mentioned inside the character set. Inside square brackets, it is treated as any other literal character.
Further, we can also mention a whitespace character inside a character set to specify one or more whitespaces inside the string.
Complement Operator ‘^’
If we want to match a character other than mentioned in the characters set, then we can use the caret operator ‘^’.
We have already learned the usage of the ‘^’ operator in the form of anchors when it is placed outside the character set to specify the start of a string. But when it is placed inside the character set, it is known as a complement operator as it matches any character other than the ones specified inside the character set.
For example, the pattern ‘[a-z]‘ matches any single-digit alphabet. On the other hand, the pattern ‘[^a-z]’ matches any single digit character that is not an alphabet.
Now, comes another important concept in regular expression i.e., Meta Sequences
Meta Sequences
Meta sequences are basically the short hands for writing commonly used character sets in the regular expression. Some of the important meta sequences and their equivalent character sets are mentioned below.
S.No. | Meta Sequence | Equivalent Character set |
1. | \s | [ \t\n\r\f\v] |
2. | \S | [^ \t\n\r\f\v] |
3. | \d | [0-9] |
4. | \D | [^0-9] |
5. | \w | [a-zA-Z0-9_] |
6. | \W | [^a-zA-Z0-9_] |
Exercise 1: Character sets
Description
Write a regular expression with the help of meta-sequences that matches the usernames of the users of a database. The username starts with alphabets of length one to ten characters long and then followed by a number of length 4.
Sample positive matches:
sam2340
irfann2590
Sample negative matches:
8730
bobby9073834
sameer728
radhagopalaswamy7890
# import regular expression library
import re
# input string on which to test regex pattern
sample_sent = ['sam2340','irfann2590','8730','bobby9073834','sameer728','radhagopalaswamy7890']
pattern = '^[a-z]{1,10}[\d]{4}$'
# iterating list of sample sentences to print match status as true or false
for sent in sample_sent:
# check whether pattern is present in string or not
result = re.search(pattern,sent) # pass the arguments to the re.search function
if result != None:
print(True)
else:
print(False)
True True False False False False
As we can see from the above code, for this problem we have used anchor ‘^’ as the constraint for username was that it should start from alphabets, next we have used character set [a-z] for alphabet constraint, further we have used braces {1,10} to allow alphabet of length one to ten characters long, next we have used meta sequence [\d] for allowing number digits followed by alphabets, next we have used braces again {4} for allowing it to a maximum length of 4 digits and at last we have used anchor tag $ for specifying the end of the string.
We hope that now you have clarity in using character sets, meta sequences, quantifiers, anchors, and wildcards both individually and with combinations.
In Lesson 6 we will study the five most commonly used regular expression functions in python.