Regex
Regular Expressions in Python
Section titled “Regular Expressions in Python”The term Regular Expression is popularly shortened as regex. A regex is a sequence of characters that defines a search pattern, used mainly for performing find and replace operations in search engines and text processors.
Regular Expression
Python offers regex capabilities through the re module bundled as a part of the standard library.
re
Raw strings
Section titled “Raw strings”Different functions in Python’s re module use raw string as an argument. A normal string, when prefixed with ‘r’ or ‘R’ becomes a raw string. string
rawstr = r'Hello! How are you?'print(rawstr) #Hello! How are you?rawstr = r'Hello! How are you?' print(rawstr) #Hello! How are you?
The difference between a normal string and a raw string is that the normal string in print() function translates escape characters (such as \n, \t etc.) if any, while those in a raw string are not.
print()\n``\t
str1 = "Hello!How are you?"print("normal string:", str1)
str2 = r"Hello!How are you?"print("raw string:",str2)`str1 = “Hello! How are you?” print(“normal string:”, str1)
str2 = r”Hello! How are you?” print(“raw string:“,str2)`Try it
normal string: Hello! How are you? raw string: Hello!\nHow are you?normal string: Hello! How are you? raw string: Hello!\nHow are you?
In the above example, \n inside str1 (normal string) has translated as a newline being printed in the next line. But, it is printed as \n in str2 - a raw string.
\n``str1``\n``str2
meta characters
Section titled “meta characters”Some characters carry a special meaning when they appear as a part pattern matching string. In Windows or Linux DOS commands, we use * and ? - they are similar to meta characters. Python’s re module uses the following characters as meta characters:
. ^ $ * + ? [ ] \ | ( )
When a set of alpha-numeric characters are placed inside square brackets [], the target string is matched with these characters. A range of characters or individual characters can be listed in the square bracket. For example:
[]
The following specific characters carry certain specific meaning.
re.match() function
Section titled “re.match() function”This function in re module tries to find if the specified pattern is present at the beginning of the given string.
re
re.match(pattern, string)re.match(pattern, string)
The function returns None, if the given pattern is not in the beginning, and a match objects if found.
from re import match
mystr = "Welcome to TutorialsTeacher"obj1 = match("We", mystr)print(obj1)
obj2 = match("teacher", mystr)print(obj2)
print("start:", obj1.start(), "end:", obj1.end())`from re import match
mystr = “Welcome to TutorialsTeacher” obj1 = match(“We”, mystr) print(obj1)
obj2 = match(“teacher”, mystr) print(obj2)
print(“start:”, obj1.start(), “end:”, obj1.end())[Try it](/codeeditor?cid=python-3z8wxs8qx) The match object has start and end properties. start“end`
<re.Match object; span=(0, 2), match='We'> None start: 0 end: 2<re.Match object; span=(0, 2), match='We'> None start: 0 end: 2
The following example demonstrates the use of the range of characters to find out if a string starts with ‘W’ and is followed by an alphabet.
from re import match
strings=["Welcome to TutorialsTeacher", "weather forecast","Winston Churchill", "W.G.Grace","Wonders of India", "Water park"]
for string in strings: obj = match("W[a-z]",string) print(obj)`from re import match
strings=[“Welcome to TutorialsTeacher”, “weather forecast”,“Winston Churchill”, “W.G.Grace”,“Wonders of India”, “Water park”]
for string in strings: obj = match(“W[a-z]“,string) print(obj)`Try it
<re.Match object; span=(0, 2), match='We'> None <re.Match object; span=(0, 2), match='Wi'> None <re.Match object; span=(0, 2), match='Wo'> <re.Match object; span=(0, 2), match='Wa'><re.Match object; span=(0, 2), match='We'> None <re.Match object; span=(0, 2), match='Wi'> None <re.Match object; span=(0, 2), match='Wo'> <re.Match object; span=(0, 2), match='Wa'>
re.search() function
Section titled “re.search() function”The re.search() function searches for a specified pattern anywhere in the given string and stops the search on the first occurrence.
re.search()
from re import search
string = "Try to earn while you learn"
obj = search("earn", string)print(obj)print(obj.start(), obj.end(), obj.group())`from re import search
string = “Try to earn while you learn”
obj = search(“earn”, string) print(obj) print(obj.start(), obj.end(), obj.group())`Try it
<re.Match object; span=(7, 11), match='earn'> 7 11 earn<re.Match object; span=(7, 11), match='earn'> 7 11 earn
This function also returns the Match object with start and end attributes. It also gives a group of characters of which the pattern is a part of.
Match
re.findall() Function
Section titled “re.findall() Function”As against the search() function, the findall() continues to search for the pattern till the target string is exhausted. The object returns a list of all occurrences.
search()``findall()
from re import findall
string = "Try to earn while you learn"
obj = findall("earn", string)print(obj)`from re import findall
string = “Try to earn while you learn”
obj = findall(“earn”, string) print(obj)`Try it
['earn', 'earn']['earn', 'earn']
This function can be used to get the list of words in a sentence. We shall use \W* pattern for the purpose. We also check which of the words do not have any vowels in them.
from re import findall, search
obj = findall(r"w*", "Fly in the sky.")print(obj)
for word in obj: obj = search(r"[aeiou]",word) if word!='' and obj==None: print(word)`from re import findall, search
obj = findall(r”w*”, “Fly in the sky.”) print(obj)
for word in obj: obj = search(r”[aeiou]“,word) if word!=” and obj==None: print(word)`Try it
['Fly', '', 'in', '', 'the', '', 'sky', '', ''] Fly sky['Fly', '', 'in', '', 'the', '', 'sky', '', ''] Fly sky
re.finditer() function
Section titled “re.finditer() function”The re.finditer() function returns an iterator object of all matches in the target string. For each matched group, start and end positions can be obtained by span() attribute.
re.finditer()
from re import finditer
string = "Try to earn while you learn"it = finditer("earn", string)for match in it: print(match.span())`from re import finditer
string = “Try to earn while you learn” it = finditer(“earn”, string) for match in it: print(match.span())`Try it
(7, 11) (23, 27)(7, 11) (23, 27)
re.split() function
Section titled “re.split() function”The re.split() function works similar to the split() method of str object in Python. It splits the given string every time a white space is found. In the above example of the findall() to get all words, the list also contains each occurrence of white space as a word. That is eliminated by the split() function in re module.
re.split()split()str``findall()``split()``re
from re import split
string = "Flat is better than nested. Sparse is better than dense."words = split(r' ', string)print(words)`from re import split
string = “Flat is better than nested. Sparse is better than dense.” words = split(r’ ’, string) print(words)`Try it
['Flat', 'is', 'better', 'than', 'nested.', 'Sparse', 'is', 'better', 'than', 'dense.']['Flat', 'is', 'better', 'than', 'nested.', 'Sparse', 'is', 'better', 'than', 'dense.']
re.compile() Function
Section titled “re.compile() Function”The re.compile() function returns a pattern object which can be repeatedly used in different regex functions. In the following example, a string ‘is’ is compiled to get a pattern object and is subjected to the search() method.
re.compile()``search()
from re import *
pattern = compile(r'[aeiou]')string = "Flat is better than nested. Sparse is better than dense."words = split(r' ', string)for word in words: print(word, pattern.match(word))`from re import *
pattern = compile(r’[aeiou]’) string = “Flat is better than nested. Sparse is better than dense.” words = split(r’ ’, string) for word in words: print(word, pattern.match(word))`Try it
Flat None is <re.Match object; span=(0, 1), match='i'> better None than None nested. None Sparse None is <re.Match object; span=(0, 1), match='i'> better None than None dense. NoneFlat None is <re.Match object; span=(0, 1), match='i'> better None than None nested. None Sparse None is <re.Match object; span=(0, 1), match='i'> better None than None dense. None
The same pattern object can be reused in searching for words having vowels, as shown below.
for word in words: print(word, pattern.search(word))for word in words: print(word, pattern.search(word))Try it
Flat <re.Match object; span=(2, 3), match='a'> is <re.Match object; span=(0, 1), match='i'> better <re.Match object; span=(1, 2), match='e'> than <re.Match object; span=(2, 3), match='a'> nested. <re.Match object; span=(1, 2), match='e'> Sparse <re.Match object; span=(2, 3), match='a'> is <re.Match object; span=(0, 1), match='i'> better <re.Match object; span=(1, 2), match='e'> than <re.Match object; span=(2, 3), match='a'> dense. <re.Match object; span=(1, 2), match='e'>Flat <re.Match object; span=(2, 3), match='a'> is <re.Match object; span=(0, 1), match='i'> better <re.Match object; span=(1, 2), match='e'> than <re.Match object; span=(2, 3), match='a'> nested. <re.Match object; span=(1, 2), match='e'> Sparse <re.Match object; span=(2, 3), match='a'> is <re.Match object; span=(0, 1), match='i'> better <re.Match object; span=(1, 2), match='e'> than <re.Match object; span=(2, 3), match='a'> dense. <re.Match object; span=(1, 2), match='e'>