REGEX is sexy
Regular Expression is a tool for processing text. It’s popular in most programming languages. This is one of problematic works for many and yes I do think so. However, we are learning the concepts and rules to make it familiar then we will be happy and comfortable to use it in every projects.
Concept of REGEX
REGEX stands for REGular EXpression. It is in a form of string and is built for string or text processing methods.
These are main concepts of the REGEX:
- We need to select what will be in each position in a string.
- Every character has its own class.
- Define quantifier for number of the character series in the same class.
- We have choices, we use alteration.
- Add anchor for beginnings or endings of words or strings.
- Apply escape characters if needed
- Select parts of the text with capture groups
Class
ASCII is the basic of characters in programming 101. REGEX benefits this to define classes:
- Digits are
"\d"
. Not digits are"\D"
- English alphabets are
"\w"
or word, otherwise"\W"
- Spaces are
"\s"
, otherwise"\S"
- In case of any characters, we put
"."
(dot) - New line symbols are
"\R"
from “Return”, otherwise"\N"
- For other languages, it’s known as unicode characters. There are
"\p{language}"
for example"\p{Thai}"
. More info, please visit regular-expressions.info/unicode
Sometimes require a list of characters, apply "[]"
- Select only A, B, C, or D then
"[ABCD]"
- Select anything but A, B, C, and D then
"[^ABCD]"
- Select a range such as letter ‘a’ to ‘x’ then
"[a-x]"
We can change some class with "[]"
"\d"
can be replaced with"[0-9]"
"\D"
can be replaced with"[^0-9]"
"\w"
can be replaced with"[a-zA-Z]"
"\W"
can be replaced with"[^a-zA-Z]"
Quantifier
Classes are selected, now we can define how many.
Least | Most | Add this followed by the class |
---|---|---|
0 | Any | * |
1 | Any | + |
0 | 1 | ? |
3 | 3 | {3} |
3 | 9 | {3,9} |
3 | Any | {3,} |
For instance, a text consists of 3 digits followed by any letters at any length can be "\d{3}\w*"
.
Alteration
Put "|"
between choices.
"a|b"
means either"a"
or"b"
"cat|dog"
means either"cat"
or"dog"
Anchor
Anchor represents beginning or ending of the words or texts
- Beginning of the line would be represented by
"^"
."^a"
is starting with"a"
- Ending of the line is
"$"
."z$"
means"z"
is the last character of that line. - Ending of the words can be used with
"\b"
from “boundary”. It ends the word if that position is not a word class. For example,"x\b"
will target"x."
,"x;"
,"x!"
but not"xa"
,"xx"
. - Ending of the words followed by any word class can be used with
"\B"
. For example,"x\B"
will target"xa"
,"xx"
,"xz"
but not"x."
,"x+"
.
Escape characters
Add backslash “\” preceding the characters.
"*"
will be"\*"
"."
will be"\."
"$"
will be"\$"
and so on.
Capture group
Apply parentheses surrounding the REGEX to make a capture group then the following syntaxes will be enabled.
- Refer the capture group using
"\index"
as the index of that group. Let’s say"(a|b|c)\1"
means there is a capture group selecting letter “a”, “b”, “c”, or “d” as the 1st group, plus"\1"
as the reference to that result of the 1st group. Result should be one of"aa"
,"bb"
,or"cc"
. - Refer the capture group using their names. We need to name the capture group before. For example,
"(?'x1'(a|b|c))\k'x1' "
, will result as same as the above but now we’re using the name “x1”. - Benefit with the method for substitution and extraction. For instance, substitute all digits to an “x” or extract all digit followed by letter “a” from given texts.
Tools
These are my tools to check the REGEX strings before run on my jobs.
Real cases
SQL on Google BigQuery
On Google BigQuery, it supports REGEX well as the example below.
WITH test_set AS (
SELECT ["+66876543210", "0812345678", "9876543210987",
"[email protected]", "[email protected]",
"[email protected]",
"1234567890123", "9876543210987", "#iphone12mini", "#รักเธอที่สุด"
] AS text
)
SELECT text,
regexp_contains(text, r'^0[689]\d{8}$') as is_mobile,
regexp_contains(text, r'[\d\w\-_\.]+\@[\d\w\-_\.]+\..*') as is_email,
regexp_contains(text, r'^\d{13}$') as is_thaiid,
regexp_contains(text, r'#[\p{Thai}\w\d_]+') as is_hashtag
FROM test_set, unnest(text) text
This diagram illustrates how REGEX is translated to sample string.
This is the result of the query above. true
here shows the text
is which type indicated by each REGEX.
Python script
Second here is Python. I do with library regex which is more flexible than the standard library re. re
doesn’t support class "\p{language}"
for this case.
import regex
test_set = ["+66876543210", "0812345678", "9876543210987",
"[email protected]", "[email protected]",
"[email protected]", "1234567890123",
"9876543210987", "#iphone12mini", "#รักเธอที่สุด"]
rgx_mobile = "^0[689]\d{8}$"
rgx_email = "[\d\w\-_\.]+\@[\d\w\-_\.]+\..*"
rgx_thaiid = "^\d{13}$"
rgx_hashtag = "#[\p{Thai}\w\d_]+"
for t in test_set:
if regex.match(rgx_mobile, t) is not None:
print(t, "is mobile")
elif regex.match(rgx_email, t) is not None:
print(t, "is email")
elif regex.match(rgx_thaiid, t) is not None:
print(t, "is thaiid")
elif regex.match(rgx_hashtag, t) is not None:
print(t, "is hashtag")
else:
print(t, "is others")
Be careful
As aforementioned, REGEX has solid patterns but we need to concern what is the REGEX engine of the work as-is because different engines may not compatible with a class. Python has library re
and regex
while Google BigQuery functions are relied on re2
of Golang.
For more info about REGEX engine, please read https://en.wikipedia.org/wiki/Comparison_of_regular-expression_engines
More or less, I’m quite pretty certain we the programmer and data fields have some times working with REGEX.