Master the regular expressions of strings in one article, it's worth collecting!

Foreword:

It is better to teach people how to fish than to teach people how to fish. When programming, you will always find strings with some complicated rules. For example, in a linux system, you need to replace a certain piece of code in multiple files. Isn't it still open for each file to replace one by one? If you also have this kind of confusion then regular expressions are a skill you must know.

1. What is a regular expression

Regular expression is a kind of logical formula for string manipulation. It uses some predefined specific characters and the combination of these specific characters to form a "rule string". This "rule string" is used to express the right characters. A kind of filtering logic for strings. In other words, regular expressions are codes that record text rules.

It is possible that you have used the wildcard for file search under Windows, namely * and ?. If you want to find all pdf documents in a certain directory, you can directly search *.pdf, as follows:

Here, * will be interpreted as an arbitrary string. Similar to wildcards, regular expressions are also a tool for text matching, but compared to wildcards, it can more accurately describe your needs. Of course, the cost is more complicated. For example, you can write a regular expression to find all characters that start with 0, followed by 2-3 numbers, then a hyphen "-", and finally 7 or 8 digits. String (like 011-12345678 or 0856-7654321).

Mass data related to programmers, click to get it for free

2. Getting started

The best way to learn regular expressions is to start with examples.

If you are looking for me in an English journal, you can use regular expression me.

This is almost the simplest regular expression, it can exactly match such a string: composed of two characters, the first character is m, the latter is e. Usually, tools that process regular expressions will provide an option to ignore case. If this option is selected, it can match any of the four cases of me, ME, Me, and mE.

Unfortunately, many words contain two consecutive characters hi, such as me, mean, measure, etc. If you use me to find it, the me here will also be found. If we want to find the word me accurately, we should use \bme\b.

\b is a special code specified by regular expressions (some people call it metacharacter, metacharacter), which represents the beginning or end of a word, that is, the boundary between words. Although usually English words are separated by spaces, punctuation or newlines, \b does not match any of these word separation characters, it only matches one position.

If what you are looking for is a james not far behind me, you should use it \bme\b.*\bjames\b.

Here. is another metacharacter, which matches any character except the newline character. * Is also a metacharacter, but it does not represent a character or a position, but a quantity-it specifies that the content before * can be used repeatedly any number of times to make the entire expression match.

Therefore, .* together means any number of characters that do not contain newlines. Now \bme\b.*\bjames\bthe meaning is obvious: first is the word me, then any number of characters (but not newlines), and finally the word james.

3. Metacharacters

Regular expressions consist of some ordinary characters and some metacharacters ( metacharacters). Common characters include uppercase and lowercase letters and numbers, while metacharacters have special meanings. To use regular expressions properly, the most important thing is to understand metacharacters correctly. The following table lists commonly used metacharacters

Metacharacterdescription
.Matches any single character except "\n" and "\r". To match any character including "\n" and "\r", use a pattern like "[\s\S]"
\wMatches any word character including underscore. Similar but not equivalent to "[A-Za-z0-9_]", the "word" character here uses the Unicode character set
\sMatches any invisible characters, including spaces, tabs, form feeds, etc. Equivalent to [\f\n\r\t\v]
\dMatch a numeric character. Equivalent to [0-9]. Grep needs to add -P, perl regular support
\bMatch the boundary of a word, that is, the position between the word and the space (that is, the "match" of regular expressions has two concepts, one is the matching character, the other is the matching position, where \b is the matching position) . For example, "er\b" can match the "er" in "never" but cannot match the "er" in "verb"; "\b1_" can match the "1_" in "1_23" but cannot match "21_3" "1_" in
^Match the beginning of the input line. If the Multiline property of the RegExp object is set, ^ also matches the position after "\n" or "\r".
$Match the end of the input line. If the Multiline property of the RegExp object is set, $ also matches the position before "\n" or "\r".

4. Character escape

If you want to find the metacharacters themselves, such as finding., Or *, there is a problem: you can't specify them, because they will be interpreted as other meanings. At this time, you have to use \ to cancel the special meaning of these characters. Therefore,. And * should be used. Of course, to find \ itself, you must also use \.

For example: mayday\.netmatch mayday.net, C:\\Windowsmatch C:\Windows.

5. Repeat

I have already seen the previous *, + several ways of matching and repetition. The following are all the qualifiers in the regular expression (the code specifying the number:

Metacharacterdescription
*Match the preceding sub-expression any number of times. For example, zo* can match "z" as well as "zo" and "zoo". * Equivalent to {0,}.
+Match the preceding sub-expression one or more times (greater than or equal to 1 time). For example, "zo+" can match "zo" and "zoo", but not "z". + Is equivalent to {1,}.
?Matches the preceding subexpression zero or one time. For example, "do(es)?" can match "do" or "does". ? Equivalent to {0,1}.
{n}n is a non-negative integer. Matches certain n times. For example, "o{2}" cannot match the "o" in "Bob", but it can match the two o's in "food".
{n,}n is a non-negative integer. Match at least n times. For example, "o{2,}" cannot match the "o" in "Bob", but it can match all o in "foooood". "O{1,}" is equivalent to "o+". "O{0,}" is equivalent to "o*".
{n,m}Both m and n are non-negative integers, where n<=m. Matches at least n times and at most m times. For example, "o{1,3}" will match the first three o's in "fooooood" as a group, and the last three o's as a group. "O{0,1}" is equivalent to "o?". Please note that there can be no spaces between the comma and the two numbers.

6. Characters

It is very simple to find numbers, letters, numbers, and blanks, because there are already metacharacters corresponding to these character sets, but if you want to match character sets that do not have predefined metacharacters (such as vowels a, e, i, o,u ), what should be done?

It's very simple, you just need to list them in square brackets, like [aeiou] matches any English vowel, [.?!] matches punctuation (. Or? Or !).

We can also easily specify a range of characters. The meaning represented by [0-9] is exactly the same as \d: a digit; similarly [a-z0-9A-Z_] is also completely equivalent to \w (if Only consider English words).

Here is a more complex expression: \(?0\d{2}[) -]?\d{8}.

This expression can match phone numbers in several formats, such as 011-22884499, or 0845652452. Let's analyze it a little bit: first it is an escape character (, it can appear 0 or 1 time (?), then a 0, followed by 2 numbers (\d{2}), and then) or -Or one of the spaces, it appears 1 time or does not appear (?), and the last is 8 digits (\d{8}).

7, antisense

Sometimes it is necessary to find characters that do not belong to a character class that can be simply defined. For example, if you want to find any characters other than numbers, you need to use antonyms

Metacharacterdescription
\wMatch any character that is not a letter, number, underscore, or Chinese character
\sMatch any character that is not a whitespace character
\DMatch any non-digit character
\BMatch is not at the beginning or end of a word
[^x]Match any character except x
[^aeiou]Match any character except aeiou

example:

\S+ matches a string that does not contain whitespace characters. <a[^>]+> matches the string beginning with a enclosed in angle brackets

8. Group

We have already mentioned how to repeat a single character (just add a qualifier directly after the character); but what if you want to repeat multiple characters? You can use parentheses to specify a sub-expression (also called grouping), and then you can specify the number of repetitions of the sub-expression, and you can also perform other operations on the sub-expression.

(\d{1,3}\.){3}\d{1,3}Is a simple IP address matching expression. To understand this expression, please analyze it in the following order: \d{1,3}match 1 to 3 digits, (\d{1,3}\.){3}match three digits plus a period (this whole is the grouping) repeat 3 times, and finally add one to three The number of bits ( \d{1,3}).

But it will also match 256.300.777.888this impossible IP address. If the arithmetic more able to use it, might be able to simply solve this problem, but the regular expression does not provide any mathematical function, you can only use the lengthy packet, choose character classes to describe a correct IP address: ((2[0-4]\d|25[0-5]|[01]?\d\d?)\.){3}(2[0-4]\d|25[0-5]|[01]?\d\d?).

The key to understanding this expression is understanding 2[0-4]\d|25[0-5]|[01]?\d\d?. I won't go into details here. You should be able to analyze its meaning.

9. Greed and laziness

When the regular expression contains qualifiers that can accept repetitions, the usual behavior is to match as many characters as possible. Take this expression as an example: b.*c, it will match the longest string starting with b and ending with c. If you use it to search for babac, it will match the entire string babac. This is called greedy matching.

Sometimes, we need to match lazily, that is, to match as few characters as possible. All the qualifiers given above can be converted into lazy matching mode, as long as a question mark? is added to it. In this way. *? means to match any number of repetitions, but use the least repetition on the premise that the entire match is successful. Now take a look at the lazy version of the example:

a.*?b matches the shortest string that starts with a and ends with b. If it is applied to aabab, it will match aab (first to third characters) and ab (fourth to fifth characters).

Qualifierdescription
*?Repeat as many times as possible, but as little as possible
+?Repeat 1 or more times, but repeat as little as possible
??Repeat 0 or 1 times, but repeat as little as possible
{n,m}?Repeat n to m times, but as few as possible
{n,}?Repeat more than n times, but repeat as little as possible

10. Processing options

The above introduced several options such as ignoring case, handling multiple lines, etc. These options can be used to change the way regular expressions are processed. The following are the regular expression options commonly used in .Net:

Qualifierdescription
IgnoreCaseThe matching is not case sensitive.
MultilineThe exact meaning of changing ^ and is: match the position before \n and the position before the end of the string.)
SinglelineChange the meaning of. To match every character (including newline \n)
IgnorePatternWhitespaceChange the meaning of. To make it match every character (including newline \n)
ExplicitCaptureOnly capture groups that have been explicitly named.

A frequently asked question is: Can only one of the multi-line mode and the single-line mode be used at the same time?

The answer is: no. There is no relationship between these two options, except that they have similar names (to make people confused).

11. Tips

There are many more regular expressions. The author only lists the common parts here. If readers want to learn more, they can learn from Microsoft's professional regular expression learning website:

https://docs.microsoft.com/zh-cn/dotnet/standard/base-types/regular-expressions?redirectedfrom=MSDN

The regular expression syntax support is shown in the figure below: