This module provides regular expression matching operations similar to those found in Perl.

Nội dung chính Show

6.2.1. Regular Expression Syntax¶
6.2.2. Nội dung mô -đun
Mẫu có thể là một chuỗi hoặc một đối tượng RE.
6.2.4. Khớp đối tượng lor
6.2.5. Ví dụ biểu hiện chính quy
6.2.5.1. Kiểm tra một cặp
6.2.5.2. Mô phỏng scanf ()
6.2.5.3. search () so với match ()
6.2.5.4. Làm một danh bạ
6.2.5.5. Nhắn tin MUNGING¶
6.2.5.6. Tìm tất cả các trạng từ
6.2.5.7. Tìm tất cả các trạng từ và vị trí của chúng
6.2.5.8. Ký hiệu chuỗi thô
6.2.5.9. Viết Tokenizer¶

Both patterns and strings to be searched can be Unicode strings as well as 8-bit strings. However, Unicode strings and 8-bit strings cannot be mixed: that is, you cannot match an Unicode string with a byte pattern or vice-versa; similarly, when asking for a substitution, the replacement string must be of the same type as both the pattern and the search string.

Regular expressions use the backslash character ('\') to indicate special forms or to allow special characters to be used without invoking their special meaning. This collides with Python’s usage of the same character for the same purpose in string literals; for example, to match a literal backslash, one might have to write '\\\\' as the pattern string, because the regular expression must be \\, and each backslash must be expressed as \\ inside a regular Python string literal.

The solution is to use Python’s raw string notation for regular expression patterns; backslashes are not handled in any special way in a string literal prefixed with 'r'. So r"\n" is a two-character string containing '\' and 'n', while "\n" is a one-character string containing a newline. Usually patterns will be expressed in Python code using this raw string notation.

It is important to note that most regular expression operations are available as module-level functions and methods on compiled regular expressions. The functions are shortcuts that don’t require you to compile a regex object first, but miss some fine-tuning parameters.

6.2.1. Regular Expression Syntax¶

A regular expression (or RE) specifies a set of strings that matches it; the functions in this module let you check if a particular string matches a given regular expression (or if a given regular expression matches a particular string, which comes down to the same thing).

Regular expressions can be concatenated to form new regular expressions; if A and B are both regular expressions, then AB is also a regular expression. In general, if a string p matches A and another string q matches B, the string pq will match AB. This holds unless A or B contain low precedence operations; boundary conditions between A and B; or have numbered group references. Thus, complex expressions can easily be constructed from simpler primitive expressions like the ones described here. For details of the theory and implementation of regular expressions, consult the Friedl book referenced above, or almost any textbook about compiler construction.

A brief explanation of the format of regular expressions follows. For further information and a gentler presentation, consult the Regular Expression HOWTO.

Regular expressions can contain both special and ordinary characters. Most ordinary characters, like 'A', 'a', or '0', are the simplest regular expressions; they simply match themselves. You can concatenate ordinary characters, so last matches the string 'last'. (In the rest of this section, we’ll write RE’s in this special style, usually without quotes, and strings to be matched 'in single quotes'.)

Some characters, like '|' or '(', are special. Special characters either stand for classes of ordinary characters, or affect how the regular expressions around them are interpreted. Regular expression pattern strings may not contain null bytes, but can specify the null byte using a \number notation such as '\x00'.

The special characters are:

'.'(Dot.) In the default mode, this matches any character except a newline. If the DOTALL flag has been specified, this matches any character including a newline.'^'(Caret.) Matches the start of the string, and in MULTILINE mode also matches immediately after each newline.'$' Matches the end of the string or just before the newline at the end of the string, and in MULTILINE mode also matches before a newline. foo matches both ‘foo’ and ‘foobar’, while the regular expression foo$ matches only ‘foo’. More interestingly, searching for foo.$ in 'foo1\nfoo2\n' matches ‘foo2’ normally, but ‘foo1’ in MULTILINE mode; searching for a single $ in 'foo\n' will find two (empty) matches: one just before the newline, and one at the end of the string.'*'Causes the resulting RE to match 0 or more repetitions of the preceding RE, as many repetitions as are possible. ab* will match ‘a’, ‘ab’, or ‘a’ followed by any number of ‘b’s.'+'Causes the resulting RE to match 1 or more repetitions of the preceding RE. ab+ will match ‘a’ followed by any non-zero number of ‘b’s; it will not match just ‘a’.'?'Causes the resulting RE to match 0 or 1 repetitions of the preceding RE. ab? will match either ‘a’ or ‘ab’.*?, +?, ??The '*', '+', and '?' qualifiers are all greedy; they match as much text as possible. Sometimes this behaviour isn’t desired; if the RE <.*> is matched against '

title

', it will match the entire string, and not just '

'. Adding '?' after the qualifier makes it perform the match in non-greedy or minimal fashion; as few characters as possible will be matched. Using .*? in the previous expression will match only '

'.{m}Specifies that exactly m copies of the previous RE should be matched; fewer matches cause the entire RE not to match. For example, a{6} will match exactly six 'a' characters, but not five.{m,n}Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as many repetitions as possible. For example, a{3,5} will match from 3 to 5 'a' characters. Omitting m specifies a lower bound of zero, and omitting n specifies an infinite upper bound. As an example, a{4,}b will match aaaab or a thousand 'a' characters followed by a b, but not aaab. The comma may not be omitted or the modifier would be confused with the previously described form.{m,n}?Causes the resulting RE to match from m to n repetitions of the preceding RE, attempting to match as few repetitions as possible. This is the non-greedy version of the previous qualifier. For example, on the 6-character string 'aaaaaa', a{3,5} will match 5 'a' characters, while a{3,5}? will only match 3 characters.'\'

Hoặc thoát khỏi các ký tự đặc biệt (cho phép bạn khớp các ký tự như '*', '?', V.v.), hoặc báo hiệu một chuỗi đặc biệt; Trình tự đặc biệt được thảo luận dưới đây.'*', '?', and so forth), or signals a special sequence; special sequences are discussed below.

Nếu bạn không sử dụng một chuỗi thô để diễn đạt mẫu, hãy nhớ rằng Python cũng sử dụng dấu gạch chéo ngược như một chuỗi thoát trong các chữ viết; Nếu trình tự thoát được công nhận bởi trình phân tích cú pháp Python, thì dấu gạch chéo ngược và ký tự tiếp theo được bao gồm trong chuỗi kết quả. Tuy nhiên, nếu Python sẽ nhận ra chuỗi kết quả, dấu gạch chéo ngược nên được lặp lại hai lần. Điều này rất phức tạp và khó hiểu, vì vậy, nó rất khuyến khích bạn sử dụng các chuỗi thô cho tất cả nhưng các biểu thức đơn giản nhất.

[]

Được sử dụng để chỉ ra một tập hợp các ký tự. Trong một tập hợp:

Các ký tự có thể được liệt kê riêng lẻ, ví dụ: [AMK] sẽ khớp với 'A', 'M' hoặc 'K'.[amk] will match 'a', 'm', or 'k'.
Các phạm vi ký tự có thể được biểu thị bằng cách đưa hai ký tự và tách chúng bằng '-', ví dụ [A-Z] sẽ khớp với bất kỳ chữ cái viết thường nào, [0-5] [0-9] sẽ khớp với tất cả các số hai chữ số từ 00 đến 59 và [0-9a-fa-f] sẽ phù hợp với bất kỳ chữ số thập lục phân nào. Nếu-được thoát ra (ví dụ: [a \ -z]) hoặc nếu nó được đặt là ký tự đầu tiên hoặc cuối cùng (ví dụ: [a-]), nó sẽ khớp với một chữ '-'.'-', for example [a-z] will match any lowercase ASCII letter, [0-5][0-9] will match all the two-digits numbers from 00 to 59, and [0-9A-Fa-f] will match any hexadecimal digit. If - is escaped (e.g. [a\-z]) or if it’s placed as the first or last character (e.g. [a-]), it will match a literal '-'.
Các nhân vật đặc biệt mất ý nghĩa đặc biệt của họ bên trong bộ. Ví dụ: [(+*)] sẽ khớp với bất kỳ ký tự nghĩa đen nào '(', '+', '*' hoặc ')'.[(+*)] will match any of the literal characters '(', '+', '*', or ')'.
Các lớp ký tự như \ w hoặc \ s (được xác định bên dưới) cũng được chấp nhận bên trong một tập hợp, mặc dù các ký tự mà chúng phù hợp phụ thuộc vào việc ASCII hay chế độ Locale có hiệu lực hay không.\w or \S (defined below) are also accepted inside a set, although the characters they match depends on whether ASCII or LOCALE mode is in force.
Các ký tự không nằm trong phạm vi có thể được khớp bằng cách bổ sung cho tập hợp. Nếu ký tự đầu tiên của tập hợp là '^', tất cả các ký tự không có trong tập hợp sẽ được khớp. Ví dụ: [^5] sẽ phù hợp với bất kỳ ký tự nào ngoại trừ '5' và [^^] sẽ phù hợp với bất kỳ ký tự nào ngoại trừ '^'. ^ không có ý nghĩa đặc biệt nếu nó không phải là nhân vật đầu tiên trong tập hợp.'^', all the characters that are not in the set will be matched. For example, [^5] will match any character except '5', and [^^] will match any character except '^'. ^ has no special meaning if it’s not the first character in the set.
Để phù hợp với một chữ ']' bên trong một bộ, đi trước nó với một dấu gạch chéo ngược hoặc đặt nó vào đầu bộ. Ví dụ: cả [() [\] {}] và [] () [{}] sẽ cả hai khớp với dấu ngoặc đơn.']' inside a set, precede it with a backslash, or place it at the beginning of the set. For example, both [()[\]{}] and []()[{}] will both match a parenthesis.

'. theo cách này. Điều này có thể được sử dụng bên trong các nhóm (xem bên dưới) là tốt. Khi chuỗi đích được quét, res được phân tách bởi '|' được thử từ trái sang phải. Khi một mẫu hoàn toàn phù hợp, nhánh đó được chấp nhận. Điều này có nghĩa là một khi A khớp, B sẽ không được kiểm tra thêm, ngay cả khi nó sẽ tạo ra một trận đấu tổng thể dài hơn. Nói cách khác, '|' Nhà điều hành không bao giờ tham lam. Để phù hợp với chữ '|', sử dụng \ | hoặc đặt nó bên trong một lớp ký tự, như trong [|]. (...) khớp với bất kỳ biểu thức chính quy nào nằm trong dấu ngoặc đơn và chỉ ra sự khởi đầu và kết thúc của một nhóm; Nội dung của một nhóm có thể được truy xuất sau khi một trận đấu được thực hiện và có thể được khớp sau trong chuỗi với chuỗi đặc biệt \ Number, được mô tả bên dưới. Để phù hợp với các chữ '(' hoặc ')', sử dụng \ (hoặc \) hoặc đặt chúng bên trong một lớp ký tự: [(] [)]. (? ...) Đây là một ký hiệu mở rộng (A '?' Theo sau '(' không có ý nghĩa khác). Nhân vật đầu tiên sau '?' Xác định ý nghĩa và cú pháp tiếp theo của cấu trúc là gì. Các phần mở rộng thường không tạo ra một nhóm mới; (? P ...) Ngoại lệ cho quy tắc này. Sau đây là các tiện ích mở rộng hiện được hỗ trợ. (? AILMSUX)A|B, where A and B can be arbitrary REs, creates a regular expression that will match either A or B. An arbitrary number of REs can be separated by the '|' in this way. This can be used inside groups (see below) as well. As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy. To match a literal '|', use \|, or enclose it inside a character class, as in [|].(...)Matches whatever regular expression is inside the parentheses, and indicates the start and end of a group; the contents of a group can be retrieved after a match has been performed, and can be matched later in the string with the \number special sequence, described below. To match the literals '(' or ')', use $ or $, or enclose them inside a character class: [(] [)].(?...) This is an extension notation (a '?' following a '(' is not meaningful otherwise). The first character after the '?' determines what the meaning and further syntax of the construct is. Extensions usually do not create a new group; (?P...) is the only exception to this rule. Following are the currently supported extensions.(?aiLmsux)

. Các chữ cái đặt các cờ tương ứng: re.a (khớp ASCII-chỉ), re.i (trường hợp bỏ qua), re.l (phụ thuộc locale), re.m (multi-line), re.s (dot khớp với tất cả và re.x (verbose), cho toàn bộ biểu thức chính quy. .'a', 'i', 'L', 'm', 's', 'u', 'x'.) The group matches the empty string; the letters set the corresponding flags: re.A (ASCII-only matching), re.I (ignore case), re.L (locale dependent), re.M (multi-line), re.S (dot matches all), and re.X (verbose), for the entire regular expression. (The flags are described in Module Contents.) This is useful if you wish to include the flags as part of the regular expression, instead of passing a flag argument to the re.compile() function.

Lưu ý rằng cờ (? X) thay đổi cách biểu thức được phân tích cú pháp. Nó nên được sử dụng đầu tiên trong chuỗi biểu thức hoặc sau một hoặc nhiều ký tự khoảng trắng. Nếu có các ký tự không phải là màu trước khi cờ, kết quả không được xác định.(?x) flag changes how the expression is parsed. It should be used first in the expression string, or after one or more whitespace characters. If there are non-whitespace characters before the flag, the results are undefined.

(?: ...) Một phiên bản không bắt giữ của dấu ngoặc đơn thông thường. Khớp với bất kỳ biểu thức chính quy nào nằm trong dấu ngoặc đơn, nhưng không thể lấy được chuỗi con phù hợp sau khi thực hiện một trận đấu hoặc được tham chiếu sau trong mẫu. (? P ...)A non-capturing version of regular parentheses. Matches whatever regular expression is inside the parentheses, but the substring matched by the group cannot be retrieved after performing a match or referenced later in the pattern.(?P...)

Tương tự như dấu ngoặc đơn thông thường, nhưng phần phụ được khớp với tên nhóm có thể truy cập thông qua tên nhóm tượng trưng. Tên nhóm phải là định danh Python hợp lệ và mỗi tên nhóm chỉ được xác định một lần trong một biểu thức chính quy. Một nhóm tượng trưng cũng là một nhóm được đánh số, giống như nhóm không được đặt tên.

Named groups can be referenced in three contexts. If the pattern is (?P['"]).*?(?P=quote) (i.e. matching a string quoted with either single or double quotes):

Context of reference to group “quote”	Ways to reference it
in the same pattern itself	(?P=quote) (as shown) \1
when processing match object m	m.group('quote') m.end('quote') (etc.)
in a string passed to the repl argument of re.sub()	\g \g<1> \1

(?P=name)A backreference to a named group; it matches whatever text was matched by the earlier group named name.(?#...)A comment; the contents of the parentheses are simply ignored.(?=...)Matches if ... matches next, but doesn’t consume any of the string. This is called a lookahead assertion. For example, Isaac (?=Asimov) will match 'Isaac ' only if it’s followed by 'Asimov'.(?!...)Matches if ... doesn’t match next. This is a negative lookahead assertion. For example, Isaac (?!Asimov) will match 'Isaac ' only if it’s not followed by 'Asimov'. (?<=...)

Matches if the current position in the string is preceded by a match for ... that ends at the current position. This is called a positive lookbehind assertion. (?<=abc)def will find a match in abcdef, since the lookbehind will back up 3 characters and check if the contained pattern matches. The contained pattern must only match strings of some fixed length, meaning that abc or a|b are allowed, but a* and a{3,4} are not. Note that patterns which start with positive lookbehind assertions will not match at the beginning of the string being searched; you will most likely want to use the search() function rather than the match() function:

>>> import re
>>> m = re.search('(?<=abc)def', 'abcdef')
>>> m.group(0)
'def'

This example looks for a word following a hyphen:

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

(?Matches if the current position in the string is not preceded by a match for .... This is called a negative lookbehind assertion. Similar to positive lookbehind assertions, the contained pattern must only match strings of some fixed length. Patterns which start with negative lookbehind assertions may match at the beginning of the string being searched.(?(id/name)yes-pattern|no-pattern)Will try to match with yes-pattern if the group with given id or name exists, and with no-pattern if it doesn’t. no-pattern is optional and can be omitted. For example, (<)?(\w+@\w+(?:\.\w+)+)(?(1)>|$) is a poor email matching pattern, which will match with '<>' as well as '', but not with '<' nor '>'.

The special sequences consist of '\' and a character from the list below. If the ordinary character is not on the list, then the resulting RE will match the second character. For example, \$ matches the character '$'.

\numberMatches the contents of the group of the same number. Groups are numbered starting from 1. For example, (.+) \1 matches 'the the' or '55 55', but not 'thethe' (note the space after the group). This special sequence can only be used to match one of the first 99 groups. If the first digit of number is 0, or number is 3 octal digits long, it will not be interpreted as a group match, but as the character with octal value number. Inside the '[' and ']' of a character class, all numeric escapes are treated as characters.\AMatches only at the start of the string.\b

Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of Unicode alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore Unicode character. Note that formally, \b is defined as the boundary between a \w and a \W character (or vice versa), or between \w and the beginning/end of the string. This means that r'\bfoo\b' matches 'foo', 'foo.', '(foo)', 'bar foo baz' but not 'foobar' or 'foo3'.

By default Unicode alphanumerics are the ones used, but this can be changed by using the ASCII flag. Inside a character range, \b represents the backspace character, for compatibility with Python’s string literals.

\BMatches the empty string, but only when it is not at the beginning or end of a word. This means that r'py\B' matches 'python', 'py3', 'py2', but not 'py', 'py.', or 'py!'. \B is just the opposite of \b, so word characters are Unicode alphanumerics or the underscore, although this can be changed by using the ASCII flag.\d For Unicode (str) patterns:Matches any Unicode decimal digit (that is, any character in Unicode character category [Nd]). This includes [0-9], and also many other digit characters. If the ASCII flag is used only [0-9] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [0-9] may be a better choice).For 8-bit (bytes) patterns:Matches any decimal digit; this is equivalent to [0-9].\DMatches any character which is not a Unicode decimal digit. This is the opposite of \d. If the ASCII flag is used this becomes the equivalent of [^0-9] (but the flag affects the entire regular expression, so in such cases using an explicit [^0-9] may be a better choice).\sFor Unicode (str) patterns:Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [ \t\n\r\f\v] may be a better choice).For 8-bit (bytes) patterns:Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].\SMatches any character which is not a Unicode whitespace character. This is the opposite of \s. If the ASCII flag is used this becomes the equivalent of [^ \t\n\r\f\v] (but the flag affects the entire regular expression, so in such cases using an explicit [^ \t\n\r\f\v] may be a better choice).\wFor Unicode (str) patterns:Matches Unicode word characters; this includes most characters that can be part of a word in any language, as well as numbers and the underscore. If the ASCII flag is used, only [a-zA-Z0-9_] is matched (but the flag affects the entire regular expression, so in such cases using an explicit [a-zA-Z0-9_] may be a better choice).For 8-bit (bytes) patterns:Matches characters considered alphanumeric in the ASCII character set; this is equivalent to [a-zA-Z0-9_].\WMatches any character which is not a Unicode word character. This is the opposite of \w. If the ASCII flag is used this becomes the equivalent of [^a-zA-Z0-9_] (but the flag affects the entire regular expression, so in such cases using an explicit [^a-zA-Z0-9_] may be a better choice).\ZMatches only at the end of the string.

Hầu hết các ESCAPES tiêu chuẩn được hỗ trợ bởi các chuỗi chuỗi Python cũng được chấp nhận bởi trình phân tích cú pháp biểu thức thông thường:

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

.\b is used to represent word boundaries, and means “backspace” only inside character classes.)

Các chuỗi thoát '\ u' và '\ u' chỉ được nhận ra trong các mẫu Unicode. Trong các mẫu byte, chúng không được đối xử đặc biệt. and '\U' escape sequences are only recognized in Unicode patterns. In bytes patterns they are not treated specially.

Escapes octal được bao gồm trong một hình thức giới hạn. Nếu chữ số đầu tiên là 0, hoặc nếu có ba chữ số bát phân, thì nó được coi là một lối thoát bát phân. Nếu không, nó là một tham chiếu nhóm. Đối với các chuỗi chữ, thoát hiểm luôn luôn có chiều dài tối đa ba chữ số.

Đã thay đổi trong phiên bản 3.3: Các chuỗi thoát '\ u' và '\ u' đã được thêm vào.The '\u' and '\U' escape sequences have been added.

6.2.2. Nội dung mô -đun

Mô -đun xác định một số chức năng, hằng số và một ngoại lệ. Một số chức năng là phiên bản đơn giản hóa của các phương thức nổi bật đầy đủ cho các biểu thức thông thường được biên dịch. Hầu hết các ứng dụng không tầm thường luôn sử dụng biểu mẫu được biên dịch.

re.compile (mẫu, cờ = 0) ¶

Biên dịch một mẫu biểu thức chính quy thành một đối tượng biểu thức chính quy, có thể được sử dụng để khớp bằng các phương thức khớp () và tìm kiếm (), được mô tả bên dưới.match() and search() methods, described below.

Hành vi biểu thức có thể được sửa đổi bằng cách chỉ định giá trị cờ. Các giá trị có thể là bất kỳ biến nào sau đây, kết hợp bằng bitwise hoặc (| toán tử).| operator).

Trình tự

prog = re.compile(pattern)
result = prog.match(string)

tương đương với

result = re.match(pattern, string)

Nhưng sử dụng re.compile () và lưu đối tượng biểu thức chính quy kết quả để sử dụng lại hiệu quả hơn khi biểu thức sẽ được sử dụng nhiều lần trong một chương trình.re.compile() and saving the resulting regular expression object for reuse is more efficient when the expression will be used several times in a single program.

Ghi chú

Các phiên bản được biên dịch của các mẫu gần đây nhất được truyền cho re.match (), re.search () hoặc re.compile () được lưu trong bộ nhớ cach .re.match(), re.search() or re.compile() are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.

Re.a¶ Re.ascii¶

Tạo \ w, \ w, \ b, \ b, \ d, \ d, \ s và \ s phù hợp với ASCII-chỉ thay vì khớp unicode đầy đủ. Điều này chỉ có ý nghĩa đối với các mẫu Unicode và bị bỏ qua cho các mẫu byte.\w, \W, \b, \B, \d, \D, \s and \S perform ASCII-only matching instead of full Unicode matching. This is only meaningful for Unicode patterns, and is ignored for byte patterns.

Lưu ý rằng để tương thích ngược, cờ RE.U vẫn tồn tại (cũng như từ đồng nghĩa của nó re.unicode và đối tác nhúng của nó (? Kết hợp không được phép cho byte).re.U flag still exists (as well as its synonym re.UNICODE and its embedded counterpart (?u)), but these are redundant in Python 3 since matches are Unicode by default for strings (and Unicode matching isn’t allowed for bytes).

Re.debug¶

Hiển thị thông tin gỡ lỗi về biểu thức biên dịch.

Re.I¶ re.ignorecase¶

Thực hiện kết hợp không nhạy cảm trường hợp; Biểu thức như [A-Z] cũng sẽ khớp với các chữ cái viết thường. Điều này không bị ảnh hưởng bởi địa phương hiện tại và hoạt động cho các ký tự Unicode như mong đợi.[A-Z] will match lowercase letters, too. This is not affected by the current locale and works for Unicode characters as expected.

Re.l¶ Re.Locale¶

Tạo \ w, \ w, \ b, \ b, \ s và \ s phụ thuộc vào địa phương hiện tại. Việc sử dụng lá cờ này không được khuyến khích vì cơ chế địa phương rất không đáng tin cậy và dù sao nó cũng chỉ xử lý một nền văn hóa của người Hồi giáo tại một thời điểm; Thay vào đó, bạn nên sử dụng Unicode khớp, đây là mặc định trong Python 3 cho các mẫu Unicode (STR).\w, \W, \b, \B, \s and \S dependent on the current locale. The use of this flag is discouraged as the locale mechanism is very unreliable, and it only handles one “culture” at a time anyway; you should use Unicode matching instead, which is the default in Python 3 for Unicode (str) patterns.

Re.m¶ Re.Multiline¶

Khi được chỉ định, ký tự mẫu '^' khớp ở đầu chuỗi và ở đầu mỗi dòng (ngay sau mỗi dòng mới); và ký tự mẫu '$' khớp ở cuối chuỗi và ở cuối mỗi dòng (ngay trước mỗi dòng mới). Theo mặc định, '^' chỉ khớp với đầu chuỗi và '$' chỉ ở cuối chuỗi và ngay trước dòng mới (nếu có) ở cuối chuỗi.'^' matches at the beginning of the string and at the beginning of each line (immediately following each newline); and the pattern character '$' matches at the end of the string and at the end of each line (immediately preceding each newline). By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

Re.s¶ Re.Dotall¶

Làm cái '.' Nhân vật đặc biệt phù hợp với bất kỳ nhân vật nào, bao gồm một dòng mới; không có lá cờ này, '.' sẽ phù hợp với bất cứ điều gì ngoại trừ một dòng mới.'.' special character match any character at all, including a newline; without this flag, '.' will match anything except a newline.

Re.x¶ Re.verbose¶

Cờ này cho phép bạn viết các biểu thức thông thường trông đẹp hơn. Khoảng trắng trong mẫu bị bỏ qua, ngoại trừ khi trong một lớp ký tự hoặc đi trước bởi một dấu gạch chéo ngược không được xác định và, khi một dòng chứa '#' không trong một lớp ký tự hoặc trước một dấu gạch chéo ngược không có 'Qua cuối dòng bị bỏ qua.'#' neither in a character class or preceded by an unescaped backslash, all characters from the leftmost such '#' through the end of the line are ignored.

Điều đó có nghĩa là hai đối tượng biểu thức chính quy sau phù hợp với số thập phân có chức năng bằng nhau:

a = re.compile(r"""\d +  # the integral part
                   \.    # the decimal point
                   \d *  # some fractional digits""", re.X)
b = re.compile(r"\d+\.\d*")

Re.Search (mẫu, chuỗi, cờ = 0) ¶

Quét qua chuỗi Tìm kiếm một vị trí nơi mẫu biểu thức chính quy tạo ra một khớp và trả về một đối tượng khớp tương ứng. Trả về không nếu không có vị trí trong chuỗi khớp với mẫu; Lưu ý rằng điều này khác với việc tìm một trận đấu có độ dài bằng không tại một số điểm trong chuỗi.None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

Re.match (mẫu, chuỗi, cờ = 0) ¶

Nếu số không hoặc nhiều ký tự ở đầu chuỗi khớp với mẫu biểu thức chính quy, hãy trả về một đối tượng khớp tương ứng. Trả về không nếu chuỗi không khớp với mẫu; Lưu ý rằng điều này khác với một trận đấu có độ dài bằng không.None if the string does not match the pattern; note that this is different from a zero-length match.

Lưu ý rằng ngay cả trong chế độ đa dòng, re.match () sẽ chỉ khớp ở đầu chuỗi và không ở đầu mỗi dòng.MULTILINE mode, re.match() will only match at the beginning of the string and not at the beginning of each line.

Nếu bạn muốn xác định vị trí đối sánh ở bất cứ đâu trong chuỗi, hãy sử dụng search () thay vào đó (xem thêm search () so với match ()).search() instead (see also search() vs. match()).

Re.Split (mẫu, chuỗi, maxsplit = 0, cờ = 0) ¶

Chuỗi phân chia theo các lần xuất hiện của mẫu. Nếu chụp dấu ngoặc đơn được sử dụng trong mẫu, thì văn bản của tất cả các nhóm trong mẫu cũng được trả về như một phần của danh sách kết quả. Nếu MAXSplit là không khác biệt, tại hầu hết các phân tách MaxSplit xảy ra và phần còn lại của chuỗi được trả về làm yếu tố cuối cùng của danh sách.

>>> re.split('\W+', 'Words, words, words.')
['Words', 'words', 'words', '']
>>> re.split('(\W+)', 'Words, words, words.')
['Words', ', ', 'words', ', ', 'words', '.', '']
>>> re.split('\W+', 'Words, words, words.', 1)
['Words', 'words, words.']
>>> re.split('[a-f]+', '0a3B9', flags=re.IGNORECASE)
['0', '3', '9']

Nếu có các nhóm bắt giữ trong dấu phân cách và nó khớp với đầu chuỗi, kết quả sẽ bắt đầu bằng một chuỗi trống. Tương tự giữ cho phần cuối của chuỗi:

>>> re.split('(\W+)', '...words, words...')
['', '...', 'words', ', ', 'words', '...', '']

Bằng cách đó, các thành phần phân tách luôn được tìm thấy tại cùng một chỉ số tương đối trong danh sách kết quả.

Lưu ý rằng Split sẽ không bao giờ chia một chuỗi trên một kết hợp mẫu trống. Ví dụ:

>>> re.split('x*', 'foo')
['foo']
>>> re.split("(?m)^$", "foo\n\nbar\n")
['foo\n\nbar\n']

Đã thay đổi trong phiên bản 3.1: Đã thêm đối số cờ tùy chọn.Added the optional flags argument.

re.findall (mẫu, chuỗi, cờ = 0) ¶

Trả về tất cả các trận đấu không chồng chéo của mẫu trong chuỗi, như một danh sách các chuỗi. Chuỗi được quét từ trái sang phải và các trận đấu được trả về theo thứ tự được tìm thấy. Nếu một hoặc nhiều nhóm có mặt trong mẫu, hãy trả lại danh sách các nhóm; Đây sẽ là một danh sách các bộ dữ liệu nếu mẫu có nhiều hơn một nhóm. Các trận đấu trống được bao gồm trong kết quả trừ khi họ chạm vào sự khởi đầu của một trận đấu khác.

re.finditer (mẫu, chuỗi, cờ = 0) ¶

Trả về một iterator mang lại các đối tượng khớp trên tất cả các kết quả không chồng chéo cho mẫu RE trong chuỗi. Chuỗi được quét từ trái sang phải và các trận đấu được trả về theo thứ tự được tìm thấy. Các trận đấu trống được bao gồm trong kết quả trừ khi họ chạm vào sự khởi đầu của một trận đấu khác.

Re.sub (mẫu, repl, chuỗi, đếm = 0, cờ = 0) ¶

Trả về chuỗi thu được bằng cách thay thế các lần xuất hiện không chồng chéo bên trái của mẫu trong chuỗi bằng cách thay thế. Nếu mẫu được tìm thấy, chuỗi được trả về không thay đổi. REPREP có thể là một chuỗi hoặc một hàm; Nếu đó là một chuỗi, bất kỳ dấu gạch chéo ngược nào thoát trong đó được xử lý. Nghĩa là, \ n được chuyển đổi thành một ký tự dòng mới, \ r được chuyển đổi thành trở lại vận chuyển, v.v. Những lối thoát chưa biết như \ J bị bỏ lại một mình. Các bản sao lưu, chẳng hạn như \ 6, được thay thế bằng chuỗi con phù hợp với nhóm 6 trong mẫu. Ví dụ:\n is converted to a single newline character, \r is converted to a carriage return, and so forth. Unknown escapes such as \j are left alone. Backreferences, such as \6, are replaced with the substring matched by group 6 in the pattern. For example:

>>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
...        r'static PyObject*\npy_\1(void)\n{',
...        'def myfunc():')
'static PyObject*\npy_myfunc(void)\n{'

Nếu thay thế là một hàm, nó được gọi cho mọi lần xuất hiện không chồng chéo của mẫu. Hàm có một đối số đối tượng khớp và trả về chuỗi thay thế. Ví dụ:

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

Mẫu có thể là một chuỗi hoặc một đối tượng RE.

Số lượng đối số tùy chọn là số lượng tối đa của các lần xuất hiện mẫu được thay thế; Đếm phải là một số nguyên không âm. Nếu bị bỏ qua hoặc không, tất cả các lần xuất hiện sẽ được thay thế. Các trận đấu trống cho mẫu chỉ được thay thế khi không liền kề với trận đấu trước đó, do đó, phụ ('x*', '-', 'abc') trả về '-a-b-c-'.sub('x*', '-', 'abc') returns '-a-b-c-'.

Trong các đối số thay thế loại chuỗi, ngoài các ký tự thoát ra và các bản sao lưu được mô tả ở trên, \ g sẽ sử dụng chuỗi con được khớp bởi tên nhóm được đặt tên, như được định nghĩa bởi cú pháp (? P ...). \ G sử dụng số nhóm tương ứng; Do đó, \ g tương đương với \ 2, nhưng không mơ hồ trong việc thay thế như \ g0. \ 20 sẽ được hiểu là tham chiếu đến nhóm 20, không phải là tham chiếu đến nhóm 2, theo sau là ký tự nghĩa đen '0'. Các bản sao lại \ g thay thế trong toàn bộ chuỗi con phù hợp với RE.\g will use the substring matched by the group named name, as defined by the (?P...) syntax. \g uses the corresponding group number; \g<2> is therefore equivalent to \2, but isn’t ambiguous in a replacement such as \g<2>0. \20 would be interpreted as a reference to group 20, not a reference to group 2 followed by the literal character '0'. The backreference \g<0> substitutes in the entire substring matched by the RE.

Đã thay đổi trong phiên bản 3.1: Đã thêm đối số cờ tùy chọn.Added the optional flags argument.

re.findall (mẫu, chuỗi, cờ = 0) ¶

Đã thay đổi trong phiên bản 3.1: Đã thêm đối số cờ tùy chọn.Added the optional flags argument.

re.findall (mẫu, chuỗi, cờ = 0) ¶

re.finditer (mẫu, chuỗi, cờ = 0) ¶The '_' character is no longer escaped.

Re.sub (mẫu, repl, chuỗi, đếm = 0, cờ = 0) ¶

Mẫu có thể là một chuỗi hoặc một đối tượng RE.

Re.subn (mẫu, repl, chuỗi, đếm = 0, cờ = 0) ¶None if no position in the string matches the pattern; note that this is different from finding a zero-length match at some point in the string.

Tham số thứ hai tùy chọn POS cung cấp một chỉ mục trong chuỗi nơi tìm kiếm sẽ bắt đầu; Nó mặc định là 0. Điều này không hoàn toàn tương đương với việc cắt chuỗi; ký tự mẫu '^' khớp với đầu thực của chuỗi và tại các vị trí ngay sau một dòng mới, nhưng không nhất thiết phải ở chỉ mục nơi tìm kiếm sẽ bắt đầu.0. This is not completely equivalent to slicing the string; the '^' pattern character matches at the real beginning of the string and at positions just after a newline, but not necessarily at the index where the search is to start.

Các endpos tham số tùy chọn giới hạn bao xa chuỗi sẽ được tìm kiếm; Nó sẽ như thể chuỗi là các ký tự endpos dài, vì vậy chỉ các ký tự từ POS đến EndPOS-1 sẽ được tìm kiếm cho một trận đấu. Nếu endpos nhỏ hơn POS, sẽ không tìm thấy trận đấu nào; Mặt khác, nếu Rx là một đối tượng biểu thức chính quy được biên dịch, RX.Search (chuỗi, 0,50) tương đương với RX.Search (chuỗi [: 50], 0).endpos - 1 will be searched for a match. If endpos is less than pos, no match will be found; otherwise, if rx is a compiled regular expression object, rx.search(string, 0, 50) is equivalent to rx.search(string[:50], 0).

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

regex.match (chuỗi [, pos [, endpos]]) ¶[, pos[, endpos]])¶

Nếu số không hoặc nhiều ký tự ở đầu chuỗi khớp với biểu thức chính quy này, hãy trả về một đối tượng khớp tương ứng. Trả về không nếu chuỗi không khớp với mẫu; Lưu ý rằng điều này khác với một trận đấu có độ dài bằng không.None if the string does not match the pattern; note that this is different from a zero-length match.

Các tham số POS và EndPOS tùy chọn có cùng ý nghĩa với phương thức search ().search() method.

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

regex.split (chuỗi, maxsplit = 0) ¶

Giống hệt với hàm chia (), sử dụng mẫu được biên dịch.split() function, using the compiled pattern.

regex.findall (chuỗi [, pos [, endpos]]) ¶[, pos[, endpos]])¶

Tương tự như hàm findall (), sử dụng mẫu được biên dịch, nhưng cũng chấp nhận các tham số POS và EndPOS tùy chọn giới hạn vùng tìm kiếm như đối với Match ().findall() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for match().

regex.finditer (chuỗi [, pos [, endpos]]) ¶[, pos[, endpos]])¶

Tương tự như hàm finditer (), sử dụng mẫu được biên dịch, nhưng cũng chấp nhận các tham số POS và EndPOS tùy chọn giới hạn vùng tìm kiếm như match ().finditer() function, using the compiled pattern, but also accepts optional pos and endpos parameters that limit the search region like for match().

regex.sub (repl, chuỗi, đếm = 0) ¶

Giống hệt với hàm sub (), sử dụng mẫu được biên dịch.sub() function, using the compiled pattern.

regex.subn (repl, string, count = 0) ¶

Giống hệt với hàm subn (), sử dụng mẫu được biên dịch.subn() function, using the compiled pattern.

regex.flags¶

Các lá cờ phù hợp với regex. Đây là sự kết hợp của các cờ được cung cấp cho compile (), bất kỳ (? ...) Các cờ nội tuyến trong mẫu và các cờ ẩn như Unicode nếu mẫu là chuỗi Unicode.compile(), any (?...) inline flags in the pattern, and implicit flags such as UNICODE if the pattern is a Unicode string.

regex.groups¶

Số lượng các nhóm chụp trong mẫu.

regex.groupindex¶

Một bản đồ từ điển lập bản đồ bất kỳ tên nhóm tượng trưng nào được xác định bởi (? P) cho các số nhóm. Từ điển trống nếu không có nhóm tượng trưng nào được sử dụng trong mẫu.(?P) to group numbers. The dictionary is empty if no symbolic groups were used in the pattern.

regex.potypsn¶

Chuỗi mẫu mà từ đó đối tượng RE được biên dịch.

6.2.4. Khớp đối tượng lor

Các đối tượng khớp luôn có giá trị boolean của true. Vì match () và search () không trả về không khi không có khớp, bạn có thể kiểm tra xem có khớp với câu lệnh IF đơn giản:True. Since match() and search() return None when there is no match, you can test whether there was a match with a simple if statement:

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

Đối tượng khớp hỗ trợ các phương thức và thuộc tính sau:

match.expand (mẫu) ¶

Trả về chuỗi thu được bằng cách thực hiện thay thế Backslash trên mẫu chuỗi mẫu, như được thực hiện bằng phương thức Sub (). Các lối thoát như \ n được chuyển đổi thành các ký tự thích hợp và các bản sao lưu số (\ 1, \ 2) và được đặt tên là Backreferences (\ g, \ g) được thay thế bằng nội dung của nhóm tương ứng.sub() method. Escapes such as \n are converted to the appropriate characters, and numeric backreferences (\1, \2) and named backreferences (\g<1>, \g) are replaced by the contents of the corresponding group.

match.group ([nhóm1, ...]) ¶[group1, ...])¶

Trả về một hoặc nhiều nhóm con của trận đấu. Nếu có một đối số duy nhất, kết quả là một chuỗi duy nhất; Nếu có nhiều đối số, kết quả là một tuple với một mục cho mỗi đối số. Không có đối số, Group1 mặc định về 0 (toàn bộ trận đấu được trả về). Nếu một đối số nhóm bằng không, giá trị trả về tương ứng là toàn bộ chuỗi khớp; Nếu nó nằm trong phạm vi bao gồm [1..99], thì đó là chuỗi phù hợp với nhóm dấu ngoặc đơn tương ứng. Nếu một số nhóm âm hoặc lớn hơn số lượng nhóm được xác định trong mẫu, ngoại lệ IndexError sẽ được nâng lên. Nếu một nhóm được chứa trong một phần của mẫu không khớp, kết quả tương ứng là không có. Nếu một nhóm được chứa trong một phần của mẫu phù hợp với nhiều lần, trận đấu cuối cùng sẽ được trả về.IndexError exception is raised. If a group is contained in a part of the pattern that did not match, the corresponding result is None. If a group is contained in a part of the pattern that matched multiple times, the last match is returned.

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

Nếu biểu thức chính quy sử dụng cú pháp (? P ...), các đối số nhóm cũng có thể là các chuỗi xác định các nhóm theo tên nhóm của chúng. Nếu một đối số chuỗi không được sử dụng làm tên nhóm trong mẫu, ngoại lệ IndexError sẽ được nêu ra.(?P...) syntax, the groupN arguments may also be strings identifying groups by their group name. If a string argument is not used as a group name in the pattern, an IndexError exception is raised.

Một ví dụ phức tạp vừa phải:

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

Các nhóm được đặt tên cũng có thể được đề cập bởi chỉ mục của họ:

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

Nếu một nhóm khớp với nhiều lần, chỉ có thể truy cập được trận đấu cuối cùng:

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

match.groups (mặc định = none) ¶

Trả về một tuple chứa tất cả các nhóm nhỏ của trận đấu, từ 1 trở lên tuy nhiên nhiều nhóm nằm trong mẫu. Đối số mặc định được sử dụng cho các nhóm không tham gia vào trận đấu; Nó mặc định là không có.None.

Ví dụ:

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

Nếu chúng ta thực hiện vị trí thập phân và mọi thứ sau khi nó tùy chọn, không phải tất cả các nhóm có thể tham gia vào trận đấu. Các nhóm này sẽ mặc định không có gì trừ khi đối số mặc định được đưa ra:None unless the default argument is given:

>>> m = re.search('(?<=-)\w+', 'spam-egg')
>>> m.group(0)
'egg'

match.groupdict (mặc định = none) ¶

Trả về một từ điển chứa tất cả các nhóm con được đặt tên của trận đấu, được khóa bởi tên phân nhóm. Đối số mặc định được sử dụng cho các nhóm không tham gia vào trận đấu; Nó mặc định là không có. Ví dụ:None. For example:

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

match.start ([nhóm]) ¶ match.end ([nhóm]) ¶[group])¶ match.end([group])¶

Trả về các chỉ số của phần bắt đầu và kết thúc của chuỗi con phù hợp với nhóm; Nhóm mặc định về 0 (có nghĩa là toàn bộ chuỗi con phù hợp). Trả về -1 nếu nhóm tồn tại nhưng không đóng góp cho trận đấu. Đối với một đối tượng khớp m và nhóm G đã đóng góp cho trận đấu, phần phụ phù hợp với nhóm G (tương đương với M.group (g)) là-1 if group exists but did not contribute to the match. For a match object m, and a group g that did contribute to the match, the substring matched by group g (equivalent to m.group(g)) is

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

Lưu ý rằng M.Start (nhóm) sẽ bằng M.end (nhóm) nếu nhóm khớp với chuỗi null. Ví dụ: sau m = re.Search ('b (c?)', 'Cba'), m.start (0) là 1, m.end (0) là 2, m.start (1) và m. Kết thúc (1) là cả 2 và M.Start (2) làm tăng ngoại lệ IndexError.m.start(group) will equal m.end(group) if group matched a null string. For example, after m = re.search('b(c?)', 'cba'), m.start(0) is 1, m.end(0) is 2, m.start(1) and m.end(1) are both 2, and m.start(2) raises an IndexError exception.

Một ví dụ sẽ loại bỏ Remove_this khỏi địa chỉ email:

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

match.span ([nhóm]) ¶[group])¶

Đối với một trận đấu M, trả lại 2-Tuple (M.Start (nhóm), M.end (nhóm)). Lưu ý rằng nếu nhóm không đóng góp cho trận đấu, thì đây là (-1, -1). Nhóm mặc định về 0, toàn bộ trận đấu.(m.start(group), m.end(group)). Note that if group did not contribute to the match, this is (-1, -1). group defaults to zero, the entire match.

Match.Pos¶

Giá trị của POS được chuyển đến phương thức tìm kiếm () hoặc match () của đối tượng regex. Đây là chỉ mục vào chuỗi mà động cơ RE bắt đầu tìm kiếm một trận đấu.search() or match() method of a regex object. This is the index into the string at which the RE engine started looking for a match.

Match.endpos¶

Giá trị của các endpos được chuyển đến phương thức tìm kiếm () hoặc match () của đối tượng regex. Đây là chỉ mục vào chuỗi mà động cơ RE sẽ không đi.search() or match() method of a regex object. This is the index into the string beyond which the RE engine will not go.

Match.lastindex¶

Chỉ số số nguyên của nhóm bắt giữ phù hợp cuối cùng, hoặc không có nhóm nào nếu không có nhóm nào được khớp. Ví dụ: các biểu thức (a) b, ((a) (b)) và ((ab)) sẽ có lastindex == 1 nếu được áp dụng cho chuỗi 'ab', trong khi biểu thức (a) (b) sẽ có LastIndex == 2, nếu được áp dụng cho cùng một chuỗi.None if no group was matched at all. For example, the expressions (a)b, ((a)(b)), and ((ab)) will have lastindex == 1 if applied to the string 'ab', while the expression (a)(b) will have lastindex == 2, if applied to the same string.

Match.lastgroup¶

Tên của nhóm bắt giữ phù hợp cuối cùng, hoặc không có gì nếu nhóm không có tên, hoặc nếu không có nhóm nào được khớp.None if the group didn’t have a name, or if no group was matched at all.

khớp.re¶

Đối tượng biểu thức chính quy có phương thức khớp () hoặc search () đã tạo ra thể hiện đối sánh này.match() or search() method produced this match instance.

Match.String¶

Chuỗi được truyền để khớp () hoặc tìm kiếm ().match() or search().

6.2.5. Ví dụ biểu hiện chính quy

6.2.5.1. Kiểm tra một cặp

Trong ví dụ này, chúng tôi sẽ sử dụng chức năng trợ giúp sau để hiển thị các đối tượng phù hợp hơn một chút một cách duyên dáng hơn một chút:

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

Giả sử bạn đang viết một chương trình poker trong đó bàn tay của người chơi được thể hiện dưới dạng chuỗi 5 ký tự với mỗi nhân vật đại diện cho một thẻ, một cách Càng Tiên cho 10, và 2 2, thông qua 9 9, đại diện cho thẻ có giá trị đó.

Để xem liệu một chuỗi đã cho là một bàn tay hợp lệ, người ta có thể làm như sau:

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

Tay cuối cùng đó, "727ak", chứa một cặp, hoặc hai trong số cùng một thẻ có giá trị. Để phù hợp với biểu thức thông thường, người ta có thể sử dụng các bản sao lưu như vậy:"727ak", contained a pair, or two of the same valued cards. To match this with a regular expression, one could use backreferences as such:

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

Để tìm hiểu thẻ nào mà cặp bao gồm, người ta có thể sử dụng phương thức nhóm () của đối tượng khớp theo cách sau:group() method of the match object in the following manner:

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

6.2.5.2. Mô phỏng scanf ()

Python hiện không có tương đương với scanf (). Các biểu thức thông thường thường mạnh mẽ hơn, mặc dù cũng nhiều hơn so với các chuỗi định dạng scanf (). Bảng dưới đây cung cấp một số ánh xạ tương đương hoặc ít hơn giữa các mã thông báo định dạng scanf () và các biểu thức thông thường.scanf(). Regular expressions are generally more powerful, though also more verbose, than scanf() format strings. The table below offers some more-or-less equivalent mappings between scanf() format tokens and regular expressions.

Mã thông báo Scanf () Token	Biểu hiện thông thường
%c	.
%5c	.{5}
%d	[-+]? \ D+
%E, %E, %F, %G, %E, %f, %g	[-+]? (\ d+(\. \ d*)? \| \. \ d+) ([ee] [-+]? \ d+)?
%i	[-+]? (0 [xx] [\ da-fa-f]+\| 0 [0-7]*\| \ d+)
%o	[-+]?[0-7]+
%s	\ S+
%u	\ d+
%x, %x, %X	[-+]? (0 [xx])? [\ Da-fa-f]+

Để trích xuất tên tệp và số từ một chuỗi như

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

bạn sẽ sử dụng định dạng scanf () nhưscanf() format like

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

Biểu thức chính quy tương đương sẽ là

\a      \b      \f      \n
\r      \t      \u      \U
\v      \x      \\

6.2.5.3. search () so với match ()

Python cung cấp hai hoạt động nguyên thủy khác nhau dựa trên các biểu thức thông thường: re.match () chỉ kiểm tra một trận đấu ở đầu chuỗi, trong khi Re.Search () kiểm tra đối sánh ở bất cứ đâu trong chuỗi (đây là những gì Perl làm theo mặc định ).re.match() checks for a match only at the beginning of the string, while re.search() checks for a match anywhere in the string (this is what Perl does by default).

Ví dụ:

prog = re.compile(pattern)
result = prog.match(string)

Các biểu thức thông thường bắt đầu bằng '^' có thể được sử dụng với search () để hạn chế trận đấu ở đầu chuỗi:'^' can be used with search() to restrict the match at the beginning of the string:

prog = re.compile(pattern)
result = prog.match(string)

Tuy nhiên, lưu ý rằng trong chế độ đa dòng, chỉ khớp () chỉ khớp ở đầu chuỗi, trong khi sử dụng search () với biểu thức chính quy bắt đầu với '^' sẽ khớp ở đầu mỗi dòng.MULTILINE mode match() only matches at the beginning of the string, whereas using search() with a regular expression beginning with '^' will match at the beginning of each line.

prog = re.compile(pattern)
result = prog.match(string)

6.2.5.4. Làm một danh bạ

Split () chia một chuỗi thành một danh sách được phân định bởi mẫu được truyền. Phương pháp này là vô giá để chuyển đổi dữ liệu văn bản thành các cấu trúc dữ liệu có thể dễ dàng đọc và sửa đổi bởi Python như được trình bày trong ví dụ sau tạo ra một danh bạ. splits a string into a list delimited by the passed pattern. The method is invaluable for converting textual data into data structures that can be easily read and modified by Python as demonstrated in the following example that creates a phonebook.

Đầu tiên, đây là đầu vào. Thông thường nó có thể đến từ một tệp, ở đây chúng tôi đang sử dụng cú pháp chuỗi được trích xuất ba lần:

prog = re.compile(pattern)
result = prog.match(string)

Các mục được phân tách bằng một hoặc nhiều dòng mới. Bây giờ chúng tôi chuyển đổi chuỗi thành một danh sách với mỗi dòng không có giá trị có mục nhập riêng:

prog = re.compile(pattern)
result = prog.match(string)

Cuối cùng, chia mỗi mục vào một danh sách với tên, họ, số điện thoại và địa chỉ. Chúng tôi sử dụng tham số MaxSplit của Split () vì địa chỉ có khoảng trắng, mẫu phân tách của chúng tôi, trong đó:maxsplit parameter of split() because the address has spaces, our splitting pattern, in it:

prog = re.compile(pattern)
result = prog.match(string)

Các :? Mẫu khớp với dấu hai chấm sau tên cuối cùng, để nó không xảy ra trong danh sách kết quả. Với MaxSplit là 4, chúng tôi có thể tách số nhà khỏi tên đường phố::? pattern matches the colon after the last name, so that it does not occur in the result list. With a maxsplit of 4, we could separate the house number from the street name:

prog = re.compile(pattern)
result = prog.match(string)

6.2.5.5. Nhắn tin MUNGING¶

Sub () thay thế mọi lần xuất hiện của một mẫu bằng một chuỗi hoặc kết quả của một hàm. Ví dụ này thể hiện bằng cách sử dụng sub () với một hàm cho văn bản của Mung Munge, hoặc ngẫu nhiên thứ tự của tất cả các ký tự trong mỗi từ của một câu ngoại trừ các ký tự đầu tiên và cuối cùng: replaces every occurrence of a pattern with a string or the result of a function. This example demonstrates using sub() with a function to “munge” text, or randomize the order of all the characters in each word of a sentence except for the first and last characters:

prog = re.compile(pattern)
result = prog.match(string)

6.2.5.6. Tìm tất cả các trạng từ

findall () khớp với tất cả các lần xuất hiện của một mẫu, không chỉ mô hình đầu tiên như search () làm. Ví dụ: nếu một người là một nhà văn và muốn tìm tất cả các trạng từ trong một số văn bản, anh ấy hoặc cô ấy có thể sử dụng findall () theo cách sau: matches all occurrences of a pattern, not just the first one as search() does. For example, if one was a writer and wanted to find all of the adverbs in some text, he or she might use findall() in the following manner:

prog = re.compile(pattern)
result = prog.match(string)

6.2.5.7. Tìm tất cả các trạng từ và vị trí của chúng

Nếu người ta muốn nhiều thông tin hơn về tất cả các trận đấu của một mẫu so với văn bản phù hợp, finditer () rất hữu ích vì nó cung cấp các đối tượng phù hợp thay vì chuỗi. Tiếp tục với ví dụ trước, nếu một người là một nhà văn muốn tìm tất cả các trạng từ và vị trí của họ trong một số văn bản, anh ta hoặc cô ta sẽ sử dụng finditer () theo cách sau:finditer() is useful as it provides match objects instead of strings. Continuing with the previous example, if one was a writer who wanted to find all of the adverbs and their positions in some text, he or she would use finditer() in the following manner:

prog = re.compile(pattern)
result = prog.match(string)

6.2.5.8. Ký hiệu chuỗi thô

Ký hiệu chuỗi thô (r "văn bản") giữ cho các biểu thức thường xuyên lành mạnh. Nếu không có nó, mọi dấu gạch chéo ngược ('\') trong một biểu thức thông thường sẽ phải được đặt trước với một biểu thức khác để thoát khỏi nó. Ví dụ: hai dòng mã sau đây có chức năng giống hệt nhau:r"text") keeps regular expressions sane. Without it, every backslash ('\') in a regular expression would have to be prefixed with another one to escape it. For example, the two following lines of code are functionally identical:

result = re.match(pattern, string)

Khi một người muốn kết hợp một dấu gạch chéo ngược theo nghĩa đen, nó phải được thoát trong biểu thức thông thường. Với ký hiệu chuỗi thô, điều này có nghĩa là r "\\". Không có ký hiệu chuỗi thô, người ta phải sử dụng "\\\\\\", làm cho các dòng mã sau giống hệt nhau về mặt chức năng:r"\\". Without raw string notation, one must use "\\\\", making the following lines of code functionally identical:

result = re.match(pattern, string)

6.2.5.9. Viết Tokenizer¶

Một tokenizer hoặc máy quét phân tích một chuỗi để phân loại các nhóm ký tự. Đây là bước đầu tiên hữu ích khi viết trình biên dịch hoặc trình thông dịch.

Các danh mục văn bản được chỉ định với các biểu thức thông thường. Kỹ thuật này là kết hợp những người đó thành một biểu thức thông thường chính và lặp lại các trận đấu liên tiếp:

result = re.match(pattern, string)

Tokenizer tạo ra đầu ra sau:

result = re.match(pattern, string)

Hướng dẫn python regular expression hyphen - dấu gạch nối biểu thức chính quy python

6.2.1. Regular Expression Syntax¶

title

6.2.2. Nội dung mô -đun

Mẫu có thể là một chuỗi hoặc một đối tượng RE.

6.2.4. Khớp đối tượng lor

6.2.5. Ví dụ biểu hiện chính quy

6.2.5.1. Kiểm tra một cặp

6.2.5.2. Mô phỏng scanf ()

6.2.5.3. search () so với match ()

6.2.5.4. Làm một danh bạ

6.2.5.5. Nhắn tin MUNGING¶

6.2.5.6. Tìm tất cả các trạng từ

6.2.5.7. Tìm tất cả các trạng từ và vị trí của chúng

6.2.5.8. Ký hiệu chuỗi thô

6.2.5.9. Viết Tokenizer¶

Bài Viết Liên Quan

Quảng Cáo

Có thể bạn quan tâm

Toplist được quan tâm

Quảng cáo

Xem Nhiều

Quảng cáo

Chúng tôi

Điều khoản

Trợ giúp

Mạng xã hội