Hướng dẫn dùng recursive pattern trong PHP

Consider the problem of matching a string in parentheses, allowing for unlimited nested parentheses. Without the use of recursion, the best that can be done is to use a pattern that matches up to some fixed depth of nesting. It is not possible to handle an arbitrary nesting depth. Perl 5.6 has provided an experimental facility that allows regular expressions to recurse [among other things]. The special item [?R] is provided for the specific case of recursion. This PCRE pattern solves the parentheses problem [assume the PCRE_EXTENDED option is set so that white space is ignored]: \[ [ [?>[^[]]+] | [?R] ]* \]

First it matches an opening parenthesis. Then it matches any number of substrings which can either be a sequence of non-parentheses, or a recursive match of the pattern itself [i.e. a correctly parenthesized substring]. Finally there is a closing parenthesis.

This particular example pattern contains nested unlimited repeats, and so the use of a once-only subpattern for matching strings of non-parentheses is important when applying the pattern to strings that do not match. For example, when it is applied to [aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa[] it yields "no match" quickly. However, if a once-only subpattern is not used, the match runs for a very long time indeed because there are so many different ways the + and * repeats can carve up the subject, and all have to be tested before failure can be reported.

The values set for any capturing subpatterns are those from the outermost level of the recursion at which the subpattern value is set. If the pattern above is matched against [ab[cd]ef] the value for the capturing parentheses is "ef", which is the last value taken on at the top level. If additional parentheses are added, giving \[ [ [ [?>[^[]]+] | [?R] ]* ] \] then the string they capture is "ab[cd]ef", the contents of the top level parentheses. If there are more than 15 capturing parentheses in a pattern, PCRE has to obtain extra memory to store data during a recursion, which it does by using pcre_malloc, freeing it via pcre_free afterwards. If no memory can be obtained, it saves data for the first 15 capturing parentheses only, as there is no way to give an out-of-memory error from within a recursion.

[?1], [?2] and so on can be used for recursive subpatterns too. It is also possible to use named subpatterns: [?P>name] or [?&name].

If the syntax for a recursive subpattern reference [either by number or by name] is used outside the parentheses to which it refers, it operates like a subroutine in a programming language. An earlier example pointed out that the pattern [sens|respons]e and \1ibility matches "sense and sensibility" and "response and responsibility", but not "sense and responsibility". If instead the pattern [sens|respons]e and [?1]ibility is used, it does match "sense and responsibility" as well as the other two strings. Such references must, however, follow the subpattern to which they refer.

The maximum length of a subject string is the largest positive number that an integer variable can hold. However, PCRE uses recursion to handle subpatterns and indefinite repetition. This means that the available stack space may limit the size of a subject string that can be processed by certain patterns.

horvath at webarticum dot hu

9 years ago

With the [?R] item you can link only to the full pattern, because it quasi equals to [?0]. You can not use anchors, asserts etc., and you can only check that string CONTAINS a valid hierarchy or not.

This is wrong: ^\[[[?>[^[]]+]|[?R]]*\]$

However, you can bracketing the full expression, and replace [?R] to the relative link [?-2]. This make it reusable. So you can check complex expressions, for example:


You can also catch multibyte quotes with the 'u' modifier [if you use UTF-8], eg:

emanueledelgrande at email dot it

12 years ago

The recursion in regular expressions is the only way to allow the parsing of HTML code with nested tags of indefinite depth.
It seems it's not yet a spreaded practice; not so much contents are available on the web regarding regexp recursion, and until now no user contribute notes have been published on this manual page.
I made several tests with complex patterns to get tags with specific attributes or namespaces, studying the recursion of a subpattern only instead of the full pattern.
Here's an example that may power a fast LL parser with recursive descent [//en.wikipedia.org/wiki/Recursive_descent_parser]:

$pattern = "/]*?] [[[\s]*\/>]| [>[[[[^

Chủ Đề