X-Git-Url: http://ftp.carnet.hr/carnet-debian/scm?p=ossec-hids.git;a=blobdiff_plain;f=src%2Fexternal%2Fpcre2-10.32%2Fdoc%2Fpcre2pattern.3;fp=src%2Fexternal%2Fpcre2-10.32%2Fdoc%2Fpcre2pattern.3;h=0247c524a307f52fe2546429322bf5e47b0495e5;hp=0000000000000000000000000000000000000000;hb=3f728675941dc69d4e544d3a880a56240a6e394a;hpb=927951d1c1ad45ba9e7325f07d996154a91c911b diff --git a/src/external/pcre2-10.32/doc/pcre2pattern.3 b/src/external/pcre2-10.32/doc/pcre2pattern.3 new file mode 100644 index 0000000..0247c52 --- /dev/null +++ b/src/external/pcre2-10.32/doc/pcre2pattern.3 @@ -0,0 +1,3660 @@ +.TH PCRE2PATTERN 3 "04 September 2018" "PCRE2 10.32" +.SH NAME +PCRE2 - Perl-compatible regular expressions (revised API) +.SH "PCRE2 REGULAR EXPRESSION DETAILS" +.rs +.sp +The syntax and semantics of the regular expressions that are supported by PCRE2 +are described in detail below. There is a quick-reference syntax summary in the +.\" HREF +\fBpcre2syntax\fP +.\" +page. PCRE2 tries to match Perl syntax and semantics as closely as it can. +PCRE2 also supports some alternative regular expression syntax (which does not +conflict with the Perl syntax) in order to provide some compatibility with +regular expressions in Python, .NET, and Oniguruma. +.P +Perl's regular expressions are described in its own documentation, and regular +expressions in general are covered in a number of books, some of which have +copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published +by O'Reilly, covers regular expressions in great detail. This description of +PCRE2's regular expressions is intended as reference material. +.P +This document discusses the patterns that are supported by PCRE2 when its main +matching function, \fBpcre2_match()\fP, is used. PCRE2 also has an alternative +matching function, \fBpcre2_dfa_match()\fP, which matches using a different +algorithm that is not Perl-compatible. Some of the features discussed below are +not available when DFA matching is used. The advantages and disadvantages of +the alternative function, and how it differs from the normal function, are +discussed in the +.\" HREF +\fBpcre2matching\fP +.\" +page. +. +. +.SH "SPECIAL START-OF-PATTERN ITEMS" +.rs +.sp +A number of options that can be passed to \fBpcre2_compile()\fP can also be set +by special items at the start of a pattern. These are not Perl-compatible, but +are provided to make these options accessible to pattern writers who are not +able to change the program that processes the pattern. Any number of these +items may appear, but they must all be together right at the start of the +pattern string, and the letters must be in upper case. +. +. +.SS "UTF support" +.rs +.sp +In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as +single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be +specified for the 32-bit library, in which case it constrains the character +values to valid Unicode code points. To process UTF strings, PCRE2 must be +built to include Unicode support (which is the default). When using UTF strings +you must either call the compiling function with the PCRE2_UTF option, or the +pattern must start with the special sequence (*UTF), which is equivalent to +setting the relevant option. How setting a UTF mode affects pattern matching is +mentioned in several places below. There is also a summary of features in the +.\" HREF +\fBpcre2unicode\fP +.\" +page. +.P +Some applications that allow their users to supply patterns may wish to +restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF +option is passed to \fBpcre2_compile()\fP, (*UTF) is not allowed, and its +appearance in a pattern causes an error. +. +. +.SS "Unicode property support" +.rs +.sp +Another special sequence that may appear at the start of a pattern is (*UCP). +This has the same effect as setting the PCRE2_UCP option: it causes sequences +such as \ed and \ew to use Unicode properties to determine character types, +instead of recognizing only characters with codes less than 256 via a lookup +table. +.P +Some applications that allow their users to supply patterns may wish to +restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to +\fBpcre2_compile()\fP, (*UCP) is not allowed, and its appearance in a pattern +causes an error. +. +. +.SS "Locking out empty string matching" +.rs +.sp +Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect +as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever +matching function is subsequently called to match the pattern. These options +lock out the matching of empty strings, either entirely, or only at the start +of the subject. +. +. +.SS "Disabling auto-possessification" +.rs +.sp +If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting +the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making quantifiers +possessive when what follows cannot match the repeated item. For example, by +default a+b is treated as a++b. For more details, see the +.\" HREF +\fBpcre2api\fP +.\" +documentation. +. +. +.SS "Disabling start-up optimizations" +.rs +.sp +If a pattern starts with (*NO_START_OPT), it has the same effect as setting the +PCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly +reaching "no match" results. For more details, see the +.\" HREF +\fBpcre2api\fP +.\" +documentation. +. +. +.SS "Disabling automatic anchoring" +.rs +.sp +If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as +setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that +apply to patterns whose top-level branches all start with .* (match any number +of arbitrary characters). For more details, see the +.\" HREF +\fBpcre2api\fP +.\" +documentation. +. +. +.SS "Disabling JIT compilation" +.rs +.sp +If a pattern that starts with (*NO_JIT) is successfully compiled, an attempt by +the application to apply the JIT optimization by calling +\fBpcre2_jit_compile()\fP is ignored. +. +. +.SS "Setting match resource limits" +.rs +.sp +The \fBpcre2_match()\fP function contains a counter that is incremented every +time it goes round its main loop. The caller of \fBpcre2_match()\fP can set a +limit on this counter, which therefore limits the amount of computing resource +used for a match. The maximum depth of nested backtracking can also be limited; +this indirectly restricts the amount of heap memory that is used, but there is +also an explicit memory limit that can be set. +.P +These facilities are provided to catch runaway matches that are provoked by +patterns with huge matching trees (a typical example is a pattern with nested +unlimited repeats applied to a long string that does not match). When one of +these limits is reached, \fBpcre2_match()\fP gives an error return. The limits +can also be set by items at the start of the pattern of the form +.sp + (*LIMIT_HEAP=d) + (*LIMIT_MATCH=d) + (*LIMIT_DEPTH=d) +.sp +where d is any number of decimal digits. However, the value of the setting must +be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP +for it to have any effect. In other words, the pattern writer can lower the +limits set by the programmer, but not raise them. If there is more than one +setting of one of these limits, the lower value is used. The heap limit is +specified in kibibytes (units of 1024 bytes). +.P +Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is +still recognized for backwards compatibility. +.P +The heap limit applies only when the \fBpcre2_match()\fP or +\fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply +to JIT. The match limit is used (but in a different way) when JIT is being +used, or when \fBpcre2_dfa_match()\fP is called, to limit computing resource +usage by those matching functions. The depth limit is ignored by JIT but is +relevant for DFA matching, which uses function recursion for recursions within +the pattern and for lookaround assertions and atomic groups. In this case, the +depth limit controls the depth of such recursion. +. +. +.\" HTML +.SS "Newline conventions" +.rs +.sp +PCRE2 supports six different conventions for indicating line breaks in +strings: a single CR (carriage return) character, a single LF (linefeed) +character, the two-character sequence CRLF, any of the three preceding, any +Unicode newline sequence, or the NUL character (binary zero). The +.\" HREF +\fBpcre2api\fP +.\" +page has +.\" HTML +.\" +further discussion +.\" +about newlines, and shows how to set the newline convention when calling +\fBpcre2_compile()\fP. +.P +It is also possible to specify a newline convention by starting a pattern +string with one of the following sequences: +.sp + (*CR) carriage return + (*LF) linefeed + (*CRLF) carriage return, followed by linefeed + (*ANYCRLF) any of the three above + (*ANY) all Unicode newline sequences + (*NUL) the NUL character (binary zero) +.sp +These override the default and the options given to the compiling function. For +example, on a Unix system where LF is the default newline sequence, the pattern +.sp + (*CR)a.b +.sp +changes the convention to CR. That pattern matches "a\enb" because LF is no +longer a newline. If more than one of these settings is present, the last one +is used. +.P +The newline convention affects where the circumflex and dollar assertions are +true. It also affects the interpretation of the dot metacharacter when +PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an +opening brace. However, it does not affect what the \eR escape sequence +matches. By default, this is any Unicode newline sequence, for Perl +compatibility. However, this can be changed; see the next section and the +description of \eR in the section entitled +.\" HTML +.\" +"Newline sequences" +.\" +below. A change of \eR setting can be combined with a change of newline +convention. +. +. +.SS "Specifying what \eR matches" +.rs +.sp +It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the +complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF +at compile time. This effect can also be achieved by starting a pattern with +(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized, +corresponding to PCRE2_BSR_UNICODE. +. +. +.SH "EBCDIC CHARACTER CODES" +.rs +.sp +PCRE2 can be compiled to run in an environment that uses EBCDIC as its +character code instead of ASCII or Unicode (typically a mainframe system). In +the sections below, character code values are ASCII or Unicode; in an EBCDIC +environment these characters may have different code values, and there are no +code points greater than 255. +. +. +.SH "CHARACTERS AND METACHARACTERS" +.rs +.sp +A regular expression is a pattern that is matched against a subject string from +left to right. Most characters stand for themselves in a pattern, and match the +corresponding characters in the subject. As a trivial example, the pattern +.sp + The quick brown fox +.sp +matches a portion of a subject string that is identical to itself. When +caseless matching is specified (the PCRE2_CASELESS option), letters are matched +independently of case. +.P +The power of regular expressions comes from the ability to include alternatives +and repetitions in the pattern. These are encoded in the pattern by the use of +\fImetacharacters\fP, which do not stand for themselves but instead are +interpreted in some special way. +.P +There are two different sets of metacharacters: those that are recognized +anywhere in the pattern except within square brackets, and those that are +recognized within square brackets. Outside square brackets, the metacharacters +are as follows: +.sp + \e general escape character with several uses + ^ assert start of string (or line, in multiline mode) + $ assert end of string (or line, in multiline mode) + . match any character except newline (by default) + [ start character class definition + | start of alternative branch + ( start subpattern + ) end subpattern + ? extends the meaning of ( + also 0 or 1 quantifier + also quantifier minimizer + * 0 or more quantifier + + 1 or more quantifier + also "possessive quantifier" + { start min/max quantifier +.sp +Part of a pattern that is in square brackets is called a "character class". In +a character class the only metacharacters are: +.sp + \e general escape character + ^ negate the class, but only if the first character + - indicates character range +.\" JOIN + [ POSIX character class (only if followed by POSIX + syntax) + ] terminates the character class +.sp +The following sections describe the use of each of the metacharacters. +. +. +.SH BACKSLASH +.rs +.sp +The backslash character has several uses. Firstly, if it is followed by a +character that is not a number or a letter, it takes away any special meaning +that character may have. This use of backslash as an escape character applies +both inside and outside character classes. +.P +For example, if you want to match a * character, you must write \e* in the +pattern. This escaping action applies whether or not the following character +would otherwise be interpreted as a metacharacter, so it is always safe to +precede a non-alphanumeric with backslash to specify that it stands for itself. +In particular, if you want to match a backslash, you write \e\e. +.P +In a UTF mode, only ASCII numbers and letters have any special meaning after a +backslash. All other characters (in particular, those whose code points are +greater than 127) are treated as literals. +.P +If a pattern is compiled with the PCRE2_EXTENDED option, most white space in +the pattern (other than in a character class), and characters between a # +outside a character class and the next newline, inclusive, are ignored. An +escaping backslash can be used to include a white space or # character as part +of the pattern. +.P +If you want to remove the special meaning from a sequence of characters, you +can do so by putting them between \eQ and \eE. This is different from Perl in +that $ and @ are handled as literals in \eQ...\eE sequences in PCRE2, whereas +in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish +backslash interpolation" on any backslashes between \eQ and \eE which, its +documentation says, "may lead to confusing results". PCRE2 treats a backslash +between \eQ and \eE just like any other character. Note the following examples: +.sp + Pattern PCRE2 matches Perl matches +.sp +.\" JOIN + \eQabc$xyz\eE abc$xyz abc followed by the + contents of $xyz + \eQabc\e$xyz\eE abc\e$xyz abc\e$xyz + \eQabc\eE\e$\eQxyz\eE abc$xyz abc$xyz + \eQA\eB\eE A\eB A\eB + \eQ\e\eE \e \e\eE +.sp +The \eQ...\eE sequence is recognized both inside and outside character classes. +An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed +by \eE later in the pattern, the literal interpretation continues to the end of +the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside +a character class, this causes an error, because the character class is not +terminated by a closing square bracket. +. +. +.\" HTML +.SS "Non-printing characters" +.rs +.sp +A second use of backslash provides a way of encoding non-printing characters +in patterns in a visible manner. There is no restriction on the appearance of +non-printing characters in a pattern, but when a pattern is being prepared by +text editing, it is often easier to use one of the following escape sequences +than the binary character it represents. In an ASCII or Unicode environment, +these escapes are as follows: +.sp + \ea alarm, that is, the BEL character (hex 07) + \ecx "control-x", where x is any printable ASCII character + \ee escape (hex 1B) + \ef form feed (hex 0C) + \en linefeed (hex 0A) + \er carriage return (hex 0D) + \et tab (hex 09) + \e0dd character with octal code 0dd + \eddd character with octal code ddd, or backreference + \eo{ddd..} character with octal code ddd.. + \exhh character with hex code hh + \ex{hhh..} character with hex code hhh.. + \eN{U+hhh..} character with Unicode hex code point hhh.. + \euhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set) +.sp +The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option +is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses +\eN{name} to specify characters by Unicode name; PCRE2 does not support this. +Note that when \eN is not followed by an opening brace (curly bracket) it has +an entirely different meaning, matching any character that is not a newline. +.P +The precise effect of \ecx on ASCII characters is as follows: if x is a lower +case letter, it is converted to upper case. Then bit 6 of the character (hex +40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A), +but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the +code unit following \ec has a value less than 32 or greater than 126, a +compile-time error occurs. +.P +When PCRE2 is compiled in EBCDIC mode, \eN{U+hhh..} is not supported. \ea, \ee, +\ef, \en, \er, and \et generate the appropriate EBCDIC code values. The \ec +escape is processed as specified for Perl in the \fBperlebcdic\fP document. The +only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ], +^, _, or ?. Any other character provokes a compile-time error. The sequence +\ec@ encodes character code 0; after \ec the letters (in either case) encode +characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31 +(hex 1B to hex 1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F). +.P +Thus, apart from \ec?, these escapes generate the same character code values as +they do in an ASCII environment, though the meanings of the values mostly +differ. For example, \ecG always generates code value 7, which is BEL in ASCII +but DEL in EBCDIC. +.P +The sequence \ec? generates DEL (127, hex 7F) in an ASCII environment, but +because 127 is not a control character in EBCDIC, Perl makes it generate the +APC character. Unfortunately, there are several variants of EBCDIC. In most of +them the APC character has the value 255 (hex FF), but in the one Perl calls +POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC +values, PCRE2 makes \ec? generate 95; otherwise it generates 255. +.P +After \e0 up to two further octal digits are read. If there are fewer than two +digits, just those that are present are used. Thus the sequence \e0\ex\e015 +specifies two binary zeros followed by a CR character (code value 13). Make +sure you supply two digits after the initial zero if the pattern character that +follows is itself an octal digit. +.P +The escape \eo must be followed by a sequence of octal digits, enclosed in +braces. An error occurs if this is not the case. This escape is a recent +addition to Perl; it provides way of specifying character code points as octal +numbers greater than 0777, and it also allows octal numbers and backreferences +to be unambiguously specified. +.P +For greater clarity and unambiguity, it is best to avoid following \e by a +digit greater than zero. Instead, use \eo{} or \ex{} to specify numerical +character code points, and \eg{} to specify backreferences. The following +paragraphs describe the old, ambiguous syntax. +.P +The handling of a backslash followed by a digit other than 0 is complicated, +and Perl has changed over time, causing PCRE2 also to change. +.P +Outside a character class, PCRE2 reads the digit and any following digits as a +decimal number. If the number is less than 10, begins with the digit 8 or 9, or +if there are at least that many previous capturing left parentheses in the +expression, the entire sequence is taken as a \fIbackreference\fP. A +description of how this works is given +.\" HTML +.\" +later, +.\" +following the discussion of +.\" HTML +.\" +parenthesized subpatterns. +.\" +Otherwise, up to three octal digits are read to form a character code. +.P +Inside a character class, PCRE2 handles \e8 and \e9 as the literal characters +"8" and "9", and otherwise reads up to three octal digits following the +backslash, using them to generate a data character. Any subsequent digits stand +for themselves. For example, outside a character class: +.sp + \e040 is another way of writing an ASCII space +.\" JOIN + \e40 is the same, provided there are fewer than 40 + previous capturing subpatterns + \e7 is always a backreference +.\" JOIN + \e11 might be a backreference, or another way of + writing a tab + \e011 is always a tab + \e0113 is a tab followed by the character "3" +.\" JOIN + \e113 might be a backreference, otherwise the + character with octal code 113 +.\" JOIN + \e377 might be a backreference, otherwise + the value 255 (decimal) +.\" JOIN + \e81 is always a backreference +.sp +Note that octal values of 100 or greater that are specified using this syntax +must not be introduced by a leading zero, because no more than three octal +digits are ever read. +.P +By default, after \ex that is not followed by {, from zero to two hexadecimal +digits are read (letters can be in upper or lower case). Any number of +hexadecimal digits may appear between \ex{ and }. If a character other than +a hexadecimal digit appears between \ex{ and }, or if there is no terminating +}, an error occurs. +.P +If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just +described only when it is followed by two hexadecimal digits. Otherwise, it +matches a literal "x" character. In this mode, support for code points greater +than 256 is provided by \eu, which must be followed by four hexadecimal digits; +otherwise it matches a literal "u" character. +.P +Characters whose value is less than 256 can be defined by either of the two +syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in +the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or +\eu00dc in PCRE2_ALT_BSUX mode). +. +. +.SS "Constraints on character values" +.rs +.sp +Characters that are specified using octal or hexadecimal numbers are +limited to certain values, as follows: +.sp + 8-bit non-UTF mode no greater than 0xff + 16-bit non-UTF mode no greater than 0xffff + 32-bit non-UTF mode no greater than 0xffffffff + All UTF modes no greater than 0x10ffff and a valid code point +.sp +Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the +so-called "surrogate" code points). The check for these can be disabled by the +caller of \fBpcre2_compile()\fP by setting the option +PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8 +and UTF-32 modes, because these values are not representable in UTF-16. +. +. +.SS "Escape sequences in character classes" +.rs +.sp +All the sequences that define a single character value can be used both inside +and outside character classes. In addition, inside a character class, \eb is +interpreted as the backspace character (hex 08). +.P +When not followed by an opening brace, \eN is not allowed in a character class. +\eB, \eR, and \eX are not special inside a character class. Like other +unrecognized alphabetic escape sequences, they cause an error. Outside a +character class, these sequences have different meanings. +. +. +.SS "Unsupported escape sequences" +.rs +.sp +In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string +handler and used to modify the case of following characters. By default, PCRE2 +does not support these escape sequences. However, if the PCRE2_ALT_BSUX option +is set, \eU matches a "U" character, and \eu can be used to define a character +by code point, as described above. +. +. +.SS "Absolute and relative backreferences" +.rs +.sp +The sequence \eg followed by a signed or unsigned number, optionally enclosed +in braces, is an absolute or relative backreference. A named backreference +can be coded as \eg{name}. Backreferences are discussed +.\" HTML +.\" +later, +.\" +following the discussion of +.\" HTML +.\" +parenthesized subpatterns. +.\" +. +. +.SS "Absolute and relative subroutine calls" +.rs +.sp +For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or +a number enclosed either in angle brackets or single quotes, is an alternative +syntax for referencing a subpattern as a "subroutine". Details are discussed +.\" HTML +.\" +later. +.\" +Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP +synonymous. The former is a backreference; the latter is a +.\" HTML +.\" +subroutine +.\" +call. +. +. +.\" HTML +.SS "Generic character types" +.rs +.sp +Another use of backslash is for specifying generic character types: +.sp + \ed any decimal digit + \eD any character that is not a decimal digit + \eh any horizontal white space character + \eH any character that is not a horizontal white space character + \eN any character that is not a newline + \es any white space character + \eS any character that is not a white space character + \ev any vertical white space character + \eV any character that is not a vertical white space character + \ew any "word" character + \eW any "non-word" character +.sp +The \eN escape sequence has the same meaning as +.\" HTML +.\" +the "." metacharacter +.\" +when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the +meaning of \eN. Note that when \eN is followed by an opening brace it has a +different meaning. See the section entitled +.\" HTML +.\" +"Non-printing characters" +.\" +above for details. Perl also uses \eN{name} to specify characters by Unicode +name; PCRE2 does not support this. +.P +Each pair of lower and upper case escape sequences partitions the complete set +of characters into two disjoint sets. Any given character matches one, and only +one, of each pair. The sequences can appear both inside and outside character +classes. They each match one character of the appropriate type. If the current +matching point is at the end of the subject string, all of them fail, because +there is no character to match. +.P +The default \es characters are HT (9), LF (10), VT (11), FF (12), CR (13), and +space (32), which are defined as white space in the "C" locale. This list may +vary if locale-specific matching is taking place. For example, in some locales +the "non-breaking space" character (\exA0) is recognized as white space, and in +others the VT character is not. +.P +A "word" character is an underscore or any character that is a letter or digit. +By default, the definition of letters and digits is controlled by PCRE2's +low-valued character tables, and may vary if locale-specific matching is taking +place (see +.\" HTML +.\" +"Locale support" +.\" +in the +.\" HREF +\fBpcre2api\fP +.\" +page). For example, in a French locale such as "fr_FR" in Unix-like systems, +or "french" in Windows, some character codes greater than 127 are used for +accented letters, and these are then matched by \ew. The use of locales with +Unicode is discouraged. +.P +By default, characters whose code points are greater than 127 never match \ed, +\es, or \ew, and always match \eD, \eS, and \eW, although this may be different +for characters in the range 128-255 when locale-specific matching is happening. +These escape sequences retain their original meanings from before Unicode +support was available, mainly for efficiency reasons. If the PCRE2_UCP option +is set, the behaviour is changed so that Unicode properties are used to +determine character types, as follows: +.sp + \ed any character that matches \ep{Nd} (decimal digit) + \es any character that matches \ep{Z} or \eh or \ev + \ew any character that matches \ep{L} or \ep{N}, plus underscore +.sp +The upper case escapes match the inverse sets of characters. Note that \ed +matches only decimal digits, whereas \ew matches any Unicode digit, as well as +any Unicode letter, and underscore. Note also that PCRE2_UCP affects \eb, and +\eB because they are defined in terms of \ew and \eW. Matching these sequences +is noticeably slower when PCRE2_UCP is set. +.P +The sequences \eh, \eH, \ev, and \eV, in contrast to the other sequences, which +match only ASCII characters by default, always match a specific list of code +points, whether or not PCRE2_UCP is set. The horizontal space characters are: +.sp + U+0009 Horizontal tab (HT) + U+0020 Space + U+00A0 Non-break space + U+1680 Ogham space mark + U+180E Mongolian vowel separator + U+2000 En quad + U+2001 Em quad + U+2002 En space + U+2003 Em space + U+2004 Three-per-em space + U+2005 Four-per-em space + U+2006 Six-per-em space + U+2007 Figure space + U+2008 Punctuation space + U+2009 Thin space + U+200A Hair space + U+202F Narrow no-break space + U+205F Medium mathematical space + U+3000 Ideographic space +.sp +The vertical space characters are: +.sp + U+000A Linefeed (LF) + U+000B Vertical tab (VT) + U+000C Form feed (FF) + U+000D Carriage return (CR) + U+0085 Next line (NEL) + U+2028 Line separator + U+2029 Paragraph separator +.sp +In 8-bit, non-UTF-8 mode, only the characters with code points less than 256 +are relevant. +. +. +.\" HTML +.SS "Newline sequences" +.rs +.sp +Outside a character class, by default, the escape sequence \eR matches any +Unicode newline sequence. In 8-bit non-UTF-8 mode \eR is equivalent to the +following: +.sp + (?>\er\en|\en|\ex0b|\ef|\er|\ex85) +.sp +This is an example of an "atomic group", details of which are given +.\" HTML +.\" +below. +.\" +This particular group matches either the two-character sequence CR followed by +LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab, +U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next +line, U+0085). Because this is an atomic group, the two-character sequence is +treated as a single unit that cannot be split. +.P +In other modes, two additional characters whose code points are greater than 255 +are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029). +Unicode support is not needed for these characters to be recognized. +.P +It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the +complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF +at compile time. (BSR is an abbrevation for "backslash R".) This can be made +the default when PCRE2 is built; if this is the case, the other behaviour can +be requested via the PCRE2_BSR_UNICODE option. It is also possible to specify +these settings by starting a pattern string with one of the following +sequences: +.sp + (*BSR_ANYCRLF) CR, LF, or CRLF only + (*BSR_UNICODE) any Unicode newline sequence +.sp +These override the default and the options given to the compiling function. +Note that these special settings, which are not Perl-compatible, are recognized +only at the very start of a pattern, and that they must be in upper case. If +more than one of them is present, the last one is used. They can be combined +with a change of newline convention; for example, a pattern can start with: +.sp + (*ANY)(*BSR_ANYCRLF) +.sp +They can also be combined with the (*UTF) or (*UCP) special sequences. Inside a +character class, \eR is treated as an unrecognized escape sequence, and causes +an error. +. +. +.\" HTML +.SS Unicode character properties +.rs +.sp +When PCRE2 is built with Unicode support (the default), three additional escape +sequences that match characters with specific properties are available. In +8-bit non-UTF-8 mode, these sequences are of course limited to testing +characters whose code points are less than 256, but they do work in this mode. +In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit) +may be encountered. These are all treated as being in the Common script and +with an unassigned type. The extra escape sequences are: +.sp + \ep{\fIxx\fP} a character with the \fIxx\fP property + \eP{\fIxx\fP} a character without the \fIxx\fP property + \eX a Unicode extended grapheme cluster +.sp +The property names represented by \fIxx\fP above are limited to the Unicode +script names, the general category properties, "Any", which matches any +character (including newline), and some special PCRE2 properties (described +in the +.\" HTML +.\" +next section). +.\" +Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2. +Note that \eP{Any} does not match any characters, so always causes a match +failure. +.P +Sets of Unicode characters are defined as belonging to certain scripts. A +character from one of these sets can be matched using a script name. For +example: +.sp + \ep{Greek} + \eP{Han} +.sp +Those that are not part of an identified script are lumped together as +"Common". The current list of scripts is: +.P +Adlam, +Ahom, +Anatolian_Hieroglyphs, +Arabic, +Armenian, +Avestan, +Balinese, +Bamum, +Bassa_Vah, +Batak, +Bengali, +Bhaiksuki, +Bopomofo, +Brahmi, +Braille, +Buginese, +Buhid, +Canadian_Aboriginal, +Carian, +Caucasian_Albanian, +Chakma, +Cham, +Cherokee, +Common, +Coptic, +Cuneiform, +Cypriot, +Cyrillic, +Deseret, +Devanagari, +Dogra, +Duployan, +Egyptian_Hieroglyphs, +Elbasan, +Ethiopic, +Georgian, +Glagolitic, +Gothic, +Grantha, +Greek, +Gujarati, +Gunjala_Gondi, +Gurmukhi, +Han, +Hangul, +Hanifi_Rohingya, +Hanunoo, +Hatran, +Hebrew, +Hiragana, +Imperial_Aramaic, +Inherited, +Inscriptional_Pahlavi, +Inscriptional_Parthian, +Javanese, +Kaithi, +Kannada, +Katakana, +Kayah_Li, +Kharoshthi, +Khmer, +Khojki, +Khudawadi, +Lao, +Latin, +Lepcha, +Limbu, +Linear_A, +Linear_B, +Lisu, +Lycian, +Lydian, +Mahajani, +Makasar, +Malayalam, +Mandaic, +Manichaean, +Marchen, +Masaram_Gondi, +Medefaidrin, +Meetei_Mayek, +Mende_Kikakui, +Meroitic_Cursive, +Meroitic_Hieroglyphs, +Miao, +Modi, +Mongolian, +Mro, +Multani, +Myanmar, +Nabataean, +New_Tai_Lue, +Newa, +Nko, +Nushu, +Ogham, +Ol_Chiki, +Old_Hungarian, +Old_Italic, +Old_North_Arabian, +Old_Permic, +Old_Persian, +Old_Sogdian, +Old_South_Arabian, +Old_Turkic, +Oriya, +Osage, +Osmanya, +Pahawh_Hmong, +Palmyrene, +Pau_Cin_Hau, +Phags_Pa, +Phoenician, +Psalter_Pahlavi, +Rejang, +Runic, +Samaritan, +Saurashtra, +Sharada, +Shavian, +Siddham, +SignWriting, +Sinhala, +Sogdian, +Sora_Sompeng, +Soyombo, +Sundanese, +Syloti_Nagri, +Syriac, +Tagalog, +Tagbanwa, +Tai_Le, +Tai_Tham, +Tai_Viet, +Takri, +Tamil, +Tangut, +Telugu, +Thaana, +Thai, +Tibetan, +Tifinagh, +Tirhuta, +Ugaritic, +Vai, +Warang_Citi, +Yi, +Zanabazar_Square. +.P +Each character has exactly one Unicode general category property, specified by +a two-letter abbreviation. For compatibility with Perl, negation can be +specified by including a circumflex between the opening brace and the property +name. For example, \ep{^Lu} is the same as \eP{Lu}. +.P +If only one letter is specified with \ep or \eP, it includes all the general +category properties that start with that letter. In this case, in the absence +of negation, the curly brackets in the escape sequence are optional; these two +examples have the same effect: +.sp + \ep{L} + \epL +.sp +The following general category property codes are supported: +.sp + C Other + Cc Control + Cf Format + Cn Unassigned + Co Private use + Cs Surrogate +.sp + L Letter + Ll Lower case letter + Lm Modifier letter + Lo Other letter + Lt Title case letter + Lu Upper case letter +.sp + M Mark + Mc Spacing mark + Me Enclosing mark + Mn Non-spacing mark +.sp + N Number + Nd Decimal number + Nl Letter number + No Other number +.sp + P Punctuation + Pc Connector punctuation + Pd Dash punctuation + Pe Close punctuation + Pf Final punctuation + Pi Initial punctuation + Po Other punctuation + Ps Open punctuation +.sp + S Symbol + Sc Currency symbol + Sk Modifier symbol + Sm Mathematical symbol + So Other symbol +.sp + Z Separator + Zl Line separator + Zp Paragraph separator + Zs Space separator +.sp +The special property L& is also supported: it matches a character that has +the Lu, Ll, or Lt property, in other words, a letter that is not classified as +a modifier or "other". +.P +The Cs (Surrogate) property applies only to characters in the range U+D800 to +U+DFFF. Such characters are not valid in Unicode strings and so +cannot be tested by PCRE2, unless UTF validity checking has been turned off +(see the discussion of PCRE2_NO_UTF_CHECK in the +.\" HREF +\fBpcre2api\fP +.\" +page). Perl does not support the Cs property. +.P +The long synonyms for property names that Perl supports (such as \ep{Letter}) +are not supported by PCRE2, nor is it permitted to prefix any of these +properties with "Is". +.P +No character that is in the Unicode table has the Cn (unassigned) property. +Instead, this property is assumed for any code point that is not in the +Unicode table. +.P +Specifying caseless matching does not affect these escape sequences. For +example, \ep{Lu} always matches only upper case letters. This is different from +the behaviour of current versions of Perl. +.P +Matching characters by Unicode property is not fast, because PCRE2 has to do a +multistage table lookup in order to find a character's property. That is why +the traditional escape sequences such as \ed and \ew do not use Unicode +properties in PCRE2 by default, though you can make them do so by setting the +PCRE2_UCP option or by starting the pattern with (*UCP). +. +. +.SS Extended grapheme clusters +.rs +.sp +The \eX escape matches any number of Unicode characters that form an "extended +grapheme cluster", and treats the sequence as an atomic group +.\" HTML +.\" +(see below). +.\" +Unicode supports various kinds of composite character by giving each character +a grapheme breaking property, and having rules that use these properties to +define the boundaries of extended grapheme clusters. The rules are defined in +Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0 +abandoned the use of some previous properties that had been used for emojis. +Instead it introduced various emoji-specific properties. PCRE2 uses only the +Extended Pictographic property. +.P +\eX always matches at least one character. Then it decides whether to add +additional characters according to the following rules for ending a cluster: +.P +1. End at the end of the subject string. +.P +2. Do not end between CR and LF; otherwise end after any control character. +.P +3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters +are of five types: L, V, T, LV, and LVT. An L character may be followed by an +L, V, LV, or LVT character; an LV or V character may be followed by a V or T +character; an LVT or T character may be follwed only by a T character. +.P +4. Do not end before extending characters or spacing marks or the "zero-width +joiner" character. Characters with the "mark" property always have the +"extend" grapheme breaking property. +.P +5. Do not end after prepend characters. +.P +6. Do not break within emoji modifier sequences or emoji zwj sequences. That +is, do not break between characters with the Extended_Pictographic property. +Extend and ZWJ characters are allowed between the characters. +.P +7. Do not break within emoji flag sequences. That is, do not break between +regional indicator (RI) characters if there are an odd number of RI characters +before the break point. +.P +8. Otherwise, end the cluster. +. +. +.\" HTML +.SS PCRE2's additional properties +.rs +.sp +As well as the standard Unicode properties described above, PCRE2 supports four +more that make it possible to convert traditional escape sequences such as \ew +and \es to use Unicode properties. PCRE2 uses these non-standard, non-Perl +properties internally when PCRE2_UCP is set. However, they may also be used +explicitly. These properties are: +.sp + Xan Any alphanumeric character + Xps Any POSIX space character + Xsp Any Perl space character + Xwd Any Perl "word" character +.sp +Xan matches characters that have either the L (letter) or the N (number) +property. Xps matches the characters tab, linefeed, vertical tab, form feed, or +carriage return, and any other character that has the Z (separator) property. +Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl +compatibility, but Perl changed. Xwd matches the same characters as Xan, plus +underscore. +.P +There is another non-standard property, Xuc, which matches any character that +can be represented by a Universal Character Name in C++ and other programming +languages. These are the characters $, @, ` (grave accent), and all characters +with Unicode code points greater than or equal to U+00A0, except for the +surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are +excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH +where H is a hexadecimal digit. Note that the Xuc property does not match these +sequences but the characters that they represent.) +. +. +.\" HTML +.SS "Resetting the match start" +.rs +.sp +In normal use, the escape sequence \eK causes any previously matched characters +not to be included in the final matched sequence that is returned. For example, +the pattern: +.sp + foo\eKbar +.sp +matches "foobar", but reports that it has matched "bar". \eK does not interact +with anchoring in any way. The pattern: +.sp + ^foo\eKbar +.sp +matches only when the subject begins with "foobar" (in single line mode), +though it again reports the matched string as "bar". This feature is similar to +a lookbehind assertion +.\" HTML +.\" +(described below). +.\" +However, in this case, the part of the subject before the real match does not +have to be of fixed length, as lookbehind assertions do. The use of \eK does +not interfere with the setting of +.\" HTML +.\" +captured substrings. +.\" +For example, when the pattern +.sp + (foo)\eKbar +.sp +matches "foobar", the first substring is still set to "foo". +.P +Perl documents that the use of \eK within assertions is "not well defined". In +PCRE2, \eK is acted upon when it occurs inside positive assertions, but is +ignored in negative assertions. Note that when a pattern such as (?=ab\eK) +matches, the reported start of the match can be greater than the end of the +match. Using \eK in a lookbehind assertion at the start of a pattern can also +lead to odd effects. For example, consider this pattern: +.sp + (?<=\eKfoo)bar +.sp +If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting +offset of 3 succeeds and reports the matching string as "foobar", that is, the +start of the reported match is earlier than where the match started. +. +. +.\" HTML +.SS "Simple assertions" +.rs +.sp +The final use of backslash is for certain simple assertions. An assertion +specifies a condition that has to be met at a particular point in a match, +without consuming any characters from the subject string. The use of +subpatterns for more complicated assertions is described +.\" HTML +.\" +below. +.\" +The backslashed assertions are: +.sp + \eb matches at a word boundary + \eB matches when not at a word boundary + \eA matches at the start of the subject + \eZ matches at the end of the subject + also matches before a newline at the end of the subject + \ez matches only at the end of the subject + \eG matches at the first matching position in the subject +.sp +Inside a character class, \eb has a different meaning; it matches the backspace +character. If any other of these assertions appears in a character class, an +"invalid escape sequence" error is generated. +.P +A word boundary is a position in the subject string where the current character +and the previous character do not both match \ew or \eW (i.e. one matches +\ew and the other matches \eW), or the start or end of the string if the +first or last character matches \ew, respectively. In a UTF mode, the meanings +of \ew and \eW can be changed by setting the PCRE2_UCP option. When this is +done, it also affects \eb and \eB. Neither PCRE2 nor Perl has a separate "start +of word" or "end of word" metasequence. However, whatever follows \eb normally +determines which it is. For example, the fragment \eba matches "a" at the start +of a word. +.P +The \eA, \eZ, and \ez assertions differ from the traditional circumflex and +dollar (described in the next section) in that they only ever match at the very +start and end of the subject string, whatever options are set. Thus, they are +independent of multiline mode. These three assertions are not affected by the +PCRE2_NOTBOL or PCRE2_NOTEOL options, which affect only the behaviour of the +circumflex and dollar metacharacters. However, if the \fIstartoffset\fP +argument of \fBpcre2_match()\fP is non-zero, indicating that matching is to +start at a point other than the beginning of the subject, \eA can never match. +The difference between \eZ and \ez is that \eZ matches before a newline at the +end of the string as well as at the very end, whereas \ez matches only at the +end. +.P +The \eG assertion is true only when the current matching position is at the +start point of the matching process, as specified by the \fIstartoffset\fP +argument of \fBpcre2_match()\fP. It differs from \eA when the value of +\fIstartoffset\fP is non-zero. By calling \fBpcre2_match()\fP multiple times +with appropriate arguments, you can mimic Perl's /g option, and it is in this +kind of implementation where \eG can be useful. +.P +Note, however, that PCRE2's implementation of \eG, being true at the starting +character of the matching process, is subtly different from Perl's, which +defines it as true at the end of the previous match. In Perl, these can be +different when the previously matched string was empty. Because PCRE2 does just +one match at a time, it cannot reproduce this behaviour. +.P +If all the alternatives of a pattern begin with \eG, the expression is anchored +to the starting match position, and the "anchored" flag is set in the compiled +regular expression. +. +. +.SH "CIRCUMFLEX AND DOLLAR" +.rs +.sp +The circumflex and dollar metacharacters are zero-width assertions. That is, +they test for a particular condition being true without consuming any +characters from the subject string. These two metacharacters are concerned with +matching the starts and ends of lines. If the newline convention is set so that +only the two-character sequence CRLF is recognized as a newline, isolated CR +and LF characters are treated as ordinary data characters, and are not +recognized as newlines. +.P +Outside a character class, in the default matching mode, the circumflex +character is an assertion that is true only if the current matching point is at +the start of the subject string. If the \fIstartoffset\fP argument of +\fBpcre2_match()\fP is non-zero, or if PCRE2_NOTBOL is set, circumflex can +never match if the PCRE2_MULTILINE option is unset. Inside a character class, +circumflex has an entirely different meaning +.\" HTML +.\" +(see below). +.\" +.P +Circumflex need not be the first character of the pattern if a number of +alternatives are involved, but it should be the first thing in each alternative +in which it appears if the pattern is ever to match that branch. If all +possible alternatives start with a circumflex, that is, if the pattern is +constrained to match only at the start of the subject, it is said to be an +"anchored" pattern. (There are also other constructs that can cause a pattern +to be anchored.) +.P +The dollar character is an assertion that is true only if the current matching +point is at the end of the subject string, or immediately before a newline at +the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however, +that it does not actually match the newline. Dollar need not be the last +character of the pattern if a number of alternatives are involved, but it +should be the last item in any branch in which it appears. Dollar has no +special meaning in a character class. +.P +The meaning of dollar can be changed so that it matches only at the very end of +the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This +does not affect the \eZ assertion. +.P +The meanings of the circumflex and dollar metacharacters are changed if the +PCRE2_MULTILINE option is set. When this is the case, a dollar character +matches before any newlines in the string, as well as at the very end, and a +circumflex matches immediately after internal newlines as well as at the start +of the subject string. It does not match after a newline that ends the string, +for compatibility with Perl. However, this can be changed by setting the +PCRE2_ALT_CIRCUMFLEX option. +.P +For example, the pattern /^abc$/ matches the subject string "def\enabc" (where +\en represents a newline) in multiline mode, but not otherwise. Consequently, +patterns that are anchored in single line mode because all branches start with +^ are not anchored in multiline mode, and a match for circumflex is possible +when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The +PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set. +.P +When the newline convention (see +.\" HTML +.\" +"Newline conventions" +.\" +below) recognizes the two-character sequence CRLF as a newline, this is +preferred, even if the single characters CR and LF are also recognized as +newlines. For example, if the newline convention is "any", a multiline mode +circumflex matches before "xyz" in the string "abc\er\enxyz" rather than after +CR, even though CR on its own is a valid newline. (It also matches at the very +start of the string, of course.) +.P +Note that the sequences \eA, \eZ, and \ez can be used to match the start and +end of the subject in both modes, and if all branches of a pattern start with +\eA it is always anchored, whether or not PCRE2_MULTILINE is set. +. +. +.\" HTML +.SH "FULL STOP (PERIOD, DOT) AND \eN" +.rs +.sp +Outside a character class, a dot in the pattern matches any one character in +the subject string except (by default) a character that signifies the end of a +line. +.P +When a line ending is defined as a single character, dot never matches that +character; when the two-character sequence CRLF is used, dot does not match CR +if it is immediately followed by LF, but otherwise it matches all characters +(including isolated CRs and LFs). When any Unicode line endings are being +recognized, dot does not match CR or LF or any of the other line ending +characters. +.P +The behaviour of dot with regard to newlines can be changed. If the +PCRE2_DOTALL option is set, a dot matches any one character, without exception. +If the two-character sequence CRLF is present in the subject string, it takes +two dots to match it. +.P +The handling of dot is entirely independent of the handling of circumflex and +dollar, the only relationship being that they both involve newlines. Dot has no +special meaning in a character class. +.P +The escape sequence \eN when not followed by an opening brace behaves like a +dot, except that it is not affected by the PCRE2_DOTALL option. In other words, +it matches any character except one that signifies the end of a line. +.P +When \eN is followed by an opening brace it has a different meaning. See the +section entitled +.\" HTML +.\" +"Non-printing characters" +.\" +above for details. Perl also uses \eN{name} to specify characters by Unicode +name; PCRE2 does not support this. +. +. +.SH "MATCHING A SINGLE CODE UNIT" +.rs +.sp +Outside a character class, the escape sequence \eC matches any one code unit, +whether or not a UTF mode is set. In the 8-bit library, one code unit is one +byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a +32-bit unit. Unlike a dot, \eC always matches line-ending characters. The +feature is provided in Perl in order to match individual bytes in UTF-8 mode, +but it is unclear how it can usefully be used. +.P +Because \eC breaks up characters into individual code units, matching one unit +with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start +with a malformed UTF character. This has undefined results, because PCRE2 +assumes that it is matching character by character in a valid UTF string (by +default it checks the subject string's validity at the start of processing +unless the PCRE2_NO_UTF_CHECK option is used). +.P +An application can lock out the use of \eC by setting the +PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to +build PCRE2 with the use of \eC permanently disabled. +.P +PCRE2 does not allow \eC to appear in lookbehind assertions +.\" HTML +.\" +(described below) +.\" +in UTF-8 or UTF-16 modes, because this would make it impossible to calculate +the length of the lookbehind. Neither the alternative matching function +\fBpcre2_dfa_match()\fP nor the JIT optimizer support \eC in these UTF modes. +The former gives a match-time error; the latter fails to optimize and so the +match is always run using the interpreter. +.P +In the 32-bit library, however, \eC is always supported (when not explicitly +locked out) because it always matches a single code unit, whether or not UTF-32 +is specified. +.P +In general, the \eC escape sequence is best avoided. However, one way of using +it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a +lookahead to check the length of the next character, as in this pattern, which +could be used with a UTF-8 string (ignore white space and line breaks): +.sp + (?| (?=[\ex00-\ex7f])(\eC) | + (?=[\ex80-\ex{7ff}])(\eC)(\eC) | + (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) | + (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC)) +.sp +In this example, a group that starts with (?| resets the capturing parentheses +numbers in each alternative (see +.\" HTML +.\" +"Duplicate Subpattern Numbers" +.\" +below). The assertions at the start of each branch check the next UTF-8 +character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The +character's individual bytes are then captured by the appropriate number of +\eC groups. +. +. +.\" HTML +.SH "SQUARE BRACKETS AND CHARACTER CLASSES" +.rs +.sp +An opening square bracket introduces a character class, terminated by a closing +square bracket. A closing square bracket on its own is not special by default. +If a closing square bracket is required as a member of the class, it should be +the first data character in the class (after an initial circumflex, if present) +or escaped with a backslash. This means that, by default, an empty class cannot +be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing +square bracket at the start does end the (empty) class. +.P +A character class matches a single character in the subject. A matched +character must be in the set of characters defined by the class, unless the +first character in the class definition is a circumflex, in which case the +subject character must not be in the set defined by the class. If a circumflex +is actually required as a member of the class, ensure it is not the first +character, or escape it with a backslash. +.P +For example, the character class [aeiou] matches any lower case vowel, while +[^aeiou] matches any character that is not a lower case vowel. Note that a +circumflex is just a convenient notation for specifying the characters that +are in the class by enumerating those that are not. A class that starts with a +circumflex is not an assertion; it still consumes a character from the subject +string, and therefore it fails if the current pointer is at the end of the +string. +.P +Characters in a class may be specified by their code points using \eo, \ex, or +\eN{U+hh..} in the usual way. When caseless matching is set, any letters in a +class represent both their upper case and lower case versions, so for example, +a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not +match "A", whereas a caseful version would. +.P +Characters that might indicate line breaks are never treated in any special way +when matching character classes, whatever line-ending sequence is in use, and +whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A +class such as [^a] always matches one of these characters. +.P +The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es, +\eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the +characters that they match to the class. For example, [\edABCDEF] matches any +hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of +\ed, \es, \ew and their upper case partners, just as it does when they appear +outside a character class, as described in the section entitled +.\" HTML +.\" +"Generic character types" +.\" +above. The escape sequence \eb has a different meaning inside a character +class; it matches the backspace character. The sequences \eB, \eR, and \eX are +not special inside a character class. Like any other unrecognized escape +sequences, they cause an error. The same is true for \eN when not followed by +an opening brace. +.P +The minus (hyphen) character can be used to specify a range of characters in a +character class. For example, [d-m] matches any letter between d and m, +inclusive. If a minus character is required in a class, it must be escaped with +a backslash or appear in a position where it cannot be interpreted as +indicating a range, typically as the first or last character in the class, +or immediately after a range. For example, [b-d-z] matches letters in the range +b to d, a hyphen character, or z. +.P +Perl treats a hyphen as a literal if it appears before or after a POSIX class +(see below) or before or after a character type escape such as as \ed or \eH. +However, unless the hyphen is the last character in the class, Perl outputs a +warning in its warning mode, as this is most likely a user error. As PCRE2 has +no facility for warning, an error is given in these cases. +.P +It is not possible to have the literal character "]" as the end character of a +range. A pattern such as [W-]46] is interpreted as a class of two characters +("W" and "-") followed by a literal string "46]", so it would match "W46]" or +"-46]". However, if the "]" is escaped with a backslash it is interpreted as +the end of range, so [W-\e]46] is interpreted as a class containing a range +followed by two other characters. The octal or hexadecimal representation of +"]" can also be used to end a range. +.P +Ranges normally include all code points between the start and end characters, +inclusive. They can also be used for code points specified numerically, for +example [\e000-\e037]. Ranges can include any characters that are valid for the +current mode. In any UTF mode, the so-called "surrogate" characters (those +whose code points lie between 0xd800 and 0xdfff inclusive) may not be specified +explicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables +this check). However, ranges such as [\ex{d7ff}-\ex{e000}], which include the +surrogates, are always permitted. +.P +There is a special case in EBCDIC environments for ranges whose end points are +both specified as literal letters in the same case. For compatibility with +Perl, EBCDIC code points within the range that are not letters are omitted. For +example, [h-k] matches only four characters, even though the codes for h and k +are 0x88 and 0x92, a range of 11 code points. However, if the range is +specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points +are included. +.P +If a range that includes letters is used when caseless matching is set, it +matches the letters in either case. For example, [W-c] is equivalent to +[][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character +tables for a French locale are in use, [\exc8-\excb] matches accented E +characters in both cases. +.P +A circumflex can conveniently be used with the upper case character types to +specify a more restricted set of characters than the matching lower case type. +For example, the class [^\eW_] matches any letter or digit, but not underscore, +whereas [\ew] includes underscore. A positive character class should be read as +"something OR something OR ..." and a negative class as "NOT something AND NOT +something AND NOT ...". +.P +The only metacharacters that are recognized in character classes are backslash, +hyphen (only where it can be interpreted as specifying a range), circumflex +(only at the start), opening square bracket (only when it can be interpreted as +introducing a POSIX class name, or for a special compatibility feature - see +the next two sections), and the terminating closing square bracket. However, +escaping other non-alphanumeric characters does no harm. +. +. +.SH "POSIX CHARACTER CLASSES" +.rs +.sp +Perl supports the POSIX notation for character classes. This uses names +enclosed by [: and :] within the enclosing square brackets. PCRE2 also supports +this notation. For example, +.sp + [01[:alpha:]%] +.sp +matches "0", "1", any alphabetic character, or "%". The supported class names +are: +.sp + alnum letters and digits + alpha letters + ascii character codes 0 - 127 + blank space or tab only + cntrl control characters + digit decimal digits (same as \ed) + graph printing characters, excluding space + lower lower case letters + print printing characters, including space + punct printing characters, excluding letters and digits and space + space white space (the same as \es from PCRE2 8.34) + upper upper case letters + word "word" characters (same as \ew) + xdigit hexadecimal digits +.sp +The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13), +and space (32). If locale-specific matching is taking place, the list of space +characters may be different; there may be fewer or more of them. "Space" and +\es match the same set of characters. +.P +The name "word" is a Perl extension, and "blank" is a GNU extension from Perl +5.8. Another Perl extension is negation, which is indicated by a ^ character +after the colon. For example, +.sp + [12[:^digit:]] +.sp +matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX +syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not +supported, and an error is given if they are encountered. +.P +By default, characters with values greater than 127 do not match any of the +POSIX character classes, although this may be different for characters in the +range 128-255 when locale-specific matching is happening. However, if the +PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are +changed so that Unicode character properties are used. This is achieved by +replacing certain POSIX classes with other sequences, as follows: +.sp + [:alnum:] becomes \ep{Xan} + [:alpha:] becomes \ep{L} + [:blank:] becomes \eh + [:cntrl:] becomes \ep{Cc} + [:digit:] becomes \ep{Nd} + [:lower:] becomes \ep{Ll} + [:space:] becomes \ep{Xps} + [:upper:] becomes \ep{Lu} + [:word:] becomes \ep{Xwd} +.sp +Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX +classes are handled specially in UCP mode: +.TP 10 +[:graph:] +This matches characters that have glyphs that mark the page when printed. In +Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf +properties, except for: +.sp + U+061C Arabic Letter Mark + U+180E Mongolian Vowel Separator + U+2066 - U+2069 Various "isolate"s +.sp +.TP 10 +[:print:] +This matches the same characters as [:graph:] plus space characters that are +not controls, that is, characters with the Zs property. +.TP 10 +[:punct:] +This matches all characters that have the Unicode P (punctuation) property, +plus those characters with code points less than 256 that have the S (Symbol) +property. +.P +The other POSIX classes are unchanged, and match only characters with code +points less than 256. +. +. +.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES" +.rs +.sp +In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly +syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of +word". PCRE2 treats these items as follows: +.sp + [[:<:]] is converted to \eb(?=\ew) + [[:>:]] is converted to \eb(?<=\ew) +.sp +Only these exact character sequences are recognized. A sequence such as +[a[:<:]b] provokes error for an unrecognized POSIX class name. This support is +not compatible with Perl. It is provided to help migrations from other +environments, and is best not used in any new patterns. Note that \eb matches +at the start and the end of a word (see +.\" HTML +.\" +"Simple assertions" +.\" +above), and in a Perl-style pattern the preceding or following character +normally shows which is wanted, without the need for the assertions that are +used above in order to give exactly the POSIX behaviour. +. +. +.SH "VERTICAL BAR" +.rs +.sp +Vertical bar characters are used to separate alternative patterns. For example, +the pattern +.sp + gilbert|sullivan +.sp +matches either "gilbert" or "sullivan". Any number of alternatives may appear, +and an empty alternative is permitted (matching the empty string). The matching +process tries each alternative in turn, from left to right, and the first one +that succeeds is used. If the alternatives are within a subpattern +.\" HTML +.\" +(defined below), +.\" +"succeeds" means matching the rest of the main pattern as well as the +alternative in the subpattern. +. +. +.SH "INTERNAL OPTION SETTING" +.rs +.sp +The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL, +PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be +changed from within the pattern by a sequence of letters enclosed between "(?" +and ")". These options are Perl-compatible, and are described in detail in the +.\" HREF +\fBpcre2api\fP +.\" +documentation. The option letters are: +.sp + i for PCRE2_CASELESS + m for PCRE2_MULTILINE + n for PCRE2_NO_AUTO_CAPTURE + s for PCRE2_DOTALL + x for PCRE2_EXTENDED + xx for PCRE2_EXTENDED_MORE +.sp +For example, (?im) sets caseless, multiline matching. It is also possible to +unset these options by preceding the relevant letters with a hyphen, for +example (?-im). The two "extended" options are not independent; unsetting either +one cancels the effects of both of them. +.P +A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS +and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also +permitted. Only one hyphen may appear in the options string. If a letter +appears both before and after the hyphen, the option is unset. An empty options +setting "(?)" is allowed. Needless to say, it has no effect. +.P +If the first character following (? is a circumflex, it causes all of the above +options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow +the circumflex to cause some options to be re-instated, but a hyphen may not +appear. +.P +The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in +the same way as the Perl-compatible options by using the characters J and U +respectively. However, these are not unset by (?^). +.P +When one of these option changes occurs at top level (that is, not inside +subpattern parentheses), the change applies to the remainder of the pattern +that follows. An option change within a subpattern (see below for a description +of subpatterns) affects only that part of the subpattern that follows it, so +.sp + (a(?i)b)c +.sp +matches abc and aBc and no other strings (assuming PCRE2_CASELESS is not used). +By this means, options can be made to have different settings in different +parts of the pattern. Any changes made in one alternative do carry on +into subsequent branches within the same subpattern. For example, +.sp + (a(?i)b|c) +.sp +matches "ab", "aB", "c", and "C", even though when matching "C" the first +branch is abandoned before the option setting. This is because the effects of +option settings happen at compile time. There would be some very weird +behaviour otherwise. +.P +As a convenient shorthand, if any option settings are required at the start of +a non-capturing subpattern (see the next section), the option letters may +appear between the "?" and the ":". Thus the two patterns +.sp + (?i:saturday|sunday) + (?:(?i)saturday|sunday) +.sp +match exactly the same set of strings. +.P +\fBNote:\fP There are other PCRE2-specific options that can be set by the +application when the compiling function is called. The pattern can contain +special leading sequences such as (*CRLF) to override what the application has +set or what has been defaulted. Details are given in the section entitled +.\" HTML +.\" +"Newline sequences" +.\" +above. There are also the (*UTF) and (*UCP) leading sequences that can be used +to set UTF and Unicode property modes; they are equivalent to setting the +PCRE2_UTF and PCRE2_UCP options, respectively. However, the application can set +the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use of the +(*UTF) and (*UCP) sequences. +. +. +.\" HTML +.SH SUBPATTERNS +.rs +.sp +Subpatterns are delimited by parentheses (round brackets), which can be nested. +Turning part of a pattern into a subpattern does two things: +.sp +1. It localizes a set of alternatives. For example, the pattern +.sp + cat(aract|erpillar|) +.sp +matches "cataract", "caterpillar", or "cat". Without the parentheses, it would +match "cataract", "erpillar" or an empty string. +.sp +2. It sets up the subpattern as a capturing subpattern. This means that, when +the whole pattern matches, the portion of the subject string that matched the +subpattern is passed back to the caller, separately from the portion that +matched the whole pattern. (This applies only to the traditional matching +function; the DFA matching function does not support capturing.) +.P +Opening parentheses are counted from left to right (starting from 1) to obtain +numbers for the capturing subpatterns. For example, if the string "the red +king" is matched against the pattern +.sp + the ((red|white) (king|queen)) +.sp +the captured substrings are "red king", "red", and "king", and are numbered 1, +2, and 3, respectively. +.P +The fact that plain parentheses fulfil two functions is not always helpful. +There are often times when a grouping subpattern is required without a +capturing requirement. If an opening parenthesis is followed by a question mark +and a colon, the subpattern does not do any capturing, and is not counted when +computing the number of any subsequent capturing subpatterns. For example, if +the string "the white queen" is matched against the pattern +.sp + the ((?:red|white) (king|queen)) +.sp +the captured substrings are "white queen" and "queen", and are numbered 1 and +2. The maximum number of capturing subpatterns is 65535. +.P +As a convenient shorthand, if any option settings are required at the start of +a non-capturing subpattern, the option letters may appear between the "?" and +the ":". Thus the two patterns +.sp + (?i:saturday|sunday) + (?:(?i)saturday|sunday) +.sp +match exactly the same set of strings. Because alternative branches are tried +from left to right, and options are not reset until the end of the subpattern +is reached, an option setting in one branch does affect subsequent branches, so +the above patterns match "SUNDAY" as well as "Saturday". +. +. +.\" HTML +.SH "DUPLICATE SUBPATTERN NUMBERS" +.rs +.sp +Perl 5.10 introduced a feature whereby each alternative in a subpattern uses +the same numbers for its capturing parentheses. Such a subpattern starts with +(?| and is itself a non-capturing subpattern. For example, consider this +pattern: +.sp + (?|(Sat)ur|(Sun))day +.sp +Because the two alternatives are inside a (?| group, both sets of capturing +parentheses are numbered one. Thus, when the pattern matches, you can look +at captured substring number one, whichever alternative matched. This construct +is useful when you want to capture part, but not all, of one of a number of +alternatives. Inside a (?| group, parentheses are numbered as usual, but the +number is reset at the start of each branch. The numbers of any capturing +parentheses that follow the subpattern start after the highest number used in +any branch. The following example is taken from the Perl documentation. The +numbers underneath show in which buffer the captured content will be stored. +.sp + # before ---------------branch-reset----------- after + / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x + # 1 2 2 3 2 3 4 +.sp +A backreference to a numbered subpattern uses the most recent value that is +set for that number by any subpattern. The following pattern matches "abcabc" +or "defdef": +.sp + /(?|(abc)|(def))\e1/ +.sp +In contrast, a subroutine call to a numbered subpattern always refers to the +first one in the pattern with the given number. The following pattern matches +"abcabc" or "defabc": +.sp + /(?|(abc)|(def))(?1)/ +.sp +A relative reference such as (?-1) is no different: it is just a convenient way +of computing an absolute group number. +.P +If a +.\" HTML +.\" +condition test +.\" +for a subpattern's having matched refers to a non-unique number, the test is +true if any of the subpatterns of that number have matched. +.P +An alternative approach to using this "branch reset" feature is to use +duplicate named subpatterns, as described in the next section. +. +. +.SH "NAMED SUBPATTERNS" +.rs +.sp +Identifying capturing parentheses by number is simple, but it can be very hard +to keep track of the numbers in complicated patterns. Furthermore, if an +expression is modified, the numbers may change. To help with this difficulty, +PCRE2 supports the naming of capturing subpatterns. This feature was not added +to Perl until release 5.10. Python had the feature earlier, and PCRE1 +introduced it at release 4.0, using the Python syntax. PCRE2 supports both the +Perl and the Python syntax. +.P +In PCRE2, a capturing subpattern can be named in one of three ways: +(?...) or (?'name'...) as in Perl, or (?P...) as in Python. Names +consist of up to 32 alphanumeric characters and underscores, but must start +with a non-digit. References to capturing parentheses from other parts of the +pattern, such as +.\" HTML +.\" +backreferences, +.\" +.\" HTML +.\" +recursion, +.\" +and +.\" HTML +.\" +conditions, +.\" +can all be made by name as well as by number. +.P +Named capturing parentheses are allocated numbers as well as names, exactly as +if the names were not present. In both PCRE2 and Perl, capturing subpatterns +are primarily identified by numbers; any names are just aliases for these +numbers. The PCRE2 API provides function calls for extracting the complete +name-to-number translation table from a compiled pattern, as well as +convenience functions for extracting captured substrings by name. +.P +\fBWarning:\fP When more than one subpattern has the same number, as described +in the previous section, a name given to one of them applies to all of them. +Perl allows identically numbered subpatterns to have different names. Consider +this pattern, where there are two capturing subpatterns, both numbered 1: +.sp + (?|(?aa)|(?bb)) +.sp +Perl allows this, with both names AA and BB as aliases of group 1. Thus, after +a successful match, both names yield the same value (either "aa" or "bb"). +.P +In an attempt to reduce confusion, PCRE2 does not allow the same group number +to be associated with more than one name. The example above provokes a +compile-time error. However, there is still scope for confusion. Consider this +pattern: +.sp + (?|(?aa)|(bb)) +.sp +Although the second subpattern number 1 is not explicitly named, the name AA is +still an alias for subpattern 1. Whether the pattern matches "aa" or "bb", a +reference by name to group AA yields the matched string. +.P +By default, a name must be unique within a pattern, except that duplicate names +are permitted for subpatterns with the same number, for example: +.sp + (?|(?aa)|(?bb)) +.sp +The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES +option at compile time, or by the use of (?J) within the pattern. Duplicate +names can be useful for patterns where only one instance of the named +parentheses can match. Suppose you want to match the name of a weekday, either +as a 3-letter abbreviation or as the full name, and in both cases you want to +extract the abbreviation. This pattern (ignoring the line breaks) does the job: +.sp + (?Mon|Fri|Sun)(?:day)?| + (?Tue)(?:sday)?| + (?Wed)(?:nesday)?| + (?Thu)(?:rsday)?| + (?Sat)(?:urday)? +.sp +There are five capturing substrings, but only one is ever set after a match. +The convenience functions for extracting the data by name returns the substring +for the first (and in this example, the only) subpattern of that name that +matched. This saves searching to find which numbered subpattern it was. (An +alternative way of solving this problem is to use a "branch reset" subpattern, +as described in the previous section.) +.P +If you make a backreference to a non-unique named subpattern from elsewhere in +the pattern, the subpatterns to which the name refers are checked in the order +in which they appear in the overall pattern. The first one that is set is used +for the reference. For example, this pattern matches both "foofoo" and +"barbar" but not "foobar" or "barfoo": +.sp + (?:(?foo)|(?bar))\ek +.sp +.P +If you make a subroutine call to a non-unique named subpattern, the one that +corresponds to the first occurrence of the name is used. In the absence of +duplicate numbers this is the one with the lowest number. +.P +If you use a named reference in a condition +test (see the +.\" +.\" HTML +.\" +section about conditions +.\" +below), either to check whether a subpattern has matched, or to check for +recursion, all subpatterns with the same name are tested. If the condition is +true for any one of them, the overall condition is true. This is the same +behaviour as testing by number. For further details of the interfaces for +handling named subpatterns, see the +.\" HREF +\fBpcre2api\fP +.\" +documentation. +. +. +.SH REPETITION +.rs +.sp +Repetition is specified by quantifiers, which can follow any of the following +items: +.sp + a literal data character + the dot metacharacter + the \eC escape sequence + the \eX escape sequence + the \eR escape sequence + an escape such as \ed or \epL that matches a single character + a character class + a backreference + a parenthesized subpattern (including most assertions) + a subroutine call to a subpattern (recursive or otherwise) +.sp +The general repetition quantifier specifies a minimum and maximum number of +permitted matches, by giving the two numbers in curly brackets (braces), +separated by a comma. The numbers must be less than 65536, and the first must +be less than or equal to the second. For example: +.sp + z{2,4} +.sp +matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special +character. If the second number is omitted, but the comma is present, there is +no upper limit; if the second number and the comma are both omitted, the +quantifier specifies an exact number of required matches. Thus +.sp + [aeiou]{3,} +.sp +matches at least 3 successive vowels, but may match many more, whereas +.sp + \ed{8} +.sp +matches exactly 8 digits. An opening curly bracket that appears in a position +where a quantifier is not allowed, or one that does not match the syntax of a +quantifier, is taken as a literal character. For example, {,6} is not a +quantifier, but a literal string of four characters. +.P +In UTF modes, quantifiers apply to characters rather than to individual code +units. Thus, for example, \ex{100}{2} matches two characters, each of +which is represented by a two-byte sequence in a UTF-8 string. Similarly, +\eX{3} matches three Unicode extended grapheme clusters, each of which may be +several code units long (and they may be of different lengths). +.P +The quantifier {0} is permitted, causing the expression to behave as if the +previous item and the quantifier were not present. This may be useful for +subpatterns that are referenced as +.\" HTML +.\" +subroutines +.\" +from elsewhere in the pattern (but see also the section entitled +.\" HTML +.\" +"Defining subpatterns for use by reference only" +.\" +below). Items other than subpatterns that have a {0} quantifier are omitted +from the compiled pattern. +.P +For convenience, the three most common quantifiers have single-character +abbreviations: +.sp + * is equivalent to {0,} + + is equivalent to {1,} + ? is equivalent to {0,1} +.sp +It is possible to construct infinite loops by following a subpattern that can +match no characters with a quantifier that has no upper limit, for example: +.sp + (a?)* +.sp +Earlier versions of Perl and PCRE1 used to give an error at compile time for +such patterns. However, because there are cases where this can be useful, such +patterns are now accepted, but if any repetition of the subpattern does in fact +match no characters, the loop is forcibly broken. +.P +By default, the quantifiers are "greedy", that is, they match as much as +possible (up to the maximum number of permitted times), without causing the +rest of the pattern to fail. The classic example of where this gives problems +is in trying to match comments in C programs. These appear between /* and */ +and within the comment, individual * and / characters may appear. An attempt to +match C comments by applying the pattern +.sp + /\e*.*\e*/ +.sp +to the string +.sp + /* first comment */ not comment /* second comment */ +.sp +fails, because it matches the entire string owing to the greediness of the .* +item. +.P +If a quantifier is followed by a question mark, it ceases to be greedy, and +instead matches the minimum number of times possible, so the pattern +.sp + /\e*.*?\e*/ +.sp +does the right thing with the C comments. The meaning of the various +quantifiers is not otherwise changed, just the preferred number of matches. +Do not confuse this use of question mark with its use as a quantifier in its +own right. Because it has two uses, it can sometimes appear doubled, as in +.sp + \ed??\ed +.sp +which matches one digit by preference, but can match two if that is the only +way the rest of the pattern matches. +.P +If the PCRE2_UNGREEDY option is set (an option that is not available in Perl), +the quantifiers are not greedy by default, but individual ones can be made +greedy by following them with a question mark. In other words, it inverts the +default behaviour. +.P +When a parenthesized subpattern is quantified with a minimum repeat count that +is greater than 1 or with a limited maximum, more memory is required for the +compiled pattern, in proportion to the size of the minimum or maximum. +.P +If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option (equivalent +to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is +implicitly anchored, because whatever follows will be tried against every +character position in the subject string, so there is no point in retrying the +overall match at any position after the first. PCRE2 normally treats such a +pattern as though it were preceded by \eA. +.P +In cases where it is known that the subject string contains no newlines, it is +worth setting PCRE2_DOTALL in order to obtain this optimization, or +alternatively, using ^ to indicate anchoring explicitly. +.P +However, there are some cases where the optimization cannot be used. When .* +is inside capturing parentheses that are the subject of a backreference +elsewhere in the pattern, a match at the start may fail where a later one +succeeds. Consider, for example: +.sp + (.*)abc\e1 +.sp +If the subject is "xyz123abc123" the match point is the fourth character. For +this reason, such a pattern is not implicitly anchored. +.P +Another case where implicit anchoring is not applied is when the leading .* is +inside an atomic group. Once again, a match at the start may fail where a later +one succeeds. Consider this pattern: +.sp + (?>.*?a)b +.sp +It matches "ab" in the subject "aab". The use of the backtracking control verbs +(*PRUNE) and (*SKIP) also disable this optimization, and there is an option, +PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly. +.P +When a capturing subpattern is repeated, the value captured is the substring +that matched the final iteration. For example, after +.sp + (tweedle[dume]{3}\es*)+ +.sp +has matched "tweedledum tweedledee" the value of the captured substring is +"tweedledee". However, if there are nested capturing subpatterns, the +corresponding captured values may have been set in previous iterations. For +example, after +.sp + (a|(b))+ +.sp +matches "aba" the value of the second captured substring is "b". +. +. +.\" HTML +.SH "ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS" +.rs +.sp +With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy") +repetition, failure of what follows normally causes the repeated item to be +re-evaluated to see if a different number of repeats allows the rest of the +pattern to match. Sometimes it is useful to prevent this, either to change the +nature of the match, or to cause it fail earlier than it otherwise might, when +the author of the pattern knows there is no point in carrying on. +.P +Consider, for example, the pattern \ed+foo when applied to the subject line +.sp + 123456bar +.sp +After matching all 6 digits and then failing to match "foo", the normal +action of the matcher is to try again with only 5 digits matching the \ed+ +item, and then with 4, and so on, before ultimately failing. "Atomic grouping" +(a term taken from Jeffrey Friedl's book) provides the means for specifying +that once a subpattern has matched, it is not to be re-evaluated in this way. +.P +If we use atomic grouping for the previous example, the matcher gives up +immediately on failing to match "foo" the first time. The notation is a kind of +special parenthesis, starting with (?> as in this example: +.sp + (?>\ed+)foo +.sp +This kind of parenthesis "locks up" the part of the pattern it contains once +it has matched, and a failure further into the pattern is prevented from +backtracking into it. Backtracking past it to previous items, however, works as +normal. +.P +An alternative description is that a subpattern of this type matches exactly +the string of characters that an identical standalone pattern would match, if +anchored at the current point in the subject string. +.P +Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as +the above example can be thought of as a maximizing repeat that must swallow +everything it can. So, while both \ed+ and \ed+? are prepared to adjust the +number of digits they match in order to make the rest of the pattern match, +(?>\ed+) can only match an entire sequence of digits. +.P +Atomic groups in general can of course contain arbitrarily complicated +subpatterns, and can be nested. However, when the subpattern for an atomic +group is just a single repeated item, as in the example above, a simpler +notation, called a "possessive quantifier" can be used. This consists of an +additional + character following a quantifier. Using this notation, the +previous example can be rewritten as +.sp + \ed++foo +.sp +Note that a possessive quantifier can be used with an entire group, for +example: +.sp + (abc|xyz){2,3}+ +.sp +Possessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY +option is ignored. They are a convenient notation for the simpler forms of +atomic group. However, there is no difference in the meaning of a possessive +quantifier and the equivalent atomic group, though there may be a performance +difference; possessive quantifiers should be slightly faster. +.P +The possessive quantifier syntax is an extension to the Perl 5.8 syntax. +Jeffrey Friedl originated the idea (and the name) in the first edition of his +book. Mike McCloskey liked it, so implemented it when he built Sun's Java +package, and PCRE1 copied it from there. It ultimately found its way into Perl +at release 5.10. +.P +PCRE2 has an optimization that automatically "possessifies" certain simple +pattern constructs. For example, the sequence A+B is treated as A++B because +there is no point in backtracking into a sequence of A's when B must follow. +This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting +the pattern with (*NO_AUTO_POSSESS). +.P +When a pattern contains an unlimited repeat inside a subpattern that can itself +be repeated an unlimited number of times, the use of an atomic group is the +only way to avoid some failing matches taking a very long time indeed. The +pattern +.sp + (\eD+|<\ed+>)*[!?] +.sp +matches an unlimited number of substrings that either consist of non-digits, or +digits enclosed in <>, followed by either ! or ?. When it matches, it runs +quickly. However, if it is applied to +.sp + aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa +.sp +it takes a long time before reporting failure. This is because the string can +be divided between the internal \eD+ repeat and the external * repeat in a +large number of ways, and all have to be tried. (The example uses [!?] rather +than a single character at the end, because both PCRE2 and Perl have an +optimization that allows for fast failure when a single character is used. They +remember the last single character that is required for a match, and fail early +if it is not present in the string.) If the pattern is changed so that it uses +an atomic group, like this: +.sp + ((?>\eD+)|<\ed+>)*[!?] +.sp +sequences of non-digits cannot be broken, and failure happens quickly. +. +. +.\" HTML +.SH "BACKREFERENCES" +.rs +.sp +Outside a character class, a backslash followed by a digit greater than 0 (and +possibly further digits) is a backreference to a capturing subpattern earlier +(that is, to its left) in the pattern, provided there have been that many +previous capturing left parentheses. +.P +However, if the decimal number following the backslash is less than 8, it is +always taken as a backreference, and causes an error only if there are not +that many capturing left parentheses in the entire pattern. In other words, the +parentheses that are referenced need not be to the left of the reference for +numbers less than 8. A "forward backreference" of this type can make sense +when a repetition is involved and the subpattern to the right has participated +in an earlier iteration. +.P +It is not possible to have a numerical "forward backreference" to a subpattern +whose number is 8 or more using this syntax because a sequence such as \e50 is +interpreted as a character defined in octal. See the subsection entitled +"Non-printing characters" +.\" HTML +.\" +above +.\" +for further details of the handling of digits following a backslash. There is +no such problem when named parentheses are used. A backreference to any +subpattern is possible using named parentheses (see below). +.P +Another way of avoiding the ambiguity inherent in the use of digits following a +backslash is to use the \eg escape sequence. This escape must be followed by a +signed or unsigned number, optionally enclosed in braces. These examples are +all identical: +.sp + (ring), \e1 + (ring), \eg1 + (ring), \eg{1} +.sp +An unsigned number specifies an absolute reference without the ambiguity that +is present in the older syntax. It is also useful when literal digits follow +the reference. A signed number is a relative reference. Consider this example: +.sp + (abc(def)ghi)\eg{-1} +.sp +The sequence \eg{-1} is a reference to the most recently started capturing +subpattern before \eg, that is, is it equivalent to \e2 in this example. +Similarly, \eg{-2} would be equivalent to \e1. The use of relative references +can be helpful in long patterns, and also in patterns that are created by +joining together fragments that contain references within themselves. +.P +The sequence \eg{+1} is a reference to the next capturing subpattern. This kind +of forward reference can be useful it patterns that repeat. Perl does not +support the use of + in this way. +.P +A backreference matches whatever actually matched the capturing subpattern in +the current subject string, rather than anything matching the subpattern +itself (see +.\" HTML +.\" +"Subpatterns as subroutines" +.\" +below for a way of doing that). So the pattern +.sp + (sens|respons)e and \e1ibility +.sp +matches "sense and sensibility" and "response and responsibility", but not +"sense and responsibility". If caseful matching is in force at the time of the +backreference, the case of letters is relevant. For example, +.sp + ((?i)rah)\es+\e1 +.sp +matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original +capturing subpattern is matched caselessly. +.P +There are several different ways of writing backreferences to named +subpatterns. The .NET syntax \ek{name} and the Perl syntax \ek or +\ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified +backreference syntax, in which \eg can be used for both numeric and named +references, is also supported. We could rewrite the above example in any of +the following ways: +.sp + (?(?i)rah)\es+\ek + (?'p1'(?i)rah)\es+\ek{p1} + (?P(?i)rah)\es+(?P=p1) + (?(?i)rah)\es+\eg{p1} +.sp +A subpattern that is referenced by name may appear in the pattern before or +after the reference. +.P +There may be more than one backreference to the same subpattern. If a +subpattern has not actually been used in a particular match, any backreferences +to it always fail by default. For example, the pattern +.sp + (a|(bc))\e2 +.sp +always fails if it starts to match "a" rather than "bc". However, if the +PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an +unset value matches an empty string. +.P +Because there may be many capturing parentheses in a pattern, all digits +following a backslash are taken as part of a potential backreference number. +If the pattern continues with a digit character, some delimiter must be used to +terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE +option is set, this can be white space. Otherwise, the \eg{ syntax or an empty +comment (see +.\" HTML +.\" +"Comments" +.\" +below) can be used. +. +. +.SS "Recursive backreferences" +.rs +.sp +A backreference that occurs inside the parentheses to which it refers fails +when the subpattern is first used, so, for example, (a\e1) never matches. +However, such references can be useful inside repeated subpatterns. For +example, the pattern +.sp + (a|b\e1)+ +.sp +matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of +the subpattern, the backreference matches the character string corresponding +to the previous iteration. In order for this to work, the pattern must be such +that the first iteration does not need to match the backreference. This can be +done using alternation, as in the example above, or by a quantifier with a +minimum of zero. +.P +Backreferences of this type cause the group that they reference to be treated +as an +.\" HTML +.\" +atomic group. +.\" +Once the whole group has been matched, a subsequent matching failure cannot +cause backtracking into the middle of the group. +. +. +.\" HTML +.SH ASSERTIONS +.rs +.sp +An assertion is a test on the characters following or preceding the current +matching point that does not consume any characters. The simple assertions +coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described +.\" HTML +.\" +above. +.\" +.P +More complicated assertions are coded as subpatterns. There are two kinds: +those that look ahead of the current position in the subject string, and those +that look behind it, and in each case an assertion may be positive (must +succeed for matching to continue) or negative (must not succeed for matching to +continue). An assertion subpattern is matched in the normal way, except that, +when matching continues after a successful assertion, the matching position in +the subject string is as it was before the assertion was processed. +.P +Assertion subpatterns are not capturing subpatterns. If an assertion contains +capturing subpatterns within it, these are counted for the purposes of +numbering the capturing subpatterns in the whole pattern. Within each branch of +an assertion, locally captured substrings may be referenced in the usual way. +For example, a sequence such as (.)\eg{-1} can be used to check that two +adjacent characters are the same. +.P +When a branch within an assertion fails to match, any substrings that were +captured are discarded (as happens with any pattern branch that fails to +match). A negative assertion succeeds only when all its branches fail to match; +this means that no captured substrings are ever retained after a successful +negative assertion. When an assertion contains a matching branch, what happens +depends on the type of assertion. +.P +For a positive assertion, internally captured substrings in the successful +branch are retained, and matching continues with the next pattern item after +the assertion. For a negative assertion, a matching branch means that the +assertion has failed. If the assertion is being used as a condition in a +.\" HTML +.\" +conditional subpattern +.\" +(see below), captured substrings are retained, because matching continues with +the "no" branch of the condition. For other failing negative assertions, +control passes to the previous backtracking point, thus discarding any captured +strings within the assertion. +.P +For compatibility with Perl, most assertion subpatterns may be repeated; though +it makes no sense to assert the same thing several times, the side effect of +capturing parentheses may occasionally be useful. However, an assertion that +forms the condition for a conditional subpattern may not be quantified. In +practice, for other assertions, there only three cases: +.sp +(1) If the quantifier is {0}, the assertion is never obeyed during matching. +However, it may contain internal capturing parenthesized groups that are called +from elsewhere via the +.\" HTML +.\" +subroutine mechanism. +.\" +.sp +(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it +were {0,1}. At run time, the rest of the pattern match is tried with and +without the assertion, the order depending on the greediness of the quantifier. +.sp +(3) If the minimum repetition is greater than zero, the quantifier is ignored. +The assertion is obeyed just once when encountered during matching. +. +. +.SS "Lookahead assertions" +.rs +.sp +Lookahead assertions start with (?= for positive assertions and (?! for +negative assertions. For example, +.sp + \ew+(?=;) +.sp +matches a word followed by a semicolon, but does not include the semicolon in +the match, and +.sp + foo(?!bar) +.sp +matches any occurrence of "foo" that is not followed by "bar". Note that the +apparently similar pattern +.sp + (?!foo)bar +.sp +does not find an occurrence of "bar" that is preceded by something other than +"foo"; it finds any occurrence of "bar" whatsoever, because the assertion +(?!foo) is always true when the next three characters are "bar". A +lookbehind assertion is needed to achieve the other effect. +.P +If you want to force a matching failure at some point in a pattern, the most +convenient way to do it is with (?!) because an empty string always matches, so +an assertion that requires there not to be an empty string must always fail. +The backtracking control verb (*FAIL) or (*F) is a synonym for (?!). +. +. +.\" HTML +.SS "Lookbehind assertions" +.rs +.sp +Lookbehind assertions start with (?<= for positive assertions and (? +.\" +(see above) +.\" +can be used instead of a lookbehind assertion to get round the fixed-length +restriction. +.P +The implementation of lookbehind assertions is, for each alternative, to +temporarily move the current position back by the fixed length and then try to +match. If there are insufficient characters before the current position, the +assertion fails. +.P +In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a +single code unit even in a UTF mode) to appear in lookbehind assertions, +because it makes it impossible to calculate the length of the lookbehind. The +\eX and \eR escapes, which can match different numbers of code units, are never +permitted in lookbehinds. +.P +.\" HTML +.\" +"Subroutine" +.\" +calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long +as the subpattern matches a fixed-length string. However, +.\" HTML +.\" +recursion, +.\" +that is, a "subroutine" call into a group that is already active, +is not supported. +.P +Perl does not support backreferences in lookbehinds. PCRE2 does support them, +but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option +must not be set, there must be no use of (?| in the pattern (it creates +duplicate subpattern numbers), and if the backreference is by name, the name +must be unique. Of course, the referenced subpattern must itself be of fixed +length. The following pattern matches words containing at least two characters +that begin and end with the same character: +.sp + \eb(\ew)\ew++(?<=\e1) +.P +Possessive quantifiers can be used in conjunction with lookbehind assertions to +specify efficient matching of fixed-length strings at the end of subject +strings. Consider a simple pattern such as +.sp + abcd$ +.sp +when applied to a long string that does not match. Because matching proceeds +from left to right, PCRE2 will look for each "a" in the subject and then see if +what follows matches the rest of the pattern. If the pattern is specified as +.sp + ^.*abcd$ +.sp +the initial .* matches the entire string at first, but when this fails (because +there is no following "a"), it backtracks to match all but the last character, +then all but the last two characters, and so on. Once again the search for "a" +covers the entire string, from right to left, so we are no better off. However, +if the pattern is written as +.sp + ^.*+(?<=abcd) +.sp +there can be no backtracking for the .*+ item because of the possessive +quantifier; it can match only the entire string. The subsequent lookbehind +assertion does a single test on the last four characters. If it fails, the +match fails immediately. For long strings, this approach makes a significant +difference to the processing time. +. +. +.SS "Using multiple assertions" +.rs +.sp +Several assertions (of any sort) may occur in succession. For example, +.sp + (?<=\ed{3})(? +.SH "CONDITIONAL SUBPATTERNS" +.rs +.sp +It is possible to cause the matching process to obey a subpattern +conditionally or to choose between two alternative subpatterns, depending on +the result of an assertion, or whether a specific capturing subpattern has +already been matched. The two possible forms of conditional subpattern are: +.sp + (?(condition)yes-pattern) + (?(condition)yes-pattern|no-pattern) +.sp +If the condition is satisfied, the yes-pattern is used; otherwise the +no-pattern (if present) is used. An absent no-pattern is equivalent to an empty +string (it always matches). If there are more than two alternatives in the +subpattern, a compile-time error occurs. Each of the two alternatives may +itself contain nested subpatterns of any form, including conditional +subpatterns; the restriction to two alternatives applies only at the level of +the condition. This pattern fragment is an example where the alternatives are +complex: +.sp + (?(1) (A|B|C) | (D | (?(2)E|F) | E) ) +.sp +.P +There are five kinds of condition: references to subpatterns, references to +recursion, two pseudo-conditions called DEFINE and VERSION, and assertions. +. +. +.SS "Checking for a used subpattern by number" +.rs +.sp +If the text between the parentheses consists of a sequence of digits, the +condition is true if a capturing subpattern of that number has previously +matched. If there is more than one capturing subpattern with the same number +(see the earlier +.\" +.\" HTML +.\" +section about duplicate subpattern numbers), +.\" +the condition is true if any of them have matched. An alternative notation is +to precede the digits with a plus or minus sign. In this case, the subpattern +number is relative rather than absolute. The most recently opened parentheses +can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside +loops it can also make sense to refer to subsequent groups. The next +parentheses to be opened can be referenced as (?(+1), and so on. (The value +zero in any of these forms is not used; it provokes a compile-time error.) +.P +Consider the following pattern, which contains non-significant white space to +make it more readable (assume the PCRE2_EXTENDED option) and to divide it into +three parts for ease of discussion: +.sp + ( \e( )? [^()]+ (?(1) \e) ) +.sp +The first part matches an optional opening parenthesis, and if that +character is present, sets it as the first captured substring. The second part +matches one or more characters that are not parentheses. The third part is a +conditional subpattern that tests whether or not the first set of parentheses +matched. If they did, that is, if subject started with an opening parenthesis, +the condition is true, and so the yes-pattern is executed and a closing +parenthesis is required. Otherwise, since no-pattern is not present, the +subpattern matches nothing. In other words, this pattern matches a sequence of +non-parentheses, optionally enclosed in parentheses. +.P +If you were embedding this pattern in a larger one, you could use a relative +reference: +.sp + ...other stuff... ( \e( )? [^()]+ (?(-1) \e) ) ... +.sp +This makes the fragment independent of the parentheses in the larger pattern. +. +. +.SS "Checking for a used subpattern by name" +.rs +.sp +Perl uses the syntax (?()...) or (?('name')...) to test for a used +subpattern by name. For compatibility with earlier versions of PCRE1, which had +this facility before Perl, the syntax (?(name)...) is also recognized. Note, +however, that undelimited names consisting of the letter R followed by digits +are ambiguous (see the following section). +.P +Rewriting the above example to use a named subpattern gives this: +.sp + (? \e( )? [^()]+ (?() \e) ) +.sp +If the name used in a condition of this kind is a duplicate, the test is +applied to all subpatterns of the same name, and is true if any one of them has +matched. +. +. +.SS "Checking for pattern recursion" +.rs +.sp +"Recursion" in this sense refers to any subroutine-like call from one part of +the pattern to another, whether or not it is actually recursive. See the +sections entitled +.\" HTML +.\" +"Recursive patterns" +.\" +and +.\" HTML +.\" +"Subpatterns as subroutines" +.\" +below for details of recursion and subpattern calls. +.P +If a condition is the string (R), and there is no subpattern with the name R, +the condition is true if matching is currently in a recursion or subroutine +call to the whole pattern or any subpattern. If digits follow the letter R, and +there is no subpattern with that name, the condition is true if the most recent +call is into a subpattern with the given number, which must exist somewhere in +the overall pattern. This is a contrived example that is equivalent to a+b: +.sp + ((?(R1)a+|(?1)b)) +.sp +However, in both cases, if there is a subpattern with a matching name, the +condition tests for its being set, as described in the section above, instead +of testing for recursion. For example, creating a group with the name R1 by +adding (?) to the above pattern completely changes its meaning. +.P +If a name preceded by ampersand follows the letter R, for example: +.sp + (?(R&name)...) +.sp +the condition is true if the most recent recursion is into a subpattern of that +name (which must exist within the pattern). +.P +This condition does not check the entire recursion stack. It tests only the +current level. If the name used in a condition of this kind is a duplicate, the +test is applied to all subpatterns of the same name, and is true if any one of +them is the most recent recursion. +.P +At "top level", all these recursion test conditions are false. +. +. +.\" HTML +.SS "Defining subpatterns for use by reference only" +.rs +.sp +If the condition is the string (DEFINE), the condition is always false, even if +there is a group with the name DEFINE. In this case, there may be only one +alternative in the subpattern. It is always skipped if control reaches this +point in the pattern; the idea of DEFINE is that it can be used to define +subroutines that can be referenced from elsewhere. (The use of +.\" HTML +.\" +subroutines +.\" +is described below.) For example, a pattern to match an IPv4 address such as +"192.168.23.245" could be written like this (ignore white space and line +breaks): +.sp + (?(DEFINE) (? 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) ) + \eb (?&byte) (\e.(?&byte)){3} \eb +.sp +The first part of the pattern is a DEFINE group inside which a another group +named "byte" is defined. This matches an individual component of an IPv4 +address (a number less than 256). When matching takes place, this part of the +pattern is skipped because DEFINE acts like a false condition. The rest of the +pattern uses references to the named group to match the four dot-separated +components of an IPv4 address, insisting on a word boundary at each end. +. +. +.SS "Checking the PCRE2 version" +.rs +.sp +Programs that link with a PCRE2 library can check the version by calling +\fBpcre2_config()\fP with appropriate arguments. Users of applications that do +not have access to the underlying code cannot do this. A special "condition" +called VERSION exists to allow such users to discover which version of PCRE2 +they are dealing with by using this condition to match a string such as +"yesno". VERSION must be followed either by "=" or ">=" and a version number. +For example: +.sp + (?(VERSION>=10.4)yes|no) +.sp +This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or +"no" otherwise. The fractional part of the version number may not contain more +than two digits. +. +. +.SS "Assertion conditions" +.rs +.sp +If the condition is not in any of the above formats, it must be an assertion. +This may be a positive or negative lookahead or lookbehind assertion. Consider +this pattern, again containing non-significant white space, and with the two +alternatives on the second line: +.sp + (?(?=[^a-z]*[a-z]) + \ed{2}-[a-z]{3}-\ed{2} | \ed{2}-\ed{2}-\ed{2} ) +.sp +The condition is a positive lookahead assertion that matches an optional +sequence of non-letters followed by a letter. In other words, it tests for the +presence of at least one letter in the subject. If a letter is found, the +subject is matched against the first alternative; otherwise it is matched +against the second. This pattern matches strings in one of the two forms +dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits. +.P +When an assertion that is a condition contains capturing subpatterns, any +capturing that occurs in a matching branch is retained afterwards, for both +positive and negative assertions, because matching always continues after the +assertion, whether it succeeds or fails. (Compare non-conditional assertions, +when captures are retained only for positive assertions that succeed.) +. +. +.\" HTML +.SH COMMENTS +.rs +.sp +There are two ways of including comments in patterns that are processed by +PCRE2. In both cases, the start of the comment must not be in a character +class, nor in the middle of any other sequence of related characters such as +(?: or a subpattern name or number. The characters that make up a comment play +no part in the pattern matching. +.P +The sequence (?# marks the start of a comment that continues up to the next +closing parenthesis. Nested parentheses are not permitted. If the +PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character +also introduces a comment, which in this case continues to immediately after +the next newline character or character sequence in the pattern. Which +characters are interpreted as newlines is controlled by an option passed to the +compiling function or by a special sequence at the start of the pattern, as +described in the section entitled +.\" HTML +.\" +"Newline conventions" +.\" +above. Note that the end of this type of comment is a literal newline sequence +in the pattern; escape sequences that happen to represent a newline do not +count. For example, consider this pattern when PCRE2_EXTENDED is set, and the +default newline convention (a single linefeed character) is in force: +.sp + abc #comment \en still comment +.sp +On encountering the # character, \fBpcre2_compile()\fP skips along, looking for +a newline in the pattern. The sequence \en is still literal at this stage, so +it does not terminate the comment. Only an actual character with the code value +0x0a (the default newline) does so. +. +. +.\" HTML +.SH "RECURSIVE PATTERNS" +.rs +.sp +Consider the problem of matching a string in parentheses, allowing for +unlimited nested parentheses. Without the use of recursion, the best that can +be done is to use a pattern that matches up to some fixed depth of nesting. It +is not possible to handle an arbitrary nesting depth. +.P +For some time, Perl has provided a facility that allows regular expressions to +recurse (amongst other things). It does this by interpolating Perl code in the +expression at run time, and the code can refer to the expression itself. A Perl +pattern using code interpolation to solve the parentheses problem can be +created like this: +.sp + $re = qr{\e( (?: (?>[^()]+) | (?p{$re}) )* \e)}x; +.sp +The (?p{...}) item interpolates Perl code at run time, and in this case refers +recursively to the pattern in which it appears. +.P +Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it +supports special syntax for recursion of the entire pattern, and also for +individual subpattern recursion. After its introduction in PCRE1 and Python, +this kind of recursion was subsequently introduced into Perl at release 5.10. +.P +A special item that consists of (? followed by a number greater than zero and a +closing parenthesis is a recursive subroutine call of the subpattern of the +given number, provided that it occurs inside that subpattern. (If not, it is a +.\" HTML +.\" +non-recursive subroutine +.\" +call, which is described in the next section.) The special item (?R) or (?0) is +a recursive call of the entire regular expression. +.P +This PCRE2 pattern solves the nested parentheses problem (assume the +PCRE2_EXTENDED option is set so that white space is ignored): +.sp + \e( ( [^()]++ | (?R) )* \e) +.sp +First it matches an opening parenthesis. Then it matches any number of +substrings which can either be a sequence of non-parentheses, or a recursive +match of the pattern itself (that is, a correctly parenthesized substring). +Finally there is a closing parenthesis. Note the use of a possessive quantifier +to avoid backtracking into sequences of non-parentheses. +.P +If this were part of a larger pattern, you would not want to recurse the entire +pattern, so instead you could use this: +.sp + ( \e( ( [^()]++ | (?1) )* \e) ) +.sp +We have put the pattern into parentheses, and caused the recursion to refer to +them instead of the whole pattern. +.P +In a larger pattern, keeping track of parenthesis numbers can be tricky. This +is made easier by the use of relative references. Instead of (?1) in the +pattern above you can write (?-2) to refer to the second most recently opened +parentheses preceding the recursion. In other words, a negative number counts +capturing parentheses leftwards from the point at which it is encountered. +.P +Be aware however, that if +.\" HTML +.\" +duplicate subpattern numbers +.\" +are in use, relative references refer to the earliest subpattern with the +appropriate number. Consider, for example: +.sp + (?|(a)|(b)) (c) (?-2) +.sp +The first two capturing groups (a) and (b) are both numbered 1, and group (c) +is number 2. When the reference (?-2) is encountered, the second most recently +opened parentheses has the number 1, but it is the first such group (the (a) +group) to which the recursion refers. This would be the same if an absolute +reference (?1) was used. In other words, relative references are just a +shorthand for computing a group number. +.P +It is also possible to refer to subsequently opened parentheses, by writing +references such as (?+2). However, these cannot be recursive because the +reference is not inside the parentheses that are referenced. They are always +.\" HTML +.\" +non-recursive subroutine +.\" +calls, as described in the next section. +.P +An alternative approach is to use named parentheses. The Perl syntax for this +is (?&name); PCRE1's earlier syntax (?P>name) is also supported. We could +rewrite the above example as follows: +.sp + (? \e( ( [^()]++ | (?&pn) )* \e) ) +.sp +If there is more than one subpattern with the same name, the earliest one is +used. +.P +The example pattern that we have been looking at contains nested unlimited +repeats, and so the use of a possessive quantifier for matching strings of +non-parentheses is important when applying the pattern to strings that do not +match. For example, when this pattern is applied to +.sp + (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa() +.sp +it yields "no match" quickly. However, if a possessive quantifier is not used, +the match runs for a very long time indeed because there are so many different +ways the + and * repeats can carve up the subject, and all have to be tested +before failure can be reported. +.P +At the end of a match, the values of capturing parentheses are those from +the outermost level. If you want to obtain intermediate values, a callout +function can be used (see below and the +.\" HREF +\fBpcre2callout\fP +.\" +documentation). If the pattern above is matched against +.sp + (ab(cd)ef) +.sp +the value for the inner capturing parentheses (numbered 2) is "ef", which is +the last value taken on at the top level. If a capturing subpattern is not +matched at the top level, its final captured value is unset, even if it was +(temporarily) set at a deeper level during the matching process. +.P +Do not confuse the (?R) item with the condition (R), which tests for recursion. +Consider this pattern, which matches text in angle brackets, allowing for +arbitrary nesting. Only digits are allowed in nested brackets (that is, when +recursing), whereas any characters are permitted at the outer level. +.sp + < (?: (?(R) \ed++ | [^<>]*+) | (?R)) * > +.sp +In this pattern, (?(R) is the start of a conditional subpattern, with two +different alternatives for the recursive and non-recursive cases. The (?R) item +is the actual recursive call. +. +. +.\" HTML +.SS "Differences in recursion processing between PCRE2 and Perl" +.rs +.sp +Some former differences between PCRE2 and Perl no longer exist. +.P +Before release 10.30, recursion processing in PCRE2 differed from Perl in that +a recursive subpattern call was always treated as an atomic group. That is, +once it had matched some of the subject string, it was never re-entered, even +if it contained untried alternatives and there was a subsequent matching +failure. (Historical note: PCRE implemented recursion before Perl did.) +.P +Starting with release 10.30, recursive subroutine calls are no longer treated +as atomic. That is, they can be re-entered to try unused alternatives if there +is a matching failure later in the pattern. This is now compatible with the way +Perl works. If you want a subroutine call to be atomic, you must explicitly +enclose it in an atomic group. +.P +Supporting backtracking into recursions simplifies certain types of recursive +pattern. For example, this pattern matches palindromic strings: +.sp + ^((.)(?1)\e2|.?)$ +.sp +The second branch in the group matches a single central character in the +palindrome when there are an odd number of characters, or nothing when there +are an even number of characters, but in order to work it has to be able to try +the second case when the rest of the pattern match fails. If you want to match +typical palindromic phrases, the pattern has to ignore all non-word characters, +which can be done like this: +.sp + ^\eW*+((.)\eW*+(?1)\eW*+\e2|\eW*+.?)\eW*+$ +.sp +If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A +man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to +avoid backtracking into sequences of non-word characters. Without this, PCRE2 +takes a great deal longer (ten times or more) to match typical phrases, and +Perl takes so long that you think it has gone into a loop. +.P +Another way in which PCRE2 and Perl used to differ in their recursion +processing is in the handling of captured values. Formerly in Perl, when a +subpattern was called recursively or as a subpattern (see the next section), it +had no access to any values that were captured outside the recursion, whereas +in PCRE2 these values can be referenced. Consider this pattern: +.sp + ^(.)(\e1|a(?2)) +.sp +This pattern matches "bab". The first capturing parentheses match "b", then in +the second group, when the backreference \e1 fails to match "b", the second +alternative matches "a" and then recurses. In the recursion, \e1 does now match +"b" and so the whole match succeeds. This match used to fail in Perl, but in +later versions (I tried 5.024) it now works. +. +. +.\" HTML +.SH "SUBPATTERNS AS SUBROUTINES" +.rs +.sp +If the syntax for a recursive subpattern call (either by number or by +name) is used outside the parentheses to which it refers, it operates a bit +like a subroutine in a programming language. More accurately, PCRE2 treats the +referenced subpattern as an independent subpattern which it tries to match at +the current matching position. The called subpattern may be defined before or +after the reference. A numbered reference can be absolute or relative, as in +these examples: +.sp + (...(absolute)...)...(?2)... + (...(relative)...)...(?-1)... + (...(?+1)...(relative)... +.sp +An earlier example pointed out that the pattern +.sp + (sens|respons)e and \e1ibility +.sp +matches "sense and sensibility" and "response and responsibility", but not +"sense and responsibility". If instead the pattern +.sp + (sens|respons)e and (?1)ibility +.sp +is used, it does match "sense and responsibility" as well as the other two +strings. Another example is given in the discussion of DEFINE above. +.P +Like recursions, subroutine calls used to be treated as atomic, but this +changed at PCRE2 release 10.30, so backtracking into subroutine calls can now +occur. However, any capturing parentheses that are set during the subroutine +call revert to their previous values afterwards. +.P +Processing options such as case-independence are fixed when a subpattern is +defined, so if it is used as a subroutine, such options cannot be changed for +different calls. For example, consider this pattern: +.sp + (abc)(?i:(?-1)) +.sp +It matches "abcabc". It does not match "abcABC" because the change of +processing option does not affect the called subpattern. +.P +The behaviour of +.\" HTML +.\" +backtracking control verbs +.\" +in subpatterns when called as subroutines is described in the section entitled +.\" HTML +.\" +"Backtracking verbs in subroutines" +.\" +below. +. +. +.\" HTML +.SH "ONIGURUMA SUBROUTINE SYNTAX" +.rs +.sp +For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or +a number enclosed either in angle brackets or single quotes, is an alternative +syntax for referencing a subpattern as a subroutine, possibly recursively. Here +are two of the examples used above, rewritten using this syntax: +.sp + (? \e( ( (?>[^()]+) | \eg )* \e) ) + (sens|respons)e and \eg'1'ibility +.sp +PCRE2 supports an extension to Oniguruma: if a number is preceded by a +plus or a minus sign it is taken as a relative reference. For example: +.sp + (abc)(?i:\eg<-1>) +.sp +Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP +synonymous. The former is a backreference; the latter is a subroutine call. +. +. +.SH CALLOUTS +.rs +.sp +Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl +code to be obeyed in the middle of matching a regular expression. This makes it +possible, amongst other things, to extract different substrings that match the +same pair of parentheses when there is a repetition. +.P +PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl +code. The feature is called "callout". The caller of PCRE2 provides an external +function by putting its entry point in a match context using the function +\fBpcre2_set_callout()\fP, and then passing that context to \fBpcre2_match()\fP +or \fBpcre2_dfa_match()\fP. If no match context is passed, or if the callout +entry point is set to NULL, callouts are disabled. +.P +Within a regular expression, (?C) indicates a point at which the external +function is to be called. There are two kinds of callout: those with a +numerical argument and those with a string argument. (?C) on its own with no +argument is treated as (?C0). A numerical argument allows the application to +distinguish between different callouts. String arguments were added for release +10.20 to make it possible for script languages that use PCRE2 to embed short +scripts within patterns in a similar way to Perl. +.P +During matching, when PCRE2 reaches a callout point, the external function is +called. It is provided with the number or string argument of the callout, the +position in the pattern, and one item of data that is also set in the match +block. The callout function may cause matching to proceed, to backtrack, or to +fail. +.P +By default, PCRE2 implements a number of optimizations at matching time, and +one side-effect is that sometimes callouts are skipped. If you need all +possible callouts to happen, you need to set options that disable the relevant +optimizations. More details, including a complete description of the +programming interface to the callout function, are given in the +.\" HREF +\fBpcre2callout\fP +.\" +documentation. +. +. +.SS "Callouts with numerical arguments" +.rs +.sp +If you just want to have a means of identifying different callout points, put a +number less than 256 after the letter C. For example, this pattern has two +callout points: +.sp + (?C1)abc(?C2)def +.sp +If the PCRE2_AUTO_CALLOUT flag is passed to \fBpcre2_compile()\fP, numerical +callouts are automatically installed before each item in the pattern. They are +all numbered 255. If there is a conditional group in the pattern whose +condition is an assertion, an additional callout is inserted just before the +condition. An explicit callout may also be set at this position, as in this +example: +.sp + (?(?C9)(?=a)abc|def) +.sp +Note that this applies only to assertion conditions, not to other types of +condition. +. +. +.SS "Callouts with string arguments" +.rs +.sp +A delimited string may be used instead of a number as a callout argument. The +starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is +the same as the start, except for {, where the ending delimiter is }. If the +ending delimiter is needed within the string, it must be doubled. For +example: +.sp + (?C'ab ''c'' d')xyz(?C{any text})pqr +.sp +The doubling is removed before the string is passed to the callout function. +. +. +.\" HTML +.SH "BACKTRACKING CONTROL" +.rs +.sp +There are a number of special "Backtracking Control Verbs" (to use Perl's +terminology) that modify the behaviour of backtracking during matching. They +are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form, +possibly behaving differently depending on whether or not a name is present. +.P +By default, for compatibility with Perl, a name is any sequence of characters +that does not include a closing parenthesis. The name is not processed in +any way, and it is not possible to include a closing parenthesis in the name. +This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result +is no longer Perl-compatible. +.P +When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names +and only an unescaped closing parenthesis terminates the name. However, the +only backslash items that are permitted are \eQ, \eE, and sequences such as +\ex{100} that define character code points. Character type escapes such as \ed +are faulted. +.P +A closing parenthesis can be included in a name either as \e) or between \eQ +and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or +PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is +skipped, and #-comments are recognized, exactly as in the rest of the pattern. +PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless +PCRE2_ALT_VERBNAMES is also set. +.P +The maximum length of a name is 255 in the 8-bit library and 65535 in the +16-bit and 32-bit libraries. If the name is empty, that is, if the closing +parenthesis immediately follows the colon, the effect is as if the colon were +not there. Any number of these verbs may occur in a pattern. +.P +Since these verbs are specifically related to backtracking, most of them can be +used only when the pattern is to be matched using the traditional matching +function, because that uses a backtracking algorithm. With the exception of +(*FAIL), which behaves like a failing negative assertion, the backtracking +control verbs cause an error if encountered by the DFA matching function. +.P +The behaviour of these verbs in +.\" HTML +.\" +repeated groups, +.\" +.\" HTML +.\" +assertions, +.\" +and in +.\" HTML +.\" +subpatterns called as subroutines +.\" +(whether or not recursively) is documented below. +. +. +.\" HTML +.SS "Optimizations that affect backtracking verbs" +.rs +.sp +PCRE2 contains some optimizations that are used to speed up matching by running +some checks at the start of each match attempt. For example, it may know the +minimum length of matching subject, or that a particular character must be +present. When one of these optimizations bypasses the running of a match, any +included backtracking verbs will not, of course, be processed. You can suppress +the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option +when calling \fBpcre2_compile()\fP, or by starting the pattern with +(*NO_START_OPT). There is more discussion of this option in the section +entitled +.\" HTML +.\" +"Compiling a pattern" +.\" +in the +.\" HREF +\fBpcre2api\fP +.\" +documentation. +.P +Experiments with Perl suggest that it too has similar optimizations, and like +PCRE2, turning them off can change the result of a match. +. +. +.SS "Verbs that act immediately" +.rs +.sp +The following verbs act as soon as they are encountered. +.sp + (*ACCEPT) or (*ACCEPT:NAME) +.sp +This verb causes the match to end successfully, skipping the remainder of the +pattern. However, when it is inside a subpattern that is called as a +subroutine, only that subpattern is ended successfully. Matching then continues +at the outer level. If (*ACCEPT) in triggered in a positive assertion, the +assertion succeeds; in a negative assertion, the assertion fails. +.P +If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For +example: +.sp + A((?:A|B(*ACCEPT)|C)D) +.sp +This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by +the outer parentheses. +.sp + (*FAIL) or (*FAIL:NAME) +.sp +This verb causes a matching failure, forcing backtracking to occur. It may be +abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl +documentation notes that it is probably useful only when combined with (?{}) or +(??{}). Those are, of course, Perl features that are not present in PCRE2. The +nearest equivalent is the callout feature, as for example in this pattern: +.sp + a+(?C)(*FAIL) +.sp +A match with the string "aaaa" always fails, but the callout is taken before +each backtrack happens (in this example, 10 times). +.P +(*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as +(*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively. +. +. +.SS "Recording which path was taken" +.rs +.sp +There is one verb whose main purpose is to track how a match was arrived at, +though it also has a secondary use in conjunction with advancing the match +starting point (see (*SKIP) below). +.sp + (*MARK:NAME) or (*:NAME) +.sp +A name is always required with this verb. There may be as many instances of +(*MARK) as you like in a pattern, and their names do not have to be unique. +.P +When a match succeeds, the name of the last-encountered (*MARK:NAME) on the +matching path is passed back to the caller as described in the section entitled +.\" HTML +.\" +"Other information about the match" +.\" +in the +.\" HREF +\fBpcre2api\fP +.\" +documentation. This applies to all instances of (*MARK), including those inside +assertions and atomic groups. (There are differences in those cases when +(*MARK) is used in conjunction with (*SKIP) as described below.) +.P +As well as (*MARK), the (*COMMIT), (*PRUNE) and (*THEN) verbs may have +associated NAME arguments. Whichever is last on the matching path is passed +back. See below for more details of these other verbs. +.P +Here is an example of \fBpcre2test\fP output, where the "mark" modifier +requests the retrieval and outputting of (*MARK) data: +.sp + re> /X(*MARK:A)Y|X(*MARK:B)Z/mark + data> XY + 0: XY + MK: A + XZ + 0: XZ + MK: B +.sp +The (*MARK) name is tagged with "MK:" in this output, and in this example it +indicates which of the two alternatives matched. This is a more efficient way +of obtaining this information than putting each alternative in its own +capturing parentheses. +.P +If a verb with a name is encountered in a positive assertion that is true, the +name is recorded and passed back if it is the last-encountered. This does not +happen for negative assertions or failing positive assertions. +.P +After a partial match or a failed match, the last encountered name in the +entire match process is returned. For example: +.sp + re> /X(*MARK:A)Y|X(*MARK:B)Z/mark + data> XP + No match, mark = B +.sp +Note that in this unanchored example the mark is retained from the match +attempt that started at the letter "X" in the subject. Subsequent match +attempts starting at "P" and then with an empty string do not get as far as the +(*MARK) item, but nevertheless do not reset it. +.P +If you are interested in (*MARK) values after failed matches, you should +probably set the PCRE2_NO_START_OPTIMIZE option +.\" HTML +.\" +(see above) +.\" +to ensure that the match is always attempted. +. +. +.SS "Verbs that act after backtracking" +.rs +.sp +The following verbs do nothing when they are encountered. Matching continues +with what follows, but if there is a subsequent match failure, causing a +backtrack to the verb, a failure is forced. That is, backtracking cannot pass +to the left of the verb. However, when one of these verbs appears inside an +atomic group or in a lookaround assertion that is true, its effect is confined +to that group, because once the group has been matched, there is never any +backtracking into it. Backtracking from beyond an assertion or an atomic group +ignores the entire group, and seeks a preceeding backtracking point. +.P +These verbs differ in exactly what kind of failure occurs when backtracking +reaches them. The behaviour described below is what happens when the verb is +not in a subroutine or an assertion. Subsequent sections cover these special +cases. +.sp + (*COMMIT) or (*COMMIT:NAME) +.sp +This verb causes the whole match to fail outright if there is a later matching +failure that causes backtracking to reach it. Even if the pattern is +unanchored, no further attempts to find a match by advancing the starting point +take place. If (*COMMIT) is the only backtracking verb that is encountered, +once it has been passed \fBpcre2_match()\fP is committed to finding a match at +the current starting point, or not at all. For example: +.sp + a+(*COMMIT)b +.sp +This matches "xxaab" but not "aacaab". It can be thought of as a kind of +dynamic anchor, or "I've started, so I must finish." +.P +The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COMMIT). It is +like (*MARK:NAME) in that the name is remembered for passing back to the +caller. However, (*SKIP:NAME) searches only for names set with (*MARK), +ignoring those set by (*COMMIT), (*PRUNE) and (*THEN). +.P +If there is more than one backtracking verb in a pattern, a different one that +follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a +match does not always guarantee that a match must be at this starting point. +.P +Note that (*COMMIT) at the start of a pattern is not the same as an anchor, +unless PCRE2's start-of-match optimizations are turned off, as shown in this +output from \fBpcre2test\fP: +.sp + re> /(*COMMIT)abc/ + data> xyzabc + 0: abc + data> + re> /(*COMMIT)abc/no_start_optimize + data> xyzabc + No match +.sp +For the first pattern, PCRE2 knows that any match must start with "a", so the +optimization skips along the subject to "a" before applying the pattern to the +first set of data. The match attempt then succeeds. The second pattern disables +the optimization that skips along to the first character. The pattern is now +applied starting at "x", and so the (*COMMIT) causes the match to fail without +trying any other starting points. +.sp + (*PRUNE) or (*PRUNE:NAME) +.sp +This verb causes the match to fail at the current starting position in the +subject if there is a later matching failure that causes backtracking to reach +it. If the pattern is unanchored, the normal "bumpalong" advance to the next +starting character then happens. Backtracking can occur as usual to the left of +(*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but +if there is no match to the right, backtracking cannot cross (*PRUNE). In +simple cases, the use of (*PRUNE) is just an alternative to an atomic group or +possessive quantifier, but there are some uses of (*PRUNE) that cannot be +expressed in any other way. In an anchored pattern (*PRUNE) has the same effect +as (*COMMIT). +.P +The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is +like (*MARK:NAME) in that the name is remembered for passing back to the +caller. However, (*SKIP:NAME) searches only for names set with (*MARK), +ignoring those set by (*COMMIT), (*PRUNE) or (*THEN). +.sp + (*SKIP) +.sp +This verb, when given without a name, is like (*PRUNE), except that if the +pattern is unanchored, the "bumpalong" advance is not to the next character, +but to the position in the subject where (*SKIP) was encountered. (*SKIP) +signifies that whatever text was matched leading up to it cannot be part of a +successful match if there is a later mismatch. Consider: +.sp + a+(*SKIP)b +.sp +If the subject is "aaaac...", after the first match attempt fails (starting at +the first character in the string), the starting point skips on to start the +next attempt at "c". Note that a possessive quantifer does not have the same +effect as this example; although it would suppress backtracking during the +first match attempt, the second attempt would start at the second character +instead of skipping on to "c". +.sp + (*SKIP:NAME) +.sp +When (*SKIP) has an associated name, its behaviour is modified. When such a +(*SKIP) is triggered, the previous path through the pattern is searched for the +most recent (*MARK) that has the same name. If one is found, the "bumpalong" +advance is to the subject position that corresponds to that (*MARK) instead of +to where (*SKIP) was encountered. If no (*MARK) with a matching name is found, +the (*SKIP) is ignored. +.P +The search for a (*MARK) name uses the normal backtracking mechanism, which +means that it does not see (*MARK) settings that are inside atomic groups or +assertions, because they are never re-entered by backtracking. Compare the +following \fBpcre2test\fP examples: +.sp + re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/ + data: abc + 0: a + 1: a + data: + re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/ + data: abc + 0: b + 1: b +.sp +In the first example, the (*MARK) setting is in an atomic group, so it is not +seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. This allows +the second branch of the pattern to be tried at the first character position. +In the second example, the (*MARK) setting is not in an atomic group. This +allows (*SKIP:X) to find the (*MARK) when it backtracks, and this causes a new +matching attempt to start at the second character. This time, the (*MARK) is +never seen because "a" does not match "b", so the matcher immediately jumps to +the second branch of the pattern. +.P +Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores +names that are set by (*COMMIT:NAME), (*PRUNE:NAME) or (*THEN:NAME). +.sp + (*THEN) or (*THEN:NAME) +.sp +This verb causes a skip to the next innermost alternative when backtracking +reaches it. That is, it cancels any further backtracking within the current +alternative. Its name comes from the observation that it can be used for a +pattern-based if-then-else block: +.sp + ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ... +.sp +If the COND1 pattern matches, FOO is tried (and possibly further items after +the end of the group if FOO succeeds); on failure, the matcher skips to the +second alternative and tries COND2, without backtracking into COND1. If that +succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no +more alternatives, so there is a backtrack to whatever came before the entire +group. If (*THEN) is not inside an alternation, it acts like (*PRUNE). +.P +The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). It is +like (*MARK:NAME) in that the name is remembered for passing back to the +caller. However, (*SKIP:NAME) searches only for names set with (*MARK), +ignoring those set by (*COMMIT), (*PRUNE) and (*THEN). +.P +A subpattern that does not contain a | character is just a part of the +enclosing alternative; it is not a nested alternation with only one +alternative. The effect of (*THEN) extends beyond such a subpattern to the +enclosing alternative. Consider this pattern, where A, B, etc. are complex +pattern fragments that do not contain any | characters at this level: +.sp + A (B(*THEN)C) | D +.sp +If A and B are matched, but there is a failure in C, matching does not +backtrack into A; instead it moves to the next alternative, that is, D. +However, if the subpattern containing (*THEN) is given an alternative, it +behaves differently: +.sp + A (B(*THEN)C | (*FAIL)) | D +.sp +The effect of (*THEN) is now confined to the inner subpattern. After a failure +in C, matching moves to (*FAIL), which causes the whole subpattern to fail +because there are no more alternatives to try. In this case, matching does now +backtrack into A. +.P +Note that a conditional subpattern is not considered as having two +alternatives, because only one is ever used. In other words, the | character in +a conditional subpattern has a different meaning. Ignoring white space, +consider: +.sp + ^.*? (?(?=a) a | b(*THEN)c ) +.sp +If the subject is "ba", this pattern does not match. Because .*? is ungreedy, +it initially matches zero characters. The condition (?=a) then fails, the +character "b" is matched, but "c" is not. At this point, matching does not +backtrack to .*? as might perhaps be expected from the presence of the | +character. The conditional subpattern is part of the single alternative that +comprises the whole pattern, and so the match fails. (If there was a backtrack +into .*?, allowing it to match "b", the match would succeed.) +.P +The verbs just described provide four different "strengths" of control when +subsequent matching fails. (*THEN) is the weakest, carrying on the match at the +next alternative. (*PRUNE) comes next, failing the match at the current +starting position, but allowing an advance to the next character (for an +unanchored pattern). (*SKIP) is similar, except that the advance may be more +than one character. (*COMMIT) is the strongest, causing the entire match to +fail. +. +. +.SS "More than one backtracking verb" +.rs +.sp +If more than one backtracking verb is present in a pattern, the one that is +backtracked onto first acts. For example, consider this pattern, where A, B, +etc. are complex pattern fragments: +.sp + (A(*COMMIT)B(*THEN)C|ABD) +.sp +If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to +fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes +the next alternative (ABD) to be tried. This behaviour is consistent, but is +not always the same as Perl's. It means that if two or more backtracking verbs +appear in succession, all the the last of them has no effect. Consider this +example: +.sp + ...(*COMMIT)(*PRUNE)... +.sp +If there is a matching failure to the right, backtracking onto (*PRUNE) causes +it to be triggered, and its action is taken. There can never be a backtrack +onto (*COMMIT). +. +. +.\" HTML +.SS "Backtracking verbs in repeated groups" +.rs +.sp +PCRE2 sometimes differs from Perl in its handling of backtracking verbs in +repeated groups. For example, consider: +.sp + /(a(*COMMIT)b)+ac/ +.sp +If the subject is "abac", Perl matches unless its optimizations are disabled, +but PCRE2 always fails because the (*COMMIT) in the second repeat of the group +acts. +. +. +.\" HTML +.SS "Backtracking verbs in assertions" +.rs +.sp +(*FAIL) in any assertion has its normal effect: it forces an immediate +backtrack. The behaviour of the other backtracking verbs depends on whether or +not the assertion is standalone or acting as the condition in a conditional +subpattern. +.P +(*ACCEPT) in a standalone positive assertion causes the assertion to succeed +without any further processing; captured strings and a (*MARK) name (if set) +are retained. In a standalone negative assertion, (*ACCEPT) causes the +assertion to fail without any further processing; captured substrings and any +(*MARK) name are discarded. +.P +If the assertion is a condition, (*ACCEPT) causes the condition to be true for +a positive assertion and false for a negative one; captured substrings are +retained in both cases. +.P +The remaining verbs act only when a later failure causes a backtrack to +reach them. This means that their effect is confined to the assertion, +because lookaround assertions are atomic. A backtrack that occurs after an +assertion is complete does not jump back into the assertion. Note in particular +that a (*MARK) name that is set in an assertion is not "seen" by an instance of +(*SKIP:NAME) latter in the pattern. +.P +The effect of (*THEN) is not allowed to escape beyond an assertion. If there +are no more branches to try, (*THEN) causes a positive assertion to be false, +and a negative assertion to be true. +.P +The other backtracking verbs are not treated specially if they appear in a +standalone positive assertion. In a conditional positive assertion, +backtracking (from within the assertion) into (*COMMIT), (*SKIP), or (*PRUNE) +causes the condition to be false. However, for both standalone and conditional +negative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes +the assertion to be true, without considering any further alternative branches. +. +. +.\" HTML +.SS "Backtracking verbs in subroutines" +.rs +.sp +These behaviours occur whether or not the subpattern is called recursively. +.P +(*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to +succeed without any further processing. Matching then continues after the +subroutine call. Perl documents this behaviour. Perl's treatment of the other +verbs in subroutines is different in some cases. +.P +(*FAIL) in a subpattern called as a subroutine has its normal effect: it forces +an immediate backtrack. +.P +(*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail when +triggered by being backtracked to in a subpattern called as a subroutine. There +is then a backtrack at the outer level. +.P +(*THEN), when triggered, skips to the next alternative in the innermost +enclosing group within the subpattern that has alternatives (its normal +behaviour). However, if there is no such group within the subroutine +subpattern, the subroutine match fails and there is a backtrack at the outer +level. +. +. +.SH "SEE ALSO" +.rs +.sp +\fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3), +\fBpcre2syntax\fP(3), \fBpcre2\fP(3). +. +. +.SH AUTHOR +.rs +.sp +.nf +Philip Hazel +University Computing Service +Cambridge, England. +.fi +. +. +.SH REVISION +.rs +.sp +.nf +Last updated: 04 September 2018 +Copyright (c) 1997-2018 University of Cambridge. +.fi