new upstream release (3.3.0); modify package compatibility for Stretch

[ossec-hids.git] / src / external / pcre2-10.32 / doc / pcre2pattern.3
diff --git a/src/external/pcre2-10.32/doc/pcre2pattern.3 b/src/external/pcre2-10.32/doc/pcre2pattern.3

new file mode 100644 (file)

index 0000000..0247c52
--- /dev/null
+++ b/src/external/pcre2-10.32/doc/pcre2pattern.3
@@ -0,0 +1,3660 @@
+.TH PCRE2PATTERN 3 "04 September 2018" "PCRE2 10.32"
+.SH NAME
+PCRE2 - Perl-compatible regular expressions (revised API)
+.SH "PCRE2 REGULAR EXPRESSION DETAILS"
+.rs
+.sp
+The syntax and semantics of the regular expressions that are supported by PCRE2
+are described in detail below. There is a quick-reference syntax summary in the
+.\" HREF
+\fBpcre2syntax\fP
+.\"
+page. PCRE2 tries to match Perl syntax and semantics as closely as it can.
+PCRE2 also supports some alternative regular expression syntax (which does not
+conflict with the Perl syntax) in order to provide some compatibility with
+regular expressions in Python, .NET, and Oniguruma.
+.P
+Perl's regular expressions are described in its own documentation, and regular
+expressions in general are covered in a number of books, some of which have
+copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published
+by O'Reilly, covers regular expressions in great detail. This description of
+PCRE2's regular expressions is intended as reference material.
+.P
+This document discusses the patterns that are supported by PCRE2 when its main
+matching function, \fBpcre2_match()\fP, is used. PCRE2 also has an alternative
+matching function, \fBpcre2_dfa_match()\fP, which matches using a different
+algorithm that is not Perl-compatible. Some of the features discussed below are
+not available when DFA matching is used. The advantages and disadvantages of
+the alternative function, and how it differs from the normal function, are
+discussed in the
+.\" HREF
+\fBpcre2matching\fP
+.\"
+page.
+.
+.
+.SH "SPECIAL START-OF-PATTERN ITEMS"
+.rs
+.sp
+A number of options that can be passed to \fBpcre2_compile()\fP can also be set
+by special items at the start of a pattern. These are not Perl-compatible, but
+are provided to make these options accessible to pattern writers who are not
+able to change the program that processes the pattern. Any number of these
+items may appear, but they must all be together right at the start of the
+pattern string, and the letters must be in upper case.
+.
+.
+.SS "UTF support"
+.rs
+.sp
+In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
+single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
+specified for the 32-bit library, in which case it constrains the character
+values to valid Unicode code points. To process UTF strings, PCRE2 must be
+built to include Unicode support (which is the default). When using UTF strings
+you must either call the compiling function with the PCRE2_UTF option, or the
+pattern must start with the special sequence (*UTF), which is equivalent to
+setting the relevant option. How setting a UTF mode affects pattern matching is
+mentioned in several places below. There is also a summary of features in the
+.\" HREF
+\fBpcre2unicode\fP
+.\"
+page.
+.P
+Some applications that allow their users to supply patterns may wish to
+restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
+option is passed to \fBpcre2_compile()\fP, (*UTF) is not allowed, and its
+appearance in a pattern causes an error.
+.
+.
+.SS "Unicode property support"
+.rs
+.sp
+Another special sequence that may appear at the start of a pattern is (*UCP).
+This has the same effect as setting the PCRE2_UCP option: it causes sequences
+such as \ed and \ew to use Unicode properties to determine character types,
+instead of recognizing only characters with codes less than 256 via a lookup
+table.
+.P
+Some applications that allow their users to supply patterns may wish to
+restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
+\fBpcre2_compile()\fP, (*UCP) is not allowed, and its appearance in a pattern
+causes an error.
+.
+.
+.SS "Locking out empty string matching"
+.rs
+.sp
+Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
+as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
+matching function is subsequently called to match the pattern. These options
+lock out the matching of empty strings, either entirely, or only at the start
+of the subject.
+.
+.
+.SS "Disabling auto-possessification"
+.rs
+.sp
+If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
+the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making quantifiers
+possessive when what follows cannot match the repeated item. For example, by
+default a+b is treated as a++b. For more details, see the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation.
+.
+.
+.SS "Disabling start-up optimizations"
+.rs
+.sp
+If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
+PCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly
+reaching "no match" results. For more details, see the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation.
+.
+.
+.SS "Disabling automatic anchoring"
+.rs
+.sp
+If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
+setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
+apply to patterns whose top-level branches all start with .* (match any number
+of arbitrary characters). For more details, see the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation.
+.
+.
+.SS "Disabling JIT compilation"
+.rs
+.sp
+If a pattern that starts with (*NO_JIT) is successfully compiled, an attempt by
+the application to apply the JIT optimization by calling
+\fBpcre2_jit_compile()\fP is ignored.
+.
+.
+.SS "Setting match resource limits"
+.rs
+.sp
+The \fBpcre2_match()\fP function contains a counter that is incremented every
+time it goes round its main loop. The caller of \fBpcre2_match()\fP can set a
+limit on this counter, which therefore limits the amount of computing resource
+used for a match. The maximum depth of nested backtracking can also be limited;
+this indirectly restricts the amount of heap memory that is used, but there is
+also an explicit memory limit that can be set.
+.P
+These facilities are provided to catch runaway matches that are provoked by
+patterns with huge matching trees (a typical example is a pattern with nested
+unlimited repeats applied to a long string that does not match). When one of
+these limits is reached, \fBpcre2_match()\fP gives an error return. The limits
+can also be set by items at the start of the pattern of the form
+.sp
+  (*LIMIT_HEAP=d)
+  (*LIMIT_MATCH=d)
+  (*LIMIT_DEPTH=d)
+.sp
+where d is any number of decimal digits. However, the value of the setting must
+be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
+for it to have any effect. In other words, the pattern writer can lower the
+limits set by the programmer, but not raise them. If there is more than one
+setting of one of these limits, the lower value is used. The heap limit is
+specified in kibibytes (units of 1024 bytes).
+.P
+Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
+still recognized for backwards compatibility.
+.P
+The heap limit applies only when the \fBpcre2_match()\fP or
+\fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply
+to JIT. The match limit is used (but in a different way) when JIT is being
+used, or when \fBpcre2_dfa_match()\fP is called, to limit computing resource
+usage by those matching functions. The depth limit is ignored by JIT but is
+relevant for DFA matching, which uses function recursion for recursions within
+the pattern and for lookaround assertions and atomic groups. In this case, the
+depth limit controls the depth of such recursion.
+.
+.
+.\" HTML <a name="newlines"></a>
+.SS "Newline conventions"
+.rs
+.sp
+PCRE2 supports six different conventions for indicating line breaks in
+strings: a single CR (carriage return) character, a single LF (linefeed)
+character, the two-character sequence CRLF, any of the three preceding, any
+Unicode newline sequence, or the NUL character (binary zero). The
+.\" HREF
+\fBpcre2api\fP
+.\"
+page has
+.\" HTML <a href="pcre2api.html#newlines">
+.\" </a>
+further discussion
+.\"
+about newlines, and shows how to set the newline convention when calling
+\fBpcre2_compile()\fP.
+.P
+It is also possible to specify a newline convention by starting a pattern
+string with one of the following sequences:
+.sp
+  (*CR)        carriage return
+  (*LF)        linefeed
+  (*CRLF)      carriage return, followed by linefeed
+  (*ANYCRLF)   any of the three above
+  (*ANY)       all Unicode newline sequences
+  (*NUL)       the NUL character (binary zero)
+.sp
+These override the default and the options given to the compiling function. For
+example, on a Unix system where LF is the default newline sequence, the pattern
+.sp
+  (*CR)a.b
+.sp
+changes the convention to CR. That pattern matches "a\enb" because LF is no
+longer a newline. If more than one of these settings is present, the last one
+is used.
+.P
+The newline convention affects where the circumflex and dollar assertions are
+true. It also affects the interpretation of the dot metacharacter when
+PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an
+opening brace. However, it does not affect what the \eR escape sequence
+matches. By default, this is any Unicode newline sequence, for Perl
+compatibility. However, this can be changed; see the next section and the
+description of \eR in the section entitled
+.\" HTML <a href="#newlineseq">
+.\" </a>
+"Newline sequences"
+.\"
+below. A change of \eR setting can be combined with a change of newline
+convention.
+.
+.
+.SS "Specifying what \eR matches"
+.rs
+.sp
+It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
+complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
+at compile time. This effect can also be achieved by starting a pattern with
+(*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
+corresponding to PCRE2_BSR_UNICODE.
+.
+.
+.SH "EBCDIC CHARACTER CODES"
+.rs
+.sp
+PCRE2 can be compiled to run in an environment that uses EBCDIC as its
+character code instead of ASCII or Unicode (typically a mainframe system). In
+the sections below, character code values are ASCII or Unicode; in an EBCDIC
+environment these characters may have different code values, and there are no
+code points greater than 255.
+.
+.
+.SH "CHARACTERS AND METACHARACTERS"
+.rs
+.sp
+A regular expression is a pattern that is matched against a subject string from
+left to right. Most characters stand for themselves in a pattern, and match the
+corresponding characters in the subject. As a trivial example, the pattern
+.sp
+  The quick brown fox
+.sp
+matches a portion of a subject string that is identical to itself. When
+caseless matching is specified (the PCRE2_CASELESS option), letters are matched
+independently of case.
+.P
+The power of regular expressions comes from the ability to include alternatives
+and repetitions in the pattern. These are encoded in the pattern by the use of
+\fImetacharacters\fP, which do not stand for themselves but instead are
+interpreted in some special way.
+.P
+There are two different sets of metacharacters: those that are recognized
+anywhere in the pattern except within square brackets, and those that are
+recognized within square brackets. Outside square brackets, the metacharacters
+are as follows:
+.sp
+  \e      general escape character with several uses
+  ^      assert start of string (or line, in multiline mode)
+  $      assert end of string (or line, in multiline mode)
+  .      match any character except newline (by default)
+  [      start character class definition
+  |      start of alternative branch
+  (      start subpattern
+  )      end subpattern
+  ?      extends the meaning of (
+         also 0 or 1 quantifier
+         also quantifier minimizer
+  *      0 or more quantifier
+  +      1 or more quantifier
+         also "possessive quantifier"
+  {      start min/max quantifier
+.sp
+Part of a pattern that is in square brackets is called a "character class". In
+a character class the only metacharacters are:
+.sp
+  \e      general escape character
+  ^      negate the class, but only if the first character
+  -      indicates character range
+.\" JOIN
+  [      POSIX character class (only if followed by POSIX
+           syntax)
+  ]      terminates the character class
+.sp
+The following sections describe the use of each of the metacharacters.
+.
+.
+.SH BACKSLASH
+.rs
+.sp
+The backslash character has several uses. Firstly, if it is followed by a
+character that is not a number or a letter, it takes away any special meaning
+that character may have. This use of backslash as an escape character applies
+both inside and outside character classes.
+.P
+For example, if you want to match a * character, you must write \e* in the
+pattern. This escaping action applies whether or not the following character
+would otherwise be interpreted as a metacharacter, so it is always safe to
+precede a non-alphanumeric with backslash to specify that it stands for itself.
+In particular, if you want to match a backslash, you write \e\e.
+.P
+In a UTF mode, only ASCII numbers and letters have any special meaning after a
+backslash. All other characters (in particular, those whose code points are
+greater than 127) are treated as literals.
+.P
+If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
+the pattern (other than in a character class), and characters between a #
+outside a character class and the next newline, inclusive, are ignored. An
+escaping backslash can be used to include a white space or # character as part
+of the pattern.
+.P
+If you want to remove the special meaning from a sequence of characters, you
+can do so by putting them between \eQ and \eE. This is different from Perl in
+that $ and @ are handled as literals in \eQ...\eE sequences in PCRE2, whereas
+in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish
+backslash interpolation" on any backslashes between \eQ and \eE which, its
+documentation says, "may lead to confusing results". PCRE2 treats a backslash
+between \eQ and \eE just like any other character. Note the following examples:
+.sp
+  Pattern            PCRE2 matches   Perl matches
+.sp
+.\" JOIN
+  \eQabc$xyz\eE        abc$xyz        abc followed by the
+                                      contents of $xyz
+  \eQabc\e$xyz\eE       abc\e$xyz       abc\e$xyz
+  \eQabc\eE\e$\eQxyz\eE   abc$xyz        abc$xyz
+  \eQA\eB\eE            A\eB            A\eB
+  \eQ\e\eE              \e              \e\eE
+.sp
+The \eQ...\eE sequence is recognized both inside and outside character classes.
+An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
+by \eE later in the pattern, the literal interpretation continues to the end of
+the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
+a character class, this causes an error, because the character class is not
+terminated by a closing square bracket.
+.
+.
+.\" HTML <a name="digitsafterbackslash"></a>
+.SS "Non-printing characters"
+.rs
+.sp
+A second use of backslash provides a way of encoding non-printing characters
+in patterns in a visible manner. There is no restriction on the appearance of
+non-printing characters in a pattern, but when a pattern is being prepared by
+text editing, it is often easier to use one of the following escape sequences
+than the binary character it represents. In an ASCII or Unicode environment,
+these escapes are as follows:
+.sp
+  \ea          alarm, that is, the BEL character (hex 07)
+  \ecx         "control-x", where x is any printable ASCII character
+  \ee          escape (hex 1B)
+  \ef          form feed (hex 0C)
+  \en          linefeed (hex 0A)
+  \er          carriage return (hex 0D)
+  \et          tab (hex 09)
+  \e0dd        character with octal code 0dd
+  \eddd        character with octal code ddd, or backreference
+  \eo{ddd..}   character with octal code ddd..
+  \exhh        character with hex code hh
+  \ex{hhh..}   character with hex code hhh..
+  \eN{U+hhh..} character with Unicode hex code point hhh..
+  \euhhhh      character with hex code hhhh (when PCRE2_ALT_BSUX is set)
+.sp
+The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
+is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
+\eN{name} to specify characters by Unicode name; PCRE2 does not support this.
+Note that when \eN is not followed by an opening brace (curly bracket) it has
+an entirely different meaning, matching any character that is not a newline.
+.P
+The precise effect of \ecx on ASCII characters is as follows: if x is a lower
+case letter, it is converted to upper case. Then bit 6 of the character (hex
+40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
+but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
+code unit following \ec has a value less than 32 or greater than 126, a
+compile-time error occurs.
+.P
+When PCRE2 is compiled in EBCDIC mode, \eN{U+hhh..} is not supported. \ea, \ee,
+\ef, \en, \er, and \et generate the appropriate EBCDIC code values. The \ec
+escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
+only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
+^, _, or ?. Any other character provokes a compile-time error. The sequence
+\ec@ encodes character code 0; after \ec the letters (in either case) encode
+characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
+(hex 1B to hex 1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
+.P
+Thus, apart from \ec?, these escapes generate the same character code values as
+they do in an ASCII environment, though the meanings of the values mostly
+differ. For example, \ecG always generates code value 7, which is BEL in ASCII
+but DEL in EBCDIC.
+.P
+The sequence \ec? generates DEL (127, hex 7F) in an ASCII environment, but
+because 127 is not a control character in EBCDIC, Perl makes it generate the
+APC character. Unfortunately, there are several variants of EBCDIC. In most of
+them the APC character has the value 255 (hex FF), but in the one Perl calls
+POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
+values, PCRE2 makes \ec? generate 95; otherwise it generates 255.
+.P
+After \e0 up to two further octal digits are read. If there are fewer than two
+digits, just those that are present are used. Thus the sequence \e0\ex\e015
+specifies two binary zeros followed by a CR character (code value 13). Make
+sure you supply two digits after the initial zero if the pattern character that
+follows is itself an octal digit.
+.P
+The escape \eo must be followed by a sequence of octal digits, enclosed in
+braces. An error occurs if this is not the case. This escape is a recent
+addition to Perl; it provides way of specifying character code points as octal
+numbers greater than 0777, and it also allows octal numbers and backreferences
+to be unambiguously specified.
+.P
+For greater clarity and unambiguity, it is best to avoid following \e by a
+digit greater than zero. Instead, use \eo{} or \ex{} to specify numerical
+character code points, and \eg{} to specify backreferences. The following
+paragraphs describe the old, ambiguous syntax.
+.P
+The handling of a backslash followed by a digit other than 0 is complicated,
+and Perl has changed over time, causing PCRE2 also to change.
+.P
+Outside a character class, PCRE2 reads the digit and any following digits as a
+decimal number. If the number is less than 10, begins with the digit 8 or 9, or
+if there are at least that many previous capturing left parentheses in the
+expression, the entire sequence is taken as a \fIbackreference\fP. A
+description of how this works is given
+.\" HTML <a href="#backreferences">
+.\" </a>
+later,
+.\"
+following the discussion of
+.\" HTML <a href="#subpattern">
+.\" </a>
+parenthesized subpatterns.
+.\"
+Otherwise, up to three octal digits are read to form a character code.
+.P
+Inside a character class, PCRE2 handles \e8 and \e9 as the literal characters
+"8" and "9", and otherwise reads up to three octal digits following the
+backslash, using them to generate a data character. Any subsequent digits stand
+for themselves. For example, outside a character class:
+.sp
+  \e040   is another way of writing an ASCII space
+.\" JOIN
+  \e40    is the same, provided there are fewer than 40
+            previous capturing subpatterns
+  \e7     is always a backreference
+.\" JOIN
+  \e11    might be a backreference, or another way of
+            writing a tab
+  \e011   is always a tab
+  \e0113  is a tab followed by the character "3"
+.\" JOIN
+  \e113   might be a backreference, otherwise the
+            character with octal code 113
+.\" JOIN
+  \e377   might be a backreference, otherwise
+            the value 255 (decimal)
+.\" JOIN
+  \e81    is always a backreference
+.sp
+Note that octal values of 100 or greater that are specified using this syntax
+must not be introduced by a leading zero, because no more than three octal
+digits are ever read.
+.P
+By default, after \ex that is not followed by {, from zero to two hexadecimal
+digits are read (letters can be in upper or lower case). Any number of
+hexadecimal digits may appear between \ex{ and }. If a character other than
+a hexadecimal digit appears between \ex{ and }, or if there is no terminating
+}, an error occurs.
+.P
+If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just
+described only when it is followed by two hexadecimal digits. Otherwise, it
+matches a literal "x" character. In this mode, support for code points greater
+than 256 is provided by \eu, which must be followed by four hexadecimal digits;
+otherwise it matches a literal "u" character.
+.P
+Characters whose value is less than 256 can be defined by either of the two
+syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in
+the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
+\eu00dc in PCRE2_ALT_BSUX mode).
+.
+.
+.SS "Constraints on character values"
+.rs
+.sp
+Characters that are specified using octal or hexadecimal numbers are
+limited to certain values, as follows:
+.sp
+  8-bit non-UTF mode    no greater than 0xff
+  16-bit non-UTF mode   no greater than 0xffff
+  32-bit non-UTF mode   no greater than 0xffffffff
+  All UTF modes         no greater than 0x10ffff and a valid code point
+.sp
+Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
+so-called "surrogate" code points). The check for these can be disabled by the
+caller of \fBpcre2_compile()\fP by setting the option
+PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
+and UTF-32 modes, because these values are not representable in UTF-16.
+.
+.
+.SS "Escape sequences in character classes"
+.rs
+.sp
+All the sequences that define a single character value can be used both inside
+and outside character classes. In addition, inside a character class, \eb is
+interpreted as the backspace character (hex 08).
+.P
+When not followed by an opening brace, \eN is not allowed in a character class.
+\eB, \eR, and \eX are not special inside a character class. Like other
+unrecognized alphabetic escape sequences, they cause an error. Outside a
+character class, these sequences have different meanings.
+.
+.
+.SS "Unsupported escape sequences"
+.rs
+.sp
+In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
+handler and used to modify the case of following characters. By default, PCRE2
+does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
+is set, \eU matches a "U" character, and \eu can be used to define a character
+by code point, as described above.
+.
+.
+.SS "Absolute and relative backreferences"
+.rs
+.sp
+The sequence \eg followed by a signed or unsigned number, optionally enclosed
+in braces, is an absolute or relative backreference. A named backreference
+can be coded as \eg{name}. Backreferences are discussed
+.\" HTML <a href="#backreferences">
+.\" </a>
+later,
+.\"
+following the discussion of
+.\" HTML <a href="#subpattern">
+.\" </a>
+parenthesized subpatterns.
+.\"
+.
+.
+.SS "Absolute and relative subroutine calls"
+.rs
+.sp
+For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
+a number enclosed either in angle brackets or single quotes, is an alternative
+syntax for referencing a subpattern as a "subroutine". Details are discussed
+.\" HTML <a href="#onigurumasubroutines">
+.\" </a>
+later.
+.\"
+Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
+synonymous. The former is a backreference; the latter is a
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+subroutine
+.\"
+call.
+.
+.
+.\" HTML <a name="genericchartypes"></a>
+.SS "Generic character types"
+.rs
+.sp
+Another use of backslash is for specifying generic character types:
+.sp
+  \ed     any decimal digit
+  \eD     any character that is not a decimal digit
+  \eh     any horizontal white space character
+  \eH     any character that is not a horizontal white space character
+  \eN     any character that is not a newline
+  \es     any white space character
+  \eS     any character that is not a white space character
+  \ev     any vertical white space character
+  \eV     any character that is not a vertical white space character
+  \ew     any "word" character
+  \eW     any "non-word" character
+.sp
+The \eN escape sequence has the same meaning as
+.\" HTML <a href="#fullstopdot">
+.\" </a>
+the "." metacharacter
+.\"
+when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
+meaning of \eN. Note that when \eN is followed by an opening brace it has a
+different meaning. See the section entitled
+.\" HTML <a href="#digitsafterbackslash">
+.\" </a>
+"Non-printing characters"
+.\"
+above for details. Perl also uses \eN{name} to specify characters by Unicode
+name; PCRE2 does not support this.
+.P
+Each pair of lower and upper case escape sequences partitions the complete set
+of characters into two disjoint sets. Any given character matches one, and only
+one, of each pair. The sequences can appear both inside and outside character
+classes. They each match one character of the appropriate type. If the current
+matching point is at the end of the subject string, all of them fail, because
+there is no character to match.
+.P
+The default \es characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
+space (32), which are defined as white space in the "C" locale. This list may
+vary if locale-specific matching is taking place. For example, in some locales
+the "non-breaking space" character (\exA0) is recognized as white space, and in
+others the VT character is not.
+.P
+A "word" character is an underscore or any character that is a letter or digit.
+By default, the definition of letters and digits is controlled by PCRE2's
+low-valued character tables, and may vary if locale-specific matching is taking
+place (see
+.\" HTML <a href="pcre2api.html#localesupport">
+.\" </a>
+"Locale support"
+.\"
+in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+page). For example, in a French locale such as "fr_FR" in Unix-like systems,
+or "french" in Windows, some character codes greater than 127 are used for
+accented letters, and these are then matched by \ew. The use of locales with
+Unicode is discouraged.
+.P
+By default, characters whose code points are greater than 127 never match \ed,
+\es, or \ew, and always match \eD, \eS, and \eW, although this may be different
+for characters in the range 128-255 when locale-specific matching is happening.
+These escape sequences retain their original meanings from before Unicode
+support was available, mainly for efficiency reasons. If the PCRE2_UCP option
+is set, the behaviour is changed so that Unicode properties are used to
+determine character types, as follows:
+.sp
+  \ed  any character that matches \ep{Nd} (decimal digit)
+  \es  any character that matches \ep{Z} or \eh or \ev
+  \ew  any character that matches \ep{L} or \ep{N}, plus underscore
+.sp
+The upper case escapes match the inverse sets of characters. Note that \ed
+matches only decimal digits, whereas \ew matches any Unicode digit, as well as
+any Unicode letter, and underscore. Note also that PCRE2_UCP affects \eb, and
+\eB because they are defined in terms of \ew and \eW. Matching these sequences
+is noticeably slower when PCRE2_UCP is set.
+.P
+The sequences \eh, \eH, \ev, and \eV, in contrast to the other sequences, which
+match only ASCII characters by default, always match a specific list of code
+points, whether or not PCRE2_UCP is set. The horizontal space characters are:
+.sp
+  U+0009     Horizontal tab (HT)
+  U+0020     Space
+  U+00A0     Non-break space
+  U+1680     Ogham space mark
+  U+180E     Mongolian vowel separator
+  U+2000     En quad
+  U+2001     Em quad
+  U+2002     En space
+  U+2003     Em space
+  U+2004     Three-per-em space
+  U+2005     Four-per-em space
+  U+2006     Six-per-em space
+  U+2007     Figure space
+  U+2008     Punctuation space
+  U+2009     Thin space
+  U+200A     Hair space
+  U+202F     Narrow no-break space
+  U+205F     Medium mathematical space
+  U+3000     Ideographic space
+.sp
+The vertical space characters are:
+.sp
+  U+000A     Linefeed (LF)
+  U+000B     Vertical tab (VT)
+  U+000C     Form feed (FF)
+  U+000D     Carriage return (CR)
+  U+0085     Next line (NEL)
+  U+2028     Line separator
+  U+2029     Paragraph separator
+.sp
+In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
+are relevant.
+.
+.
+.\" HTML <a name="newlineseq"></a>
+.SS "Newline sequences"
+.rs
+.sp
+Outside a character class, by default, the escape sequence \eR matches any
+Unicode newline sequence. In 8-bit non-UTF-8 mode \eR is equivalent to the
+following:
+.sp
+  (?>\er\en|\en|\ex0b|\ef|\er|\ex85)
+.sp
+This is an example of an "atomic group", details of which are given
+.\" HTML <a href="#atomicgroup">
+.\" </a>
+below.
+.\"
+This particular group matches either the two-character sequence CR followed by
+LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
+U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
+line, U+0085). Because this is an atomic group, the two-character sequence is
+treated as a single unit that cannot be split.
+.P
+In other modes, two additional characters whose code points are greater than 255
+are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
+Unicode support is not needed for these characters to be recognized.
+.P
+It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
+complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
+at compile time. (BSR is an abbrevation for "backslash R".) This can be made
+the default when PCRE2 is built; if this is the case, the other behaviour can
+be requested via the PCRE2_BSR_UNICODE option. It is also possible to specify
+these settings by starting a pattern string with one of the following
+sequences:
+.sp
+  (*BSR_ANYCRLF)   CR, LF, or CRLF only
+  (*BSR_UNICODE)   any Unicode newline sequence
+.sp
+These override the default and the options given to the compiling function.
+Note that these special settings, which are not Perl-compatible, are recognized
+only at the very start of a pattern, and that they must be in upper case. If
+more than one of them is present, the last one is used. They can be combined
+with a change of newline convention; for example, a pattern can start with:
+.sp
+  (*ANY)(*BSR_ANYCRLF)
+.sp
+They can also be combined with the (*UTF) or (*UCP) special sequences. Inside a
+character class, \eR is treated as an unrecognized escape sequence, and causes
+an error.
+.
+.
+.\" HTML <a name="uniextseq"></a>
+.SS Unicode character properties
+.rs
+.sp
+When PCRE2 is built with Unicode support (the default), three additional escape
+sequences that match characters with specific properties are available. In
+8-bit non-UTF-8 mode, these sequences are of course limited to testing
+characters whose code points are less than 256, but they do work in this mode.
+In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
+may be encountered. These are all treated as being in the Common script and
+with an unassigned type. The extra escape sequences are:
+.sp
+  \ep{\fIxx\fP}   a character with the \fIxx\fP property
+  \eP{\fIxx\fP}   a character without the \fIxx\fP property
+  \eX       a Unicode extended grapheme cluster
+.sp
+The property names represented by \fIxx\fP above are limited to the Unicode
+script names, the general category properties, "Any", which matches any
+character (including newline), and some special PCRE2 properties (described
+in the
+.\" HTML <a href="#extraprops">
+.\" </a>
+next section).
+.\"
+Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
+Note that \eP{Any} does not match any characters, so always causes a match
+failure.
+.P
+Sets of Unicode characters are defined as belonging to certain scripts. A
+character from one of these sets can be matched using a script name. For
+example:
+.sp
+  \ep{Greek}
+  \eP{Han}
+.sp
+Those that are not part of an identified script are lumped together as
+"Common". The current list of scripts is:
+.P
+Adlam,
+Ahom,
+Anatolian_Hieroglyphs,
+Arabic,
+Armenian,
+Avestan,
+Balinese,
+Bamum,
+Bassa_Vah,
+Batak,
+Bengali,
+Bhaiksuki,
+Bopomofo,
+Brahmi,
+Braille,
+Buginese,
+Buhid,
+Canadian_Aboriginal,
+Carian,
+Caucasian_Albanian,
+Chakma,
+Cham,
+Cherokee,
+Common,
+Coptic,
+Cuneiform,
+Cypriot,
+Cyrillic,
+Deseret,
+Devanagari,
+Dogra,
+Duployan,
+Egyptian_Hieroglyphs,
+Elbasan,
+Ethiopic,
+Georgian,
+Glagolitic,
+Gothic,
+Grantha,
+Greek,
+Gujarati,
+Gunjala_Gondi,
+Gurmukhi,
+Han,
+Hangul,
+Hanifi_Rohingya,
+Hanunoo,
+Hatran,
+Hebrew,
+Hiragana,
+Imperial_Aramaic,
+Inherited,
+Inscriptional_Pahlavi,
+Inscriptional_Parthian,
+Javanese,
+Kaithi,
+Kannada,
+Katakana,
+Kayah_Li,
+Kharoshthi,
+Khmer,
+Khojki,
+Khudawadi,
+Lao,
+Latin,
+Lepcha,
+Limbu,
+Linear_A,
+Linear_B,
+Lisu,
+Lycian,
+Lydian,
+Mahajani,
+Makasar,
+Malayalam,
+Mandaic,
+Manichaean,
+Marchen,
+Masaram_Gondi,
+Medefaidrin,
+Meetei_Mayek,
+Mende_Kikakui,
+Meroitic_Cursive,
+Meroitic_Hieroglyphs,
+Miao,
+Modi,
+Mongolian,
+Mro,
+Multani,
+Myanmar,
+Nabataean,
+New_Tai_Lue,
+Newa,
+Nko,
+Nushu,
+Ogham,
+Ol_Chiki,
+Old_Hungarian,
+Old_Italic,
+Old_North_Arabian,
+Old_Permic,
+Old_Persian,
+Old_Sogdian,
+Old_South_Arabian,
+Old_Turkic,
+Oriya,
+Osage,
+Osmanya,
+Pahawh_Hmong,
+Palmyrene,
+Pau_Cin_Hau,
+Phags_Pa,
+Phoenician,
+Psalter_Pahlavi,
+Rejang,
+Runic,
+Samaritan,
+Saurashtra,
+Sharada,
+Shavian,
+Siddham,
+SignWriting,
+Sinhala,
+Sogdian,
+Sora_Sompeng,
+Soyombo,
+Sundanese,
+Syloti_Nagri,
+Syriac,
+Tagalog,
+Tagbanwa,
+Tai_Le,
+Tai_Tham,
+Tai_Viet,
+Takri,
+Tamil,
+Tangut,
+Telugu,
+Thaana,
+Thai,
+Tibetan,
+Tifinagh,
+Tirhuta,
+Ugaritic,
+Vai,
+Warang_Citi,
+Yi,
+Zanabazar_Square.
+.P
+Each character has exactly one Unicode general category property, specified by
+a two-letter abbreviation. For compatibility with Perl, negation can be
+specified by including a circumflex between the opening brace and the property
+name. For example, \ep{^Lu} is the same as \eP{Lu}.
+.P
+If only one letter is specified with \ep or \eP, it includes all the general
+category properties that start with that letter. In this case, in the absence
+of negation, the curly brackets in the escape sequence are optional; these two
+examples have the same effect:
+.sp
+  \ep{L}
+  \epL
+.sp
+The following general category property codes are supported:
+.sp
+  C     Other
+  Cc    Control
+  Cf    Format
+  Cn    Unassigned
+  Co    Private use
+  Cs    Surrogate
+.sp
+  L     Letter
+  Ll    Lower case letter
+  Lm    Modifier letter
+  Lo    Other letter
+  Lt    Title case letter
+  Lu    Upper case letter
+.sp
+  M     Mark
+  Mc    Spacing mark
+  Me    Enclosing mark
+  Mn    Non-spacing mark
+.sp
+  N     Number
+  Nd    Decimal number
+  Nl    Letter number
+  No    Other number
+.sp
+  P     Punctuation
+  Pc    Connector punctuation
+  Pd    Dash punctuation
+  Pe    Close punctuation
+  Pf    Final punctuation
+  Pi    Initial punctuation
+  Po    Other punctuation
+  Ps    Open punctuation
+.sp
+  S     Symbol
+  Sc    Currency symbol
+  Sk    Modifier symbol
+  Sm    Mathematical symbol
+  So    Other symbol
+.sp
+  Z     Separator
+  Zl    Line separator
+  Zp    Paragraph separator
+  Zs    Space separator
+.sp
+The special property L& is also supported: it matches a character that has
+the Lu, Ll, or Lt property, in other words, a letter that is not classified as
+a modifier or "other".
+.P
+The Cs (Surrogate) property applies only to characters in the range U+D800 to
+U+DFFF. Such characters are not valid in Unicode strings and so
+cannot be tested by PCRE2, unless UTF validity checking has been turned off
+(see the discussion of PCRE2_NO_UTF_CHECK in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+page). Perl does not support the Cs property.
+.P
+The long synonyms for property names that Perl supports (such as \ep{Letter})
+are not supported by PCRE2, nor is it permitted to prefix any of these
+properties with "Is".
+.P
+No character that is in the Unicode table has the Cn (unassigned) property.
+Instead, this property is assumed for any code point that is not in the
+Unicode table.
+.P
+Specifying caseless matching does not affect these escape sequences. For
+example, \ep{Lu} always matches only upper case letters. This is different from
+the behaviour of current versions of Perl.
+.P
+Matching characters by Unicode property is not fast, because PCRE2 has to do a
+multistage table lookup in order to find a character's property. That is why
+the traditional escape sequences such as \ed and \ew do not use Unicode
+properties in PCRE2 by default, though you can make them do so by setting the
+PCRE2_UCP option or by starting the pattern with (*UCP).
+.
+.
+.SS Extended grapheme clusters
+.rs
+.sp
+The \eX escape matches any number of Unicode characters that form an "extended
+grapheme cluster", and treats the sequence as an atomic group
+.\" HTML <a href="#atomicgroup">
+.\" </a>
+(see below).
+.\"
+Unicode supports various kinds of composite character by giving each character
+a grapheme breaking property, and having rules that use these properties to
+define the boundaries of extended grapheme clusters. The rules are defined in
+Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
+abandoned the use of some previous properties that had been used for emojis.
+Instead it introduced various emoji-specific properties. PCRE2 uses only the
+Extended Pictographic property.
+.P
+\eX always matches at least one character. Then it decides whether to add
+additional characters according to the following rules for ending a cluster:
+.P
+1. End at the end of the subject string.
+.P
+2. Do not end between CR and LF; otherwise end after any control character.
+.P
+3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
+are of five types: L, V, T, LV, and LVT. An L character may be followed by an
+L, V, LV, or LVT character; an LV or V character may be followed by a V or T
+character; an LVT or T character may be follwed only by a T character.
+.P
+4. Do not end before extending characters or spacing marks or the "zero-width
+joiner" character. Characters with the "mark" property always have the
+"extend" grapheme breaking property.
+.P
+5. Do not end after prepend characters.
+.P
+6. Do not break within emoji modifier sequences or emoji zwj sequences. That
+is, do not break between characters with the Extended_Pictographic property.
+Extend and ZWJ characters are allowed between the characters.
+.P
+7. Do not break within emoji flag sequences. That is, do not break between
+regional indicator (RI) characters if there are an odd number of RI characters
+before the break point.
+.P
+8. Otherwise, end the cluster.
+.
+.
+.\" HTML <a name="extraprops"></a>
+.SS PCRE2's additional properties
+.rs
+.sp
+As well as the standard Unicode properties described above, PCRE2 supports four
+more that make it possible to convert traditional escape sequences such as \ew
+and \es to use Unicode properties. PCRE2 uses these non-standard, non-Perl
+properties internally when PCRE2_UCP is set. However, they may also be used
+explicitly. These properties are:
+.sp
+  Xan   Any alphanumeric character
+  Xps   Any POSIX space character
+  Xsp   Any Perl space character
+  Xwd   Any Perl "word" character
+.sp
+Xan matches characters that have either the L (letter) or the N (number)
+property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
+carriage return, and any other character that has the Z (separator) property.
+Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl
+compatibility, but Perl changed. Xwd matches the same characters as Xan, plus
+underscore.
+.P
+There is another non-standard property, Xuc, which matches any character that
+can be represented by a Universal Character Name in C++ and other programming
+languages. These are the characters $, @, ` (grave accent), and all characters
+with Unicode code points greater than or equal to U+00A0, except for the
+surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
+excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH
+where H is a hexadecimal digit. Note that the Xuc property does not match these
+sequences but the characters that they represent.)
+.
+.
+.\" HTML <a name="resetmatchstart"></a>
+.SS "Resetting the match start"
+.rs
+.sp
+In normal use, the escape sequence \eK causes any previously matched characters
+not to be included in the final matched sequence that is returned. For example,
+the pattern:
+.sp
+  foo\eKbar
+.sp
+matches "foobar", but reports that it has matched "bar". \eK does not interact
+with anchoring in any way. The pattern:
+.sp
+  ^foo\eKbar
+.sp
+matches only when the subject begins with "foobar" (in single line mode),
+though it again reports the matched string as "bar". This feature is similar to
+a lookbehind assertion
+.\" HTML <a href="#lookbehind">
+.\" </a>
+(described below).
+.\"
+However, in this case, the part of the subject before the real match does not
+have to be of fixed length, as lookbehind assertions do. The use of \eK does
+not interfere with the setting of
+.\" HTML <a href="#subpattern">
+.\" </a>
+captured substrings.
+.\"
+For example, when the pattern
+.sp
+  (foo)\eKbar
+.sp
+matches "foobar", the first substring is still set to "foo".
+.P
+Perl documents that the use of \eK within assertions is "not well defined". In
+PCRE2, \eK is acted upon when it occurs inside positive assertions, but is
+ignored in negative assertions. Note that when a pattern such as (?=ab\eK)
+matches, the reported start of the match can be greater than the end of the
+match. Using \eK in a lookbehind assertion at the start of a pattern can also
+lead to odd effects. For example, consider this pattern:
+.sp
+  (?<=\eKfoo)bar
+.sp
+If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
+offset of 3 succeeds and reports the matching string as "foobar", that is, the
+start of the reported match is earlier than where the match started.
+.
+.
+.\" HTML <a name="smallassertions"></a>
+.SS "Simple assertions"
+.rs
+.sp
+The final use of backslash is for certain simple assertions. An assertion
+specifies a condition that has to be met at a particular point in a match,
+without consuming any characters from the subject string. The use of
+subpatterns for more complicated assertions is described
+.\" HTML <a href="#bigassertions">
+.\" </a>
+below.
+.\"
+The backslashed assertions are:
+.sp
+  \eb     matches at a word boundary
+  \eB     matches when not at a word boundary
+  \eA     matches at the start of the subject
+  \eZ     matches at the end of the subject
+          also matches before a newline at the end of the subject
+  \ez     matches only at the end of the subject
+  \eG     matches at the first matching position in the subject
+.sp
+Inside a character class, \eb has a different meaning; it matches the backspace
+character. If any other of these assertions appears in a character class, an
+"invalid escape sequence" error is generated.
+.P
+A word boundary is a position in the subject string where the current character
+and the previous character do not both match \ew or \eW (i.e. one matches
+\ew and the other matches \eW), or the start or end of the string if the
+first or last character matches \ew, respectively. In a UTF mode, the meanings
+of \ew and \eW can be changed by setting the PCRE2_UCP option. When this is
+done, it also affects \eb and \eB. Neither PCRE2 nor Perl has a separate "start
+of word" or "end of word" metasequence. However, whatever follows \eb normally
+determines which it is. For example, the fragment \eba matches "a" at the start
+of a word.
+.P
+The \eA, \eZ, and \ez assertions differ from the traditional circumflex and
+dollar (described in the next section) in that they only ever match at the very
+start and end of the subject string, whatever options are set. Thus, they are
+independent of multiline mode. These three assertions are not affected by the
+PCRE2_NOTBOL or PCRE2_NOTEOL options, which affect only the behaviour of the
+circumflex and dollar metacharacters. However, if the \fIstartoffset\fP
+argument of \fBpcre2_match()\fP is non-zero, indicating that matching is to
+start at a point other than the beginning of the subject, \eA can never match.
+The difference between \eZ and \ez is that \eZ matches before a newline at the
+end of the string as well as at the very end, whereas \ez matches only at the
+end.
+.P
+The \eG assertion is true only when the current matching position is at the
+start point of the matching process, as specified by the \fIstartoffset\fP
+argument of \fBpcre2_match()\fP. It differs from \eA when the value of
+\fIstartoffset\fP is non-zero. By calling \fBpcre2_match()\fP multiple times
+with appropriate arguments, you can mimic Perl's /g option, and it is in this
+kind of implementation where \eG can be useful.
+.P
+Note, however, that PCRE2's implementation of \eG, being true at the starting
+character of the matching process, is subtly different from Perl's, which
+defines it as true at the end of the previous match. In Perl, these can be
+different when the previously matched string was empty. Because PCRE2 does just
+one match at a time, it cannot reproduce this behaviour.
+.P
+If all the alternatives of a pattern begin with \eG, the expression is anchored
+to the starting match position, and the "anchored" flag is set in the compiled
+regular expression.
+.
+.
+.SH "CIRCUMFLEX AND DOLLAR"
+.rs
+.sp
+The circumflex and dollar metacharacters are zero-width assertions. That is,
+they test for a particular condition being true without consuming any
+characters from the subject string. These two metacharacters are concerned with
+matching the starts and ends of lines. If the newline convention is set so that
+only the two-character sequence CRLF is recognized as a newline, isolated CR
+and LF characters are treated as ordinary data characters, and are not
+recognized as newlines.
+.P
+Outside a character class, in the default matching mode, the circumflex
+character is an assertion that is true only if the current matching point is at
+the start of the subject string. If the \fIstartoffset\fP argument of
+\fBpcre2_match()\fP is non-zero, or if PCRE2_NOTBOL is set, circumflex can
+never match if the PCRE2_MULTILINE option is unset. Inside a character class,
+circumflex has an entirely different meaning
+.\" HTML <a href="#characterclass">
+.\" </a>
+(see below).
+.\"
+.P
+Circumflex need not be the first character of the pattern if a number of
+alternatives are involved, but it should be the first thing in each alternative
+in which it appears if the pattern is ever to match that branch. If all
+possible alternatives start with a circumflex, that is, if the pattern is
+constrained to match only at the start of the subject, it is said to be an
+"anchored" pattern. (There are also other constructs that can cause a pattern
+to be anchored.)
+.P
+The dollar character is an assertion that is true only if the current matching
+point is at the end of the subject string, or immediately before a newline at
+the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
+that it does not actually match the newline. Dollar need not be the last
+character of the pattern if a number of alternatives are involved, but it
+should be the last item in any branch in which it appears. Dollar has no
+special meaning in a character class.
+.P
+The meaning of dollar can be changed so that it matches only at the very end of
+the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
+does not affect the \eZ assertion.
+.P
+The meanings of the circumflex and dollar metacharacters are changed if the
+PCRE2_MULTILINE option is set. When this is the case, a dollar character
+matches before any newlines in the string, as well as at the very end, and a
+circumflex matches immediately after internal newlines as well as at the start
+of the subject string. It does not match after a newline that ends the string,
+for compatibility with Perl. However, this can be changed by setting the
+PCRE2_ALT_CIRCUMFLEX option.
+.P
+For example, the pattern /^abc$/ matches the subject string "def\enabc" (where
+\en represents a newline) in multiline mode, but not otherwise. Consequently,
+patterns that are anchored in single line mode because all branches start with
+^ are not anchored in multiline mode, and a match for circumflex is possible
+when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
+PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
+.P
+When the newline convention (see
+.\" HTML <a href="#newlines">
+.\" </a>
+"Newline conventions"
+.\"
+below) recognizes the two-character sequence CRLF as a newline, this is
+preferred, even if the single characters CR and LF are also recognized as
+newlines. For example, if the newline convention is "any", a multiline mode
+circumflex matches before "xyz" in the string "abc\er\enxyz" rather than after
+CR, even though CR on its own is a valid newline. (It also matches at the very
+start of the string, of course.)
+.P
+Note that the sequences \eA, \eZ, and \ez can be used to match the start and
+end of the subject in both modes, and if all branches of a pattern start with
+\eA it is always anchored, whether or not PCRE2_MULTILINE is set.
+.
+.
+.\" HTML <a name="fullstopdot"></a>
+.SH "FULL STOP (PERIOD, DOT) AND \eN"
+.rs
+.sp
+Outside a character class, a dot in the pattern matches any one character in
+the subject string except (by default) a character that signifies the end of a
+line.
+.P
+When a line ending is defined as a single character, dot never matches that
+character; when the two-character sequence CRLF is used, dot does not match CR
+if it is immediately followed by LF, but otherwise it matches all characters
+(including isolated CRs and LFs). When any Unicode line endings are being
+recognized, dot does not match CR or LF or any of the other line ending
+characters.
+.P
+The behaviour of dot with regard to newlines can be changed. If the
+PCRE2_DOTALL option is set, a dot matches any one character, without exception.
+If the two-character sequence CRLF is present in the subject string, it takes
+two dots to match it.
+.P
+The handling of dot is entirely independent of the handling of circumflex and
+dollar, the only relationship being that they both involve newlines. Dot has no
+special meaning in a character class.
+.P
+The escape sequence \eN when not followed by an opening brace behaves like a
+dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
+it matches any character except one that signifies the end of a line.
+.P
+When \eN is followed by an opening brace it has a different meaning. See the
+section entitled
+.\" HTML <a href="digitsafterbackslash">
+.\" </a>
+"Non-printing characters"
+.\"
+above for details. Perl also uses \eN{name} to specify characters by Unicode
+name; PCRE2 does not support this.
+.
+.
+.SH "MATCHING A SINGLE CODE UNIT"
+.rs
+.sp
+Outside a character class, the escape sequence \eC matches any one code unit,
+whether or not a UTF mode is set. In the 8-bit library, one code unit is one
+byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
+32-bit unit. Unlike a dot, \eC always matches line-ending characters. The
+feature is provided in Perl in order to match individual bytes in UTF-8 mode,
+but it is unclear how it can usefully be used.
+.P
+Because \eC breaks up characters into individual code units, matching one unit
+with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
+with a malformed UTF character. This has undefined results, because PCRE2
+assumes that it is matching character by character in a valid UTF string (by
+default it checks the subject string's validity at the start of processing
+unless the PCRE2_NO_UTF_CHECK option is used).
+.P
+An application can lock out the use of \eC by setting the
+PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
+build PCRE2 with the use of \eC permanently disabled.
+.P
+PCRE2 does not allow \eC to appear in lookbehind assertions
+.\" HTML <a href="#lookbehind">
+.\" </a>
+(described below)
+.\"
+in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
+the length of the lookbehind. Neither the alternative matching function
+\fBpcre2_dfa_match()\fP nor the JIT optimizer support \eC in these UTF modes.
+The former gives a match-time error; the latter fails to optimize and so the
+match is always run using the interpreter.
+.P
+In the 32-bit library, however, \eC is always supported (when not explicitly
+locked out) because it always matches a single code unit, whether or not UTF-32
+is specified.
+.P
+In general, the \eC escape sequence is best avoided. However, one way of using
+it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
+lookahead to check the length of the next character, as in this pattern, which
+could be used with a UTF-8 string (ignore white space and line breaks):
+.sp
+  (?| (?=[\ex00-\ex7f])(\eC) |
+      (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
+      (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
+      (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
+.sp
+In this example, a group that starts with (?| resets the capturing parentheses
+numbers in each alternative (see
+.\" HTML <a href="#dupsubpatternnumber">
+.\" </a>
+"Duplicate Subpattern Numbers"
+.\"
+below). The assertions at the start of each branch check the next UTF-8
+character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
+character's individual bytes are then captured by the appropriate number of
+\eC groups.
+.
+.
+.\" HTML <a name="characterclass"></a>
+.SH "SQUARE BRACKETS AND CHARACTER CLASSES"
+.rs
+.sp
+An opening square bracket introduces a character class, terminated by a closing
+square bracket. A closing square bracket on its own is not special by default.
+If a closing square bracket is required as a member of the class, it should be
+the first data character in the class (after an initial circumflex, if present)
+or escaped with a backslash. This means that, by default, an empty class cannot
+be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing
+square bracket at the start does end the (empty) class.
+.P
+A character class matches a single character in the subject. A matched
+character must be in the set of characters defined by the class, unless the
+first character in the class definition is a circumflex, in which case the
+subject character must not be in the set defined by the class. If a circumflex
+is actually required as a member of the class, ensure it is not the first
+character, or escape it with a backslash.
+.P
+For example, the character class [aeiou] matches any lower case vowel, while
+[^aeiou] matches any character that is not a lower case vowel. Note that a
+circumflex is just a convenient notation for specifying the characters that
+are in the class by enumerating those that are not. A class that starts with a
+circumflex is not an assertion; it still consumes a character from the subject
+string, and therefore it fails if the current pointer is at the end of the
+string.
+.P
+Characters in a class may be specified by their code points using \eo, \ex, or
+\eN{U+hh..} in the usual way. When caseless matching is set, any letters in a
+class represent both their upper case and lower case versions, so for example,
+a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
+match "A", whereas a caseful version would.
+.P
+Characters that might indicate line breaks are never treated in any special way
+when matching character classes, whatever line-ending sequence is in use, and
+whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
+class such as [^a] always matches one of these characters.
+.P
+The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es,
+\eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the
+characters that they match to the class. For example, [\edABCDEF] matches any
+hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
+\ed, \es, \ew and their upper case partners, just as it does when they appear
+outside a character class, as described in the section entitled
+.\" HTML <a href="#genericchartypes">
+.\" </a>
+"Generic character types"
+.\"
+above. The escape sequence \eb has a different meaning inside a character
+class; it matches the backspace character. The sequences \eB, \eR, and \eX are
+not special inside a character class. Like any other unrecognized escape
+sequences, they cause an error. The same is true for \eN when not followed by
+an opening brace.
+.P
+The minus (hyphen) character can be used to specify a range of characters in a
+character class. For example, [d-m] matches any letter between d and m,
+inclusive. If a minus character is required in a class, it must be escaped with
+a backslash or appear in a position where it cannot be interpreted as
+indicating a range, typically as the first or last character in the class,
+or immediately after a range. For example, [b-d-z] matches letters in the range
+b to d, a hyphen character, or z.
+.P
+Perl treats a hyphen as a literal if it appears before or after a POSIX class
+(see below) or before or after a character type escape such as as \ed or \eH.
+However, unless the hyphen is the last character in the class, Perl outputs a
+warning in its warning mode, as this is most likely a user error. As PCRE2 has
+no facility for warning, an error is given in these cases.
+.P
+It is not possible to have the literal character "]" as the end character of a
+range. A pattern such as [W-]46] is interpreted as a class of two characters
+("W" and "-") followed by a literal string "46]", so it would match "W46]" or
+"-46]". However, if the "]" is escaped with a backslash it is interpreted as
+the end of range, so [W-\e]46] is interpreted as a class containing a range
+followed by two other characters. The octal or hexadecimal representation of
+"]" can also be used to end a range.
+.P
+Ranges normally include all code points between the start and end characters,
+inclusive. They can also be used for code points specified numerically, for
+example [\e000-\e037]. Ranges can include any characters that are valid for the
+current mode. In any UTF mode, the so-called "surrogate" characters (those
+whose code points lie between 0xd800 and 0xdfff inclusive) may not be specified
+explicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables
+this check). However, ranges such as [\ex{d7ff}-\ex{e000}], which include the
+surrogates, are always permitted.
+.P
+There is a special case in EBCDIC environments for ranges whose end points are
+both specified as literal letters in the same case. For compatibility with
+Perl, EBCDIC code points within the range that are not letters are omitted. For
+example, [h-k] matches only four characters, even though the codes for h and k
+are 0x88 and 0x92, a range of 11 code points. However, if the range is
+specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points
+are included.
+.P
+If a range that includes letters is used when caseless matching is set, it
+matches the letters in either case. For example, [W-c] is equivalent to
+[][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
+tables for a French locale are in use, [\exc8-\excb] matches accented E
+characters in both cases.
+.P
+A circumflex can conveniently be used with the upper case character types to
+specify a more restricted set of characters than the matching lower case type.
+For example, the class [^\eW_] matches any letter or digit, but not underscore,
+whereas [\ew] includes underscore. A positive character class should be read as
+"something OR something OR ..." and a negative class as "NOT something AND NOT
+something AND NOT ...".
+.P
+The only metacharacters that are recognized in character classes are backslash,
+hyphen (only where it can be interpreted as specifying a range), circumflex
+(only at the start), opening square bracket (only when it can be interpreted as
+introducing a POSIX class name, or for a special compatibility feature - see
+the next two sections), and the terminating closing square bracket. However,
+escaping other non-alphanumeric characters does no harm.
+.
+.
+.SH "POSIX CHARACTER CLASSES"
+.rs
+.sp
+Perl supports the POSIX notation for character classes. This uses names
+enclosed by [: and :] within the enclosing square brackets. PCRE2 also supports
+this notation. For example,
+.sp
+  [01[:alpha:]%]
+.sp
+matches "0", "1", any alphabetic character, or "%". The supported class names
+are:
+.sp
+  alnum    letters and digits
+  alpha    letters
+  ascii    character codes 0 - 127
+  blank    space or tab only
+  cntrl    control characters
+  digit    decimal digits (same as \ed)
+  graph    printing characters, excluding space
+  lower    lower case letters
+  print    printing characters, including space
+  punct    printing characters, excluding letters and digits and space
+  space    white space (the same as \es from PCRE2 8.34)
+  upper    upper case letters
+  word     "word" characters (same as \ew)
+  xdigit   hexadecimal digits
+.sp
+The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
+and space (32). If locale-specific matching is taking place, the list of space
+characters may be different; there may be fewer or more of them. "Space" and
+\es match the same set of characters.
+.P
+The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
+5.8. Another Perl extension is negation, which is indicated by a ^ character
+after the colon. For example,
+.sp
+  [12[:^digit:]]
+.sp
+matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
+syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
+supported, and an error is given if they are encountered.
+.P
+By default, characters with values greater than 127 do not match any of the
+POSIX character classes, although this may be different for characters in the
+range 128-255 when locale-specific matching is happening. However, if the
+PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
+changed so that Unicode character properties are used. This is achieved by
+replacing certain POSIX classes with other sequences, as follows:
+.sp
+  [:alnum:]  becomes  \ep{Xan}
+  [:alpha:]  becomes  \ep{L}
+  [:blank:]  becomes  \eh
+  [:cntrl:]  becomes  \ep{Cc}
+  [:digit:]  becomes  \ep{Nd}
+  [:lower:]  becomes  \ep{Ll}
+  [:space:]  becomes  \ep{Xps}
+  [:upper:]  becomes  \ep{Lu}
+  [:word:]   becomes  \ep{Xwd}
+.sp
+Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX
+classes are handled specially in UCP mode:
+.TP 10
+[:graph:]
+This matches characters that have glyphs that mark the page when printed. In
+Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
+properties, except for:
+.sp
+  U+061C           Arabic Letter Mark
+  U+180E           Mongolian Vowel Separator
+  U+2066 - U+2069  Various "isolate"s
+.sp
+.TP 10
+[:print:]
+This matches the same characters as [:graph:] plus space characters that are
+not controls, that is, characters with the Zs property.
+.TP 10
+[:punct:]
+This matches all characters that have the Unicode P (punctuation) property,
+plus those characters with code points less than 256 that have the S (Symbol)
+property.
+.P
+The other POSIX classes are unchanged, and match only characters with code
+points less than 256.
+.
+.
+.SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
+.rs
+.sp
+In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
+syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of
+word". PCRE2 treats these items as follows:
+.sp
+  [[:<:]]  is converted to  \eb(?=\ew)
+  [[:>:]]  is converted to  \eb(?<=\ew)
+.sp
+Only these exact character sequences are recognized. A sequence such as
+[a[:<:]b] provokes error for an unrecognized POSIX class name. This support is
+not compatible with Perl. It is provided to help migrations from other
+environments, and is best not used in any new patterns. Note that \eb matches
+at the start and the end of a word (see
+.\" HTML <a href="#smallassertions">
+.\" </a>
+"Simple assertions"
+.\"
+above), and in a Perl-style pattern the preceding or following character
+normally shows which is wanted, without the need for the assertions that are
+used above in order to give exactly the POSIX behaviour.
+.
+.
+.SH "VERTICAL BAR"
+.rs
+.sp
+Vertical bar characters are used to separate alternative patterns. For example,
+the pattern
+.sp
+  gilbert|sullivan
+.sp
+matches either "gilbert" or "sullivan". Any number of alternatives may appear,
+and an empty alternative is permitted (matching the empty string). The matching
+process tries each alternative in turn, from left to right, and the first one
+that succeeds is used. If the alternatives are within a subpattern
+.\" HTML <a href="#subpattern">
+.\" </a>
+(defined below),
+.\"
+"succeeds" means matching the rest of the main pattern as well as the
+alternative in the subpattern.
+.
+.
+.SH "INTERNAL OPTION SETTING"
+.rs
+.sp
+The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
+PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
+changed from within the pattern by a sequence of letters enclosed between "(?"
+and ")". These options are Perl-compatible, and are described in detail in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation. The option letters are:
+.sp
+  i  for PCRE2_CASELESS
+  m  for PCRE2_MULTILINE
+  n  for PCRE2_NO_AUTO_CAPTURE
+  s  for PCRE2_DOTALL
+  x  for PCRE2_EXTENDED
+  xx for PCRE2_EXTENDED_MORE
+.sp
+For example, (?im) sets caseless, multiline matching. It is also possible to
+unset these options by preceding the relevant letters with a hyphen, for
+example (?-im). The two "extended" options are not independent; unsetting either
+one cancels the effects of both of them.
+.P
+A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
+and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
+permitted. Only one hyphen may appear in the options string. If a letter
+appears both before and after the hyphen, the option is unset. An empty options
+setting "(?)" is allowed. Needless to say, it has no effect.
+.P
+If the first character following (? is a circumflex, it causes all of the above
+options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
+the circumflex to cause some options to be re-instated, but a hyphen may not
+appear.
+.P
+The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
+the same way as the Perl-compatible options by using the characters J and U
+respectively. However, these are not unset by (?^).
+.P
+When one of these option changes occurs at top level (that is, not inside
+subpattern parentheses), the change applies to the remainder of the pattern
+that follows. An option change within a subpattern (see below for a description
+of subpatterns) affects only that part of the subpattern that follows it, so
+.sp
+  (a(?i)b)c
+.sp
+matches abc and aBc and no other strings (assuming PCRE2_CASELESS is not used).
+By this means, options can be made to have different settings in different
+parts of the pattern. Any changes made in one alternative do carry on
+into subsequent branches within the same subpattern. For example,
+.sp
+  (a(?i)b|c)
+.sp
+matches "ab", "aB", "c", and "C", even though when matching "C" the first
+branch is abandoned before the option setting. This is because the effects of
+option settings happen at compile time. There would be some very weird
+behaviour otherwise.
+.P
+As a convenient shorthand, if any option settings are required at the start of
+a non-capturing subpattern (see the next section), the option letters may
+appear between the "?" and the ":". Thus the two patterns
+.sp
+  (?i:saturday|sunday)
+  (?:(?i)saturday|sunday)
+.sp
+match exactly the same set of strings.
+.P
+\fBNote:\fP There are other PCRE2-specific options that can be set by the
+application when the compiling function is called. The pattern can contain
+special leading sequences such as (*CRLF) to override what the application has
+set or what has been defaulted. Details are given in the section entitled
+.\" HTML <a href="#newlineseq">
+.\" </a>
+"Newline sequences"
+.\"
+above. There are also the (*UTF) and (*UCP) leading sequences that can be used
+to set UTF and Unicode property modes; they are equivalent to setting the
+PCRE2_UTF and PCRE2_UCP options, respectively. However, the application can set
+the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use of the
+(*UTF) and (*UCP) sequences.
+.
+.
+.\" HTML <a name="subpattern"></a>
+.SH SUBPATTERNS
+.rs
+.sp
+Subpatterns are delimited by parentheses (round brackets), which can be nested.
+Turning part of a pattern into a subpattern does two things:
+.sp
+1. It localizes a set of alternatives. For example, the pattern
+.sp
+  cat(aract|erpillar|)
+.sp
+matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
+match "cataract", "erpillar" or an empty string.
+.sp
+2. It sets up the subpattern as a capturing subpattern. This means that, when
+the whole pattern matches, the portion of the subject string that matched the
+subpattern is passed back to the caller, separately from the portion that
+matched the whole pattern. (This applies only to the traditional matching
+function; the DFA matching function does not support capturing.)
+.P
+Opening parentheses are counted from left to right (starting from 1) to obtain
+numbers for the capturing subpatterns. For example, if the string "the red
+king" is matched against the pattern
+.sp
+  the ((red|white) (king|queen))
+.sp
+the captured substrings are "red king", "red", and "king", and are numbered 1,
+2, and 3, respectively.
+.P
+The fact that plain parentheses fulfil two functions is not always helpful.
+There are often times when a grouping subpattern is required without a
+capturing requirement. If an opening parenthesis is followed by a question mark
+and a colon, the subpattern does not do any capturing, and is not counted when
+computing the number of any subsequent capturing subpatterns. For example, if
+the string "the white queen" is matched against the pattern
+.sp
+  the ((?:red|white) (king|queen))
+.sp
+the captured substrings are "white queen" and "queen", and are numbered 1 and
+2. The maximum number of capturing subpatterns is 65535.
+.P
+As a convenient shorthand, if any option settings are required at the start of
+a non-capturing subpattern, the option letters may appear between the "?" and
+the ":". Thus the two patterns
+.sp
+  (?i:saturday|sunday)
+  (?:(?i)saturday|sunday)
+.sp
+match exactly the same set of strings. Because alternative branches are tried
+from left to right, and options are not reset until the end of the subpattern
+is reached, an option setting in one branch does affect subsequent branches, so
+the above patterns match "SUNDAY" as well as "Saturday".
+.
+.
+.\" HTML <a name="dupsubpatternnumber"></a>
+.SH "DUPLICATE SUBPATTERN NUMBERS"
+.rs
+.sp
+Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
+the same numbers for its capturing parentheses. Such a subpattern starts with
+(?| and is itself a non-capturing subpattern. For example, consider this
+pattern:
+.sp
+  (?|(Sat)ur|(Sun))day
+.sp
+Because the two alternatives are inside a (?| group, both sets of capturing
+parentheses are numbered one. Thus, when the pattern matches, you can look
+at captured substring number one, whichever alternative matched. This construct
+is useful when you want to capture part, but not all, of one of a number of
+alternatives. Inside a (?| group, parentheses are numbered as usual, but the
+number is reset at the start of each branch. The numbers of any capturing
+parentheses that follow the subpattern start after the highest number used in
+any branch. The following example is taken from the Perl documentation. The
+numbers underneath show in which buffer the captured content will be stored.
+.sp
+  # before  ---------------branch-reset----------- after
+  / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
+  # 1            2         2  3        2     3     4
+.sp
+A backreference to a numbered subpattern uses the most recent value that is
+set for that number by any subpattern. The following pattern matches "abcabc"
+or "defdef":
+.sp
+  /(?|(abc)|(def))\e1/
+.sp
+In contrast, a subroutine call to a numbered subpattern always refers to the
+first one in the pattern with the given number. The following pattern matches
+"abcabc" or "defabc":
+.sp
+  /(?|(abc)|(def))(?1)/
+.sp
+A relative reference such as (?-1) is no different: it is just a convenient way
+of computing an absolute group number.
+.P
+If a
+.\" HTML <a href="#conditions">
+.\" </a>
+condition test
+.\"
+for a subpattern's having matched refers to a non-unique number, the test is
+true if any of the subpatterns of that number have matched.
+.P
+An alternative approach to using this "branch reset" feature is to use
+duplicate named subpatterns, as described in the next section.
+.
+.
+.SH "NAMED SUBPATTERNS"
+.rs
+.sp
+Identifying capturing parentheses by number is simple, but it can be very hard
+to keep track of the numbers in complicated patterns. Furthermore, if an
+expression is modified, the numbers may change. To help with this difficulty,
+PCRE2 supports the naming of capturing subpatterns. This feature was not added
+to Perl until release 5.10. Python had the feature earlier, and PCRE1
+introduced it at release 4.0, using the Python syntax. PCRE2 supports both the
+Perl and the Python syntax.
+.P
+In PCRE2, a capturing subpattern can be named in one of three ways:
+(?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. Names
+consist of up to 32 alphanumeric characters and underscores, but must start
+with a non-digit. References to capturing parentheses from other parts of the
+pattern, such as
+.\" HTML <a href="#backreferences">
+.\" </a>
+backreferences,
+.\"
+.\" HTML <a href="#recursion">
+.\" </a>
+recursion,
+.\"
+and
+.\" HTML <a href="#conditions">
+.\" </a>
+conditions,
+.\"
+can all be made by name as well as by number.
+.P
+Named capturing parentheses are allocated numbers as well as names, exactly as
+if the names were not present. In both PCRE2 and Perl, capturing subpatterns
+are primarily identified by numbers; any names are just aliases for these
+numbers. The PCRE2 API provides function calls for extracting the complete
+name-to-number translation table from a compiled pattern, as well as
+convenience functions for extracting captured substrings by name.
+.P
+\fBWarning:\fP When more than one subpattern has the same number, as described
+in the previous section, a name given to one of them applies to all of them.
+Perl allows identically numbered subpatterns to have different names. Consider
+this pattern, where there are two capturing subpatterns, both numbered 1:
+.sp
+  (?|(?<AA>aa)|(?<BB>bb))
+.sp
+Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
+a successful match, both names yield the same value (either "aa" or "bb").
+.P
+In an attempt to reduce confusion, PCRE2 does not allow the same group number
+to be associated with more than one name. The example above provokes a
+compile-time error. However, there is still scope for confusion. Consider this
+pattern:
+.sp
+  (?|(?<AA>aa)|(bb))
+.sp
+Although the second subpattern number 1 is not explicitly named, the name AA is
+still an alias for subpattern 1. Whether the pattern matches "aa" or "bb", a
+reference by name to group AA yields the matched string.
+.P
+By default, a name must be unique within a pattern, except that duplicate names
+are permitted for subpatterns with the same number, for example:
+.sp
+  (?|(?<AA>aa)|(?<AA>bb))
+.sp
+The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
+option at compile time, or by the use of (?J) within the pattern. Duplicate
+names can be useful for patterns where only one instance of the named
+parentheses can match. Suppose you want to match the name of a weekday, either
+as a 3-letter abbreviation or as the full name, and in both cases you want to
+extract the abbreviation. This pattern (ignoring the line breaks) does the job:
+.sp
+  (?<DN>Mon|Fri|Sun)(?:day)?|
+  (?<DN>Tue)(?:sday)?|
+  (?<DN>Wed)(?:nesday)?|
+  (?<DN>Thu)(?:rsday)?|
+  (?<DN>Sat)(?:urday)?
+.sp
+There are five capturing substrings, but only one is ever set after a match.
+The convenience functions for extracting the data by name returns the substring
+for the first (and in this example, the only) subpattern of that name that
+matched. This saves searching to find which numbered subpattern it was. (An
+alternative way of solving this problem is to use a "branch reset" subpattern,
+as described in the previous section.)
+.P
+If you make a backreference to a non-unique named subpattern from elsewhere in
+the pattern, the subpatterns to which the name refers are checked in the order
+in which they appear in the overall pattern. The first one that is set is used
+for the reference. For example, this pattern matches both "foofoo" and
+"barbar" but not "foobar" or "barfoo":
+.sp
+  (?:(?<n>foo)|(?<n>bar))\ek<n>
+.sp
+.P
+If you make a subroutine call to a non-unique named subpattern, the one that
+corresponds to the first occurrence of the name is used. In the absence of
+duplicate numbers this is the one with the lowest number.
+.P
+If you use a named reference in a condition
+test (see the
+.\"
+.\" HTML <a href="#conditions">
+.\" </a>
+section about conditions
+.\"
+below), either to check whether a subpattern has matched, or to check for
+recursion, all subpatterns with the same name are tested. If the condition is
+true for any one of them, the overall condition is true. This is the same
+behaviour as testing by number. For further details of the interfaces for
+handling named subpatterns, see the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation.
+.
+.
+.SH REPETITION
+.rs
+.sp
+Repetition is specified by quantifiers, which can follow any of the following
+items:
+.sp
+  a literal data character
+  the dot metacharacter
+  the \eC escape sequence
+  the \eX escape sequence
+  the \eR escape sequence
+  an escape such as \ed or \epL that matches a single character
+  a character class
+  a backreference
+  a parenthesized subpattern (including most assertions)
+  a subroutine call to a subpattern (recursive or otherwise)
+.sp
+The general repetition quantifier specifies a minimum and maximum number of
+permitted matches, by giving the two numbers in curly brackets (braces),
+separated by a comma. The numbers must be less than 65536, and the first must
+be less than or equal to the second. For example:
+.sp
+  z{2,4}
+.sp
+matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
+character. If the second number is omitted, but the comma is present, there is
+no upper limit; if the second number and the comma are both omitted, the
+quantifier specifies an exact number of required matches. Thus
+.sp
+  [aeiou]{3,}
+.sp
+matches at least 3 successive vowels, but may match many more, whereas
+.sp
+  \ed{8}
+.sp
+matches exactly 8 digits. An opening curly bracket that appears in a position
+where a quantifier is not allowed, or one that does not match the syntax of a
+quantifier, is taken as a literal character. For example, {,6} is not a
+quantifier, but a literal string of four characters.
+.P
+In UTF modes, quantifiers apply to characters rather than to individual code
+units. Thus, for example, \ex{100}{2} matches two characters, each of
+which is represented by a two-byte sequence in a UTF-8 string. Similarly,
+\eX{3} matches three Unicode extended grapheme clusters, each of which may be
+several code units long (and they may be of different lengths).
+.P
+The quantifier {0} is permitted, causing the expression to behave as if the
+previous item and the quantifier were not present. This may be useful for
+subpatterns that are referenced as
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+subroutines
+.\"
+from elsewhere in the pattern (but see also the section entitled
+.\" HTML <a href="#subdefine">
+.\" </a>
+"Defining subpatterns for use by reference only"
+.\"
+below). Items other than subpatterns that have a {0} quantifier are omitted
+from the compiled pattern.
+.P
+For convenience, the three most common quantifiers have single-character
+abbreviations:
+.sp
+  *    is equivalent to {0,}
+  +    is equivalent to {1,}
+  ?    is equivalent to {0,1}
+.sp
+It is possible to construct infinite loops by following a subpattern that can
+match no characters with a quantifier that has no upper limit, for example:
+.sp
+  (a?)*
+.sp
+Earlier versions of Perl and PCRE1 used to give an error at compile time for
+such patterns. However, because there are cases where this can be useful, such
+patterns are now accepted, but if any repetition of the subpattern does in fact
+match no characters, the loop is forcibly broken.
+.P
+By default, the quantifiers are "greedy", that is, they match as much as
+possible (up to the maximum number of permitted times), without causing the
+rest of the pattern to fail. The classic example of where this gives problems
+is in trying to match comments in C programs. These appear between /* and */
+and within the comment, individual * and / characters may appear. An attempt to
+match C comments by applying the pattern
+.sp
+  /\e*.*\e*/
+.sp
+to the string
+.sp
+  /* first comment */  not comment  /* second comment */
+.sp
+fails, because it matches the entire string owing to the greediness of the .*
+item.
+.P
+If a quantifier is followed by a question mark, it ceases to be greedy, and
+instead matches the minimum number of times possible, so the pattern
+.sp
+  /\e*.*?\e*/
+.sp
+does the right thing with the C comments. The meaning of the various
+quantifiers is not otherwise changed, just the preferred number of matches.
+Do not confuse this use of question mark with its use as a quantifier in its
+own right. Because it has two uses, it can sometimes appear doubled, as in
+.sp
+  \ed??\ed
+.sp
+which matches one digit by preference, but can match two if that is the only
+way the rest of the pattern matches.
+.P
+If the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
+the quantifiers are not greedy by default, but individual ones can be made
+greedy by following them with a question mark. In other words, it inverts the
+default behaviour.
+.P
+When a parenthesized subpattern is quantified with a minimum repeat count that
+is greater than 1 or with a limited maximum, more memory is required for the
+compiled pattern, in proportion to the size of the minimum or maximum.
+.P
+If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option (equivalent
+to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
+implicitly anchored, because whatever follows will be tried against every
+character position in the subject string, so there is no point in retrying the
+overall match at any position after the first. PCRE2 normally treats such a
+pattern as though it were preceded by \eA.
+.P
+In cases where it is known that the subject string contains no newlines, it is
+worth setting PCRE2_DOTALL in order to obtain this optimization, or
+alternatively, using ^ to indicate anchoring explicitly.
+.P
+However, there are some cases where the optimization cannot be used. When .*
+is inside capturing parentheses that are the subject of a backreference
+elsewhere in the pattern, a match at the start may fail where a later one
+succeeds. Consider, for example:
+.sp
+  (.*)abc\e1
+.sp
+If the subject is "xyz123abc123" the match point is the fourth character. For
+this reason, such a pattern is not implicitly anchored.
+.P
+Another case where implicit anchoring is not applied is when the leading .* is
+inside an atomic group. Once again, a match at the start may fail where a later
+one succeeds. Consider this pattern:
+.sp
+  (?>.*?a)b
+.sp
+It matches "ab" in the subject "aab". The use of the backtracking control verbs
+(*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
+PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
+.P
+When a capturing subpattern is repeated, the value captured is the substring
+that matched the final iteration. For example, after
+.sp
+  (tweedle[dume]{3}\es*)+
+.sp
+has matched "tweedledum tweedledee" the value of the captured substring is
+"tweedledee". However, if there are nested capturing subpatterns, the
+corresponding captured values may have been set in previous iterations. For
+example, after
+.sp
+  (a|(b))+
+.sp
+matches "aba" the value of the second captured substring is "b".
+.
+.
+.\" HTML <a name="atomicgroup"></a>
+.SH "ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS"
+.rs
+.sp
+With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
+repetition, failure of what follows normally causes the repeated item to be
+re-evaluated to see if a different number of repeats allows the rest of the
+pattern to match. Sometimes it is useful to prevent this, either to change the
+nature of the match, or to cause it fail earlier than it otherwise might, when
+the author of the pattern knows there is no point in carrying on.
+.P
+Consider, for example, the pattern \ed+foo when applied to the subject line
+.sp
+  123456bar
+.sp
+After matching all 6 digits and then failing to match "foo", the normal
+action of the matcher is to try again with only 5 digits matching the \ed+
+item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
+(a term taken from Jeffrey Friedl's book) provides the means for specifying
+that once a subpattern has matched, it is not to be re-evaluated in this way.
+.P
+If we use atomic grouping for the previous example, the matcher gives up
+immediately on failing to match "foo" the first time. The notation is a kind of
+special parenthesis, starting with (?> as in this example:
+.sp
+  (?>\ed+)foo
+.sp
+This kind of parenthesis "locks up" the  part of the pattern it contains once
+it has matched, and a failure further into the pattern is prevented from
+backtracking into it. Backtracking past it to previous items, however, works as
+normal.
+.P
+An alternative description is that a subpattern of this type matches exactly
+the string of characters that an identical standalone pattern would match, if
+anchored at the current point in the subject string.
+.P
+Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
+the above example can be thought of as a maximizing repeat that must swallow
+everything it can. So, while both \ed+ and \ed+? are prepared to adjust the
+number of digits they match in order to make the rest of the pattern match,
+(?>\ed+) can only match an entire sequence of digits.
+.P
+Atomic groups in general can of course contain arbitrarily complicated
+subpatterns, and can be nested. However, when the subpattern for an atomic
+group is just a single repeated item, as in the example above, a simpler
+notation, called a "possessive quantifier" can be used. This consists of an
+additional + character following a quantifier. Using this notation, the
+previous example can be rewritten as
+.sp
+  \ed++foo
+.sp
+Note that a possessive quantifier can be used with an entire group, for
+example:
+.sp
+  (abc|xyz){2,3}+
+.sp
+Possessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY
+option is ignored. They are a convenient notation for the simpler forms of
+atomic group. However, there is no difference in the meaning of a possessive
+quantifier and the equivalent atomic group, though there may be a performance
+difference; possessive quantifiers should be slightly faster.
+.P
+The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
+Jeffrey Friedl originated the idea (and the name) in the first edition of his
+book. Mike McCloskey liked it, so implemented it when he built Sun's Java
+package, and PCRE1 copied it from there. It ultimately found its way into Perl
+at release 5.10.
+.P
+PCRE2 has an optimization that automatically "possessifies" certain simple
+pattern constructs. For example, the sequence A+B is treated as A++B because
+there is no point in backtracking into a sequence of A's when B must follow.
+This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
+the pattern with (*NO_AUTO_POSSESS).
+.P
+When a pattern contains an unlimited repeat inside a subpattern that can itself
+be repeated an unlimited number of times, the use of an atomic group is the
+only way to avoid some failing matches taking a very long time indeed. The
+pattern
+.sp
+  (\eD+|<\ed+>)*[!?]
+.sp
+matches an unlimited number of substrings that either consist of non-digits, or
+digits enclosed in <>, followed by either ! or ?. When it matches, it runs
+quickly. However, if it is applied to
+.sp
+  aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+.sp
+it takes a long time before reporting failure. This is because the string can
+be divided between the internal \eD+ repeat and the external * repeat in a
+large number of ways, and all have to be tried. (The example uses [!?] rather
+than a single character at the end, because both PCRE2 and Perl have an
+optimization that allows for fast failure when a single character is used. They
+remember the last single character that is required for a match, and fail early
+if it is not present in the string.) If the pattern is changed so that it uses
+an atomic group, like this:
+.sp
+  ((?>\eD+)|<\ed+>)*[!?]
+.sp
+sequences of non-digits cannot be broken, and failure happens quickly.
+.
+.
+.\" HTML <a name="backreferences"></a>
+.SH "BACKREFERENCES"
+.rs
+.sp
+Outside a character class, a backslash followed by a digit greater than 0 (and
+possibly further digits) is a backreference to a capturing subpattern earlier
+(that is, to its left) in the pattern, provided there have been that many
+previous capturing left parentheses.
+.P
+However, if the decimal number following the backslash is less than 8, it is
+always taken as a backreference, and causes an error only if there are not
+that many capturing left parentheses in the entire pattern. In other words, the
+parentheses that are referenced need not be to the left of the reference for
+numbers less than 8. A "forward backreference" of this type can make sense
+when a repetition is involved and the subpattern to the right has participated
+in an earlier iteration.
+.P
+It is not possible to have a numerical "forward backreference" to a subpattern
+whose number is 8 or more using this syntax because a sequence such as \e50 is
+interpreted as a character defined in octal. See the subsection entitled
+"Non-printing characters"
+.\" HTML <a href="#digitsafterbackslash">
+.\" </a>
+above
+.\"
+for further details of the handling of digits following a backslash. There is
+no such problem when named parentheses are used. A backreference to any
+subpattern is possible using named parentheses (see below).
+.P
+Another way of avoiding the ambiguity inherent in the use of digits following a
+backslash is to use the \eg escape sequence. This escape must be followed by a
+signed or unsigned number, optionally enclosed in braces. These examples are
+all identical:
+.sp
+  (ring), \e1
+  (ring), \eg1
+  (ring), \eg{1}
+.sp
+An unsigned number specifies an absolute reference without the ambiguity that
+is present in the older syntax. It is also useful when literal digits follow
+the reference. A signed number is a relative reference. Consider this example:
+.sp
+  (abc(def)ghi)\eg{-1}
+.sp
+The sequence \eg{-1} is a reference to the most recently started capturing
+subpattern before \eg, that is, is it equivalent to \e2 in this example.
+Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
+can be helpful in long patterns, and also in patterns that are created by
+joining together fragments that contain references within themselves.
+.P
+The sequence \eg{+1} is a reference to the next capturing subpattern. This kind
+of forward reference can be useful it patterns that repeat. Perl does not
+support the use of + in this way.
+.P
+A backreference matches whatever actually matched the capturing subpattern in
+the current subject string, rather than anything matching the subpattern
+itself (see
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+"Subpatterns as subroutines"
+.\"
+below for a way of doing that). So the pattern
+.sp
+  (sens|respons)e and \e1ibility
+.sp
+matches "sense and sensibility" and "response and responsibility", but not
+"sense and responsibility". If caseful matching is in force at the time of the
+backreference, the case of letters is relevant. For example,
+.sp
+  ((?i)rah)\es+\e1
+.sp
+matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
+capturing subpattern is matched caselessly.
+.P
+There are several different ways of writing backreferences to named
+subpatterns. The .NET syntax \ek{name} and the Perl syntax \ek<name> or
+\ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
+backreference syntax, in which \eg can be used for both numeric and named
+references, is also supported. We could rewrite the above example in any of
+the following ways:
+.sp
+  (?<p1>(?i)rah)\es+\ek<p1>
+  (?'p1'(?i)rah)\es+\ek{p1}
+  (?P<p1>(?i)rah)\es+(?P=p1)
+  (?<p1>(?i)rah)\es+\eg{p1}
+.sp
+A subpattern that is referenced by name may appear in the pattern before or
+after the reference.
+.P
+There may be more than one backreference to the same subpattern. If a
+subpattern has not actually been used in a particular match, any backreferences
+to it always fail by default. For example, the pattern
+.sp
+  (a|(bc))\e2
+.sp
+always fails if it starts to match "a" rather than "bc". However, if the
+PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
+unset value matches an empty string.
+.P
+Because there may be many capturing parentheses in a pattern, all digits
+following a backslash are taken as part of a potential backreference number.
+If the pattern continues with a digit character, some delimiter must be used to
+terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
+option is set, this can be white space. Otherwise, the \eg{ syntax or an empty
+comment (see
+.\" HTML <a href="#comments">
+.\" </a>
+"Comments"
+.\"
+below) can be used.
+.
+.
+.SS "Recursive backreferences"
+.rs
+.sp
+A backreference that occurs inside the parentheses to which it refers fails
+when the subpattern is first used, so, for example, (a\e1) never matches.
+However, such references can be useful inside repeated subpatterns. For
+example, the pattern
+.sp
+  (a|b\e1)+
+.sp
+matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
+the subpattern, the backreference matches the character string corresponding
+to the previous iteration. In order for this to work, the pattern must be such
+that the first iteration does not need to match the backreference. This can be
+done using alternation, as in the example above, or by a quantifier with a
+minimum of zero.
+.P
+Backreferences of this type cause the group that they reference to be treated
+as an
+.\" HTML <a href="#atomicgroup">
+.\" </a>
+atomic group.
+.\"
+Once the whole group has been matched, a subsequent matching failure cannot
+cause backtracking into the middle of the group.
+.
+.
+.\" HTML <a name="bigassertions"></a>
+.SH ASSERTIONS
+.rs
+.sp
+An assertion is a test on the characters following or preceding the current
+matching point that does not consume any characters. The simple assertions
+coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described
+.\" HTML <a href="#smallassertions">
+.\" </a>
+above.
+.\"
+.P
+More complicated assertions are coded as subpatterns. There are two kinds:
+those that look ahead of the current position in the subject string, and those
+that look behind it, and in each case an assertion may be positive (must
+succeed for matching to continue) or negative (must not succeed for matching to
+continue). An assertion subpattern is matched in the normal way, except that,
+when matching continues after a successful assertion, the matching position in
+the subject string is as it was before the assertion was processed.
+.P
+Assertion subpatterns are not capturing subpatterns. If an assertion contains
+capturing subpatterns within it, these are counted for the purposes of
+numbering the capturing subpatterns in the whole pattern. Within each branch of
+an assertion, locally captured substrings may be referenced in the usual way.
+For example, a sequence such as (.)\eg{-1} can be used to check that two
+adjacent characters are the same.
+.P
+When a branch within an assertion fails to match, any substrings that were
+captured are discarded (as happens with any pattern branch that fails to
+match). A negative assertion succeeds only when all its branches fail to match;
+this means that no captured substrings are ever retained after a successful
+negative assertion. When an assertion contains a matching branch, what happens
+depends on the type of assertion.
+.P
+For a positive assertion, internally captured substrings in the successful
+branch are retained, and matching continues with the next pattern item after
+the assertion. For a negative assertion, a matching branch means that the
+assertion has failed. If the assertion is being used as a condition in a
+.\" HTML <a href="#conditions">
+.\" </a>
+conditional subpattern
+.\"
+(see below), captured substrings are retained, because matching continues with
+the "no" branch of the condition. For other failing negative assertions,
+control passes to the previous backtracking point, thus discarding any captured
+strings within the assertion.
+.P
+For compatibility with Perl, most assertion subpatterns may be repeated; though
+it makes no sense to assert the same thing several times, the side effect of
+capturing parentheses may occasionally be useful. However, an assertion that
+forms the condition for a conditional subpattern may not be quantified. In
+practice, for other assertions, there only three cases:
+.sp
+(1) If the quantifier is {0}, the assertion is never obeyed during matching.
+However, it may contain internal capturing parenthesized groups that are called
+from elsewhere via the
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+subroutine mechanism.
+.\"
+.sp
+(2) If quantifier is {0,n} where n is greater than zero, it is treated as if it
+were {0,1}. At run time, the rest of the pattern match is tried with and
+without the assertion, the order depending on the greediness of the quantifier.
+.sp
+(3) If the minimum repetition is greater than zero, the quantifier is ignored.
+The assertion is obeyed just once when encountered during matching.
+.
+.
+.SS "Lookahead assertions"
+.rs
+.sp
+Lookahead assertions start with (?= for positive assertions and (?! for
+negative assertions. For example,
+.sp
+  \ew+(?=;)
+.sp
+matches a word followed by a semicolon, but does not include the semicolon in
+the match, and
+.sp
+  foo(?!bar)
+.sp
+matches any occurrence of "foo" that is not followed by "bar". Note that the
+apparently similar pattern
+.sp
+  (?!foo)bar
+.sp
+does not find an occurrence of "bar" that is preceded by something other than
+"foo"; it finds any occurrence of "bar" whatsoever, because the assertion
+(?!foo) is always true when the next three characters are "bar". A
+lookbehind assertion is needed to achieve the other effect.
+.P
+If you want to force a matching failure at some point in a pattern, the most
+convenient way to do it is with (?!) because an empty string always matches, so
+an assertion that requires there not to be an empty string must always fail.
+The backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
+.
+.
+.\" HTML <a name="lookbehind"></a>
+.SS "Lookbehind assertions"
+.rs
+.sp
+Lookbehind assertions start with (?<= for positive assertions and (?<! for
+negative assertions. For example,
+.sp
+  (?<!foo)bar
+.sp
+does find an occurrence of "bar" that is not preceded by "foo". The contents of
+a lookbehind assertion are restricted such that all the strings it matches must
+have a fixed length. However, if there are several top-level alternatives, they
+do not all have to have the same fixed length. Thus
+.sp
+  (?<=bullock|donkey)
+.sp
+is permitted, but
+.sp
+  (?<!dogs?|cats?)
+.sp
+causes an error at compile time. Branches that match different length strings
+are permitted only at the top level of a lookbehind assertion. This is an
+extension compared with Perl, which requires all branches to match the same
+length of string. An assertion such as
+.sp
+  (?<=ab(c|de))
+.sp
+is not permitted, because its single top-level branch can match two different
+lengths, but it is acceptable to PCRE2 if rewritten to use two top-level
+branches:
+.sp
+  (?<=abc|abde)
+.sp
+In some cases, the escape sequence \eK
+.\" HTML <a href="#resetmatchstart">
+.\" </a>
+(see above)
+.\"
+can be used instead of a lookbehind assertion to get round the fixed-length
+restriction.
+.P
+The implementation of lookbehind assertions is, for each alternative, to
+temporarily move the current position back by the fixed length and then try to
+match. If there are insufficient characters before the current position, the
+assertion fails.
+.P
+In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
+single code unit even in a UTF mode) to appear in lookbehind assertions,
+because it makes it impossible to calculate the length of the lookbehind. The
+\eX and \eR escapes, which can match different numbers of code units, are never
+permitted in lookbehinds.
+.P
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+"Subroutine"
+.\"
+calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
+as the subpattern matches a fixed-length string. However,
+.\" HTML <a href="#recursion">
+.\" </a>
+recursion,
+.\"
+that is, a "subroutine" call into a group that is already active,
+is not supported.
+.P
+Perl does not support backreferences in lookbehinds. PCRE2 does support them,
+but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
+must not be set, there must be no use of (?| in the pattern (it creates
+duplicate subpattern numbers), and if the backreference is by name, the name
+must be unique. Of course, the referenced subpattern must itself be of fixed
+length. The following pattern matches words containing at least two characters
+that begin and end with the same character:
+.sp
+   \eb(\ew)\ew++(?<=\e1)
+.P
+Possessive quantifiers can be used in conjunction with lookbehind assertions to
+specify efficient matching of fixed-length strings at the end of subject
+strings. Consider a simple pattern such as
+.sp
+  abcd$
+.sp
+when applied to a long string that does not match. Because matching proceeds
+from left to right, PCRE2 will look for each "a" in the subject and then see if
+what follows matches the rest of the pattern. If the pattern is specified as
+.sp
+  ^.*abcd$
+.sp
+the initial .* matches the entire string at first, but when this fails (because
+there is no following "a"), it backtracks to match all but the last character,
+then all but the last two characters, and so on. Once again the search for "a"
+covers the entire string, from right to left, so we are no better off. However,
+if the pattern is written as
+.sp
+  ^.*+(?<=abcd)
+.sp
+there can be no backtracking for the .*+ item because of the possessive
+quantifier; it can match only the entire string. The subsequent lookbehind
+assertion does a single test on the last four characters. If it fails, the
+match fails immediately. For long strings, this approach makes a significant
+difference to the processing time.
+.
+.
+.SS "Using multiple assertions"
+.rs
+.sp
+Several assertions (of any sort) may occur in succession. For example,
+.sp
+  (?<=\ed{3})(?<!999)foo
+.sp
+matches "foo" preceded by three digits that are not "999". Notice that each of
+the assertions is applied independently at the same point in the subject
+string. First there is a check that the previous three characters are all
+digits, and then there is a check that the same three characters are not "999".
+This pattern does \fInot\fP match "foo" preceded by six characters, the first
+of which are digits and the last three of which are not "999". For example, it
+doesn't match "123abcfoo". A pattern to do that is
+.sp
+  (?<=\ed{3}...)(?<!999)foo
+.sp
+This time the first assertion looks at the preceding six characters, checking
+that the first three are digits, and then the second assertion checks that the
+preceding three characters are not "999".
+.P
+Assertions can be nested in any combination. For example,
+.sp
+  (?<=(?<!foo)bar)baz
+.sp
+matches an occurrence of "baz" that is preceded by "bar" which in turn is not
+preceded by "foo", while
+.sp
+  (?<=\ed{3}(?!999)...)foo
+.sp
+is another pattern that matches "foo" preceded by three digits and any three
+characters that are not "999".
+.
+.
+.\" HTML <a name="conditions"></a>
+.SH "CONDITIONAL SUBPATTERNS"
+.rs
+.sp
+It is possible to cause the matching process to obey a subpattern
+conditionally or to choose between two alternative subpatterns, depending on
+the result of an assertion, or whether a specific capturing subpattern has
+already been matched. The two possible forms of conditional subpattern are:
+.sp
+  (?(condition)yes-pattern)
+  (?(condition)yes-pattern|no-pattern)
+.sp
+If the condition is satisfied, the yes-pattern is used; otherwise the
+no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
+string (it always matches). If there are more than two alternatives in the
+subpattern, a compile-time error occurs. Each of the two alternatives may
+itself contain nested subpatterns of any form, including conditional
+subpatterns; the restriction to two alternatives applies only at the level of
+the condition. This pattern fragment is an example where the alternatives are
+complex:
+.sp
+  (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
+.sp
+.P
+There are five kinds of condition: references to subpatterns, references to
+recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
+.
+.
+.SS "Checking for a used subpattern by number"
+.rs
+.sp
+If the text between the parentheses consists of a sequence of digits, the
+condition is true if a capturing subpattern of that number has previously
+matched. If there is more than one capturing subpattern with the same number
+(see the earlier
+.\"
+.\" HTML <a href="#recursion">
+.\" </a>
+section about duplicate subpattern numbers),
+.\"
+the condition is true if any of them have matched. An alternative notation is
+to precede the digits with a plus or minus sign. In this case, the subpattern
+number is relative rather than absolute. The most recently opened parentheses
+can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside
+loops it can also make sense to refer to subsequent groups. The next
+parentheses to be opened can be referenced as (?(+1), and so on. (The value
+zero in any of these forms is not used; it provokes a compile-time error.)
+.P
+Consider the following pattern, which contains non-significant white space to
+make it more readable (assume the PCRE2_EXTENDED option) and to divide it into
+three parts for ease of discussion:
+.sp
+  ( \e( )?    [^()]+    (?(1) \e) )
+.sp
+The first part matches an optional opening parenthesis, and if that
+character is present, sets it as the first captured substring. The second part
+matches one or more characters that are not parentheses. The third part is a
+conditional subpattern that tests whether or not the first set of parentheses
+matched. If they did, that is, if subject started with an opening parenthesis,
+the condition is true, and so the yes-pattern is executed and a closing
+parenthesis is required. Otherwise, since no-pattern is not present, the
+subpattern matches nothing. In other words, this pattern matches a sequence of
+non-parentheses, optionally enclosed in parentheses.
+.P
+If you were embedding this pattern in a larger one, you could use a relative
+reference:
+.sp
+  ...other stuff... ( \e( )?    [^()]+    (?(-1) \e) ) ...
+.sp
+This makes the fragment independent of the parentheses in the larger pattern.
+.
+.
+.SS "Checking for a used subpattern by name"
+.rs
+.sp
+Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
+subpattern by name. For compatibility with earlier versions of PCRE1, which had
+this facility before Perl, the syntax (?(name)...) is also recognized. Note,
+however, that undelimited names consisting of the letter R followed by digits
+are ambiguous (see the following section).
+.P
+Rewriting the above example to use a named subpattern gives this:
+.sp
+  (?<OPEN> \e( )?    [^()]+    (?(<OPEN>) \e) )
+.sp
+If the name used in a condition of this kind is a duplicate, the test is
+applied to all subpatterns of the same name, and is true if any one of them has
+matched.
+.
+.
+.SS "Checking for pattern recursion"
+.rs
+.sp
+"Recursion" in this sense refers to any subroutine-like call from one part of
+the pattern to another, whether or not it is actually recursive. See the
+sections entitled
+.\" HTML <a href="#recursion">
+.\" </a>
+"Recursive patterns"
+.\"
+and
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+"Subpatterns as subroutines"
+.\"
+below for details of recursion and subpattern calls.
+.P
+If a condition is the string (R), and there is no subpattern with the name R,
+the condition is true if matching is currently in a recursion or subroutine
+call to the whole pattern or any subpattern. If digits follow the letter R, and
+there is no subpattern with that name, the condition is true if the most recent
+call is into a subpattern with the given number, which must exist somewhere in
+the overall pattern. This is a contrived example that is equivalent to a+b:
+.sp
+  ((?(R1)a+|(?1)b))
+.sp
+However, in both cases, if there is a subpattern with a matching name, the
+condition tests for its being set, as described in the section above, instead
+of testing for recursion. For example, creating a group with the name R1 by
+adding (?<R1>) to the above pattern completely changes its meaning.
+.P
+If a name preceded by ampersand follows the letter R, for example:
+.sp
+  (?(R&name)...)
+.sp
+the condition is true if the most recent recursion is into a subpattern of that
+name (which must exist within the pattern).
+.P
+This condition does not check the entire recursion stack. It tests only the
+current level. If the name used in a condition of this kind is a duplicate, the
+test is applied to all subpatterns of the same name, and is true if any one of
+them is the most recent recursion.
+.P
+At "top level", all these recursion test conditions are false.
+.
+.
+.\" HTML <a name="subdefine"></a>
+.SS "Defining subpatterns for use by reference only"
+.rs
+.sp
+If the condition is the string (DEFINE), the condition is always false, even if
+there is a group with the name DEFINE. In this case, there may be only one
+alternative in the subpattern. It is always skipped if control reaches this
+point in the pattern; the idea of DEFINE is that it can be used to define
+subroutines that can be referenced from elsewhere. (The use of
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+subroutines
+.\"
+is described below.) For example, a pattern to match an IPv4 address such as
+"192.168.23.245" could be written like this (ignore white space and line
+breaks):
+.sp
+  (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )
+  \eb (?&byte) (\e.(?&byte)){3} \eb
+.sp
+The first part of the pattern is a DEFINE group inside which a another group
+named "byte" is defined. This matches an individual component of an IPv4
+address (a number less than 256). When matching takes place, this part of the
+pattern is skipped because DEFINE acts like a false condition. The rest of the
+pattern uses references to the named group to match the four dot-separated
+components of an IPv4 address, insisting on a word boundary at each end.
+.
+.
+.SS "Checking the PCRE2 version"
+.rs
+.sp
+Programs that link with a PCRE2 library can check the version by calling
+\fBpcre2_config()\fP with appropriate arguments. Users of applications that do
+not have access to the underlying code cannot do this. A special "condition"
+called VERSION exists to allow such users to discover which version of PCRE2
+they are dealing with by using this condition to match a string such as
+"yesno". VERSION must be followed either by "=" or ">=" and a version number.
+For example:
+.sp
+  (?(VERSION>=10.4)yes|no)
+.sp
+This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
+"no" otherwise. The fractional part of the version number may not contain more
+than two digits.
+.
+.
+.SS "Assertion conditions"
+.rs
+.sp
+If the condition is not in any of the above formats, it must be an assertion.
+This may be a positive or negative lookahead or lookbehind assertion. Consider
+this pattern, again containing non-significant white space, and with the two
+alternatives on the second line:
+.sp
+  (?(?=[^a-z]*[a-z])
+  \ed{2}-[a-z]{3}-\ed{2}  |  \ed{2}-\ed{2}-\ed{2} )
+.sp
+The condition is a positive lookahead assertion that matches an optional
+sequence of non-letters followed by a letter. In other words, it tests for the
+presence of at least one letter in the subject. If a letter is found, the
+subject is matched against the first alternative; otherwise it is matched
+against the second. This pattern matches strings in one of the two forms
+dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
+.P
+When an assertion that is a condition contains capturing subpatterns, any
+capturing that occurs in a matching branch is retained afterwards, for both
+positive and negative assertions, because matching always continues after the
+assertion, whether it succeeds or fails. (Compare non-conditional assertions,
+when captures are retained only for positive assertions that succeed.)
+.
+.
+.\" HTML <a name="comments"></a>
+.SH COMMENTS
+.rs
+.sp
+There are two ways of including comments in patterns that are processed by
+PCRE2. In both cases, the start of the comment must not be in a character
+class, nor in the middle of any other sequence of related characters such as
+(?: or a subpattern name or number. The characters that make up a comment play
+no part in the pattern matching.
+.P
+The sequence (?# marks the start of a comment that continues up to the next
+closing parenthesis. Nested parentheses are not permitted. If the
+PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
+also introduces a comment, which in this case continues to immediately after
+the next newline character or character sequence in the pattern. Which
+characters are interpreted as newlines is controlled by an option passed to the
+compiling function or by a special sequence at the start of the pattern, as
+described in the section entitled
+.\" HTML <a href="#newlines">
+.\" </a>
+"Newline conventions"
+.\"
+above. Note that the end of this type of comment is a literal newline sequence
+in the pattern; escape sequences that happen to represent a newline do not
+count. For example, consider this pattern when PCRE2_EXTENDED is set, and the
+default newline convention (a single linefeed character) is in force:
+.sp
+  abc #comment \en still comment
+.sp
+On encountering the # character, \fBpcre2_compile()\fP skips along, looking for
+a newline in the pattern. The sequence \en is still literal at this stage, so
+it does not terminate the comment. Only an actual character with the code value
+0x0a (the default newline) does so.
+.
+.
+.\" HTML <a name="recursion"></a>
+.SH "RECURSIVE PATTERNS"
+.rs
+.sp
+Consider the problem of matching a string in parentheses, allowing for
+unlimited nested parentheses. Without the use of recursion, the best that can
+be done is to use a pattern that matches up to some fixed depth of nesting. It
+is not possible to handle an arbitrary nesting depth.
+.P
+For some time, Perl has provided a facility that allows regular expressions to
+recurse (amongst other things). It does this by interpolating Perl code in the
+expression at run time, and the code can refer to the expression itself. A Perl
+pattern using code interpolation to solve the parentheses problem can be
+created like this:
+.sp
+  $re = qr{\e( (?: (?>[^()]+) | (?p{$re}) )* \e)}x;
+.sp
+The (?p{...}) item interpolates Perl code at run time, and in this case refers
+recursively to the pattern in which it appears.
+.P
+Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it
+supports special syntax for recursion of the entire pattern, and also for
+individual subpattern recursion. After its introduction in PCRE1 and Python,
+this kind of recursion was subsequently introduced into Perl at release 5.10.
+.P
+A special item that consists of (? followed by a number greater than zero and a
+closing parenthesis is a recursive subroutine call of the subpattern of the
+given number, provided that it occurs inside that subpattern. (If not, it is a
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+non-recursive subroutine
+.\"
+call, which is described in the next section.) The special item (?R) or (?0) is
+a recursive call of the entire regular expression.
+.P
+This PCRE2 pattern solves the nested parentheses problem (assume the
+PCRE2_EXTENDED option is set so that white space is ignored):
+.sp
+  \e( ( [^()]++ | (?R) )* \e)
+.sp
+First it matches an opening parenthesis. Then it matches any number of
+substrings which can either be a sequence of non-parentheses, or a recursive
+match of the pattern itself (that is, a correctly parenthesized substring).
+Finally there is a closing parenthesis. Note the use of a possessive quantifier
+to avoid backtracking into sequences of non-parentheses.
+.P
+If this were part of a larger pattern, you would not want to recurse the entire
+pattern, so instead you could use this:
+.sp
+  ( \e( ( [^()]++ | (?1) )* \e) )
+.sp
+We have put the pattern into parentheses, and caused the recursion to refer to
+them instead of the whole pattern.
+.P
+In a larger pattern, keeping track of parenthesis numbers can be tricky. This
+is made easier by the use of relative references. Instead of (?1) in the
+pattern above you can write (?-2) to refer to the second most recently opened
+parentheses preceding the recursion. In other words, a negative number counts
+capturing parentheses leftwards from the point at which it is encountered.
+.P
+Be aware however, that if
+.\" HTML <a href="#dupsubpatternnumber">
+.\" </a>
+duplicate subpattern numbers
+.\"
+are in use, relative references refer to the earliest subpattern with the
+appropriate number. Consider, for example:
+.sp
+  (?|(a)|(b)) (c) (?-2)
+.sp
+The first two capturing groups (a) and (b) are both numbered 1, and group (c)
+is number 2. When the reference (?-2) is encountered, the second most recently
+opened parentheses has the number 1, but it is the first such group (the (a)
+group) to which the recursion refers. This would be the same if an absolute
+reference (?1) was used. In other words, relative references are just a
+shorthand for computing a group number.
+.P
+It is also possible to refer to subsequently opened parentheses, by writing
+references such as (?+2). However, these cannot be recursive because the
+reference is not inside the parentheses that are referenced. They are always
+.\" HTML <a href="#subpatternsassubroutines">
+.\" </a>
+non-recursive subroutine
+.\"
+calls, as described in the next section.
+.P
+An alternative approach is to use named parentheses. The Perl syntax for this
+is (?&name); PCRE1's earlier syntax (?P>name) is also supported. We could
+rewrite the above example as follows:
+.sp
+  (?<pn> \e( ( [^()]++ | (?&pn) )* \e) )
+.sp
+If there is more than one subpattern with the same name, the earliest one is
+used.
+.P
+The example pattern that we have been looking at contains nested unlimited
+repeats, and so the use of a possessive quantifier for matching strings of
+non-parentheses is important when applying the pattern to strings that do not
+match. For example, when this pattern is applied to
+.sp
+  (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
+.sp
+it yields "no match" quickly. However, if a possessive quantifier is not used,
+the match runs for a very long time indeed because there are so many different
+ways the + and * repeats can carve up the subject, and all have to be tested
+before failure can be reported.
+.P
+At the end of a match, the values of capturing parentheses are those from
+the outermost level. If you want to obtain intermediate values, a callout
+function can be used (see below and the
+.\" HREF
+\fBpcre2callout\fP
+.\"
+documentation). If the pattern above is matched against
+.sp
+  (ab(cd)ef)
+.sp
+the value for the inner capturing parentheses (numbered 2) is "ef", which is
+the last value taken on at the top level. If a capturing subpattern is not
+matched at the top level, its final captured value is unset, even if it was
+(temporarily) set at a deeper level during the matching process.
+.P
+Do not confuse the (?R) item with the condition (R), which tests for recursion.
+Consider this pattern, which matches text in angle brackets, allowing for
+arbitrary nesting. Only digits are allowed in nested brackets (that is, when
+recursing), whereas any characters are permitted at the outer level.
+.sp
+  < (?: (?(R) \ed++  | [^<>]*+) | (?R)) * >
+.sp
+In this pattern, (?(R) is the start of a conditional subpattern, with two
+different alternatives for the recursive and non-recursive cases. The (?R) item
+is the actual recursive call.
+.
+.
+.\" HTML <a name="recursiondifference"></a>
+.SS "Differences in recursion processing between PCRE2 and Perl"
+.rs
+.sp
+Some former differences between PCRE2 and Perl no longer exist.
+.P
+Before release 10.30, recursion processing in PCRE2 differed from Perl in that
+a recursive subpattern call was always treated as an atomic group. That is,
+once it had matched some of the subject string, it was never re-entered, even
+if it contained untried alternatives and there was a subsequent matching
+failure. (Historical note: PCRE implemented recursion before Perl did.)
+.P
+Starting with release 10.30, recursive subroutine calls are no longer treated
+as atomic. That is, they can be re-entered to try unused alternatives if there
+is a matching failure later in the pattern. This is now compatible with the way
+Perl works. If you want a subroutine call to be atomic, you must explicitly
+enclose it in an atomic group.
+.P
+Supporting backtracking into recursions simplifies certain types of recursive
+pattern. For example, this pattern matches palindromic strings:
+.sp
+  ^((.)(?1)\e2|.?)$
+.sp
+The second branch in the group matches a single central character in the
+palindrome when there are an odd number of characters, or nothing when there
+are an even number of characters, but in order to work it has to be able to try
+the second case when the rest of the pattern match fails. If you want to match
+typical palindromic phrases, the pattern has to ignore all non-word characters,
+which can be done like this:
+.sp
+  ^\eW*+((.)\eW*+(?1)\eW*+\e2|\eW*+.?)\eW*+$
+.sp
+If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
+man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
+avoid backtracking into sequences of non-word characters. Without this, PCRE2
+takes a great deal longer (ten times or more) to match typical phrases, and
+Perl takes so long that you think it has gone into a loop.
+.P
+Another way in which PCRE2 and Perl used to differ in their recursion
+processing is in the handling of captured values. Formerly in Perl, when a
+subpattern was called recursively or as a subpattern (see the next section), it
+had no access to any values that were captured outside the recursion, whereas
+in PCRE2 these values can be referenced. Consider this pattern:
+.sp
+  ^(.)(\e1|a(?2))
+.sp
+This pattern matches "bab". The first capturing parentheses match "b", then in
+the second group, when the backreference \e1 fails to match "b", the second
+alternative matches "a" and then recurses. In the recursion, \e1 does now match
+"b" and so the whole match succeeds. This match used to fail in Perl, but in
+later versions (I tried 5.024) it now works.
+.
+.
+.\" HTML <a name="subpatternsassubroutines"></a>
+.SH "SUBPATTERNS AS SUBROUTINES"
+.rs
+.sp
+If the syntax for a recursive subpattern call (either by number or by
+name) is used outside the parentheses to which it refers, it operates a bit
+like a subroutine in a programming language. More accurately, PCRE2 treats the
+referenced subpattern as an independent subpattern which it tries to match at
+the current matching position. The called subpattern may be defined before or
+after the reference. A numbered reference can be absolute or relative, as in
+these examples:
+.sp
+  (...(absolute)...)...(?2)...
+  (...(relative)...)...(?-1)...
+  (...(?+1)...(relative)...
+.sp
+An earlier example pointed out that the pattern
+.sp
+  (sens|respons)e and \e1ibility
+.sp
+matches "sense and sensibility" and "response and responsibility", but not
+"sense and responsibility". If instead the pattern
+.sp
+  (sens|respons)e and (?1)ibility
+.sp
+is used, it does match "sense and responsibility" as well as the other two
+strings. Another example is given in the discussion of DEFINE above.
+.P
+Like recursions, subroutine calls used to be treated as atomic, but this
+changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
+occur. However, any capturing parentheses that are set during the subroutine
+call revert to their previous values afterwards.
+.P
+Processing options such as case-independence are fixed when a subpattern is
+defined, so if it is used as a subroutine, such options cannot be changed for
+different calls. For example, consider this pattern:
+.sp
+  (abc)(?i:(?-1))
+.sp
+It matches "abcabc". It does not match "abcABC" because the change of
+processing option does not affect the called subpattern.
+.P
+The behaviour of
+.\" HTML <a href="#backtrackcontrol">
+.\" </a>
+backtracking control verbs
+.\"
+in subpatterns when called as subroutines is described in the section entitled
+.\" HTML <a href="#btsub">
+.\" </a>
+"Backtracking verbs in subroutines"
+.\"
+below.
+.
+.
+.\" HTML <a name="onigurumasubroutines"></a>
+.SH "ONIGURUMA SUBROUTINE SYNTAX"
+.rs
+.sp
+For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
+a number enclosed either in angle brackets or single quotes, is an alternative
+syntax for referencing a subpattern as a subroutine, possibly recursively. Here
+are two of the examples used above, rewritten using this syntax:
+.sp
+  (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) )
+  (sens|respons)e and \eg'1'ibility
+.sp
+PCRE2 supports an extension to Oniguruma: if a number is preceded by a
+plus or a minus sign it is taken as a relative reference. For example:
+.sp
+  (abc)(?i:\eg<-1>)
+.sp
+Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
+synonymous. The former is a backreference; the latter is a subroutine call.
+.
+.
+.SH CALLOUTS
+.rs
+.sp
+Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
+code to be obeyed in the middle of matching a regular expression. This makes it
+possible, amongst other things, to extract different substrings that match the
+same pair of parentheses when there is a repetition.
+.P
+PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
+code. The feature is called "callout". The caller of PCRE2 provides an external
+function by putting its entry point in a match context using the function
+\fBpcre2_set_callout()\fP, and then passing that context to \fBpcre2_match()\fP
+or \fBpcre2_dfa_match()\fP. If no match context is passed, or if the callout
+entry point is set to NULL, callouts are disabled.
+.P
+Within a regular expression, (?C<arg>) indicates a point at which the external
+function is to be called. There are two kinds of callout: those with a
+numerical argument and those with a string argument. (?C) on its own with no
+argument is treated as (?C0). A numerical argument allows the application to
+distinguish between different callouts. String arguments were added for release
+10.20 to make it possible for script languages that use PCRE2 to embed short
+scripts within patterns in a similar way to Perl.
+.P
+During matching, when PCRE2 reaches a callout point, the external function is
+called. It is provided with the number or string argument of the callout, the
+position in the pattern, and one item of data that is also set in the match
+block. The callout function may cause matching to proceed, to backtrack, or to
+fail.
+.P
+By default, PCRE2 implements a number of optimizations at matching time, and
+one side-effect is that sometimes callouts are skipped. If you need all
+possible callouts to happen, you need to set options that disable the relevant
+optimizations. More details, including a complete description of the
+programming interface to the callout function, are given in the
+.\" HREF
+\fBpcre2callout\fP
+.\"
+documentation.
+.
+.
+.SS "Callouts with numerical arguments"
+.rs
+.sp
+If you just want to have a means of identifying different callout points, put a
+number less than 256 after the letter C. For example, this pattern has two
+callout points:
+.sp
+  (?C1)abc(?C2)def
+.sp
+If the PCRE2_AUTO_CALLOUT flag is passed to \fBpcre2_compile()\fP, numerical
+callouts are automatically installed before each item in the pattern. They are
+all numbered 255. If there is a conditional group in the pattern whose
+condition is an assertion, an additional callout is inserted just before the
+condition. An explicit callout may also be set at this position, as in this
+example:
+.sp
+  (?(?C9)(?=a)abc|def)
+.sp
+Note that this applies only to assertion conditions, not to other types of
+condition.
+.
+.
+.SS "Callouts with string arguments"
+.rs
+.sp
+A delimited string may be used instead of a number as a callout argument. The
+starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
+the same as the start, except for {, where the ending delimiter is }. If the
+ending delimiter is needed within the string, it must be doubled. For
+example:
+.sp
+  (?C'ab ''c'' d')xyz(?C{any text})pqr
+.sp
+The doubling is removed before the string is passed to the callout function.
+.
+.
+.\" HTML <a name="backtrackcontrol"></a>
+.SH "BACKTRACKING CONTROL"
+.rs
+.sp
+There are a number of special "Backtracking Control Verbs" (to use Perl's
+terminology) that modify the behaviour of backtracking during matching. They
+are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
+possibly behaving differently depending on whether or not a name is present.
+.P
+By default, for compatibility with Perl, a name is any sequence of characters
+that does not include a closing parenthesis. The name is not processed in
+any way, and it is not possible to include a closing parenthesis in the name.
+This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
+is no longer Perl-compatible.
+.P
+When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
+and only an unescaped closing parenthesis terminates the name. However, the
+only backslash items that are permitted are \eQ, \eE, and sequences such as
+\ex{100} that define character code points. Character type escapes such as \ed
+are faulted.
+.P
+A closing parenthesis can be included in a name either as \e) or between \eQ
+and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or
+PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
+skipped, and #-comments are recognized, exactly as in the rest of the pattern.
+PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
+PCRE2_ALT_VERBNAMES is also set.
+.P
+The maximum length of a name is 255 in the 8-bit library and 65535 in the
+16-bit and 32-bit libraries. If the name is empty, that is, if the closing
+parenthesis immediately follows the colon, the effect is as if the colon were
+not there. Any number of these verbs may occur in a pattern.
+.P
+Since these verbs are specifically related to backtracking, most of them can be
+used only when the pattern is to be matched using the traditional matching
+function, because that uses a backtracking algorithm. With the exception of
+(*FAIL), which behaves like a failing negative assertion, the backtracking
+control verbs cause an error if encountered by the DFA matching function.
+.P
+The behaviour of these verbs in
+.\" HTML <a href="#btrepeat">
+.\" </a>
+repeated groups,
+.\"
+.\" HTML <a href="#btassert">
+.\" </a>
+assertions,
+.\"
+and in
+.\" HTML <a href="#btsub">
+.\" </a>
+subpatterns called as subroutines
+.\"
+(whether or not recursively) is documented below.
+.
+.
+.\" HTML <a name="nooptimize"></a>
+.SS "Optimizations that affect backtracking verbs"
+.rs
+.sp
+PCRE2 contains some optimizations that are used to speed up matching by running
+some checks at the start of each match attempt. For example, it may know the
+minimum length of matching subject, or that a particular character must be
+present. When one of these optimizations bypasses the running of a match, any
+included backtracking verbs will not, of course, be processed. You can suppress
+the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
+when calling \fBpcre2_compile()\fP, or by starting the pattern with
+(*NO_START_OPT). There is more discussion of this option in the section
+entitled
+.\" HTML <a href="pcre2api.html#compiling">
+.\" </a>
+"Compiling a pattern"
+.\"
+in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation.
+.P
+Experiments with Perl suggest that it too has similar optimizations, and like
+PCRE2, turning them off can change the result of a match.
+.
+.
+.SS "Verbs that act immediately"
+.rs
+.sp
+The following verbs act as soon as they are encountered.
+.sp
+   (*ACCEPT) or (*ACCEPT:NAME)
+.sp
+This verb causes the match to end successfully, skipping the remainder of the
+pattern. However, when it is inside a subpattern that is called as a
+subroutine, only that subpattern is ended successfully. Matching then continues
+at the outer level. If (*ACCEPT) in triggered in a positive assertion, the
+assertion succeeds; in a negative assertion, the assertion fails.
+.P
+If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For
+example:
+.sp
+  A((?:A|B(*ACCEPT)|C)D)
+.sp
+This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
+the outer parentheses.
+.sp
+  (*FAIL) or (*FAIL:NAME)
+.sp
+This verb causes a matching failure, forcing backtracking to occur. It may be
+abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
+documentation notes that it is probably useful only when combined with (?{}) or
+(??{}). Those are, of course, Perl features that are not present in PCRE2. The
+nearest equivalent is the callout feature, as for example in this pattern:
+.sp
+  a+(?C)(*FAIL)
+.sp
+A match with the string "aaaa" always fails, but the callout is taken before
+each backtrack happens (in this example, 10 times).
+.P
+(*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as
+(*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively.
+.
+.
+.SS "Recording which path was taken"
+.rs
+.sp
+There is one verb whose main purpose is to track how a match was arrived at,
+though it also has a secondary use in conjunction with advancing the match
+starting point (see (*SKIP) below).
+.sp
+  (*MARK:NAME) or (*:NAME)
+.sp
+A name is always required with this verb. There may be as many instances of
+(*MARK) as you like in a pattern, and their names do not have to be unique.
+.P
+When a match succeeds, the name of the last-encountered (*MARK:NAME) on the
+matching path is passed back to the caller as described in the section entitled
+.\" HTML <a href="pcre2api.html#matchotherdata">
+.\" </a>
+"Other information about the match"
+.\"
+in the
+.\" HREF
+\fBpcre2api\fP
+.\"
+documentation. This applies to all instances of (*MARK), including those inside
+assertions and atomic groups. (There are differences in those cases when
+(*MARK) is used in conjunction with (*SKIP) as described below.)
+.P
+As well as (*MARK), the (*COMMIT), (*PRUNE) and (*THEN) verbs may have
+associated NAME arguments. Whichever is last on the matching path is passed
+back. See below for more details of these other verbs.
+.P
+Here is an example of \fBpcre2test\fP output, where the "mark" modifier
+requests the retrieval and outputting of (*MARK) data:
+.sp
+    re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
+  data> XY
+   0: XY
+  MK: A
+  XZ
+   0: XZ
+  MK: B
+.sp
+The (*MARK) name is tagged with "MK:" in this output, and in this example it
+indicates which of the two alternatives matched. This is a more efficient way
+of obtaining this information than putting each alternative in its own
+capturing parentheses.
+.P
+If a verb with a name is encountered in a positive assertion that is true, the
+name is recorded and passed back if it is the last-encountered. This does not
+happen for negative assertions or failing positive assertions.
+.P
+After a partial match or a failed match, the last encountered name in the
+entire match process is returned. For example:
+.sp
+    re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
+  data> XP
+  No match, mark = B
+.sp
+Note that in this unanchored example the mark is retained from the match
+attempt that started at the letter "X" in the subject. Subsequent match
+attempts starting at "P" and then with an empty string do not get as far as the
+(*MARK) item, but nevertheless do not reset it.
+.P
+If you are interested in (*MARK) values after failed matches, you should
+probably set the PCRE2_NO_START_OPTIMIZE option
+.\" HTML <a href="#nooptimize">
+.\" </a>
+(see above)
+.\"
+to ensure that the match is always attempted.
+.
+.
+.SS "Verbs that act after backtracking"
+.rs
+.sp
+The following verbs do nothing when they are encountered. Matching continues
+with what follows, but if there is a subsequent match failure, causing a
+backtrack to the verb, a failure is forced. That is, backtracking cannot pass
+to the left of the verb. However, when one of these verbs appears inside an
+atomic group or in a lookaround assertion that is true, its effect is confined
+to that group, because once the group has been matched, there is never any
+backtracking into it. Backtracking from beyond an assertion or an atomic group
+ignores the entire group, and seeks a preceeding backtracking point.
+.P
+These verbs differ in exactly what kind of failure occurs when backtracking
+reaches them. The behaviour described below is what happens when the verb is
+not in a subroutine or an assertion. Subsequent sections cover these special
+cases.
+.sp
+  (*COMMIT) or (*COMMIT:NAME)
+.sp
+This verb causes the whole match to fail outright if there is a later matching
+failure that causes backtracking to reach it. Even if the pattern is
+unanchored, no further attempts to find a match by advancing the starting point
+take place. If (*COMMIT) is the only backtracking verb that is encountered,
+once it has been passed \fBpcre2_match()\fP is committed to finding a match at
+the current starting point, or not at all. For example:
+.sp
+  a+(*COMMIT)b
+.sp
+This matches "xxaab" but not "aacaab". It can be thought of as a kind of
+dynamic anchor, or "I've started, so I must finish."
+.P
+The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COMMIT). It is
+like (*MARK:NAME) in that the name is remembered for passing back to the
+caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
+ignoring those set by (*COMMIT), (*PRUNE) and (*THEN).
+.P
+If there is more than one backtracking verb in a pattern, a different one that
+follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a
+match does not always guarantee that a match must be at this starting point.
+.P
+Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
+unless PCRE2's start-of-match optimizations are turned off, as shown in this
+output from \fBpcre2test\fP:
+.sp
+    re> /(*COMMIT)abc/
+  data> xyzabc
+   0: abc
+  data>
+  re> /(*COMMIT)abc/no_start_optimize
+  data> xyzabc
+  No match
+.sp
+For the first pattern, PCRE2 knows that any match must start with "a", so the
+optimization skips along the subject to "a" before applying the pattern to the
+first set of data. The match attempt then succeeds. The second pattern disables
+the optimization that skips along to the first character. The pattern is now
+applied starting at "x", and so the (*COMMIT) causes the match to fail without
+trying any other starting points.
+.sp
+  (*PRUNE) or (*PRUNE:NAME)
+.sp
+This verb causes the match to fail at the current starting position in the
+subject if there is a later matching failure that causes backtracking to reach
+it. If the pattern is unanchored, the normal "bumpalong" advance to the next
+starting character then happens. Backtracking can occur as usual to the left of
+(*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but
+if there is no match to the right, backtracking cannot cross (*PRUNE). In
+simple cases, the use of (*PRUNE) is just an alternative to an atomic group or
+possessive quantifier, but there are some uses of (*PRUNE) that cannot be
+expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
+as (*COMMIT).
+.P
+The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
+like (*MARK:NAME) in that the name is remembered for passing back to the
+caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
+ignoring those set by (*COMMIT), (*PRUNE) or (*THEN).
+.sp
+  (*SKIP)
+.sp
+This verb, when given without a name, is like (*PRUNE), except that if the
+pattern is unanchored, the "bumpalong" advance is not to the next character,
+but to the position in the subject where (*SKIP) was encountered. (*SKIP)
+signifies that whatever text was matched leading up to it cannot be part of a
+successful match if there is a later mismatch. Consider:
+.sp
+  a+(*SKIP)b
+.sp
+If the subject is "aaaac...", after the first match attempt fails (starting at
+the first character in the string), the starting point skips on to start the
+next attempt at "c". Note that a possessive quantifer does not have the same
+effect as this example; although it would suppress backtracking during the
+first match attempt, the second attempt would start at the second character
+instead of skipping on to "c".
+.sp
+  (*SKIP:NAME)
+.sp
+When (*SKIP) has an associated name, its behaviour is modified. When such a
+(*SKIP) is triggered, the previous path through the pattern is searched for the
+most recent (*MARK) that has the same name. If one is found, the "bumpalong"
+advance is to the subject position that corresponds to that (*MARK) instead of
+to where (*SKIP) was encountered. If no (*MARK) with a matching name is found,
+the (*SKIP) is ignored.
+.P
+The search for a (*MARK) name uses the normal backtracking mechanism, which
+means that it does not see (*MARK) settings that are inside atomic groups or
+assertions, because they are never re-entered by backtracking. Compare the
+following \fBpcre2test\fP examples:
+.sp
+    re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
+  data: abc
+   0: a
+   1: a
+  data:
+    re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
+  data: abc
+   0: b
+   1: b
+.sp
+In the first example, the (*MARK) setting is in an atomic group, so it is not
+seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. This allows
+the second branch of the pattern to be tried at the first character position.
+In the second example, the (*MARK) setting is not in an atomic group. This
+allows (*SKIP:X) to find the (*MARK) when it backtracks, and this causes a new
+matching attempt to start at the second character. This time, the (*MARK) is
+never seen because "a" does not match "b", so the matcher immediately jumps to
+the second branch of the pattern.
+.P
+Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores
+names that are set by (*COMMIT:NAME), (*PRUNE:NAME) or (*THEN:NAME).
+.sp
+  (*THEN) or (*THEN:NAME)
+.sp
+This verb causes a skip to the next innermost alternative when backtracking
+reaches it. That is, it cancels any further backtracking within the current
+alternative. Its name comes from the observation that it can be used for a
+pattern-based if-then-else block:
+.sp
+  ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
+.sp
+If the COND1 pattern matches, FOO is tried (and possibly further items after
+the end of the group if FOO succeeds); on failure, the matcher skips to the
+second alternative and tries COND2, without backtracking into COND1. If that
+succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
+more alternatives, so there is a backtrack to whatever came before the entire
+group. If (*THEN) is not inside an alternation, it acts like (*PRUNE).
+.P
+The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). It is
+like (*MARK:NAME) in that the name is remembered for passing back to the
+caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
+ignoring those set by (*COMMIT), (*PRUNE) and (*THEN).
+.P
+A subpattern that does not contain a | character is just a part of the
+enclosing alternative; it is not a nested alternation with only one
+alternative. The effect of (*THEN) extends beyond such a subpattern to the
+enclosing alternative. Consider this pattern, where A, B, etc. are complex
+pattern fragments that do not contain any | characters at this level:
+.sp
+  A (B(*THEN)C) | D
+.sp
+If A and B are matched, but there is a failure in C, matching does not
+backtrack into A; instead it moves to the next alternative, that is, D.
+However, if the subpattern containing (*THEN) is given an alternative, it
+behaves differently:
+.sp
+  A (B(*THEN)C | (*FAIL)) | D
+.sp
+The effect of (*THEN) is now confined to the inner subpattern. After a failure
+in C, matching moves to (*FAIL), which causes the whole subpattern to fail
+because there are no more alternatives to try. In this case, matching does now
+backtrack into A.
+.P
+Note that a conditional subpattern is not considered as having two
+alternatives, because only one is ever used. In other words, the | character in
+a conditional subpattern has a different meaning. Ignoring white space,
+consider:
+.sp
+  ^.*? (?(?=a) a | b(*THEN)c )
+.sp
+If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
+it initially matches zero characters. The condition (?=a) then fails, the
+character "b" is matched, but "c" is not. At this point, matching does not
+backtrack to .*? as might perhaps be expected from the presence of the |
+character. The conditional subpattern is part of the single alternative that
+comprises the whole pattern, and so the match fails. (If there was a backtrack
+into .*?, allowing it to match "b", the match would succeed.)
+.P
+The verbs just described provide four different "strengths" of control when
+subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
+next alternative. (*PRUNE) comes next, failing the match at the current
+starting position, but allowing an advance to the next character (for an
+unanchored pattern). (*SKIP) is similar, except that the advance may be more
+than one character. (*COMMIT) is the strongest, causing the entire match to
+fail.
+.
+.
+.SS "More than one backtracking verb"
+.rs
+.sp
+If more than one backtracking verb is present in a pattern, the one that is
+backtracked onto first acts. For example, consider this pattern, where A, B,
+etc. are complex pattern fragments:
+.sp
+  (A(*COMMIT)B(*THEN)C|ABD)
+.sp
+If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to
+fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes
+the next alternative (ABD) to be tried. This behaviour is consistent, but is
+not always the same as Perl's. It means that if two or more backtracking verbs
+appear in succession, all the the last of them has no effect. Consider this
+example:
+.sp
+  ...(*COMMIT)(*PRUNE)...
+.sp
+If there is a matching failure to the right, backtracking onto (*PRUNE) causes
+it to be triggered, and its action is taken. There can never be a backtrack
+onto (*COMMIT).
+.
+.
+.\" HTML <a name="btrepeat"></a>
+.SS "Backtracking verbs in repeated groups"
+.rs
+.sp
+PCRE2 sometimes differs from Perl in its handling of backtracking verbs in
+repeated groups. For example, consider:
+.sp
+  /(a(*COMMIT)b)+ac/
+.sp
+If the subject is "abac", Perl matches unless its optimizations are disabled,
+but PCRE2 always fails because the (*COMMIT) in the second repeat of the group
+acts.
+.
+.
+.\" HTML <a name="btassert"></a>
+.SS "Backtracking verbs in assertions"
+.rs
+.sp
+(*FAIL) in any assertion has its normal effect: it forces an immediate
+backtrack. The behaviour of the other backtracking verbs depends on whether or
+not the assertion is standalone or acting as the condition in a conditional
+subpattern.
+.P
+(*ACCEPT) in a standalone positive assertion causes the assertion to succeed
+without any further processing; captured strings and a (*MARK) name (if set)
+are retained. In a standalone negative assertion, (*ACCEPT) causes the
+assertion to fail without any further processing; captured substrings and any
+(*MARK) name are discarded.
+.P
+If the assertion is a condition, (*ACCEPT) causes the condition to be true for
+a positive assertion and false for a negative one; captured substrings are
+retained in both cases.
+.P
+The remaining verbs act only when a later failure causes a backtrack to
+reach them. This means that their effect is confined to the assertion,
+because lookaround assertions are atomic. A backtrack that occurs after an
+assertion is complete does not jump back into the assertion. Note in particular
+that a (*MARK) name that is set in an assertion is not "seen" by an instance of
+(*SKIP:NAME) latter in the pattern.
+.P
+The effect of (*THEN) is not allowed to escape beyond an assertion. If there
+are no more branches to try, (*THEN) causes a positive assertion to be false,
+and a negative assertion to be true.
+.P
+The other backtracking verbs are not treated specially if they appear in a
+standalone positive assertion. In a conditional positive assertion,
+backtracking (from within the assertion) into (*COMMIT), (*SKIP), or (*PRUNE)
+causes the condition to be false. However, for both standalone and conditional
+negative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes
+the assertion to be true, without considering any further alternative branches.
+.
+.
+.\" HTML <a name="btsub"></a>
+.SS "Backtracking verbs in subroutines"
+.rs
+.sp
+These behaviours occur whether or not the subpattern is called recursively.
+.P
+(*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to
+succeed without any further processing. Matching then continues after the
+subroutine call. Perl documents this behaviour. Perl's treatment of the other
+verbs in subroutines is different in some cases.
+.P
+(*FAIL) in a subpattern called as a subroutine has its normal effect: it forces
+an immediate backtrack.
+.P
+(*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail when
+triggered by being backtracked to in a subpattern called as a subroutine. There
+is then a backtrack at the outer level.
+.P
+(*THEN), when triggered, skips to the next alternative in the innermost
+enclosing group within the subpattern that has alternatives (its normal
+behaviour). However, if there is no such group within the subroutine
+subpattern, the subroutine match fails and there is a backtrack at the outer
+level.
+.
+.
+.SH "SEE ALSO"
+.rs
+.sp
+\fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3),
+\fBpcre2syntax\fP(3), \fBpcre2\fP(3).
+.
+.
+.SH AUTHOR
+.rs
+.sp
+.nf
+Philip Hazel
+University Computing Service
+Cambridge, England.
+.fi
+.
+.
+.SH REVISION
+.rs
+.sp
+.nf
+Last updated: 04 September 2018
+Copyright (c) 1997-2018 University of Cambridge.
+.fi