src/external/pcre2-10.32/doc/pcre2pattern.3

   1 .TH PCRE2PATTERN 3 "04 September 2018" "PCRE2 10.32"
   2 .SH NAME
   3 PCRE2 - Perl-compatible regular expressions (revised API)
   4 .SH "PCRE2 REGULAR EXPRESSION DETAILS"
   5 .rs
   6 .sp
   7 The syntax and semantics of the regular expressions that are supported by PCRE2
   8 are described in detail below. There is a quick-reference syntax summary in the
   9 .\" HREF
  10 \fBpcre2syntax\fP
  11 .\"
  12 page. PCRE2 tries to match Perl syntax and semantics as closely as it can.
  13 PCRE2 also supports some alternative regular expression syntax (which does not
  14 conflict with the Perl syntax) in order to provide some compatibility with
  15 regular expressions in Python, .NET, and Oniguruma.
  16 .P
  17 Perl's regular expressions are described in its own documentation, and regular
  18 expressions in general are covered in a number of books, some of which have
  19 copious examples. Jeffrey Friedl's "Mastering Regular Expressions", published
  20 by O'Reilly, covers regular expressions in great detail. This description of
  21 PCRE2's regular expressions is intended as reference material.
  22 .P
  23 This document discusses the patterns that are supported by PCRE2 when its main
  24 matching function, \fBpcre2_match()\fP, is used. PCRE2 also has an alternative
  25 matching function, \fBpcre2_dfa_match()\fP, which matches using a different
  26 algorithm that is not Perl-compatible. Some of the features discussed below are
  27 not available when DFA matching is used. The advantages and disadvantages of
  28 the alternative function, and how it differs from the normal function, are
  29 discussed in the
  30 .\" HREF
  31 \fBpcre2matching\fP
  32 .\"
  33 page.
  34 .
  35 .
  36 .SH "SPECIAL START-OF-PATTERN ITEMS"
  37 .rs
  38 .sp
  39 A number of options that can be passed to \fBpcre2_compile()\fP can also be set
  40 by special items at the start of a pattern. These are not Perl-compatible, but
  41 are provided to make these options accessible to pattern writers who are not
  42 able to change the program that processes the pattern. Any number of these
  43 items may appear, but they must all be together right at the start of the
  44 pattern string, and the letters must be in upper case.
  45 .
  46 .
  47 .SS "UTF support"
  48 .rs
  49 .sp
  50 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either as
  51 single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32 can be
  52 specified for the 32-bit library, in which case it constrains the character
  53 values to valid Unicode code points. To process UTF strings, PCRE2 must be
  54 built to include Unicode support (which is the default). When using UTF strings
  55 you must either call the compiling function with the PCRE2_UTF option, or the
  56 pattern must start with the special sequence (*UTF), which is equivalent to
  57 setting the relevant option. How setting a UTF mode affects pattern matching is
  58 mentioned in several places below. There is also a summary of features in the
  59 .\" HREF
  60 \fBpcre2unicode\fP
  61 .\"
  62 page.
  63 .P
  64 Some applications that allow their users to supply patterns may wish to
  65 restrict them to non-UTF data for security reasons. If the PCRE2_NEVER_UTF
  66 option is passed to \fBpcre2_compile()\fP, (*UTF) is not allowed, and its
  67 appearance in a pattern causes an error.
  68 .
  69 .
  70 .SS "Unicode property support"
  71 .rs
  72 .sp
  73 Another special sequence that may appear at the start of a pattern is (*UCP).
  74 This has the same effect as setting the PCRE2_UCP option: it causes sequences
  75 such as \ed and \ew to use Unicode properties to determine character types,
  76 instead of recognizing only characters with codes less than 256 via a lookup
  77 table.
  78 .P
  79 Some applications that allow their users to supply patterns may wish to
  80 restrict them for security reasons. If the PCRE2_NEVER_UCP option is passed to
  81 \fBpcre2_compile()\fP, (*UCP) is not allowed, and its appearance in a pattern
  82 causes an error.
  83 .
  84 .
  85 .SS "Locking out empty string matching"
  86 .rs
  87 .sp
  88 Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same effect
  89 as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option to whichever
  90 matching function is subsequently called to match the pattern. These options
  91 lock out the matching of empty strings, either entirely, or only at the start
  92 of the subject.
  93 .
  94 .
  95 .SS "Disabling auto-possessification"
  96 .rs
  97 .sp
  98 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as setting
  99 the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making quantifiers
 100 possessive when what follows cannot match the repeated item. For example, by
 101 default a+b is treated as a++b. For more details, see the
 102 .\" HREF
 103 \fBpcre2api\fP
 104 .\"
 105 documentation.
 106 .
 107 .
 108 .SS "Disabling start-up optimizations"
 109 .rs
 110 .sp
 111 If a pattern starts with (*NO_START_OPT), it has the same effect as setting the
 112 PCRE2_NO_START_OPTIMIZE option. This disables several optimizations for quickly
 113 reaching "no match" results. For more details, see the
 114 .\" HREF
 115 \fBpcre2api\fP
 116 .\"
 117 documentation.
 118 .
 119 .
 120 .SS "Disabling automatic anchoring"
 121 .rs
 122 .sp
 123 If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect as
 124 setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimizations that
 125 apply to patterns whose top-level branches all start with .* (match any number
 126 of arbitrary characters). For more details, see the
 127 .\" HREF
 128 \fBpcre2api\fP
 129 .\"
 130 documentation.
 131 .
 132 .
 133 .SS "Disabling JIT compilation"
 134 .rs
 135 .sp
 136 If a pattern that starts with (*NO_JIT) is successfully compiled, an attempt by
 137 the application to apply the JIT optimization by calling
 138 \fBpcre2_jit_compile()\fP is ignored.
 139 .
 140 .
 141 .SS "Setting match resource limits"
 142 .rs
 143 .sp
 144 The \fBpcre2_match()\fP function contains a counter that is incremented every
 145 time it goes round its main loop. The caller of \fBpcre2_match()\fP can set a
 146 limit on this counter, which therefore limits the amount of computing resource
 147 used for a match. The maximum depth of nested backtracking can also be limited;
 148 this indirectly restricts the amount of heap memory that is used, but there is
 149 also an explicit memory limit that can be set.
 150 .P
 151 These facilities are provided to catch runaway matches that are provoked by
 152 patterns with huge matching trees (a typical example is a pattern with nested
 153 unlimited repeats applied to a long string that does not match). When one of
 154 these limits is reached, \fBpcre2_match()\fP gives an error return. The limits
 155 can also be set by items at the start of the pattern of the form
 156 .sp
 157   (*LIMIT_HEAP=d)
 158   (*LIMIT_MATCH=d)
 159   (*LIMIT_DEPTH=d)
 160 .sp
 161 where d is any number of decimal digits. However, the value of the setting must
 162 be less than the value set (or defaulted) by the caller of \fBpcre2_match()\fP
 163 for it to have any effect. In other words, the pattern writer can lower the
 164 limits set by the programmer, but not raise them. If there is more than one
 165 setting of one of these limits, the lower value is used. The heap limit is
 166 specified in kibibytes (units of 1024 bytes).
 167 .P
 168 Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This name is
 169 still recognized for backwards compatibility.
 170 .P
 171 The heap limit applies only when the \fBpcre2_match()\fP or
 172 \fBpcre2_dfa_match()\fP interpreters are used for matching. It does not apply
 173 to JIT. The match limit is used (but in a different way) when JIT is being
 174 used, or when \fBpcre2_dfa_match()\fP is called, to limit computing resource
 175 usage by those matching functions. The depth limit is ignored by JIT but is
 176 relevant for DFA matching, which uses function recursion for recursions within
 177 the pattern and for lookaround assertions and atomic groups. In this case, the
 178 depth limit controls the depth of such recursion.
 179 .
 180 .
 181 .\" HTML <a name="newlines"></a>
 182 .SS "Newline conventions"
 183 .rs
 184 .sp
 185 PCRE2 supports six different conventions for indicating line breaks in
 186 strings: a single CR (carriage return) character, a single LF (linefeed)
 187 character, the two-character sequence CRLF, any of the three preceding, any
 188 Unicode newline sequence, or the NUL character (binary zero). The
 189 .\" HREF
 190 \fBpcre2api\fP
 191 .\"
 192 page has
 193 .\" HTML <a href="pcre2api.html#newlines">
 194 .\" </a>
 195 further discussion
 196 .\"
 197 about newlines, and shows how to set the newline convention when calling
 198 \fBpcre2_compile()\fP.
 199 .P
 200 It is also possible to specify a newline convention by starting a pattern
 201 string with one of the following sequences:
 202 .sp
 203   (*CR)        carriage return
 204   (*LF)        linefeed
 205   (*CRLF)      carriage return, followed by linefeed
 206   (*ANYCRLF)   any of the three above
 207   (*ANY)       all Unicode newline sequences
 208   (*NUL)       the NUL character (binary zero)
 209 .sp
 210 These override the default and the options given to the compiling function. For
 211 example, on a Unix system where LF is the default newline sequence, the pattern
 212 .sp
 213   (*CR)a.b
 214 .sp
 215 changes the convention to CR. That pattern matches "a\enb" because LF is no
 216 longer a newline. If more than one of these settings is present, the last one
 217 is used.
 218 .P
 219 The newline convention affects where the circumflex and dollar assertions are
 220 true. It also affects the interpretation of the dot metacharacter when
 221 PCRE2_DOTALL is not set, and the behaviour of \eN when not followed by an
 222 opening brace. However, it does not affect what the \eR escape sequence
 223 matches. By default, this is any Unicode newline sequence, for Perl
 224 compatibility. However, this can be changed; see the next section and the
 225 description of \eR in the section entitled
 226 .\" HTML <a href="#newlineseq">
 227 .\" </a>
 228 "Newline sequences"
 229 .\"
 230 below. A change of \eR setting can be combined with a change of newline
 231 convention.
 232 .
 233 .
 234 .SS "Specifying what \eR matches"
 235 .rs
 236 .sp
 237 It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
 238 complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
 239 at compile time. This effect can also be achieved by starting a pattern with
 240 (*BSR_ANYCRLF). For completeness, (*BSR_UNICODE) is also recognized,
 241 corresponding to PCRE2_BSR_UNICODE.
 242 .
 243 .
 244 .SH "EBCDIC CHARACTER CODES"
 245 .rs
 246 .sp
 247 PCRE2 can be compiled to run in an environment that uses EBCDIC as its
 248 character code instead of ASCII or Unicode (typically a mainframe system). In
 249 the sections below, character code values are ASCII or Unicode; in an EBCDIC
 250 environment these characters may have different code values, and there are no
 251 code points greater than 255.
 252 .
 253 .
 254 .SH "CHARACTERS AND METACHARACTERS"
 255 .rs
 256 .sp
 257 A regular expression is a pattern that is matched against a subject string from
 258 left to right. Most characters stand for themselves in a pattern, and match the
 259 corresponding characters in the subject. As a trivial example, the pattern
 260 .sp
 261   The quick brown fox
 262 .sp
 263 matches a portion of a subject string that is identical to itself. When
 264 caseless matching is specified (the PCRE2_CASELESS option), letters are matched
 265 independently of case.
 266 .P
 267 The power of regular expressions comes from the ability to include alternatives
 268 and repetitions in the pattern. These are encoded in the pattern by the use of
 269 \fImetacharacters\fP, which do not stand for themselves but instead are
 270 interpreted in some special way.
 271 .P
 272 There are two different sets of metacharacters: those that are recognized
 273 anywhere in the pattern except within square brackets, and those that are
 274 recognized within square brackets. Outside square brackets, the metacharacters
 275 are as follows:
 276 .sp
 277   \e      general escape character with several uses
 278   ^      assert start of string (or line, in multiline mode)
 279   $      assert end of string (or line, in multiline mode)
 280   .      match any character except newline (by default)
 281   [      start character class definition
 282   |      start of alternative branch
 283   (      start subpattern
 284   )      end subpattern
 285   ?      extends the meaning of (
 286          also 0 or 1 quantifier
 287          also quantifier minimizer
 288   *      0 or more quantifier
 289   +      1 or more quantifier
 290          also "possessive quantifier"
 291   {      start min/max quantifier
 292 .sp
 293 Part of a pattern that is in square brackets is called a "character class". In
 294 a character class the only metacharacters are:
 295 .sp
 296   \e      general escape character
 297   ^      negate the class, but only if the first character
 298   -      indicates character range
 299 .\" JOIN
 300   [      POSIX character class (only if followed by POSIX
 301            syntax)
 302   ]      terminates the character class
 303 .sp
 304 The following sections describe the use of each of the metacharacters.
 305 .
 306 .
 307 .SH BACKSLASH
 308 .rs
 309 .sp
 310 The backslash character has several uses. Firstly, if it is followed by a
 311 character that is not a number or a letter, it takes away any special meaning
 312 that character may have. This use of backslash as an escape character applies
 313 both inside and outside character classes.
 314 .P
 315 For example, if you want to match a * character, you must write \e* in the
 316 pattern. This escaping action applies whether or not the following character
 317 would otherwise be interpreted as a metacharacter, so it is always safe to
 318 precede a non-alphanumeric with backslash to specify that it stands for itself.
 319 In particular, if you want to match a backslash, you write \e\e.
 320 .P
 321 In a UTF mode, only ASCII numbers and letters have any special meaning after a
 322 backslash. All other characters (in particular, those whose code points are
 323 greater than 127) are treated as literals.
 324 .P
 325 If a pattern is compiled with the PCRE2_EXTENDED option, most white space in
 326 the pattern (other than in a character class), and characters between a #
 327 outside a character class and the next newline, inclusive, are ignored. An
 328 escaping backslash can be used to include a white space or # character as part
 329 of the pattern.
 330 .P
 331 If you want to remove the special meaning from a sequence of characters, you
 332 can do so by putting them between \eQ and \eE. This is different from Perl in
 333 that $ and @ are handled as literals in \eQ...\eE sequences in PCRE2, whereas
 334 in Perl, $ and @ cause variable interpolation. Also, Perl does "double-quotish
 335 backslash interpolation" on any backslashes between \eQ and \eE which, its
 336 documentation says, "may lead to confusing results". PCRE2 treats a backslash
 337 between \eQ and \eE just like any other character. Note the following examples:
 338 .sp
 339   Pattern            PCRE2 matches   Perl matches
 340 .sp
 341 .\" JOIN
 342   \eQabc$xyz\eE        abc$xyz        abc followed by the
 343                                       contents of $xyz
 344   \eQabc\e$xyz\eE       abc\e$xyz       abc\e$xyz
 345   \eQabc\eE\e$\eQxyz\eE   abc$xyz        abc$xyz
 346   \eQA\eB\eE            A\eB            A\eB
 347   \eQ\e\eE              \e              \e\eE
 348 .sp
 349 The \eQ...\eE sequence is recognized both inside and outside character classes.
 350 An isolated \eE that is not preceded by \eQ is ignored. If \eQ is not followed
 351 by \eE later in the pattern, the literal interpretation continues to the end of
 352 the pattern (that is, \eE is assumed at the end). If the isolated \eQ is inside
 353 a character class, this causes an error, because the character class is not
 354 terminated by a closing square bracket.
 355 .
 356 .
 357 .\" HTML <a name="digitsafterbackslash"></a>
 358 .SS "Non-printing characters"
 359 .rs
 360 .sp
 361 A second use of backslash provides a way of encoding non-printing characters
 362 in patterns in a visible manner. There is no restriction on the appearance of
 363 non-printing characters in a pattern, but when a pattern is being prepared by
 364 text editing, it is often easier to use one of the following escape sequences
 365 than the binary character it represents. In an ASCII or Unicode environment,
 366 these escapes are as follows:
 367 .sp
 368   \ea          alarm, that is, the BEL character (hex 07)
 369   \ecx         "control-x", where x is any printable ASCII character
 370   \ee          escape (hex 1B)
 371   \ef          form feed (hex 0C)
 372   \en          linefeed (hex 0A)
 373   \er          carriage return (hex 0D)
 374   \et          tab (hex 09)
 375   \e0dd        character with octal code 0dd
 376   \eddd        character with octal code ddd, or backreference
 377   \eo{ddd..}   character with octal code ddd..
 378   \exhh        character with hex code hh
 379   \ex{hhh..}   character with hex code hhh..
 380   \eN{U+hhh..} character with Unicode hex code point hhh..
 381   \euhhhh      character with hex code hhhh (when PCRE2_ALT_BSUX is set)
 382 .sp
 383 The \eN{U+hhh..} escape sequence is recognized only when the PCRE2_UTF option
 384 is set, that is, when PCRE2 is operating in a Unicode mode. Perl also uses
 385 \eN{name} to specify characters by Unicode name; PCRE2 does not support this.
 386 Note that when \eN is not followed by an opening brace (curly bracket) it has
 387 an entirely different meaning, matching any character that is not a newline.
 388 .P
 389 The precise effect of \ecx on ASCII characters is as follows: if x is a lower
 390 case letter, it is converted to upper case. Then bit 6 of the character (hex
 391 40) is inverted. Thus \ecA to \ecZ become hex 01 to hex 1A (A is 41, Z is 5A),
 392 but \ec{ becomes hex 3B ({ is 7B), and \ec; becomes hex 7B (; is 3B). If the
 393 code unit following \ec has a value less than 32 or greater than 126, a
 394 compile-time error occurs.
 395 .P
 396 When PCRE2 is compiled in EBCDIC mode, \eN{U+hhh..} is not supported. \ea, \ee,
 397 \ef, \en, \er, and \et generate the appropriate EBCDIC code values. The \ec
 398 escape is processed as specified for Perl in the \fBperlebcdic\fP document. The
 399 only characters that are allowed after \ec are A-Z, a-z, or one of @, [, \e, ],
 400 ^, _, or ?. Any other character provokes a compile-time error. The sequence
 401 \ec@ encodes character code 0; after \ec the letters (in either case) encode
 402 characters 1-26 (hex 01 to hex 1A); [, \e, ], ^, and _ encode characters 27-31
 403 (hex 1B to hex 1F), and \ec? becomes either 255 (hex FF) or 95 (hex 5F).
 404 .P
 405 Thus, apart from \ec?, these escapes generate the same character code values as
 406 they do in an ASCII environment, though the meanings of the values mostly
 407 differ. For example, \ecG always generates code value 7, which is BEL in ASCII
 408 but DEL in EBCDIC.
 409 .P
 410 The sequence \ec? generates DEL (127, hex 7F) in an ASCII environment, but
 411 because 127 is not a control character in EBCDIC, Perl makes it generate the
 412 APC character. Unfortunately, there are several variants of EBCDIC. In most of
 413 them the APC character has the value 255 (hex FF), but in the one Perl calls
 414 POSIX-BC its value is 95 (hex 5F). If certain other characters have POSIX-BC
 415 values, PCRE2 makes \ec? generate 95; otherwise it generates 255.
 416 .P
 417 After \e0 up to two further octal digits are read. If there are fewer than two
 418 digits, just those that are present are used. Thus the sequence \e0\ex\e015
 419 specifies two binary zeros followed by a CR character (code value 13). Make
 420 sure you supply two digits after the initial zero if the pattern character that
 421 follows is itself an octal digit.
 422 .P
 423 The escape \eo must be followed by a sequence of octal digits, enclosed in
 424 braces. An error occurs if this is not the case. This escape is a recent
 425 addition to Perl; it provides way of specifying character code points as octal
 426 numbers greater than 0777, and it also allows octal numbers and backreferences
 427 to be unambiguously specified.
 428 .P
 429 For greater clarity and unambiguity, it is best to avoid following \e by a
 430 digit greater than zero. Instead, use \eo{} or \ex{} to specify numerical
 431 character code points, and \eg{} to specify backreferences. The following
 432 paragraphs describe the old, ambiguous syntax.
 433 .P
 434 The handling of a backslash followed by a digit other than 0 is complicated,
 435 and Perl has changed over time, causing PCRE2 also to change.
 436 .P
 437 Outside a character class, PCRE2 reads the digit and any following digits as a
 438 decimal number. If the number is less than 10, begins with the digit 8 or 9, or
 439 if there are at least that many previous capturing left parentheses in the
 440 expression, the entire sequence is taken as a \fIbackreference\fP. A
 441 description of how this works is given
 442 .\" HTML <a href="#backreferences">
 443 .\" </a>
 444 later,
 445 .\"
 446 following the discussion of
 447 .\" HTML <a href="#subpattern">
 448 .\" </a>
 449 parenthesized subpatterns.
 450 .\"
 451 Otherwise, up to three octal digits are read to form a character code.
 452 .P
 453 Inside a character class, PCRE2 handles \e8 and \e9 as the literal characters
 454 "8" and "9", and otherwise reads up to three octal digits following the
 455 backslash, using them to generate a data character. Any subsequent digits stand
 456 for themselves. For example, outside a character class:
 457 .sp
 458   \e040   is another way of writing an ASCII space
 459 .\" JOIN
 460   \e40    is the same, provided there are fewer than 40
 461             previous capturing subpatterns
 462   \e7     is always a backreference
 463 .\" JOIN
 464   \e11    might be a backreference, or another way of
 465             writing a tab
 466   \e011   is always a tab
 467   \e0113  is a tab followed by the character "3"
 468 .\" JOIN
 469   \e113   might be a backreference, otherwise the
 470             character with octal code 113
 471 .\" JOIN
 472   \e377   might be a backreference, otherwise
 473             the value 255 (decimal)
 474 .\" JOIN
 475   \e81    is always a backreference
 476 .sp
 477 Note that octal values of 100 or greater that are specified using this syntax
 478 must not be introduced by a leading zero, because no more than three octal
 479 digits are ever read.
 480 .P
 481 By default, after \ex that is not followed by {, from zero to two hexadecimal
 482 digits are read (letters can be in upper or lower case). Any number of
 483 hexadecimal digits may appear between \ex{ and }. If a character other than
 484 a hexadecimal digit appears between \ex{ and }, or if there is no terminating
 485 }, an error occurs.
 486 .P
 487 If the PCRE2_ALT_BSUX option is set, the interpretation of \ex is as just
 488 described only when it is followed by two hexadecimal digits. Otherwise, it
 489 matches a literal "x" character. In this mode, support for code points greater
 490 than 256 is provided by \eu, which must be followed by four hexadecimal digits;
 491 otherwise it matches a literal "u" character.
 492 .P
 493 Characters whose value is less than 256 can be defined by either of the two
 494 syntaxes for \ex (or by \eu in PCRE2_ALT_BSUX mode). There is no difference in
 495 the way they are handled. For example, \exdc is exactly the same as \ex{dc} (or
 496 \eu00dc in PCRE2_ALT_BSUX mode).
 497 .
 498 .
 499 .SS "Constraints on character values"
 500 .rs
 501 .sp
 502 Characters that are specified using octal or hexadecimal numbers are
 503 limited to certain values, as follows:
 504 .sp
 505   8-bit non-UTF mode    no greater than 0xff
 506   16-bit non-UTF mode   no greater than 0xffff
 507   32-bit non-UTF mode   no greater than 0xffffffff
 508   All UTF modes         no greater than 0x10ffff and a valid code point
 509 .sp
 510 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff (the
 511 so-called "surrogate" code points). The check for these can be disabled by the
 512 caller of \fBpcre2_compile()\fP by setting the option
 513 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in UTF-8
 514 and UTF-32 modes, because these values are not representable in UTF-16.
 515 .
 516 .
 517 .SS "Escape sequences in character classes"
 518 .rs
 519 .sp
 520 All the sequences that define a single character value can be used both inside
 521 and outside character classes. In addition, inside a character class, \eb is
 522 interpreted as the backspace character (hex 08).
 523 .P
 524 When not followed by an opening brace, \eN is not allowed in a character class.
 525 \eB, \eR, and \eX are not special inside a character class. Like other
 526 unrecognized alphabetic escape sequences, they cause an error. Outside a
 527 character class, these sequences have different meanings.
 528 .
 529 .
 530 .SS "Unsupported escape sequences"
 531 .rs
 532 .sp
 533 In Perl, the sequences \eF, \el, \eL, \eu, and \eU are recognized by its string
 534 handler and used to modify the case of following characters. By default, PCRE2
 535 does not support these escape sequences. However, if the PCRE2_ALT_BSUX option
 536 is set, \eU matches a "U" character, and \eu can be used to define a character
 537 by code point, as described above.
 538 .
 539 .
 540 .SS "Absolute and relative backreferences"
 541 .rs
 542 .sp
 543 The sequence \eg followed by a signed or unsigned number, optionally enclosed
 544 in braces, is an absolute or relative backreference. A named backreference
 545 can be coded as \eg{name}. Backreferences are discussed
 546 .\" HTML <a href="#backreferences">
 547 .\" </a>
 548 later,
 549 .\"
 550 following the discussion of
 551 .\" HTML <a href="#subpattern">
 552 .\" </a>
 553 parenthesized subpatterns.
 554 .\"
 555 .
 556 .
 557 .SS "Absolute and relative subroutine calls"
 558 .rs
 559 .sp
 560 For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
 561 a number enclosed either in angle brackets or single quotes, is an alternative
 562 syntax for referencing a subpattern as a "subroutine". Details are discussed
 563 .\" HTML <a href="#onigurumasubroutines">
 564 .\" </a>
 565 later.
 566 .\"
 567 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
 568 synonymous. The former is a backreference; the latter is a
 569 .\" HTML <a href="#subpatternsassubroutines">
 570 .\" </a>
 571 subroutine
 572 .\"
 573 call.
 574 .
 575 .
 576 .\" HTML <a name="genericchartypes"></a>
 577 .SS "Generic character types"
 578 .rs
 579 .sp
 580 Another use of backslash is for specifying generic character types:
 581 .sp
 582   \ed     any decimal digit
 583   \eD     any character that is not a decimal digit
 584   \eh     any horizontal white space character
 585   \eH     any character that is not a horizontal white space character
 586   \eN     any character that is not a newline
 587   \es     any white space character
 588   \eS     any character that is not a white space character
 589   \ev     any vertical white space character
 590   \eV     any character that is not a vertical white space character
 591   \ew     any "word" character
 592   \eW     any "non-word" character
 593 .sp
 594 The \eN escape sequence has the same meaning as
 595 .\" HTML <a href="#fullstopdot">
 596 .\" </a>
 597 the "." metacharacter
 598 .\"
 599 when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change the
 600 meaning of \eN. Note that when \eN is followed by an opening brace it has a
 601 different meaning. See the section entitled
 602 .\" HTML <a href="#digitsafterbackslash">
 603 .\" </a>
 604 "Non-printing characters"
 605 .\"
 606 above for details. Perl also uses \eN{name} to specify characters by Unicode
 607 name; PCRE2 does not support this.
 608 .P
 609 Each pair of lower and upper case escape sequences partitions the complete set
 610 of characters into two disjoint sets. Any given character matches one, and only
 611 one, of each pair. The sequences can appear both inside and outside character
 612 classes. They each match one character of the appropriate type. If the current
 613 matching point is at the end of the subject string, all of them fail, because
 614 there is no character to match.
 615 .P
 616 The default \es characters are HT (9), LF (10), VT (11), FF (12), CR (13), and
 617 space (32), which are defined as white space in the "C" locale. This list may
 618 vary if locale-specific matching is taking place. For example, in some locales
 619 the "non-breaking space" character (\exA0) is recognized as white space, and in
 620 others the VT character is not.
 621 .P
 622 A "word" character is an underscore or any character that is a letter or digit.
 623 By default, the definition of letters and digits is controlled by PCRE2's
 624 low-valued character tables, and may vary if locale-specific matching is taking
 625 place (see
 626 .\" HTML <a href="pcre2api.html#localesupport">
 627 .\" </a>
 628 "Locale support"
 629 .\"
 630 in the
 631 .\" HREF
 632 \fBpcre2api\fP
 633 .\"
 634 page). For example, in a French locale such as "fr_FR" in Unix-like systems,
 635 or "french" in Windows, some character codes greater than 127 are used for
 636 accented letters, and these are then matched by \ew. The use of locales with
 637 Unicode is discouraged.
 638 .P
 639 By default, characters whose code points are greater than 127 never match \ed,
 640 \es, or \ew, and always match \eD, \eS, and \eW, although this may be different
 641 for characters in the range 128-255 when locale-specific matching is happening.
 642 These escape sequences retain their original meanings from before Unicode
 643 support was available, mainly for efficiency reasons. If the PCRE2_UCP option
 644 is set, the behaviour is changed so that Unicode properties are used to
 645 determine character types, as follows:
 646 .sp
 647   \ed  any character that matches \ep{Nd} (decimal digit)
 648   \es  any character that matches \ep{Z} or \eh or \ev
 649   \ew  any character that matches \ep{L} or \ep{N}, plus underscore
 650 .sp
 651 The upper case escapes match the inverse sets of characters. Note that \ed
 652 matches only decimal digits, whereas \ew matches any Unicode digit, as well as
 653 any Unicode letter, and underscore. Note also that PCRE2_UCP affects \eb, and
 654 \eB because they are defined in terms of \ew and \eW. Matching these sequences
 655 is noticeably slower when PCRE2_UCP is set.
 656 .P
 657 The sequences \eh, \eH, \ev, and \eV, in contrast to the other sequences, which
 658 match only ASCII characters by default, always match a specific list of code
 659 points, whether or not PCRE2_UCP is set. The horizontal space characters are:
 660 .sp
 661   U+0009     Horizontal tab (HT)
 662   U+0020     Space
 663   U+00A0     Non-break space
 664   U+1680     Ogham space mark
 665   U+180E     Mongolian vowel separator
 666   U+2000     En quad
 667   U+2001     Em quad
 668   U+2002     En space
 669   U+2003     Em space
 670   U+2004     Three-per-em space
 671   U+2005     Four-per-em space
 672   U+2006     Six-per-em space
 673   U+2007     Figure space
 674   U+2008     Punctuation space
 675   U+2009     Thin space
 676   U+200A     Hair space
 677   U+202F     Narrow no-break space
 678   U+205F     Medium mathematical space
 679   U+3000     Ideographic space
 680 .sp
 681 The vertical space characters are:
 682 .sp
 683   U+000A     Linefeed (LF)
 684   U+000B     Vertical tab (VT)
 685   U+000C     Form feed (FF)
 686   U+000D     Carriage return (CR)
 687   U+0085     Next line (NEL)
 688   U+2028     Line separator
 689   U+2029     Paragraph separator
 690 .sp
 691 In 8-bit, non-UTF-8 mode, only the characters with code points less than 256
 692 are relevant.
 693 .
 694 .
 695 .\" HTML <a name="newlineseq"></a>
 696 .SS "Newline sequences"
 697 .rs
 698 .sp
 699 Outside a character class, by default, the escape sequence \eR matches any
 700 Unicode newline sequence. In 8-bit non-UTF-8 mode \eR is equivalent to the
 701 following:
 702 .sp
 703   (?>\er\en|\en|\ex0b|\ef|\er|\ex85)
 704 .sp
 705 This is an example of an "atomic group", details of which are given
 706 .\" HTML <a href="#atomicgroup">
 707 .\" </a>
 708 below.
 709 .\"
 710 This particular group matches either the two-character sequence CR followed by
 711 LF, or one of the single characters LF (linefeed, U+000A), VT (vertical tab,
 712 U+000B), FF (form feed, U+000C), CR (carriage return, U+000D), or NEL (next
 713 line, U+0085). Because this is an atomic group, the two-character sequence is
 714 treated as a single unit that cannot be split.
 715 .P
 716 In other modes, two additional characters whose code points are greater than 255
 717 are added: LS (line separator, U+2028) and PS (paragraph separator, U+2029).
 718 Unicode support is not needed for these characters to be recognized.
 719 .P
 720 It is possible to restrict \eR to match only CR, LF, or CRLF (instead of the
 721 complete set of Unicode line endings) by setting the option PCRE2_BSR_ANYCRLF
 722 at compile time. (BSR is an abbrevation for "backslash R".) This can be made
 723 the default when PCRE2 is built; if this is the case, the other behaviour can
 724 be requested via the PCRE2_BSR_UNICODE option. It is also possible to specify
 725 these settings by starting a pattern string with one of the following
 726 sequences:
 727 .sp
 728   (*BSR_ANYCRLF)   CR, LF, or CRLF only
 729   (*BSR_UNICODE)   any Unicode newline sequence
 730 .sp
 731 These override the default and the options given to the compiling function.
 732 Note that these special settings, which are not Perl-compatible, are recognized
 733 only at the very start of a pattern, and that they must be in upper case. If
 734 more than one of them is present, the last one is used. They can be combined
 735 with a change of newline convention; for example, a pattern can start with:
 736 .sp
 737   (*ANY)(*BSR_ANYCRLF)
 738 .sp
 739 They can also be combined with the (*UTF) or (*UCP) special sequences. Inside a
 740 character class, \eR is treated as an unrecognized escape sequence, and causes
 741 an error.
 742 .
 743 .
 744 .\" HTML <a name="uniextseq"></a>
 745 .SS Unicode character properties
 746 .rs
 747 .sp
 748 When PCRE2 is built with Unicode support (the default), three additional escape
 749 sequences that match characters with specific properties are available. In
 750 8-bit non-UTF-8 mode, these sequences are of course limited to testing
 751 characters whose code points are less than 256, but they do work in this mode.
 752 In 32-bit non-UTF mode, code points greater than 0x10ffff (the Unicode limit)
 753 may be encountered. These are all treated as being in the Common script and
 754 with an unassigned type. The extra escape sequences are:
 755 .sp
 756   \ep{\fIxx\fP}   a character with the \fIxx\fP property
 757   \eP{\fIxx\fP}   a character without the \fIxx\fP property
 758   \eX       a Unicode extended grapheme cluster
 759 .sp
 760 The property names represented by \fIxx\fP above are limited to the Unicode
 761 script names, the general category properties, "Any", which matches any
 762 character (including newline), and some special PCRE2 properties (described
 763 in the
 764 .\" HTML <a href="#extraprops">
 765 .\" </a>
 766 next section).
 767 .\"
 768 Other Perl properties such as "InMusicalSymbols" are not supported by PCRE2.
 769 Note that \eP{Any} does not match any characters, so always causes a match
 770 failure.
 771 .P
 772 Sets of Unicode characters are defined as belonging to certain scripts. A
 773 character from one of these sets can be matched using a script name. For
 774 example:
 775 .sp
 776   \ep{Greek}
 777   \eP{Han}
 778 .sp
 779 Those that are not part of an identified script are lumped together as
 780 "Common". The current list of scripts is:
 781 .P
 782 Adlam,
 783 Ahom,
 784 Anatolian_Hieroglyphs,
 785 Arabic,
 786 Armenian,
 787 Avestan,
 788 Balinese,
 789 Bamum,
 790 Bassa_Vah,
 791 Batak,
 792 Bengali,
 793 Bhaiksuki,
 794 Bopomofo,
 795 Brahmi,
 796 Braille,
 797 Buginese,
 798 Buhid,
 799 Canadian_Aboriginal,
 800 Carian,
 801 Caucasian_Albanian,
 802 Chakma,
 803 Cham,
 804 Cherokee,
 805 Common,
 806 Coptic,
 807 Cuneiform,
 808 Cypriot,
 809 Cyrillic,
 810 Deseret,
 811 Devanagari,
 812 Dogra,
 813 Duployan,
 814 Egyptian_Hieroglyphs,
 815 Elbasan,
 816 Ethiopic,
 817 Georgian,
 818 Glagolitic,
 819 Gothic,
 820 Grantha,
 821 Greek,
 822 Gujarati,
 823 Gunjala_Gondi,
 824 Gurmukhi,
 825 Han,
 826 Hangul,
 827 Hanifi_Rohingya,
 828 Hanunoo,
 829 Hatran,
 830 Hebrew,
 831 Hiragana,
 832 Imperial_Aramaic,
 833 Inherited,
 834 Inscriptional_Pahlavi,
 835 Inscriptional_Parthian,
 836 Javanese,
 837 Kaithi,
 838 Kannada,
 839 Katakana,
 840 Kayah_Li,
 841 Kharoshthi,
 842 Khmer,
 843 Khojki,
 844 Khudawadi,
 845 Lao,
 846 Latin,
 847 Lepcha,
 848 Limbu,
 849 Linear_A,
 850 Linear_B,
 851 Lisu,
 852 Lycian,
 853 Lydian,
 854 Mahajani,
 855 Makasar,
 856 Malayalam,
 857 Mandaic,
 858 Manichaean,
 859 Marchen,
 860 Masaram_Gondi,
 861 Medefaidrin,
 862 Meetei_Mayek,
 863 Mende_Kikakui,
 864 Meroitic_Cursive,
 865 Meroitic_Hieroglyphs,
 866 Miao,
 867 Modi,
 868 Mongolian,
 869 Mro,
 870 Multani,
 871 Myanmar,
 872 Nabataean,
 873 New_Tai_Lue,
 874 Newa,
 875 Nko,
 876 Nushu,
 877 Ogham,
 878 Ol_Chiki,
 879 Old_Hungarian,
 880 Old_Italic,
 881 Old_North_Arabian,
 882 Old_Permic,
 883 Old_Persian,
 884 Old_Sogdian,
 885 Old_South_Arabian,
 886 Old_Turkic,
 887 Oriya,
 888 Osage,
 889 Osmanya,
 890 Pahawh_Hmong,
 891 Palmyrene,
 892 Pau_Cin_Hau,
 893 Phags_Pa,
 894 Phoenician,
 895 Psalter_Pahlavi,
 896 Rejang,
 897 Runic,
 898 Samaritan,
 899 Saurashtra,
 900 Sharada,
 901 Shavian,
 902 Siddham,
 903 SignWriting,
 904 Sinhala,
 905 Sogdian,
 906 Sora_Sompeng,
 907 Soyombo,
 908 Sundanese,
 909 Syloti_Nagri,
 910 Syriac,
 911 Tagalog,
 912 Tagbanwa,
 913 Tai_Le,
 914 Tai_Tham,
 915 Tai_Viet,
 916 Takri,
 917 Tamil,
 918 Tangut,
 919 Telugu,
 920 Thaana,
 921 Thai,
 922 Tibetan,
 923 Tifinagh,
 924 Tirhuta,
 925 Ugaritic,
 926 Vai,
 927 Warang_Citi,
 928 Yi,
 929 Zanabazar_Square.
 930 .P
 931 Each character has exactly one Unicode general category property, specified by
 932 a two-letter abbreviation. For compatibility with Perl, negation can be
 933 specified by including a circumflex between the opening brace and the property
 934 name. For example, \ep{^Lu} is the same as \eP{Lu}.
 935 .P
 936 If only one letter is specified with \ep or \eP, it includes all the general
 937 category properties that start with that letter. In this case, in the absence
 938 of negation, the curly brackets in the escape sequence are optional; these two
 939 examples have the same effect:
 940 .sp
 941   \ep{L}
 942   \epL
 943 .sp
 944 The following general category property codes are supported:
 945 .sp
 946   C     Other
 947   Cc    Control
 948   Cf    Format
 949   Cn    Unassigned
 950   Co    Private use
 951   Cs    Surrogate
 952 .sp
 953   L     Letter
 954   Ll    Lower case letter
 955   Lm    Modifier letter
 956   Lo    Other letter
 957   Lt    Title case letter
 958   Lu    Upper case letter
 959 .sp
 960   M     Mark
 961   Mc    Spacing mark
 962   Me    Enclosing mark
 963   Mn    Non-spacing mark
 964 .sp
 965   N     Number
 966   Nd    Decimal number
 967   Nl    Letter number
 968   No    Other number
 969 .sp
 970   P     Punctuation
 971   Pc    Connector punctuation
 972   Pd    Dash punctuation
 973   Pe    Close punctuation
 974   Pf    Final punctuation
 975   Pi    Initial punctuation
 976   Po    Other punctuation
 977   Ps    Open punctuation
 978 .sp
 979   S     Symbol
 980   Sc    Currency symbol
 981   Sk    Modifier symbol
 982   Sm    Mathematical symbol
 983   So    Other symbol
 984 .sp
 985   Z     Separator
 986   Zl    Line separator
 987   Zp    Paragraph separator
 988   Zs    Space separator
 989 .sp
 990 The special property L& is also supported: it matches a character that has
 991 the Lu, Ll, or Lt property, in other words, a letter that is not classified as
 992 a modifier or "other".
 993 .P
 994 The Cs (Surrogate) property applies only to characters in the range U+D800 to
 995 U+DFFF. Such characters are not valid in Unicode strings and so
 996 cannot be tested by PCRE2, unless UTF validity checking has been turned off
 997 (see the discussion of PCRE2_NO_UTF_CHECK in the
 998 .\" HREF
 999 \fBpcre2api\fP
1000 .\"
1001 page). Perl does not support the Cs property.
1002 .P
1003 The long synonyms for property names that Perl supports (such as \ep{Letter})
1004 are not supported by PCRE2, nor is it permitted to prefix any of these
1005 properties with "Is".
1006 .P
1007 No character that is in the Unicode table has the Cn (unassigned) property.
1008 Instead, this property is assumed for any code point that is not in the
1009 Unicode table.
1010 .P
1011 Specifying caseless matching does not affect these escape sequences. For
1012 example, \ep{Lu} always matches only upper case letters. This is different from
1013 the behaviour of current versions of Perl.
1014 .P
1015 Matching characters by Unicode property is not fast, because PCRE2 has to do a
1016 multistage table lookup in order to find a character's property. That is why
1017 the traditional escape sequences such as \ed and \ew do not use Unicode
1018 properties in PCRE2 by default, though you can make them do so by setting the
1019 PCRE2_UCP option or by starting the pattern with (*UCP).
1020 .
1021 .
1022 .SS Extended grapheme clusters
1023 .rs
1024 .sp
1025 The \eX escape matches any number of Unicode characters that form an "extended
1026 grapheme cluster", and treats the sequence as an atomic group
1027 .\" HTML <a href="#atomicgroup">
1028 .\" </a>
1029 (see below).
1030 .\"
1031 Unicode supports various kinds of composite character by giving each character
1032 a grapheme breaking property, and having rules that use these properties to
1033 define the boundaries of extended grapheme clusters. The rules are defined in
1034 Unicode Standard Annex 29, "Unicode Text Segmentation". Unicode 11.0.0
1035 abandoned the use of some previous properties that had been used for emojis.
1036 Instead it introduced various emoji-specific properties. PCRE2 uses only the
1037 Extended Pictographic property.
1038 .P
1039 \eX always matches at least one character. Then it decides whether to add
1040 additional characters according to the following rules for ending a cluster:
1041 .P
1042 1. End at the end of the subject string.
1043 .P
1044 2. Do not end between CR and LF; otherwise end after any control character.
1045 .P
1046 3. Do not break Hangul (a Korean script) syllable sequences. Hangul characters
1047 are of five types: L, V, T, LV, and LVT. An L character may be followed by an
1048 L, V, LV, or LVT character; an LV or V character may be followed by a V or T
1049 character; an LVT or T character may be follwed only by a T character.
1050 .P
1051 4. Do not end before extending characters or spacing marks or the "zero-width
1052 joiner" character. Characters with the "mark" property always have the
1053 "extend" grapheme breaking property.
1054 .P
1055 5. Do not end after prepend characters.
1056 .P
1057 6. Do not break within emoji modifier sequences or emoji zwj sequences. That
1058 is, do not break between characters with the Extended_Pictographic property.
1059 Extend and ZWJ characters are allowed between the characters.
1060 .P
1061 7. Do not break within emoji flag sequences. That is, do not break between
1062 regional indicator (RI) characters if there are an odd number of RI characters
1063 before the break point.
1064 .P
1065 8. Otherwise, end the cluster.
1066 .
1067 .
1068 .\" HTML <a name="extraprops"></a>
1069 .SS PCRE2's additional properties
1070 .rs
1071 .sp
1072 As well as the standard Unicode properties described above, PCRE2 supports four
1073 more that make it possible to convert traditional escape sequences such as \ew
1074 and \es to use Unicode properties. PCRE2 uses these non-standard, non-Perl
1075 properties internally when PCRE2_UCP is set. However, they may also be used
1076 explicitly. These properties are:
1077 .sp
1078   Xan   Any alphanumeric character
1079   Xps   Any POSIX space character
1080   Xsp   Any Perl space character
1081   Xwd   Any Perl "word" character
1082 .sp
1083 Xan matches characters that have either the L (letter) or the N (number)
1084 property. Xps matches the characters tab, linefeed, vertical tab, form feed, or
1085 carriage return, and any other character that has the Z (separator) property.
1086 Xsp is the same as Xps; in PCRE1 it used to exclude vertical tab, for Perl
1087 compatibility, but Perl changed. Xwd matches the same characters as Xan, plus
1088 underscore.
1089 .P
1090 There is another non-standard property, Xuc, which matches any character that
1091 can be represented by a Universal Character Name in C++ and other programming
1092 languages. These are the characters $, @, ` (grave accent), and all characters
1093 with Unicode code points greater than or equal to U+00A0, except for the
1094 surrogates U+D800 to U+DFFF. Note that most base (ASCII) characters are
1095 excluded. (Universal Character Names are of the form \euHHHH or \eUHHHHHHHH
1096 where H is a hexadecimal digit. Note that the Xuc property does not match these
1097 sequences but the characters that they represent.)
1098 .
1099 .
1100 .\" HTML <a name="resetmatchstart"></a>
1101 .SS "Resetting the match start"
1102 .rs
1103 .sp
1104 In normal use, the escape sequence \eK causes any previously matched characters
1105 not to be included in the final matched sequence that is returned. For example,
1106 the pattern:
1107 .sp
1108   foo\eKbar
1109 .sp
1110 matches "foobar", but reports that it has matched "bar". \eK does not interact
1111 with anchoring in any way. The pattern:
1112 .sp
1113   ^foo\eKbar
1114 .sp
1115 matches only when the subject begins with "foobar" (in single line mode),
1116 though it again reports the matched string as "bar". This feature is similar to
1117 a lookbehind assertion
1118 .\" HTML <a href="#lookbehind">
1119 .\" </a>
1120 (described below).
1121 .\"
1122 However, in this case, the part of the subject before the real match does not
1123 have to be of fixed length, as lookbehind assertions do. The use of \eK does
1124 not interfere with the setting of
1125 .\" HTML <a href="#subpattern">
1126 .\" </a>
1127 captured substrings.
1128 .\"
1129 For example, when the pattern
1130 .sp
1131   (foo)\eKbar
1132 .sp
1133 matches "foobar", the first substring is still set to "foo".
1134 .P
1135 Perl documents that the use of \eK within assertions is "not well defined". In
1136 PCRE2, \eK is acted upon when it occurs inside positive assertions, but is
1137 ignored in negative assertions. Note that when a pattern such as (?=ab\eK)
1138 matches, the reported start of the match can be greater than the end of the
1139 match. Using \eK in a lookbehind assertion at the start of a pattern can also
1140 lead to odd effects. For example, consider this pattern:
1141 .sp
1142   (?<=\eKfoo)bar
1143 .sp
1144 If the subject is "foobar", a call to \fBpcre2_match()\fP with a starting
1145 offset of 3 succeeds and reports the matching string as "foobar", that is, the
1146 start of the reported match is earlier than where the match started.
1147 .
1148 .
1149 .\" HTML <a name="smallassertions"></a>
1150 .SS "Simple assertions"
1151 .rs
1152 .sp
1153 The final use of backslash is for certain simple assertions. An assertion
1154 specifies a condition that has to be met at a particular point in a match,
1155 without consuming any characters from the subject string. The use of
1156 subpatterns for more complicated assertions is described
1157 .\" HTML <a href="#bigassertions">
1158 .\" </a>
1159 below.
1160 .\"
1161 The backslashed assertions are:
1162 .sp
1163   \eb     matches at a word boundary
1164   \eB     matches when not at a word boundary
1165   \eA     matches at the start of the subject
1166   \eZ     matches at the end of the subject
1167           also matches before a newline at the end of the subject
1168   \ez     matches only at the end of the subject
1169   \eG     matches at the first matching position in the subject
1170 .sp
1171 Inside a character class, \eb has a different meaning; it matches the backspace
1172 character. If any other of these assertions appears in a character class, an
1173 "invalid escape sequence" error is generated.
1174 .P
1175 A word boundary is a position in the subject string where the current character
1176 and the previous character do not both match \ew or \eW (i.e. one matches
1177 \ew and the other matches \eW), or the start or end of the string if the
1178 first or last character matches \ew, respectively. In a UTF mode, the meanings
1179 of \ew and \eW can be changed by setting the PCRE2_UCP option. When this is
1180 done, it also affects \eb and \eB. Neither PCRE2 nor Perl has a separate "start
1181 of word" or "end of word" metasequence. However, whatever follows \eb normally
1182 determines which it is. For example, the fragment \eba matches "a" at the start
1183 of a word.
1184 .P
1185 The \eA, \eZ, and \ez assertions differ from the traditional circumflex and
1186 dollar (described in the next section) in that they only ever match at the very
1187 start and end of the subject string, whatever options are set. Thus, they are
1188 independent of multiline mode. These three assertions are not affected by the
1189 PCRE2_NOTBOL or PCRE2_NOTEOL options, which affect only the behaviour of the
1190 circumflex and dollar metacharacters. However, if the \fIstartoffset\fP
1191 argument of \fBpcre2_match()\fP is non-zero, indicating that matching is to
1192 start at a point other than the beginning of the subject, \eA can never match.
1193 The difference between \eZ and \ez is that \eZ matches before a newline at the
1194 end of the string as well as at the very end, whereas \ez matches only at the
1195 end.
1196 .P
1197 The \eG assertion is true only when the current matching position is at the
1198 start point of the matching process, as specified by the \fIstartoffset\fP
1199 argument of \fBpcre2_match()\fP. It differs from \eA when the value of
1200 \fIstartoffset\fP is non-zero. By calling \fBpcre2_match()\fP multiple times
1201 with appropriate arguments, you can mimic Perl's /g option, and it is in this
1202 kind of implementation where \eG can be useful.
1203 .P
1204 Note, however, that PCRE2's implementation of \eG, being true at the starting
1205 character of the matching process, is subtly different from Perl's, which
1206 defines it as true at the end of the previous match. In Perl, these can be
1207 different when the previously matched string was empty. Because PCRE2 does just
1208 one match at a time, it cannot reproduce this behaviour.
1209 .P
1210 If all the alternatives of a pattern begin with \eG, the expression is anchored
1211 to the starting match position, and the "anchored" flag is set in the compiled
1212 regular expression.
1213 .
1214 .
1215 .SH "CIRCUMFLEX AND DOLLAR"
1216 .rs
1217 .sp
1218 The circumflex and dollar metacharacters are zero-width assertions. That is,
1219 they test for a particular condition being true without consuming any
1220 characters from the subject string. These two metacharacters are concerned with
1221 matching the starts and ends of lines. If the newline convention is set so that
1222 only the two-character sequence CRLF is recognized as a newline, isolated CR
1223 and LF characters are treated as ordinary data characters, and are not
1224 recognized as newlines.
1225 .P
1226 Outside a character class, in the default matching mode, the circumflex
1227 character is an assertion that is true only if the current matching point is at
1228 the start of the subject string. If the \fIstartoffset\fP argument of
1229 \fBpcre2_match()\fP is non-zero, or if PCRE2_NOTBOL is set, circumflex can
1230 never match if the PCRE2_MULTILINE option is unset. Inside a character class,
1231 circumflex has an entirely different meaning
1232 .\" HTML <a href="#characterclass">
1233 .\" </a>
1234 (see below).
1235 .\"
1236 .P
1237 Circumflex need not be the first character of the pattern if a number of
1238 alternatives are involved, but it should be the first thing in each alternative
1239 in which it appears if the pattern is ever to match that branch. If all
1240 possible alternatives start with a circumflex, that is, if the pattern is
1241 constrained to match only at the start of the subject, it is said to be an
1242 "anchored" pattern. (There are also other constructs that can cause a pattern
1243 to be anchored.)
1244 .P
1245 The dollar character is an assertion that is true only if the current matching
1246 point is at the end of the subject string, or immediately before a newline at
1247 the end of the string (by default), unless PCRE2_NOTEOL is set. Note, however,
1248 that it does not actually match the newline. Dollar need not be the last
1249 character of the pattern if a number of alternatives are involved, but it
1250 should be the last item in any branch in which it appears. Dollar has no
1251 special meaning in a character class.
1252 .P
1253 The meaning of dollar can be changed so that it matches only at the very end of
1254 the string, by setting the PCRE2_DOLLAR_ENDONLY option at compile time. This
1255 does not affect the \eZ assertion.
1256 .P
1257 The meanings of the circumflex and dollar metacharacters are changed if the
1258 PCRE2_MULTILINE option is set. When this is the case, a dollar character
1259 matches before any newlines in the string, as well as at the very end, and a
1260 circumflex matches immediately after internal newlines as well as at the start
1261 of the subject string. It does not match after a newline that ends the string,
1262 for compatibility with Perl. However, this can be changed by setting the
1263 PCRE2_ALT_CIRCUMFLEX option.
1264 .P
1265 For example, the pattern /^abc$/ matches the subject string "def\enabc" (where
1266 \en represents a newline) in multiline mode, but not otherwise. Consequently,
1267 patterns that are anchored in single line mode because all branches start with
1268 ^ are not anchored in multiline mode, and a match for circumflex is possible
1269 when the \fIstartoffset\fP argument of \fBpcre2_match()\fP is non-zero. The
1270 PCRE2_DOLLAR_ENDONLY option is ignored if PCRE2_MULTILINE is set.
1271 .P
1272 When the newline convention (see
1273 .\" HTML <a href="#newlines">
1274 .\" </a>
1275 "Newline conventions"
1276 .\"
1277 below) recognizes the two-character sequence CRLF as a newline, this is
1278 preferred, even if the single characters CR and LF are also recognized as
1279 newlines. For example, if the newline convention is "any", a multiline mode
1280 circumflex matches before "xyz" in the string "abc\er\enxyz" rather than after
1281 CR, even though CR on its own is a valid newline. (It also matches at the very
1282 start of the string, of course.)
1283 .P
1284 Note that the sequences \eA, \eZ, and \ez can be used to match the start and
1285 end of the subject in both modes, and if all branches of a pattern start with
1286 \eA it is always anchored, whether or not PCRE2_MULTILINE is set.
1287 .
1288 .
1289 .\" HTML <a name="fullstopdot"></a>
1290 .SH "FULL STOP (PERIOD, DOT) AND \eN"
1291 .rs
1292 .sp
1293 Outside a character class, a dot in the pattern matches any one character in
1294 the subject string except (by default) a character that signifies the end of a
1295 line.
1296 .P
1297 When a line ending is defined as a single character, dot never matches that
1298 character; when the two-character sequence CRLF is used, dot does not match CR
1299 if it is immediately followed by LF, but otherwise it matches all characters
1300 (including isolated CRs and LFs). When any Unicode line endings are being
1301 recognized, dot does not match CR or LF or any of the other line ending
1302 characters.
1303 .P
1304 The behaviour of dot with regard to newlines can be changed. If the
1305 PCRE2_DOTALL option is set, a dot matches any one character, without exception.
1306 If the two-character sequence CRLF is present in the subject string, it takes
1307 two dots to match it.
1308 .P
1309 The handling of dot is entirely independent of the handling of circumflex and
1310 dollar, the only relationship being that they both involve newlines. Dot has no
1311 special meaning in a character class.
1312 .P
1313 The escape sequence \eN when not followed by an opening brace behaves like a
1314 dot, except that it is not affected by the PCRE2_DOTALL option. In other words,
1315 it matches any character except one that signifies the end of a line.
1316 .P
1317 When \eN is followed by an opening brace it has a different meaning. See the
1318 section entitled
1319 .\" HTML <a href="digitsafterbackslash">
1320 .\" </a>
1321 "Non-printing characters"
1322 .\"
1323 above for details. Perl also uses \eN{name} to specify characters by Unicode
1324 name; PCRE2 does not support this.
1325 .
1326 .
1327 .SH "MATCHING A SINGLE CODE UNIT"
1328 .rs
1329 .sp
1330 Outside a character class, the escape sequence \eC matches any one code unit,
1331 whether or not a UTF mode is set. In the 8-bit library, one code unit is one
1332 byte; in the 16-bit library it is a 16-bit unit; in the 32-bit library it is a
1333 32-bit unit. Unlike a dot, \eC always matches line-ending characters. The
1334 feature is provided in Perl in order to match individual bytes in UTF-8 mode,
1335 but it is unclear how it can usefully be used.
1336 .P
1337 Because \eC breaks up characters into individual code units, matching one unit
1338 with \eC in UTF-8 or UTF-16 mode means that the rest of the string may start
1339 with a malformed UTF character. This has undefined results, because PCRE2
1340 assumes that it is matching character by character in a valid UTF string (by
1341 default it checks the subject string's validity at the start of processing
1342 unless the PCRE2_NO_UTF_CHECK option is used).
1343 .P
1344 An application can lock out the use of \eC by setting the
1345 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also possible to
1346 build PCRE2 with the use of \eC permanently disabled.
1347 .P
1348 PCRE2 does not allow \eC to appear in lookbehind assertions
1349 .\" HTML <a href="#lookbehind">
1350 .\" </a>
1351 (described below)
1352 .\"
1353 in UTF-8 or UTF-16 modes, because this would make it impossible to calculate
1354 the length of the lookbehind. Neither the alternative matching function
1355 \fBpcre2_dfa_match()\fP nor the JIT optimizer support \eC in these UTF modes.
1356 The former gives a match-time error; the latter fails to optimize and so the
1357 match is always run using the interpreter.
1358 .P
1359 In the 32-bit library, however, \eC is always supported (when not explicitly
1360 locked out) because it always matches a single code unit, whether or not UTF-32
1361 is specified.
1362 .P
1363 In general, the \eC escape sequence is best avoided. However, one way of using
1364 it that avoids the problem of malformed UTF-8 or UTF-16 characters is to use a
1365 lookahead to check the length of the next character, as in this pattern, which
1366 could be used with a UTF-8 string (ignore white space and line breaks):
1367 .sp
1368   (?| (?=[\ex00-\ex7f])(\eC) |
1369       (?=[\ex80-\ex{7ff}])(\eC)(\eC) |
1370       (?=[\ex{800}-\ex{ffff}])(\eC)(\eC)(\eC) |
1371       (?=[\ex{10000}-\ex{1fffff}])(\eC)(\eC)(\eC)(\eC))
1372 .sp
1373 In this example, a group that starts with (?| resets the capturing parentheses
1374 numbers in each alternative (see
1375 .\" HTML <a href="#dupsubpatternnumber">
1376 .\" </a>
1377 "Duplicate Subpattern Numbers"
1378 .\"
1379 below). The assertions at the start of each branch check the next UTF-8
1380 character for values whose encoding uses 1, 2, 3, or 4 bytes, respectively. The
1381 character's individual bytes are then captured by the appropriate number of
1382 \eC groups.
1383 .
1384 .
1385 .\" HTML <a name="characterclass"></a>
1386 .SH "SQUARE BRACKETS AND CHARACTER CLASSES"
1387 .rs
1388 .sp
1389 An opening square bracket introduces a character class, terminated by a closing
1390 square bracket. A closing square bracket on its own is not special by default.
1391 If a closing square bracket is required as a member of the class, it should be
1392 the first data character in the class (after an initial circumflex, if present)
1393 or escaped with a backslash. This means that, by default, an empty class cannot
1394 be defined. However, if the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing
1395 square bracket at the start does end the (empty) class.
1396 .P
1397 A character class matches a single character in the subject. A matched
1398 character must be in the set of characters defined by the class, unless the
1399 first character in the class definition is a circumflex, in which case the
1400 subject character must not be in the set defined by the class. If a circumflex
1401 is actually required as a member of the class, ensure it is not the first
1402 character, or escape it with a backslash.
1403 .P
1404 For example, the character class [aeiou] matches any lower case vowel, while
1405 [^aeiou] matches any character that is not a lower case vowel. Note that a
1406 circumflex is just a convenient notation for specifying the characters that
1407 are in the class by enumerating those that are not. A class that starts with a
1408 circumflex is not an assertion; it still consumes a character from the subject
1409 string, and therefore it fails if the current pointer is at the end of the
1410 string.
1411 .P
1412 Characters in a class may be specified by their code points using \eo, \ex, or
1413 \eN{U+hh..} in the usual way. When caseless matching is set, any letters in a
1414 class represent both their upper case and lower case versions, so for example,
1415 a caseless [aeiou] matches "A" as well as "a", and a caseless [^aeiou] does not
1416 match "A", whereas a caseful version would.
1417 .P
1418 Characters that might indicate line breaks are never treated in any special way
1419 when matching character classes, whatever line-ending sequence is in use, and
1420 whatever setting of the PCRE2_DOTALL and PCRE2_MULTILINE options is used. A
1421 class such as [^a] always matches one of these characters.
1422 .P
1423 The generic character type escape sequences \ed, \eD, \eh, \eH, \ep, \eP, \es,
1424 \eS, \ev, \eV, \ew, and \eW may appear in a character class, and add the
1425 characters that they match to the class. For example, [\edABCDEF] matches any
1426 hexadecimal digit. In UTF modes, the PCRE2_UCP option affects the meanings of
1427 \ed, \es, \ew and their upper case partners, just as it does when they appear
1428 outside a character class, as described in the section entitled
1429 .\" HTML <a href="#genericchartypes">
1430 .\" </a>
1431 "Generic character types"
1432 .\"
1433 above. The escape sequence \eb has a different meaning inside a character
1434 class; it matches the backspace character. The sequences \eB, \eR, and \eX are
1435 not special inside a character class. Like any other unrecognized escape
1436 sequences, they cause an error. The same is true for \eN when not followed by
1437 an opening brace.
1438 .P
1439 The minus (hyphen) character can be used to specify a range of characters in a
1440 character class. For example, [d-m] matches any letter between d and m,
1441 inclusive. If a minus character is required in a class, it must be escaped with
1442 a backslash or appear in a position where it cannot be interpreted as
1443 indicating a range, typically as the first or last character in the class,
1444 or immediately after a range. For example, [b-d-z] matches letters in the range
1445 b to d, a hyphen character, or z.
1446 .P
1447 Perl treats a hyphen as a literal if it appears before or after a POSIX class
1448 (see below) or before or after a character type escape such as as \ed or \eH.
1449 However, unless the hyphen is the last character in the class, Perl outputs a
1450 warning in its warning mode, as this is most likely a user error. As PCRE2 has
1451 no facility for warning, an error is given in these cases.
1452 .P
1453 It is not possible to have the literal character "]" as the end character of a
1454 range. A pattern such as [W-]46] is interpreted as a class of two characters
1455 ("W" and "-") followed by a literal string "46]", so it would match "W46]" or
1456 "-46]". However, if the "]" is escaped with a backslash it is interpreted as
1457 the end of range, so [W-\e]46] is interpreted as a class containing a range
1458 followed by two other characters. The octal or hexadecimal representation of
1459 "]" can also be used to end a range.
1460 .P
1461 Ranges normally include all code points between the start and end characters,
1462 inclusive. They can also be used for code points specified numerically, for
1463 example [\e000-\e037]. Ranges can include any characters that are valid for the
1464 current mode. In any UTF mode, the so-called "surrogate" characters (those
1465 whose code points lie between 0xd800 and 0xdfff inclusive) may not be specified
1466 explicitly by default (the PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables
1467 this check). However, ranges such as [\ex{d7ff}-\ex{e000}], which include the
1468 surrogates, are always permitted.
1469 .P
1470 There is a special case in EBCDIC environments for ranges whose end points are
1471 both specified as literal letters in the same case. For compatibility with
1472 Perl, EBCDIC code points within the range that are not letters are omitted. For
1473 example, [h-k] matches only four characters, even though the codes for h and k
1474 are 0x88 and 0x92, a range of 11 code points. However, if the range is
1475 specified numerically, for example, [\ex88-\ex92] or [h-\ex92], all code points
1476 are included.
1477 .P
1478 If a range that includes letters is used when caseless matching is set, it
1479 matches the letters in either case. For example, [W-c] is equivalent to
1480 [][\e\e^_`wxyzabc], matched caselessly, and in a non-UTF mode, if character
1481 tables for a French locale are in use, [\exc8-\excb] matches accented E
1482 characters in both cases.
1483 .P
1484 A circumflex can conveniently be used with the upper case character types to
1485 specify a more restricted set of characters than the matching lower case type.
1486 For example, the class [^\eW_] matches any letter or digit, but not underscore,
1487 whereas [\ew] includes underscore. A positive character class should be read as
1488 "something OR something OR ..." and a negative class as "NOT something AND NOT
1489 something AND NOT ...".
1490 .P
1491 The only metacharacters that are recognized in character classes are backslash,
1492 hyphen (only where it can be interpreted as specifying a range), circumflex
1493 (only at the start), opening square bracket (only when it can be interpreted as
1494 introducing a POSIX class name, or for a special compatibility feature - see
1495 the next two sections), and the terminating closing square bracket. However,
1496 escaping other non-alphanumeric characters does no harm.
1497 .
1498 .
1499 .SH "POSIX CHARACTER CLASSES"
1500 .rs
1501 .sp
1502 Perl supports the POSIX notation for character classes. This uses names
1503 enclosed by [: and :] within the enclosing square brackets. PCRE2 also supports
1504 this notation. For example,
1505 .sp
1506   [01[:alpha:]%]
1507 .sp
1508 matches "0", "1", any alphabetic character, or "%". The supported class names
1509 are:
1510 .sp
1511   alnum    letters and digits
1512   alpha    letters
1513   ascii    character codes 0 - 127
1514   blank    space or tab only
1515   cntrl    control characters
1516   digit    decimal digits (same as \ed)
1517   graph    printing characters, excluding space
1518   lower    lower case letters
1519   print    printing characters, including space
1520   punct    printing characters, excluding letters and digits and space
1521   space    white space (the same as \es from PCRE2 8.34)
1522   upper    upper case letters
1523   word     "word" characters (same as \ew)
1524   xdigit   hexadecimal digits
1525 .sp
1526 The default "space" characters are HT (9), LF (10), VT (11), FF (12), CR (13),
1527 and space (32). If locale-specific matching is taking place, the list of space
1528 characters may be different; there may be fewer or more of them. "Space" and
1529 \es match the same set of characters.
1530 .P
1531 The name "word" is a Perl extension, and "blank" is a GNU extension from Perl
1532 5.8. Another Perl extension is negation, which is indicated by a ^ character
1533 after the colon. For example,
1534 .sp
1535   [12[:^digit:]]
1536 .sp
1537 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the POSIX
1538 syntax [.ch.] and [=ch=] where "ch" is a "collating element", but these are not
1539 supported, and an error is given if they are encountered.
1540 .P
1541 By default, characters with values greater than 127 do not match any of the
1542 POSIX character classes, although this may be different for characters in the
1543 range 128-255 when locale-specific matching is happening. However, if the
1544 PCRE2_UCP option is passed to \fBpcre2_compile()\fP, some of the classes are
1545 changed so that Unicode character properties are used. This is achieved by
1546 replacing certain POSIX classes with other sequences, as follows:
1547 .sp
1548   [:alnum:]  becomes  \ep{Xan}
1549   [:alpha:]  becomes  \ep{L}
1550   [:blank:]  becomes  \eh
1551   [:cntrl:]  becomes  \ep{Cc}
1552   [:digit:]  becomes  \ep{Nd}
1553   [:lower:]  becomes  \ep{Ll}
1554   [:space:]  becomes  \ep{Xps}
1555   [:upper:]  becomes  \ep{Lu}
1556   [:word:]   becomes  \ep{Xwd}
1557 .sp
1558 Negated versions, such as [:^alpha:] use \eP instead of \ep. Three other POSIX
1559 classes are handled specially in UCP mode:
1560 .TP 10
1561 [:graph:]
1562 This matches characters that have glyphs that mark the page when printed. In
1563 Unicode property terms, it matches all characters with the L, M, N, P, S, or Cf
1564 properties, except for:
1565 .sp
1566   U+061C           Arabic Letter Mark
1567   U+180E           Mongolian Vowel Separator
1568   U+2066 - U+2069  Various "isolate"s
1569 .sp
1570 .TP 10
1571 [:print:]
1572 This matches the same characters as [:graph:] plus space characters that are
1573 not controls, that is, characters with the Zs property.
1574 .TP 10
1575 [:punct:]
1576 This matches all characters that have the Unicode P (punctuation) property,
1577 plus those characters with code points less than 256 that have the S (Symbol)
1578 property.
1579 .P
1580 The other POSIX classes are unchanged, and match only characters with code
1581 points less than 256.
1582 .
1583 .
1584 .SH "COMPATIBILITY FEATURE FOR WORD BOUNDARIES"
1585 .rs
1586 .sp
1587 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the ugly
1588 syntax [[:<:]] and [[:>:]] is used for matching "start of word" and "end of
1589 word". PCRE2 treats these items as follows:
1590 .sp
1591   [[:<:]]  is converted to  \eb(?=\ew)
1592   [[:>:]]  is converted to  \eb(?<=\ew)
1593 .sp
1594 Only these exact character sequences are recognized. A sequence such as
1595 [a[:<:]b] provokes error for an unrecognized POSIX class name. This support is
1596 not compatible with Perl. It is provided to help migrations from other
1597 environments, and is best not used in any new patterns. Note that \eb matches
1598 at the start and the end of a word (see
1599 .\" HTML <a href="#smallassertions">
1600 .\" </a>
1601 "Simple assertions"
1602 .\"
1603 above), and in a Perl-style pattern the preceding or following character
1604 normally shows which is wanted, without the need for the assertions that are
1605 used above in order to give exactly the POSIX behaviour.
1606 .
1607 .
1608 .SH "VERTICAL BAR"
1609 .rs
1610 .sp
1611 Vertical bar characters are used to separate alternative patterns. For example,
1612 the pattern
1613 .sp
1614   gilbert|sullivan
1615 .sp
1616 matches either "gilbert" or "sullivan". Any number of alternatives may appear,
1617 and an empty alternative is permitted (matching the empty string). The matching
1618 process tries each alternative in turn, from left to right, and the first one
1619 that succeeds is used. If the alternatives are within a subpattern
1620 .\" HTML <a href="#subpattern">
1621 .\" </a>
1622 (defined below),
1623 .\"
1624 "succeeds" means matching the rest of the main pattern as well as the
1625 alternative in the subpattern.
1626 .
1627 .
1628 .SH "INTERNAL OPTION SETTING"
1629 .rs
1630 .sp
1631 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
1632 PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options can be
1633 changed from within the pattern by a sequence of letters enclosed between "(?"
1634 and ")". These options are Perl-compatible, and are described in detail in the
1635 .\" HREF
1636 \fBpcre2api\fP
1637 .\"
1638 documentation. The option letters are:
1639 .sp
1640   i  for PCRE2_CASELESS
1641   m  for PCRE2_MULTILINE
1642   n  for PCRE2_NO_AUTO_CAPTURE
1643   s  for PCRE2_DOTALL
1644   x  for PCRE2_EXTENDED
1645   xx for PCRE2_EXTENDED_MORE
1646 .sp
1647 For example, (?im) sets caseless, multiline matching. It is also possible to
1648 unset these options by preceding the relevant letters with a hyphen, for
1649 example (?-im). The two "extended" options are not independent; unsetting either
1650 one cancels the effects of both of them.
1651 .P
1652 A combined setting and unsetting such as (?im-sx), which sets PCRE2_CASELESS
1653 and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and PCRE2_EXTENDED, is also
1654 permitted. Only one hyphen may appear in the options string. If a letter
1655 appears both before and after the hyphen, the option is unset. An empty options
1656 setting "(?)" is allowed. Needless to say, it has no effect.
1657 .P
1658 If the first character following (? is a circumflex, it causes all of the above
1659 options to be unset. Thus, (?^) is equivalent to (?-imnsx). Letters may follow
1660 the circumflex to cause some options to be re-instated, but a hyphen may not
1661 appear.
1662 .P
1663 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be changed in
1664 the same way as the Perl-compatible options by using the characters J and U
1665 respectively. However, these are not unset by (?^).
1666 .P
1667 When one of these option changes occurs at top level (that is, not inside
1668 subpattern parentheses), the change applies to the remainder of the pattern
1669 that follows. An option change within a subpattern (see below for a description
1670 of subpatterns) affects only that part of the subpattern that follows it, so
1671 .sp
1672   (a(?i)b)c
1673 .sp
1674 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is not used).
1675 By this means, options can be made to have different settings in different
1676 parts of the pattern. Any changes made in one alternative do carry on
1677 into subsequent branches within the same subpattern. For example,
1678 .sp
1679   (a(?i)b|c)
1680 .sp
1681 matches "ab", "aB", "c", and "C", even though when matching "C" the first
1682 branch is abandoned before the option setting. This is because the effects of
1683 option settings happen at compile time. There would be some very weird
1684 behaviour otherwise.
1685 .P
1686 As a convenient shorthand, if any option settings are required at the start of
1687 a non-capturing subpattern (see the next section), the option letters may
1688 appear between the "?" and the ":". Thus the two patterns
1689 .sp
1690   (?i:saturday|sunday)
1691   (?:(?i)saturday|sunday)
1692 .sp
1693 match exactly the same set of strings.
1694 .P
1695 \fBNote:\fP There are other PCRE2-specific options that can be set by the
1696 application when the compiling function is called. The pattern can contain
1697 special leading sequences such as (*CRLF) to override what the application has
1698 set or what has been defaulted. Details are given in the section entitled
1699 .\" HTML <a href="#newlineseq">
1700 .\" </a>
1701 "Newline sequences"
1702 .\"
1703 above. There are also the (*UTF) and (*UCP) leading sequences that can be used
1704 to set UTF and Unicode property modes; they are equivalent to setting the
1705 PCRE2_UTF and PCRE2_UCP options, respectively. However, the application can set
1706 the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use of the
1707 (*UTF) and (*UCP) sequences.
1708 .
1709 .
1710 .\" HTML <a name="subpattern"></a>
1711 .SH SUBPATTERNS
1712 .rs
1713 .sp
1714 Subpatterns are delimited by parentheses (round brackets), which can be nested.
1715 Turning part of a pattern into a subpattern does two things:
1716 .sp
1717 1. It localizes a set of alternatives. For example, the pattern
1718 .sp
1719   cat(aract|erpillar|)
1720 .sp
1721 matches "cataract", "caterpillar", or "cat". Without the parentheses, it would
1722 match "cataract", "erpillar" or an empty string.
1723 .sp
1724 2. It sets up the subpattern as a capturing subpattern. This means that, when
1725 the whole pattern matches, the portion of the subject string that matched the
1726 subpattern is passed back to the caller, separately from the portion that
1727 matched the whole pattern. (This applies only to the traditional matching
1728 function; the DFA matching function does not support capturing.)
1729 .P
1730 Opening parentheses are counted from left to right (starting from 1) to obtain
1731 numbers for the capturing subpatterns. For example, if the string "the red
1732 king" is matched against the pattern
1733 .sp
1734   the ((red|white) (king|queen))
1735 .sp
1736 the captured substrings are "red king", "red", and "king", and are numbered 1,
1737 2, and 3, respectively.
1738 .P
1739 The fact that plain parentheses fulfil two functions is not always helpful.
1740 There are often times when a grouping subpattern is required without a
1741 capturing requirement. If an opening parenthesis is followed by a question mark
1742 and a colon, the subpattern does not do any capturing, and is not counted when
1743 computing the number of any subsequent capturing subpatterns. For example, if
1744 the string "the white queen" is matched against the pattern
1745 .sp
1746   the ((?:red|white) (king|queen))
1747 .sp
1748 the captured substrings are "white queen" and "queen", and are numbered 1 and
1749 2. The maximum number of capturing subpatterns is 65535.
1750 .P
1751 As a convenient shorthand, if any option settings are required at the start of
1752 a non-capturing subpattern, the option letters may appear between the "?" and
1753 the ":". Thus the two patterns
1754 .sp
1755   (?i:saturday|sunday)
1756   (?:(?i)saturday|sunday)
1757 .sp
1758 match exactly the same set of strings. Because alternative branches are tried
1759 from left to right, and options are not reset until the end of the subpattern
1760 is reached, an option setting in one branch does affect subsequent branches, so
1761 the above patterns match "SUNDAY" as well as "Saturday".
1762 .
1763 .
1764 .\" HTML <a name="dupsubpatternnumber"></a>
1765 .SH "DUPLICATE SUBPATTERN NUMBERS"
1766 .rs
1767 .sp
1768 Perl 5.10 introduced a feature whereby each alternative in a subpattern uses
1769 the same numbers for its capturing parentheses. Such a subpattern starts with
1770 (?| and is itself a non-capturing subpattern. For example, consider this
1771 pattern:
1772 .sp
1773   (?|(Sat)ur|(Sun))day
1774 .sp
1775 Because the two alternatives are inside a (?| group, both sets of capturing
1776 parentheses are numbered one. Thus, when the pattern matches, you can look
1777 at captured substring number one, whichever alternative matched. This construct
1778 is useful when you want to capture part, but not all, of one of a number of
1779 alternatives. Inside a (?| group, parentheses are numbered as usual, but the
1780 number is reset at the start of each branch. The numbers of any capturing
1781 parentheses that follow the subpattern start after the highest number used in
1782 any branch. The following example is taken from the Perl documentation. The
1783 numbers underneath show in which buffer the captured content will be stored.
1784 .sp
1785   # before  ---------------branch-reset----------- after
1786   / ( a )  (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
1787   # 1            2         2  3        2     3     4
1788 .sp
1789 A backreference to a numbered subpattern uses the most recent value that is
1790 set for that number by any subpattern. The following pattern matches "abcabc"
1791 or "defdef":
1792 .sp
1793   /(?|(abc)|(def))\e1/
1794 .sp
1795 In contrast, a subroutine call to a numbered subpattern always refers to the
1796 first one in the pattern with the given number. The following pattern matches
1797 "abcabc" or "defabc":
1798 .sp
1799   /(?|(abc)|(def))(?1)/
1800 .sp
1801 A relative reference such as (?-1) is no different: it is just a convenient way
1802 of computing an absolute group number.
1803 .P
1804 If a
1805 .\" HTML <a href="#conditions">
1806 .\" </a>
1807 condition test
1808 .\"
1809 for a subpattern's having matched refers to a non-unique number, the test is
1810 true if any of the subpatterns of that number have matched.
1811 .P
1812 An alternative approach to using this "branch reset" feature is to use
1813 duplicate named subpatterns, as described in the next section.
1814 .
1815 .
1816 .SH "NAMED SUBPATTERNS"
1817 .rs
1818 .sp
1819 Identifying capturing parentheses by number is simple, but it can be very hard
1820 to keep track of the numbers in complicated patterns. Furthermore, if an
1821 expression is modified, the numbers may change. To help with this difficulty,
1822 PCRE2 supports the naming of capturing subpatterns. This feature was not added
1823 to Perl until release 5.10. Python had the feature earlier, and PCRE1
1824 introduced it at release 4.0, using the Python syntax. PCRE2 supports both the
1825 Perl and the Python syntax.
1826 .P
1827 In PCRE2, a capturing subpattern can be named in one of three ways:
1828 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python. Names
1829 consist of up to 32 alphanumeric characters and underscores, but must start
1830 with a non-digit. References to capturing parentheses from other parts of the
1831 pattern, such as
1832 .\" HTML <a href="#backreferences">
1833 .\" </a>
1834 backreferences,
1835 .\"
1836 .\" HTML <a href="#recursion">
1837 .\" </a>
1838 recursion,
1839 .\"
1840 and
1841 .\" HTML <a href="#conditions">
1842 .\" </a>
1843 conditions,
1844 .\"
1845 can all be made by name as well as by number.
1846 .P
1847 Named capturing parentheses are allocated numbers as well as names, exactly as
1848 if the names were not present. In both PCRE2 and Perl, capturing subpatterns
1849 are primarily identified by numbers; any names are just aliases for these
1850 numbers. The PCRE2 API provides function calls for extracting the complete
1851 name-to-number translation table from a compiled pattern, as well as
1852 convenience functions for extracting captured substrings by name.
1853 .P
1854 \fBWarning:\fP When more than one subpattern has the same number, as described
1855 in the previous section, a name given to one of them applies to all of them.
1856 Perl allows identically numbered subpatterns to have different names. Consider
1857 this pattern, where there are two capturing subpatterns, both numbered 1:
1858 .sp
1859   (?|(?<AA>aa)|(?<BB>bb))
1860 .sp
1861 Perl allows this, with both names AA and BB as aliases of group 1. Thus, after
1862 a successful match, both names yield the same value (either "aa" or "bb").
1863 .P
1864 In an attempt to reduce confusion, PCRE2 does not allow the same group number
1865 to be associated with more than one name. The example above provokes a
1866 compile-time error. However, there is still scope for confusion. Consider this
1867 pattern:
1868 .sp
1869   (?|(?<AA>aa)|(bb))
1870 .sp
1871 Although the second subpattern number 1 is not explicitly named, the name AA is
1872 still an alias for subpattern 1. Whether the pattern matches "aa" or "bb", a
1873 reference by name to group AA yields the matched string.
1874 .P
1875 By default, a name must be unique within a pattern, except that duplicate names
1876 are permitted for subpatterns with the same number, for example:
1877 .sp
1878   (?|(?<AA>aa)|(?<AA>bb))
1879 .sp
1880 The duplicate name constraint can be disabled by setting the PCRE2_DUPNAMES
1881 option at compile time, or by the use of (?J) within the pattern. Duplicate
1882 names can be useful for patterns where only one instance of the named
1883 parentheses can match. Suppose you want to match the name of a weekday, either
1884 as a 3-letter abbreviation or as the full name, and in both cases you want to
1885 extract the abbreviation. This pattern (ignoring the line breaks) does the job:
1886 .sp
1887   (?<DN>Mon|Fri|Sun)(?:day)?|
1888   (?<DN>Tue)(?:sday)?|
1889   (?<DN>Wed)(?:nesday)?|
1890   (?<DN>Thu)(?:rsday)?|
1891   (?<DN>Sat)(?:urday)?
1892 .sp
1893 There are five capturing substrings, but only one is ever set after a match.
1894 The convenience functions for extracting the data by name returns the substring
1895 for the first (and in this example, the only) subpattern of that name that
1896 matched. This saves searching to find which numbered subpattern it was. (An
1897 alternative way of solving this problem is to use a "branch reset" subpattern,
1898 as described in the previous section.)
1899 .P
1900 If you make a backreference to a non-unique named subpattern from elsewhere in
1901 the pattern, the subpatterns to which the name refers are checked in the order
1902 in which they appear in the overall pattern. The first one that is set is used
1903 for the reference. For example, this pattern matches both "foofoo" and
1904 "barbar" but not "foobar" or "barfoo":
1905 .sp
1906   (?:(?<n>foo)|(?<n>bar))\ek<n>
1907 .sp
1908 .P
1909 If you make a subroutine call to a non-unique named subpattern, the one that
1910 corresponds to the first occurrence of the name is used. In the absence of
1911 duplicate numbers this is the one with the lowest number.
1912 .P
1913 If you use a named reference in a condition
1914 test (see the
1915 .\"
1916 .\" HTML <a href="#conditions">
1917 .\" </a>
1918 section about conditions
1919 .\"
1920 below), either to check whether a subpattern has matched, or to check for
1921 recursion, all subpatterns with the same name are tested. If the condition is
1922 true for any one of them, the overall condition is true. This is the same
1923 behaviour as testing by number. For further details of the interfaces for
1924 handling named subpatterns, see the
1925 .\" HREF
1926 \fBpcre2api\fP
1927 .\"
1928 documentation.
1929 .
1930 .
1931 .SH REPETITION
1932 .rs
1933 .sp
1934 Repetition is specified by quantifiers, which can follow any of the following
1935 items:
1936 .sp
1937   a literal data character
1938   the dot metacharacter
1939   the \eC escape sequence
1940   the \eX escape sequence
1941   the \eR escape sequence
1942   an escape such as \ed or \epL that matches a single character
1943   a character class
1944   a backreference
1945   a parenthesized subpattern (including most assertions)
1946   a subroutine call to a subpattern (recursive or otherwise)
1947 .sp
1948 The general repetition quantifier specifies a minimum and maximum number of
1949 permitted matches, by giving the two numbers in curly brackets (braces),
1950 separated by a comma. The numbers must be less than 65536, and the first must
1951 be less than or equal to the second. For example:
1952 .sp
1953   z{2,4}
1954 .sp
1955 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a special
1956 character. If the second number is omitted, but the comma is present, there is
1957 no upper limit; if the second number and the comma are both omitted, the
1958 quantifier specifies an exact number of required matches. Thus
1959 .sp
1960   [aeiou]{3,}
1961 .sp
1962 matches at least 3 successive vowels, but may match many more, whereas
1963 .sp
1964   \ed{8}
1965 .sp
1966 matches exactly 8 digits. An opening curly bracket that appears in a position
1967 where a quantifier is not allowed, or one that does not match the syntax of a
1968 quantifier, is taken as a literal character. For example, {,6} is not a
1969 quantifier, but a literal string of four characters.
1970 .P
1971 In UTF modes, quantifiers apply to characters rather than to individual code
1972 units. Thus, for example, \ex{100}{2} matches two characters, each of
1973 which is represented by a two-byte sequence in a UTF-8 string. Similarly,
1974 \eX{3} matches three Unicode extended grapheme clusters, each of which may be
1975 several code units long (and they may be of different lengths).
1976 .P
1977 The quantifier {0} is permitted, causing the expression to behave as if the
1978 previous item and the quantifier were not present. This may be useful for
1979 subpatterns that are referenced as
1980 .\" HTML <a href="#subpatternsassubroutines">
1981 .\" </a>
1982 subroutines
1983 .\"
1984 from elsewhere in the pattern (but see also the section entitled
1985 .\" HTML <a href="#subdefine">
1986 .\" </a>
1987 "Defining subpatterns for use by reference only"
1988 .\"
1989 below). Items other than subpatterns that have a {0} quantifier are omitted
1990 from the compiled pattern.
1991 .P
1992 For convenience, the three most common quantifiers have single-character
1993 abbreviations:
1994 .sp
1995   *    is equivalent to {0,}
1996   +    is equivalent to {1,}
1997   ?    is equivalent to {0,1}
1998 .sp
1999 It is possible to construct infinite loops by following a subpattern that can
2000 match no characters with a quantifier that has no upper limit, for example:
2001 .sp
2002   (a?)*
2003 .sp
2004 Earlier versions of Perl and PCRE1 used to give an error at compile time for
2005 such patterns. However, because there are cases where this can be useful, such
2006 patterns are now accepted, but if any repetition of the subpattern does in fact
2007 match no characters, the loop is forcibly broken.
2008 .P
2009 By default, the quantifiers are "greedy", that is, they match as much as
2010 possible (up to the maximum number of permitted times), without causing the
2011 rest of the pattern to fail. The classic example of where this gives problems
2012 is in trying to match comments in C programs. These appear between /* and */
2013 and within the comment, individual * and / characters may appear. An attempt to
2014 match C comments by applying the pattern
2015 .sp
2016   /\e*.*\e*/
2017 .sp
2018 to the string
2019 .sp
2020   /* first comment */  not comment  /* second comment */
2021 .sp
2022 fails, because it matches the entire string owing to the greediness of the .*
2023 item.
2024 .P
2025 If a quantifier is followed by a question mark, it ceases to be greedy, and
2026 instead matches the minimum number of times possible, so the pattern
2027 .sp
2028   /\e*.*?\e*/
2029 .sp
2030 does the right thing with the C comments. The meaning of the various
2031 quantifiers is not otherwise changed, just the preferred number of matches.
2032 Do not confuse this use of question mark with its use as a quantifier in its
2033 own right. Because it has two uses, it can sometimes appear doubled, as in
2034 .sp
2035   \ed??\ed
2036 .sp
2037 which matches one digit by preference, but can match two if that is the only
2038 way the rest of the pattern matches.
2039 .P
2040 If the PCRE2_UNGREEDY option is set (an option that is not available in Perl),
2041 the quantifiers are not greedy by default, but individual ones can be made
2042 greedy by following them with a question mark. In other words, it inverts the
2043 default behaviour.
2044 .P
2045 When a parenthesized subpattern is quantified with a minimum repeat count that
2046 is greater than 1 or with a limited maximum, more memory is required for the
2047 compiled pattern, in proportion to the size of the minimum or maximum.
2048 .P
2049 If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option (equivalent
2050 to Perl's /s) is set, thus allowing the dot to match newlines, the pattern is
2051 implicitly anchored, because whatever follows will be tried against every
2052 character position in the subject string, so there is no point in retrying the
2053 overall match at any position after the first. PCRE2 normally treats such a
2054 pattern as though it were preceded by \eA.
2055 .P
2056 In cases where it is known that the subject string contains no newlines, it is
2057 worth setting PCRE2_DOTALL in order to obtain this optimization, or
2058 alternatively, using ^ to indicate anchoring explicitly.
2059 .P
2060 However, there are some cases where the optimization cannot be used. When .*
2061 is inside capturing parentheses that are the subject of a backreference
2062 elsewhere in the pattern, a match at the start may fail where a later one
2063 succeeds. Consider, for example:
2064 .sp
2065   (.*)abc\e1
2066 .sp
2067 If the subject is "xyz123abc123" the match point is the fourth character. For
2068 this reason, such a pattern is not implicitly anchored.
2069 .P
2070 Another case where implicit anchoring is not applied is when the leading .* is
2071 inside an atomic group. Once again, a match at the start may fail where a later
2072 one succeeds. Consider this pattern:
2073 .sp
2074   (?>.*?a)b
2075 .sp
2076 It matches "ab" in the subject "aab". The use of the backtracking control verbs
2077 (*PRUNE) and (*SKIP) also disable this optimization, and there is an option,
2078 PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
2079 .P
2080 When a capturing subpattern is repeated, the value captured is the substring
2081 that matched the final iteration. For example, after
2082 .sp
2083   (tweedle[dume]{3}\es*)+
2084 .sp
2085 has matched "tweedledum tweedledee" the value of the captured substring is
2086 "tweedledee". However, if there are nested capturing subpatterns, the
2087 corresponding captured values may have been set in previous iterations. For
2088 example, after
2089 .sp
2090   (a|(b))+
2091 .sp
2092 matches "aba" the value of the second captured substring is "b".
2093 .
2094 .
2095 .\" HTML <a name="atomicgroup"></a>
2096 .SH "ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS"
2097 .rs
2098 .sp
2099 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
2100 repetition, failure of what follows normally causes the repeated item to be
2101 re-evaluated to see if a different number of repeats allows the rest of the
2102 pattern to match. Sometimes it is useful to prevent this, either to change the
2103 nature of the match, or to cause it fail earlier than it otherwise might, when
2104 the author of the pattern knows there is no point in carrying on.
2105 .P
2106 Consider, for example, the pattern \ed+foo when applied to the subject line
2107 .sp
2108   123456bar
2109 .sp
2110 After matching all 6 digits and then failing to match "foo", the normal
2111 action of the matcher is to try again with only 5 digits matching the \ed+
2112 item, and then with 4, and so on, before ultimately failing. "Atomic grouping"
2113 (a term taken from Jeffrey Friedl's book) provides the means for specifying
2114 that once a subpattern has matched, it is not to be re-evaluated in this way.
2115 .P
2116 If we use atomic grouping for the previous example, the matcher gives up
2117 immediately on failing to match "foo" the first time. The notation is a kind of
2118 special parenthesis, starting with (?> as in this example:
2119 .sp
2120   (?>\ed+)foo
2121 .sp
2122 This kind of parenthesis "locks up" the  part of the pattern it contains once
2123 it has matched, and a failure further into the pattern is prevented from
2124 backtracking into it. Backtracking past it to previous items, however, works as
2125 normal.
2126 .P
2127 An alternative description is that a subpattern of this type matches exactly
2128 the string of characters that an identical standalone pattern would match, if
2129 anchored at the current point in the subject string.
2130 .P
2131 Atomic grouping subpatterns are not capturing subpatterns. Simple cases such as
2132 the above example can be thought of as a maximizing repeat that must swallow
2133 everything it can. So, while both \ed+ and \ed+? are prepared to adjust the
2134 number of digits they match in order to make the rest of the pattern match,
2135 (?>\ed+) can only match an entire sequence of digits.
2136 .P
2137 Atomic groups in general can of course contain arbitrarily complicated
2138 subpatterns, and can be nested. However, when the subpattern for an atomic
2139 group is just a single repeated item, as in the example above, a simpler
2140 notation, called a "possessive quantifier" can be used. This consists of an
2141 additional + character following a quantifier. Using this notation, the
2142 previous example can be rewritten as
2143 .sp
2144   \ed++foo
2145 .sp
2146 Note that a possessive quantifier can be used with an entire group, for
2147 example:
2148 .sp
2149   (abc|xyz){2,3}+
2150 .sp
2151 Possessive quantifiers are always greedy; the setting of the PCRE2_UNGREEDY
2152 option is ignored. They are a convenient notation for the simpler forms of
2153 atomic group. However, there is no difference in the meaning of a possessive
2154 quantifier and the equivalent atomic group, though there may be a performance
2155 difference; possessive quantifiers should be slightly faster.
2156 .P
2157 The possessive quantifier syntax is an extension to the Perl 5.8 syntax.
2158 Jeffrey Friedl originated the idea (and the name) in the first edition of his
2159 book. Mike McCloskey liked it, so implemented it when he built Sun's Java
2160 package, and PCRE1 copied it from there. It ultimately found its way into Perl
2161 at release 5.10.
2162 .P
2163 PCRE2 has an optimization that automatically "possessifies" certain simple
2164 pattern constructs. For example, the sequence A+B is treated as A++B because
2165 there is no point in backtracking into a sequence of A's when B must follow.
2166 This feature can be disabled by the PCRE2_NO_AUTOPOSSESS option, or starting
2167 the pattern with (*NO_AUTO_POSSESS).
2168 .P
2169 When a pattern contains an unlimited repeat inside a subpattern that can itself
2170 be repeated an unlimited number of times, the use of an atomic group is the
2171 only way to avoid some failing matches taking a very long time indeed. The
2172 pattern
2173 .sp
2174   (\eD+|<\ed+>)*[!?]
2175 .sp
2176 matches an unlimited number of substrings that either consist of non-digits, or
2177 digits enclosed in <>, followed by either ! or ?. When it matches, it runs
2178 quickly. However, if it is applied to
2179 .sp
2180   aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
2181 .sp
2182 it takes a long time before reporting failure. This is because the string can
2183 be divided between the internal \eD+ repeat and the external * repeat in a
2184 large number of ways, and all have to be tried. (The example uses [!?] rather
2185 than a single character at the end, because both PCRE2 and Perl have an
2186 optimization that allows for fast failure when a single character is used. They
2187 remember the last single character that is required for a match, and fail early
2188 if it is not present in the string.) If the pattern is changed so that it uses
2189 an atomic group, like this:
2190 .sp
2191   ((?>\eD+)|<\ed+>)*[!?]
2192 .sp
2193 sequences of non-digits cannot be broken, and failure happens quickly.
2194 .
2195 .
2196 .\" HTML <a name="backreferences"></a>
2197 .SH "BACKREFERENCES"
2198 .rs
2199 .sp
2200 Outside a character class, a backslash followed by a digit greater than 0 (and
2201 possibly further digits) is a backreference to a capturing subpattern earlier
2202 (that is, to its left) in the pattern, provided there have been that many
2203 previous capturing left parentheses.
2204 .P
2205 However, if the decimal number following the backslash is less than 8, it is
2206 always taken as a backreference, and causes an error only if there are not
2207 that many capturing left parentheses in the entire pattern. In other words, the
2208 parentheses that are referenced need not be to the left of the reference for
2209 numbers less than 8. A "forward backreference" of this type can make sense
2210 when a repetition is involved and the subpattern to the right has participated
2211 in an earlier iteration.
2212 .P
2213 It is not possible to have a numerical "forward backreference" to a subpattern
2214 whose number is 8 or more using this syntax because a sequence such as \e50 is
2215 interpreted as a character defined in octal. See the subsection entitled
2216 "Non-printing characters"
2217 .\" HTML <a href="#digitsafterbackslash">
2218 .\" </a>
2219 above
2220 .\"
2221 for further details of the handling of digits following a backslash. There is
2222 no such problem when named parentheses are used. A backreference to any
2223 subpattern is possible using named parentheses (see below).
2224 .P
2225 Another way of avoiding the ambiguity inherent in the use of digits following a
2226 backslash is to use the \eg escape sequence. This escape must be followed by a
2227 signed or unsigned number, optionally enclosed in braces. These examples are
2228 all identical:
2229 .sp
2230   (ring), \e1
2231   (ring), \eg1
2232   (ring), \eg{1}
2233 .sp
2234 An unsigned number specifies an absolute reference without the ambiguity that
2235 is present in the older syntax. It is also useful when literal digits follow
2236 the reference. A signed number is a relative reference. Consider this example:
2237 .sp
2238   (abc(def)ghi)\eg{-1}
2239 .sp
2240 The sequence \eg{-1} is a reference to the most recently started capturing
2241 subpattern before \eg, that is, is it equivalent to \e2 in this example.
2242 Similarly, \eg{-2} would be equivalent to \e1. The use of relative references
2243 can be helpful in long patterns, and also in patterns that are created by
2244 joining together fragments that contain references within themselves.
2245 .P
2246 The sequence \eg{+1} is a reference to the next capturing subpattern. This kind
2247 of forward reference can be useful it patterns that repeat. Perl does not
2248 support the use of + in this way.
2249 .P
2250 A backreference matches whatever actually matched the capturing subpattern in
2251 the current subject string, rather than anything matching the subpattern
2252 itself (see
2253 .\" HTML <a href="#subpatternsassubroutines">
2254 .\" </a>
2255 "Subpatterns as subroutines"
2256 .\"
2257 below for a way of doing that). So the pattern
2258 .sp
2259   (sens|respons)e and \e1ibility
2260 .sp
2261 matches "sense and sensibility" and "response and responsibility", but not
2262 "sense and responsibility". If caseful matching is in force at the time of the
2263 backreference, the case of letters is relevant. For example,
2264 .sp
2265   ((?i)rah)\es+\e1
2266 .sp
2267 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the original
2268 capturing subpattern is matched caselessly.
2269 .P
2270 There are several different ways of writing backreferences to named
2271 subpatterns. The .NET syntax \ek{name} and the Perl syntax \ek<name> or
2272 \ek'name' are supported, as is the Python syntax (?P=name). Perl 5.10's unified
2273 backreference syntax, in which \eg can be used for both numeric and named
2274 references, is also supported. We could rewrite the above example in any of
2275 the following ways:
2276 .sp
2277   (?<p1>(?i)rah)\es+\ek<p1>
2278   (?'p1'(?i)rah)\es+\ek{p1}
2279   (?P<p1>(?i)rah)\es+(?P=p1)
2280   (?<p1>(?i)rah)\es+\eg{p1}
2281 .sp
2282 A subpattern that is referenced by name may appear in the pattern before or
2283 after the reference.
2284 .P
2285 There may be more than one backreference to the same subpattern. If a
2286 subpattern has not actually been used in a particular match, any backreferences
2287 to it always fail by default. For example, the pattern
2288 .sp
2289   (a|(bc))\e2
2290 .sp
2291 always fails if it starts to match "a" rather than "bc". However, if the
2292 PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backreference to an
2293 unset value matches an empty string.
2294 .P
2295 Because there may be many capturing parentheses in a pattern, all digits
2296 following a backslash are taken as part of a potential backreference number.
2297 If the pattern continues with a digit character, some delimiter must be used to
2298 terminate the backreference. If the PCRE2_EXTENDED or PCRE2_EXTENDED_MORE
2299 option is set, this can be white space. Otherwise, the \eg{ syntax or an empty
2300 comment (see
2301 .\" HTML <a href="#comments">
2302 .\" </a>
2303 "Comments"
2304 .\"
2305 below) can be used.
2306 .
2307 .
2308 .SS "Recursive backreferences"
2309 .rs
2310 .sp
2311 A backreference that occurs inside the parentheses to which it refers fails
2312 when the subpattern is first used, so, for example, (a\e1) never matches.
2313 However, such references can be useful inside repeated subpatterns. For
2314 example, the pattern
2315 .sp
2316   (a|b\e1)+
2317 .sp
2318 matches any number of "a"s and also "aba", "ababbaa" etc. At each iteration of
2319 the subpattern, the backreference matches the character string corresponding
2320 to the previous iteration. In order for this to work, the pattern must be such
2321 that the first iteration does not need to match the backreference. This can be
2322 done using alternation, as in the example above, or by a quantifier with a
2323 minimum of zero.
2324 .P
2325 Backreferences of this type cause the group that they reference to be treated
2326 as an
2327 .\" HTML <a href="#atomicgroup">
2328 .\" </a>
2329 atomic group.
2330 .\"
2331 Once the whole group has been matched, a subsequent matching failure cannot
2332 cause backtracking into the middle of the group.
2333 .
2334 .
2335 .\" HTML <a name="bigassertions"></a>
2336 .SH ASSERTIONS
2337 .rs
2338 .sp
2339 An assertion is a test on the characters following or preceding the current
2340 matching point that does not consume any characters. The simple assertions
2341 coded as \eb, \eB, \eA, \eG, \eZ, \ez, ^ and $ are described
2342 .\" HTML <a href="#smallassertions">
2343 .\" </a>
2344 above.
2345 .\"
2346 .P
2347 More complicated assertions are coded as subpatterns. There are two kinds:
2348 those that look ahead of the current position in the subject string, and those
2349 that look behind it, and in each case an assertion may be positive (must
2350 succeed for matching to continue) or negative (must not succeed for matching to
2351 continue). An assertion subpattern is matched in the normal way, except that,
2352 when matching continues after a successful assertion, the matching position in
2353 the subject string is as it was before the assertion was processed.
2354 .P
2355 Assertion subpatterns are not capturing subpatterns. If an assertion contains
2356 capturing subpatterns within it, these are counted for the purposes of
2357 numbering the capturing subpatterns in the whole pattern. Within each branch of
2358 an assertion, locally captured substrings may be referenced in the usual way.
2359 For example, a sequence such as (.)\eg{-1} can be used to check that two
2360 adjacent characters are the same.
2361 .P
2362 When a branch within an assertion fails to match, any substrings that were
2363 captured are discarded (as happens with any pattern branch that fails to
2364 match). A negative assertion succeeds only when all its branches fail to match;
2365 this means that no captured substrings are ever retained after a successful
2366 negative assertion. When an assertion contains a matching branch, what happens
2367 depends on the type of assertion.
2368 .P
2369 For a positive assertion, internally captured substrings in the successful
2370 branch are retained, and matching continues with the next pattern item after
2371 the assertion. For a negative assertion, a matching branch means that the
2372 assertion has failed. If the assertion is being used as a condition in a
2373 .\" HTML <a href="#conditions">
2374 .\" </a>
2375 conditional subpattern
2376 .\"
2377 (see below), captured substrings are retained, because matching continues with
2378 the "no" branch of the condition. For other failing negative assertions,
2379 control passes to the previous backtracking point, thus discarding any captured
2380 strings within the assertion.
2381 .P
2382 For compatibility with Perl, most assertion subpatterns may be repeated; though
2383 it makes no sense to assert the same thing several times, the side effect of
2384 capturing parentheses may occasionally be useful. However, an assertion that
2385 forms the condition for a conditional subpattern may not be quantified. In
2386 practice, for other assertions, there only three cases:
2387 .sp
2388 (1) If the quantifier is {0}, the assertion is never obeyed during matching.
2389 However, it may contain internal capturing parenthesized groups that are called
2390 from elsewhere via the
2391 .\" HTML <a href="#subpatternsassubroutines">
2392 .\" </a>
2393 subroutine mechanism.
2394 .\"
2395 .sp
2396 (2) If quantifier is {0,n} where n is greater than zero, it is treated as if it
2397 were {0,1}. At run time, the rest of the pattern match is tried with and
2398 without the assertion, the order depending on the greediness of the quantifier.
2399 .sp
2400 (3) If the minimum repetition is greater than zero, the quantifier is ignored.
2401 The assertion is obeyed just once when encountered during matching.
2402 .
2403 .
2404 .SS "Lookahead assertions"
2405 .rs
2406 .sp
2407 Lookahead assertions start with (?= for positive assertions and (?! for
2408 negative assertions. For example,
2409 .sp
2410   \ew+(?=;)
2411 .sp
2412 matches a word followed by a semicolon, but does not include the semicolon in
2413 the match, and
2414 .sp
2415   foo(?!bar)
2416 .sp
2417 matches any occurrence of "foo" that is not followed by "bar". Note that the
2418 apparently similar pattern
2419 .sp
2420   (?!foo)bar
2421 .sp
2422 does not find an occurrence of "bar" that is preceded by something other than
2423 "foo"; it finds any occurrence of "bar" whatsoever, because the assertion
2424 (?!foo) is always true when the next three characters are "bar". A
2425 lookbehind assertion is needed to achieve the other effect.
2426 .P
2427 If you want to force a matching failure at some point in a pattern, the most
2428 convenient way to do it is with (?!) because an empty string always matches, so
2429 an assertion that requires there not to be an empty string must always fail.
2430 The backtracking control verb (*FAIL) or (*F) is a synonym for (?!).
2431 .
2432 .
2433 .\" HTML <a name="lookbehind"></a>
2434 .SS "Lookbehind assertions"
2435 .rs
2436 .sp
2437 Lookbehind assertions start with (?<= for positive assertions and (?<! for
2438 negative assertions. For example,
2439 .sp
2440   (?<!foo)bar
2441 .sp
2442 does find an occurrence of "bar" that is not preceded by "foo". The contents of
2443 a lookbehind assertion are restricted such that all the strings it matches must
2444 have a fixed length. However, if there are several top-level alternatives, they
2445 do not all have to have the same fixed length. Thus
2446 .sp
2447   (?<=bullock|donkey)
2448 .sp
2449 is permitted, but
2450 .sp
2451   (?<!dogs?|cats?)
2452 .sp
2453 causes an error at compile time. Branches that match different length strings
2454 are permitted only at the top level of a lookbehind assertion. This is an
2455 extension compared with Perl, which requires all branches to match the same
2456 length of string. An assertion such as
2457 .sp
2458   (?<=ab(c|de))
2459 .sp
2460 is not permitted, because its single top-level branch can match two different
2461 lengths, but it is acceptable to PCRE2 if rewritten to use two top-level
2462 branches:
2463 .sp
2464   (?<=abc|abde)
2465 .sp
2466 In some cases, the escape sequence \eK
2467 .\" HTML <a href="#resetmatchstart">
2468 .\" </a>
2469 (see above)
2470 .\"
2471 can be used instead of a lookbehind assertion to get round the fixed-length
2472 restriction.
2473 .P
2474 The implementation of lookbehind assertions is, for each alternative, to
2475 temporarily move the current position back by the fixed length and then try to
2476 match. If there are insufficient characters before the current position, the
2477 assertion fails.
2478 .P
2479 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \eC escape (which matches a
2480 single code unit even in a UTF mode) to appear in lookbehind assertions,
2481 because it makes it impossible to calculate the length of the lookbehind. The
2482 \eX and \eR escapes, which can match different numbers of code units, are never
2483 permitted in lookbehinds.
2484 .P
2485 .\" HTML <a href="#subpatternsassubroutines">
2486 .\" </a>
2487 "Subroutine"
2488 .\"
2489 calls (see below) such as (?2) or (?&X) are permitted in lookbehinds, as long
2490 as the subpattern matches a fixed-length string. However,
2491 .\" HTML <a href="#recursion">
2492 .\" </a>
2493 recursion,
2494 .\"
2495 that is, a "subroutine" call into a group that is already active,
2496 is not supported.
2497 .P
2498 Perl does not support backreferences in lookbehinds. PCRE2 does support them,
2499 but only if certain conditions are met. The PCRE2_MATCH_UNSET_BACKREF option
2500 must not be set, there must be no use of (?| in the pattern (it creates
2501 duplicate subpattern numbers), and if the backreference is by name, the name
2502 must be unique. Of course, the referenced subpattern must itself be of fixed
2503 length. The following pattern matches words containing at least two characters
2504 that begin and end with the same character:
2505 .sp
2506    \eb(\ew)\ew++(?<=\e1)
2507 .P
2508 Possessive quantifiers can be used in conjunction with lookbehind assertions to
2509 specify efficient matching of fixed-length strings at the end of subject
2510 strings. Consider a simple pattern such as
2511 .sp
2512   abcd$
2513 .sp
2514 when applied to a long string that does not match. Because matching proceeds
2515 from left to right, PCRE2 will look for each "a" in the subject and then see if
2516 what follows matches the rest of the pattern. If the pattern is specified as
2517 .sp
2518   ^.*abcd$
2519 .sp
2520 the initial .* matches the entire string at first, but when this fails (because
2521 there is no following "a"), it backtracks to match all but the last character,
2522 then all but the last two characters, and so on. Once again the search for "a"
2523 covers the entire string, from right to left, so we are no better off. However,
2524 if the pattern is written as
2525 .sp
2526   ^.*+(?<=abcd)
2527 .sp
2528 there can be no backtracking for the .*+ item because of the possessive
2529 quantifier; it can match only the entire string. The subsequent lookbehind
2530 assertion does a single test on the last four characters. If it fails, the
2531 match fails immediately. For long strings, this approach makes a significant
2532 difference to the processing time.
2533 .
2534 .
2535 .SS "Using multiple assertions"
2536 .rs
2537 .sp
2538 Several assertions (of any sort) may occur in succession. For example,
2539 .sp
2540   (?<=\ed{3})(?<!999)foo
2541 .sp
2542 matches "foo" preceded by three digits that are not "999". Notice that each of
2543 the assertions is applied independently at the same point in the subject
2544 string. First there is a check that the previous three characters are all
2545 digits, and then there is a check that the same three characters are not "999".
2546 This pattern does \fInot\fP match "foo" preceded by six characters, the first
2547 of which are digits and the last three of which are not "999". For example, it
2548 doesn't match "123abcfoo". A pattern to do that is
2549 .sp
2550   (?<=\ed{3}...)(?<!999)foo
2551 .sp
2552 This time the first assertion looks at the preceding six characters, checking
2553 that the first three are digits, and then the second assertion checks that the
2554 preceding three characters are not "999".
2555 .P
2556 Assertions can be nested in any combination. For example,
2557 .sp
2558   (?<=(?<!foo)bar)baz
2559 .sp
2560 matches an occurrence of "baz" that is preceded by "bar" which in turn is not
2561 preceded by "foo", while
2562 .sp
2563   (?<=\ed{3}(?!999)...)foo
2564 .sp
2565 is another pattern that matches "foo" preceded by three digits and any three
2566 characters that are not "999".
2567 .
2568 .
2569 .\" HTML <a name="conditions"></a>
2570 .SH "CONDITIONAL SUBPATTERNS"
2571 .rs
2572 .sp
2573 It is possible to cause the matching process to obey a subpattern
2574 conditionally or to choose between two alternative subpatterns, depending on
2575 the result of an assertion, or whether a specific capturing subpattern has
2576 already been matched. The two possible forms of conditional subpattern are:
2577 .sp
2578   (?(condition)yes-pattern)
2579   (?(condition)yes-pattern|no-pattern)
2580 .sp
2581 If the condition is satisfied, the yes-pattern is used; otherwise the
2582 no-pattern (if present) is used. An absent no-pattern is equivalent to an empty
2583 string (it always matches). If there are more than two alternatives in the
2584 subpattern, a compile-time error occurs. Each of the two alternatives may
2585 itself contain nested subpatterns of any form, including conditional
2586 subpatterns; the restriction to two alternatives applies only at the level of
2587 the condition. This pattern fragment is an example where the alternatives are
2588 complex:
2589 .sp
2590   (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
2591 .sp
2592 .P
2593 There are five kinds of condition: references to subpatterns, references to
2594 recursion, two pseudo-conditions called DEFINE and VERSION, and assertions.
2595 .
2596 .
2597 .SS "Checking for a used subpattern by number"
2598 .rs
2599 .sp
2600 If the text between the parentheses consists of a sequence of digits, the
2601 condition is true if a capturing subpattern of that number has previously
2602 matched. If there is more than one capturing subpattern with the same number
2603 (see the earlier
2604 .\"
2605 .\" HTML <a href="#recursion">
2606 .\" </a>
2607 section about duplicate subpattern numbers),
2608 .\"
2609 the condition is true if any of them have matched. An alternative notation is
2610 to precede the digits with a plus or minus sign. In this case, the subpattern
2611 number is relative rather than absolute. The most recently opened parentheses
2612 can be referenced by (?(-1), the next most recent by (?(-2), and so on. Inside
2613 loops it can also make sense to refer to subsequent groups. The next
2614 parentheses to be opened can be referenced as (?(+1), and so on. (The value
2615 zero in any of these forms is not used; it provokes a compile-time error.)
2616 .P
2617 Consider the following pattern, which contains non-significant white space to
2618 make it more readable (assume the PCRE2_EXTENDED option) and to divide it into
2619 three parts for ease of discussion:
2620 .sp
2621   ( \e( )?    [^()]+    (?(1) \e) )
2622 .sp
2623 The first part matches an optional opening parenthesis, and if that
2624 character is present, sets it as the first captured substring. The second part
2625 matches one or more characters that are not parentheses. The third part is a
2626 conditional subpattern that tests whether or not the first set of parentheses
2627 matched. If they did, that is, if subject started with an opening parenthesis,
2628 the condition is true, and so the yes-pattern is executed and a closing
2629 parenthesis is required. Otherwise, since no-pattern is not present, the
2630 subpattern matches nothing. In other words, this pattern matches a sequence of
2631 non-parentheses, optionally enclosed in parentheses.
2632 .P
2633 If you were embedding this pattern in a larger one, you could use a relative
2634 reference:
2635 .sp
2636   ...other stuff... ( \e( )?    [^()]+    (?(-1) \e) ) ...
2637 .sp
2638 This makes the fragment independent of the parentheses in the larger pattern.
2639 .
2640 .
2641 .SS "Checking for a used subpattern by name"
2642 .rs
2643 .sp
2644 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a used
2645 subpattern by name. For compatibility with earlier versions of PCRE1, which had
2646 this facility before Perl, the syntax (?(name)...) is also recognized. Note,
2647 however, that undelimited names consisting of the letter R followed by digits
2648 are ambiguous (see the following section).
2649 .P
2650 Rewriting the above example to use a named subpattern gives this:
2651 .sp
2652   (?<OPEN> \e( )?    [^()]+    (?(<OPEN>) \e) )
2653 .sp
2654 If the name used in a condition of this kind is a duplicate, the test is
2655 applied to all subpatterns of the same name, and is true if any one of them has
2656 matched.
2657 .
2658 .
2659 .SS "Checking for pattern recursion"
2660 .rs
2661 .sp
2662 "Recursion" in this sense refers to any subroutine-like call from one part of
2663 the pattern to another, whether or not it is actually recursive. See the
2664 sections entitled
2665 .\" HTML <a href="#recursion">
2666 .\" </a>
2667 "Recursive patterns"
2668 .\"
2669 and
2670 .\" HTML <a href="#subpatternsassubroutines">
2671 .\" </a>
2672 "Subpatterns as subroutines"
2673 .\"
2674 below for details of recursion and subpattern calls.
2675 .P
2676 If a condition is the string (R), and there is no subpattern with the name R,
2677 the condition is true if matching is currently in a recursion or subroutine
2678 call to the whole pattern or any subpattern. If digits follow the letter R, and
2679 there is no subpattern with that name, the condition is true if the most recent
2680 call is into a subpattern with the given number, which must exist somewhere in
2681 the overall pattern. This is a contrived example that is equivalent to a+b:
2682 .sp
2683   ((?(R1)a+|(?1)b))
2684 .sp
2685 However, in both cases, if there is a subpattern with a matching name, the
2686 condition tests for its being set, as described in the section above, instead
2687 of testing for recursion. For example, creating a group with the name R1 by
2688 adding (?<R1>) to the above pattern completely changes its meaning.
2689 .P
2690 If a name preceded by ampersand follows the letter R, for example:
2691 .sp
2692   (?(R&name)...)
2693 .sp
2694 the condition is true if the most recent recursion is into a subpattern of that
2695 name (which must exist within the pattern).
2696 .P
2697 This condition does not check the entire recursion stack. It tests only the
2698 current level. If the name used in a condition of this kind is a duplicate, the
2699 test is applied to all subpatterns of the same name, and is true if any one of
2700 them is the most recent recursion.
2701 .P
2702 At "top level", all these recursion test conditions are false.
2703 .
2704 .
2705 .\" HTML <a name="subdefine"></a>
2706 .SS "Defining subpatterns for use by reference only"
2707 .rs
2708 .sp
2709 If the condition is the string (DEFINE), the condition is always false, even if
2710 there is a group with the name DEFINE. In this case, there may be only one
2711 alternative in the subpattern. It is always skipped if control reaches this
2712 point in the pattern; the idea of DEFINE is that it can be used to define
2713 subroutines that can be referenced from elsewhere. (The use of
2714 .\" HTML <a href="#subpatternsassubroutines">
2715 .\" </a>
2716 subroutines
2717 .\"
2718 is described below.) For example, a pattern to match an IPv4 address such as
2719 "192.168.23.245" could be written like this (ignore white space and line
2720 breaks):
2721 .sp
2722   (?(DEFINE) (?<byte> 2[0-4]\ed | 25[0-5] | 1\ed\ed | [1-9]?\ed) )
2723   \eb (?&byte) (\e.(?&byte)){3} \eb
2724 .sp
2725 The first part of the pattern is a DEFINE group inside which a another group
2726 named "byte" is defined. This matches an individual component of an IPv4
2727 address (a number less than 256). When matching takes place, this part of the
2728 pattern is skipped because DEFINE acts like a false condition. The rest of the
2729 pattern uses references to the named group to match the four dot-separated
2730 components of an IPv4 address, insisting on a word boundary at each end.
2731 .
2732 .
2733 .SS "Checking the PCRE2 version"
2734 .rs
2735 .sp
2736 Programs that link with a PCRE2 library can check the version by calling
2737 \fBpcre2_config()\fP with appropriate arguments. Users of applications that do
2738 not have access to the underlying code cannot do this. A special "condition"
2739 called VERSION exists to allow such users to discover which version of PCRE2
2740 they are dealing with by using this condition to match a string such as
2741 "yesno". VERSION must be followed either by "=" or ">=" and a version number.
2742 For example:
2743 .sp
2744   (?(VERSION>=10.4)yes|no)
2745 .sp
2746 This pattern matches "yes" if the PCRE2 version is greater or equal to 10.4, or
2747 "no" otherwise. The fractional part of the version number may not contain more
2748 than two digits.
2749 .
2750 .
2751 .SS "Assertion conditions"
2752 .rs
2753 .sp
2754 If the condition is not in any of the above formats, it must be an assertion.
2755 This may be a positive or negative lookahead or lookbehind assertion. Consider
2756 this pattern, again containing non-significant white space, and with the two
2757 alternatives on the second line:
2758 .sp
2759   (?(?=[^a-z]*[a-z])
2760   \ed{2}-[a-z]{3}-\ed{2}  |  \ed{2}-\ed{2}-\ed{2} )
2761 .sp
2762 The condition is a positive lookahead assertion that matches an optional
2763 sequence of non-letters followed by a letter. In other words, it tests for the
2764 presence of at least one letter in the subject. If a letter is found, the
2765 subject is matched against the first alternative; otherwise it is matched
2766 against the second. This pattern matches strings in one of the two forms
2767 dd-aaa-dd or dd-dd-dd, where aaa are letters and dd are digits.
2768 .P
2769 When an assertion that is a condition contains capturing subpatterns, any
2770 capturing that occurs in a matching branch is retained afterwards, for both
2771 positive and negative assertions, because matching always continues after the
2772 assertion, whether it succeeds or fails. (Compare non-conditional assertions,
2773 when captures are retained only for positive assertions that succeed.)
2774 .
2775 .
2776 .\" HTML <a name="comments"></a>
2777 .SH COMMENTS
2778 .rs
2779 .sp
2780 There are two ways of including comments in patterns that are processed by
2781 PCRE2. In both cases, the start of the comment must not be in a character
2782 class, nor in the middle of any other sequence of related characters such as
2783 (?: or a subpattern name or number. The characters that make up a comment play
2784 no part in the pattern matching.
2785 .P
2786 The sequence (?# marks the start of a comment that continues up to the next
2787 closing parenthesis. Nested parentheses are not permitted. If the
2788 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped # character
2789 also introduces a comment, which in this case continues to immediately after
2790 the next newline character or character sequence in the pattern. Which
2791 characters are interpreted as newlines is controlled by an option passed to the
2792 compiling function or by a special sequence at the start of the pattern, as
2793 described in the section entitled
2794 .\" HTML <a href="#newlines">
2795 .\" </a>
2796 "Newline conventions"
2797 .\"
2798 above. Note that the end of this type of comment is a literal newline sequence
2799 in the pattern; escape sequences that happen to represent a newline do not
2800 count. For example, consider this pattern when PCRE2_EXTENDED is set, and the
2801 default newline convention (a single linefeed character) is in force:
2802 .sp
2803   abc #comment \en still comment
2804 .sp
2805 On encountering the # character, \fBpcre2_compile()\fP skips along, looking for
2806 a newline in the pattern. The sequence \en is still literal at this stage, so
2807 it does not terminate the comment. Only an actual character with the code value
2808 0x0a (the default newline) does so.
2809 .
2810 .
2811 .\" HTML <a name="recursion"></a>
2812 .SH "RECURSIVE PATTERNS"
2813 .rs
2814 .sp
2815 Consider the problem of matching a string in parentheses, allowing for
2816 unlimited nested parentheses. Without the use of recursion, the best that can
2817 be done is to use a pattern that matches up to some fixed depth of nesting. It
2818 is not possible to handle an arbitrary nesting depth.
2819 .P
2820 For some time, Perl has provided a facility that allows regular expressions to
2821 recurse (amongst other things). It does this by interpolating Perl code in the
2822 expression at run time, and the code can refer to the expression itself. A Perl
2823 pattern using code interpolation to solve the parentheses problem can be
2824 created like this:
2825 .sp
2826   $re = qr{\e( (?: (?>[^()]+) | (?p{$re}) )* \e)}x;
2827 .sp
2828 The (?p{...}) item interpolates Perl code at run time, and in this case refers
2829 recursively to the pattern in which it appears.
2830 .P
2831 Obviously, PCRE2 cannot support the interpolation of Perl code. Instead, it
2832 supports special syntax for recursion of the entire pattern, and also for
2833 individual subpattern recursion. After its introduction in PCRE1 and Python,
2834 this kind of recursion was subsequently introduced into Perl at release 5.10.
2835 .P
2836 A special item that consists of (? followed by a number greater than zero and a
2837 closing parenthesis is a recursive subroutine call of the subpattern of the
2838 given number, provided that it occurs inside that subpattern. (If not, it is a
2839 .\" HTML <a href="#subpatternsassubroutines">
2840 .\" </a>
2841 non-recursive subroutine
2842 .\"
2843 call, which is described in the next section.) The special item (?R) or (?0) is
2844 a recursive call of the entire regular expression.
2845 .P
2846 This PCRE2 pattern solves the nested parentheses problem (assume the
2847 PCRE2_EXTENDED option is set so that white space is ignored):
2848 .sp
2849   \e( ( [^()]++ | (?R) )* \e)
2850 .sp
2851 First it matches an opening parenthesis. Then it matches any number of
2852 substrings which can either be a sequence of non-parentheses, or a recursive
2853 match of the pattern itself (that is, a correctly parenthesized substring).
2854 Finally there is a closing parenthesis. Note the use of a possessive quantifier
2855 to avoid backtracking into sequences of non-parentheses.
2856 .P
2857 If this were part of a larger pattern, you would not want to recurse the entire
2858 pattern, so instead you could use this:
2859 .sp
2860   ( \e( ( [^()]++ | (?1) )* \e) )
2861 .sp
2862 We have put the pattern into parentheses, and caused the recursion to refer to
2863 them instead of the whole pattern.
2864 .P
2865 In a larger pattern, keeping track of parenthesis numbers can be tricky. This
2866 is made easier by the use of relative references. Instead of (?1) in the
2867 pattern above you can write (?-2) to refer to the second most recently opened
2868 parentheses preceding the recursion. In other words, a negative number counts
2869 capturing parentheses leftwards from the point at which it is encountered.
2870 .P
2871 Be aware however, that if
2872 .\" HTML <a href="#dupsubpatternnumber">
2873 .\" </a>
2874 duplicate subpattern numbers
2875 .\"
2876 are in use, relative references refer to the earliest subpattern with the
2877 appropriate number. Consider, for example:
2878 .sp
2879   (?|(a)|(b)) (c) (?-2)
2880 .sp
2881 The first two capturing groups (a) and (b) are both numbered 1, and group (c)
2882 is number 2. When the reference (?-2) is encountered, the second most recently
2883 opened parentheses has the number 1, but it is the first such group (the (a)
2884 group) to which the recursion refers. This would be the same if an absolute
2885 reference (?1) was used. In other words, relative references are just a
2886 shorthand for computing a group number.
2887 .P
2888 It is also possible to refer to subsequently opened parentheses, by writing
2889 references such as (?+2). However, these cannot be recursive because the
2890 reference is not inside the parentheses that are referenced. They are always
2891 .\" HTML <a href="#subpatternsassubroutines">
2892 .\" </a>
2893 non-recursive subroutine
2894 .\"
2895 calls, as described in the next section.
2896 .P
2897 An alternative approach is to use named parentheses. The Perl syntax for this
2898 is (?&name); PCRE1's earlier syntax (?P>name) is also supported. We could
2899 rewrite the above example as follows:
2900 .sp
2901   (?<pn> \e( ( [^()]++ | (?&pn) )* \e) )
2902 .sp
2903 If there is more than one subpattern with the same name, the earliest one is
2904 used.
2905 .P
2906 The example pattern that we have been looking at contains nested unlimited
2907 repeats, and so the use of a possessive quantifier for matching strings of
2908 non-parentheses is important when applying the pattern to strings that do not
2909 match. For example, when this pattern is applied to
2910 .sp
2911   (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
2912 .sp
2913 it yields "no match" quickly. However, if a possessive quantifier is not used,
2914 the match runs for a very long time indeed because there are so many different
2915 ways the + and * repeats can carve up the subject, and all have to be tested
2916 before failure can be reported.
2917 .P
2918 At the end of a match, the values of capturing parentheses are those from
2919 the outermost level. If you want to obtain intermediate values, a callout
2920 function can be used (see below and the
2921 .\" HREF
2922 \fBpcre2callout\fP
2923 .\"
2924 documentation). If the pattern above is matched against
2925 .sp
2926   (ab(cd)ef)
2927 .sp
2928 the value for the inner capturing parentheses (numbered 2) is "ef", which is
2929 the last value taken on at the top level. If a capturing subpattern is not
2930 matched at the top level, its final captured value is unset, even if it was
2931 (temporarily) set at a deeper level during the matching process.
2932 .P
2933 Do not confuse the (?R) item with the condition (R), which tests for recursion.
2934 Consider this pattern, which matches text in angle brackets, allowing for
2935 arbitrary nesting. Only digits are allowed in nested brackets (that is, when
2936 recursing), whereas any characters are permitted at the outer level.
2937 .sp
2938   < (?: (?(R) \ed++  | [^<>]*+) | (?R)) * >
2939 .sp
2940 In this pattern, (?(R) is the start of a conditional subpattern, with two
2941 different alternatives for the recursive and non-recursive cases. The (?R) item
2942 is the actual recursive call.
2943 .
2944 .
2945 .\" HTML <a name="recursiondifference"></a>
2946 .SS "Differences in recursion processing between PCRE2 and Perl"
2947 .rs
2948 .sp
2949 Some former differences between PCRE2 and Perl no longer exist.
2950 .P
2951 Before release 10.30, recursion processing in PCRE2 differed from Perl in that
2952 a recursive subpattern call was always treated as an atomic group. That is,
2953 once it had matched some of the subject string, it was never re-entered, even
2954 if it contained untried alternatives and there was a subsequent matching
2955 failure. (Historical note: PCRE implemented recursion before Perl did.)
2956 .P
2957 Starting with release 10.30, recursive subroutine calls are no longer treated
2958 as atomic. That is, they can be re-entered to try unused alternatives if there
2959 is a matching failure later in the pattern. This is now compatible with the way
2960 Perl works. If you want a subroutine call to be atomic, you must explicitly
2961 enclose it in an atomic group.
2962 .P
2963 Supporting backtracking into recursions simplifies certain types of recursive
2964 pattern. For example, this pattern matches palindromic strings:
2965 .sp
2966   ^((.)(?1)\e2|.?)$
2967 .sp
2968 The second branch in the group matches a single central character in the
2969 palindrome when there are an odd number of characters, or nothing when there
2970 are an even number of characters, but in order to work it has to be able to try
2971 the second case when the rest of the pattern match fails. If you want to match
2972 typical palindromic phrases, the pattern has to ignore all non-word characters,
2973 which can be done like this:
2974 .sp
2975   ^\eW*+((.)\eW*+(?1)\eW*+\e2|\eW*+.?)\eW*+$
2976 .sp
2977 If run with the PCRE2_CASELESS option, this pattern matches phrases such as "A
2978 man, a plan, a canal: Panama!". Note the use of the possessive quantifier *+ to
2979 avoid backtracking into sequences of non-word characters. Without this, PCRE2
2980 takes a great deal longer (ten times or more) to match typical phrases, and
2981 Perl takes so long that you think it has gone into a loop.
2982 .P
2983 Another way in which PCRE2 and Perl used to differ in their recursion
2984 processing is in the handling of captured values. Formerly in Perl, when a
2985 subpattern was called recursively or as a subpattern (see the next section), it
2986 had no access to any values that were captured outside the recursion, whereas
2987 in PCRE2 these values can be referenced. Consider this pattern:
2988 .sp
2989   ^(.)(\e1|a(?2))
2990 .sp
2991 This pattern matches "bab". The first capturing parentheses match "b", then in
2992 the second group, when the backreference \e1 fails to match "b", the second
2993 alternative matches "a" and then recurses. In the recursion, \e1 does now match
2994 "b" and so the whole match succeeds. This match used to fail in Perl, but in
2995 later versions (I tried 5.024) it now works.
2996 .
2997 .
2998 .\" HTML <a name="subpatternsassubroutines"></a>
2999 .SH "SUBPATTERNS AS SUBROUTINES"
3000 .rs
3001 .sp
3002 If the syntax for a recursive subpattern call (either by number or by
3003 name) is used outside the parentheses to which it refers, it operates a bit
3004 like a subroutine in a programming language. More accurately, PCRE2 treats the
3005 referenced subpattern as an independent subpattern which it tries to match at
3006 the current matching position. The called subpattern may be defined before or
3007 after the reference. A numbered reference can be absolute or relative, as in
3008 these examples:
3009 .sp
3010   (...(absolute)...)...(?2)...
3011   (...(relative)...)...(?-1)...
3012   (...(?+1)...(relative)...
3013 .sp
3014 An earlier example pointed out that the pattern
3015 .sp
3016   (sens|respons)e and \e1ibility
3017 .sp
3018 matches "sense and sensibility" and "response and responsibility", but not
3019 "sense and responsibility". If instead the pattern
3020 .sp
3021   (sens|respons)e and (?1)ibility
3022 .sp
3023 is used, it does match "sense and responsibility" as well as the other two
3024 strings. Another example is given in the discussion of DEFINE above.
3025 .P
3026 Like recursions, subroutine calls used to be treated as atomic, but this
3027 changed at PCRE2 release 10.30, so backtracking into subroutine calls can now
3028 occur. However, any capturing parentheses that are set during the subroutine
3029 call revert to their previous values afterwards.
3030 .P
3031 Processing options such as case-independence are fixed when a subpattern is
3032 defined, so if it is used as a subroutine, such options cannot be changed for
3033 different calls. For example, consider this pattern:
3034 .sp
3035   (abc)(?i:(?-1))
3036 .sp
3037 It matches "abcabc". It does not match "abcABC" because the change of
3038 processing option does not affect the called subpattern.
3039 .P
3040 The behaviour of
3041 .\" HTML <a href="#backtrackcontrol">
3042 .\" </a>
3043 backtracking control verbs
3044 .\"
3045 in subpatterns when called as subroutines is described in the section entitled
3046 .\" HTML <a href="#btsub">
3047 .\" </a>
3048 "Backtracking verbs in subroutines"
3049 .\"
3050 below.
3051 .
3052 .
3053 .\" HTML <a name="onigurumasubroutines"></a>
3054 .SH "ONIGURUMA SUBROUTINE SYNTAX"
3055 .rs
3056 .sp
3057 For compatibility with Oniguruma, the non-Perl syntax \eg followed by a name or
3058 a number enclosed either in angle brackets or single quotes, is an alternative
3059 syntax for referencing a subpattern as a subroutine, possibly recursively. Here
3060 are two of the examples used above, rewritten using this syntax:
3061 .sp
3062   (?<pn> \e( ( (?>[^()]+) | \eg<pn> )* \e) )
3063   (sens|respons)e and \eg'1'ibility
3064 .sp
3065 PCRE2 supports an extension to Oniguruma: if a number is preceded by a
3066 plus or a minus sign it is taken as a relative reference. For example:
3067 .sp
3068   (abc)(?i:\eg<-1>)
3069 .sp
3070 Note that \eg{...} (Perl syntax) and \eg<...> (Oniguruma syntax) are \fInot\fP
3071 synonymous. The former is a backreference; the latter is a subroutine call.
3072 .
3073 .
3074 .SH CALLOUTS
3075 .rs
3076 .sp
3077 Perl has a feature whereby using the sequence (?{...}) causes arbitrary Perl
3078 code to be obeyed in the middle of matching a regular expression. This makes it
3079 possible, amongst other things, to extract different substrings that match the
3080 same pair of parentheses when there is a repetition.
3081 .P
3082 PCRE2 provides a similar feature, but of course it cannot obey arbitrary Perl
3083 code. The feature is called "callout". The caller of PCRE2 provides an external
3084 function by putting its entry point in a match context using the function
3085 \fBpcre2_set_callout()\fP, and then passing that context to \fBpcre2_match()\fP
3086 or \fBpcre2_dfa_match()\fP. If no match context is passed, or if the callout
3087 entry point is set to NULL, callouts are disabled.
3088 .P
3089 Within a regular expression, (?C<arg>) indicates a point at which the external
3090 function is to be called. There are two kinds of callout: those with a
3091 numerical argument and those with a string argument. (?C) on its own with no
3092 argument is treated as (?C0). A numerical argument allows the application to
3093 distinguish between different callouts. String arguments were added for release
3094 10.20 to make it possible for script languages that use PCRE2 to embed short
3095 scripts within patterns in a similar way to Perl.
3096 .P
3097 During matching, when PCRE2 reaches a callout point, the external function is
3098 called. It is provided with the number or string argument of the callout, the
3099 position in the pattern, and one item of data that is also set in the match
3100 block. The callout function may cause matching to proceed, to backtrack, or to
3101 fail.
3102 .P
3103 By default, PCRE2 implements a number of optimizations at matching time, and
3104 one side-effect is that sometimes callouts are skipped. If you need all
3105 possible callouts to happen, you need to set options that disable the relevant
3106 optimizations. More details, including a complete description of the
3107 programming interface to the callout function, are given in the
3108 .\" HREF
3109 \fBpcre2callout\fP
3110 .\"
3111 documentation.
3112 .
3113 .
3114 .SS "Callouts with numerical arguments"
3115 .rs
3116 .sp
3117 If you just want to have a means of identifying different callout points, put a
3118 number less than 256 after the letter C. For example, this pattern has two
3119 callout points:
3120 .sp
3121   (?C1)abc(?C2)def
3122 .sp
3123 If the PCRE2_AUTO_CALLOUT flag is passed to \fBpcre2_compile()\fP, numerical
3124 callouts are automatically installed before each item in the pattern. They are
3125 all numbered 255. If there is a conditional group in the pattern whose
3126 condition is an assertion, an additional callout is inserted just before the
3127 condition. An explicit callout may also be set at this position, as in this
3128 example:
3129 .sp
3130   (?(?C9)(?=a)abc|def)
3131 .sp
3132 Note that this applies only to assertion conditions, not to other types of
3133 condition.
3134 .
3135 .
3136 .SS "Callouts with string arguments"
3137 .rs
3138 .sp
3139 A delimited string may be used instead of a number as a callout argument. The
3140 starting delimiter must be one of ` ' " ^ % # $ { and the ending delimiter is
3141 the same as the start, except for {, where the ending delimiter is }. If the
3142 ending delimiter is needed within the string, it must be doubled. For
3143 example:
3144 .sp
3145   (?C'ab ''c'' d')xyz(?C{any text})pqr
3146 .sp
3147 The doubling is removed before the string is passed to the callout function.
3148 .
3149 .
3150 .\" HTML <a name="backtrackcontrol"></a>
3151 .SH "BACKTRACKING CONTROL"
3152 .rs
3153 .sp
3154 There are a number of special "Backtracking Control Verbs" (to use Perl's
3155 terminology) that modify the behaviour of backtracking during matching. They
3156 are generally of the form (*VERB) or (*VERB:NAME). Some verbs take either form,
3157 possibly behaving differently depending on whether or not a name is present.
3158 .P
3159 By default, for compatibility with Perl, a name is any sequence of characters
3160 that does not include a closing parenthesis. The name is not processed in
3161 any way, and it is not possible to include a closing parenthesis in the name.
3162 This can be changed by setting the PCRE2_ALT_VERBNAMES option, but the result
3163 is no longer Perl-compatible.
3164 .P
3165 When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to verb names
3166 and only an unescaped closing parenthesis terminates the name. However, the
3167 only backslash items that are permitted are \eQ, \eE, and sequences such as
3168 \ex{100} that define character code points. Character type escapes such as \ed
3169 are faulted.
3170 .P
3171 A closing parenthesis can be included in a name either as \e) or between \eQ
3172 and \eE. In addition to backslash processing, if the PCRE2_EXTENDED or
3173 PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb names is
3174 skipped, and #-comments are recognized, exactly as in the rest of the pattern.
3175 PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect verb names unless
3176 PCRE2_ALT_VERBNAMES is also set.
3177 .P
3178 The maximum length of a name is 255 in the 8-bit library and 65535 in the
3179 16-bit and 32-bit libraries. If the name is empty, that is, if the closing
3180 parenthesis immediately follows the colon, the effect is as if the colon were
3181 not there. Any number of these verbs may occur in a pattern.
3182 .P
3183 Since these verbs are specifically related to backtracking, most of them can be
3184 used only when the pattern is to be matched using the traditional matching
3185 function, because that uses a backtracking algorithm. With the exception of
3186 (*FAIL), which behaves like a failing negative assertion, the backtracking
3187 control verbs cause an error if encountered by the DFA matching function.
3188 .P
3189 The behaviour of these verbs in
3190 .\" HTML <a href="#btrepeat">
3191 .\" </a>
3192 repeated groups,
3193 .\"
3194 .\" HTML <a href="#btassert">
3195 .\" </a>
3196 assertions,
3197 .\"
3198 and in
3199 .\" HTML <a href="#btsub">
3200 .\" </a>
3201 subpatterns called as subroutines
3202 .\"
3203 (whether or not recursively) is documented below.
3204 .
3205 .
3206 .\" HTML <a name="nooptimize"></a>
3207 .SS "Optimizations that affect backtracking verbs"
3208 .rs
3209 .sp
3210 PCRE2 contains some optimizations that are used to speed up matching by running
3211 some checks at the start of each match attempt. For example, it may know the
3212 minimum length of matching subject, or that a particular character must be
3213 present. When one of these optimizations bypasses the running of a match, any
3214 included backtracking verbs will not, of course, be processed. You can suppress
3215 the start-of-match optimizations by setting the PCRE2_NO_START_OPTIMIZE option
3216 when calling \fBpcre2_compile()\fP, or by starting the pattern with
3217 (*NO_START_OPT). There is more discussion of this option in the section
3218 entitled
3219 .\" HTML <a href="pcre2api.html#compiling">
3220 .\" </a>
3221 "Compiling a pattern"
3222 .\"
3223 in the
3224 .\" HREF
3225 \fBpcre2api\fP
3226 .\"
3227 documentation.
3228 .P
3229 Experiments with Perl suggest that it too has similar optimizations, and like
3230 PCRE2, turning them off can change the result of a match.
3231 .
3232 .
3233 .SS "Verbs that act immediately"
3234 .rs
3235 .sp
3236 The following verbs act as soon as they are encountered.
3237 .sp
3238    (*ACCEPT) or (*ACCEPT:NAME)
3239 .sp
3240 This verb causes the match to end successfully, skipping the remainder of the
3241 pattern. However, when it is inside a subpattern that is called as a
3242 subroutine, only that subpattern is ended successfully. Matching then continues
3243 at the outer level. If (*ACCEPT) in triggered in a positive assertion, the
3244 assertion succeeds; in a negative assertion, the assertion fails.
3245 .P
3246 If (*ACCEPT) is inside capturing parentheses, the data so far is captured. For
3247 example:
3248 .sp
3249   A((?:A|B(*ACCEPT)|C)D)
3250 .sp
3251 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is captured by
3252 the outer parentheses.
3253 .sp
3254   (*FAIL) or (*FAIL:NAME)
3255 .sp
3256 This verb causes a matching failure, forcing backtracking to occur. It may be
3257 abbreviated to (*F). It is equivalent to (?!) but easier to read. The Perl
3258 documentation notes that it is probably useful only when combined with (?{}) or
3259 (??{}). Those are, of course, Perl features that are not present in PCRE2. The
3260 nearest equivalent is the callout feature, as for example in this pattern:
3261 .sp
3262   a+(?C)(*FAIL)
3263 .sp
3264 A match with the string "aaaa" always fails, but the callout is taken before
3265 each backtrack happens (in this example, 10 times).
3266 .P
3267 (*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as
3268 (*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively.
3269 .
3270 .
3271 .SS "Recording which path was taken"
3272 .rs
3273 .sp
3274 There is one verb whose main purpose is to track how a match was arrived at,
3275 though it also has a secondary use in conjunction with advancing the match
3276 starting point (see (*SKIP) below).
3277 .sp
3278   (*MARK:NAME) or (*:NAME)
3279 .sp
3280 A name is always required with this verb. There may be as many instances of
3281 (*MARK) as you like in a pattern, and their names do not have to be unique.
3282 .P
3283 When a match succeeds, the name of the last-encountered (*MARK:NAME) on the
3284 matching path is passed back to the caller as described in the section entitled
3285 .\" HTML <a href="pcre2api.html#matchotherdata">
3286 .\" </a>
3287 "Other information about the match"
3288 .\"
3289 in the
3290 .\" HREF
3291 \fBpcre2api\fP
3292 .\"
3293 documentation. This applies to all instances of (*MARK), including those inside
3294 assertions and atomic groups. (There are differences in those cases when
3295 (*MARK) is used in conjunction with (*SKIP) as described below.)
3296 .P
3297 As well as (*MARK), the (*COMMIT), (*PRUNE) and (*THEN) verbs may have
3298 associated NAME arguments. Whichever is last on the matching path is passed
3299 back. See below for more details of these other verbs.
3300 .P
3301 Here is an example of \fBpcre2test\fP output, where the "mark" modifier
3302 requests the retrieval and outputting of (*MARK) data:
3303 .sp
3304     re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
3305   data> XY
3306    0: XY
3307   MK: A
3308   XZ
3309    0: XZ
3310   MK: B
3311 .sp
3312 The (*MARK) name is tagged with "MK:" in this output, and in this example it
3313 indicates which of the two alternatives matched. This is a more efficient way
3314 of obtaining this information than putting each alternative in its own
3315 capturing parentheses.
3316 .P
3317 If a verb with a name is encountered in a positive assertion that is true, the
3318 name is recorded and passed back if it is the last-encountered. This does not
3319 happen for negative assertions or failing positive assertions.
3320 .P
3321 After a partial match or a failed match, the last encountered name in the
3322 entire match process is returned. For example:
3323 .sp
3324     re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
3325   data> XP
3326   No match, mark = B
3327 .sp
3328 Note that in this unanchored example the mark is retained from the match
3329 attempt that started at the letter "X" in the subject. Subsequent match
3330 attempts starting at "P" and then with an empty string do not get as far as the
3331 (*MARK) item, but nevertheless do not reset it.
3332 .P
3333 If you are interested in (*MARK) values after failed matches, you should
3334 probably set the PCRE2_NO_START_OPTIMIZE option
3335 .\" HTML <a href="#nooptimize">
3336 .\" </a>
3337 (see above)
3338 .\"
3339 to ensure that the match is always attempted.
3340 .
3341 .
3342 .SS "Verbs that act after backtracking"
3343 .rs
3344 .sp
3345 The following verbs do nothing when they are encountered. Matching continues
3346 with what follows, but if there is a subsequent match failure, causing a
3347 backtrack to the verb, a failure is forced. That is, backtracking cannot pass
3348 to the left of the verb. However, when one of these verbs appears inside an
3349 atomic group or in a lookaround assertion that is true, its effect is confined
3350 to that group, because once the group has been matched, there is never any
3351 backtracking into it. Backtracking from beyond an assertion or an atomic group
3352 ignores the entire group, and seeks a preceeding backtracking point.
3353 .P
3354 These verbs differ in exactly what kind of failure occurs when backtracking
3355 reaches them. The behaviour described below is what happens when the verb is
3356 not in a subroutine or an assertion. Subsequent sections cover these special
3357 cases.
3358 .sp
3359   (*COMMIT) or (*COMMIT:NAME)
3360 .sp
3361 This verb causes the whole match to fail outright if there is a later matching
3362 failure that causes backtracking to reach it. Even if the pattern is
3363 unanchored, no further attempts to find a match by advancing the starting point
3364 take place. If (*COMMIT) is the only backtracking verb that is encountered,
3365 once it has been passed \fBpcre2_match()\fP is committed to finding a match at
3366 the current starting point, or not at all. For example:
3367 .sp
3368   a+(*COMMIT)b
3369 .sp
3370 This matches "xxaab" but not "aacaab". It can be thought of as a kind of
3371 dynamic anchor, or "I've started, so I must finish."
3372 .P
3373 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COMMIT). It is
3374 like (*MARK:NAME) in that the name is remembered for passing back to the
3375 caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
3376 ignoring those set by (*COMMIT), (*PRUNE) and (*THEN).
3377 .P
3378 If there is more than one backtracking verb in a pattern, a different one that
3379 follows (*COMMIT) may be triggered first, so merely passing (*COMMIT) during a
3380 match does not always guarantee that a match must be at this starting point.
3381 .P
3382 Note that (*COMMIT) at the start of a pattern is not the same as an anchor,
3383 unless PCRE2's start-of-match optimizations are turned off, as shown in this
3384 output from \fBpcre2test\fP:
3385 .sp
3386     re> /(*COMMIT)abc/
3387   data> xyzabc
3388    0: abc
3389   data>
3390   re> /(*COMMIT)abc/no_start_optimize
3391   data> xyzabc
3392   No match
3393 .sp
3394 For the first pattern, PCRE2 knows that any match must start with "a", so the
3395 optimization skips along the subject to "a" before applying the pattern to the
3396 first set of data. The match attempt then succeeds. The second pattern disables
3397 the optimization that skips along to the first character. The pattern is now
3398 applied starting at "x", and so the (*COMMIT) causes the match to fail without
3399 trying any other starting points.
3400 .sp
3401   (*PRUNE) or (*PRUNE:NAME)
3402 .sp
3403 This verb causes the match to fail at the current starting position in the
3404 subject if there is a later matching failure that causes backtracking to reach
3405 it. If the pattern is unanchored, the normal "bumpalong" advance to the next
3406 starting character then happens. Backtracking can occur as usual to the left of
3407 (*PRUNE), before it is reached, or when matching to the right of (*PRUNE), but
3408 if there is no match to the right, backtracking cannot cross (*PRUNE). In
3409 simple cases, the use of (*PRUNE) is just an alternative to an atomic group or
3410 possessive quantifier, but there are some uses of (*PRUNE) that cannot be
3411 expressed in any other way. In an anchored pattern (*PRUNE) has the same effect
3412 as (*COMMIT).
3413 .P
3414 The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE). It is
3415 like (*MARK:NAME) in that the name is remembered for passing back to the
3416 caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
3417 ignoring those set by (*COMMIT), (*PRUNE) or (*THEN).
3418 .sp
3419   (*SKIP)
3420 .sp
3421 This verb, when given without a name, is like (*PRUNE), except that if the
3422 pattern is unanchored, the "bumpalong" advance is not to the next character,
3423 but to the position in the subject where (*SKIP) was encountered. (*SKIP)
3424 signifies that whatever text was matched leading up to it cannot be part of a
3425 successful match if there is a later mismatch. Consider:
3426 .sp
3427   a+(*SKIP)b
3428 .sp
3429 If the subject is "aaaac...", after the first match attempt fails (starting at
3430 the first character in the string), the starting point skips on to start the
3431 next attempt at "c". Note that a possessive quantifer does not have the same
3432 effect as this example; although it would suppress backtracking during the
3433 first match attempt, the second attempt would start at the second character
3434 instead of skipping on to "c".
3435 .sp
3436   (*SKIP:NAME)
3437 .sp
3438 When (*SKIP) has an associated name, its behaviour is modified. When such a
3439 (*SKIP) is triggered, the previous path through the pattern is searched for the
3440 most recent (*MARK) that has the same name. If one is found, the "bumpalong"
3441 advance is to the subject position that corresponds to that (*MARK) instead of
3442 to where (*SKIP) was encountered. If no (*MARK) with a matching name is found,
3443 the (*SKIP) is ignored.
3444 .P
3445 The search for a (*MARK) name uses the normal backtracking mechanism, which
3446 means that it does not see (*MARK) settings that are inside atomic groups or
3447 assertions, because they are never re-entered by backtracking. Compare the
3448 following \fBpcre2test\fP examples:
3449 .sp
3450     re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
3451   data: abc
3452    0: a
3453    1: a
3454   data:
3455     re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
3456   data: abc
3457    0: b
3458    1: b
3459 .sp
3460 In the first example, the (*MARK) setting is in an atomic group, so it is not
3461 seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored. This allows
3462 the second branch of the pattern to be tried at the first character position.
3463 In the second example, the (*MARK) setting is not in an atomic group. This
3464 allows (*SKIP:X) to find the (*MARK) when it backtracks, and this causes a new
3465 matching attempt to start at the second character. This time, the (*MARK) is
3466 never seen because "a" does not match "b", so the matcher immediately jumps to
3467 the second branch of the pattern.
3468 .P
3469 Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It ignores
3470 names that are set by (*COMMIT:NAME), (*PRUNE:NAME) or (*THEN:NAME).
3471 .sp
3472   (*THEN) or (*THEN:NAME)
3473 .sp
3474 This verb causes a skip to the next innermost alternative when backtracking
3475 reaches it. That is, it cancels any further backtracking within the current
3476 alternative. Its name comes from the observation that it can be used for a
3477 pattern-based if-then-else block:
3478 .sp
3479   ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
3480 .sp
3481 If the COND1 pattern matches, FOO is tried (and possibly further items after
3482 the end of the group if FOO succeeds); on failure, the matcher skips to the
3483 second alternative and tries COND2, without backtracking into COND1. If that
3484 succeeds and BAR fails, COND3 is tried. If subsequently BAZ fails, there are no
3485 more alternatives, so there is a backtrack to whatever came before the entire
3486 group. If (*THEN) is not inside an alternation, it acts like (*PRUNE).
3487 .P
3488 The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN). It is
3489 like (*MARK:NAME) in that the name is remembered for passing back to the
3490 caller. However, (*SKIP:NAME) searches only for names set with (*MARK),
3491 ignoring those set by (*COMMIT), (*PRUNE) and (*THEN).
3492 .P
3493 A subpattern that does not contain a | character is just a part of the
3494 enclosing alternative; it is not a nested alternation with only one
3495 alternative. The effect of (*THEN) extends beyond such a subpattern to the
3496 enclosing alternative. Consider this pattern, where A, B, etc. are complex
3497 pattern fragments that do not contain any | characters at this level:
3498 .sp
3499   A (B(*THEN)C) | D
3500 .sp
3501 If A and B are matched, but there is a failure in C, matching does not
3502 backtrack into A; instead it moves to the next alternative, that is, D.
3503 However, if the subpattern containing (*THEN) is given an alternative, it
3504 behaves differently:
3505 .sp
3506   A (B(*THEN)C | (*FAIL)) | D
3507 .sp
3508 The effect of (*THEN) is now confined to the inner subpattern. After a failure
3509 in C, matching moves to (*FAIL), which causes the whole subpattern to fail
3510 because there are no more alternatives to try. In this case, matching does now
3511 backtrack into A.
3512 .P
3513 Note that a conditional subpattern is not considered as having two
3514 alternatives, because only one is ever used. In other words, the | character in
3515 a conditional subpattern has a different meaning. Ignoring white space,
3516 consider:
3517 .sp
3518   ^.*? (?(?=a) a | b(*THEN)c )
3519 .sp
3520 If the subject is "ba", this pattern does not match. Because .*? is ungreedy,
3521 it initially matches zero characters. The condition (?=a) then fails, the
3522 character "b" is matched, but "c" is not. At this point, matching does not
3523 backtrack to .*? as might perhaps be expected from the presence of the |
3524 character. The conditional subpattern is part of the single alternative that
3525 comprises the whole pattern, and so the match fails. (If there was a backtrack
3526 into .*?, allowing it to match "b", the match would succeed.)
3527 .P
3528 The verbs just described provide four different "strengths" of control when
3529 subsequent matching fails. (*THEN) is the weakest, carrying on the match at the
3530 next alternative. (*PRUNE) comes next, failing the match at the current
3531 starting position, but allowing an advance to the next character (for an
3532 unanchored pattern). (*SKIP) is similar, except that the advance may be more
3533 than one character. (*COMMIT) is the strongest, causing the entire match to
3534 fail.
3535 .
3536 .
3537 .SS "More than one backtracking verb"
3538 .rs
3539 .sp
3540 If more than one backtracking verb is present in a pattern, the one that is
3541 backtracked onto first acts. For example, consider this pattern, where A, B,
3542 etc. are complex pattern fragments:
3543 .sp
3544   (A(*COMMIT)B(*THEN)C|ABD)
3545 .sp
3546 If A matches but B fails, the backtrack to (*COMMIT) causes the entire match to
3547 fail. However, if A and B match, but C fails, the backtrack to (*THEN) causes
3548 the next alternative (ABD) to be tried. This behaviour is consistent, but is
3549 not always the same as Perl's. It means that if two or more backtracking verbs
3550 appear in succession, all the the last of them has no effect. Consider this
3551 example:
3552 .sp
3553   ...(*COMMIT)(*PRUNE)...
3554 .sp
3555 If there is a matching failure to the right, backtracking onto (*PRUNE) causes
3556 it to be triggered, and its action is taken. There can never be a backtrack
3557 onto (*COMMIT).
3558 .
3559 .
3560 .\" HTML <a name="btrepeat"></a>
3561 .SS "Backtracking verbs in repeated groups"
3562 .rs
3563 .sp
3564 PCRE2 sometimes differs from Perl in its handling of backtracking verbs in
3565 repeated groups. For example, consider:
3566 .sp
3567   /(a(*COMMIT)b)+ac/
3568 .sp
3569 If the subject is "abac", Perl matches unless its optimizations are disabled,
3570 but PCRE2 always fails because the (*COMMIT) in the second repeat of the group
3571 acts.
3572 .
3573 .
3574 .\" HTML <a name="btassert"></a>
3575 .SS "Backtracking verbs in assertions"
3576 .rs
3577 .sp
3578 (*FAIL) in any assertion has its normal effect: it forces an immediate
3579 backtrack. The behaviour of the other backtracking verbs depends on whether or
3580 not the assertion is standalone or acting as the condition in a conditional
3581 subpattern.
3582 .P
3583 (*ACCEPT) in a standalone positive assertion causes the assertion to succeed
3584 without any further processing; captured strings and a (*MARK) name (if set)
3585 are retained. In a standalone negative assertion, (*ACCEPT) causes the
3586 assertion to fail without any further processing; captured substrings and any
3587 (*MARK) name are discarded.
3588 .P
3589 If the assertion is a condition, (*ACCEPT) causes the condition to be true for
3590 a positive assertion and false for a negative one; captured substrings are
3591 retained in both cases.
3592 .P
3593 The remaining verbs act only when a later failure causes a backtrack to
3594 reach them. This means that their effect is confined to the assertion,
3595 because lookaround assertions are atomic. A backtrack that occurs after an
3596 assertion is complete does not jump back into the assertion. Note in particular
3597 that a (*MARK) name that is set in an assertion is not "seen" by an instance of
3598 (*SKIP:NAME) latter in the pattern.
3599 .P
3600 The effect of (*THEN) is not allowed to escape beyond an assertion. If there
3601 are no more branches to try, (*THEN) causes a positive assertion to be false,
3602 and a negative assertion to be true.
3603 .P
3604 The other backtracking verbs are not treated specially if they appear in a
3605 standalone positive assertion. In a conditional positive assertion,
3606 backtracking (from within the assertion) into (*COMMIT), (*SKIP), or (*PRUNE)
3607 causes the condition to be false. However, for both standalone and conditional
3608 negative assertions, backtracking into (*COMMIT), (*SKIP), or (*PRUNE) causes
3609 the assertion to be true, without considering any further alternative branches.
3610 .
3611 .
3612 .\" HTML <a name="btsub"></a>
3613 .SS "Backtracking verbs in subroutines"
3614 .rs
3615 .sp
3616 These behaviours occur whether or not the subpattern is called recursively.
3617 .P
3618 (*ACCEPT) in a subpattern called as a subroutine causes the subroutine match to
3619 succeed without any further processing. Matching then continues after the
3620 subroutine call. Perl documents this behaviour. Perl's treatment of the other
3621 verbs in subroutines is different in some cases.
3622 .P
3623 (*FAIL) in a subpattern called as a subroutine has its normal effect: it forces
3624 an immediate backtrack.
3625 .P
3626 (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail when
3627 triggered by being backtracked to in a subpattern called as a subroutine. There
3628 is then a backtrack at the outer level.
3629 .P
3630 (*THEN), when triggered, skips to the next alternative in the innermost
3631 enclosing group within the subpattern that has alternatives (its normal
3632 behaviour). However, if there is no such group within the subroutine
3633 subpattern, the subroutine match fails and there is a backtrack at the outer
3634 level.
3635 .
3636 .
3637 .SH "SEE ALSO"
3638 .rs
3639 .sp
3640 \fBpcre2api\fP(3), \fBpcre2callout\fP(3), \fBpcre2matching\fP(3),
3641 \fBpcre2syntax\fP(3), \fBpcre2\fP(3).
3642 .
3643 .
3644 .SH AUTHOR
3645 .rs
3646 .sp
3647 .nf
3648 Philip Hazel
3649 University Computing Service
3650 Cambridge, England.
3651 .fi
3652 .
3653 .
3654 .SH REVISION
3655 .rs
3656 .sp
3657 .nf
3658 Last updated: 04 September 2018
3659 Copyright (c) 1997-2018 University of Cambridge.
3660 .fi