src/external/pcre2-10.32/doc/pcre2syntax.3

   1 .TH PCRE2SYNTAX 3 "02 September 2018" "PCRE2 10.32"
   2 .SH NAME
   3 PCRE2 - Perl-compatible regular expressions (revised API)
   4 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
   5 .rs
   6 .sp
   7 The full syntax and semantics of the regular expressions that are supported by
   8 PCRE2 are described in the
   9 .\" HREF
  10 \fBpcre2pattern\fP
  11 .\"
  12 documentation. This document contains a quick-reference summary of the syntax.
  13 .
  14 .
  15 .SH "QUOTING"
  16 .rs
  17 .sp
  18   \ex         where x is non-alphanumeric is a literal x
  19   \eQ...\eE    treat enclosed characters as literal
  20 .
  21 .
  22 .SH "ESCAPED CHARACTERS"
  23 .rs
  24 .sp
  25 This table applies to ASCII and Unicode environments.
  26 .sp
  27   \ea         alarm, that is, the BEL character (hex 07)
  28   \ecx        "control-x", where x is any ASCII printing character
  29   \ee         escape (hex 1B)
  30   \ef         form feed (hex 0C)
  31   \en         newline (hex 0A)
  32   \er         carriage return (hex 0D)
  33   \et         tab (hex 09)
  34   \e0dd       character with octal code 0dd
  35   \eddd       character with octal code ddd, or backreference
  36   \eo{ddd..}  character with octal code ddd..
  37   \eU         "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
  38   \eN{U+hh..} character with Unicode code point hh.. (Unicode mode only)
  39   \euhhhh     character with hex code hhhh (if PCRE2_ALT_BSUX is set)
  40   \exhh       character with hex code hh
  41   \ex{hh..}   character with hex code hh..
  42 .sp
  43 Note that \e0dd is always an octal code. The treatment of backslash followed by
  44 a non-zero digit is complicated; for details see the section
  45 .\" HTML <a href="pcre2pattern.html#digitsafterbackslash">
  46 .\" </a>
  47 "Non-printing characters"
  48 .\"
  49 in the
  50 .\" HREF
  51 \fBpcre2pattern\fP
  52 .\"
  53 documentation, where details of escape processing in EBCDIC environments are
  54 also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not
  55 supported in EBCDIC environments. Note that \eN not followed by an opening
  56 curly bracket has a different meaning (see below).
  57 .P
  58 When \ex is not followed by {, from zero to two hexadecimal digits are read,
  59 but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
  60 be recognized as a hexadecimal escape; otherwise it matches a literal "x".
  61 Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits,
  62 it matches a literal "u".
  63 .
  64 .
  65 .SH "CHARACTER TYPES"
  66 .rs
  67 .sp
  68   .          any character except newline;
  69                in dotall mode, any character whatsoever
  70   \eC         one code unit, even in UTF mode (best avoided)
  71   \ed         a decimal digit
  72   \eD         a character that is not a decimal digit
  73   \eh         a horizontal white space character
  74   \eH         a character that is not a horizontal white space character
  75   \eN         a character that is not a newline
  76   \ep{\fIxx\fP}     a character with the \fIxx\fP property
  77   \eP{\fIxx\fP}     a character without the \fIxx\fP property
  78   \eR         a newline sequence
  79   \es         a white space character
  80   \eS         a character that is not a white space character
  81   \ev         a vertical white space character
  82   \eV         a character that is not a vertical white space character
  83   \ew         a "word" character
  84   \eW         a "non-word" character
  85   \eX         a Unicode extended grapheme cluster
  86 .sp
  87 \eC is dangerous because it may leave the current matching point in the middle
  88 of a UTF-8 or UTF-16 character. The application can lock out the use of \eC by
  89 setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
  90 with the use of \eC permanently disabled.
  91 .P
  92 By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
  93 or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
  94 happening, \es and \ew may also match characters with code points in the range
  95 128-255. If the PCRE2_UCP option is set, the behaviour of these escape
  96 sequences is changed to use Unicode properties and they match many more
  97 characters.
  98 .
  99 .
 100 .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
 101 .rs
 102 .sp
 103   C          Other
 104   Cc         Control
 105   Cf         Format
 106   Cn         Unassigned
 107   Co         Private use
 108   Cs         Surrogate
 109 .sp
 110   L          Letter
 111   Ll         Lower case letter
 112   Lm         Modifier letter
 113   Lo         Other letter
 114   Lt         Title case letter
 115   Lu         Upper case letter
 116   L&         Ll, Lu, or Lt
 117 .sp
 118   M          Mark
 119   Mc         Spacing mark
 120   Me         Enclosing mark
 121   Mn         Non-spacing mark
 122 .sp
 123   N          Number
 124   Nd         Decimal number
 125   Nl         Letter number
 126   No         Other number
 127 .sp
 128   P          Punctuation
 129   Pc         Connector punctuation
 130   Pd         Dash punctuation
 131   Pe         Close punctuation
 132   Pf         Final punctuation
 133   Pi         Initial punctuation
 134   Po         Other punctuation
 135   Ps         Open punctuation
 136 .sp
 137   S          Symbol
 138   Sc         Currency symbol
 139   Sk         Modifier symbol
 140   Sm         Mathematical symbol
 141   So         Other symbol
 142 .sp
 143   Z          Separator
 144   Zl         Line separator
 145   Zp         Paragraph separator
 146   Zs         Space separator
 147 .
 148 .
 149 .SH "PCRE2 SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
 150 .rs
 151 .sp
 152   Xan        Alphanumeric: union of properties L and N
 153   Xps        POSIX space: property Z or tab, NL, VT, FF, CR
 154   Xsp        Perl space: property Z or tab, NL, VT, FF, CR
 155   Xuc        Univerally-named character: one that can be
 156                represented by a Universal Character Name
 157   Xwd        Perl word: property Xan or underscore
 158 .sp
 159 Perl and POSIX space are now the same. Perl added VT to its space character set
 160 at release 5.18.
 161 .
 162 .
 163 .SH "SCRIPT NAMES FOR \ep AND \eP"
 164 .rs
 165 .sp
 166 Adlam,
 167 Ahom,
 168 Anatolian_Hieroglyphs,
 169 Arabic,
 170 Armenian,
 171 Avestan,
 172 Balinese,
 173 Bamum,
 174 Bassa_Vah,
 175 Batak,
 176 Bengali,
 177 Bhaiksuki,
 178 Bopomofo,
 179 Brahmi,
 180 Braille,
 181 Buginese,
 182 Buhid,
 183 Canadian_Aboriginal,
 184 Carian,
 185 Caucasian_Albanian,
 186 Chakma,
 187 Cham,
 188 Cherokee,
 189 Common,
 190 Coptic,
 191 Cuneiform,
 192 Cypriot,
 193 Cyrillic,
 194 Deseret,
 195 Devanagari,
 196 Dogra,
 197 Duployan,
 198 Egyptian_Hieroglyphs,
 199 Elbasan,
 200 Ethiopic,
 201 Georgian,
 202 Glagolitic,
 203 Gothic,
 204 Grantha,
 205 Greek,
 206 Gujarati,
 207 Gunjala_Gondi,
 208 Gurmukhi,
 209 Han,
 210 Hangul,
 211 Hanifi_Rohingya,
 212 Hanunoo,
 213 Hatran,
 214 Hebrew,
 215 Hiragana,
 216 Imperial_Aramaic,
 217 Inherited,
 218 Inscriptional_Pahlavi,
 219 Inscriptional_Parthian,
 220 Javanese,
 221 Kaithi,
 222 Kannada,
 223 Katakana,
 224 Kayah_Li,
 225 Kharoshthi,
 226 Khmer,
 227 Khojki,
 228 Khudawadi,
 229 Lao,
 230 Latin,
 231 Lepcha,
 232 Limbu,
 233 Linear_A,
 234 Linear_B,
 235 Lisu,
 236 Lycian,
 237 Lydian,
 238 Mahajani,
 239 Makasar,
 240 Malayalam,
 241 Mandaic,
 242 Manichaean,
 243 Marchen,
 244 Masaram_Gondi,
 245 Medefaidrin,
 246 Meetei_Mayek,
 247 Mende_Kikakui,
 248 Meroitic_Cursive,
 249 Meroitic_Hieroglyphs,
 250 Miao,
 251 Modi,
 252 Mongolian,
 253 Mro,
 254 Multani,
 255 Myanmar,
 256 Nabataean,
 257 New_Tai_Lue,
 258 Newa,
 259 Nko,
 260 Nushu,
 261 Ogham,
 262 Ol_Chiki,
 263 Old_Hungarian,
 264 Old_Italic,
 265 Old_North_Arabian,
 266 Old_Permic,
 267 Old_Persian,
 268 Old_Sogdian,
 269 Old_South_Arabian,
 270 Old_Turkic,
 271 Oriya,
 272 Osage,
 273 Osmanya,
 274 Pahawh_Hmong,
 275 Palmyrene,
 276 Pau_Cin_Hau,
 277 Phags_Pa,
 278 Phoenician,
 279 Psalter_Pahlavi,
 280 Rejang,
 281 Runic,
 282 Samaritan,
 283 Saurashtra,
 284 Sharada,
 285 Shavian,
 286 Siddham,
 287 SignWriting,
 288 Sinhala,
 289 Sogdian,
 290 Sora_Sompeng,
 291 Soyombo,
 292 Sundanese,
 293 Syloti_Nagri,
 294 Syriac,
 295 Tagalog,
 296 Tagbanwa,
 297 Tai_Le,
 298 Tai_Tham,
 299 Tai_Viet,
 300 Takri,
 301 Tamil,
 302 Tangut,
 303 Telugu,
 304 Thaana,
 305 Thai,
 306 Tibetan,
 307 Tifinagh,
 308 Tirhuta,
 309 Ugaritic,
 310 Vai,
 311 Warang_Citi,
 312 Yi,
 313 Zanabazar_Square.
 314 .
 315 .
 316 .SH "CHARACTER CLASSES"
 317 .rs
 318 .sp
 319   [...]       positive character class
 320   [^...]      negative character class
 321   [x-y]       range (can be used for hex characters)
 322   [[:xxx:]]   positive POSIX named set
 323   [[:^xxx:]]  negative POSIX named set
 324 .sp
 325   alnum       alphanumeric
 326   alpha       alphabetic
 327   ascii       0-127
 328   blank       space or tab
 329   cntrl       control character
 330   digit       decimal digit
 331   graph       printing, excluding space
 332   lower       lower case letter
 333   print       printing, including space
 334   punct       printing, excluding alphanumeric
 335   space       white space
 336   upper       upper case letter
 337   word        same as \ew
 338   xdigit      hexadecimal digit
 339 .sp
 340 In PCRE2, POSIX character set names recognize only ASCII characters by default,
 341 but some of them use Unicode properties if PCRE2_UCP is set. You can use
 342 \eQ...\eE inside a character class.
 343 .
 344 .
 345 .SH "QUANTIFIERS"
 346 .rs
 347 .sp
 348   ?           0 or 1, greedy
 349   ?+          0 or 1, possessive
 350   ??          0 or 1, lazy
 351   *           0 or more, greedy
 352   *+          0 or more, possessive
 353   *?          0 or more, lazy
 354   +           1 or more, greedy
 355   ++          1 or more, possessive
 356   +?          1 or more, lazy
 357   {n}         exactly n
 358   {n,m}       at least n, no more than m, greedy
 359   {n,m}+      at least n, no more than m, possessive
 360   {n,m}?      at least n, no more than m, lazy
 361   {n,}        n or more, greedy
 362   {n,}+       n or more, possessive
 363   {n,}?       n or more, lazy
 364 .
 365 .
 366 .SH "ANCHORS AND SIMPLE ASSERTIONS"
 367 .rs
 368 .sp
 369   \eb          word boundary
 370   \eB          not a word boundary
 371   ^           start of subject
 372                 also after an internal newline in multiline mode
 373                 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
 374   \eA          start of subject
 375   $           end of subject
 376                 also before newline at end of subject
 377                 also before internal newline in multiline mode
 378   \eZ          end of subject
 379                 also before newline at end of subject
 380   \ez          end of subject
 381   \eG          first matching position in subject
 382 .
 383 .
 384 .SH "REPORTED MATCH POINT SETTING"
 385 .rs
 386 .sp
 387   \eK          set reported start of match
 388 .sp
 389 \eK is honoured in positive assertions, but ignored in negative ones.
 390 .
 391 .
 392 .SH "ALTERNATION"
 393 .rs
 394 .sp
 395   expr|expr|expr...
 396 .
 397 .
 398 .SH "CAPTURING"
 399 .rs
 400 .sp
 401   (...)           capturing group
 402   (?<name>...)    named capturing group (Perl)
 403   (?'name'...)    named capturing group (Perl)
 404   (?P<name>...)   named capturing group (Python)
 405   (?:...)         non-capturing group
 406   (?|...)         non-capturing group; reset group numbers for
 407                    capturing groups in each alternative
 408 .
 409 .
 410 .SH "ATOMIC GROUPS"
 411 .rs
 412 .sp
 413   (?>...)         atomic, non-capturing group
 414 .
 415 .
 416 .SH "COMMENT"
 417 .rs
 418 .sp
 419   (?#....)        comment (not nestable)
 420 .
 421 .
 422 .SH "OPTION SETTING"
 423 .rs
 424 Changes of these options within a group are automatically cancelled at the end
 425 of the group.
 426 .sp
 427   (?i)            caseless
 428   (?J)            allow duplicate names
 429   (?m)            multiline
 430   (?n)            no auto capture
 431   (?s)            single line (dotall)
 432   (?U)            default ungreedy (lazy)
 433   (?x)            extended: ignore white space except in classes
 434   (?xx)           as (?x) but also ignore space and tab in classes
 435   (?-...)         unset option(s)
 436   (?^)            unset imnsx options
 437 .sp
 438 Unsetting x or xx unsets both. Several options may be set at once, and a
 439 mixture of setting and unsetting such as (?i-x) is allowed, but there may be
 440 only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
 441 (?^in). An option setting may appear at the start of a non-capturing group, for
 442 example (?i:...).
 443 .P
 444 The following are recognized only at the very start of a pattern or after one
 445 of the newline or \eR options with similar syntax. More than one of them may
 446 appear. For the first three, d is a decimal number.
 447 .sp
 448   (*LIMIT_DEPTH=d) set the backtracking limit to d
 449   (*LIMIT_HEAP=d)  set the heap size limit to d * 1024 bytes
 450   (*LIMIT_MATCH=d) set the match limit to d
 451   (*NOTEMPTY)      set PCRE2_NOTEMPTY when matching
 452   (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
 453   (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
 454   (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
 455   (*NO_JIT)       disable JIT optimization
 456   (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
 457   (*UTF)          set appropriate UTF mode for the library in use
 458   (*UCP)          set PCRE2_UCP (use Unicode properties for \ed etc)
 459 .sp
 460 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
 461 the limits set by the caller of \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP,
 462 not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
 463 application can lock out the use of (*UTF) and (*UCP) by setting the
 464 PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
 465 .
 466 .
 467 .SH "NEWLINE CONVENTION"
 468 .rs
 469 .sp
 470 These are recognized only at the very start of the pattern or after option
 471 settings with a similar syntax.
 472 .sp
 473   (*CR)           carriage return only
 474   (*LF)           linefeed only
 475   (*CRLF)         carriage return followed by linefeed
 476   (*ANYCRLF)      all three of the above
 477   (*ANY)          any Unicode newline sequence
 478   (*NUL)          the NUL character (binary zero)
 479 .
 480 .
 481 .SH "WHAT \eR MATCHES"
 482 .rs
 483 .sp
 484 These are recognized only at the very start of the pattern or after option
 485 setting with a similar syntax.
 486 .sp
 487   (*BSR_ANYCRLF)  CR, LF, or CRLF
 488   (*BSR_UNICODE)  any Unicode newline sequence
 489 .
 490 .
 491 .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
 492 .rs
 493 .sp
 494   (?=...)         positive look ahead
 495   (?!...)         negative look ahead
 496   (?<=...)        positive look behind
 497   (?<!...)        negative look behind
 498 .sp
 499 Each top-level branch of a look behind must be of a fixed length.
 500 .
 501 .
 502 .SH "BACKREFERENCES"
 503 .rs
 504 .sp
 505   \en              reference by number (can be ambiguous)
 506   \egn             reference by number
 507   \eg{n}           reference by number
 508   \eg+n            relative reference by number (PCRE2 extension)
 509   \eg-n            relative reference by number
 510   \eg{+n}          relative reference by number (PCRE2 extension)
 511   \eg{-n}          relative reference by number
 512   \ek<name>        reference by name (Perl)
 513   \ek'name'        reference by name (Perl)
 514   \eg{name}        reference by name (Perl)
 515   \ek{name}        reference by name (.NET)
 516   (?P=name)       reference by name (Python)
 517 .
 518 .
 519 .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
 520 .rs
 521 .sp
 522   (?R)            recurse whole pattern
 523   (?n)            call subpattern by absolute number
 524   (?+n)           call subpattern by relative number
 525   (?-n)           call subpattern by relative number
 526   (?&name)        call subpattern by name (Perl)
 527   (?P>name)       call subpattern by name (Python)
 528   \eg<name>        call subpattern by name (Oniguruma)
 529   \eg'name'        call subpattern by name (Oniguruma)
 530   \eg<n>           call subpattern by absolute number (Oniguruma)
 531   \eg'n'           call subpattern by absolute number (Oniguruma)
 532   \eg<+n>          call subpattern by relative number (PCRE2 extension)
 533   \eg'+n'          call subpattern by relative number (PCRE2 extension)
 534   \eg<-n>          call subpattern by relative number (PCRE2 extension)
 535   \eg'-n'          call subpattern by relative number (PCRE2 extension)
 536 .
 537 .
 538 .SH "CONDITIONAL PATTERNS"
 539 .rs
 540 .sp
 541   (?(condition)yes-pattern)
 542   (?(condition)yes-pattern|no-pattern)
 543 .sp
 544   (?(n)               absolute reference condition
 545   (?(+n)              relative reference condition
 546   (?(-n)              relative reference condition
 547   (?(<name>)          named reference condition (Perl)
 548   (?('name')          named reference condition (Perl)
 549   (?(name)            named reference condition (PCRE2, deprecated)
 550   (?(R)               overall recursion condition
 551   (?(Rn)              specific numbered group recursion condition
 552   (?(R&name)          specific named group recursion condition
 553   (?(DEFINE)          define subpattern for reference
 554   (?(VERSION[>]=n.m)  test PCRE2 version
 555   (?(assert)          assertion condition
 556 .sp
 557 Note the ambiguity of (?(R) and (?(Rn) which might be named reference
 558 conditions or recursion tests. Such a condition is interpreted as a reference
 559 condition if the relevant named group exists.
 560 .
 561 .
 562 .SH "BACKTRACKING CONTROL"
 563 .rs
 564 .sp
 565 All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
 566 name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
 567 if :NAME is present. The others just set a name for passing back to the caller,
 568 but this is not a name that (*SKIP) can see. The following act immediately they
 569 are reached:
 570 .sp
 571   (*ACCEPT)       force successful match
 572   (*FAIL)         force backtrack; synonym (*F)
 573   (*MARK:NAME)    set name to be passed back; synonym (*:NAME)
 574 .sp
 575 The following act only when a subsequent match failure causes a backtrack to
 576 reach them. They all force a match failure, but they differ in what happens
 577 afterwards. Those that advance the start-of-match point do so only if the
 578 pattern is not anchored.
 579 .sp
 580   (*COMMIT)       overall failure, no advance of starting point
 581   (*PRUNE)        advance to next starting character
 582   (*SKIP)         advance to current matching position
 583   (*SKIP:NAME)    advance to position corresponding to an earlier
 584                   (*MARK:NAME); if not found, the (*SKIP) is ignored
 585   (*THEN)         local failure, backtrack to next alternation
 586 .sp
 587 The effect of one of these verbs in a group called as a subroutine is confined
 588 to the subroutine call.
 589 .
 590 .
 591 .SH "CALLOUTS"
 592 .rs
 593 .sp
 594   (?C)            callout (assumed number 0)
 595   (?Cn)           callout with numerical data n
 596   (?C"text")      callout with string data
 597 .sp
 598 The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
 599 start and the end), and the starting delimiter { matched with the ending
 600 delimiter }. To encode the ending delimiter within the string, double it.
 601 .
 602 .
 603 .SH "SEE ALSO"
 604 .rs
 605 .sp
 606 \fBpcre2pattern\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3),
 607 \fBpcre2matching\fP(3), \fBpcre2\fP(3).
 608 .
 609 .
 610 .SH AUTHOR
 611 .rs
 612 .sp
 613 .nf
 614 Philip Hazel
 615 University Computing Service
 616 Cambridge, England.
 617 .fi
 618 .
 619 .
 620 .SH REVISION
 621 .rs
 622 .sp
 623 .nf
 624 Last updated: 02 September 2018
 625 Copyright (c) 1997-2018 University of Cambridge.
 626 .fi