1 .TH PCRE2SYNTAX 3 "02 September 2018" "PCRE2 10.32"
3 PCRE2 - Perl-compatible regular expressions (revised API)
4 .SH "PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY"
7 The full syntax and semantics of the regular expressions that are supported by
8 PCRE2 are described in the
12 documentation. This document contains a quick-reference summary of the syntax.
18 \ex where x is non-alphanumeric is a literal x
19 \eQ...\eE treat enclosed characters as literal
22 .SH "ESCAPED CHARACTERS"
25 This table applies to ASCII and Unicode environments.
27 \ea alarm, that is, the BEL character (hex 07)
28 \ecx "control-x", where x is any ASCII printing character
30 \ef form feed (hex 0C)
32 \er carriage return (hex 0D)
34 \e0dd character with octal code 0dd
35 \eddd character with octal code ddd, or backreference
36 \eo{ddd..} character with octal code ddd..
37 \eU "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
38 \eN{U+hh..} character with Unicode code point hh.. (Unicode mode only)
39 \euhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
40 \exhh character with hex code hh
41 \ex{hh..} character with hex code hh..
43 Note that \e0dd is always an octal code. The treatment of backslash followed by
44 a non-zero digit is complicated; for details see the section
45 .\" HTML <a href="pcre2pattern.html#digitsafterbackslash">
47 "Non-printing characters"
53 documentation, where details of escape processing in EBCDIC environments are
54 also given. \eN{U+hh..} is synonymous with \ex{hh..} in PCRE2 but is not
55 supported in EBCDIC environments. Note that \eN not followed by an opening
56 curly bracket has a different meaning (see below).
58 When \ex is not followed by {, from zero to two hexadecimal digits are read,
59 but if PCRE2_ALT_BSUX is set, \ex must be followed by two hexadecimal digits to
60 be recognized as a hexadecimal escape; otherwise it matches a literal "x".
61 Likewise, if \eu (in ALT_BSUX mode) is not followed by four hexadecimal digits,
62 it matches a literal "u".
68 . any character except newline;
69 in dotall mode, any character whatsoever
70 \eC one code unit, even in UTF mode (best avoided)
72 \eD a character that is not a decimal digit
73 \eh a horizontal white space character
74 \eH a character that is not a horizontal white space character
75 \eN a character that is not a newline
76 \ep{\fIxx\fP} a character with the \fIxx\fP property
77 \eP{\fIxx\fP} a character without the \fIxx\fP property
78 \eR a newline sequence
79 \es a white space character
80 \eS a character that is not a white space character
81 \ev a vertical white space character
82 \eV a character that is not a vertical white space character
83 \ew a "word" character
84 \eW a "non-word" character
85 \eX a Unicode extended grapheme cluster
87 \eC is dangerous because it may leave the current matching point in the middle
88 of a UTF-8 or UTF-16 character. The application can lock out the use of \eC by
89 setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
90 with the use of \eC permanently disabled.
92 By default, \ed, \es, and \ew match only ASCII characters, even in UTF-8 mode
93 or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
94 happening, \es and \ew may also match characters with code points in the range
95 128-255. If the PCRE2_UCP option is set, the behaviour of these escape
96 sequences is changed to use Unicode properties and they match many more
100 .SH "GENERAL CATEGORY PROPERTIES FOR \ep and \eP"
129 Pc Connector punctuation
133 Pi Initial punctuation
140 Sm Mathematical symbol
145 Zp Paragraph separator
149 .SH "PCRE2 SPECIAL CATEGORY PROPERTIES FOR \ep and \eP"
152 Xan Alphanumeric: union of properties L and N
153 Xps POSIX space: property Z or tab, NL, VT, FF, CR
154 Xsp Perl space: property Z or tab, NL, VT, FF, CR
155 Xuc Univerally-named character: one that can be
156 represented by a Universal Character Name
157 Xwd Perl word: property Xan or underscore
159 Perl and POSIX space are now the same. Perl added VT to its space character set
163 .SH "SCRIPT NAMES FOR \ep AND \eP"
168 Anatolian_Hieroglyphs,
198 Egyptian_Hieroglyphs,
218 Inscriptional_Pahlavi,
219 Inscriptional_Parthian,
249 Meroitic_Hieroglyphs,
316 .SH "CHARACTER CLASSES"
319 [...] positive character class
320 [^...] negative character class
321 [x-y] range (can be used for hex characters)
322 [[:xxx:]] positive POSIX named set
323 [[:^xxx:]] negative POSIX named set
329 cntrl control character
331 graph printing, excluding space
332 lower lower case letter
333 print printing, including space
334 punct printing, excluding alphanumeric
336 upper upper case letter
338 xdigit hexadecimal digit
340 In PCRE2, POSIX character set names recognize only ASCII characters by default,
341 but some of them use Unicode properties if PCRE2_UCP is set. You can use
342 \eQ...\eE inside a character class.
349 ?+ 0 or 1, possessive
352 *+ 0 or more, possessive
355 ++ 1 or more, possessive
358 {n,m} at least n, no more than m, greedy
359 {n,m}+ at least n, no more than m, possessive
360 {n,m}? at least n, no more than m, lazy
361 {n,} n or more, greedy
362 {n,}+ n or more, possessive
363 {n,}? n or more, lazy
366 .SH "ANCHORS AND SIMPLE ASSERTIONS"
370 \eB not a word boundary
372 also after an internal newline in multiline mode
373 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
376 also before newline at end of subject
377 also before internal newline in multiline mode
379 also before newline at end of subject
381 \eG first matching position in subject
384 .SH "REPORTED MATCH POINT SETTING"
387 \eK set reported start of match
389 \eK is honoured in positive assertions, but ignored in negative ones.
401 (...) capturing group
402 (?<name>...) named capturing group (Perl)
403 (?'name'...) named capturing group (Perl)
404 (?P<name>...) named capturing group (Python)
405 (?:...) non-capturing group
406 (?|...) non-capturing group; reset group numbers for
407 capturing groups in each alternative
413 (?>...) atomic, non-capturing group
419 (?#....) comment (not nestable)
424 Changes of these options within a group are automatically cancelled at the end
428 (?J) allow duplicate names
431 (?s) single line (dotall)
432 (?U) default ungreedy (lazy)
433 (?x) extended: ignore white space except in classes
434 (?xx) as (?x) but also ignore space and tab in classes
435 (?-...) unset option(s)
436 (?^) unset imnsx options
438 Unsetting x or xx unsets both. Several options may be set at once, and a
439 mixture of setting and unsetting such as (?i-x) is allowed, but there may be
440 only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
441 (?^in). An option setting may appear at the start of a non-capturing group, for
444 The following are recognized only at the very start of a pattern or after one
445 of the newline or \eR options with similar syntax. More than one of them may
446 appear. For the first three, d is a decimal number.
448 (*LIMIT_DEPTH=d) set the backtracking limit to d
449 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
450 (*LIMIT_MATCH=d) set the match limit to d
451 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
452 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
453 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
454 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
455 (*NO_JIT) disable JIT optimization
456 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
457 (*UTF) set appropriate UTF mode for the library in use
458 (*UCP) set PCRE2_UCP (use Unicode properties for \ed etc)
460 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
461 the limits set by the caller of \fBpcre2_match()\fP or \fBpcre2_dfa_match()\fP,
462 not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
463 application can lock out the use of (*UTF) and (*UCP) by setting the
464 PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
467 .SH "NEWLINE CONVENTION"
470 These are recognized only at the very start of the pattern or after option
471 settings with a similar syntax.
473 (*CR) carriage return only
475 (*CRLF) carriage return followed by linefeed
476 (*ANYCRLF) all three of the above
477 (*ANY) any Unicode newline sequence
478 (*NUL) the NUL character (binary zero)
481 .SH "WHAT \eR MATCHES"
484 These are recognized only at the very start of the pattern or after option
485 setting with a similar syntax.
487 (*BSR_ANYCRLF) CR, LF, or CRLF
488 (*BSR_UNICODE) any Unicode newline sequence
491 .SH "LOOKAHEAD AND LOOKBEHIND ASSERTIONS"
494 (?=...) positive look ahead
495 (?!...) negative look ahead
496 (?<=...) positive look behind
497 (?<!...) negative look behind
499 Each top-level branch of a look behind must be of a fixed length.
505 \en reference by number (can be ambiguous)
506 \egn reference by number
507 \eg{n} reference by number
508 \eg+n relative reference by number (PCRE2 extension)
509 \eg-n relative reference by number
510 \eg{+n} relative reference by number (PCRE2 extension)
511 \eg{-n} relative reference by number
512 \ek<name> reference by name (Perl)
513 \ek'name' reference by name (Perl)
514 \eg{name} reference by name (Perl)
515 \ek{name} reference by name (.NET)
516 (?P=name) reference by name (Python)
519 .SH "SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)"
522 (?R) recurse whole pattern
523 (?n) call subpattern by absolute number
524 (?+n) call subpattern by relative number
525 (?-n) call subpattern by relative number
526 (?&name) call subpattern by name (Perl)
527 (?P>name) call subpattern by name (Python)
528 \eg<name> call subpattern by name (Oniguruma)
529 \eg'name' call subpattern by name (Oniguruma)
530 \eg<n> call subpattern by absolute number (Oniguruma)
531 \eg'n' call subpattern by absolute number (Oniguruma)
532 \eg<+n> call subpattern by relative number (PCRE2 extension)
533 \eg'+n' call subpattern by relative number (PCRE2 extension)
534 \eg<-n> call subpattern by relative number (PCRE2 extension)
535 \eg'-n' call subpattern by relative number (PCRE2 extension)
538 .SH "CONDITIONAL PATTERNS"
541 (?(condition)yes-pattern)
542 (?(condition)yes-pattern|no-pattern)
544 (?(n) absolute reference condition
545 (?(+n) relative reference condition
546 (?(-n) relative reference condition
547 (?(<name>) named reference condition (Perl)
548 (?('name') named reference condition (Perl)
549 (?(name) named reference condition (PCRE2, deprecated)
550 (?(R) overall recursion condition
551 (?(Rn) specific numbered group recursion condition
552 (?(R&name) specific named group recursion condition
553 (?(DEFINE) define subpattern for reference
554 (?(VERSION[>]=n.m) test PCRE2 version
555 (?(assert) assertion condition
557 Note the ambiguity of (?(R) and (?(Rn) which might be named reference
558 conditions or recursion tests. Such a condition is interpreted as a reference
559 condition if the relevant named group exists.
562 .SH "BACKTRACKING CONTROL"
565 All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
566 name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
567 if :NAME is present. The others just set a name for passing back to the caller,
568 but this is not a name that (*SKIP) can see. The following act immediately they
571 (*ACCEPT) force successful match
572 (*FAIL) force backtrack; synonym (*F)
573 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
575 The following act only when a subsequent match failure causes a backtrack to
576 reach them. They all force a match failure, but they differ in what happens
577 afterwards. Those that advance the start-of-match point do so only if the
578 pattern is not anchored.
580 (*COMMIT) overall failure, no advance of starting point
581 (*PRUNE) advance to next starting character
582 (*SKIP) advance to current matching position
583 (*SKIP:NAME) advance to position corresponding to an earlier
584 (*MARK:NAME); if not found, the (*SKIP) is ignored
585 (*THEN) local failure, backtrack to next alternation
587 The effect of one of these verbs in a group called as a subroutine is confined
588 to the subroutine call.
594 (?C) callout (assumed number 0)
595 (?Cn) callout with numerical data n
596 (?C"text") callout with string data
598 The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
599 start and the end), and the starting delimiter { matched with the ending
600 delimiter }. To encode the ending delimiter within the string, double it.
606 \fBpcre2pattern\fP(3), \fBpcre2api\fP(3), \fBpcre2callout\fP(3),
607 \fBpcre2matching\fP(3), \fBpcre2\fP(3).
615 University Computing Service
624 Last updated: 02 September 2018
625 Copyright (c) 1997-2018 University of Cambridge.