3 <title>pcre2syntax specification</title>
5 <body bgcolor="#FFFFFF" text="#00005A" link="#0066FF" alink="#3399FF" vlink="#2222BB">
6 <h1>pcre2syntax man page</h1>
8 Return to the <a href="index.html">PCRE2 index page</a>.
11 This page is part of the PCRE2 HTML documentation. It was generated
12 automatically from the original man page. If there is any nonsense in it,
13 please consult the man page, in case the conversion went wrong.
16 <li><a name="TOC1" href="#SEC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a>
17 <li><a name="TOC2" href="#SEC2">QUOTING</a>
18 <li><a name="TOC3" href="#SEC3">ESCAPED CHARACTERS</a>
19 <li><a name="TOC4" href="#SEC4">CHARACTER TYPES</a>
20 <li><a name="TOC5" href="#SEC5">GENERAL CATEGORY PROPERTIES FOR \p and \P</a>
21 <li><a name="TOC6" href="#SEC6">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a>
22 <li><a name="TOC7" href="#SEC7">SCRIPT NAMES FOR \p AND \P</a>
23 <li><a name="TOC8" href="#SEC8">CHARACTER CLASSES</a>
24 <li><a name="TOC9" href="#SEC9">QUANTIFIERS</a>
25 <li><a name="TOC10" href="#SEC10">ANCHORS AND SIMPLE ASSERTIONS</a>
26 <li><a name="TOC11" href="#SEC11">REPORTED MATCH POINT SETTING</a>
27 <li><a name="TOC12" href="#SEC12">ALTERNATION</a>
28 <li><a name="TOC13" href="#SEC13">CAPTURING</a>
29 <li><a name="TOC14" href="#SEC14">ATOMIC GROUPS</a>
30 <li><a name="TOC15" href="#SEC15">COMMENT</a>
31 <li><a name="TOC16" href="#SEC16">OPTION SETTING</a>
32 <li><a name="TOC17" href="#SEC17">NEWLINE CONVENTION</a>
33 <li><a name="TOC18" href="#SEC18">WHAT \R MATCHES</a>
34 <li><a name="TOC19" href="#SEC19">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a>
35 <li><a name="TOC20" href="#SEC20">BACKREFERENCES</a>
36 <li><a name="TOC21" href="#SEC21">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a>
37 <li><a name="TOC22" href="#SEC22">CONDITIONAL PATTERNS</a>
38 <li><a name="TOC23" href="#SEC23">BACKTRACKING CONTROL</a>
39 <li><a name="TOC24" href="#SEC24">CALLOUTS</a>
40 <li><a name="TOC25" href="#SEC25">SEE ALSO</a>
41 <li><a name="TOC26" href="#SEC26">AUTHOR</a>
42 <li><a name="TOC27" href="#SEC27">REVISION</a>
44 <br><a name="SEC1" href="#TOC1">PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY</a><br>
46 The full syntax and semantics of the regular expressions that are supported by
47 PCRE2 are described in the
48 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
49 documentation. This document contains a quick-reference summary of the syntax.
51 <br><a name="SEC2" href="#TOC1">QUOTING</a><br>
54 \x where x is non-alphanumeric is a literal x
55 \Q...\E treat enclosed characters as literal
58 <br><a name="SEC3" href="#TOC1">ESCAPED CHARACTERS</a><br>
60 This table applies to ASCII and Unicode environments.
62 \a alarm, that is, the BEL character (hex 07)
63 \cx "control-x", where x is any ASCII printing character
67 \r carriage return (hex 0D)
69 \0dd character with octal code 0dd
70 \ddd character with octal code ddd, or backreference
71 \o{ddd..} character with octal code ddd..
72 \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
73 \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
74 \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
75 \xhh character with hex code hh
76 \x{hh..} character with hex code hh..
78 Note that \0dd is always an octal code. The treatment of backslash followed by
79 a non-zero digit is complicated; for details see the section
80 <a href="pcre2pattern.html#digitsafterbackslash">"Non-printing characters"</a>
82 <a href="pcre2pattern.html"><b>pcre2pattern</b></a>
83 documentation, where details of escape processing in EBCDIC environments are
84 also given. \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not
85 supported in EBCDIC environments. Note that \N not followed by an opening
86 curly bracket has a different meaning (see below).
89 When \x is not followed by {, from zero to two hexadecimal digits are read,
90 but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadecimal digits to
91 be recognized as a hexadecimal escape; otherwise it matches a literal "x".
92 Likewise, if \u (in ALT_BSUX mode) is not followed by four hexadecimal digits,
93 it matches a literal "u".
95 <br><a name="SEC4" href="#TOC1">CHARACTER TYPES</a><br>
98 . any character except newline;
99 in dotall mode, any character whatsoever
100 \C one code unit, even in UTF mode (best avoided)
102 \D a character that is not a decimal digit
103 \h a horizontal white space character
104 \H a character that is not a horizontal white space character
105 \N a character that is not a newline
106 \p{<i>xx</i>} a character with the <i>xx</i> property
107 \P{<i>xx</i>} a character without the <i>xx</i> property
108 \R a newline sequence
109 \s a white space character
110 \S a character that is not a white space character
111 \v a vertical white space character
112 \V a character that is not a vertical white space character
113 \w a "word" character
114 \W a "non-word" character
115 \X a Unicode extended grapheme cluster
117 \C is dangerous because it may leave the current matching point in the middle
118 of a UTF-8 or UTF-16 character. The application can lock out the use of \C by
119 setting the PCRE2_NEVER_BACKSLASH_C option. It is also possible to build PCRE2
120 with the use of \C permanently disabled.
123 By default, \d, \s, and \w match only ASCII characters, even in UTF-8 mode
124 or in the 16-bit and 32-bit libraries. However, if locale-specific matching is
125 happening, \s and \w may also match characters with code points in the range
126 128-255. If the PCRE2_UCP option is set, the behaviour of these escape
127 sequences is changed to use Unicode properties and they match many more
130 <br><a name="SEC5" href="#TOC1">GENERAL CATEGORY PROPERTIES FOR \p and \P</a><br>
159 Pc Connector punctuation
163 Pi Initial punctuation
170 Sm Mathematical symbol
175 Zp Paragraph separator
179 <br><a name="SEC6" href="#TOC1">PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P</a><br>
182 Xan Alphanumeric: union of properties L and N
183 Xps POSIX space: property Z or tab, NL, VT, FF, CR
184 Xsp Perl space: property Z or tab, NL, VT, FF, CR
185 Xuc Univerally-named character: one that can be
186 represented by a Universal Character Name
187 Xwd Perl word: property Xan or underscore
189 Perl and POSIX space are now the same. Perl added VT to its space character set
192 <br><a name="SEC7" href="#TOC1">SCRIPT NAMES FOR \p AND \P</a><br>
196 Anatolian_Hieroglyphs,
226 Egyptian_Hieroglyphs,
246 Inscriptional_Pahlavi,
247 Inscriptional_Parthian,
277 Meroitic_Hieroglyphs,
343 <br><a name="SEC8" href="#TOC1">CHARACTER CLASSES</a><br>
346 [...] positive character class
347 [^...] negative character class
348 [x-y] range (can be used for hex characters)
349 [[:xxx:]] positive POSIX named set
350 [[:^xxx:]] negative POSIX named set
356 cntrl control character
358 graph printing, excluding space
359 lower lower case letter
360 print printing, including space
361 punct printing, excluding alphanumeric
363 upper upper case letter
365 xdigit hexadecimal digit
367 In PCRE2, POSIX character set names recognize only ASCII characters by default,
368 but some of them use Unicode properties if PCRE2_UCP is set. You can use
369 \Q...\E inside a character class.
371 <br><a name="SEC9" href="#TOC1">QUANTIFIERS</a><br>
375 ?+ 0 or 1, possessive
378 *+ 0 or more, possessive
381 ++ 1 or more, possessive
384 {n,m} at least n, no more than m, greedy
385 {n,m}+ at least n, no more than m, possessive
386 {n,m}? at least n, no more than m, lazy
387 {n,} n or more, greedy
388 {n,}+ n or more, possessive
389 {n,}? n or more, lazy
392 <br><a name="SEC10" href="#TOC1">ANCHORS AND SIMPLE ASSERTIONS</a><br>
396 \B not a word boundary
398 also after an internal newline in multiline mode
399 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
402 also before newline at end of subject
403 also before internal newline in multiline mode
405 also before newline at end of subject
407 \G first matching position in subject
410 <br><a name="SEC11" href="#TOC1">REPORTED MATCH POINT SETTING</a><br>
413 \K set reported start of match
415 \K is honoured in positive assertions, but ignored in negative ones.
417 <br><a name="SEC12" href="#TOC1">ALTERNATION</a><br>
423 <br><a name="SEC13" href="#TOC1">CAPTURING</a><br>
426 (...) capturing group
427 (?<name>...) named capturing group (Perl)
428 (?'name'...) named capturing group (Perl)
429 (?P<name>...) named capturing group (Python)
430 (?:...) non-capturing group
431 (?|...) non-capturing group; reset group numbers for
432 capturing groups in each alternative
435 <br><a name="SEC14" href="#TOC1">ATOMIC GROUPS</a><br>
438 (?>...) atomic, non-capturing group
441 <br><a name="SEC15" href="#TOC1">COMMENT</a><br>
444 (?#....) comment (not nestable)
447 <br><a name="SEC16" href="#TOC1">OPTION SETTING</a><br>
449 Changes of these options within a group are automatically cancelled at the end
453 (?J) allow duplicate names
456 (?s) single line (dotall)
457 (?U) default ungreedy (lazy)
458 (?x) extended: ignore white space except in classes
459 (?xx) as (?x) but also ignore space and tab in classes
460 (?-...) unset option(s)
461 (?^) unset imnsx options
463 Unsetting x or xx unsets both. Several options may be set at once, and a
464 mixture of setting and unsetting such as (?i-x) is allowed, but there may be
465 only one hyphen. Setting (but no unsetting) is allowed after (?^ for example
466 (?^in). An option setting may appear at the start of a non-capturing group, for
470 The following are recognized only at the very start of a pattern or after one
471 of the newline or \R options with similar syntax. More than one of them may
472 appear. For the first three, d is a decimal number.
474 (*LIMIT_DEPTH=d) set the backtracking limit to d
475 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
476 (*LIMIT_MATCH=d) set the match limit to d
477 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
478 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
479 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
480 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
481 (*NO_JIT) disable JIT optimization
482 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
483 (*UTF) set appropriate UTF mode for the library in use
484 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
486 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the value of
487 the limits set by the caller of <b>pcre2_match()</b> or <b>pcre2_dfa_match()</b>,
488 not increase them. LIMIT_RECURSION is an obsolete synonym for LIMIT_DEPTH. The
489 application can lock out the use of (*UTF) and (*UCP) by setting the
490 PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options, respectively, at compile time.
492 <br><a name="SEC17" href="#TOC1">NEWLINE CONVENTION</a><br>
494 These are recognized only at the very start of the pattern or after option
495 settings with a similar syntax.
497 (*CR) carriage return only
499 (*CRLF) carriage return followed by linefeed
500 (*ANYCRLF) all three of the above
501 (*ANY) any Unicode newline sequence
502 (*NUL) the NUL character (binary zero)
505 <br><a name="SEC18" href="#TOC1">WHAT \R MATCHES</a><br>
507 These are recognized only at the very start of the pattern or after option
508 setting with a similar syntax.
510 (*BSR_ANYCRLF) CR, LF, or CRLF
511 (*BSR_UNICODE) any Unicode newline sequence
514 <br><a name="SEC19" href="#TOC1">LOOKAHEAD AND LOOKBEHIND ASSERTIONS</a><br>
517 (?=...) positive look ahead
518 (?!...) negative look ahead
519 (?<=...) positive look behind
520 (?<!...) negative look behind
522 Each top-level branch of a look behind must be of a fixed length.
524 <br><a name="SEC20" href="#TOC1">BACKREFERENCES</a><br>
527 \n reference by number (can be ambiguous)
528 \gn reference by number
529 \g{n} reference by number
530 \g+n relative reference by number (PCRE2 extension)
531 \g-n relative reference by number
532 \g{+n} relative reference by number (PCRE2 extension)
533 \g{-n} relative reference by number
534 \k<name> reference by name (Perl)
535 \k'name' reference by name (Perl)
536 \g{name} reference by name (Perl)
537 \k{name} reference by name (.NET)
538 (?P=name) reference by name (Python)
541 <br><a name="SEC21" href="#TOC1">SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)</a><br>
544 (?R) recurse whole pattern
545 (?n) call subpattern by absolute number
546 (?+n) call subpattern by relative number
547 (?-n) call subpattern by relative number
548 (?&name) call subpattern by name (Perl)
549 (?P>name) call subpattern by name (Python)
550 \g<name> call subpattern by name (Oniguruma)
551 \g'name' call subpattern by name (Oniguruma)
552 \g<n> call subpattern by absolute number (Oniguruma)
553 \g'n' call subpattern by absolute number (Oniguruma)
554 \g<+n> call subpattern by relative number (PCRE2 extension)
555 \g'+n' call subpattern by relative number (PCRE2 extension)
556 \g<-n> call subpattern by relative number (PCRE2 extension)
557 \g'-n' call subpattern by relative number (PCRE2 extension)
560 <br><a name="SEC22" href="#TOC1">CONDITIONAL PATTERNS</a><br>
563 (?(condition)yes-pattern)
564 (?(condition)yes-pattern|no-pattern)
566 (?(n) absolute reference condition
567 (?(+n) relative reference condition
568 (?(-n) relative reference condition
569 (?(<name>) named reference condition (Perl)
570 (?('name') named reference condition (Perl)
571 (?(name) named reference condition (PCRE2, deprecated)
572 (?(R) overall recursion condition
573 (?(Rn) specific numbered group recursion condition
574 (?(R&name) specific named group recursion condition
575 (?(DEFINE) define subpattern for reference
576 (?(VERSION[>]=n.m) test PCRE2 version
577 (?(assert) assertion condition
579 Note the ambiguity of (?(R) and (?(Rn) which might be named reference
580 conditions or recursion tests. Such a condition is interpreted as a reference
581 condition if the relevant named group exists.
583 <br><a name="SEC23" href="#TOC1">BACKTRACKING CONTROL</a><br>
585 All backtracking control verbs may be in the form (*VERB:NAME). For (*MARK) the
586 name is mandatory, for the others it is optional. (*SKIP) changes its behaviour
587 if :NAME is present. The others just set a name for passing back to the caller,
588 but this is not a name that (*SKIP) can see. The following act immediately they
591 (*ACCEPT) force successful match
592 (*FAIL) force backtrack; synonym (*F)
593 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
595 The following act only when a subsequent match failure causes a backtrack to
596 reach them. They all force a match failure, but they differ in what happens
597 afterwards. Those that advance the start-of-match point do so only if the
598 pattern is not anchored.
600 (*COMMIT) overall failure, no advance of starting point
601 (*PRUNE) advance to next starting character
602 (*SKIP) advance to current matching position
603 (*SKIP:NAME) advance to position corresponding to an earlier
604 (*MARK:NAME); if not found, the (*SKIP) is ignored
605 (*THEN) local failure, backtrack to next alternation
607 The effect of one of these verbs in a group called as a subroutine is confined
608 to the subroutine call.
610 <br><a name="SEC24" href="#TOC1">CALLOUTS</a><br>
613 (?C) callout (assumed number 0)
614 (?Cn) callout with numerical data n
615 (?C"text") callout with string data
617 The allowed string delimiters are ` ' " ^ % # $ (which are the same for the
618 start and the end), and the starting delimiter { matched with the ending
619 delimiter }. To encode the ending delimiter within the string, double it.
621 <br><a name="SEC25" href="#TOC1">SEE ALSO</a><br>
623 <b>pcre2pattern</b>(3), <b>pcre2api</b>(3), <b>pcre2callout</b>(3),
624 <b>pcre2matching</b>(3), <b>pcre2</b>(3).
626 <br><a name="SEC26" href="#TOC1">AUTHOR</a><br>
630 University Computing Service
635 <br><a name="SEC27" href="#TOC1">REVISION</a><br>
637 Last updated: 02 September 2018
639 Copyright © 1997-2018 University of Cambridge.
642 Return to the <a href="index.html">PCRE2 index page</a>.