1 -----------------------------------------------------------------------------
2 This file contains a concatenation of the PCRE2 man pages, converted to plain
3 text format for ease of searching with a text editor, or for use on systems
4 that do not have a man page processor. The small individual files that give
5 synopses of each function in the library have not been included. Neither has
6 the pcre2demo program. There are separate text files for the pcre2grep and
8 -----------------------------------------------------------------------------
11 PCRE2(3) Library Functions Manual PCRE2(3)
16 PCRE2 - Perl-compatible regular expressions (revised API)
20 PCRE2 is the name used for a revised API for the PCRE library, which is
21 a set of functions, written in C, that implement regular expression
22 pattern matching using the same syntax and semantics as Perl, with just
23 a few differences. After nearly two decades, the limitations of the
24 original API were making development increasingly difficult. The new
25 API is more extensible, and it was simplified by abolishing the sepa-
26 rate "study" optimizing function; in PCRE2, patterns are automatically
27 optimized where possible. Since forking from PCRE1, the code has been
28 extensively refactored and new features introduced.
30 As well as Perl-style regular expression patterns, some features that
31 appeared in Python and the original PCRE before they appeared in Perl
32 are available using the Python syntax. There is also some support for
33 one or two .NET and Oniguruma syntax items, and there are options for
34 requesting some minor changes that give better ECMAScript (aka
35 JavaScript) compatibility.
37 The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or
38 32-bit code units, which means that up to three separate libraries may
39 be installed. The original work to extend PCRE to 16-bit and 32-bit
40 code units was done by Zoltan Herczeg and Christian Persch, respec-
41 tively. In all three cases, strings can be interpreted either as one
42 character per code unit, or as UTF-encoded Unicode, with support for
43 Unicode general category properties. Unicode support is optional at
44 build time (but is the default). However, processing strings as UTF
45 code units must be enabled explicitly at run time. The version of Uni-
46 code in use can be discovered by running
50 The three libraries contain identical sets of functions, with names
51 ending in _8, _16, or _32, respectively (for example, pcre2_com-
52 pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
53 32, a program that uses just one code unit width can be written using
54 generic names such as pcre2_compile(), and the documentation is written
55 assuming that this is the case.
57 In addition to the Perl-compatible matching function, PCRE2 contains an
58 alternative function that matches the same compiled patterns in a dif-
59 ferent way. In certain circumstances, the alternative function has some
60 advantages. For a discussion of the two matching algorithms, see the
63 Details of exactly which Perl regular expression features are and are
64 not supported by PCRE2 are given in separate documents. See the
65 pcre2pattern and pcre2compat pages. There is a syntax summary in the
68 Some features of PCRE2 can be included, excluded, or changed when the
69 library is built. The pcre2_config() function makes it possible for a
70 client to discover which features are available. The features them-
71 selves are described in the pcre2build page. Documentation about build-
72 ing PCRE2 for various operating systems can be found in the README and
73 NON-AUTOTOOLS_BUILD files in the source distribution.
75 The libraries contains a number of undocumented internal functions and
76 data tables that are used by more than one of the exported external
77 functions, but which are not intended for use by external callers.
78 Their names all begin with "_pcre2", which hopefully will not provoke
79 any name clashes. In some environments, it is possible to control which
80 external symbols are exported when a shared library is built, and in
81 these cases the undocumented symbols are not exported.
84 SECURITY CONSIDERATIONS
86 If you are using PCRE2 in a non-UTF application that permits users to
87 supply arbitrary patterns for compilation, you should be aware of a
88 feature that allows users to turn on UTF support from within a pattern.
89 For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
90 mode, which interprets patterns and subjects as strings of UTF-8 code
91 units instead of individual 8-bit characters. This causes both the pat-
92 tern and any data against which it is matched to be checked for UTF-8
93 validity. If the data string is very long, such a check might use suf-
94 ficiently many resources as to cause your application to lose perfor-
97 One way of guarding against this possibility is to use the pcre2_pat-
98 tern_info() function to check the compiled pattern's options for
99 PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
100 calling pcre2_compile(). This causes a compile time error if the pat-
101 tern contains a UTF-setting sequence.
103 The use of Unicode properties for character types such as \d can also
104 be enabled from within the pattern, by specifying "(*UCP)". This fea-
105 ture can be disallowed by setting the PCRE2_NEVER_UCP option.
107 If your application is one that supports UTF, be aware that validity
108 checking can take time. If the same data string is to be matched many
109 times, you can use the PCRE2_NO_UTF_CHECK option for the second and
110 subsequent matches to avoid running redundant checks.
112 The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
113 to problems, because it may leave the current matching point in the
114 middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C
115 option can be used by an application to lock out the use of \C, causing
116 a compile-time error if it is encountered. It is also possible to build
117 PCRE2 with the use of \C permanently disabled.
119 Another way that performance can be hit is by running a pattern that
120 has a very large search tree against a string that will never match.
121 Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
122 vides some protection against this: see the pcre2_set_match_limit()
123 function in the pcre2api page. There is a similar function called
124 pcre2_set_depth_limit() that can be used to restrict the amount of mem-
130 The user documentation for PCRE2 comprises a number of different sec-
131 tions. In the "man" format, each of these is a separate "man page". In
132 the HTML format, each is a separate page, linked from the index page.
133 In the plain text format, the descriptions of the pcre2grep and
134 pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
135 respectively. The remaining sections, except for the pcre2demo section
136 (which is a program listing), and the short pages for individual func-
137 tions, are concatenated in pcre2.txt, for ease of searching. The sec-
138 tions are as follows:
141 pcre2-config show PCRE2 installation configuration information
142 pcre2api details of PCRE2's native C API
143 pcre2build building PCRE2
144 pcre2callout details of the callout feature
145 pcre2compat discussion of Perl compatibility
146 pcre2convert details of pattern conversion functions
147 pcre2demo a demonstration C program that uses PCRE2
148 pcre2grep description of the pcre2grep command (8-bit only)
149 pcre2jit discussion of just-in-time optimization support
150 pcre2limits details of size and other limits
151 pcre2matching discussion of the two matching algorithms
152 pcre2partial details of the partial matching facility
153 pcre2pattern syntax and semantics of supported regular
155 pcre2perform discussion of performance issues
156 pcre2posix the POSIX-compatible C API for the 8-bit library
157 pcre2sample discussion of the pcre2demo program
158 pcre2serialize details of pattern serialization
159 pcre2syntax quick syntax reference
160 pcre2test description of the pcre2test command
161 pcre2unicode discussion of Unicode and UTF support
163 In the "man" and HTML formats, there is also a short page for each C
164 library function, listing its arguments and results.
170 University Computing Service
173 Putting an actual email address here is a spam magnet. If you want to
174 email me, use my two initials, followed by the two digits 10, at the
180 Last updated: 11 July 2018
181 Copyright (c) 1997-2018 University of Cambridge.
182 ------------------------------------------------------------------------------
185 PCRE2API(3) Library Functions Manual PCRE2API(3)
190 PCRE2 - Perl-compatible regular expressions (revised API)
194 PCRE2 is a new API for PCRE, starting at release 10.0. This document
195 contains a description of all its native functions. See the pcre2 docu-
196 ment for an overview of all the PCRE2 documentation.
199 PCRE2 NATIVE API BASIC FUNCTIONS
201 pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
202 uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
203 pcre2_compile_context *ccontext);
205 void pcre2_code_free(pcre2_code *code);
207 pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
208 pcre2_general_context *gcontext);
210 pcre2_match_data *pcre2_match_data_create_from_pattern(
211 const pcre2_code *code, pcre2_general_context *gcontext);
213 int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
214 PCRE2_SIZE length, PCRE2_SIZE startoffset,
215 uint32_t options, pcre2_match_data *match_data,
216 pcre2_match_context *mcontext);
218 int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
219 PCRE2_SIZE length, PCRE2_SIZE startoffset,
220 uint32_t options, pcre2_match_data *match_data,
221 pcre2_match_context *mcontext,
222 int *workspace, PCRE2_SIZE wscount);
224 void pcre2_match_data_free(pcre2_match_data *match_data);
227 PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
229 PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
231 uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
233 PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
235 PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
238 PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
240 pcre2_general_context *pcre2_general_context_create(
241 void *(*private_malloc)(PCRE2_SIZE, void *),
242 void (*private_free)(void *, void *), void *memory_data);
244 pcre2_general_context *pcre2_general_context_copy(
245 pcre2_general_context *gcontext);
247 void pcre2_general_context_free(pcre2_general_context *gcontext);
250 PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
252 pcre2_compile_context *pcre2_compile_context_create(
253 pcre2_general_context *gcontext);
255 pcre2_compile_context *pcre2_compile_context_copy(
256 pcre2_compile_context *ccontext);
258 void pcre2_compile_context_free(pcre2_compile_context *ccontext);
260 int pcre2_set_bsr(pcre2_compile_context *ccontext,
263 int pcre2_set_character_tables(pcre2_compile_context *ccontext,
264 const unsigned char *tables);
266 int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
267 uint32_t extra_options);
269 int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
272 int pcre2_set_newline(pcre2_compile_context *ccontext,
275 int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
278 int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
279 int (*guard_function)(uint32_t, void *), void *user_data);
282 PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
284 pcre2_match_context *pcre2_match_context_create(
285 pcre2_general_context *gcontext);
287 pcre2_match_context *pcre2_match_context_copy(
288 pcre2_match_context *mcontext);
290 void pcre2_match_context_free(pcre2_match_context *mcontext);
292 int pcre2_set_callout(pcre2_match_context *mcontext,
293 int (*callout_function)(pcre2_callout_block *, void *),
296 int pcre2_set_offset_limit(pcre2_match_context *mcontext,
299 int pcre2_set_heap_limit(pcre2_match_context *mcontext,
302 int pcre2_set_match_limit(pcre2_match_context *mcontext,
305 int pcre2_set_depth_limit(pcre2_match_context *mcontext,
309 PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
311 int pcre2_substring_copy_byname(pcre2_match_data *match_data,
312 PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
314 int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
315 uint32_t number, PCRE2_UCHAR *buffer,
316 PCRE2_SIZE *bufflen);
318 void pcre2_substring_free(PCRE2_UCHAR *buffer);
320 int pcre2_substring_get_byname(pcre2_match_data *match_data,
321 PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
323 int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
324 uint32_t number, PCRE2_UCHAR **bufferptr,
325 PCRE2_SIZE *bufflen);
327 int pcre2_substring_length_byname(pcre2_match_data *match_data,
328 PCRE2_SPTR name, PCRE2_SIZE *length);
330 int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
331 uint32_t number, PCRE2_SIZE *length);
333 int pcre2_substring_nametable_scan(const pcre2_code *code,
334 PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
336 int pcre2_substring_number_from_name(const pcre2_code *code,
339 void pcre2_substring_list_free(PCRE2_SPTR *list);
341 int pcre2_substring_list_get(pcre2_match_data *match_data,
342 PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
345 PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
347 int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
348 PCRE2_SIZE length, PCRE2_SIZE startoffset,
349 uint32_t options, pcre2_match_data *match_data,
350 pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
351 PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
352 PCRE2_SIZE *outlengthptr);
355 PCRE2 NATIVE API JIT FUNCTIONS
357 int pcre2_jit_compile(pcre2_code *code, uint32_t options);
359 int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
360 PCRE2_SIZE length, PCRE2_SIZE startoffset,
361 uint32_t options, pcre2_match_data *match_data,
362 pcre2_match_context *mcontext);
364 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
366 pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
367 PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
369 void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
370 pcre2_jit_callback callback_function, void *callback_data);
372 void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
375 PCRE2 NATIVE API SERIALIZATION FUNCTIONS
377 int32_t pcre2_serialize_decode(pcre2_code **codes,
378 int32_t number_of_codes, const uint8_t *bytes,
379 pcre2_general_context *gcontext);
381 int32_t pcre2_serialize_encode(const pcre2_code **codes,
382 int32_t number_of_codes, uint8_t **serialized_bytes,
383 PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
385 void pcre2_serialize_free(uint8_t *bytes);
387 int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
390 PCRE2 NATIVE API AUXILIARY FUNCTIONS
392 pcre2_code *pcre2_code_copy(const pcre2_code *code);
394 pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
396 int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
399 const unsigned char *pcre2_maketables(pcre2_general_context *gcontext);
401 int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
403 int pcre2_callout_enumerate(const pcre2_code *code,
404 int (*callback)(pcre2_callout_enumerate_block *, void *),
407 int pcre2_config(uint32_t what, void *where);
410 PCRE2 NATIVE API OBSOLETE FUNCTIONS
412 int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
415 int pcre2_set_recursion_memory_management(
416 pcre2_match_context *mcontext,
417 void *(*private_malloc)(PCRE2_SIZE, void *),
418 void (*private_free)(void *, void *), void *memory_data);
420 These functions became obsolete at release 10.30 and are retained only
421 for backward compatibility. They should not be used in new code. The
422 first is replaced by pcre2_set_depth_limit(); the second is no longer
423 needed and has no effect (it always returns zero).
426 PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS
428 pcre2_convert_context *pcre2_convert_context_create(
429 pcre2_general_context *gcontext);
431 pcre2_convert_context *pcre2_convert_context_copy(
432 pcre2_convert_context *cvcontext);
434 void pcre2_convert_context_free(pcre2_convert_context *cvcontext);
436 int pcre2_set_glob_escape(pcre2_convert_context *cvcontext,
437 uint32_t escape_char);
439 int pcre2_set_glob_separator(pcre2_convert_context *cvcontext,
440 uint32_t separator_char);
442 int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length,
443 uint32_t options, PCRE2_UCHAR **buffer,
444 PCRE2_SIZE *blength, pcre2_convert_context *cvcontext);
446 void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern);
448 These functions provide a way of converting non-PCRE2 patterns into
449 patterns that can be processed by pcre2_compile(). This facility is
450 experimental and may be changed in future releases. At present, "globs"
451 and POSIX basic and extended patterns can be converted. Details are
452 given in the pcre2convert documentation.
455 PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
457 There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
458 code units, respectively. However, there is just one header file,
459 pcre2.h. This contains the function prototypes and other definitions
460 for all three libraries. One, two, or all three can be installed simul-
461 taneously. On Unix-like systems the libraries are called libpcre2-8,
462 libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
465 Character strings are passed to and from a PCRE2 library as a sequence
466 of unsigned integers in code units of the appropriate width. Every
467 PCRE2 function comes in three different forms, one for each library,
474 There are also three different sets of data types:
476 PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
477 PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32
479 The UCHAR types define unsigned code units of the appropriate widths.
480 For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
481 types are constant pointers to the equivalent UCHAR types, that is,
482 they are pointers to vectors of unsigned code units.
484 Many applications use only one code unit width. For their convenience,
485 macros are defined whose names are the generic forms such as pcre2_com-
486 pile() and PCRE2_SPTR. These macros use the value of the macro
487 PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
488 tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default.
489 An application must define it to be 8, 16, or 32 before including
490 pcre2.h in order to make use of the generic names.
492 Applications that use more than one code unit width can be linked with
493 more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to
494 be 0 before including pcre2.h, and then use the real function names.
495 Any code that is to be included in an environment where the value of
496 PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function
497 names. (Unfortunately, it is not possible in C code to save and restore
498 the value of a macro.)
500 If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a
501 compiler error occurs.
503 When using multiple libraries in an application, you must take care
504 when processing any particular pattern to use only functions from a
505 single library. For example, if you want to run a match using a pat-
506 tern that was compiled with pcre2_compile_16(), you must do so with
507 pcre2_match_16(), not pcre2_match_8() or pcre2_match_32().
509 In the function summaries above, and in the rest of this document and
510 other PCRE2 documents, functions and data types are described using
511 their generic names, without the _8, _16, or _32 suffix.
516 PCRE2 has its own native API, which is described in this document.
517 There are also some wrapper functions for the 8-bit library that corre-
518 spond to the POSIX regular expression API, but they do not give access
519 to all the functionality of PCRE2. They are described in the pcre2posix
520 documentation. Both these APIs define a set of C function calls.
522 The native API C data types, function prototypes, option values, and
523 error codes are defined in the header file pcre2.h, which also contains
524 definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
525 numbers for the library. Applications can use these to include support
526 for different releases of PCRE2.
528 In a Windows environment, if you want to statically link an application
529 program against a non-dll PCRE2 library, you must define PCRE2_STATIC
530 before including pcre2.h.
532 The functions pcre2_compile() and pcre2_match() are used for compiling
533 and matching regular expressions in a Perl-compatible manner. A sample
534 program that demonstrates the simplest way of using them is provided in
535 the file called pcre2demo.c in the PCRE2 source distribution. A listing
536 of this program is given in the pcre2demo documentation, and the
537 pcre2sample documentation describes how to compile and run it.
539 The compiling and matching functions recognize various options that are
540 passed as bits in an options argument. There are also some more compli-
541 cated parameters such as custom memory management functions and
542 resource limits that are passed in "contexts" (which are just memory
543 blocks, described below). Simple applications do not need to make use
546 Just-in-time (JIT) compiler support is an optional feature of PCRE2
547 that can be built in appropriate hardware environments. It greatly
548 speeds up the matching performance of many patterns. Programs can
549 request that it be used if available by calling pcre2_jit_compile()
550 after a pattern has been successfully compiled by pcre2_compile(). This
551 does nothing if JIT support is not available.
553 More complicated programs might need to make use of the specialist
554 functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and
555 pcre2_jit_stack_assign() in order to control the JIT code's memory
558 JIT matching is automatically used by pcre2_match() if it is available,
559 unless the PCRE2_NO_JIT option is set. There is also a direct interface
560 for JIT matching, which gives improved performance at the expense of
561 less sanity checking. The JIT-specific functions are discussed in the
562 pcre2jit documentation.
564 A second matching function, pcre2_dfa_match(), which is not Perl-com-
565 patible, is also provided. This uses a different algorithm for the
566 matching. The alternative algorithm finds all possible matches (at a
567 given point in the subject), and scans the subject just once (unless
568 there are lookaround assertions). However, this algorithm does not
569 return captured substrings. A description of the two matching algo-
570 rithms and their advantages and disadvantages is given in the
571 pcre2matching documentation. There is no JIT support for
574 In addition to the main compiling and matching functions, there are
575 convenience functions for extracting captured substrings from a subject
576 string that has been matched by pcre2_match(). They are:
578 pcre2_substring_copy_byname()
579 pcre2_substring_copy_bynumber()
580 pcre2_substring_get_byname()
581 pcre2_substring_get_bynumber()
582 pcre2_substring_list_get()
583 pcre2_substring_length_byname()
584 pcre2_substring_length_bynumber()
585 pcre2_substring_nametable_scan()
586 pcre2_substring_number_from_name()
588 pcre2_substring_free() and pcre2_substring_list_free() are also pro-
589 vided, to free memory used for extracted strings. If either of these
590 functions is called with a NULL argument, the function returns immedi-
591 ately without doing anything.
593 The function pcre2_substitute() can be called to match a pattern and
594 return a copy of the subject string with substitutions for parts that
597 Functions whose names begin with pcre2_serialize_ are used for saving
598 compiled patterns on disc or elsewhere, and reloading them later.
600 Finally, there are functions for finding out information about a com-
601 piled pattern (pcre2_pattern_info()) and about the configuration with
602 which PCRE2 was built (pcre2_config()).
604 Functions with names ending with _free() are used for freeing memory
605 blocks of various sorts. In all cases, if one of these functions is
606 called with a NULL argument, it does nothing.
609 STRING LENGTHS AND OFFSETS
611 The PCRE2 API uses string lengths and offsets into strings of code
612 units in several places. These values are always of type PCRE2_SIZE,
613 which is an unsigned integer type, currently always defined as size_t.
614 The largest value that can be stored in such a type (that is
615 ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
616 strings and unset offsets. Therefore, the longest string that can be
617 handled is one less than this maximum.
622 PCRE2 supports five different conventions for indicating line breaks in
623 strings: a single CR (carriage return) character, a single LF (line-
624 feed) character, the two-character sequence CRLF, any of the three pre-
625 ceding, or any Unicode newline sequence. The Unicode newline sequences
626 are the three just mentioned, plus the single characters VT (vertical
627 tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
628 separator, U+2028), and PS (paragraph separator, U+2029).
630 Each of the first three conventions is used by at least one operating
631 system as its standard newline sequence. When PCRE2 is built, a default
632 can be specified. If it is not, the default is set to LF, which is the
633 Unix standard. However, the newline convention can be changed by an
634 application when calling pcre2_compile(), or it can be specified by
635 special text at the start of the pattern itself; this overrides any
636 other settings. See the pcre2pattern page for details of the special
639 In the PCRE2 documentation the word "newline" is used to mean "the
640 character or pair of characters that indicate a line break". The choice
641 of newline convention affects the handling of the dot, circumflex, and
642 dollar metacharacters, the handling of #-comments in /x mode, and, when
643 CRLF is a recognized line ending sequence, the match position advance-
644 ment for a non-anchored pattern. There is more detail about this in the
645 section on pcre2_match() options below.
647 The choice of newline convention does not affect the interpretation of
648 the \n or \r escape sequences, nor does it affect what \R matches; this
649 has its own separate convention.
654 In a multithreaded application it is important to keep thread-specific
655 data separate from data that can be shared between threads. The PCRE2
656 library code itself is thread-safe: it contains no static or global
657 variables. The API is designed to be fairly simple for non-threaded
658 applications while at the same time ensuring that multithreaded appli-
661 There are several different blocks of data that are used to pass infor-
662 mation between the application and the PCRE2 libraries.
666 A pointer to the compiled form of a pattern is returned to the user
667 when pcre2_compile() is successful. The data in the compiled pattern is
668 fixed, and does not change when the pattern is matched. Therefore, it
669 is thread-safe, that is, the same compiled pattern can be used by more
670 than one thread simultaneously. For example, an application can compile
671 all its patterns at the start, before forking off multiple threads that
672 use them. However, if the just-in-time (JIT) optimization feature is
673 being used, it needs separate memory stack areas for each thread. See
674 the pcre2jit documentation for more details.
676 In a more complicated situation, where patterns are compiled only when
677 they are first needed, but are still shared between threads, pointers
678 to compiled patterns must be protected from simultaneous writing by
679 multiple threads, at least until a pattern has been compiled. The logic
680 can be something like this:
682 Get a read-only (shared) lock (mutex) for pointer
685 Get a write (unique) lock for pointer
686 pointer = pcre2_compile(...
689 Use pointer in pcre2_match()
691 Of course, testing for compilation errors should also be included in
694 If JIT is being used, but the JIT compilation is not being done immedi-
695 ately, (perhaps waiting to see if the pattern is used often enough)
696 similar logic is required. JIT compilation updates a pointer within the
697 compiled code block, so a thread must gain unique write access to the
698 pointer before calling pcre2_jit_compile(). Alternatively,
699 pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to
700 obtain a private copy of the compiled code before calling the JIT com-
705 The next main section below introduces the idea of "contexts" in which
706 PCRE2 functions are called. A context is nothing more than a collection
707 of parameters that control the way PCRE2 operates. Grouping a number of
708 parameters together in a context is a convenient way of passing them to
709 a PCRE2 function without using lots of arguments. The parameters that
710 are stored in contexts are in some sense "advanced features" of the
711 API. Many straightforward applications will not need to use contexts.
713 In a multithreaded application, if the parameters in a context are val-
714 ues that are never changed, the same context can be used by all the
715 threads. However, if any thread needs to change any value in a context,
716 it must make its own thread-specific copy.
720 The matching functions need a block of memory for storing the results
721 of a match. This includes details of what was matched, as well as addi-
722 tional information such as the name of a (*MARK) setting. Each thread
723 must provide its own copy of this memory.
728 Some PCRE2 functions have a lot of parameters, many of which are used
729 only by specialist applications, for example, those that use custom
730 memory management or non-standard character tables. To keep function
731 argument lists at a reasonable size, and at the same time to keep the
732 API extensible, "uncommon" parameters are passed to certain functions
733 in a context instead of directly. A context is just a block of memory
734 that holds the parameter values. Applications that do not need to
735 adjust any of the context parameters can pass NULL when a context
738 There are three different types of context: a general context that is
739 relevant for several PCRE2 operations, a compile-time context, and a
744 At present, this context just contains pointers to (and data for)
745 external memory management functions that are called from several
746 places in the PCRE2 library. The context is named `general' rather than
747 specifically `memory' because in future other fields may be added. If
748 you do not want to supply your own custom memory management functions,
749 you do not need to bother with a general context. A general context is
752 pcre2_general_context *pcre2_general_context_create(
753 void *(*private_malloc)(PCRE2_SIZE, void *),
754 void (*private_free)(void *, void *), void *memory_data);
756 The two function pointers specify custom memory management functions,
757 whose prototypes are:
759 void *private_malloc(PCRE2_SIZE, void *);
760 void private_free(void *, void *);
762 Whenever code in PCRE2 calls these functions, the final argument is the
763 value of memory_data. Either of the first two arguments of the creation
764 function may be NULL, in which case the system memory management func-
765 tions malloc() and free() are used. (This is not currently useful, as
766 there are no other fields in a general context, but in future there
767 might be.) The private_malloc() function is used (if supplied) to
768 obtain memory for storing the context, and all three values are saved
769 as part of the context.
771 Whenever PCRE2 creates a data block of any kind, the block contains a
772 pointer to the free() function that matches the malloc() function that
773 was used. When the time comes to free the block, this function is
776 A general context can be copied by calling:
778 pcre2_general_context *pcre2_general_context_copy(
779 pcre2_general_context *gcontext);
781 The memory used for a general context should be freed by calling:
783 void pcre2_general_context_free(pcre2_general_context *gcontext);
785 If this function is passed a NULL argument, it returns immediately
786 without doing anything.
790 A compile context is required if you want to provide an external func-
791 tion for stack checking during compilation or to change the default
792 values of any of the following compile-time parameters:
794 What \R matches (Unicode newlines or CR, LF, CRLF only)
795 PCRE2's character tables
796 The newline character sequence
797 The compile time nested parentheses limit
798 The maximum length of the pattern string
799 The extra options bits (none set by default)
801 A compile context is also required if you are using custom memory man-
802 agement. If none of these apply, just pass NULL as the context argu-
803 ment of pcre2_compile().
805 A compile context is created, copied, and freed by the following func-
808 pcre2_compile_context *pcre2_compile_context_create(
809 pcre2_general_context *gcontext);
811 pcre2_compile_context *pcre2_compile_context_copy(
812 pcre2_compile_context *ccontext);
814 void pcre2_compile_context_free(pcre2_compile_context *ccontext);
816 A compile context is created with default values for its parameters.
817 These can be changed by calling the following functions, which return 0
818 on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
820 int pcre2_set_bsr(pcre2_compile_context *ccontext,
823 The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only
824 CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any
825 Unicode line ending sequence. The value is used by the JIT compiler and
826 by the two interpreted matching functions, pcre2_match() and
829 int pcre2_set_character_tables(pcre2_compile_context *ccontext,
830 const unsigned char *tables);
832 The value must be the result of a call to pcre2_maketables(), whose
833 only argument is a general context. This function builds a set of char-
834 acter tables in the current locale.
836 int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
837 uint32_t extra_options);
839 As PCRE2 has developed, almost all the 32 option bits that are avail-
840 able in the options argument of pcre2_compile() have been used up. To
841 avoid running out, the compile context contains a set of extra option
842 bits which are used for some newer, assumed rarer, options. This func-
843 tion sets those bits. It always sets all the bits (either on or off).
844 It does not modify any existing setting. The available options are
845 defined in the section entitled "Extra compile options" below.
847 int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
850 This sets a maximum length, in code units, for any pattern string that
851 is compiled with this context. If the pattern is longer, an error is
852 generated. This facility is provided so that applications that accept
853 patterns from external sources can limit their size. The default is the
854 largest number that a PCRE2_SIZE variable can hold, which is effec-
857 int pcre2_set_newline(pcre2_compile_context *ccontext,
860 This specifies which characters or character sequences are to be recog-
861 nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage
862 return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
863 two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
864 of the above), PCRE2_NEWLINE_ANY (any Unicode newline sequence), or
865 PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
867 A pattern can override the value set in the compile context by starting
868 with a sequence such as (*CRLF). See the pcre2pattern page for details.
870 When a pattern is compiled with the PCRE2_EXTENDED or
871 PCRE2_EXTENDED_MORE option, the newline convention affects the recogni-
872 tion of the end of internal comments starting with #. The value is
873 saved with the compiled pattern for subsequent use by the JIT compiler
874 and by the two interpreted matching functions, pcre2_match() and
877 int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
880 This parameter ajusts the limit, set when PCRE2 is built (default 250),
881 on the depth of parenthesis nesting in a pattern. This limit stops
882 rogue patterns using up too much system stack when being compiled. The
883 limit applies to parentheses of all kinds, not just capturing parenthe-
886 int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
887 int (*guard_function)(uint32_t, void *), void *user_data);
889 There is at least one application that runs PCRE2 in threads with very
890 limited system stack, where running out of stack is to be avoided at
891 all costs. The parenthesis limit above cannot take account of how much
892 stack is actually available during compilation. For a finer control,
893 you can supply a function that is called whenever pcre2_compile()
894 starts to compile a parenthesized part of a pattern. This function can
895 check the actual stack size (or anything else that it wants to, of
898 The first argument to the callout function gives the current depth of
899 nesting, and the second is user data that is set up by the last argu-
900 ment of pcre2_set_compile_recursion_guard(). The callout function
901 should return zero if all is well, or non-zero to force an error.
905 A match context is required if you want to:
907 Set up a callout function
908 Set an offset limit for matching an unanchored pattern
909 Change the limit on the amount of heap used when matching
910 Change the backtracking match limit
911 Change the backtracking depth limit
912 Set custom memory management specifically for the match
914 If none of these apply, just pass NULL as the context argument of
915 pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
917 A match context is created, copied, and freed by the following func-
920 pcre2_match_context *pcre2_match_context_create(
921 pcre2_general_context *gcontext);
923 pcre2_match_context *pcre2_match_context_copy(
924 pcre2_match_context *mcontext);
926 void pcre2_match_context_free(pcre2_match_context *mcontext);
928 A match context is created with default values for its parameters.
929 These can be changed by calling the following functions, which return 0
930 on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
932 int pcre2_set_callout(pcre2_match_context *mcontext,
933 int (*callout_function)(pcre2_callout_block *, void *),
936 This sets up a "callout" function for PCRE2 to call at specified points
937 during a matching operation. Details are given in the pcre2callout doc-
940 int pcre2_set_offset_limit(pcre2_match_context *mcontext,
943 The offset_limit parameter limits how far an unanchored search can
944 advance in the subject string. The default value is PCRE2_UNSET. The
945 pcre2_match() and pcre2_dfa_match() functions return
946 PCRE2_ERROR_NOMATCH if a match with a starting point before or at the
947 given offset is not found. The pcre2_substitute() function makes no
950 For example, if the pattern /abc/ is matched against "123abc" with an
951 offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH. A match
952 can never be found if the startoffset argument of pcre2_match(),
953 pcre2_dfa_match(), or pcre2_substitute() is greater than the offset
954 limit set in the match context.
956 When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT
957 option when calling pcre2_compile() so that when JIT is in use, differ-
958 ent code can be compiled. If a match is started with a non-default
959 match limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is gener-
962 The offset limit facility can be used to track progress when searching
963 large subject strings or to limit the extent of global substitutions.
964 See also the PCRE2_FIRSTLINE option, which requires a match to start
965 before or at the first newline that follows the start of matching in
966 the subject. If this is set with an offset limit, a match must occur in
967 the first line and also within the offset limit. In other words, which-
968 ever limit comes first is used.
970 int pcre2_set_heap_limit(pcre2_match_context *mcontext,
973 The heap_limit parameter specifies, in units of kibibytes (1024 bytes),
974 the maximum amount of heap memory that pcre2_match() may use to hold
975 backtracking information when running an interpretive match. This limit
976 also applies to pcre2_dfa_match(), which may use the heap when process-
977 ing patterns with a lot of nested pattern recursion or lookarounds or
978 atomic groups. This limit does not apply to matching with the JIT opti-
979 mization, which has its own memory control arrangements (see the
980 pcre2jit documentation for more details). If the limit is reached, the
981 negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default
982 limit can be set when PCRE2 is built; if it is not, the default is set
983 very large and is essentially "unlimited".
985 A value for the heap limit may also be supplied by an item at the start
986 of a pattern of the form
990 where ddd is a decimal number. However, such a setting is ignored
991 unless ddd is less than the limit set by the caller of pcre2_match()
992 or, if no such limit is set, less than the default.
994 The pcre2_match() function starts out using a 20KiB vector on the sys-
995 tem stack for recording backtracking points. The more nested backtrack-
996 ing points there are (that is, the deeper the search tree), the more
997 memory is needed. Heap memory is used only if the initial vector is
998 too small. If the heap limit is set to a value less than 21 (in partic-
999 ular, zero) no heap memory will be used. In this case, only patterns
1000 that do not have a lot of nested backtracking can be successfully pro-
1003 Similarly, for pcre2_dfa_match(), a vector on the system stack is used
1004 when processing pattern recursions, lookarounds, or atomic groups, and
1005 only if this is not big enough is heap memory used. In this case, too,
1006 setting a value of zero disables the use of the heap.
1008 int pcre2_set_match_limit(pcre2_match_context *mcontext,
1011 The match_limit parameter provides a means of preventing PCRE2 from
1012 using up too many computing resources when processing patterns that are
1013 not going to match, but which have a very large number of possibilities
1014 in their search trees. The classic example is a pattern that uses
1015 nested unlimited repeats.
1017 There is an internal counter in pcre2_match() that is incremented each
1018 time round its main matching loop. If this value reaches the match
1019 limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT.
1020 This has the effect of limiting the amount of backtracking that can
1021 take place. For patterns that are not anchored, the count restarts from
1022 zero for each position in the subject string. This limit also applies
1023 to pcre2_dfa_match(), though the counting is done in a different way.
1025 When pcre2_match() is called with a pattern that was successfully pro-
1026 cessed by pcre2_jit_compile(), the way in which matching is executed is
1027 entirely different. However, there is still the possibility of runaway
1028 matching that goes on for a very long time, and so the match_limit
1029 value is also used in this case (but in a different way) to limit how
1030 long the matching can continue.
1032 The default value for the limit can be set when PCRE2 is built; the
1033 default default is 10 million, which handles all but the most extreme
1034 cases. A value for the match limit may also be supplied by an item at
1035 the start of a pattern of the form
1039 where ddd is a decimal number. However, such a setting is ignored
1040 unless ddd is less than the limit set by the caller of pcre2_match() or
1041 pcre2_dfa_match() or, if no such limit is set, less than the default.
1043 int pcre2_set_depth_limit(pcre2_match_context *mcontext,
1046 This parameter limits the depth of nested backtracking in
1047 pcre2_match(). Each time a nested backtracking point is passed, a new
1048 memory "frame" is used to remember the state of matching at that point.
1049 Thus, this parameter indirectly limits the amount of memory that is
1050 used in a match. However, because the size of each memory "frame"
1051 depends on the number of capturing parentheses, the actual memory limit
1052 varies from pattern to pattern. This limit was more useful in versions
1053 before 10.30, where function recursion was used for backtracking.
1055 The depth limit is not relevant, and is ignored, when matching is done
1056 using JIT compiled code. However, it is supported by pcre2_dfa_match(),
1057 which uses it to limit the depth of nested internal recursive function
1058 calls that implement atomic groups, lookaround assertions, and pattern
1059 recursions. This limits, indirectly, the amount of system stack that is
1060 used. It was more useful in versions before 10.32, when stack memory
1061 was used for local workspace vectors for recursive function calls. From
1062 version 10.32, only local variables are allocated on the stack and as
1063 each call uses only a few hundred bytes, even a small stack can support
1064 quite a lot of recursion.
1066 If the depth of internal recursive function calls is great enough,
1067 local workspace vectors are allocated on the heap from version 10.32
1068 onwards, so the depth limit also indirectly limits the amount of heap
1069 memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when
1070 matched to a very long string using pcre2_dfa_match(), can use a great
1071 deal of memory. However, it is probably better to limit heap usage
1072 directly by calling pcre2_set_heap_limit().
1074 The default value for the depth limit can be set when PCRE2 is built;
1075 if it is not, the default is set to the same value as the default for
1076 the match limit. If the limit is exceeded, pcre2_match() or
1077 pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth
1078 limit may also be supplied by an item at the start of a pattern of the
1083 where ddd is a decimal number. However, such a setting is ignored
1084 unless ddd is less than the limit set by the caller of pcre2_match() or
1085 pcre2_dfa_match() or, if no such limit is set, less than the default.
1088 CHECKING BUILD-TIME OPTIONS
1090 int pcre2_config(uint32_t what, void *where);
1092 The function pcre2_config() makes it possible for a PCRE2 client to
1093 discover which optional features have been compiled into the PCRE2
1094 library. The pcre2build documentation has more details about these
1097 The first argument for pcre2_config() specifies which information is
1098 required. The second argument is a pointer to memory into which the
1099 information is placed. If NULL is passed, the function returns the
1100 amount of memory that is needed for the requested information. For
1101 calls that return numerical values, the value is in bytes; when
1102 requesting these values, where should point to appropriately aligned
1103 memory. For calls that return strings, the required length is given in
1104 code units, not counting the terminating zero.
1106 When requesting information, the returned value from pcre2_config() is
1107 non-negative on success, or the negative error code PCRE2_ERROR_BADOP-
1108 TION if the value in the first argument is not recognized. The follow-
1109 ing information is available:
1113 The output is a uint32_t integer whose value indicates what character
1114 sequences the \R escape sequence matches by default. A value of
1115 PCRE2_BSR_UNICODE means that \R matches any Unicode line ending
1116 sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR,
1117 LF, or CRLF. The default can be overridden when a pattern is compiled.
1119 PCRE2_CONFIG_COMPILED_WIDTHS
1121 The output is a uint32_t integer whose lower bits indicate which code
1122 unit widths were selected when PCRE2 was built. The 1-bit indicates
1123 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
1126 PCRE2_CONFIG_DEPTHLIMIT
1128 The output is a uint32_t integer that gives the default limit for the
1129 depth of nested backtracking in pcre2_match() or the depth of nested
1130 recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur-
1131 ther details are given with pcre2_set_depth_limit() above.
1133 PCRE2_CONFIG_HEAPLIMIT
1135 The output is a uint32_t integer that gives, in kibibytes, the default
1136 limit for the amount of heap memory used by pcre2_match() or
1137 pcre2_dfa_match(). Further details are given with
1138 pcre2_set_heap_limit() above.
1142 The output is a uint32_t integer that is set to one if support for
1143 just-in-time compiling is available; otherwise it is set to zero.
1145 PCRE2_CONFIG_JITTARGET
1147 The where argument should point to a buffer that is at least 48 code
1148 units long. (The exact length required can be found by calling
1149 pcre2_config() with where set to NULL.) The buffer is filled with a
1150 string that contains the name of the architecture for which the JIT
1151 compiler is configured, for example "x86 32bit (little endian +
1152 unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is
1153 returned, otherwise the number of code units used is returned. This is
1154 the length of the string, plus one unit for the terminating zero.
1156 PCRE2_CONFIG_LINKSIZE
1158 The output is a uint32_t integer that contains the number of bytes used
1159 for internal linkage in compiled regular expressions. When PCRE2 is
1160 configured, the value can be set to 2, 3, or 4, with the default being
1161 2. This is the value that is returned by pcre2_config(). However, when
1162 the 16-bit library is compiled, a value of 3 is rounded up to 4, and
1163 when the 32-bit library is compiled, internal linkages always use 4
1164 bytes, so the configured value is not relevant.
1166 The default value of 2 for the 8-bit and 16-bit libraries is sufficient
1167 for all but the most massive patterns, since it allows the size of the
1168 compiled pattern to be up to 65535 code units. Larger values allow
1169 larger regular expressions to be compiled by those two libraries, but
1170 at the expense of slower matching.
1172 PCRE2_CONFIG_MATCHLIMIT
1174 The output is a uint32_t integer that gives the default match limit for
1175 pcre2_match(). Further details are given with pcre2_set_match_limit()
1178 PCRE2_CONFIG_NEWLINE
1180 The output is a uint32_t integer whose value specifies the default
1181 character sequence that is recognized as meaning "newline". The values
1184 PCRE2_NEWLINE_CR Carriage return (CR)
1185 PCRE2_NEWLINE_LF Linefeed (LF)
1186 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
1187 PCRE2_NEWLINE_ANY Any Unicode line ending
1188 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
1189 PCRE2_NEWLINE_NUL The NUL character (binary zero)
1191 The default should normally correspond to the standard sequence for
1192 your operating system.
1194 PCRE2_CONFIG_NEVER_BACKSLASH_C
1196 The output is a uint32_t integer that is set to one if the use of \C
1197 was permanently disabled when PCRE2 was built; otherwise it is set to
1200 PCRE2_CONFIG_PARENSLIMIT
1202 The output is a uint32_t integer that gives the maximum depth of nest-
1203 ing of parentheses (of any kind) in a pattern. This limit is imposed to
1204 cap the amount of system stack used when a pattern is compiled. It is
1205 specified when PCRE2 is built; the default is 250. This limit does not
1206 take into account the stack that may already be used by the calling
1207 application. For finer control over compilation stack usage, see
1208 pcre2_set_compile_recursion_guard().
1210 PCRE2_CONFIG_STACKRECURSE
1212 This parameter is obsolete and should not be used in new code. The out-
1213 put is a uint32_t integer that is always set to zero.
1215 PCRE2_CONFIG_UNICODE_VERSION
1217 The where argument should point to a buffer that is at least 24 code
1218 units long. (The exact length required can be found by calling
1219 pcre2_config() with where set to NULL.) If PCRE2 has been compiled
1220 without Unicode support, the buffer is filled with the text "Unicode
1221 not supported". Otherwise, the Unicode version string (for example,
1222 "8.0.0") is inserted. The number of code units used is returned. This
1223 is the length of the string plus one unit for the terminating zero.
1225 PCRE2_CONFIG_UNICODE
1227 The output is a uint32_t integer that is set to one if Unicode support
1228 is available; otherwise it is set to zero. Unicode support implies UTF
1231 PCRE2_CONFIG_VERSION
1233 The where argument should point to a buffer that is at least 24 code
1234 units long. (The exact length required can be found by calling
1235 pcre2_config() with where set to NULL.) The buffer is filled with the
1236 PCRE2 version string, zero-terminated. The number of code units used is
1237 returned. This is the length of the string plus one unit for the termi-
1243 pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
1244 uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
1245 pcre2_compile_context *ccontext);
1247 void pcre2_code_free(pcre2_code *code);
1249 pcre2_code *pcre2_code_copy(const pcre2_code *code);
1251 pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
1253 The pcre2_compile() function compiles a pattern into an internal form.
1254 The pattern is defined by a pointer to a string of code units and a
1255 length (in code units). If the pattern is zero-terminated, the length
1256 can be specified as PCRE2_ZERO_TERMINATED. The function returns a
1257 pointer to a block of memory that contains the compiled pattern and
1258 related data, or NULL if an error occurred.
1260 If the compile context argument ccontext is NULL, memory for the com-
1261 piled pattern is obtained by calling malloc(). Otherwise, it is
1262 obtained from the same memory function that was used for the compile
1263 context. The caller must free the memory by calling pcre2_code_free()
1264 when it is no longer needed. If pcre2_code_free() is called with a
1265 NULL argument, it returns immediately, without doing anything.
1267 The function pcre2_code_copy() makes a copy of the compiled code in new
1268 memory, using the same memory allocator as was used for the original.
1269 However, if the code has been processed by the JIT compiler (see
1270 below), the JIT information cannot be copied (because it is position-
1271 dependent). The new copy can initially be used only for non-JIT match-
1272 ing, though it can be passed to pcre2_jit_compile() if required. If
1273 pcre2_code_copy() is called with a NULL argument, it returns NULL.
1275 The pcre2_code_copy() function provides a way for individual threads in
1276 a multithreaded application to acquire a private copy of shared com-
1277 piled code. However, it does not make a copy of the character tables
1278 used by the compiled pattern; the new pattern code points to the same
1279 tables as the original code. (See "Locale Support" below for details
1280 of these character tables.) In many applications the same tables are
1281 used throughout, so this behaviour is appropriate. Nevertheless, there
1282 are occasions when a copy of a compiled pattern and the relevant tables
1283 are needed. The pcre2_code_copy_with_tables() provides this facility.
1284 Copies of both the code and the tables are made, with the new code
1285 pointing to the new tables. The memory for the new tables is automati-
1286 cally freed when pcre2_code_free() is called for the new copy of the
1287 compiled code. If pcre2_code_copy_withy_tables() is called with a NULL
1288 argument, it returns NULL.
1290 NOTE: When one of the matching functions is called, pointers to the
1291 compiled pattern and the subject string are set in the match data block
1292 so that they can be referenced by the substring extraction functions.
1293 After running a match, you must not free a compiled pattern (or a sub-
1294 ject string) until after all operations on the match data block have
1297 The options argument for pcre2_compile() contains various bit settings
1298 that affect the compilation. It should be zero if no options are
1299 required. The available options are described below. Some of them (in
1300 particular, those that are compatible with Perl, but some others as
1301 well) can also be set and unset from within the pattern (see the
1302 detailed description in the pcre2pattern documentation).
1304 For those options that can be different in different parts of the pat-
1305 tern, the contents of the options argument specifies their settings at
1306 the start of compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and
1307 PCRE2_NO_UTF_CHECK options can be set at the time of matching as well
1310 Other, less frequently required compile-time parameters (for example,
1311 the newline setting) can be provided in a compile context (as described
1314 If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
1315 diately. Otherwise, the variables to which these point are set to an
1316 error code and an offset (number of code units) within the pattern,
1317 respectively, when pcre2_compile() returns NULL because a compilation
1318 error has occurred. The values are not defined when compilation is suc-
1319 cessful and pcre2_compile() returns a non-NULL value.
1321 There are nearly 100 positive error codes that pcre2_compile() may
1322 return if it finds an error in the pattern. There are also some nega-
1323 tive error codes that are used for invalid UTF strings. These are the
1324 same as given by pcre2_match() and pcre2_dfa_match(), and are described
1325 in the pcre2unicode page. There is no separate documentation for the
1326 positive error codes, because the textual error messages that are
1327 obtained by calling the pcre2_get_error_message() function (see
1328 "Obtaining a textual error message" below) should be self-explanatory.
1329 Macro names starting with PCRE2_ERROR_ are defined for both positive
1330 and negative error codes in pcre2.h.
1332 The value returned in erroroffset is an indication of where in the pat-
1333 tern the error occurred. It is not necessarily the furthest point in
1334 the pattern that was read. For example, after the error "lookbehind
1335 assertion is not fixed length", the error offset points to the start of
1336 the failing assertion. For an invalid UTF-8 or UTF-16 string, the off-
1337 set is that of the first code unit of the failing character.
1339 Some errors are not detected until the whole pattern has been scanned;
1340 in these cases, the offset passed back is the length of the pattern.
1341 Note that the offset is in code units, not characters, even in a UTF
1342 mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
1345 This code fragment shows a typical straightforward call to pcre2_com-
1349 PCRE2_SIZE erroffset;
1352 "^A.*Z", /* the pattern */
1353 PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
1354 0, /* default options */
1355 &errorcode, /* for error code */
1356 &erroffset, /* for error offset */
1357 NULL); /* no compile context */
1359 The following names for option bits are defined in the pcre2.h header
1364 If this bit is set, the pattern is forced to be "anchored", that is, it
1365 is constrained to match only at the first matching point in the string
1366 that is being searched (the "subject string"). This effect can also be
1367 achieved by appropriate constructs in the pattern itself, which is the
1368 only way to do it in Perl.
1370 PCRE2_ALLOW_EMPTY_CLASS
1372 By default, for compatibility with Perl, a closing square bracket that
1373 immediately follows an opening one is treated as a data character for
1374 the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the
1375 class, which therefore contains no characters and so can never match.
1379 This option request alternative handling of three escape sequences,
1380 which makes PCRE2's behaviour more like ECMAscript (aka JavaScript).
1383 (1) \U matches an upper case "U" character; by default \U causes a com-
1384 pile time error (Perl uses \U to upper case subsequent characters).
1386 (2) \u matches a lower case "u" character unless it is followed by four
1387 hexadecimal digits, in which case the hexadecimal number defines the
1388 code point to match. By default, \u causes a compile time error (Perl
1389 uses it to upper case the following character).
1391 (3) \x matches a lower case "x" character unless it is followed by two
1392 hexadecimal digits, in which case the hexadecimal number defines the
1393 code point to match. By default, as in Perl, a hexadecimal number is
1394 always expected after \x, but it may have zero, one, or two digits (so,
1395 for example, \xz matches a binary zero character followed by z).
1397 PCRE2_ALT_CIRCUMFLEX
1399 In multiline mode (when PCRE2_MULTILINE is set), the circumflex
1400 metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
1401 is set), and also after any internal newline. However, it does not
1402 match after a newline at the end of the subject, for compatibility with
1403 Perl. If you want a multiline circumflex also to match after a termi-
1404 nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
1408 By default, for compatibility with Perl, the name in any verb sequence
1409 such as (*MARK:NAME) is any sequence of characters that does not
1410 include a closing parenthesis. The name is not processed in any way,
1411 and it is not possible to include a closing parenthesis in the name.
1412 However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash
1413 processing is applied to verb names and only an unescaped closing
1414 parenthesis terminates the name. A closing parenthesis can be included
1415 in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED or
1416 PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
1417 whitespace in verb names is skipped and #-comments are recognized,
1418 exactly as in the rest of the pattern.
1422 If this bit is set, pcre2_compile() automatically inserts callout
1423 items, all with number 255, before each pattern item, except immedi-
1424 ately before or after an explicit callout in the pattern. For discus-
1425 sion of the callout facility, see the pcre2callout documentation.
1429 If this bit is set, letters in the pattern match both upper and lower
1430 case letters in the subject. It is equivalent to Perl's /i option, and
1431 it can be changed within a pattern by a (?i) option setting. If
1432 PCRE2_UTF is set, Unicode properties are used for all characters with
1433 more than one other case, and for all characters whose code points are
1434 greater than U+007F. For lower valued characters with only one other
1435 case, a lookup table is used for speed. When PCRE2_UTF is not set, a
1436 lookup table is used for all code points less than 256, and higher code
1437 points (available only in 16-bit or 32-bit mode) are treated as not
1438 having another case.
1440 PCRE2_DOLLAR_ENDONLY
1442 If this bit is set, a dollar metacharacter in the pattern matches only
1443 at the end of the subject string. Without this option, a dollar also
1444 matches immediately before a newline at the end of the string (but not
1445 before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
1446 if PCRE2_MULTILINE is set. There is no equivalent to this option in
1447 Perl, and no way to set it within a pattern.
1451 If this bit is set, a dot metacharacter in the pattern matches any
1452 character, including one that indicates a newline. However, it only
1453 ever matches one character, even if newlines are coded as CRLF. Without
1454 this option, a dot does not match when the current position in the sub-
1455 ject is at a newline. This option is equivalent to Perl's /s option,
1456 and it can be changed within a pattern by a (?s) option setting. A neg-
1457 ative class such as [^a] always matches newline characters, and the \N
1458 escape sequence always matches a non-newline character, independent of
1459 the setting of PCRE2_DOTALL.
1463 If this bit is set, names used to identify capturing subpatterns need
1464 not be unique. This can be helpful for certain types of pattern when it
1465 is known that only one instance of the named subpattern can ever be
1466 matched. There are more details of named subpatterns below; see also
1467 the pcre2pattern documentation.
1471 If this bit is set, the end of any pattern match must be right at the
1472 end of the string being searched (the "subject string"). If the pattern
1473 match succeeds by reaching (*ACCEPT), but does not reach the end of the
1474 subject, the match fails at the current starting point. For unanchored
1475 patterns, a new match is then tried at the next starting point. How-
1476 ever, if the match succeeds by reaching the end of the pattern, but not
1477 the end of the subject, backtracking occurs and an alternative match
1478 may be found. Consider these two patterns:
1483 If matched against "abc" with PCRE2_ENDANCHORED set, the first matches
1484 "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED
1485 can also be achieved by appropriate constructs in the pattern itself,
1486 which is the only way to do it in Perl.
1488 For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
1489 to the first (that is, the longest) matched string. Other parallel
1490 matches, which are necessarily substrings of the first one, must obvi-
1491 ously end before the end of the subject.
1495 If this bit is set, most white space characters in the pattern are
1496 totally ignored except when escaped or inside a character class. How-
1497 ever, white space is not allowed within sequences such as (?> that
1498 introduce various parenthesized subpatterns, nor within numerical quan-
1499 tifiers such as {1,3}. Ignorable white space is permitted between an
1500 item and a following quantifier and between a quantifier and a follow-
1501 ing + that indicates possessiveness. PCRE2_EXTENDED is equivalent to
1502 Perl's /x option, and it can be changed within a pattern by a (?x)
1505 When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog-
1506 nizes as white space only those characters with code points less than
1507 256 that are flagged as white space in its low-character table. The ta-
1508 ble is normally created by pcre2_maketables(), which uses the isspace()
1509 function to identify space characters. In most ASCII environments, the
1510 relevant characters are those with code points 0x0009 (tab), 0x000A
1511 (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage
1512 return), and 0x0020 (space).
1514 When PCRE2 is compiled with Unicode support, in addition to these char-
1515 acters, five more Unicode "Pattern White Space" characters are recog-
1516 nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
1517 right mark), U+200F (right-to-left mark), U+2028 (line separator), and
1518 U+2029 (paragraph separator). This set of characters is the same as
1519 recognized by Perl's /x option. Note that the horizontal and vertical
1520 space characters that are matched by the \h and \v escapes in patterns
1521 are a much bigger set.
1523 As well as ignoring most white space, PCRE2_EXTENDED also causes char-
1524 acters between an unescaped # outside a character class and the next
1525 newline, inclusive, to be ignored, which makes it possible to include
1526 comments inside complicated patterns. Note that the end of this type of
1527 comment is a literal newline sequence in the pattern; escape sequences
1528 that happen to represent a newline do not count.
1530 Which characters are interpreted as newlines can be specified by a set-
1531 ting in the compile context that is passed to pcre2_compile() or by a
1532 special sequence at the start of the pattern, as described in the sec-
1533 tion entitled "Newline conventions" in the pcre2pattern documentation.
1534 A default is defined when PCRE2 is built.
1538 This option has the effect of PCRE2_EXTENDED, but, in addition,
1539 unescaped space and horizontal tab characters are ignored inside a
1540 character class. Note: only these two characters are ignored, not the
1541 full set of pattern white space characters that are ignored outside a
1542 character class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx
1543 option, and it can be changed within a pattern by a (?xx) option set-
1548 If this option is set, the start of an unanchored pattern match must be
1549 before or at the first newline in the subject string following the
1550 start of matching, though the matched text may continue over the new-
1551 line. If startoffset is non-zero, the limiting newline is not necessar-
1552 ily the first newline in the subject. For example, if the subject
1553 string is "abc\nxyz" (where \n represents a single-character newline) a
1554 pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
1555 greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
1556 general limiting facility. If PCRE2_FIRSTLINE is set with an offset
1557 limit, a match must occur in the first line and also within the offset
1558 limit. In other words, whichever limit comes first is used.
1562 If this option is set, all meta-characters in the pattern are disabled,
1563 and it is treated as a literal string. Matching literal strings with a
1564 regular expression engine is not the most efficient way of doing it. If
1565 you are doing a lot of literal matching and are worried about effi-
1566 ciency, you should consider using other approaches. The only other main
1567 options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
1568 PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
1569 PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
1570 PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
1571 PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
1574 PCRE2_MATCH_UNSET_BACKREF
1576 If this option is set, a backreference to an unset subpattern group
1577 matches an empty string (by default this causes the current matching
1578 alternative to fail). A pattern such as (\1)(a) succeeds when this
1579 option is set (assuming it can find an "a" in the subject), whereas it
1580 fails by default, for Perl compatibility. Setting this option makes
1581 PCRE2 behave more like ECMAscript (aka JavaScript).
1585 By default, for the purposes of matching "start of line" and "end of
1586 line", PCRE2 treats the subject string as consisting of a single line
1587 of characters, even if it actually contains newlines. The "start of
1588 line" metacharacter (^) matches only at the start of the string, and
1589 the "end of line" metacharacter ($) matches only at the end of the
1590 string, or before a terminating newline (except when PCRE2_DOL-
1591 LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
1592 the "any character" metacharacter (.) does not match at a newline. This
1593 behaviour (for ^, $, and dot) is the same as Perl.
1595 When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
1596 constructs match immediately following or immediately before internal
1597 newlines in the subject string, respectively, as well as at the very
1598 start and end. This is equivalent to Perl's /m option, and it can be
1599 changed within a pattern by a (?m) option setting. Note that the "start
1600 of line" metacharacter does not match after a newline at the end of the
1601 subject, for compatibility with Perl. However, you can change this by
1602 setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
1603 subject string, or no occurrences of ^ or $ in a pattern, setting
1604 PCRE2_MULTILINE has no effect.
1606 PCRE2_NEVER_BACKSLASH_C
1608 This option locks out the use of \C in the pattern that is being com-
1609 piled. This escape can cause unpredictable behaviour in UTF-8 or
1610 UTF-16 modes, because it may leave the current matching point in the
1611 middle of a multi-code-unit character. This option may be useful in
1612 applications that process patterns from external sources. Note that
1613 there is also a build-time option that permanently locks out the use of
1618 This option locks out the use of Unicode properties for handling \B,
1619 \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
1620 described for the PCRE2_UCP option below. In particular, it prevents
1621 the creator of the pattern from enabling this facility by starting the
1622 pattern with (*UCP). This option may be useful in applications that
1623 process patterns from external sources. The option combination PCRE_UCP
1624 and PCRE_NEVER_UCP causes an error.
1628 This option locks out interpretation of the pattern as UTF-8, UTF-16,
1629 or UTF-32, depending on which library is in use. In particular, it pre-
1630 vents the creator of the pattern from switching to UTF interpretation
1631 by starting the pattern with (*UTF). This option may be useful in
1632 applications that process patterns from external sources. The combina-
1633 tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
1635 PCRE2_NO_AUTO_CAPTURE
1637 If this option is set, it disables the use of numbered capturing paren-
1638 theses in the pattern. Any opening parenthesis that is not followed by
1639 ? behaves as if it were followed by ?: but named parentheses can still
1640 be used for capturing (and they acquire numbers in the usual way). This
1641 is the same as Perl's /n option. Note that, when this option is set,
1642 references to capturing groups (backreferences or recursion/subroutine
1643 calls) may only refer to named groups, though the reference can be by
1646 PCRE2_NO_AUTO_POSSESS
1648 If this option is set, it disables "auto-possessification", which is an
1649 optimization that, for example, turns a+b into a++b in order to avoid
1650 backtracks into a+ that can never be successful. However, if callouts
1651 are in use, auto-possessification means that some callouts are never
1652 taken. You can set this option if you want the matching functions to do
1653 a full unoptimized search and run all the callouts, but it is mainly
1654 provided for testing purposes.
1656 PCRE2_NO_DOTSTAR_ANCHOR
1658 If this option is set, it disables an optimization that is applied when
1659 .* is the first significant item in a top-level branch of a pattern,
1660 and all the other branches also start with .* or with \A or \G or ^.
1661 The optimization is automatically disabled for .* if it is inside an
1662 atomic group or a capturing group that is the subject of a backrefer-
1663 ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
1664 mization is not disabled, such a pattern is automatically anchored if
1665 PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
1666 for any ^ items. Otherwise, the fact that any match must start either
1667 at the start of the subject or following a newline is remembered. Like
1668 other optimizations, this can cause callouts to be skipped.
1670 PCRE2_NO_START_OPTIMIZE
1672 This is an option whose main effect is at matching time. It does not
1673 change what pcre2_compile() generates, but it does affect the output of
1676 There are a number of optimizations that may occur at the start of a
1677 match, in order to speed up the process. For example, if it is known
1678 that an unanchored match must start with a specific code unit value,
1679 the matching code searches the subject for that value, and fails imme-
1680 diately if it cannot find it, without actually running the main match-
1681 ing function. This means that a special item such as (*COMMIT) at the
1682 start of a pattern is not considered until after a suitable starting
1683 point for the match has been found. Also, when callouts or (*MARK)
1684 items are in use, these "start-up" optimizations can cause them to be
1685 skipped if the pattern is never actually used. The start-up optimiza-
1686 tions are in effect a pre-scan of the subject that takes place before
1689 The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
1690 possibly causing performance to suffer, but ensuring that in cases
1691 where the result is "no match", the callouts do occur, and that items
1692 such as (*COMMIT) and (*MARK) are considered at every possible starting
1693 position in the subject string.
1695 Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
1696 operation. Consider the pattern
1700 When this is compiled, PCRE2 records the fact that a match must start
1701 with the character "A". Suppose the subject string is "DEFABC". The
1702 start-up optimization scans along the subject, finds "A" and runs the
1703 first match attempt from there. The (*COMMIT) item means that the pat-
1704 tern must match the current starting position, which in this case, it
1705 does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
1706 set, the initial scan along the subject string does not happen. The
1707 first match attempt is run starting from "D" and when this fails,
1708 (*COMMIT) prevents any further matches being tried, so the overall
1709 result is "no match".
1711 There are also other start-up optimizations. For example, a minimum
1712 length for the subject may be recorded. Consider the pattern
1716 The minimum length for a match is one character. If the subject is
1717 "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
1718 to match an empty string at the end of the subject does not take place,
1719 because PCRE2 knows that the subject is now too short, and so the
1720 (*MARK) is never encountered. In this case, the optimization does not
1721 affect the overall match result, which is still "no match", but it does
1722 affect the auxiliary information that is returned.
1726 When PCRE2_UTF is set, the validity of the pattern as a UTF string is
1727 automatically checked. There are discussions about the validity of
1728 UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
1729 document. If an invalid UTF sequence is found, pcre2_compile() returns
1730 a negative error code.
1732 If you know that your pattern is a valid UTF string, and you want to
1733 skip this check for performance reasons, you can set the
1734 PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
1735 invalid UTF string as a pattern is undefined. It may cause your program
1738 Note that this option can also be passed to pcre2_match() and
1739 pcre_dfa_match(), to suppress UTF validity checking of the subject
1742 Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
1743 able the error that is given if an escape sequence for an invalid Uni-
1744 code code point is encountered in the pattern. In particular, the so-
1745 called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you
1746 want to allow escape sequences such as \x{d800} you can set the
1747 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the
1748 section entitled "Extra compile options" below. However, this is pos-
1749 sible only in UTF-8 and UTF-32 modes, because these values are not rep-
1750 resentable in UTF-16.
1754 This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
1755 \w, and some of the POSIX character classes. By default, only ASCII
1756 characters are recognized, but if PCRE2_UCP is set, Unicode properties
1757 are used instead to classify characters. More details are given in the
1758 section on generic character types in the pcre2pattern page. If you set
1759 PCRE2_UCP, matching one of the items it affects takes much longer. The
1760 option is available only if PCRE2 has been compiled with Unicode sup-
1761 port (which is the default).
1765 This option inverts the "greediness" of the quantifiers so that they
1766 are not greedy by default, but become greedy if followed by "?". It is
1767 not compatible with Perl. It can also be set by a (?U) option setting
1770 PCRE2_USE_OFFSET_LIMIT
1772 This option must be set for pcre2_compile() if pcre2_set_offset_limit()
1773 is going to be used to set a non-default offset limit in a match con-
1774 text for matches that use this pattern. An error is generated if an
1775 offset limit is set without this option. For more details, see the
1776 description of pcre2_set_offset_limit() in the section that describes
1777 match contexts. See also the PCRE2_FIRSTLINE option above.
1781 This option causes PCRE2 to regard both the pattern and the subject
1782 strings that are subsequently processed as strings of UTF characters
1783 instead of single-code-unit strings. It is available when PCRE2 is
1784 built to include Unicode support (which is the default). If Unicode
1785 support is not available, the use of this option provokes an error.
1786 Details of how PCRE2_UTF changes the behaviour of PCRE2 are given in
1787 the pcre2unicode page. In particular, note that it changes the way
1788 PCRE2_CASELESS handles characters with code points greater than 127.
1790 Extra compile options
1792 Unlike the main compile-time options, the extra options are not saved
1793 with the compiled pattern. The option bits that can be set in a compile
1794 context by calling the pcre2_set_compile_extra_options() function are
1797 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
1799 This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
1800 It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
1801 "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
1802 in UTF-16 to encode code points with values in the range 0x10000 to
1803 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
1804 They can be represented in UTF-8 and UTF-32, but are defined as invalid
1805 code points, and cause errors if encountered in a UTF-8 or UTF-32
1806 string that is being checked for validity by PCRE2.
1808 These values also cause errors if encountered in escape sequences such
1809 as \x{d912} within a pattern. However, it seems that some applications,
1810 when using PCRE2 to check for unwanted characters in UTF-8 strings,
1811 explicitly test for the surrogates using escape sequences. The
1812 PCRE2_NO_UTF_CHECK option does not disable the error that occurs,
1813 because it applies only to the testing of input strings for UTF valid-
1816 If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
1817 gate code point values in UTF-8 and UTF-32 patterns no longer provoke
1818 errors and are incorporated in the compiled pattern. However, they can
1819 only match subject characters if the matching function is called with
1820 PCRE2_NO_UTF_CHECK set.
1822 PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
1824 This is a dangerous option. Use with care. By default, an unrecognized
1825 escape such as \j or a malformed one such as \x{2z} causes a compile-
1826 time error when detected by pcre2_compile(). Perl is somewhat inconsis-
1827 tent in handling such items: for example, \j is treated as a literal
1828 "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
1829 ings are given in both cases if Perl's warning switch is enabled. How-
1830 ever, a malformed octal number after \o{ always causes an error in
1833 If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
1834 pcre2_compile(), all unrecognized or erroneous escape sequences are
1835 treated as single-character escapes. For example, \j is a literal "j"
1836 and \x{2z} is treated as the literal string "x{2z}". Setting this
1837 option means that typos in patterns may go undetected and have unex-
1838 pected results. This is a dangerous option. Use with care.
1840 PCRE2_EXTRA_MATCH_LINE
1842 This option is provided for use by the -x option of pcre2grep. It
1843 causes the pattern only to match complete lines. This is achieved by
1844 automatically inserting the code for "^(?:" at the start of the com-
1845 piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
1846 the matched line may be in the middle of the subject string. This
1847 option can be used with PCRE2_LITERAL.
1849 PCRE2_EXTRA_MATCH_WORD
1851 This option is provided for use by the -w option of pcre2grep. It
1852 causes the pattern only to match strings that have a word boundary at
1853 the start and the end. This is achieved by automatically inserting the
1854 code for "\b(?:" at the start of the compiled pattern and ")\b" at the
1855 end. The option may be used with PCRE2_LITERAL. However, it is ignored
1856 if PCRE2_EXTRA_MATCH_LINE is also set.
1859 JUST-IN-TIME (JIT) COMPILATION
1861 int pcre2_jit_compile(pcre2_code *code, uint32_t options);
1863 int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
1864 PCRE2_SIZE length, PCRE2_SIZE startoffset,
1865 uint32_t options, pcre2_match_data *match_data,
1866 pcre2_match_context *mcontext);
1868 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
1870 pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
1871 PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
1873 void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
1874 pcre2_jit_callback callback_function, void *callback_data);
1876 void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
1878 These functions provide support for JIT compilation, which, if the
1879 just-in-time compiler is available, further processes a compiled pat-
1880 tern into machine code that executes much faster than the pcre2_match()
1881 interpretive matching function. Full details are given in the pcre2jit
1884 JIT compilation is a heavyweight optimization. It can take some time
1885 for patterns to be analyzed, and for one-off matches and simple pat-
1886 terns the benefit of faster execution might be offset by a much slower
1887 compilation time. Most (but not all) patterns can be optimized by the
1893 PCRE2 handles caseless matching, and determines whether characters are
1894 letters, digits, or whatever, by reference to a set of tables, indexed
1895 by character code point. This applies only to characters whose code
1896 points are less than 256. By default, higher-valued code points never
1897 match escapes such as \w or \d. However, if PCRE2 is built with Uni-
1898 code support, all characters can be tested with \p and \P, or, alterna-
1899 tively, the PCRE2_UCP option can be set when a pattern is compiled;
1900 this causes \w and friends to use Unicode property support instead of
1901 the built-in tables.
1903 The use of locales with Unicode is discouraged. If you are handling
1904 characters with code points greater than 128, you should either use
1905 Unicode support, or use locales, but not try to mix the two.
1907 PCRE2 contains an internal set of character tables that are used by
1908 default. These are sufficient for many applications. Normally, the
1909 internal tables recognize only ASCII characters. However, when PCRE2 is
1910 built, it is possible to cause the internal tables to be rebuilt in the
1911 default "C" locale of the local system, which may cause them to be dif-
1914 The internal tables can be overridden by tables supplied by the appli-
1915 cation that calls PCRE2. These may be created in a different locale
1916 from the default. As more and more applications change to using Uni-
1917 code, the need for this locale support is expected to die away.
1919 External tables are built by calling the pcre2_maketables() function,
1920 in the relevant locale. The result can be passed to pcre2_compile() as
1921 often as necessary, by creating a compile context and calling
1922 pcre2_set_character_tables() to set the tables pointer therein. For
1923 example, to build and use tables that are appropriate for the French
1924 locale (where accented characters with values greater than 128 are
1925 treated as letters), the following code could be used:
1927 setlocale(LC_CTYPE, "fr_FR");
1928 tables = pcre2_maketables(NULL);
1929 ccontext = pcre2_compile_context_create(NULL);
1930 pcre2_set_character_tables(ccontext, tables);
1931 re = pcre2_compile(..., ccontext);
1933 The locale name "fr_FR" is used on Linux and other Unix-like systems;
1934 if you are using Windows, the name for the French locale is "french".
1935 It is the caller's responsibility to ensure that the memory containing
1936 the tables remains available for as long as it is needed.
1938 The pointer that is passed (via the compile context) to pcre2_compile()
1939 is saved with the compiled pattern, and the same tables are used by
1940 pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
1941 pilation and matching both happen in the same locale, but different
1942 patterns can be processed in different locales.
1945 INFORMATION ABOUT A COMPILED PATTERN
1947 int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
1949 The pcre2_pattern_info() function returns general information about a
1950 compiled pattern. For information about callouts, see the next section.
1951 The first argument for pcre2_pattern_info() is a pointer to the com-
1952 piled pattern. The second argument specifies which piece of information
1953 is required, and the third argument is a pointer to a variable to
1954 receive the data. If the third argument is NULL, the first argument is
1955 ignored, and the function returns the size in bytes of the variable
1956 that is required for the information requested. Otherwise, the yield of
1957 the function is zero for success, or one of the following negative num-
1960 PCRE2_ERROR_NULL the argument code was NULL
1961 PCRE2_ERROR_BADMAGIC the "magic number" was not found
1962 PCRE2_ERROR_BADOPTION the value of what was invalid
1963 PCRE2_ERROR_UNSET the requested field is not set
1965 The "magic number" is placed at the start of each compiled pattern as
1966 an simple check against passing an arbitrary memory pointer. Here is a
1967 typical call of pcre2_pattern_info(), to obtain the length of the com-
1972 rc = pcre2_pattern_info(
1973 re, /* result of pcre2_compile() */
1974 PCRE2_INFO_SIZE, /* what is required */
1975 &length); /* where to put the data */
1977 The possible values for the second argument are defined in pcre2.h, and
1980 PCRE2_INFO_ALLOPTIONS
1981 PCRE2_INFO_ARGOPTIONS
1982 PCRE2_INFO_EXTRAOPTIONS
1984 Return copies of the pattern's options. The third argument should point
1985 to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
1986 options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
1987 TIONS returns the compile options as modified by any top-level (*XXX)
1988 option settings such as (*UTF) at the start of the pattern itself.
1989 PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the
1990 compile context by calling the pcre2_set_compile_extra_options() func-
1993 For example, if the pattern /(*UTF)abc/ is compiled with the
1994 PCRE2_EXTENDED option, the result for PCRE2_INFO_ALLOPTIONS is
1995 PCRE2_EXTENDED and PCRE2_UTF. Option settings such as (?i) that can
1996 change within a pattern do not affect the result of PCRE2_INFO_ALLOP-
1997 TIONS, even if they appear right at the start of the pattern. (This was
1998 different in some earlier releases.)
2000 A pattern compiled without PCRE2_ANCHORED is automatically anchored by
2001 PCRE2 if the first significant item in every top-level branch is one of
2004 ^ unless PCRE2_MULTILINE is set
2007 .* sometimes - see below
2009 When .* is the first significant item, anchoring is possible only when
2010 all the following are true:
2012 .* is not in an atomic group
2013 .* is not in a capturing group that is the subject
2015 PCRE2_DOTALL is in force for .*
2016 Neither (*PRUNE) nor (*SKIP) appears in the pattern
2017 PCRE2_NO_DOTSTAR_ANCHOR is not set
2019 For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
2020 the options returned for PCRE2_INFO_ALLOPTIONS.
2022 PCRE2_INFO_BACKREFMAX
2024 Return the number of the highest backreference in the pattern. The
2025 third argument should point to an uint32_t variable. Named subpatterns
2026 acquire numbers as well as names, and these count towards the highest
2027 backreference. Backreferences such as \4 or \g{12} match the captured
2028 characters of the given group, but in addition, the check that a cap-
2029 turing group is set in a conditional subpattern such as (?(3)a|b) is
2030 also a backreference. Zero is returned if there are no backreferences.
2034 The output is a uint32_t integer whose value indicates what character
2035 sequences the \R escape sequence matches. A value of PCRE2_BSR_UNICODE
2036 means that \R matches any Unicode line ending sequence; a value of
2037 PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
2039 PCRE2_INFO_CAPTURECOUNT
2041 Return the highest capturing subpattern number in the pattern. In pat-
2042 terns where (?| is not used, this is also the total number of capturing
2043 subpatterns. The third argument should point to an uint32_t variable.
2045 PCRE2_INFO_DEPTHLIMIT
2047 If the pattern set a backtracking depth limit by including an item of
2048 the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The
2049 third argument should point to a uint32_t integer. If no such value has
2050 been set, the call to pcre2_pattern_info() returns the error
2051 PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
2052 ing if it is less than the limit set or defaulted by the caller of the
2055 PCRE2_INFO_FIRSTBITMAP
2057 In the absence of a single first code unit for a non-anchored pattern,
2058 pcre2_compile() may construct a 256-bit table that defines a fixed set
2059 of values for the first code unit in any match. For example, a pattern
2060 that starts with [abc] results in a table with three bits set. When
2061 code unit values greater than 255 are supported, the flag bit for 255
2062 means "any code unit of value 255 or above". If such a table was con-
2063 structed, a pointer to it is returned. Otherwise NULL is returned. The
2064 third argument should point to a const uint8_t * variable.
2066 PCRE2_INFO_FIRSTCODETYPE
2068 Return information about the first code unit of any matched string, for
2069 a non-anchored pattern. The third argument should point to an uint32_t
2070 variable. If there is a fixed first value, for example, the letter "c"
2071 from a pattern such as (cat|cow|coyote), 1 is returned, and the value
2072 can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed
2073 first value, but it is known that a match can occur only at the start
2074 of the subject or following a newline in the subject, 2 is returned.
2075 Otherwise, and for anchored patterns, 0 is returned.
2077 PCRE2_INFO_FIRSTCODEUNIT
2079 Return the value of the first code unit of any matched string for a
2080 pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
2081 The third argument should point to an uint32_t variable. In the 8-bit
2082 library, the value is always less than 256. In the 16-bit library the
2083 value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
2084 value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
2087 PCRE2_INFO_FRAMESIZE
2089 Return the size (in bytes) of the data frames that are used to remember
2090 backtracking positions when the pattern is processed by pcre2_match()
2091 without the use of JIT. The third argument should point to a size_t
2092 variable. The frame size depends on the number of capturing parentheses
2093 in the pattern. Each additional capturing group adds two PCRE2_SIZE
2096 PCRE2_INFO_HASBACKSLASHC
2098 Return 1 if the pattern contains any instances of \C, otherwise 0. The
2099 third argument should point to an uint32_t variable.
2101 PCRE2_INFO_HASCRORLF
2103 Return 1 if the pattern contains any explicit matches for CR or LF
2104 characters, otherwise 0. The third argument should point to an uint32_t
2105 variable. An explicit match is either a literal CR or LF character, or
2106 \r or \n or one of the equivalent hexadecimal or octal escape
2109 PCRE2_INFO_HEAPLIMIT
2111 If the pattern set a heap memory limit by including an item of the form
2112 (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
2113 ment should point to a uint32_t integer. If no such value has been set,
2114 the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET.
2115 Note that this limit will only be used during matching if it is less
2116 than the limit set or defaulted by the caller of the match function.
2120 Return 1 if the (?J) or (?-J) option setting is used in the pattern,
2121 otherwise 0. The third argument should point to an uint32_t variable.
2122 (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
2127 If the compiled pattern was successfully processed by pcre2_jit_com-
2128 pile(), return the size of the JIT compiled code, otherwise return
2129 zero. The third argument should point to a size_t variable.
2131 PCRE2_INFO_LASTCODETYPE
2133 Returns 1 if there is a rightmost literal code unit that must exist in
2134 any matched string, other than at its start. The third argument should
2135 point to an uint32_t variable. If there is no such value, 0 is
2136 returned. When 1 is returned, the code unit value itself can be
2137 retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last
2138 literal value is recorded only if it follows something of variable
2139 length. For example, for the pattern /^a\d+z\d+/ the returned value is
2140 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/
2141 the returned value is 0.
2143 PCRE2_INFO_LASTCODEUNIT
2145 Return the value of the rightmost literal code unit that must exist in
2146 any matched string, other than at its start, for a pattern where
2147 PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
2148 ment should point to an uint32_t variable.
2150 PCRE2_INFO_MATCHEMPTY
2152 Return 1 if the pattern might match an empty string, otherwise 0. The
2153 third argument should point to an uint32_t variable. When a pattern
2154 contains recursive subroutine calls it is not always possible to deter-
2155 mine whether or not it can match an empty string. PCRE2 takes a cau-
2156 tious approach and returns 1 in such cases.
2158 PCRE2_INFO_MATCHLIMIT
2160 If the pattern set a match limit by including an item of the form
2161 (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third
2162 argument should point to a uint32_t integer. If no such value has been
2163 set, the call to pcre2_pattern_info() returns the error
2164 PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
2165 ing if it is less than the limit set or defaulted by the caller of the
2168 PCRE2_INFO_MAXLOOKBEHIND
2170 Return the number of characters (not code units) in the longest lookbe-
2171 hind assertion in the pattern. The third argument should point to a
2172 uint32_t integer. This information is useful when doing multi-segment
2173 matching using the partial matching facilities. Note that the simple
2174 assertions \b and \B require a one-character lookbehind. \A also regis-
2175 ters a one-character lookbehind, though it does not actually inspect
2176 the previous character. This is to ensure that at least one character
2177 from the old segment is retained when a new segment is processed. Oth-
2178 erwise, if there are no lookbehinds in the pattern, \A might match
2179 incorrectly at the start of a second or subsequent segment.
2181 PCRE2_INFO_MINLENGTH
2183 If a minimum length for matching subject strings was computed, its
2184 value is returned. Otherwise the returned value is 0. The value is a
2185 number of characters, which in UTF mode may be different from the num-
2186 ber of code units. The third argument should point to an uint32_t
2187 variable. The value is a lower bound to the length of any matching
2188 string. There may not be any strings of that length that do actually
2189 match, but every string that does match is at least that long.
2191 PCRE2_INFO_NAMECOUNT
2192 PCRE2_INFO_NAMEENTRYSIZE
2193 PCRE2_INFO_NAMETABLE
2195 PCRE2 supports the use of named as well as numbered capturing parenthe-
2196 ses. The names are just an additional way of identifying the parenthe-
2197 ses, which still acquire numbers. Several convenience functions such as
2198 pcre2_substring_get_byname() are provided for extracting captured sub-
2199 strings by name. It is also possible to extract the data directly, by
2200 first converting the name to a number in order to access the correct
2201 pointers in the output vector (described with pcre2_match() below). To
2202 do the conversion, you need to use the name-to-number map, which is
2203 described by these three values.
2205 The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
2206 COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
2207 the size of each entry in code units; both of these return a uint32_t
2208 value. The entry size depends on the length of the longest name.
2210 PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
2211 This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit
2212 library, the first two bytes of each entry are the number of the cap-
2213 turing parenthesis, most significant byte first. In the 16-bit library,
2214 the pointer points to 16-bit code units, the first of which contains
2215 the parenthesis number. In the 32-bit library, the pointer points to
2216 32-bit code units, the first of which contains the parenthesis number.
2217 The rest of the entry is the corresponding name, zero terminated.
2219 The names are in alphabetical order. If (?| is used to create multiple
2220 groups with the same number, as described in the section on duplicate
2221 subpattern numbers in the pcre2pattern page, the groups may be given
2222 the same name, but there is only one entry in the table. Different
2223 names for groups of the same number are not permitted.
2225 Duplicate names for subpatterns with different numbers are permitted,
2226 but only if PCRE2_DUPNAMES is set. They appear in the table in the
2227 order in which they were found in the pattern. In the absence of (?|
2228 this is the order of increasing number; when (?| is used this is not
2229 necessarily the case because later subpatterns may have lower numbers.
2231 As a simple example of the name/number table, consider the following
2232 pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
2233 is set, so white space - including newlines - is ignored):
2235 (?<date> (?<year>(\d\d)?\d\d) -
2236 (?<month>\d\d) - (?<day>\d\d) )
2238 There are four named subpatterns, so the table has four entries, and
2239 each entry in the table is eight bytes long. The table is as follows,
2240 with non-printing bytes shows in hexadecimal, and undefined bytes shown
2244 00 05 d a y 00 ?? ??
2248 When writing code to extract data from named subpatterns using the
2249 name-to-number map, remember that the length of the entries is likely
2250 to be different for each compiled pattern.
2254 The output is one of the following uint32_t values:
2256 PCRE2_NEWLINE_CR Carriage return (CR)
2257 PCRE2_NEWLINE_LF Linefeed (LF)
2258 PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
2259 PCRE2_NEWLINE_ANY Any Unicode line ending
2260 PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
2261 PCRE2_NEWLINE_NUL The NUL character (binary zero)
2263 This identifies the character sequence that will be recognized as mean-
2264 ing "newline" while matching.
2268 Return the size of the compiled pattern in bytes (for all three
2269 libraries). The third argument should point to a size_t variable. This
2270 value includes the size of the general data block that precedes the
2271 code units of the compiled pattern itself. The value that is used when
2272 pcre2_compile() is getting memory in which to place the compiled pat-
2273 tern may be slightly larger than the value returned by this option,
2274 because there are cases where the code that calculates the size has to
2275 over-estimate. Processing a pattern with the JIT compiler does not
2276 alter the value returned by this option.
2279 INFORMATION ABOUT A PATTERN'S CALLOUTS
2281 int pcre2_callout_enumerate(const pcre2_code *code,
2282 int (*callback)(pcre2_callout_enumerate_block *, void *),
2285 A script language that supports the use of string arguments in callouts
2286 might like to scan all the callouts in a pattern before running the
2287 match. This can be done by calling pcre2_callout_enumerate(). The first
2288 argument is a pointer to a compiled pattern, the second points to a
2289 callback function, and the third is arbitrary user data. The callback
2290 function is called for every callout in the pattern in the order in
2291 which they appear. Its first argument is a pointer to a callout enumer-
2292 ation block, and its second argument is the user_data value that was
2293 passed to pcre2_callout_enumerate(). The contents of the callout enu-
2294 meration block are described in the pcre2callout documentation, which
2295 also gives further details about callouts.
2298 SERIALIZATION AND PRECOMPILING
2300 It is possible to save compiled patterns on disc or elsewhere, and
2301 reload them later, subject to a number of restrictions. The host on
2302 which the patterns are reloaded must be running the same version of
2303 PCRE2, with the same code unit width, and must also have the same endi-
2304 anness, pointer width, and PCRE2_SIZE type. Before compiled patterns
2305 can be saved, they must be converted to a "serialized" form, which in
2306 the case of PCRE2 is really just a bytecode dump. The functions whose
2307 names begin with pcre2_serialize_ are used for converting to and from
2308 the serialized form. They are described in the pcre2serialize documen-
2309 tation. Note that PCRE2 serialization does not convert compiled pat-
2310 terns to an abstract format like Java or .NET serialization.
2313 THE MATCH DATA BLOCK
2315 pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
2316 pcre2_general_context *gcontext);
2318 pcre2_match_data *pcre2_match_data_create_from_pattern(
2319 const pcre2_code *code, pcre2_general_context *gcontext);
2321 void pcre2_match_data_free(pcre2_match_data *match_data);
2323 Information about a successful or unsuccessful match is placed in a
2324 match data block, which is an opaque structure that is accessed by
2325 function calls. In particular, the match data block contains a vector
2326 of offsets into the subject string that define the matched part of the
2327 subject and any substrings that were captured. This is known as the
2330 Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
2331 you must create a match data block by calling one of the creation func-
2332 tions above. For pcre2_match_data_create(), the first argument is the
2333 number of pairs of offsets in the ovector. One pair of offsets is
2334 required to identify the string that matched the whole pattern, with an
2335 additional pair for each captured substring. For example, a value of 4
2336 creates enough space to record the matched portion of the subject plus
2337 three captured substrings. A minimum of at least 1 pair is imposed by
2338 pcre2_match_data_create(), so it is always possible to return the over-
2341 The second argument of pcre2_match_data_create() is a pointer to a gen-
2342 eral context, which can specify custom memory management for obtaining
2343 the memory for the match data block. If you are not using custom memory
2344 management, pass NULL, which causes malloc() to be used.
2346 For pcre2_match_data_create_from_pattern(), the first argument is a
2347 pointer to a compiled pattern. The ovector is created to be exactly the
2348 right size to hold all the substrings a pattern might capture. The sec-
2349 ond argument is again a pointer to a general context, but in this case
2350 if NULL is passed, the memory is obtained using the same allocator that
2351 was used for the compiled pattern (custom or default).
2353 A match data block can be used many times, with the same or different
2354 compiled patterns. You can extract information from a match data block
2355 after a match operation has finished, using functions that are
2356 described in the sections on matched strings and other match data
2359 When a call of pcre2_match() fails, valid data is available in the
2360 match block only when the error is PCRE2_ERROR_NOMATCH,
2361 PCRE2_ERROR_PARTIAL, or one of the error codes for an invalid UTF
2362 string. Exactly what is available depends on the error, and is detailed
2365 When one of the matching functions is called, pointers to the compiled
2366 pattern and the subject string are set in the match data block so that
2367 they can be referenced by the extraction functions. After running a
2368 match, you must not free a compiled pattern or a subject string until
2369 after all operations on the match data block (for that match) have
2372 When a match data block itself is no longer needed, it should be freed
2373 by calling pcre2_match_data_free(). If this function is called with a
2374 NULL argument, it returns immediately, without doing anything.
2377 MATCHING A PATTERN: THE TRADITIONAL FUNCTION
2379 int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
2380 PCRE2_SIZE length, PCRE2_SIZE startoffset,
2381 uint32_t options, pcre2_match_data *match_data,
2382 pcre2_match_context *mcontext);
2384 The function pcre2_match() is called to match a subject string against
2385 a compiled pattern, which is passed in the code argument. You can call
2386 pcre2_match() with the same code argument as many times as you like, in
2387 order to find multiple matches in the subject string or to match dif-
2388 ferent subject strings with the same pattern.
2390 This function is the main matching facility of the library, and it
2391 operates in a Perl-like manner. For specialist use there is also an
2392 alternative matching function, which is described below in the section
2393 about the pcre2_dfa_match() function.
2395 Here is an example of a simple call to pcre2_match():
2397 pcre2_match_data *md = pcre2_match_data_create(4, NULL);
2398 int rc = pcre2_match(
2399 re, /* result of pcre2_compile() */
2400 "some string", /* the subject string */
2401 11, /* the length of the subject string */
2402 0, /* start at offset 0 in the subject */
2403 0, /* default options */
2404 md, /* the match data block */
2405 NULL); /* a match context; NULL means use defaults */
2407 If the subject string is zero-terminated, the length can be given as
2408 PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
2409 common matching parameters are to be changed. For details, see the sec-
2410 tion on the match context above.
2412 The string to be matched by pcre2_match()
2414 The subject string is passed to pcre2_match() as a pointer in subject,
2415 a length in length, and a starting offset in startoffset. The length
2416 and offset are in code units, not characters. That is, they are in
2417 bytes for the 8-bit library, 16-bit code units for the 16-bit library,
2418 and 32-bit code units for the 32-bit library, whether or not UTF pro-
2421 If startoffset is greater than the length of the subject, pcre2_match()
2422 returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the
2423 search for a match starts at the beginning of the subject, and this is
2424 by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
2425 set must point to the start of a character, or to the end of the sub-
2426 ject (in UTF-32 mode, one code unit equals one character, so all off-
2427 sets are valid). Like the pattern string, the subject may contain
2430 A non-zero starting offset is useful when searching for another match
2431 in the same subject by calling pcre2_match() again after a previous
2432 success. Setting startoffset differs from passing over a shortened
2433 string and setting PCRE2_NOTBOL in the case of a pattern that begins
2434 with any kind of lookbehind. For example, consider the pattern
2438 which finds occurrences of "iss" in the middle of words. (\B matches
2439 only if the current position in the subject is not a word boundary.)
2440 When applied to the string "Mississipi" the first call to pcre2_match()
2441 finds the first occurrence. If pcre2_match() is called again with just
2442 the remainder of the subject, namely "issipi", it does not match,
2443 because \B is always false at the start of the subject, which is deemed
2444 to be a word boundary. However, if pcre2_match() is passed the entire
2445 string again, but with startoffset set to 4, it finds the second occur-
2446 rence of "iss" because it is able to look behind the starting point to
2447 discover that it is preceded by a letter.
2449 Finding all the matches in a subject is tricky when the pattern can
2450 match an empty string. It is possible to emulate Perl's /g behaviour by
2451 first trying the match again at the same offset, with the
2452 PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that
2453 fails, advancing the starting offset and trying an ordinary match
2454 again. There is some code that demonstrates how to do this in the
2455 pcre2demo sample program. In the most general case, you have to check
2456 to see if the newline convention recognizes CRLF as a newline, and if
2457 so, and the current character is CR followed by LF, advance the start-
2458 ing offset by two characters instead of one.
2460 If a non-zero starting offset is passed when the pattern is anchored, a
2461 single attempt to match at the given offset is made. This can only suc-
2462 ceed if the pattern does not require the match to be at the start of
2463 the subject. In other words, the anchoring must be the result of set-
2464 ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL, not
2465 by starting the pattern with ^ or \A.
2467 Option bits for pcre2_match()
2469 The unused bits of the options argument for pcre2_match() must be zero.
2470 The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
2471 PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
2472 PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PAR-
2473 TIAL_SOFT. Their action is described below.
2475 Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
2476 ported by the just-in-time (JIT) compiler. If it is set, JIT matching
2477 is disabled and the interpretive code in pcre2_match() is run. Apart
2478 from PCRE2_NO_JIT (obviously), the remaining options are supported for
2483 The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
2484 matching position. If a pattern was compiled with PCRE2_ANCHORED, or
2485 turned out to be anchored by virtue of its contents, it cannot be made
2486 unachored at matching time. Note that setting the option at match time
2487 disables JIT matching.
2491 If the PCRE2_ENDANCHORED option is set, any string that pcre2_match()
2492 matches must be right at the end of the subject string. Note that set-
2493 ting the option at match time disables JIT matching.
2497 This option specifies that first character of the subject string is not
2498 the beginning of a line, so the circumflex metacharacter should not
2499 match before it. Setting this without having set PCRE2_MULTILINE at
2500 compile time causes circumflex never to match. This option affects only
2501 the behaviour of the circumflex metacharacter. It does not affect \A.
2505 This option specifies that the end of the subject string is not the end
2506 of a line, so the dollar metacharacter should not match it nor (except
2507 in multiline mode) a newline immediately before it. Setting this with-
2508 out having set PCRE2_MULTILINE at compile time causes dollar never to
2509 match. This option affects only the behaviour of the dollar metacharac-
2510 ter. It does not affect \Z or \z.
2514 An empty string is not considered to be a valid match if this option is
2515 set. If there are alternatives in the pattern, they are tried. If all
2516 the alternatives match the empty string, the entire match fails. For
2517 example, if the pattern
2521 is applied to a string not beginning with "a" or "b", it matches an
2522 empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
2523 match is not valid, so pcre2_match() searches further into the string
2524 for occurrences of "a" or "b".
2526 PCRE2_NOTEMPTY_ATSTART
2528 This is like PCRE2_NOTEMPTY, except that it locks out an empty string
2529 match only at the first matching position, that is, at the start of the
2530 subject plus the starting offset. An empty string match later in the
2531 subject is permitted. If the pattern is anchored, such a match can
2532 occur only if the pattern contains \K.
2536 By default, if a pattern has been successfully processed by
2537 pcre2_jit_compile(), JIT is automatically used when pcre2_match() is
2538 called with options that JIT supports. Setting PCRE2_NO_JIT disables
2539 the use of JIT; it forces matching to be done by the interpreter.
2543 When PCRE2_UTF is set at compile time, the validity of the subject as a
2544 UTF string is checked by default when pcre2_match() is subsequently
2545 called. If a non-zero starting offset is given, the check is applied
2546 only to that part of the subject that could be inspected during match-
2547 ing, and there is a check that the starting offset points to the first
2548 code unit of a character or to the end of the subject. If there are no
2549 lookbehind assertions in the pattern, the check starts at the starting
2550 offset. Otherwise, it starts at the length of the longest lookbehind
2551 before the starting offset, or at the start of the subject if there are
2552 not that many characters before the starting offset. Note that the
2553 sequences \b and \B are one-character lookbehinds.
2555 The check is carried out before any other processing takes place, and a
2556 negative error code is returned if the check fails. There are several
2557 UTF error codes for each code unit width, corresponding to different
2558 problems with the code unit sequence. There are discussions about the
2559 validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the
2562 If you know that your subject is valid, and you want to skip these
2563 checks for performance reasons, you can set the PCRE2_NO_UTF_CHECK
2564 option when calling pcre2_match(). You might want to do this for the
2565 second and subsequent calls to pcre2_match() if you are making repeated
2566 calls to find other matches in the same subject string.
2568 Warning: When PCRE2_NO_UTF_CHECK is set, the effect of passing an
2569 invalid string as a subject, or an invalid value of startoffset, is
2570 undefined. Your program may crash or loop indefinitely.
2575 These options turn on the partial matching feature. A partial match
2576 occurs if the end of the subject string is reached successfully, but
2577 there are not enough subject characters to complete the match. If this
2578 happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set,
2579 matching continues by testing any remaining alternatives. Only if no
2580 complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
2581 PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that
2582 the caller is prepared to handle a partial match, but only if no com-
2583 plete match can be found.
2585 If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
2586 case, if a partial match is found, pcre2_match() immediately returns
2587 PCRE2_ERROR_PARTIAL, without considering any other alternatives. In
2588 other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
2589 ered to be more important that an alternative complete match.
2591 There is a more detailed discussion of partial and multi-segment match-
2592 ing, with examples, in the pcre2partial documentation.
2595 NEWLINE HANDLING WHEN MATCHING
2597 When PCRE2 is built, a default newline convention is set; this is usu-
2598 ally the standard convention for the operating system. The default can
2599 be overridden in a compile context by calling pcre2_set_newline(). It
2600 can also be overridden by starting a pattern string with, for example,
2601 (*CRLF), as described in the section on newline conventions in the
2602 pcre2pattern page. During matching, the newline choice affects the be-
2603 haviour of the dot, circumflex, and dollar metacharacters. It may also
2604 alter the way the match starting position is advanced after a match
2605 failure for an unanchored pattern.
2607 When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
2608 set as the newline convention, and a match attempt for an unanchored
2609 pattern fails when the current starting position is at a CRLF sequence,
2610 and the pattern contains no explicit matches for CR or LF characters,
2611 the match position is advanced by two characters instead of one, in
2612 other words, to after the CRLF.
2614 The above rule is a compromise that makes the most common cases work as
2615 expected. For example, if the pattern is .+A (and the PCRE2_DOTALL
2616 option is not set), it does not match the string "\r\nA" because, after
2617 failing at the start, it skips both the CR and the LF before retrying.
2618 However, the pattern [\r\n]A does match that string, because it con-
2619 tains an explicit CR or LF reference, and so advances only by one char-
2620 acter after the first failure.
2622 An explicit match for CR of LF is either a literal appearance of one of
2623 those characters in the pattern, or one of the \r or \n or equivalent
2624 octal or hexadecimal escape sequences. Implicit matches such as [^X] do
2625 not count, nor does \s, even though it includes CR and LF in the char-
2626 acters that it matches.
2628 Notwithstanding the above, anomalous effects may still occur when CRLF
2629 is a valid newline sequence and explicit \r or \n escapes appear in the
2633 HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
2635 uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
2637 PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
2639 In general, a pattern matches a certain portion of the subject, and in
2640 addition, further substrings from the subject may be picked out by
2641 parenthesized parts of the pattern. Following the usage in Jeffrey
2642 Friedl's book, this is called "capturing" in what follows, and the
2643 phrase "capturing subpattern" or "capturing group" is used for a frag-
2644 ment of a pattern that picks out a substring. PCRE2 supports several
2645 other kinds of parenthesized subpattern that do not cause substrings to
2646 be captured. The pcre2_pattern_info() function can be used to find out
2647 how many capturing subpatterns there are in a compiled pattern.
2649 You can use auxiliary functions for accessing captured substrings by
2650 number or by name, as described in sections below.
2652 Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
2653 ues, called the ovector, which contains the offsets of captured
2654 strings. It is part of the match data block. The function
2655 pcre2_get_ovector_pointer() returns the address of the ovector, and
2656 pcre2_get_ovector_count() returns the number of pairs of values it con-
2659 Within the ovector, the first in each pair of values is set to the off-
2660 set of the first code unit of a substring, and the second is set to the
2661 offset of the first code unit after the end of a substring. These val-
2662 ues are always code unit offsets, not character offsets. That is, they
2663 are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit
2664 library, and 32-bit offsets in the 32-bit library.
2666 After a partial match (error return PCRE2_ERROR_PARTIAL), only the
2667 first pair of offsets (that is, ovector[0] and ovector[1]) are set.
2668 They identify the part of the subject that was partially matched. See
2669 the pcre2partial documentation for details of partial matching.
2671 After a fully successful match, the first pair of offsets identifies
2672 the portion of the subject string that was matched by the entire pat-
2673 tern. The next pair is used for the first captured substring, and so
2674 on. The value returned by pcre2_match() is one more than the highest
2675 numbered pair that has been set. For example, if two substrings have
2676 been captured, the returned value is 3. If there are no captured sub-
2677 strings, the return value from a successful match is 1, indicating that
2678 just the first pair of offsets has been set.
2680 If a pattern uses the \K escape sequence within a positive assertion,
2681 the reported start of a successful match can be greater than the end of
2682 the match. For example, if the pattern (?=ab\K) is matched against
2683 "ab", the start and end offset values for the match are 2 and 0.
2685 If a capturing subpattern group is matched repeatedly within a single
2686 match operation, it is the last portion of the subject that it matched
2689 If the ovector is too small to hold all the captured substring offsets,
2690 as much as possible is filled in, and the function returns a value of
2691 zero. If captured substrings are not of interest, pcre2_match() may be
2692 called with a match data block whose ovector is of minimum length (that
2695 It is possible for capturing subpattern number n+1 to match some part
2696 of the subject when subpattern n has not been used at all. For example,
2697 if the string "abc" is matched against the pattern (a|(z))(bc) the
2698 return from the function is 4, and subpatterns 1 and 3 are matched, but
2699 2 is not. When this happens, both values in the offset pairs corre-
2700 sponding to unused subpatterns are set to PCRE2_UNSET.
2702 Offset values that correspond to unused subpatterns at the end of the
2703 expression are also set to PCRE2_UNSET. For example, if the string
2704 "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3
2705 are not matched. The return from the function is 2, because the high-
2706 est used capturing subpattern number is 1. The offsets for for the sec-
2707 ond and third capturing subpatterns (assuming the vector is large
2708 enough, of course) are set to PCRE2_UNSET.
2710 Elements in the ovector that do not correspond to capturing parentheses
2711 in the pattern are never changed. That is, if a pattern contains n cap-
2712 turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
2713 pcre2_match(). The other elements retain whatever values they previ-
2714 ously had. After a failed match attempt, the contents of the ovector
2718 OTHER INFORMATION ABOUT A MATCH
2720 PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
2722 PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
2724 As well as the offsets in the ovector, other information about a match
2725 is retained in the match data block and can be retrieved by the above
2726 functions in appropriate circumstances. If they are called at other
2727 times, the result is undefined.
2729 After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a
2730 failure to match (PCRE2_ERROR_NOMATCH), a (*MARK), (*PRUNE), or (*THEN)
2731 name may be available. The function pcre2_get_mark() can be called to
2732 access this name. The same function applies to all three verbs. It
2733 returns a pointer to the zero-terminated name, which is within the com-
2734 piled pattern. If no name is available, NULL is returned. The length of
2735 the name (excluding the terminating zero) is stored in the code unit
2736 that precedes the name. You should use this length instead of relying
2737 on the terminating zero if the name might contain a binary zero.
2739 After a successful match, the name that is returned is the last
2740 (*MARK), (*PRUNE), or (*THEN) name encountered on the matching path
2741 through the pattern. Instances of (*PRUNE) and (*THEN) without names
2742 are ignored. Thus, for example, if the matching path contains
2743 (*MARK:A)(*PRUNE), the name "A" is returned. After a "no match" or a
2744 partial match, the last encountered name is returned. For example,
2745 consider this pattern:
2747 ^(*MARK:A)((*MARK:B)a|b)c
2749 When it matches "bc", the returned name is A. The B mark is "seen" in
2750 the first branch of the group, but it is not on the matching path. On
2751 the other hand, when this pattern fails to match "bx", the returned
2754 Warning: By default, certain start-of-match optimizations are used to
2755 give a fast "no match" result in some situations. For example, if the
2756 anchoring is removed from the pattern above, there is an initial check
2757 for the presence of "c" in the subject before running the matching
2758 engine. This check fails for "bx", causing a match failure without see-
2759 ing any marks. You can disable the start-of-match optimizations by set-
2760 ting the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or starting
2761 the pattern with (*NO_START_OPT).
2763 After a successful match, a partial match, or one of the invalid UTF
2764 errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
2765 be called. After a successful or partial match it returns the code unit
2766 offset of the character at which the match started. For a non-partial
2767 match, this can be different to the value of ovector[0] if the pattern
2768 contains the \K escape sequence. After a partial match, however, this
2769 value is always the same as ovector[0] because \K does not affect the
2770 result of a partial match.
2772 After a UTF check failure, pcre2_get_startchar() can be used to obtain
2773 the code unit offset of the invalid UTF character. Details are given in
2774 the pcre2unicode page.
2777 ERROR RETURNS FROM pcre2_match()
2779 If pcre2_match() fails, it returns a negative number. This can be con-
2780 verted to a text string by calling the pcre2_get_error_message() func-
2781 tion (see "Obtaining a textual error message" below). Negative error
2782 codes are also returned by other functions, and are documented with
2783 them. The codes are given names in the header file. If UTF checking is
2784 in force and an invalid UTF subject string is detected, one of a number
2785 of UTF-specific negative error codes is returned. Details are given in
2786 the pcre2unicode page. The following are the other errors that may be
2787 returned by pcre2_match():
2791 The subject string did not match the pattern.
2795 The subject string did not match, but it did match partially. See the
2796 pcre2partial documentation for details of partial matching.
2798 PCRE2_ERROR_BADMAGIC
2800 PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
2801 to catch the case when it is passed a junk pointer. This is the error
2802 that is returned when the magic number is not present.
2806 This error is given when a compiled pattern is passed to a function in
2807 a library of a different code unit width, for example, a pattern com-
2808 piled by the 8-bit library is passed to a 16-bit or 32-bit library
2811 PCRE2_ERROR_BADOFFSET
2813 The value of startoffset was greater than the length of the subject.
2815 PCRE2_ERROR_BADOPTION
2817 An unrecognized bit was set in the options argument.
2819 PCRE2_ERROR_BADUTFOFFSET
2821 The UTF code unit sequence that was passed as a subject was checked and
2822 found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the
2823 value of startoffset did not point to the beginning of a UTF character
2824 or the end of the subject.
2828 This error is never generated by pcre2_match() itself. It is provided
2829 for use by callout functions that want to cause pcre2_match() or
2830 pcre2_callout_enumerate() to return a distinctive error code. See the
2831 pcre2callout documentation for details.
2833 PCRE2_ERROR_DEPTHLIMIT
2835 The nested backtracking depth limit was reached.
2837 PCRE2_ERROR_HEAPLIMIT
2839 The heap limit was reached.
2841 PCRE2_ERROR_INTERNAL
2843 An unexpected internal error has occurred. This error could be caused
2844 by a bug in PCRE2 or by overwriting of the compiled pattern.
2846 PCRE2_ERROR_JIT_STACKLIMIT
2848 This error is returned when a pattern that was successfully studied
2849 using JIT is being matched, but the memory available for the just-in-
2850 time processing stack is not large enough. See the pcre2jit documenta-
2851 tion for more details.
2853 PCRE2_ERROR_MATCHLIMIT
2855 The backtracking match limit was reached.
2857 PCRE2_ERROR_NOMEMORY
2859 If a pattern contains many nested backtracking points, heap memory is
2860 used to remember them. This error is given when the memory allocation
2861 function (default or custom) fails. Note that a different error,
2862 PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds
2867 Either the code, subject, or match_data argument was passed as NULL.
2869 PCRE2_ERROR_RECURSELOOP
2871 This error is returned when pcre2_match() detects a recursion loop
2872 within the pattern. Specifically, it means that either the whole pat-
2873 tern or a subpattern has been called recursively for the second time at
2874 the same position in the subject string. Some simple patterns that
2875 might do this are detected and faulted at compile time, but more com-
2876 plicated cases, in particular mutual recursions between two different
2877 subpatterns, cannot be detected until matching is attempted.
2880 OBTAINING A TEXTUAL ERROR MESSAGE
2882 int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
2883 PCRE2_SIZE bufflen);
2885 A text message for an error code from any PCRE2 function (compile,
2886 match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
2887 sage(). The code is passed as the first argument, with the remaining
2888 two arguments specifying a code unit buffer and its length in code
2889 units, into which the text message is placed. The message is returned
2890 in code units of the appropriate width for the library that is being
2893 The returned message is terminated with a trailing zero, and the func-
2894 tion returns the number of code units used, excluding the trailing
2895 zero. If the error number is unknown, the negative error code
2896 PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes-
2897 sage is truncated (but still with a trailing zero), and the negative
2898 error code PCRE2_ERROR_NOMEMORY is returned. None of the messages are
2899 very long; a buffer size of 120 code units is ample.
2902 EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
2904 int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
2905 uint32_t number, PCRE2_SIZE *length);
2907 int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
2908 uint32_t number, PCRE2_UCHAR *buffer,
2909 PCRE2_SIZE *bufflen);
2911 int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
2912 uint32_t number, PCRE2_UCHAR **bufferptr,
2913 PCRE2_SIZE *bufflen);
2915 void pcre2_substring_free(PCRE2_UCHAR *buffer);
2917 Captured substrings can be accessed directly by using the ovector as
2918 described above. For convenience, auxiliary functions are provided for
2919 extracting captured substrings as new, separate, zero-terminated
2920 strings. A substring that contains a binary zero is correctly extracted
2921 and has a further zero added on the end, but the result is not, of
2924 The functions in this section identify substrings by number. The number
2925 zero refers to the entire matched substring, with higher numbers refer-
2926 ring to substrings captured by parenthesized groups. After a partial
2927 match, only substring zero is available. An attempt to extract any
2928 other substring gives the error PCRE2_ERROR_PARTIAL. The next section
2929 describes similar functions for extracting captured substrings by name.
2931 If a pattern uses the \K escape sequence within a positive assertion,
2932 the reported start of a successful match can be greater than the end of
2933 the match. For example, if the pattern (?=ab\K) is matched against
2934 "ab", the start and end offset values for the match are 2 and 0. In
2935 this situation, calling these functions with a zero substring number
2936 extracts a zero-length empty string.
2938 You can find the length in code units of a captured substring without
2939 extracting it by calling pcre2_substring_length_bynumber(). The first
2940 argument is a pointer to the match data block, the second is the group
2941 number, and the third is a pointer to a variable into which the length
2942 is placed. If you just want to know whether or not the substring has
2943 been captured, you can pass the third argument as NULL.
2945 The pcre2_substring_copy_bynumber() function copies a captured sub-
2946 string into a supplied buffer, whereas pcre2_substring_get_bynumber()
2947 copies it into new memory, obtained using the same memory allocation
2948 function that was used for the match data block. The first two argu-
2949 ments of these functions are a pointer to the match data block and a
2950 capturing group number.
2952 The final arguments of pcre2_substring_copy_bynumber() are a pointer to
2953 the buffer and a pointer to a variable that contains its length in code
2954 units. This is updated to contain the actual number of code units used
2955 for the extracted substring, excluding the terminating zero.
2957 For pcre2_substring_get_bynumber() the third and fourth arguments point
2958 to variables that are updated with a pointer to the new memory and the
2959 number of code units that comprise the substring, again excluding the
2960 terminating zero. When the substring is no longer needed, the memory
2961 should be freed by calling pcre2_substring_free().
2963 The return value from all these functions is zero for success, or a
2964 negative error code. If the pattern match failed, the match failure
2965 code is returned. If a substring number greater than zero is used
2966 after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
2969 PCRE2_ERROR_NOMEMORY
2971 The buffer was too small for pcre2_substring_copy_bynumber(), or the
2972 attempt to get memory failed for pcre2_substring_get_bynumber().
2974 PCRE2_ERROR_NOSUBSTRING
2976 There is no substring with that number in the pattern, that is, the
2977 number is greater than the number of capturing parentheses.
2979 PCRE2_ERROR_UNAVAILABLE
2981 The substring number, though not greater than the number of captures in
2982 the pattern, is greater than the number of slots in the ovector, so the
2983 substring could not be captured.
2987 The substring did not participate in the match. For example, if the
2988 pattern is (abc)|(def) and the subject is "def", and the ovector con-
2989 tains at least two capturing slots, substring number 1 is unset.
2992 EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
2994 int pcre2_substring_list_get(pcre2_match_data *match_data,
2995 PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
2997 void pcre2_substring_list_free(PCRE2_SPTR *list);
2999 The pcre2_substring_list_get() function extracts all available sub-
3000 strings and builds a list of pointers to them. It also (optionally)
3001 builds a second list that contains their lengths (in code units),
3002 excluding a terminating zero that is added to each of them. All this is
3003 done in a single block of memory that is obtained using the same memory
3004 allocation function that was used to get the match data block.
3006 This function must be called only after a successful match. If called
3007 after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
3009 The address of the memory block is returned via listptr, which is also
3010 the start of the list of string pointers. The end of the list is marked
3011 by a NULL pointer. The address of the list of lengths is returned via
3012 lengthsptr. If your strings do not contain binary zeros and you do not
3013 therefore need the lengths, you may supply NULL as the lengthsptr argu-
3014 ment to disable the creation of a list of lengths. The yield of the
3015 function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
3016 ory block could not be obtained. When the list is no longer needed, it
3017 should be freed by calling pcre2_substring_list_free().
3019 If this function encounters a substring that is unset, which can happen
3020 when capturing subpattern number n+1 matches some part of the subject,
3021 but subpattern n has not been used at all, it returns an empty string.
3022 This can be distinguished from a genuine zero-length substring by
3023 inspecting the appropriate offset in the ovector, which contain
3024 PCRE2_UNSET for unset substrings, or by calling pcre2_sub-
3025 string_length_bynumber().
3028 EXTRACTING CAPTURED SUBSTRINGS BY NAME
3030 int pcre2_substring_number_from_name(const pcre2_code *code,
3033 int pcre2_substring_length_byname(pcre2_match_data *match_data,
3034 PCRE2_SPTR name, PCRE2_SIZE *length);
3036 int pcre2_substring_copy_byname(pcre2_match_data *match_data,
3037 PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
3039 int pcre2_substring_get_byname(pcre2_match_data *match_data,
3040 PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
3042 void pcre2_substring_free(PCRE2_UCHAR *buffer);
3044 To extract a substring by name, you first have to find associated num-
3045 ber. For example, for this pattern:
3049 the number of the subpattern called "xxx" is 2. If the name is known to
3050 be unique (PCRE2_DUPNAMES was not set), you can find the number from
3051 the name by calling pcre2_substring_number_from_name(). The first argu-
3052 ment is the compiled pattern, and the second is the name. The yield of
3053 the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there
3054 is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if
3055 there is more than one subpattern of that name. Given the number, you
3056 can extract the substring directly from the ovector, or use one of the
3057 "bynumber" functions described above.
3059 For convenience, there are also "byname" functions that correspond to
3060 the "bynumber" functions, the only difference being that the second
3061 argument is a name instead of a number. If PCRE2_DUPNAMES is set and
3062 there are duplicate names, these functions scan all the groups with the
3063 given name, and return the first named string that is set.
3065 If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
3066 returned. If all groups with the name have numbers that are greater
3067 than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is
3068 returned. If there is at least one group with a slot in the ovector,
3069 but no group is found to be set, PCRE2_ERROR_UNSET is returned.
3071 Warning: If the pattern uses the (?| feature to set up multiple subpat-
3072 terns with the same number, as described in the section on duplicate
3073 subpattern numbers in the pcre2pattern page, you cannot use names to
3074 distinguish the different subpatterns, because names are not included
3075 in the compiled code. The matching process uses only numbers. For this
3076 reason, the use of different names for subpatterns of the same number
3077 causes an error at compile time.
3080 CREATING A NEW STRING WITH SUBSTITUTIONS
3082 int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
3083 PCRE2_SIZE length, PCRE2_SIZE startoffset,
3084 uint32_t options, pcre2_match_data *match_data,
3085 pcre2_match_context *mcontext, PCRE2_SPTR replacement,
3086 PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP,
3087 PCRE2_SIZE *outlengthptr);
3089 This function calls pcre2_match() and then makes a copy of the subject
3090 string in outputbuffer, replacing the part that was matched with the
3091 replacement string, whose length is supplied in rlength. This can be
3092 given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
3093 which a \K item in a lookahead in the pattern causes the match to end
3094 before it starts are not supported, and give rise to an error return.
3095 For global replacements, matches in which \K in a lookbehind causes the
3096 match to start earlier than the point that was reached in the previous
3097 iteration are also not supported.
3099 The first seven arguments of pcre2_substitute() are the same as for
3100 pcre2_match(), except that the partial matching options are not permit-
3101 ted, and match_data may be passed as NULL, in which case a match data
3102 block is obtained and freed within this function, using memory manage-
3103 ment functions from the match context, if provided, or else those that
3104 were used to allocate memory for the compiled code.
3106 If an external match_data block is provided, its contents afterwards
3107 are those set by the final call to pcre2_match(), which will have ended
3108 in a matching error. The contents of the ovector within the match data
3109 block may or may not have been changed.
3111 The outlengthptr argument must point to a variable that contains the
3112 length, in code units, of the output buffer. If the function is suc-
3113 cessful, the value is updated to contain the length of the new string,
3114 excluding the trailing zero that is automatically added.
3116 If the function is not successful, the value set via outlengthptr
3117 depends on the type of error. For syntax errors in the replacement
3118 string, the value is the offset in the replacement string where the
3119 error was detected. For other errors, the value is PCRE2_UNSET by
3120 default. This includes the case of the output buffer being too small,
3121 unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which
3122 case the value is the minimum length needed, including space for the
3123 trailing zero. Note that in order to compute the required length,
3124 pcre2_substitute() has to simulate all the matching and copying,
3125 instead of giving an error return as soon as the buffer overflows. Note
3126 also that the length is in code units, not bytes.
3128 In the replacement string, which is interpreted as a UTF string in UTF
3129 mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
3130 option is set, a dollar character is an escape character that can spec-
3131 ify the insertion of characters from capturing groups or (*MARK),
3132 (*PRUNE), or (*THEN) items in the pattern. The following forms are
3135 $$ insert a dollar character
3136 $<n> or ${<n>} insert the contents of group <n>
3137 $*MARK or ${*MARK} insert a (*MARK), (*PRUNE), or (*THEN) name
3139 Either a group number or a group name can be given for <n>. Curly
3140 brackets are required only if the following character would be inter-
3141 preted as part of the number or name. The number may be zero to include
3142 the entire matched string. For example, if the pattern a(b)c is
3143 matched with "=abc=" and the replacement string "+$1$0$1+", the result
3146 $*MARK inserts the name from the last encountered (*MARK), (*PRUNE), or
3147 (*THEN) on the matching path that has a name. (*MARK) must always
3148 include a name, but (*PRUNE) and (*THEN) need not. For example, in the
3149 case of (*MARK:A)(*PRUNE) the name inserted is "A", but for
3150 (*MARK:A)(*PRUNE:B) the relevant name is "B". This facility can be
3151 used to perform simple simultaneous substitutions, as this pcre2test
3154 /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
3158 As well as the usual options for pcre2_match(), a number of additional
3159 options can be set in the options argument of pcre2_substitute().
3161 PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
3162 string, replacing every matching substring. If this option is not set,
3163 only the first matching substring is replaced. The search for matches
3164 takes place in the original subject string (that is, previous replace-
3165 ments do not affect it). Iteration is implemented by advancing the
3166 startoffset value for each search, which is always passed the entire
3167 subject string. If an offset limit is set in the match context, search-
3168 ing stops when that limit is reached.
3170 You can restrict the effect of a global substitution to a portion of
3171 the subject string by setting either or both of startoffset and an off-
3172 set limit. Here is a pcre2test example:
3174 /B/g,replace=!,use_offset_limit
3175 ABC ABC ABC ABC\=offset=3,offset_limit=12
3178 When continuing with global substitutions after matching a substring
3179 with zero length, an attempt to find a non-empty match at the same off-
3180 set is performed. If this is not successful, the offset is advanced by
3181 one character except when CRLF is a valid newline sequence and the next
3182 two characters are CR, LF. In this case, the offset is advanced by two
3185 PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
3186 buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
3187 ORY immediately. If this option is set, however, pcre2_substitute()
3188 continues to go through the motions of matching and substituting (with-
3189 out, of course, writing anything) in order to compute the size of buf-
3190 fer that is needed. This value is passed back via the outlengthptr
3191 variable, with the result of the function still being
3192 PCRE2_ERROR_NOMEMORY.
3194 Passing a buffer size of zero is a permitted way of finding out how
3195 much memory is needed for given substitution. However, this does mean
3196 that the entire operation is carried out twice. Depending on the appli-
3197 cation, it may be more efficient to allocate a large buffer and free
3198 the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
3201 PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups
3202 that do not appear in the pattern to be treated as unset groups. This
3203 option should be used with care, because it means that a typo in a
3204 group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING
3207 PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including
3208 unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be
3209 treated as empty strings when inserted as described above. If this
3210 option is not set, an attempt to insert an unset group causes the
3211 PCRE2_ERROR_UNSET error. This option does not influence the extended
3212 substitution syntax described below.
3214 PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
3215 replacement string. Without this option, only the dollar character is
3216 special, and only the group insertion forms listed above are valid.
3217 When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
3219 Firstly, backslash in a replacement string is interpreted as an escape
3220 character. The usual forms such as \n or \x{ddd} can be used to specify
3221 particular character codes, and backslash followed by any non-alphanu-
3222 meric character quotes that character. Extended quoting can be coded
3223 using \Q...\E, exactly as in pattern strings.
3225 There are also four escape sequences for forcing the case of inserted
3226 letters. The insertion mechanism has three states: no case forcing,
3227 force upper case, and force lower case. The escape sequences change the
3228 current state: \U and \L change to upper or lower case forcing, respec-
3229 tively, and \E (when not terminating a \Q quoted sequence) reverts to
3230 no case forcing. The sequences \u and \l force the next character (if
3231 it is a letter) to upper or lower case, respectively, and then the
3232 state automatically reverts to no case forcing. Case forcing applies to
3233 all inserted characters, including those from captured groups and let-
3234 ters within \Q...\E quoted sequences.
3236 Note that case forcing sequences such as \U...\E do not nest. For exam-
3237 ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
3240 The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
3241 flexibility to group substitution. The syntax is similar to that used
3245 ${<n>:+<string1>:<string2>}
3247 As before, <n> may be a group number or a name. The first form speci-
3248 fies a default value. If group <n> is set, its value is inserted; if
3249 not, <string> is expanded and the result inserted. The second form
3250 specifies strings that are expanded and inserted when group <n> is set
3251 or unset, respectively. The first form is just a convenient shorthand
3254 ${<n>:+${<n>}:<string>}
3256 Backslash can be used to escape colons and closing curly brackets in
3257 the replacement strings. A change of the case forcing state within a
3258 replacement string remains in force afterwards, as shown in this
3261 /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
3267 The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
3268 substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause
3269 unknown groups in the extended syntax forms to be treated as unset.
3271 If successful, pcre2_substitute() returns the number of replacements
3272 that were made. This may be zero if no matches were found, and is never
3273 greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
3275 In the event of an error, a negative error code is returned. Except for
3276 PCRE2_ERROR_NOMATCH (which is never returned), errors from
3277 pcre2_match() are passed straight back.
3279 PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
3280 tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
3282 PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
3283 ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
3284 when the simple (non-extended) syntax is used and PCRE2_SUBSTI-
3285 TUTE_UNSET_EMPTY is not set.
3287 PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
3288 enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
3289 of buffer that is needed is returned via outlengthptr. Note that this
3290 does not happen by default.
3292 PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
3293 the replacement string, with more particular errors being
3294 PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
3295 MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI-
3296 TUTION (syntax error in extended group substitution), and
3297 PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started
3298 or the match started earlier than the current position in the subject,
3299 which can happen if \K is used in an assertion).
3301 As for all PCRE2 errors, a text message that describes the error can be
3302 obtained by calling the pcre2_get_error_message() function (see
3303 "Obtaining a textual error message" above).
3306 DUPLICATE SUBPATTERN NAMES
3308 int pcre2_substring_nametable_scan(const pcre2_code *code,
3309 PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
3311 When a pattern is compiled with the PCRE2_DUPNAMES option, names for
3312 subpatterns are not required to be unique. Duplicate names are always
3313 allowed for subpatterns with the same number, created by using the (?|
3314 feature. Indeed, if such subpatterns are named, they are required to
3317 Normally, patterns with duplicate names are such that in any one match,
3318 only one of the named subpatterns participates. An example is shown in
3319 the pcre2pattern documentation.
3321 When duplicates are present, pcre2_substring_copy_byname() and
3322 pcre2_substring_get_byname() return the first substring corresponding
3323 to the given name that is set. Only if none are set is
3324 PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name()
3325 function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
3328 If you want to get full details of all captured substrings for a given
3329 name, you must use the pcre2_substring_nametable_scan() function. The
3330 first argument is the compiled pattern, and the second is the name. If
3331 the third and fourth arguments are NULL, the function returns a group
3332 number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
3334 When the third and fourth arguments are not NULL, they must be pointers
3335 to variables that are updated by the function. After it has run, they
3336 point to the first and last entries in the name-to-number table for the
3337 given name, and the function returns the length of each entry in code
3338 units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
3339 no entries for the given name.
3341 The format of the name table is described above in the section entitled
3342 Information about a pattern. Given all the relevant entries for the
3343 name, you can extract each of their numbers, and hence the captured
3347 FINDING ALL POSSIBLE MATCHES AT ONE POSITION
3349 The traditional matching function uses a similar algorithm to Perl,
3350 which stops when it finds the first match at a given point in the sub-
3351 ject. If you want to find all possible matches, or the longest possible
3352 match at a given position, consider using the alternative matching
3353 function (see below) instead. If you cannot use the alternative func-
3354 tion, you can kludge it up by making use of the callout facility, which
3355 is described in the pcre2callout documentation.
3357 What you have to do is to insert a callout right at the end of the pat-
3358 tern. When your callout function is called, extract and save the cur-
3359 rent matched substring. Then return 1, which forces pcre2_match() to
3360 backtrack and try other alternatives. Ultimately, when it runs out of
3361 matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
3364 MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
3366 int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
3367 PCRE2_SIZE length, PCRE2_SIZE startoffset,
3368 uint32_t options, pcre2_match_data *match_data,
3369 pcre2_match_context *mcontext,
3370 int *workspace, PCRE2_SIZE wscount);
3372 The function pcre2_dfa_match() is called to match a subject string
3373 against a compiled pattern, using a matching algorithm that scans the
3374 subject string just once (not counting lookaround assertions), and does
3375 not backtrack. This has different characteristics to the normal algo-
3376 rithm, and is not compatible with Perl. Some of the features of PCRE2
3377 patterns are not supported. Nevertheless, there are times when this
3378 kind of matching can be useful. For a discussion of the two matching
3379 algorithms, and a list of features that pcre2_dfa_match() does not sup-
3380 port, see the pcre2matching documentation.
3382 The arguments for the pcre2_dfa_match() function are the same as for
3383 pcre2_match(), plus two extras. The ovector within the match data block
3384 is used in a different way, and this is described below. The other com-
3385 mon arguments are used in the same way as for pcre2_match(), so their
3386 description is not repeated here.
3388 The two additional arguments provide workspace for the function. The
3389 workspace vector should contain at least 20 elements. It is used for
3390 keeping track of multiple paths through the pattern tree. More
3391 workspace is needed for patterns and subjects where there are a lot of
3394 Here is an example of a simple call to pcre2_dfa_match():
3397 pcre2_match_data *md = pcre2_match_data_create(4, NULL);
3398 int rc = pcre2_dfa_match(
3399 re, /* result of pcre2_compile() */
3400 "some string", /* the subject string */
3401 11, /* the length of the subject string */
3402 0, /* start at offset 0 in the subject */
3403 0, /* default options */
3404 md, /* the match data block */
3405 NULL, /* a match context; NULL means use defaults */
3406 wspace, /* working space vector */
3407 20); /* number of elements (NOT size in bytes) */
3409 Option bits for pcre_dfa_match()
3411 The unused bits of the options argument for pcre2_dfa_match() must be
3412 zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN-
3413 CHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
3414 PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD,
3415 PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but
3416 the last four of these are exactly the same as for pcre2_match(), so
3417 their description is not repeated here.
3422 These have the same general effect as they do for pcre2_match(), but
3423 the details are slightly different. When PCRE2_PARTIAL_HARD is set for
3424 pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
3425 subject is reached and there is still at least one matching possibility
3426 that requires additional characters. This happens even if some complete
3427 matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
3428 return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
3429 if the end of the subject is reached, there have been no complete
3430 matches, but there is still at least one matching possibility. The por-
3431 tion of the string that was inspected when the longest partial match
3432 was found is set as the first matching string in both cases. There is a
3433 more detailed discussion of partial and multi-segment matching, with
3434 examples, in the pcre2partial documentation.
3438 Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
3439 stop as soon as it has found one match. Because of the way the alterna-
3440 tive algorithm works, this is necessarily the shortest possible match
3441 at the first possible matching point in the subject string.
3445 When pcre2_dfa_match() returns a partial match, it is possible to call
3446 it again, with additional subject characters, and have it continue with
3447 the same match. The PCRE2_DFA_RESTART option requests this action; when
3448 it is set, the workspace and wscount options must reference the same
3449 vector as before because data about the match so far is left in them
3450 after a partial match. There is more discussion of this facility in the
3451 pcre2partial documentation.
3453 Successful returns from pcre2_dfa_match()
3455 When pcre2_dfa_match() succeeds, it may have matched more than one sub-
3456 string in the subject. Note, however, that all the matches from one run
3457 of the function start at the same point in the subject. The shorter
3458 matches are all initial substrings of the longer matches. For example,
3463 is matched against the string
3465 This is <something> <something else> <something further> no more
3467 the three matched strings are
3469 <something> <something else> <something further>
3470 <something> <something else>
3473 On success, the yield of the function is a number greater than zero,
3474 which is the number of matched substrings. The offsets of the sub-
3475 strings are returned in the ovector, and can be extracted by number in
3476 the same way as for pcre2_match(), but the numbers bear no relation to
3477 any capturing groups that may exist in the pattern, because DFA match-
3478 ing does not support group capture.
3480 Calls to the convenience functions that extract substrings by name
3481 return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used
3482 after a DFA match. The convenience functions that extract substrings by
3483 number never return PCRE2_ERROR_NOSUBSTRING.
3485 The matched strings are stored in the ovector in reverse order of
3486 length; that is, the longest matching string is first. If there were
3487 too many matches to fit into the ovector, the yield of the function is
3488 zero, and the vector is filled with the longest matches.
3490 NOTE: PCRE2's "auto-possessification" optimization usually applies to
3491 character repeats at the end of a pattern (as well as internally). For
3492 example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
3493 matching, this means that only one possible match is found. If you
3494 really do want multiple matches in such cases, either use an ungreedy
3495 repeat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when
3498 Error returns from pcre2_dfa_match()
3500 The pcre2_dfa_match() function returns a negative number when it fails.
3501 Many of the errors are the same as for pcre2_match(), as described
3502 above. There are in addition the following errors that are specific to
3505 PCRE2_ERROR_DFA_UITEM
3507 This return is given if pcre2_dfa_match() encounters an item in the
3508 pattern that it does not support, for instance, the use of \C in a UTF
3509 mode or a backreference.
3511 PCRE2_ERROR_DFA_UCOND
3513 This return is given if pcre2_dfa_match() encounters a condition item
3514 that uses a backreference for the condition, or a test for recursion in
3515 a specific group. These are not supported.
3517 PCRE2_ERROR_DFA_WSSIZE
3519 This return is given if pcre2_dfa_match() runs out of space in the
3522 PCRE2_ERROR_DFA_RECURSE
3524 When a recursive subpattern is processed, the matching function calls
3525 itself recursively, using private memory for the ovector and workspace.
3526 This error is given if the internal ovector is not large enough. This
3527 should be extremely rare, as a vector of size 1000 is used.
3529 PCRE2_ERROR_DFA_BADRESTART
3531 When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
3532 some plausibility checks are made on the contents of the workspace,
3533 which should contain data about the previous partial match. If any of
3534 these checks fail, this error is given.
3539 pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
3540 pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
3546 University Computing Service
3552 Last updated: 07 September 2018
3553 Copyright (c) 1997-2018 University of Cambridge.
3554 ------------------------------------------------------------------------------
3557 PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
3562 PCRE2 - Perl-compatible regular expressions (revised API)
3566 PCRE2 is distributed with a configure script that can be used to build
3567 the library in Unix-like environments using the applications known as
3568 Autotools. Also in the distribution are files to support building using
3569 CMake instead of configure. The text file README contains general
3570 information about building with Autotools (some of which is repeated
3571 below), and also has some comments about building on various operating
3572 systems. There is a lot more information about building PCRE2 without
3573 using Autotools (including information about using CMake and building
3574 "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should
3575 consult this file as well as the README file if you are building in a
3576 non-Unix-like environment.
3579 PCRE2 BUILD-TIME OPTIONS
3581 The rest of this document describes the optional features of PCRE2 that
3582 can be selected when the library is compiled. It assumes use of the
3583 configure script, where the optional features are selected or dese-
3584 lected by providing options to configure before running the make com-
3585 mand. However, the same options can be selected in both Unix-like and
3586 non-Unix-like environments if you are using CMake instead of configure
3589 If you are not using Autotools or CMake, option selection can be done
3590 by editing the config.h file, or by passing parameter settings to the
3591 compiler, as described in NON-AUTOTOOLS-BUILD.
3593 The complete list of options for configure (which includes the standard
3594 ones such as the selection of the installation directory) can be
3599 The following sections include descriptions of "on/off" options whose
3600 names begin with --enable or --disable. Because of the way that config-
3601 ure works, --enable and --disable always come in pairs, so the comple-
3602 mentary option always exists as well, but as it specifies the default,
3603 it is not described. Options that specify values have names that start
3604 with --with. At the end of a configure run, a summary of the configura-
3608 BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
3610 By default, a library called libpcre2-8 is built, containing functions
3611 that take string arguments contained in arrays of bytes, interpreted
3612 either as single-byte characters, or UTF-8 strings. You can also build
3613 two other libraries, called libpcre2-16 and libpcre2-32, which process
3614 strings that are contained in arrays of 16-bit and 32-bit code units,
3615 respectively. These can be interpreted either as single-unit characters
3616 or UTF-16/UTF-32 strings. To build these additional libraries, add one
3617 or both of the following to the configure command:
3622 If you do not want the 8-bit library, add
3626 as well. At least one of the three libraries must be built. Note that
3627 the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
3628 an 8-bit program. Neither of these are built if you select only the
3629 16-bit or 32-bit libraries.
3632 BUILDING SHARED AND STATIC LIBRARIES
3634 The Autotools PCRE2 building process uses libtool to build both shared
3635 and static libraries by default. You can suppress an unwanted library
3641 to the configure command.
3644 UNICODE AND UTF SUPPORT
3646 By default, PCRE2 is built with support for Unicode and UTF character
3647 strings. To build it without Unicode support, add
3651 to the configure command. This setting applies to all three libraries.
3652 It is not possible to build one library with Unicode support, and
3653 another without, in the same configuration.
3655 Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
3656 UTF-16 or UTF-32. To do that, applications that use the library can set
3657 the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
3658 tern. Alternatively, patterns may be started with (*UTF) unless the
3659 application has locked this out by setting PCRE2_NEVER_UTF.
3661 UTF support allows the libraries to process character code points up to
3662 0x10ffff in the strings that they handle. Unicode support also gives
3663 access to the Unicode properties of characters, using pattern escapes
3664 such as \P, \p, and \X. Only the general category properties such as Lu
3665 and Nd are supported. Details are given in the pcre2pattern documenta-
3668 Pattern escapes such as \d and \w do not by default make use of Unicode
3669 properties. The application can request that they do by setting the
3670 PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
3671 pattern may also request this by starting with (*UCP).
3674 DISABLING THE USE OF \C
3676 The \C escape sequence, which matches a single code unit, even in a UTF
3677 mode, can cause unpredictable behaviour because it may leave the cur-
3678 rent matching point in the middle of a multi-code-unit character. The
3679 application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C
3680 option when calling pcre2_compile(). There is also a build-time option
3682 --enable-never-backslash-C
3684 (note the upper case C) which locks out the use of \C entirely.
3687 JUST-IN-TIME COMPILER SUPPORT
3689 Just-in-time (JIT) compiler support is included in the build by speci-
3694 This support is available only for certain hardware architectures. If
3695 this option is set for an unsupported architecture, a building error
3696 occurs. If in doubt, use
3700 which enables JIT only if the current hardware is supported. You can
3701 check if JIT is enabled in the configuration summary that is output at
3702 the end of a configure run. If you are enabling JIT under SELinux you
3703 may also want to add
3705 --enable-jit-sealloc
3707 which enables the use of an execmem allocator in JIT that is compatible
3708 with SELinux. This has no effect if JIT is not enabled. See the
3709 pcre2jit documentation for a discussion of JIT usage. When JIT support
3710 is enabled, pcre2grep automatically makes use of it, unless you add
3712 --disable-pcre2grep-jit
3714 to the "configure" command.
3719 By default, PCRE2 interprets the linefeed (LF) character as indicating
3720 the end of a line. This is the normal newline character on Unix-like
3721 systems. You can compile PCRE2 to use carriage return (CR) instead, by
3724 --enable-newline-is-cr
3726 to the configure command. There is also an --enable-newline-is-lf
3727 option, which explicitly specifies linefeed as the newline character.
3729 Alternatively, you can specify that line endings are to be indicated by
3730 the two-character sequence CRLF (CR immediately followed by LF). If you
3733 --enable-newline-is-crlf
3735 to the configure command. There is a fourth option, specified by
3737 --enable-newline-is-anycrlf
3739 which causes PCRE2 to recognize any of the three sequences CR, LF, or
3740 CRLF as indicating a line ending. A fifth option, specified by
3742 --enable-newline-is-any
3744 causes PCRE2 to recognize any Unicode newline sequence. The Unicode
3745 newline sequences are the three just mentioned, plus the single charac-
3746 ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
3747 U+0085), LS (line separator, U+2028), and PS (paragraph separator,
3748 U+2029). The final option is
3750 --enable-newline-is-nul
3752 which causes NUL (binary zero) to be set as the default line-ending
3755 Whatever default line ending convention is selected when PCRE2 is built
3756 can be overridden by applications that use the library. At build time
3757 it is recommended to use the standard for your operating system.
3762 By default, the sequence \R in a pattern matches any Unicode newline
3763 sequence, independently of what has been selected as the line ending
3764 sequence. If you specify
3766 --enable-bsr-anycrlf
3768 the default is changed so that \R matches only CR, LF, or CRLF. What-
3769 ever is selected when PCRE2 is built can be overridden by applications
3770 that use the library.
3773 HANDLING VERY LARGE PATTERNS
3775 Within a compiled pattern, offset values are used to point from one
3776 part to another (for example, from an opening parenthesis to an alter-
3777 nation metacharacter). By default, in the 8-bit and 16-bit libraries,
3778 two-byte values are used for these offsets, leading to a maximum size
3779 for a compiled pattern of around 64 thousand code units. This is suffi-
3780 cient to handle all but the most gigantic patterns. Nevertheless, some
3781 people do want to process truly enormous patterns, so it is possible to
3782 compile PCRE2 to use three-byte or four-byte offsets by adding a set-
3787 to the configure command. The value given must be 2, 3, or 4. For the
3788 16-bit library, a value of 3 is rounded up to 4. In these libraries,
3789 using longer offsets slows down the operation of PCRE2 because it has
3790 to load additional data when handling them. For the 32-bit library the
3791 value is always 4 and cannot be overridden; the value of --with-link-
3795 LIMITING PCRE2 RESOURCE USAGE
3797 The pcre2_match() function increments a counter each time it goes round
3798 its main loop. Putting a limit on this counter controls the amount of
3799 computing resource used by a single call to pcre2_match(). The limit
3800 can be changed at run time, as described in the pcre2api documentation.
3801 The default is 10 million, but this can be changed by adding a setting
3804 --with-match-limit=500000
3806 to the configure command. This setting also applies to the
3807 pcre2_dfa_match() matching function, and to JIT matching (though the
3808 counting is done differently).
3810 The pcre2_match() function starts out using a 20KiB vector on the sys-
3811 tem stack to record backtracking points. The more nested backtracking
3812 points there are (that is, the deeper the search tree), the more memory
3813 is needed. If the initial vector is not large enough, heap memory is
3814 used, up to a certain limit, which is specified in kibibytes (units of
3815 1024 bytes). The limit can be changed at run time, as described in the
3816 pcre2api documentation. The default limit (in effect unlimited) is 20
3817 million. You can change this by a setting such as
3819 --with-heap-limit=500
3821 which limits the amount of heap to 500 KiB. This limit applies only to
3822 interpretive matching in pcre2_match() and pcre2_dfa_match(), which may
3823 also use the heap for internal workspace when processing complicated
3824 patterns. This limit does not apply when JIT (which has its own memory
3825 arrangements) is used.
3827 You can also explicitly limit the depth of nested backtracking in the
3828 pcre2_match() interpreter. This limit defaults to the value that is set
3829 for --with-match-limit. You can set a lower default limit by adding,
3832 --with-match-limit_depth=10000
3834 to the configure command. This value can be overridden at run time.
3835 This depth limit indirectly limits the amount of heap memory that is
3836 used, but because the size of each backtracking "frame" depends on the
3837 number of capturing parentheses in a pattern, the amount of heap that
3838 is used before the limit is reached varies from pattern to pattern.
3839 This limit was more useful in versions before 10.30, where function
3840 recursion was used for backtracking.
3842 As well as applying to pcre2_match(), the depth limit also controls the
3843 depth of recursive function calls in pcre2_dfa_match(). These are used
3844 for lookaround assertions, atomic groups, and recursion within pat-
3845 terns. The limit does not apply to JIT matching.
3848 CREATING CHARACTER TABLES AT BUILD TIME
3850 PCRE2 uses fixed tables for processing characters whose code points are
3851 less than 256. By default, PCRE2 is built with a set of tables that are
3852 distributed in the file src/pcre2_chartables.c.dist. These tables are
3853 for ASCII codes only. If you add
3855 --enable-rebuild-chartables
3857 to the configure command, the distributed tables are no longer used.
3858 Instead, a program called dftables is compiled and run. This outputs
3859 the source for new set of tables, created in the default locale of your
3860 C run-time system. This method of replacing the tables does not work if
3861 you are cross compiling, because dftables is run on the local host. If
3862 you need to create alternative tables when cross compiling, you will
3863 have to do so "by hand".
3868 PCRE2 assumes by default that it will run in an environment where the
3869 character code is ASCII or Unicode, which is a superset of ASCII. This
3870 is the case for most computer operating systems. PCRE2 can, however, be
3871 compiled to run in an 8-bit EBCDIC environment by adding
3873 --enable-ebcdic --disable-unicode
3875 to the configure command. This setting implies --enable-rebuild-charta-
3876 bles. You should only use it if you know that you are in an EBCDIC
3877 environment (for example, an IBM mainframe operating system).
3879 It is not possible to support both EBCDIC and UTF-8 codes in the same
3880 version of the library. Consequently, --enable-unicode and --enable-
3881 ebcdic are mutually exclusive.
3883 The EBCDIC character that corresponds to an ASCII LF is assumed to have
3884 the value 0x15 by default. However, in some EBCDIC environments, 0x25
3885 is used. In such an environment you should use
3887 --enable-ebcdic-nl25
3889 as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
3890 has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
3891 0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
3892 acter (which, in Unicode, is 0x85).
3894 The options that select newline behaviour, such as --enable-newline-is-
3895 cr, and equivalent run-time options, refer to these character values in
3896 an EBCDIC environment.
3899 PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
3901 By default, on non-Windows systems, pcre2grep supports the use of call-
3902 outs with string arguments within the patterns it is matching, in order
3903 to run external scripts. For details, see the pcre2grep documentation.
3904 This support can be disabled by adding --disable-pcre2grep-callout to
3905 the configure command.
3908 PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
3910 By default, pcre2grep reads all files as plain text. You can build it
3911 so that it recognizes files whose names end in .gz or .bz2, and reads
3912 them with libz or libbz2, respectively, by adding one or both of
3914 --enable-pcre2grep-libz
3915 --enable-pcre2grep-libbz2
3917 to the configure command. These options naturally require that the rel-
3918 evant libraries are installed on your system. Configuration will fail
3922 PCRE2GREP BUFFER SIZE
3924 pcre2grep uses an internal buffer to hold a "window" on the file it is
3925 scanning, in order to be able to output "before" and "after" lines when
3926 it finds a match. The default starting size of the buffer is 20KiB. The
3927 buffer itself is three times this size, but because of the way it is
3928 used for holding "before" lines, the longest line that is guaranteed to
3929 be processable is the notional buffer size. If a longer line is encoun-
3930 tered, pcre2grep automatically expands the buffer, up to a specified
3931 maximum size, whose default is 1MiB or the starting size, whichever is
3932 the larger. You can change the default parameter values by adding, for
3935 --with-pcre2grep-bufsize=51200
3936 --with-pcre2grep-max-bufsize=2097152
3938 to the configure command. The caller of pcre2grep can override these
3939 values by using --buffer-size and --max-buffer-size on the command
3943 PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
3947 --enable-pcre2test-libreadline
3948 --enable-pcre2test-libedit
3950 to the configure command, pcre2test is linked with the libreadline
3951 orlibedit library, respectively, and when its input is from a terminal,
3952 it reads it using the readline() function. This provides line-editing
3953 and history facilities. Note that libreadline is GPL-licensed, so if
3954 you distribute a binary of pcre2test linked in this way, there may be
3955 licensing issues. These can be avoided by linking instead with libedit,
3956 which has a BSD licence.
3958 Setting --enable-pcre2test-libreadline causes the -lreadline option to
3959 be added to the pcre2test build. In many operating environments with a
3960 sytem-installed readline library this is sufficient. However, in some
3961 environments (e.g. if an unmodified distribution version of readline is
3962 in use), some extra configuration may be necessary. The INSTALL file
3963 for libreadline says this:
3965 "Readline uses the termcap functions, but does not link with
3966 the termcap or curses library itself, allowing applications
3967 which link with readline the to choose an appropriate library."
3969 If your environment has not been set up so that an appropriate library
3970 is automatically included, you may need to add something like
3974 immediately before the configure command.
3977 INCLUDING DEBUGGING CODE
3983 to the configure command, additional debugging code is included in the
3984 build. This feature is intended for use by the PCRE2 maintainers.
3987 DEBUGGING WITH VALGRIND SUPPORT
3993 to the configure command, PCRE2 will use valgrind annotations to mark
3994 certain memory regions as unaddressable. This allows it to detect
3995 invalid memory accesses, and is mostly useful for debugging PCRE2
3999 CODE COVERAGE REPORTING
4001 If your C compiler is gcc, you can build a version of PCRE2 that can
4002 generate a code coverage report for its test suite. To enable this, you
4003 must install lcov version 1.6 or above. Then specify
4007 to the configure command and build PCRE2 in the usual way.
4009 Note that using ccache (a caching C compiler) is incompatible with code
4010 coverage reporting. If you have configured ccache to run automatically
4011 on your system, you must set the environment variable
4015 before running make to build PCRE2, so that ccache is not used.
4017 When --enable-coverage is used, the following addition targets are
4018 added to the Makefile:
4022 This creates a fresh coverage report for the PCRE2 test suite. It is
4023 equivalent to running "make coverage-reset", "make coverage-baseline",
4024 "make check", and then "make coverage-report".
4028 This zeroes the coverage counters, but does nothing else.
4030 make coverage-baseline
4032 This captures baseline coverage information.
4034 make coverage-report
4036 This creates the coverage report.
4038 make coverage-clean-report
4040 This removes the generated coverage report without cleaning the cover-
4043 make coverage-clean-data
4045 This removes the captured coverage data without removing the coverage
4046 files created at compile time (*.gcno).
4050 This cleans all coverage data including the generated coverage report.
4051 For more information about code coverage, see the gcov and lcov docu-
4057 There is a special option for use by people who want to run fuzzing
4060 --enable-fuzz-support
4062 At present this applies only to the 8-bit library. If set, it causes an
4063 extra library called libpcre2-fuzzsupport.a to be built, but not
4064 installed. This contains a single function called LLVMFuzzerTestOneIn-
4065 put() whose arguments are a pointer to a string and the length of the
4066 string. When called, this function tries to compile the string as a
4067 pattern, and if that succeeds, to match it. This is done both with no
4068 options and with some random options bits that are generated from the
4071 Setting --enable-fuzz-support also causes a binary called pcre2fuz-
4072 zcheck to be created. This is normally run under valgrind or used when
4073 PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
4074 function and outputs information about what it is doing. The input
4075 strings are specified by arguments: if an argument starts with "=" the
4076 rest of it is a literal input string. Otherwise, it is assumed to be a
4077 file name, and the contents of the file are the test string.
4082 In versions of PCRE2 prior to 10.30, there were two ways of handling
4083 backtracking in the pcre2_match() function. The default was to use the
4084 system stack, but if
4086 --disable-stack-for-recursion
4088 was set, memory on the heap was used. From release 10.30 onwards this
4089 has changed (the stack is no longer used) and this option now does
4090 nothing except give a warning.
4095 pcre2api(3), pcre2-config(3).
4101 University Computing Service
4107 Last updated: 26 April 2018
4108 Copyright (c) 1997-2018 University of Cambridge.
4109 ------------------------------------------------------------------------------
4112 PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
4117 PCRE2 - Perl-compatible regular expressions (revised API)
4123 int (*pcre2_callout)(pcre2_callout_block *, void *);
4125 int pcre2_callout_enumerate(const pcre2_code *code,
4126 int (*callback)(pcre2_callout_enumerate_block *, void *),
4132 PCRE2 provides a feature called "callout", which is a means of tempo-
4133 rarily passing control to the caller of PCRE2 in the middle of pattern
4134 matching. The caller of PCRE2 provides an external function by putting
4135 its entry point in a match context (see pcre2_set_callout() in the
4136 pcre2api documentation).
4138 Within a regular expression, (?C<arg>) indicates a point at which the
4139 external function is to be called. Different callout points can be
4140 identified by putting a number less than 256 after the letter C. The
4141 default value is zero. Alternatively, the argument may be a delimited
4142 string. The starting delimiter must be one of ` ' " ^ % # $ { and the
4143 ending delimiter is the same as the start, except for {, where the end-
4144 ing delimiter is }. If the ending delimiter is needed within the
4145 string, it must be doubled. For example, this pattern has two callout
4148 (?C1)abc(?C"some ""arbitrary"" text")def
4150 If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
4151 PCRE2 automatically inserts callouts, all with number 255, before each
4152 item in the pattern except for immediately before or after an explicit
4153 callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
4157 it is processed as if it were
4159 (?C255)A(?C3)B(?C255)
4161 Here is a more complicated example:
4165 With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
4167 (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
4169 Notice that there is a callout before and after each parenthesis and
4170 alternation bar. If the pattern contains a conditional group whose con-
4171 dition is an assertion, an automatic callout is inserted immediately
4172 before the condition. Such a callout may also be inserted explicitly,
4175 (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de)
4177 This applies only to assertion conditions (because they are themselves
4178 independent groups).
4180 Callouts can be useful for tracking the progress of pattern matching.
4181 The pcre2test program has a pattern qualifier (/auto_callout) that sets
4182 automatic callouts. When any callouts are present, the output from
4183 pcre2test indicates how the pattern is being matched. This is useful
4184 information when you are trying to optimize the performance of a par-
4190 You should be aware that, because of optimizations in the way PCRE2
4191 compiles and matches patterns, callouts sometimes do not happen exactly
4192 as you might expect.
4194 Auto-possessification
4196 At compile time, PCRE2 "auto-possessifies" repeated items when it knows
4197 that what follows cannot be part of the repeat. For example, a+[bc] is
4198 compiled as if it were a++[bc]. The pcre2test output when this pattern
4199 is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
4200 to the string "aaaa" is:
4207 This indicates that when matching [bc] fails, there is no backtracking
4208 into a+ (because it is being treated as a++) and therefore the callouts
4209 that would be taken for the backtracks do not occur. You can disable
4210 the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
4211 pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In
4212 this case, the output changes to this:
4222 This time, when matching [bc] fails, the matcher backtracks into a+ and
4223 tries again, repeatedly, until a+ itself fails.
4225 Automatic .* anchoring
4227 By default, an optimization is applied when .* is the first significant
4228 item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
4229 any character, the pattern is automatically anchored. If PCRE2_DOTALL
4230 is not set, a match can start only after an internal newline or at the
4231 beginning of the subject, and pcre2_compile() remembers this. If a pat-
4232 tern has more than one top-level branch, automatic anchoring occurs if
4233 all branches are anchorable.
4235 This optimization is disabled, however, if .* is in an atomic group or
4236 if there is a backreference to the capturing group in which it appears.
4237 It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
4238 ever, the presence of callouts does not affect it.
4240 For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
4241 and applied to the string "aa", the pcre2test output is:
4250 This shows that all match attempts start at the beginning of the sub-
4251 ject. In other words, the pattern is anchored. You can disable this
4252 optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
4253 starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
4266 This shows more match attempts, starting at the second subject charac-
4267 ter. Another optimization, described in the next section, means that
4268 there is no subsequent attempt to match with an empty subject.
4272 Other optimizations that provide fast "no match" results also affect
4273 callouts. For example, if the pattern is
4277 PCRE2 knows that any matching string must contain the letter "d". If
4278 the subject string is "abyz", the lack of "d" means that matching
4279 doesn't ever start, and the callout is never reached. However, with
4280 "abyd", though the result is still no match, the callout is obeyed.
4282 For most patterns PCRE2 also knows the minimum length of a matching
4283 string, and will immediately give a "no match" return without actually
4284 running a match if the subject is not long enough, or, for unanchored
4285 patterns, if it has been scanned far enough.
4287 You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
4288 MIZE option to pcre2_compile(), or by starting the pattern with
4289 (*NO_START_OPT). This slows down the matching process, but does ensure
4290 that callouts such as the example above are obeyed.
4293 THE CALLOUT INTERFACE
4295 During matching, when PCRE2 reaches a callout point, if an external
4296 function is provided in the match context, it is called. This applies
4297 to both normal, DFA, and JIT matching. The first argument to the call-
4298 out function is a pointer to a pcre2_callout block. The second argument
4299 is the void * callout data that was supplied when the callout was set
4300 up by calling pcre2_set_callout() (see the pcre2api documentation). The
4301 callout block structure contains the following fields, not necessarily
4305 uint32_t callout_number;
4306 uint32_t capture_top;
4307 uint32_t capture_last;
4308 uint32_t callout_flags;
4309 PCRE2_SIZE *offset_vector;
4312 PCRE2_SIZE subject_length;
4313 PCRE2_SIZE start_match;
4314 PCRE2_SIZE current_position;
4315 PCRE2_SIZE pattern_position;
4316 PCRE2_SIZE next_item_length;
4317 PCRE2_SIZE callout_string_offset;
4318 PCRE2_SIZE callout_string_length;
4319 PCRE2_SPTR callout_string;
4321 The version field contains the version number of the block format. The
4322 current version is 2; the three callout string fields were added for
4323 version 1, and the callout_flags field for version 2. If you are writ-
4324 ing an application that might use an earlier release of PCRE2, you
4325 should check the version number before accessing any of these fields.
4326 The version number will increase in future if more fields are added,
4327 but the intention is never to remove any of the existing fields.
4329 Fields for numerical callouts
4331 For a numerical callout, callout_string is NULL, and callout_number
4332 contains the number of the callout, in the range 0-255. This is the
4333 number that follows (?C for callouts that part of the pattern; it is
4334 255 for automatically generated callouts.
4336 Fields for string callouts
4338 For callouts with string arguments, callout_number is always zero, and
4339 callout_string points to the string that is contained within the com-
4340 piled pattern. Its length is given by callout_string_length. Duplicated
4341 ending delimiters that were present in the original pattern string have
4342 been turned into single characters, but there is no other processing of
4343 the callout string argument. An additional code unit containing binary
4344 zero is present after the string, but is not included in the length.
4345 The delimiter that was used to start the string is also stored within
4346 the pattern, immediately before the string itself. You can access this
4347 delimiter as callout_string[-1] if you need it.
4349 The callout_string_offset field is the code unit offset to the start of
4350 the callout argument string within the original pattern string. This is
4351 provided for the benefit of applications such as script languages that
4352 might need to report errors in the callout string within the pattern.
4354 Fields for all callouts
4356 The remaining fields in the callout block are the same for both kinds
4359 The offset_vector field is a pointer to a vector of capturing offsets
4360 (the "ovector"). You may read the elements in this vector, but you must
4361 not change any of them.
4363 For calls to pcre2_match(), the offset_vector field is not (since
4364 release 10.30) a pointer to the actual ovector that was passed to the
4365 matching function in the match data block. Instead it points to an
4366 internal ovector of a size large enough to hold all possible captured
4367 substrings in the pattern. Note that whenever a recursion or subroutine
4368 call within a pattern completes, the capturing state is reset to what
4371 The capture_last field contains the number of the most recently cap-
4372 tured substring, and the capture_top field contains one more than the
4373 number of the highest numbered captured substring so far. If no sub-
4374 strings have yet been captured, the value of capture_last is 0 and the
4375 value of capture_top is 1. The values of these fields do not always
4376 differ by one; for example, when the callout in the pattern
4377 ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
4379 The contents of ovector[2] to ovector[<capture_top>*2-1] can be
4380 inspected in order to extract substrings that have been matched so far,
4381 in the same way as extracting substrings after a match has completed.
4382 The values in ovector[0] and ovector[1] are always PCRE2_UNSET because
4383 the match is by definition not complete. Substrings that have not been
4384 captured but whose numbers are less than capture_top also have both of
4385 their ovector slots set to PCRE2_UNSET.
4387 For DFA matching, the offset_vector field points to the ovector that
4388 was passed to the matching function in the match data block for call-
4389 outs at the top level, but to an internal ovector during the processing
4390 of pattern recursions, lookarounds, and atomic groups. However, these
4391 ovectors hold no useful information because pcre2_dfa_match() does not
4392 support substring capturing. The value of capture_top is always 1 and
4393 the value of capture_last is always 0 for DFA matching.
4395 The subject and subject_length fields contain copies of the values that
4396 were passed to the matching function.
4398 The start_match field normally contains the offset within the subject
4399 at which the current match attempt started. However, if the escape
4400 sequence \K has been encountered, this value is changed to reflect the
4401 modified starting point. If the pattern is not anchored, the callout
4402 function may be called several times from the same point in the pattern
4403 for different starting points in the subject.
4405 The current_position field contains the offset within the subject of
4406 the current match pointer.
4408 The pattern_position field contains the offset in the pattern string to
4409 the next item to be matched.
4411 The next_item_length field contains the length of the next item to be
4412 processed in the pattern string. When the callout is at the end of the
4413 pattern, the length is zero. When the callout precedes an opening
4414 parenthesis, the length includes meta characters that follow the paren-
4415 thesis. For example, in a callout before an assertion such as (?=ab)
4416 the length is 3. For an an alternation bar or a closing parenthesis,
4417 the length is one, unless a closing parenthesis is followed by a quan-
4418 tifier, in which case its length is included. (This changed in release
4419 10.23. In earlier releases, before an opening parenthesis the length
4420 was that of the entire subpattern, and before an alternation bar or a
4421 closing parenthesis the length was zero.)
4423 The pattern_position and next_item_length fields are intended to help
4424 in distinguishing between different automatic callouts, which all have
4425 the same callout number. However, they are set for all callouts, and
4426 are used by pcre2test to show the next item to be matched when display-
4427 ing callout information.
4429 In callouts from pcre2_match() the mark field contains a pointer to the
4430 zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
4431 (*THEN) item in the match, or NULL if no such items have been passed.
4432 Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
4433 previous (*MARK). In callouts from the DFA matching function this field
4434 always contains NULL.
4436 The callout_flags field is always zero in callouts from
4437 pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
4438 JIT is used, the following bits may be set:
4440 PCRE2_CALLOUT_STARTMATCH
4442 This is set for the first callout after the start of matching for each
4443 new starting position in the subject.
4445 PCRE2_CALLOUT_BACKTRACK
4447 This is set if there has been a matching backtrack since the previous
4448 callout, or since the start of matching if this is the first callout
4449 from a pcre2_match() run.
4451 Both bits are set when a backtrack has caused a "bumpalong" to a new
4452 starting position in the subject. Output from pcre2test does not indi-
4453 cate the presence of these bits unless the callout_extra modifier is
4456 The information in the callout_flags field is provided so that applica-
4457 tions can track and tell their users how matching with backtracking is
4458 done. This can be useful when trying to optimize patterns, or just to
4459 understand how PCRE2 works. There is no support in pcre2_dfa_match()
4460 because there is no backtracking in DFA matching, and there is no sup-
4461 port in JIT because JIT is all about maximimizing matching performance.
4462 In both these cases the callout_flags field is always zero.
4465 RETURN VALUES FROM CALLOUTS
4467 The external callout function returns an integer to PCRE2. If the value
4468 is zero, matching proceeds as normal. If the value is greater than
4469 zero, matching fails at the current point, but the testing of other
4470 matching possibilities goes ahead, just as if a lookahead assertion had
4471 failed. If the value is less than zero, the match is abandoned, and the
4472 matching function returns the negative value.
4474 Negative values should normally be chosen from the set of
4475 PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
4476 standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
4477 reserved for use by callout functions; it will never be used by PCRE2
4483 int pcre2_callout_enumerate(const pcre2_code *code,
4484 int (*callback)(pcre2_callout_enumerate_block *, void *),
4487 A script language that supports the use of string arguments in callouts
4488 might like to scan all the callouts in a pattern before running the
4489 match. This can be done by calling pcre2_callout_enumerate(). The first
4490 argument is a pointer to a compiled pattern, the second points to a
4491 callback function, and the third is arbitrary user data. The callback
4492 function is called for every callout in the pattern in the order in
4493 which they appear. Its first argument is a pointer to a callout enumer-
4494 ation block, and its second argument is the user_data value that was
4495 passed to pcre2_callout_enumerate(). The data block contains the fol-
4498 version Block version number
4499 pattern_position Offset to next item in pattern
4500 next_item_length Length of next item in pattern
4501 callout_number Number for numbered callouts
4502 callout_string_offset Offset to string within pattern
4503 callout_string_length Length of callout string
4504 callout_string Points to callout string or is NULL
4506 The version number is currently 0. It will increase if new fields are
4507 ever added to the block. The remaining fields are the same as their
4508 namesakes in the pcre2_callout block that is used for callouts during
4509 matching, as described above.
4511 Note that the value of pattern_position is unique for each callout.
4512 However, if a callout occurs inside a group that is quantified with a
4513 non-zero minimum or a fixed maximum, the group is replicated inside the
4514 compiled pattern. For example, a pattern such as /(a){2}/ is compiled
4515 as if it were /(a)(a)/. This means that the callout will be enumerated
4516 more than once, but with the same value for pattern_position in each
4519 The callback function should normally return zero. If it returns a non-
4520 zero value, scanning the pattern stops, and that value is returned from
4521 pcre2_callout_enumerate().
4527 University Computing Service
4533 Last updated: 26 April 2018
4534 Copyright (c) 1997-2018 University of Cambridge.
4535 ------------------------------------------------------------------------------
4538 PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
4543 PCRE2 - Perl-compatible regular expressions (revised API)
4545 DIFFERENCES BETWEEN PCRE2 AND PERL
4547 This document describes the differences in the ways that PCRE2 and Perl
4548 handle regular expressions. The differences described here are with
4549 respect to Perl versions 5.26, but as both Perl and PCRE2 are continu-
4550 ally changing, the information may sometimes be out of date.
4552 1. PCRE2 has only a subset of Perl's Unicode support. Details of what
4553 it does have are given in the pcre2unicode page.
4555 2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
4556 tions, but they do not mean what you might think. For example, (?!a){3}
4557 does not assert that the next three characters are not "a". It just
4558 asserts that the next character is not "a" three times (in principle;
4559 PCRE2 optimizes this to run the assertion just once). Perl allows some
4560 repeat quantifiers on other assertions, for example, \b* (but not
4561 \b{3}), but these do not seem to have any use.
4563 3. Capturing subpatterns that occur inside negative lookaround asser-
4564 tions are counted, but their entries in the offsets vector are set only
4565 when a negative assertion is a condition that has a matching branch
4566 (that is, the condition is false).
4568 4. The following Perl escape sequences are not supported: \F, \l, \L,
4569 \u, \U, and \N when followed by a character name. \N on its own, match-
4570 ing a non-newline character, and \N{U+dd..}, matching a Unicode code
4571 point, are supported. The escapes that modify the case of following
4572 letters are implemented by Perl's general string-handling and are not
4573 part of its pattern matching engine. If any of these are encountered by
4574 PCRE2, an error is generated by default. However, if the PCRE2_ALT_BSUX
4575 option is set, \U and \u are interpreted as ECMAScript interprets them.
4577 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
4578 is built with Unicode support (the default). The properties that can be
4579 tested with \p and \P are limited to the general category properties
4580 such as Lu and Nd, script names such as Greek or Han, and the derived
4581 properties Any and L&. PCRE2 does support the Cs (surrogate) property,
4582 which Perl does not; the Perl documentation says "Because Perl hides
4583 the need for the user to understand the internal representation of Uni-
4584 code characters, there is no need to implement the somewhat messy con-
4585 cept of surrogates."
4587 6. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
4588 in between are treated as literals. However, this is slightly different
4589 from Perl in that $ and @ are also handled as literals inside the
4590 quotes. In Perl, they cause variable interpolation (but of course PCRE2
4591 does not have variables). Also, Perl does "double-quotish backslash
4592 interpolation" on any backslashes between \Q and \E which, its documen-
4593 tation says, "may lead to confusing results". PCRE2 treats a backslash
4594 between \Q and \E just like any other character. Note the following
4597 Pattern PCRE2 matches Perl matches
4599 \Qabc$xyz\E abc$xyz abc followed by the
4601 \Qabc\$xyz\E abc\$xyz abc\$xyz
4602 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
4606 The \Q...\E sequence is recognized both inside and outside character
4609 7. Fairly obviously, PCRE2 does not support the (?{code}) and
4610 (??{code}) constructions. However, PCRE2 does have a "callout" feature,
4611 which allows an external function to be called during pattern matching.
4612 See the pcre2callout documentation for details.
4614 8. Subroutine calls (whether recursive or not) were treated as atomic
4615 groups up to PCRE2 release 10.23, but from release 10.30 this changed,
4616 and backtracking into subroutine calls is now supported, as in Perl.
4618 9. If any of the backtracking control verbs are used in a subpattern
4619 that is called as a subroutine (whether or not recursively), their
4620 effect is confined to that subpattern; it does not extend to the sur-
4621 rounding pattern. This is not always the case in Perl. In particular,
4622 if (*THEN) is present in a group that is called as a subroutine, its
4623 action is limited to that group, even if the group does not contain any
4624 | characters. Note that such subpatterns are processed as anchored at
4625 the point where they are tested.
4627 10. If a pattern contains more than one backtracking control verb, the
4628 first one that is backtracked onto acts. For example, in the pattern
4629 A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure
4630 in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
4631 it is the same as PCRE2, but there are cases where it differs.
4633 11. Most backtracking verbs in assertions have their normal actions.
4634 They are not confined to the assertion.
4636 12. There are some differences that are concerned with the settings of
4637 captured strings when part of a pattern is repeated. For example,
4638 matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
4639 unset, but in PCRE2 it is set to "b".
4641 13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
4642 pattern names is not as general as Perl's. This is a consequence of the
4643 fact the PCRE2 works internally just with numbers, using an external
4644 table to translate between numbers and names. In particular, a pattern
4645 such as (?|(?<a>A)|(?<b>B), where the two capturing parentheses have
4646 the same number but different names, is not supported, and causes an
4647 error at compile time. If it were allowed, it would not be possible to
4648 distinguish which parentheses matched, because both names map to cap-
4649 turing subpattern number 1. To avoid this confusing situation, an error
4650 is given at compile time.
4652 14. Perl used to recognize comments in some places that PCRE2 does not,
4653 for example, between the ( and ? at the start of a subpattern. If the
4654 /x modifier is set, Perl allowed white space between ( and ? though the
4655 latest Perls give an error (for a while it was just deprecated). There
4656 may still be some cases where Perl behaves differently.
4658 15. Perl, when in warning mode, gives warnings for character classes
4659 such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
4660 als. PCRE2 has no warning features, so it gives an error in these cases
4661 because they are almost certainly user mistakes.
4663 16. In PCRE2, the upper/lower case character properties Lu and Ll are
4664 not affected when case-independent matching is specified. For example,
4665 \p{Lu} always matches an upper case letter. I think Perl has changed in
4666 this respect; in the release at the time of writing (5.24), \p{Lu} and
4667 \p{Ll} match all letters, regardless of case, when case independence is
4670 17. PCRE2 provides some extensions to the Perl regular expression
4671 facilities. Perl 5.10 includes new features that are not in earlier
4672 versions of Perl, some of which (such as named parentheses) were in
4673 PCRE2 for some time before. This list is with respect to Perl 5.26:
4675 (a) Although lookbehind assertions in PCRE2 must match fixed length
4676 strings, each alternative branch of a lookbehind assertion can match a
4677 different length of string. Perl requires them all to have the same
4680 (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
4681 ported in lookbehinds, provided that there is no possibility of refer-
4682 encing a non-unique number or name. Perl does not support backrefer-
4683 ences in lookbehinds.
4685 (c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
4686 $ meta-character matches only at the very end of the string.
4688 (d) A backslash followed by a letter with no special meaning is
4689 faulted. (Perl can be made to issue a warning.)
4691 (e) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
4692 fiers is inverted, that is, by default they are not greedy, but if fol-
4693 lowed by a question mark they are.
4695 (f) PCRE2_ANCHORED can be used at matching time to force a pattern to
4696 be tried only at the first matching position in the subject string.
4698 (g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and
4699 PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
4701 (h) The \R escape sequence can be restricted to match only CR, LF, or
4702 CRLF by the PCRE2_BSR_ANYCRLF option.
4704 (i) The callout facility is PCRE2-specific. Perl supports codeblocks
4705 and variable interpolation, but not general hooks on every match.
4707 (j) The partial matching facility is PCRE2-specific.
4709 (k) The alternative matching function (pcre2_dfa_match() matches in a
4710 different way and is not Perl-compatible.
4712 (l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT)
4713 at the start of a pattern that set overall options that cannot be
4714 changed within the pattern.
4716 18. The Perl /a modifier restricts /d numbers to pure ascii, and the
4717 /aa modifier restricts /i case-insensitive matching to pure ascii,
4718 ignoring Unicode rules. This separation cannot be represented with
4721 19. Perl has different limits than PCRE2. See the pcre2limit documenta-
4722 tion for details. Perl went with 5.10 from recursion to iteration keep-
4723 ing the intermediate matches on the heap, which is ~10% slower but does
4724 not fall into any stack-overflow limit. PCRE2 made a similar change at
4725 release 10.30, and also has many build-time and run-time customizable
4732 University Computing Service
4738 Last updated: 28 July 2018
4739 Copyright (c) 1997-2018 University of Cambridge.
4740 ------------------------------------------------------------------------------
4743 PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
4748 PCRE2 - Perl-compatible regular expressions (revised API)
4750 PCRE2 JUST-IN-TIME COMPILER SUPPORT
4752 Just-in-time compiling is a heavyweight optimization that can greatly
4753 speed up pattern matching. However, it comes at the cost of extra pro-
4754 cessing before the match is performed, so it is of most benefit when
4755 the same pattern is going to be matched many times. This does not nec-
4756 essarily mean many calls of a matching function; if the pattern is not
4757 anchored, matching attempts may take place many times at various posi-
4758 tions in the subject, even for a single call. Therefore, if the subject
4759 string is very long, it may still pay to use JIT even for one-off
4760 matches. JIT support is available for all of the 8-bit, 16-bit and
4761 32-bit PCRE2 libraries.
4763 JIT support applies only to the traditional Perl-compatible matching
4764 function. It does not apply when the DFA matching function is being
4765 used. The code for this support was written by Zoltan Herczeg.
4768 AVAILABILITY OF JIT SUPPORT
4770 JIT support is an optional feature of PCRE2. The "configure" option
4771 --enable-jit (or equivalent CMake option) must be set when PCRE2 is
4772 built if you want to use JIT. The support is limited to the following
4775 ARM 32-bit (v5, v7, and Thumb2)
4777 Intel x86 32-bit and 64-bit
4778 MIPS 32-bit and 64-bit
4779 Power PC 32-bit and 64-bit
4782 If --enable-jit is set on an unsupported platform, compilation fails.
4784 A program can tell if JIT support is available by calling pcre2_con-
4785 fig() with the PCRE2_CONFIG_JIT option. The result is 1 when JIT is
4786 available, and 0 otherwise. However, a simple program does not need to
4787 check this in order to use JIT. The API is implemented in a way that
4788 falls back to the interpretive code if JIT is not available. For pro-
4789 grams that need the best possible performance, there is also a "fast
4790 path" API that is JIT-specific.
4795 To make use of the JIT support in the simplest way, all you have to do
4796 is to call pcre2_jit_compile() after successfully compiling a pattern
4797 with pcre2_compile(). This function has two arguments: the first is the
4798 compiled pattern pointer that was returned by pcre2_compile(), and the
4799 second is zero or more of the following option bits: PCRE2_JIT_COM-
4800 PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
4802 If JIT support is not available, a call to pcre2_jit_compile() does
4803 nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
4804 pattern is passed to the JIT compiler, which turns it into machine code
4805 that executes much faster than the normal interpretive code, but yields
4806 exactly the same results. The returned value from pcre2_jit_compile()
4807 is zero on success, or a negative error code.
4809 There is a limit to the size of pattern that JIT supports, imposed by
4810 the size of machine stack that it uses. The exact rules are not docu-
4811 mented because they may change at any time, in particular, when new
4812 optimizations are introduced. If a pattern is too big, a call to
4813 pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
4815 PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
4816 plete matches. If you want to run partial matches using the PCRE2_PAR-
4817 TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should
4818 set one or both of the other options as well as, or instead of
4819 PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
4820 for each of the three modes (normal, soft partial, hard partial). When
4821 pcre2_match() is called, the appropriate code is run if it is avail-
4822 able. Otherwise, the pattern is matched using interpretive code.
4824 You can call pcre2_jit_compile() multiple times for the same compiled
4825 pattern. It does nothing if it has previously compiled code for any of
4826 the option bits. For example, you can call it once with PCRE2_JIT_COM-
4827 PLETE and (perhaps later, when you find you need partial matching)
4828 again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
4829 will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
4830 ing. If pcre2_jit_compile() is called with no option bits set, it imme-
4831 diately returns zero. This is an alternative way of testing whether JIT
4834 At present, it is not possible to free JIT compiled code except when
4835 the entire compiled pattern is freed by calling pcre2_code_free().
4837 In some circumstances you may need to call additional functions. These
4838 are described in the section entitled "Controlling the JIT stack"
4841 There are some pcre2_match() options that are not supported by JIT, and
4842 there are also some pattern items that JIT cannot handle. Details are
4843 given below. In both cases, matching automatically falls back to the
4844 interpretive code. If you want to know whether JIT was actually used
4845 for a particular match, you should arrange for a JIT callback function
4846 to be set up as described in the section entitled "Controlling the JIT
4847 stack" below, even if you do not need to supply a non-default JIT
4848 stack. Such a callback function is called whenever JIT code is about to
4849 be obeyed. If the match-time options are not right for JIT execution,
4850 the callback function is not obeyed.
4852 If the JIT compiler finds an unsupported item, no JIT data is gener-
4853 ated. You can find out if JIT matching is available after compiling a
4854 pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE
4855 option. A non-zero result means that JIT compilation was successful. A
4856 result of 0 means that JIT support is not available, or the pattern was
4857 not processed by pcre2_jit_compile(), or the JIT compiler was not able
4858 to handle the pattern.
4861 UNSUPPORTED OPTIONS AND PATTERN ITEMS
4863 The pcre2_match() options that are supported for JIT matching are
4864 PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
4865 PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
4866 PCRE2_ANCHORED option is not supported at match time.
4868 If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the
4869 use of JIT, forcing matching by the interpreter code.
4871 The only unsupported pattern items are \C (match a single data unit)
4872 when running in a UTF mode, and a callout immediately before an asser-
4873 tion condition in a conditional group.
4876 RETURN VALUES FROM JIT MATCHING
4878 When a pattern is matched using JIT matching, the return values are the
4879 same as those given by the interpretive pcre2_match() code, with the
4880 addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
4881 that the memory used for the JIT stack was insufficient. See "Control-
4882 ling the JIT stack" below for a discussion of JIT stack usage.
4884 The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
4885 searching a very large pattern tree goes on for too long, as it is in
4886 the same circumstance when JIT is not used, but the details of exactly
4887 what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
4888 is never returned when JIT matching is used.
4891 CONTROLLING THE JIT STACK
4893 When the compiled JIT code runs, it needs a block of memory to use as a
4894 stack. By default, it uses 32KiB on the machine stack. However, some
4895 large or complicated patterns need more than this. The error
4896 PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack.
4897 Three functions are provided for managing blocks of memory for use as
4898 JIT stacks. There is further discussion about the use of JIT stacks in
4899 the section entitled "JIT stack FAQ" below.
4901 The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
4902 ments are a starting size, a maximum size, and a general context (for
4903 memory allocation functions, or NULL for standard memory allocation).
4904 It returns a pointer to an opaque structure of type pcre2_jit_stack, or
4905 NULL if there is an error. The pcre2_jit_stack_free() function is used
4906 to free a stack that is no longer needed. If its argument is NULL, this
4907 function returns immediately, without doing anything. (For the techni-
4908 cally minded: the address space is allocated by mmap or VirtualAlloc.)
4909 A maximum stack size of 512KiB to 1MiB should be more than enough for
4912 The pcre2_jit_stack_assign() function specifies which stack JIT code
4913 should use. Its arguments are as follows:
4915 pcre2_match_context *mcontext
4916 pcre2_jit_callback callback
4919 The first argument is a pointer to a match context. When this is subse-
4920 quently passed to a matching function, its information determines which
4921 JIT stack is used. If this argument is NULL, the function returns imme-
4922 diately, without doing anything. There are three cases for the values
4923 of the other two options:
4925 (1) If callback is NULL and data is NULL, an internal 32KiB block
4926 on the machine stack is used. This is the default when a match
4929 (2) If callback is NULL and data is not NULL, data must be
4930 a pointer to a valid JIT stack, the result of calling
4931 pcre2_jit_stack_create().
4933 (3) If callback is not NULL, it must point to a function that is
4934 called with data as an argument at the start of matching, in
4935 order to set up a JIT stack. If the return from the callback
4936 function is NULL, the internal 32KiB stack is used; otherwise the
4937 return value must be a valid JIT stack, the result of calling
4938 pcre2_jit_stack_create().
4940 A callback function is obeyed whenever JIT code is about to be run; it
4941 is not obeyed when pcre2_match() is called with options that are incom-
4942 patible for JIT matching. A callback function can therefore be used to
4943 determine whether a match operation was executed by JIT or by the
4946 You may safely use the same JIT stack for more than one pattern (either
4947 by assigning directly or by callback), as long as the patterns are
4948 matched sequentially in the same thread. Currently, the only way to set
4949 up non-sequential matches in one thread is to use callouts: if a call-
4950 out function starts another match, that match must use a different JIT
4951 stack to the one used for currently suspended match(es).
4953 In a multithread application, if you do not specify a JIT stack, or if
4954 you assign or pass back NULL from a callback, that is thread-safe,
4955 because each thread has its own machine stack. However, if you assign
4956 or pass back a non-NULL JIT stack, this must be a different stack for
4957 each thread so that the application is thread-safe.
4959 Strictly speaking, even more is allowed. You can assign the same non-
4960 NULL stack to a match context that is used by any number of patterns,
4961 as long as they are not used for matching by multiple threads at the
4962 same time. For example, you could use the same stack in all compiled
4963 patterns, with a global mutex in the callback to wait until the stack
4964 is available for use. However, this is an inefficient solution, and not
4967 This is a suggestion for how a multithreaded program that needs to set
4968 up non-default JIT stacks might operate:
4970 During thread initalization
4971 thread_local_var = pcre2_jit_stack_create(...)
4974 pcre2_jit_stack_free(thread_local_var)
4976 Use a one-line callback function
4977 return thread_local_var
4979 All the functions described in this section do nothing if JIT is not
4985 (1) Why do we need JIT stacks?
4987 PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
4988 where the local data of the current node is pushed before checking its
4989 child nodes. Allocating real machine stack on some platforms is diffi-
4990 cult. For example, the stack chain needs to be updated every time if we
4991 extend the stack on PowerPC. Although it is possible, its updating
4992 time overhead decreases performance. So we do the recursion in memory.
4994 (2) Why don't we simply allocate blocks of memory with malloc()?
4996 Modern operating systems have a nice feature: they can reserve an
4997 address space instead of allocating memory. We can safely allocate mem-
4998 ory pages inside this address space, so the stack could grow without
4999 moving memory data (this is important because of pointers). Thus we can
5000 allocate 1MiB address space, and use only a single memory page (usually
5001 4KiB) if that is enough. However, we can still grow up to 1MiB anytime
5004 (3) Who "owns" a JIT stack?
5006 The owner of the stack is the user program, not the JIT studied pattern
5007 or anything else. The user program must ensure that if a stack is being
5008 used by pcre2_match(), (that is, it is assigned to a match context that
5009 is passed to the pattern currently running), that stack must not be
5010 used by any other threads (to avoid overwriting the same memory area).
5011 The best practice for multithreaded programs is to allocate a stack for
5012 each thread, and return this stack through the JIT callback function.
5014 (4) When should a JIT stack be freed?
5016 You can free a JIT stack at any time, as long as it will not be used by
5017 pcre2_match() again. When you assign the stack to a match context, only
5018 a pointer is set. There is no reference counting or any other magic.
5019 You can free compiled patterns, contexts, and stacks in any order, any-
5020 time. Just do not call pcre2_match() with a match context pointing to
5021 an already freed stack, as that will cause SEGFAULT. (Also, do not free
5022 a stack currently used by pcre2_match() in another thread). You can
5023 also replace the stack in a context at any time when it is not in use.
5024 You should free the previous stack before assigning a replacement.
5026 (5) Should I allocate/free a stack every time before/after calling
5029 No, because this is too costly in terms of resources. However, you
5030 could implement some clever idea which release the stack if it is not
5031 used in let's say two minutes. The JIT callback can help to achieve
5032 this without keeping a list of patterns.
5034 (6) OK, the stack is for long term memory allocation. But what happens
5035 if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB
5036 kept until the stack is freed?
5038 Especially on embedded sytems, it might be a good idea to release mem-
5039 ory sometimes without freeing the stack. There is no API for this at
5040 the moment. Probably a function call which returns with the currently
5041 allocated memory for any stack and another which allows releasing mem-
5042 ory (shrinking the stack) would be a good idea if someone needs this.
5044 (7) This is too much of a headache. Isn't there any better solution for
5047 No, thanks to Windows. If POSIX threads were used everywhere, we could
5048 throw out this complicated API.
5051 FREEING JIT SPECULATIVE MEMORY
5053 void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
5055 The JIT executable allocator does not free all memory when it is possi-
5056 ble. It expects new allocations, and keeps some free memory around to
5057 improve allocation speed. However, in low memory conditions, it might
5058 be better to free all possible memory. You can cause this to happen by
5059 calling pcre2_jit_free_unused_memory(). Its argument is a general con-
5060 text, for custom memory management, or NULL for standard memory manage-
5066 This is a single-threaded example that specifies a JIT stack without
5067 using a callback. A real program should include error checking after
5068 all the function calls.
5072 pcre2_match_data *match_data;
5073 pcre2_match_context *mcontext;
5074 pcre2_jit_stack *jit_stack;
5076 re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
5077 &errornumber, &erroffset, NULL);
5078 rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
5079 mcontext = pcre2_match_context_create(NULL);
5080 jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
5081 pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
5082 match_data = pcre2_match_data_create(re, 10);
5083 rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
5084 /* Process result */
5086 pcre2_code_free(re);
5087 pcre2_match_data_free(match_data);
5088 pcre2_match_context_free(mcontext);
5089 pcre2_jit_stack_free(jit_stack);
5094 Because the API described above falls back to interpreted matching when
5095 JIT is not available, it is convenient for programs that are written
5096 for general use in many environments. However, calling JIT via
5097 pcre2_match() does have a performance impact. Programs that are written
5098 for use where JIT is known to be available, and which need the best
5099 possible performance, can instead use a "fast path" API to call JIT
5100 matching directly instead of calling pcre2_match() (obviously only for
5101 patterns that have been successfully processed by pcre2_jit_compile()).
5103 The fast path function is called pcre2_jit_match(), and it takes
5104 exactly the same arguments as pcre2_match(). The return values are also
5105 the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
5106 complete) is requested that was not compiled. Unsupported option bits
5107 (for example, PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT
5110 When you call pcre2_match(), as well as testing for invalid options, a
5111 number of other sanity checks are performed on the arguments. For exam-
5112 ple, if the subject pointer is NULL, an immediate error is given. Also,
5113 unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for
5114 validity. In the interests of speed, these checks do not happen on the
5115 JIT fast path, and if invalid data is passed, the result is undefined.
5117 Bypassing the sanity checks and the pcre2_match() wrapping can give
5118 speedups of more than 10%.
5128 Philip Hazel (FAQ by Zoltan Herczeg)
5129 University Computing Service
5135 Last updated: 28 June 2018
5136 Copyright (c) 1997-2018 University of Cambridge.
5137 ------------------------------------------------------------------------------
5140 PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
5145 PCRE2 - Perl-compatible regular expressions (revised API)
5147 SIZE AND OTHER LIMITATIONS
5149 There are some size limitations in PCRE2 but it is hoped that they will
5150 never in practice be relevant.
5152 The maximum size of a compiled pattern is approximately 64 thousand
5153 code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
5154 the default internal linkage size, which is 2 bytes for these
5155 libraries. If you want to process regular expressions that are truly
5156 enormous, you can compile PCRE2 with an internal linkage size of 3 or 4
5157 (when building the 16-bit library, 3 is rounded up to 4). See the
5158 README file in the source distribution and the pcre2build documentation
5159 for details. In these cases the limit is substantially larger. How-
5160 ever, the speed of execution is slower. In the 32-bit library, the
5161 internal linkage size is always 4.
5163 The maximum length of a source pattern string is essentially unlimited;
5164 it is the largest number a PCRE2_SIZE variable can hold. However, the
5165 program that calls pcre2_compile() can specify a smaller limit.
5167 The maximum length (in code units) of a subject string is one less than
5168 the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an
5169 unsigned integer type, usually defined as size_t. Its maximum value
5170 (that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
5171 terminated strings and unset offsets.
5173 All values in repeating quantifiers must be less than 65536.
5175 The maximum length of a lookbehind assertion is 65535 characters.
5177 There is no limit to the number of parenthesized subpatterns, but there
5178 can be no more than 65535 capturing subpatterns. There is, however, a
5179 limit to the depth of nesting of parenthesized subpatterns of all
5180 kinds. This is imposed in order to limit the amount of system stack
5181 used at compile time. The default limit can be specified when PCRE2 is
5182 built; if not, the default is set to 250. An application can change
5183 this limit by calling pcre2_set_parens_nest_limit() to set the limit in
5186 The maximum length of name for a named subpattern is 32 code units, and
5187 the maximum number of named subpatterns is 10000.
5189 The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or
5190 (*THEN) verb is 255 code units for the 8-bit library and 65535 code
5191 units for the 16-bit and 32-bit libraries.
5193 The maximum length of a string argument to a callout is the largest
5194 number a 32-bit unsigned integer can hold.
5200 University Computing Service
5206 Last updated: 30 March 2017
5207 Copyright (c) 1997-2017 University of Cambridge.
5208 ------------------------------------------------------------------------------
5211 PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
5216 PCRE2 - Perl-compatible regular expressions (revised API)
5218 PCRE2 MATCHING ALGORITHMS
5220 This document describes the two different algorithms that are available
5221 in PCRE2 for matching a compiled regular expression against a given
5222 subject string. The "standard" algorithm is the one provided by the
5223 pcre2_match() function. This works in the same as as Perl's matching
5224 function, and provide a Perl-compatible matching operation. The just-
5225 in-time (JIT) optimization that is described in the pcre2jit documenta-
5226 tion is compatible with this function.
5228 An alternative algorithm is provided by the pcre2_dfa_match() function;
5229 it operates in a different way, and is not Perl-compatible. This alter-
5230 native has advantages and disadvantages compared with the standard
5231 algorithm, and these are described below.
5233 When there is only one possible way in which a given subject string can
5234 match a pattern, the two algorithms give the same answer. A difference
5235 arises, however, when there are multiple possibilities. For example, if
5240 is matched against the string
5242 <something> <something else> <something further>
5244 there are three possible answers. The standard algorithm finds only one
5245 of them, whereas the alternative algorithm finds all three.
5248 REGULAR EXPRESSIONS AS TREES
5250 The set of strings that are matched by a regular expression can be rep-
5251 resented as a tree structure. An unlimited repetition in the pattern
5252 makes the tree of infinite size, but it is still a tree. Matching the
5253 pattern to a given subject string (from a given starting point) can be
5254 thought of as a search of the tree. There are two ways to search a
5255 tree: depth-first and breadth-first, and these correspond to the two
5256 matching algorithms provided by PCRE2.
5259 THE STANDARD MATCHING ALGORITHM
5261 In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
5262 sions", the standard algorithm is an "NFA algorithm". It conducts a
5263 depth-first search of the pattern tree. That is, it proceeds along a
5264 single path through the tree, checking that the subject matches what is
5265 required. When there is a mismatch, the algorithm tries any alterna-
5266 tives at the current point, and if they all fail, it backs up to the
5267 previous branch point in the tree, and tries the next alternative
5268 branch at that level. This often involves backing up (moving to the
5269 left) in the subject string as well. The order in which repetition
5270 branches are tried is controlled by the greedy or ungreedy nature of
5273 If a leaf node is reached, a matching string has been found, and at
5274 that point the algorithm stops. Thus, if there is more than one possi-
5275 ble match, this algorithm returns the first one that it finds. Whether
5276 this is the shortest, the longest, or some intermediate length depends
5277 on the way the greedy and ungreedy repetition quantifiers are specified
5280 Because it ends up with a single path through the tree, it is rela-
5281 tively straightforward for this algorithm to keep track of the sub-
5282 strings that are matched by portions of the pattern in parentheses.
5283 This provides support for capturing parentheses and backreferences.
5286 THE ALTERNATIVE MATCHING ALGORITHM
5288 This algorithm conducts a breadth-first search of the tree. Starting
5289 from the first matching point in the subject, it scans the subject
5290 string from left to right, once, character by character, and as it does
5291 this, it remembers all the paths through the tree that represent valid
5292 matches. In Friedl's terminology, this is a kind of "DFA algorithm",
5293 though it is not implemented as a traditional finite state machine (it
5294 keeps multiple states active simultaneously).
5296 Although the general principle of this matching algorithm is that it
5297 scans the subject string only once, without backtracking, there is one
5298 exception: when a lookaround assertion is encountered, the characters
5299 following or preceding the current point have to be independently
5302 The scan continues until either the end of the subject is reached, or
5303 there are no more unterminated paths. At this point, terminated paths
5304 represent the different matching possibilities (if there are none, the
5305 match has failed). Thus, if there is more than one possible match,
5306 this algorithm finds all of them, and in particular, it finds the long-
5307 est. The matches are returned in decreasing order of length. There is
5308 an option to stop the algorithm after the first match (which is neces-
5309 sarily the shortest) is found.
5311 Note that all the matches that are found start at the same point in the
5312 subject. If the pattern
5316 is matched against the string "the caterpillar catchment", the result
5317 is the three strings "caterpillar", "cater", and "cat" that start at
5318 the fifth character of the subject. The algorithm does not automati-
5319 cally move on to find matches that start at later positions.
5321 PCRE2's "auto-possessification" optimization usually applies to charac-
5322 ter repeats at the end of a pattern (as well as internally). For exam-
5323 ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
5324 is no point even considering the possibility of backtracking into the
5325 repeated digits. For DFA matching, this means that only one possible
5326 match is found. If you really do want multiple matches in such cases,
5327 either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
5328 SESS option when compiling.
5330 There are a number of features of PCRE2 regular expressions that are
5331 not supported by the alternative matching algorithm. They are as fol-
5334 1. Because the algorithm finds all possible matches, the greedy or
5335 ungreedy nature of repetition quantifiers is not relevant (though it
5336 may affect auto-possessification, as just described). During matching,
5337 greedy and ungreedy quantifiers are treated in exactly the same way.
5338 However, possessive quantifiers can make a difference when what follows
5339 could also match what is quantified, for example in a pattern like
5344 This pattern matches "aaab!" but not "aaa!", which would be matched by
5345 a non-possessive quantifier. Similarly, if an atomic group is present,
5346 it is matched as if it were a standalone pattern at the current point,
5347 and the longest match is then "locked in" for the rest of the overall
5350 2. When dealing with multiple paths through the tree simultaneously, it
5351 is not straightforward to keep track of captured substrings for the
5352 different matching possibilities, and PCRE2's implementation of this
5353 algorithm does not attempt to do this. This means that no captured sub-
5354 strings are available.
5356 3. Because no substrings are captured, backreferences within the pat-
5357 tern are not supported, and cause errors if encountered.
5359 4. For the same reason, conditional expressions that use a backrefer-
5360 ence as the condition or test for a specific group recursion are not
5363 5. Because many paths through the tree may be active, the \K escape
5364 sequence, which resets the start of the match when encountered (but may
5365 be on some paths and not on others), is not supported. It causes an
5366 error if encountered.
5368 6. Callouts are supported, but the value of the capture_top field is
5369 always 1, and the value of the capture_last field is always 0.
5371 7. The \C escape sequence, which (in the standard algorithm) always
5372 matches a single code unit, even in a UTF mode, is not supported in
5373 these modes, because the alternative algorithm moves through the sub-
5374 ject string one character (not code unit) at a time, for all active
5375 paths through the tree.
5377 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
5378 are not supported. (*FAIL) is supported, and behaves like a failing
5382 ADVANTAGES OF THE ALTERNATIVE ALGORITHM
5384 Using the alternative matching algorithm provides the following advan-
5387 1. All possible matches (at a single point in the subject) are automat-
5388 ically found, and in particular, the longest match is found. To find
5389 more than one match using the standard algorithm, you have to do kludgy
5390 things with callouts.
5392 2. Because the alternative algorithm scans the subject string just
5393 once, and never needs to backtrack (except for lookbehinds), it is pos-
5394 sible to pass very long subject strings to the matching function in
5395 several pieces, checking for partial matching each time. Although it is
5396 also possible to do multi-segment matching using the standard algo-
5397 rithm, by retaining partially matched substrings, it is more compli-
5398 cated. The pcre2partial documentation gives details of partial matching
5399 and discusses multi-segment matching.
5402 DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
5404 The alternative algorithm suffers from a number of disadvantages:
5406 1. It is substantially slower than the standard algorithm. This is
5407 partly because it has to search for all possible matches, but is also
5408 because it is less susceptible to optimization.
5410 2. Capturing parentheses and backreferences are not supported.
5412 3. Although atomic groups are supported, their use does not provide the
5413 performance advantage that it does for the standard algorithm.
5419 University Computing Service
5425 Last updated: 29 September 2014
5426 Copyright (c) 1997-2014 University of Cambridge.
5427 ------------------------------------------------------------------------------
5430 PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
5435 PCRE2 - Perl-compatible regular expressions
5437 PARTIAL MATCHING IN PCRE2
5439 In normal use of PCRE2, if the subject string that is passed to a
5440 matching function matches as far as it goes, but is too short to match
5441 the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
5442 stances where it might be helpful to distinguish this case from other
5443 cases in which there is no match.
5445 Consider, for example, an application where a human is required to type
5446 in data for a field with specific formatting requirements. An example
5447 might be a date in the form ddmmmyy, defined by this pattern:
5449 ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
5451 If the application sees the user's keystrokes one by one, and can check
5452 that what has been typed so far is potentially valid, it is able to
5453 raise an error as soon as a mistake is made, by beeping and not
5454 reflecting the character that has been typed, for example. This immedi-
5455 ate feedback is likely to be a better user interface than a check that
5456 is delayed until the entire string has been entered. Partial matching
5457 can also be useful when the subject string is very long and is not all
5460 PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
5461 PCRE2_PARTIAL_HARD options, which can be set when calling a matching
5462 function. The difference between the two options is whether or not a
5463 partial match is preferred to an alternative complete match, though the
5464 details differ between the two types of matching function. If both
5465 options are set, PCRE2_PARTIAL_HARD takes precedence.
5467 If you want to use partial matching with just-in-time optimized code,
5468 you must call pcre2_jit_compile() with one or both of these options:
5470 PCRE2_JIT_PARTIAL_SOFT
5471 PCRE2_JIT_PARTIAL_HARD
5473 PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
5474 tial matches on the same pattern. If the appropriate JIT mode has not
5475 been compiled, interpretive matching code is used.
5477 Setting a partial matching option disables two of PCRE2's standard
5478 optimizations. PCRE2 remembers the last literal code unit in a pattern,
5479 and abandons matching immediately if it is not present in the subject
5480 string. This optimization cannot be used for a subject string that
5481 might match only partially. PCRE2 also knows the minimum length of a
5482 matching string, and does not bother to run the matching function on
5483 shorter strings. This optimization is also disabled for partial match-
5487 PARTIAL MATCHING USING pcre2_match()
5489 A partial match occurs during a call to pcre2_match() when the end of
5490 the subject string is reached successfully, but matching cannot con-
5491 tinue because more characters are needed. However, at least one charac-
5492 ter in the subject must have been inspected. This character need not
5493 form part of the final matched string; lookbehind assertions and the \K
5494 escape sequence provide ways of inspecting characters before the start
5495 of a matched string. The requirement for inspecting at least one char-
5496 acter exists because an empty string can always be matched; without
5497 such a restriction there would always be a partial match of an empty
5498 string at the end of the subject.
5500 When a partial match is returned, the first two elements in the ovector
5501 point to the portion of the subject that was matched, but the values in
5502 the rest of the ovector are undefined. The appearance of \K in the pat-
5503 tern has no effect for a partial match. Consider this pattern:
5507 If it is matched against "456abc123xyz" the result is a complete match,
5508 and the ovector defines the matched string as "123", because \K resets
5509 the "start of match" point. However, if a partial match is requested
5510 and the subject string is "456abc12", a partial match is found for the
5511 string "abc12", because all these characters are needed for a subse-
5512 quent re-match with additional characters.
5514 What happens when a partial match is identified depends on which of the
5515 two partial matching options are set.
5517 PCRE2_PARTIAL_SOFT WITH pcre2_match()
5519 If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial
5520 match, the partial match is remembered, but matching continues as nor-
5521 mal, and other alternatives in the pattern are tried. If no complete
5522 match can be found, PCRE2_ERROR_PARTIAL is returned instead of
5523 PCRE2_ERROR_NOMATCH.
5525 This option is "soft" because it prefers a complete match over a par-
5526 tial match. All the various matching items in a pattern behave as if
5527 the subject string is potentially complete. For example, \z, \Z, and $
5528 match at the end of the subject, as normal, and for \b and \B the end
5529 of the subject is treated as a non-alphanumeric.
5531 If there is more than one partial match, the first one that was found
5532 provides the data that is returned. Consider this pattern:
5536 If this is matched against the subject string "abc123dog", both alter-
5537 natives fail to match, but the end of the subject is reached during
5538 matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
5539 and 9, identifying "123dog" as the first partial match that was found.
5540 (In this example, there are two partial matches, because "dog" on its
5541 own partially matches the second alternative.)
5543 PCRE2_PARTIAL_HARD WITH pcre2_match()
5545 If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
5546 returned as soon as a partial match is found, without continuing to
5547 search for possible complete matches. This option is "hard" because it
5548 prefers an earlier partial match over a later complete match. For this
5549 reason, the assumption is made that the end of the supplied subject
5550 string may not be the true end of the available data, and so, if \z,
5551 \Z, \b, \B, or $ are encountered at the end of the subject, the result
5552 is PCRE2_ERROR_PARTIAL, provided that at least one character in the
5553 subject has been inspected.
5555 Comparing hard and soft partial matching
5557 The difference between the two partial matching options can be illus-
5558 trated by a pattern such as:
5562 This matches either "dog" or "dogsbody", greedily (that is, it prefers
5563 the longer string if possible). If it is matched against the string
5564 "dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog".
5565 However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
5566 TIAL. On the other hand, if the pattern is made ungreedy the result is
5571 In this case the result is always a complete match because that is
5572 found first, and matching never continues after finding a complete
5573 match. It might be easier to follow this explanation by thinking of the
5574 two patterns like this:
5576 /dog(sbody)?/ is the same as /dogsbody|dog/
5577 /dog(sbody)??/ is the same as /dog|dogsbody/
5579 The second pattern will never match "dogsbody", because it will always
5580 find the shorter match first.
5583 PARTIAL MATCHING USING pcre2_dfa_match()
5585 The DFA functions move along the subject string character by character,
5586 without backtracking, searching for all possible matches simultane-
5587 ously. If the end of the subject is reached before the end of the pat-
5588 tern, there is the possibility of a partial match, again provided that
5589 at least one character has been inspected.
5591 When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
5592 there have been no complete matches. Otherwise, the complete matches
5593 are returned. However, if PCRE2_PARTIAL_HARD is set, a partial match
5594 takes precedence over any complete matches. The portion of the string
5595 that was matched when the longest partial match was found is set as the
5596 first matching string.
5598 Because the DFA functions always search for all possible matches, and
5599 there is no difference between greedy and ungreedy repetition, their
5600 behaviour is different from the standard functions when PCRE2_PAR-
5601 TIAL_HARD is set. Consider the string "dog" matched against the
5602 ungreedy pattern shown above:
5606 Whereas the standard function stops as soon as it finds the complete
5607 match for "dog", the DFA function also finds the partial match for
5608 "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
5611 PARTIAL MATCHING AND WORD BOUNDARIES
5613 If a pattern ends with one of sequences \b or \B, which test for word
5614 boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-
5615 intuitive results. Consider this pattern:
5619 This matches "cat", provided there is a word boundary at either end. If
5620 the subject string is "the cat", the comparison of the final "t" with a
5621 following character cannot take place, so a partial match is found.
5622 However, normal matching carries on, and \b matches at the end of the
5623 subject when the last character is a letter, so a complete match is
5624 found. The result, therefore, is not PCRE2_ERROR_PARTIAL. Using
5625 PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
5626 then the partial match takes precedence.
5629 EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST
5631 If the partial_soft (or ps) modifier is present on a pcre2test data
5632 line, the PCRE2_PARTIAL_SOFT option is used for the match. Here is a
5633 run of pcre2test that uses the date example quoted above:
5635 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
5640 Partial match: 23dec3
5648 The first data string is matched completely, so pcre2test shows the
5649 matched substrings. The remaining four strings do not match the com-
5650 plete pattern, but the first two are partial matches. Similar output is
5651 obtained if DFA matching is used.
5653 If the partial_hard (or ph) modifier is present on a pcre2test data
5654 line, the PCRE2_PARTIAL_HARD option is set for the match.
5657 MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
5659 When a partial match has been found using a DFA matching function, it
5660 is possible to continue the match by providing additional subject data
5661 and calling the function again with the same compiled regular expres-
5662 sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
5663 same working space as before, because this is where details of the pre-
5664 vious partial match are stored. Here is an example using pcre2test:
5666 re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
5669 data> n05\=dfa,dfa_restart
5672 The first call has "23ja" as the subject, and requests partial match-
5673 ing; the second call has "n05" as the subject for the continued
5674 (restarted) match. Notice that when the match is complete, only the
5675 last part is shown; PCRE2 does not retain the previously partially-
5676 matched string. It is up to the calling program to do that if it needs
5679 That means that, for an unanchored pattern, if a continued match fails,
5680 it is not possible to try again at a new starting point. All this
5681 facility is capable of doing is continuing with the previous match
5682 attempt. In the previous example, if the second set of data is "ug23"
5683 the result is no match, even though there would be a match for "aug23"
5684 if the entire string were given at once. Depending on the application,
5685 this may or may not be what you want. The only way to allow for start-
5686 ing again at the next character is to retain the matched part of the
5687 subject and try a new complete match.
5689 You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
5690 PCRE2_DFA_RESTART to continue partial matching over multiple segments.
5691 This facility can be used to pass very long subject strings to the DFA
5695 MULTI-SEGMENT MATCHING WITH pcre2_match()
5697 Unlike the DFA function, it is not possible to restart the previous
5698 match with a new segment of data when using pcre2_match(). Instead, new
5699 data must be added to the previous subject string, and the entire match
5700 re-run, starting from the point where the partial match occurred. Ear-
5701 lier data can be discarded.
5703 It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
5704 not treat the end of a segment as the end of the subject when matching
5705 \z, \Z, \b, \B, and $. Consider an unanchored pattern that matches
5708 re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
5709 data> The date is 23ja\=ph
5712 At this stage, an application could discard the text preceding "23ja",
5713 add on text from the next segment, and call the matching function
5714 again. Unlike the DFA matching function, the entire matching string
5715 must always be available, and the complete matching process occurs for
5716 each call, so more memory and more processing time is needed.
5719 ISSUES WITH MULTI-SEGMENT MATCHING
5721 Certain types of pattern may give problems with multi-segment matching,
5722 whichever matching function is used.
5724 1. If the pattern contains a test for the beginning of a line, you need
5725 to pass the PCRE2_NOTBOL option when the subject string for any call
5726 does start at the beginning of a line. There is also a PCRE2_NOTEOL
5727 option, but in practice when doing multi-segment matching you should be
5728 using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.
5730 2. If a pattern contains a lookbehind assertion, characters that pre-
5731 cede the start of the partial match may have been inspected during the
5732 matching process. When using pcre2_match(), sufficient characters must
5733 be retained for the next match attempt. You can ensure that enough
5734 characters are retained by doing the following:
5736 Before doing any matching, find the length of the longest lookbehind in
5737 the pattern by calling pcre2_pattern_info() with the
5738 PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
5739 characters, not code units. After a partial match, moving back from the
5740 ovector[0] offset in the subject by the number of characters given for
5741 the maximum lookbehind gets you to the earliest character that must be
5742 retained. In a non-UTF or a 32-bit situation, moving back is just a
5743 subtraction, but in UTF-8 or UTF-16 you have to count characters while
5744 moving back through the code units.
5746 Characters before the point you have now reached can be discarded, and
5747 after the next segment has been added to what is retained, you should
5748 run the next match with the startoffset argument set so that the match
5749 begins at the same point as before.
5751 For example, if the pattern "(?<=123)abc" is partially matched against
5752 the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
5753 mum lookbehind count is 3, so all characters before offset 2 can be
5754 discarded. The value of startoffset for the next match should be 3.
5755 When pcre2test displays a partial match, it indicates the lookbehind
5756 characters with '<' characters:
5760 Partial match: 123ab
5763 3. Because a partial match must always contain at least one character,
5764 what might be considered a partial match of an empty string actually
5765 gives a "no match" result. For example:
5771 If the next segment begins "cx", a match should be found, but this will
5772 only happen if characters from the previous segment are retained. For
5773 this reason, a "no match" result should be interpreted as "partial
5774 match of an empty string" when the pattern contains lookbehinds.
5776 4. Matching a subject string that is split into multiple segments may
5777 not always produce exactly the same result as matching over one single
5778 long string, especially when PCRE2_PARTIAL_SOFT is used. The section
5779 "Partial Matching and Word Boundaries" above describes an issue that
5780 arises if the pattern ends with \b or \B. Another kind of difference
5781 may occur when there are multiple matching possibilities, because (for
5782 PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
5783 no completed matches. This means that as soon as the shortest match has
5784 been found, continuation to a new subject segment is no longer possi-
5785 ble. Consider this pcre2test example:
5792 data> gsb\=ps,dfa,dfa_restart
5798 The first data line passes the string "dogsb" to a standard matching
5799 function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
5800 a partial match for "dogsbody", the result is not PCRE2_ERROR_PARTIAL,
5801 because the shorter string "dog" is a complete match. Similarly, when
5802 the subject is presented to a DFA matching function in several parts
5803 ("do" and "gsb" being the first two) the match stops when "dog" has
5804 been found, and it is not possible to continue. On the other hand, if
5805 "dogsbody" is presented as a single string, a DFA matching function
5808 Because of these problems, it is best to use PCRE2_PARTIAL_HARD when
5809 matching multi-segment data. The example above then behaves differ-
5814 Partial match: dogsb
5817 data> gsb\=ph,dfa,dfa_restart
5820 5. Patterns that contain alternatives at the top level which do not all
5821 start with the same pattern item may not work as expected when
5822 PCRE2_DFA_RESTART is used. For example, consider this pattern:
5826 If the first part of the subject is "ABC123", a partial match of the
5827 first alternative is found at offset 3. There is no partial match for
5828 the second alternative, because such a match does not start at the same
5829 point in the subject string. Attempting to continue with the string
5830 "7890" does not yield a match because only those alternatives that
5831 match at one point in the subject are remembered. The problem arises
5832 because the start of the second alternative matches within the first
5833 alternative. There is no problem with anchored patterns or patterns
5838 where no string can be a partial match for both alternatives. This is
5839 not a problem if a standard matching function is used, because the
5840 entire match has to be rerun each time:
5848 Of course, instead of using PCRE2_DFA_RESTART, the same technique of
5849 re-running the entire match can also be used with the DFA matching
5850 function. Another possibility is to work with two buffers. If a partial
5851 match at offset n in the first buffer is followed by "no match" when
5852 PCRE2_DFA_RESTART is used on the second buffer, you can then try a new
5853 match starting at offset n+1 in the first buffer.
5859 University Computing Service
5865 Last updated: 22 December 2014
5866 Copyright (c) 1997-2014 University of Cambridge.
5867 ------------------------------------------------------------------------------
5870 PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
5875 PCRE2 - Perl-compatible regular expressions (revised API)
5877 PCRE2 REGULAR EXPRESSION DETAILS
5879 The syntax and semantics of the regular expressions that are supported
5880 by PCRE2 are described in detail below. There is a quick-reference syn-
5881 tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax
5882 and semantics as closely as it can. PCRE2 also supports some alterna-
5883 tive regular expression syntax (which does not conflict with the Perl
5884 syntax) in order to provide some compatibility with regular expressions
5885 in Python, .NET, and Oniguruma.
5887 Perl's regular expressions are described in its own documentation, and
5888 regular expressions in general are covered in a number of books, some
5889 of which have copious examples. Jeffrey Friedl's "Mastering Regular
5890 Expressions", published by O'Reilly, covers regular expressions in
5891 great detail. This description of PCRE2's regular expressions is
5892 intended as reference material.
5894 This document discusses the patterns that are supported by PCRE2 when
5895 its main matching function, pcre2_match(), is used. PCRE2 also has an
5896 alternative matching function, pcre2_dfa_match(), which matches using a
5897 different algorithm that is not Perl-compatible. Some of the features
5898 discussed below are not available when DFA matching is used. The advan-
5899 tages and disadvantages of the alternative function, and how it differs
5900 from the normal function, are discussed in the pcre2matching page.
5903 SPECIAL START-OF-PATTERN ITEMS
5905 A number of options that can be passed to pcre2_compile() can also be
5906 set by special items at the start of a pattern. These are not Perl-com-
5907 patible, but are provided to make these options accessible to pattern
5908 writers who are not able to change the program that processes the pat-
5909 tern. Any number of these items may appear, but they must all be
5910 together right at the start of the pattern string, and the letters must
5915 In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
5916 as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
5917 can be specified for the 32-bit library, in which case it constrains
5918 the character values to valid Unicode code points. To process UTF
5919 strings, PCRE2 must be built to include Unicode support (which is the
5920 default). When using UTF strings you must either call the compiling
5921 function with the PCRE2_UTF option, or the pattern must start with the
5922 special sequence (*UTF), which is equivalent to setting the relevant
5923 option. How setting a UTF mode affects pattern matching is mentioned in
5924 several places below. There is also a summary of features in the
5927 Some applications that allow their users to supply patterns may wish to
5928 restrict them to non-UTF data for security reasons. If the
5929 PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not
5930 allowed, and its appearance in a pattern causes an error.
5932 Unicode property support
5934 Another special sequence that may appear at the start of a pattern is
5935 (*UCP). This has the same effect as setting the PCRE2_UCP option: it
5936 causes sequences such as \d and \w to use Unicode properties to deter-
5937 mine character types, instead of recognizing only characters with codes
5938 less than 256 via a lookup table.
5940 Some applications that allow their users to supply patterns may wish to
5941 restrict them for security reasons. If the PCRE2_NEVER_UCP option is
5942 passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
5943 a pattern causes an error.
5945 Locking out empty string matching
5947 Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
5948 effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
5949 to whichever matching function is subsequently called to match the pat-
5950 tern. These options lock out the matching of empty strings, either
5951 entirely, or only at the start of the subject.
5953 Disabling auto-possessification
5955 If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
5956 setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
5957 quantifiers possessive when what follows cannot match the repeated
5958 item. For example, by default a+b is treated as a++b. For more details,
5959 see the pcre2api documentation.
5961 Disabling start-up optimizations
5963 If a pattern starts with (*NO_START_OPT), it has the same effect as
5964 setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
5965 mizations for quickly reaching "no match" results. For more details,
5966 see the pcre2api documentation.
5968 Disabling automatic anchoring
5970 If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
5971 as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
5972 tions that apply to patterns whose top-level branches all start with .*
5973 (match any number of arbitrary characters). For more details, see the
5974 pcre2api documentation.
5976 Disabling JIT compilation
5978 If a pattern that starts with (*NO_JIT) is successfully compiled, an
5979 attempt by the application to apply the JIT optimization by calling
5980 pcre2_jit_compile() is ignored.
5982 Setting match resource limits
5984 The pcre2_match() function contains a counter that is incremented every
5985 time it goes round its main loop. The caller of pcre2_match() can set a
5986 limit on this counter, which therefore limits the amount of computing
5987 resource used for a match. The maximum depth of nested backtracking can
5988 also be limited; this indirectly restricts the amount of heap memory
5989 that is used, but there is also an explicit memory limit that can be
5992 These facilities are provided to catch runaway matches that are pro-
5993 voked by patterns with huge matching trees (a typical example is a pat-
5994 tern with nested unlimited repeats applied to a long string that does
5995 not match). When one of these limits is reached, pcre2_match() gives an
5996 error return. The limits can also be set by items at the start of the
6003 where d is any number of decimal digits. However, the value of the set-
6004 ting must be less than the value set (or defaulted) by the caller of
6005 pcre2_match() for it to have any effect. In other words, the pattern
6006 writer can lower the limits set by the programmer, but not raise them.
6007 If there is more than one setting of one of these limits, the lower
6008 value is used. The heap limit is specified in kibibytes (units of 1024
6011 Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
6012 name is still recognized for backwards compatibility.
6014 The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
6015 interpreters are used for matching. It does not apply to JIT. The match
6016 limit is used (but in a different way) when JIT is being used, or when
6017 pcre2_dfa_match() is called, to limit computing resource usage by those
6018 matching functions. The depth limit is ignored by JIT but is relevant
6019 for DFA matching, which uses function recursion for recursions within
6020 the pattern and for lookaround assertions and atomic groups. In this
6021 case, the depth limit controls the depth of such recursion.
6025 PCRE2 supports six different conventions for indicating line breaks in
6026 strings: a single CR (carriage return) character, a single LF (line-
6027 feed) character, the two-character sequence CRLF, any of the three pre-
6028 ceding, any Unicode newline sequence, or the NUL character (binary
6029 zero). The pcre2api page has further discussion about newlines, and
6030 shows how to set the newline convention when calling pcre2_compile().
6032 It is also possible to specify a newline convention by starting a pat-
6033 tern string with one of the following sequences:
6035 (*CR) carriage return
6037 (*CRLF) carriage return, followed by linefeed
6038 (*ANYCRLF) any of the three above
6039 (*ANY) all Unicode newline sequences
6040 (*NUL) the NUL character (binary zero)
6042 These override the default and the options given to the compiling func-
6043 tion. For example, on a Unix system where LF is the default newline
6044 sequence, the pattern
6048 changes the convention to CR. That pattern matches "a\nb" because LF is
6049 no longer a newline. If more than one of these settings is present, the
6052 The newline convention affects where the circumflex and dollar asser-
6053 tions are true. It also affects the interpretation of the dot metachar-
6054 acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
6055 followed by an opening brace. However, it does not affect what the \R
6056 escape sequence matches. By default, this is any Unicode newline
6057 sequence, for Perl compatibility. However, this can be changed; see the
6058 next section and the description of \R in the section entitled "Newline
6059 sequences" below. A change of \R setting can be combined with a change
6060 of newline convention.
6062 Specifying what \R matches
6064 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6065 the complete set of Unicode line endings) by setting the option
6066 PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
6067 starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
6068 CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
6071 EBCDIC CHARACTER CODES
6073 PCRE2 can be compiled to run in an environment that uses EBCDIC as its
6074 character code instead of ASCII or Unicode (typically a mainframe sys-
6075 tem). In the sections below, character code values are ASCII or Uni-
6076 code; in an EBCDIC environment these characters may have different code
6077 values, and there are no code points greater than 255.
6080 CHARACTERS AND METACHARACTERS
6082 A regular expression is a pattern that is matched against a subject
6083 string from left to right. Most characters stand for themselves in a
6084 pattern, and match the corresponding characters in the subject. As a
6085 trivial example, the pattern
6089 matches a portion of a subject string that is identical to itself. When
6090 caseless matching is specified (the PCRE2_CASELESS option), letters are
6091 matched independently of case.
6093 The power of regular expressions comes from the ability to include
6094 alternatives and repetitions in the pattern. These are encoded in the
6095 pattern by the use of metacharacters, which do not stand for themselves
6096 but instead are interpreted in some special way.
6098 There are two different sets of metacharacters: those that are recog-
6099 nized anywhere in the pattern except within square brackets, and those
6100 that are recognized within square brackets. Outside square brackets,
6101 the metacharacters are as follows:
6103 \ general escape character with several uses
6104 ^ assert start of string (or line, in multiline mode)
6105 $ assert end of string (or line, in multiline mode)
6106 . match any character except newline (by default)
6107 [ start character class definition
6108 | start of alternative branch
6111 ? extends the meaning of (
6112 also 0 or 1 quantifier
6113 also quantifier minimizer
6114 * 0 or more quantifier
6115 + 1 or more quantifier
6116 also "possessive quantifier"
6117 { start min/max quantifier
6119 Part of a pattern that is in square brackets is called a "character
6120 class". In a character class the only metacharacters are:
6122 \ general escape character
6123 ^ negate the class, but only if the first character
6124 - indicates character range
6125 [ POSIX character class (only if followed by POSIX
6127 ] terminates the character class
6129 The following sections describe the use of each of the metacharacters.
6134 The backslash character has several uses. Firstly, if it is followed by
6135 a character that is not a number or a letter, it takes away any special
6136 meaning that character may have. This use of backslash as an escape
6137 character applies both inside and outside character classes.
6139 For example, if you want to match a * character, you must write \* in
6140 the pattern. This escaping action applies whether or not the following
6141 character would otherwise be interpreted as a metacharacter, so it is
6142 always safe to precede a non-alphanumeric with backslash to specify
6143 that it stands for itself. In particular, if you want to match a back-
6144 slash, you write \\.
6146 In a UTF mode, only ASCII numbers and letters have any special meaning
6147 after a backslash. All other characters (in particular, those whose
6148 code points are greater than 127) are treated as literals.
6150 If a pattern is compiled with the PCRE2_EXTENDED option, most white
6151 space in the pattern (other than in a character class), and characters
6152 between a # outside a character class and the next newline, inclusive,
6153 are ignored. An escaping backslash can be used to include a white space
6154 or # character as part of the pattern.
6156 If you want to remove the special meaning from a sequence of charac-
6157 ters, you can do so by putting them between \Q and \E. This is differ-
6158 ent from Perl in that $ and @ are handled as literals in \Q...\E
6159 sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola-
6160 tion. Also, Perl does "double-quotish backslash interpolation" on any
6161 backslashes between \Q and \E which, its documentation says, "may lead
6162 to confusing results". PCRE2 treats a backslash between \Q and \E just
6163 like any other character. Note the following examples:
6165 Pattern PCRE2 matches Perl matches
6167 \Qabc$xyz\E abc$xyz abc followed by the
6169 \Qabc\$xyz\E abc\$xyz abc\$xyz
6170 \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
6174 The \Q...\E sequence is recognized both inside and outside character
6175 classes. An isolated \E that is not preceded by \Q is ignored. If \Q
6176 is not followed by \E later in the pattern, the literal interpretation
6177 continues to the end of the pattern (that is, \E is assumed at the
6178 end). If the isolated \Q is inside a character class, this causes an
6179 error, because the character class is not terminated by a closing
6182 Non-printing characters
6184 A second use of backslash provides a way of encoding non-printing char-
6185 acters in patterns in a visible manner. There is no restriction on the
6186 appearance of non-printing characters in a pattern, but when a pattern
6187 is being prepared by text editing, it is often easier to use one of the
6188 following escape sequences than the binary character it represents. In
6189 an ASCII or Unicode environment, these escapes are as follows:
6191 \a alarm, that is, the BEL character (hex 07)
6192 \cx "control-x", where x is any printable ASCII character
6194 \f form feed (hex 0C)
6195 \n linefeed (hex 0A)
6196 \r carriage return (hex 0D)
6198 \0dd character with octal code 0dd
6199 \ddd character with octal code ddd, or backreference
6200 \o{ddd..} character with octal code ddd..
6201 \xhh character with hex code hh
6202 \x{hhh..} character with hex code hhh..
6203 \N{U+hhh..} character with Unicode hex code point hhh..
6204 \uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
6206 The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF
6207 option is set, that is, when PCRE2 is operating in a Unicode mode. Perl
6208 also uses \N{name} to specify characters by Unicode name; PCRE2 does
6209 not support this. Note that when \N is not followed by an opening
6210 brace (curly bracket) it has an entirely different meaning, matching
6211 any character that is not a newline.
6213 The precise effect of \cx on ASCII characters is as follows: if x is a
6214 lower case letter, it is converted to upper case. Then bit 6 of the
6215 character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
6216 (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
6217 hex 7B (; is 3B). If the code unit following \c has a value less than
6218 32 or greater than 126, a compile-time error occurs.
6220 When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
6221 \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
6222 The \c escape is processed as specified for Perl in the perlebcdic doc-
6223 ument. The only characters that are allowed after \c are A-Z, a-z, or
6224 one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
6225 time error. The sequence \c@ encodes character code 0; after \c the
6226 letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
6227 \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
6228 becomes either 255 (hex FF) or 95 (hex 5F).
6230 Thus, apart from \c?, these escapes generate the same character code
6231 values as they do in an ASCII environment, though the meanings of the
6232 values mostly differ. For example, \cG always generates code value 7,
6233 which is BEL in ASCII but DEL in EBCDIC.
6235 The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
6236 but because 127 is not a control character in EBCDIC, Perl makes it
6237 generate the APC character. Unfortunately, there are several variants
6238 of EBCDIC. In most of them the APC character has the value 255 (hex
6239 FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
6240 certain other characters have POSIX-BC values, PCRE2 makes \c? generate
6241 95; otherwise it generates 255.
6243 After \0 up to two further octal digits are read. If there are fewer
6244 than two digits, just those that are present are used. Thus the
6245 sequence \0\x\015 specifies two binary zeros followed by a CR character
6246 (code value 13). Make sure you supply two digits after the initial zero
6247 if the pattern character that follows is itself an octal digit.
6249 The escape \o must be followed by a sequence of octal digits, enclosed
6250 in braces. An error occurs if this is not the case. This escape is a
6251 recent addition to Perl; it provides way of specifying character code
6252 points as octal numbers greater than 0777, and it also allows octal
6253 numbers and backreferences to be unambiguously specified.
6255 For greater clarity and unambiguity, it is best to avoid following \ by
6256 a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
6257 cal character code points, and \g{} to specify backreferences. The fol-
6258 lowing paragraphs describe the old, ambiguous syntax.
6260 The handling of a backslash followed by a digit other than 0 is compli-
6261 cated, and Perl has changed over time, causing PCRE2 also to change.
6263 Outside a character class, PCRE2 reads the digit and any following dig-
6264 its as a decimal number. If the number is less than 10, begins with the
6265 digit 8 or 9, or if there are at least that many previous capturing
6266 left parentheses in the expression, the entire sequence is taken as a
6267 backreference. A description of how this works is given later, follow-
6268 ing the discussion of parenthesized subpatterns. Otherwise, up to
6269 three octal digits are read to form a character code.
6271 Inside a character class, PCRE2 handles \8 and \9 as the literal char-
6272 acters "8" and "9", and otherwise reads up to three octal digits fol-
6273 lowing the backslash, using them to generate a data character. Any sub-
6274 sequent digits stand for themselves. For example, outside a character
6277 \040 is another way of writing an ASCII space
6278 \40 is the same, provided there are fewer than 40
6279 previous capturing subpatterns
6280 \7 is always a backreference
6281 \11 might be a backreference, or another way of
6283 \011 is always a tab
6284 \0113 is a tab followed by the character "3"
6285 \113 might be a backreference, otherwise the
6286 character with octal code 113
6287 \377 might be a backreference, otherwise
6288 the value 255 (decimal)
6289 \81 is always a backreference
6291 Note that octal values of 100 or greater that are specified using this
6292 syntax must not be introduced by a leading zero, because no more than
6293 three octal digits are ever read.
6295 By default, after \x that is not followed by {, from zero to two hexa-
6296 decimal digits are read (letters can be in upper or lower case). Any
6297 number of hexadecimal digits may appear between \x{ and }. If a charac-
6298 ter other than a hexadecimal digit appears between \x{ and }, or if
6299 there is no terminating }, an error occurs.
6301 If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
6302 just described only when it is followed by two hexadecimal digits. Oth-
6303 erwise, it matches a literal "x" character. In this mode, support for
6304 code points greater than 256 is provided by \u, which must be followed
6305 by four hexadecimal digits; otherwise it matches a literal "u" charac-
6308 Characters whose value is less than 256 can be defined by either of the
6309 two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
6310 ference in the way they are handled. For example, \xdc is exactly the
6311 same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode).
6313 Constraints on character values
6315 Characters that are specified using octal or hexadecimal numbers are
6316 limited to certain values, as follows:
6318 8-bit non-UTF mode no greater than 0xff
6319 16-bit non-UTF mode no greater than 0xffff
6320 32-bit non-UTF mode no greater than 0xffffffff
6321 All UTF modes no greater than 0x10ffff and a valid code point
6323 Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
6324 (the so-called "surrogate" code points). The check for these can be
6325 disabled by the caller of pcre2_compile() by setting the option
6326 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in
6327 UTF-8 and UTF-32 modes, because these values are not representable in
6330 Escape sequences in character classes
6332 All the sequences that define a single character value can be used both
6333 inside and outside character classes. In addition, inside a character
6334 class, \b is interpreted as the backspace character (hex 08).
6336 When not followed by an opening brace, \N is not allowed in a character
6337 class. \B, \R, and \X are not special inside a character class. Like
6338 other unrecognized alphabetic escape sequences, they cause an error.
6339 Outside a character class, these sequences have different meanings.
6341 Unsupported escape sequences
6343 In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
6344 string handler and used to modify the case of following characters. By
6345 default, PCRE2 does not support these escape sequences. However, if the
6346 PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
6347 used to define a character by code point, as described above.
6349 Absolute and relative backreferences
6351 The sequence \g followed by a signed or unsigned number, optionally
6352 enclosed in braces, is an absolute or relative backreference. A named
6353 backreference can be coded as \g{name}. Backreferences are discussed
6354 later, following the discussion of parenthesized subpatterns.
6356 Absolute and relative subroutine calls
6358 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
6359 name or a number enclosed either in angle brackets or single quotes, is
6360 an alternative syntax for referencing a subpattern as a "subroutine".
6361 Details are discussed later. Note that \g{...} (Perl syntax) and
6362 \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
6363 erence; the latter is a subroutine call.
6365 Generic character types
6367 Another use of backslash is for specifying generic character types:
6369 \d any decimal digit
6370 \D any character that is not a decimal digit
6371 \h any horizontal white space character
6372 \H any character that is not a horizontal white space character
6373 \N any character that is not a newline
6374 \s any white space character
6375 \S any character that is not a white space character
6376 \v any vertical white space character
6377 \V any character that is not a vertical white space character
6378 \w any "word" character
6379 \W any "non-word" character
6381 The \N escape sequence has the same meaning as the "." metacharacter
6382 when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
6383 the meaning of \N. Note that when \N is followed by an opening brace it
6384 has a different meaning. See the section entitled "Non-printing charac-
6385 ters" above for details. Perl also uses \N{name} to specify characters
6386 by Unicode name; PCRE2 does not support this.
6388 Each pair of lower and upper case escape sequences partitions the com-
6389 plete set of characters into two disjoint sets. Any given character
6390 matches one, and only one, of each pair. The sequences can appear both
6391 inside and outside character classes. They each match one character of
6392 the appropriate type. If the current matching point is at the end of
6393 the subject string, all of them fail, because there is no character to
6396 The default \s characters are HT (9), LF (10), VT (11), FF (12), CR
6397 (13), and space (32), which are defined as white space in the "C"
6398 locale. This list may vary if locale-specific matching is taking place.
6399 For example, in some locales the "non-breaking space" character (\xA0)
6400 is recognized as white space, and in others the VT character is not.
6402 A "word" character is an underscore or any character that is a letter
6403 or digit. By default, the definition of letters and digits is con-
6404 trolled by PCRE2's low-valued character tables, and may vary if locale-
6405 specific matching is taking place (see "Locale support" in the pcre2api
6406 page). For example, in a French locale such as "fr_FR" in Unix-like
6407 systems, or "french" in Windows, some character codes greater than 127
6408 are used for accented letters, and these are then matched by \w. The
6409 use of locales with Unicode is discouraged.
6411 By default, characters whose code points are greater than 127 never
6412 match \d, \s, or \w, and always match \D, \S, and \W, although this may
6413 be different for characters in the range 128-255 when locale-specific
6414 matching is happening. These escape sequences retain their original
6415 meanings from before Unicode support was available, mainly for effi-
6416 ciency reasons. If the PCRE2_UCP option is set, the behaviour is
6417 changed so that Unicode properties are used to determine character
6420 \d any character that matches \p{Nd} (decimal digit)
6421 \s any character that matches \p{Z} or \h or \v
6422 \w any character that matches \p{L} or \p{N}, plus underscore
6424 The upper case escapes match the inverse sets of characters. Note that
6425 \d matches only decimal digits, whereas \w matches any Unicode digit,
6426 as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
6427 affects \b, and \B because they are defined in terms of \w and \W.
6428 Matching these sequences is noticeably slower when PCRE2_UCP is set.
6430 The sequences \h, \H, \v, and \V, in contrast to the other sequences,
6431 which match only ASCII characters by default, always match a specific
6432 list of code points, whether or not PCRE2_UCP is set. The horizontal
6433 space characters are:
6435 U+0009 Horizontal tab (HT)
6437 U+00A0 Non-break space
6438 U+1680 Ogham space mark
6439 U+180E Mongolian vowel separator
6444 U+2004 Three-per-em space
6445 U+2005 Four-per-em space
6446 U+2006 Six-per-em space
6448 U+2008 Punctuation space
6451 U+202F Narrow no-break space
6452 U+205F Medium mathematical space
6453 U+3000 Ideographic space
6455 The vertical space characters are:
6457 U+000A Linefeed (LF)
6458 U+000B Vertical tab (VT)
6459 U+000C Form feed (FF)
6460 U+000D Carriage return (CR)
6461 U+0085 Next line (NEL)
6462 U+2028 Line separator
6463 U+2029 Paragraph separator
6465 In 8-bit, non-UTF-8 mode, only the characters with code points less
6466 than 256 are relevant.
6470 Outside a character class, by default, the escape sequence \R matches
6471 any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
6474 (?>\r\n|\n|\x0b|\f|\r|\x85)
6476 This is an example of an "atomic group", details of which are given
6477 below. This particular group matches either the two-character sequence
6478 CR followed by LF, or one of the single characters LF (linefeed,
6479 U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
6480 riage return, U+000D), or NEL (next line, U+0085). Because this is an
6481 atomic group, the two-character sequence is treated as a single unit
6482 that cannot be split.
6484 In other modes, two additional characters whose code points are greater
6485 than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
6486 rator, U+2029). Unicode support is not needed for these characters to
6489 It is possible to restrict \R to match only CR, LF, or CRLF (instead of
6490 the complete set of Unicode line endings) by setting the option
6491 PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back-
6492 slash R".) This can be made the default when PCRE2 is built; if this is
6493 the case, the other behaviour can be requested via the PCRE2_BSR_UNI-
6494 CODE option. It is also possible to specify these settings by starting
6495 a pattern string with one of the following sequences:
6497 (*BSR_ANYCRLF) CR, LF, or CRLF only
6498 (*BSR_UNICODE) any Unicode newline sequence
6500 These override the default and the options given to the compiling func-
6501 tion. Note that these special settings, which are not Perl-compatible,
6502 are recognized only at the very start of a pattern, and that they must
6503 be in upper case. If more than one of them is present, the last one is
6504 used. They can be combined with a change of newline convention; for
6505 example, a pattern can start with:
6507 (*ANY)(*BSR_ANYCRLF)
6509 They can also be combined with the (*UTF) or (*UCP) special sequences.
6510 Inside a character class, \R is treated as an unrecognized escape
6511 sequence, and causes an error.
6513 Unicode character properties
6515 When PCRE2 is built with Unicode support (the default), three addi-
6516 tional escape sequences that match characters with specific properties
6517 are available. In 8-bit non-UTF-8 mode, these sequences are of course
6518 limited to testing characters whose code points are less than 256, but
6519 they do work in this mode. In 32-bit non-UTF mode, code points greater
6520 than 0x10ffff (the Unicode limit) may be encountered. These are all
6521 treated as being in the Common script and with an unassigned type. The
6522 extra escape sequences are:
6524 \p{xx} a character with the xx property
6525 \P{xx} a character without the xx property
6526 \X a Unicode extended grapheme cluster
6528 The property names represented by xx above are limited to the Unicode
6529 script names, the general category properties, "Any", which matches any
6530 character (including newline), and some special PCRE2 properties
6531 (described in the next section). Other Perl properties such as "InMu-
6532 sicalSymbols" are not supported by PCRE2. Note that \P{Any} does not
6533 match any characters, so always causes a match failure.
6535 Sets of Unicode characters are defined as belonging to certain scripts.
6536 A character from one of these sets can be matched using a script name.
6542 Those that are not part of an identified script are lumped together as
6543 "Common". The current list of scripts is:
6545 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
6546 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
6547 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
6548 nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
6549 Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
6550 Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
6551 Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
6552 Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
6553 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
6554 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
6555 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
6556 jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
6557 Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
6558 Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
6559 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
6560 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
6561 dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
6562 Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
6563 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
6564 vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
6565 Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
6566 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
6567 nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
6569 Each character has exactly one Unicode general category property, spec-
6570 ified by a two-letter abbreviation. For compatibility with Perl, nega-
6571 tion can be specified by including a circumflex between the opening
6572 brace and the property name. For example, \p{^Lu} is the same as
6575 If only one letter is specified with \p or \P, it includes all the gen-
6576 eral category properties that start with that letter. In this case, in
6577 the absence of negation, the curly brackets in the escape sequence are
6578 optional; these two examples have the same effect:
6583 The following general category property codes are supported:
6593 Ll Lower case letter
6596 Lt Title case letter
6597 Lu Upper case letter
6610 Pc Connector punctuation
6612 Pe Close punctuation
6613 Pf Final punctuation
6614 Pi Initial punctuation
6615 Po Other punctuation
6621 Sm Mathematical symbol
6626 Zp Paragraph separator
6629 The special property L& is also supported: it matches a character that
6630 has the Lu, Ll, or Lt property, in other words, a letter that is not
6631 classified as a modifier or "other".
6633 The Cs (Surrogate) property applies only to characters in the range
6634 U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
6635 so cannot be tested by PCRE2, unless UTF validity checking has been
6636 turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api
6637 page). Perl does not support the Cs property.
6639 The long synonyms for property names that Perl supports (such as
6640 \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
6641 any of these properties with "Is".
6643 No character that is in the Unicode table has the Cn (unassigned) prop-
6644 erty. Instead, this property is assumed for any code point that is not
6645 in the Unicode table.
6647 Specifying caseless matching does not affect these escape sequences.
6648 For example, \p{Lu} always matches only upper case letters. This is
6649 different from the behaviour of current versions of Perl.
6651 Matching characters by Unicode property is not fast, because PCRE2 has
6652 to do a multistage table lookup in order to find a character's prop-
6653 erty. That is why the traditional escape sequences such as \d and \w do
6654 not use Unicode properties in PCRE2 by default, though you can make
6655 them do so by setting the PCRE2_UCP option or by starting the pattern
6658 Extended grapheme clusters
6660 The \X escape matches any number of Unicode characters that form an
6661 "extended grapheme cluster", and treats the sequence as an atomic group
6662 (see below). Unicode supports various kinds of composite character by
6663 giving each character a grapheme breaking property, and having rules
6664 that use these properties to define the boundaries of extended grapheme
6665 clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
6666 Text Segmentation". Unicode 11.0.0 abandoned the use of some previous
6667 properties that had been used for emojis. Instead it introduced vari-
6668 ous emoji-specific properties. PCRE2 uses only the Extended Picto-
6671 \X always matches at least one character. Then it decides whether to
6672 add additional characters according to the following rules for ending a
6675 1. End at the end of the subject string.
6677 2. Do not end between CR and LF; otherwise end after any control char-
6680 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
6681 characters are of five types: L, V, T, LV, and LVT. An L character may
6682 be followed by an L, V, LV, or LVT character; an LV or V character may
6683 be followed by a V or T character; an LVT or T character may be follwed
6684 only by a T character.
6686 4. Do not end before extending characters or spacing marks or the
6687 "zero-width joiner" character. Characters with the "mark" property
6688 always have the "extend" grapheme breaking property.
6690 5. Do not end after prepend characters.
6692 6. Do not break within emoji modifier sequences or emoji zwj sequences.
6693 That is, do not break between characters with the Extended_Pictographic
6694 property. Extend and ZWJ characters are allowed between the charac-
6697 7. Do not break within emoji flag sequences. That is, do not break
6698 between regional indicator (RI) characters if there are an odd number
6699 of RI characters before the break point.
6701 8. Otherwise, end the cluster.
6703 PCRE2's additional properties
6705 As well as the standard Unicode properties described above, PCRE2 sup-
6706 ports four more that make it possible to convert traditional escape
6707 sequences such as \w and \s to use Unicode properties. PCRE2 uses these
6708 non-standard, non-Perl properties internally when PCRE2_UCP is set.
6709 However, they may also be used explicitly. These properties are:
6711 Xan Any alphanumeric character
6712 Xps Any POSIX space character
6713 Xsp Any Perl space character
6714 Xwd Any Perl "word" character
6716 Xan matches characters that have either the L (letter) or the N (num-
6717 ber) property. Xps matches the characters tab, linefeed, vertical tab,
6718 form feed, or carriage return, and any other character that has the Z
6719 (separator) property. Xsp is the same as Xps; in PCRE1 it used to
6720 exclude vertical tab, for Perl compatibility, but Perl changed. Xwd
6721 matches the same characters as Xan, plus underscore.
6723 There is another non-standard property, Xuc, which matches any charac-
6724 ter that can be represented by a Universal Character Name in C++ and
6725 other programming languages. These are the characters $, @, ` (grave
6726 accent), and all characters with Unicode code points greater than or
6727 equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
6728 most base (ASCII) characters are excluded. (Universal Character Names
6729 are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
6730 Note that the Xuc property does not match these sequences but the char-
6731 acters that they represent.)
6733 Resetting the match start
6735 In normal use, the escape sequence \K causes any previously matched
6736 characters not to be included in the final matched sequence that is
6737 returned. For example, the pattern:
6741 matches "foobar", but reports that it has matched "bar". \K does not
6742 interact with anchoring in any way. The pattern:
6746 matches only when the subject begins with "foobar" (in single line
6747 mode), though it again reports the matched string as "bar". This fea-
6748 ture is similar to a lookbehind assertion (described below). However,
6749 in this case, the part of the subject before the real match does not
6750 have to be of fixed length, as lookbehind assertions do. The use of \K
6751 does not interfere with the setting of captured substrings. For exam-
6752 ple, when the pattern
6756 matches "foobar", the first substring is still set to "foo".
6758 Perl documents that the use of \K within assertions is "not well
6759 defined". In PCRE2, \K is acted upon when it occurs inside positive
6760 assertions, but is ignored in negative assertions. Note that when a
6761 pattern such as (?=ab\K) matches, the reported start of the match can
6762 be greater than the end of the match. Using \K in a lookbehind asser-
6763 tion at the start of a pattern can also lead to odd effects. For exam-
6764 ple, consider this pattern:
6768 If the subject is "foobar", a call to pcre2_match() with a starting
6769 offset of 3 succeeds and reports the matching string as "foobar", that
6770 is, the start of the reported match is earlier than where the match
6775 The final use of backslash is for certain simple assertions. An asser-
6776 tion specifies a condition that has to be met at a particular point in
6777 a match, without consuming any characters from the subject string. The
6778 use of subpatterns for more complicated assertions is described below.
6779 The backslashed assertions are:
6781 \b matches at a word boundary
6782 \B matches when not at a word boundary
6783 \A matches at the start of the subject
6784 \Z matches at the end of the subject
6785 also matches before a newline at the end of the subject
6786 \z matches only at the end of the subject
6787 \G matches at the first matching position in the subject
6789 Inside a character class, \b has a different meaning; it matches the
6790 backspace character. If any other of these assertions appears in a
6791 character class, an "invalid escape sequence" error is generated.
6793 A word boundary is a position in the subject string where the current
6794 character and the previous character do not both match \w or \W (i.e.
6795 one matches \w and the other matches \W), or the start or end of the
6796 string if the first or last character matches \w, respectively. In a
6797 UTF mode, the meanings of \w and \W can be changed by setting the
6798 PCRE2_UCP option. When this is done, it also affects \b and \B. Neither
6799 PCRE2 nor Perl has a separate "start of word" or "end of word" metase-
6800 quence. However, whatever follows \b normally determines which it is.
6801 For example, the fragment \ba matches "a" at the start of a word.
6803 The \A, \Z, and \z assertions differ from the traditional circumflex
6804 and dollar (described in the next section) in that they only ever match
6805 at the very start and end of the subject string, whatever options are
6806 set. Thus, they are independent of multiline mode. These three asser-
6807 tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
6808 which affect only the behaviour of the circumflex and dollar metachar-
6809 acters. However, if the startoffset argument of pcre2_match() is non-
6810 zero, indicating that matching is to start at a point other than the
6811 beginning of the subject, \A can never match. The difference between
6812 \Z and \z is that \Z matches before a newline at the end of the string
6813 as well as at the very end, whereas \z matches only at the end.
6815 The \G assertion is true only when the current matching position is at
6816 the start point of the matching process, as specified by the startoff-
6817 set argument of pcre2_match(). It differs from \A when the value of
6818 startoffset is non-zero. By calling pcre2_match() multiple times with
6819 appropriate arguments, you can mimic Perl's /g option, and it is in
6820 this kind of implementation where \G can be useful.
6822 Note, however, that PCRE2's implementation of \G, being true at the
6823 starting character of the matching process, is subtly different from
6824 Perl's, which defines it as true at the end of the previous match. In
6825 Perl, these can be different when the previously matched string was
6826 empty. Because PCRE2 does just one match at a time, it cannot reproduce
6829 If all the alternatives of a pattern begin with \G, the expression is
6830 anchored to the starting match position, and the "anchored" flag is set
6831 in the compiled regular expression.
6834 CIRCUMFLEX AND DOLLAR
6836 The circumflex and dollar metacharacters are zero-width assertions.
6837 That is, they test for a particular condition being true without con-
6838 suming any characters from the subject string. These two metacharacters
6839 are concerned with matching the starts and ends of lines. If the new-
6840 line convention is set so that only the two-character sequence CRLF is
6841 recognized as a newline, isolated CR and LF characters are treated as
6842 ordinary data characters, and are not recognized as newlines.
6844 Outside a character class, in the default matching mode, the circumflex
6845 character is an assertion that is true only if the current matching
6846 point is at the start of the subject string. If the startoffset argu-
6847 ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
6848 flex can never match if the PCRE2_MULTILINE option is unset. Inside a
6849 character class, circumflex has an entirely different meaning (see
6852 Circumflex need not be the first character of the pattern if a number
6853 of alternatives are involved, but it should be the first thing in each
6854 alternative in which it appears if the pattern is ever to match that
6855 branch. If all possible alternatives start with a circumflex, that is,
6856 if the pattern is constrained to match only at the start of the sub-
6857 ject, it is said to be an "anchored" pattern. (There are also other
6858 constructs that can cause a pattern to be anchored.)
6860 The dollar character is an assertion that is true only if the current
6861 matching point is at the end of the subject string, or immediately
6862 before a newline at the end of the string (by default), unless
6863 PCRE2_NOTEOL is set. Note, however, that it does not actually match the
6864 newline. Dollar need not be the last character of the pattern if a num-
6865 ber of alternatives are involved, but it should be the last item in any
6866 branch in which it appears. Dollar has no special meaning in a charac-
6869 The meaning of dollar can be changed so that it matches only at the
6870 very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
6871 compile time. This does not affect the \Z assertion.
6873 The meanings of the circumflex and dollar metacharacters are changed if
6874 the PCRE2_MULTILINE option is set. When this is the case, a dollar
6875 character matches before any newlines in the string, as well as at the
6876 very end, and a circumflex matches immediately after internal newlines
6877 as well as at the start of the subject string. It does not match after
6878 a newline that ends the string, for compatibility with Perl. However,
6879 this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
6881 For example, the pattern /^abc$/ matches the subject string "def\nabc"
6882 (where \n represents a newline) in multiline mode, but not otherwise.
6883 Consequently, patterns that are anchored in single line mode because
6884 all branches start with ^ are not anchored in multiline mode, and a
6885 match for circumflex is possible when the startoffset argument of
6886 pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
6887 if PCRE2_MULTILINE is set.
6889 When the newline convention (see "Newline conventions" below) recog-
6890 nizes the two-character sequence CRLF as a newline, this is preferred,
6891 even if the single characters CR and LF are also recognized as new-
6892 lines. For example, if the newline convention is "any", a multiline
6893 mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
6894 than after CR, even though CR on its own is a valid newline. (It also
6895 matches at the very start of the string, of course.)
6897 Note that the sequences \A, \Z, and \z can be used to match the start
6898 and end of the subject in both modes, and if all branches of a pattern
6899 start with \A it is always anchored, whether or not PCRE2_MULTILINE is
6903 FULL STOP (PERIOD, DOT) AND \N
6905 Outside a character class, a dot in the pattern matches any one charac-
6906 ter in the subject string except (by default) a character that signi-
6907 fies the end of a line.
6909 When a line ending is defined as a single character, dot never matches
6910 that character; when the two-character sequence CRLF is used, dot does
6911 not match CR if it is immediately followed by LF, but otherwise it
6912 matches all characters (including isolated CRs and LFs). When any Uni-
6913 code line endings are being recognized, dot does not match CR or LF or
6914 any of the other line ending characters.
6916 The behaviour of dot with regard to newlines can be changed. If the
6917 PCRE2_DOTALL option is set, a dot matches any one character, without
6918 exception. If the two-character sequence CRLF is present in the sub-
6919 ject string, it takes two dots to match it.
6921 The handling of dot is entirely independent of the handling of circum-
6922 flex and dollar, the only relationship being that they both involve
6923 newlines. Dot has no special meaning in a character class.
6925 The escape sequence \N when not followed by an opening brace behaves
6926 like a dot, except that it is not affected by the PCRE2_DOTALL option.
6927 In other words, it matches any character except one that signifies the
6930 When \N is followed by an opening brace it has a different meaning. See
6931 the section entitled "Non-printing characters" above for details. Perl
6932 also uses \N{name} to specify characters by Unicode name; PCRE2 does
6936 MATCHING A SINGLE CODE UNIT
6938 Outside a character class, the escape sequence \C matches any one code
6939 unit, whether or not a UTF mode is set. In the 8-bit library, one code
6940 unit is one byte; in the 16-bit library it is a 16-bit unit; in the
6941 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
6942 line-ending characters. The feature is provided in Perl in order to
6943 match individual bytes in UTF-8 mode, but it is unclear how it can use-
6946 Because \C breaks up characters into individual code units, matching
6947 one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
6948 string may start with a malformed UTF character. This has undefined
6949 results, because PCRE2 assumes that it is matching character by charac-
6950 ter in a valid UTF string (by default it checks the subject string's
6951 validity at the start of processing unless the PCRE2_NO_UTF_CHECK
6954 An application can lock out the use of \C by setting the
6955 PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
6956 possible to build PCRE2 with the use of \C permanently disabled.
6958 PCRE2 does not allow \C to appear in lookbehind assertions (described
6959 below) in UTF-8 or UTF-16 modes, because this would make it impossible
6960 to calculate the length of the lookbehind. Neither the alternative
6961 matching function pcre2_dfa_match() nor the JIT optimizer support \C in
6962 these UTF modes. The former gives a match-time error; the latter fails
6963 to optimize and so the match is always run using the interpreter.
6965 In the 32-bit library, however, \C is always supported (when not
6966 explicitly locked out) because it always matches a single code unit,
6967 whether or not UTF-32 is specified.
6969 In general, the \C escape sequence is best avoided. However, one way of
6970 using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
6971 ters is to use a lookahead to check the length of the next character,
6972 as in this pattern, which could be used with a UTF-8 string (ignore
6973 white space and line breaks):
6975 (?| (?=[\x00-\x7f])(\C) |
6976 (?=[\x80-\x{7ff}])(\C)(\C) |
6977 (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
6978 (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
6980 In this example, a group that starts with (?| resets the capturing
6981 parentheses numbers in each alternative (see "Duplicate Subpattern Num-
6982 bers" below). The assertions at the start of each branch check the next
6983 UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
6984 respectively. The character's individual bytes are then captured by the
6985 appropriate number of \C groups.
6988 SQUARE BRACKETS AND CHARACTER CLASSES
6990 An opening square bracket introduces a character class, terminated by a
6991 closing square bracket. A closing square bracket on its own is not spe-
6992 cial by default. If a closing square bracket is required as a member
6993 of the class, it should be the first data character in the class (after
6994 an initial circumflex, if present) or escaped with a backslash. This
6995 means that, by default, an empty class cannot be defined. However, if
6996 the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
6997 the start does end the (empty) class.
6999 A character class matches a single character in the subject. A matched
7000 character must be in the set of characters defined by the class, unless
7001 the first character in the class definition is a circumflex, in which
7002 case the subject character must not be in the set defined by the class.
7003 If a circumflex is actually required as a member of the class, ensure
7004 it is not the first character, or escape it with a backslash.
7006 For example, the character class [aeiou] matches any lower case vowel,
7007 while [^aeiou] matches any character that is not a lower case vowel.
7008 Note that a circumflex is just a convenient notation for specifying the
7009 characters that are in the class by enumerating those that are not. A
7010 class that starts with a circumflex is not an assertion; it still con-
7011 sumes a character from the subject string, and therefore it fails if
7012 the current pointer is at the end of the string.
7014 Characters in a class may be specified by their code points using \o,
7015 \x, or \N{U+hh..} in the usual way. When caseless matching is set, any
7016 letters in a class represent both their upper case and lower case ver-
7017 sions, so for example, a caseless [aeiou] matches "A" as well as "a",
7018 and a caseless [^aeiou] does not match "A", whereas a caseful version
7021 Characters that might indicate line breaks are never treated in any
7022 special way when matching character classes, whatever line-ending
7023 sequence is in use, and whatever setting of the PCRE2_DOTALL and
7024 PCRE2_MULTILINE options is used. A class such as [^a] always matches
7025 one of these characters.
7027 The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
7028 \S, \v, \V, \w, and \W may appear in a character class, and add the
7029 characters that they match to the class. For example, [\dABCDEF]
7030 matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option
7031 affects the meanings of \d, \s, \w and their upper case partners, just
7032 as it does when they appear outside a character class, as described in
7033 the section entitled "Generic character types" above. The escape
7034 sequence \b has a different meaning inside a character class; it
7035 matches the backspace character. The sequences \B, \R, and \X are not
7036 special inside a character class. Like any other unrecognized escape
7037 sequences, they cause an error. The same is true for \N when not fol-
7038 lowed by an opening brace.
7040 The minus (hyphen) character can be used to specify a range of charac-
7041 ters in a character class. For example, [d-m] matches any letter
7042 between d and m, inclusive. If a minus character is required in a
7043 class, it must be escaped with a backslash or appear in a position
7044 where it cannot be interpreted as indicating a range, typically as the
7045 first or last character in the class, or immediately after a range. For
7046 example, [b-d-z] matches letters in the range b to d, a hyphen charac-
7049 Perl treats a hyphen as a literal if it appears before or after a POSIX
7050 class (see below) or before or after a character type escape such as as
7051 \d or \H. However, unless the hyphen is the last character in the
7052 class, Perl outputs a warning in its warning mode, as this is most
7053 likely a user error. As PCRE2 has no facility for warning, an error is
7054 given in these cases.
7056 It is not possible to have the literal character "]" as the end charac-
7057 ter of a range. A pattern such as [W-]46] is interpreted as a class of
7058 two characters ("W" and "-") followed by a literal string "46]", so it
7059 would match "W46]" or "-46]". However, if the "]" is escaped with a
7060 backslash it is interpreted as the end of range, so [W-\]46] is inter-
7061 preted as a class containing a range followed by two other characters.
7062 The octal or hexadecimal representation of "]" can also be used to end
7065 Ranges normally include all code points between the start and end char-
7066 acters, inclusive. They can also be used for code points specified
7067 numerically, for example [\000-\037]. Ranges can include any characters
7068 that are valid for the current mode. In any UTF mode, the so-called
7069 "surrogate" characters (those whose code points lie between 0xd800 and
7070 0xdfff inclusive) may not be specified explicitly by default (the
7071 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How-
7072 ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
7073 are always permitted.
7075 There is a special case in EBCDIC environments for ranges whose end
7076 points are both specified as literal letters in the same case. For com-
7077 patibility with Perl, EBCDIC code points within the range that are not
7078 letters are omitted. For example, [h-k] matches only four characters,
7079 even though the codes for h and k are 0x88 and 0x92, a range of 11 code
7080 points. However, if the range is specified numerically, for example,
7081 [\x88-\x92] or [h-\x92], all code points are included.
7083 If a range that includes letters is used when caseless matching is set,
7084 it matches the letters in either case. For example, [W-c] is equivalent
7085 to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
7086 character tables for a French locale are in use, [\xc8-\xcb] matches
7087 accented E characters in both cases.
7089 A circumflex can conveniently be used with the upper case character
7090 types to specify a more restricted set of characters than the matching
7091 lower case type. For example, the class [^\W_] matches any letter or
7092 digit, but not underscore, whereas [\w] includes underscore. A positive
7093 character class should be read as "something OR something OR ..." and a
7094 negative class as "NOT something AND NOT something AND NOT ...".
7096 The only metacharacters that are recognized in character classes are
7097 backslash, hyphen (only where it can be interpreted as specifying a
7098 range), circumflex (only at the start), opening square bracket (only
7099 when it can be interpreted as introducing a POSIX class name, or for a
7100 special compatibility feature - see the next two sections), and the
7101 terminating closing square bracket. However, escaping other non-
7102 alphanumeric characters does no harm.
7105 POSIX CHARACTER CLASSES
7107 Perl supports the POSIX notation for character classes. This uses names
7108 enclosed by [: and :] within the enclosing square brackets. PCRE2 also
7109 supports this notation. For example,
7113 matches "0", "1", any alphabetic character, or "%". The supported class
7116 alnum letters and digits
7118 ascii character codes 0 - 127
7119 blank space or tab only
7120 cntrl control characters
7121 digit decimal digits (same as \d)
7122 graph printing characters, excluding space
7123 lower lower case letters
7124 print printing characters, including space
7125 punct printing characters, excluding letters and digits and space
7126 space white space (the same as \s from PCRE2 8.34)
7127 upper upper case letters
7128 word "word" characters (same as \w)
7129 xdigit hexadecimal digits
7131 The default "space" characters are HT (9), LF (10), VT (11), FF (12),
7132 CR (13), and space (32). If locale-specific matching is taking place,
7133 the list of space characters may be different; there may be fewer or
7134 more of them. "Space" and \s match the same set of characters.
7136 The name "word" is a Perl extension, and "blank" is a GNU extension
7137 from Perl 5.8. Another Perl extension is negation, which is indicated
7138 by a ^ character after the colon. For example,
7142 matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
7143 POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
7144 these are not supported, and an error is given if they are encountered.
7146 By default, characters with values greater than 127 do not match any of
7147 the POSIX character classes, although this may be different for charac-
7148 ters in the range 128-255 when locale-specific matching is happening.
7149 However, if the PCRE2_UCP option is passed to pcre2_compile(), some of
7150 the classes are changed so that Unicode character properties are used.
7151 This is achieved by replacing certain POSIX classes with other
7152 sequences, as follows:
7154 [:alnum:] becomes \p{Xan}
7155 [:alpha:] becomes \p{L}
7156 [:blank:] becomes \h
7157 [:cntrl:] becomes \p{Cc}
7158 [:digit:] becomes \p{Nd}
7159 [:lower:] becomes \p{Ll}
7160 [:space:] becomes \p{Xps}
7161 [:upper:] becomes \p{Lu}
7162 [:word:] becomes \p{Xwd}
7164 Negated versions, such as [:^alpha:] use \P instead of \p. Three other
7165 POSIX classes are handled specially in UCP mode:
7167 [:graph:] This matches characters that have glyphs that mark the page
7168 when printed. In Unicode property terms, it matches all char-
7169 acters with the L, M, N, P, S, or Cf properties, except for:
7171 U+061C Arabic Letter Mark
7172 U+180E Mongolian Vowel Separator
7173 U+2066 - U+2069 Various "isolate"s
7176 [:print:] This matches the same characters as [:graph:] plus space
7177 characters that are not controls, that is, characters with
7180 [:punct:] This matches all characters that have the Unicode P (punctua-
7181 tion) property, plus those characters with code points less
7182 than 256 that have the S (Symbol) property.
7184 The other POSIX classes are unchanged, and match only characters with
7185 code points less than 256.
7188 COMPATIBILITY FEATURE FOR WORD BOUNDARIES
7190 In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
7191 ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
7192 and "end of word". PCRE2 treats these items as follows:
7194 [[:<:]] is converted to \b(?=\w)
7195 [[:>:]] is converted to \b(?<=\w)
7197 Only these exact character sequences are recognized. A sequence such as
7198 [a[:<:]b] provokes error for an unrecognized POSIX class name. This
7199 support is not compatible with Perl. It is provided to help migrations
7200 from other environments, and is best not used in any new patterns. Note
7201 that \b matches at the start and the end of a word (see "Simple asser-
7202 tions" above), and in a Perl-style pattern the preceding or following
7203 character normally shows which is wanted, without the need for the
7204 assertions that are used above in order to give exactly the POSIX be-
7210 Vertical bar characters are used to separate alternative patterns. For
7211 example, the pattern
7215 matches either "gilbert" or "sullivan". Any number of alternatives may
7216 appear, and an empty alternative is permitted (matching the empty
7217 string). The matching process tries each alternative in turn, from left
7218 to right, and the first one that succeeds is used. If the alternatives
7219 are within a subpattern (defined below), "succeeds" means matching the
7220 rest of the main pattern as well as the alternative in the subpattern.
7223 INTERNAL OPTION SETTING
7225 The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
7226 PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
7227 can be changed from within the pattern by a sequence of letters
7228 enclosed between "(?" and ")". These options are Perl-compatible, and
7229 are described in detail in the pcre2api documentation. The option let-
7232 i for PCRE2_CASELESS
7233 m for PCRE2_MULTILINE
7234 n for PCRE2_NO_AUTO_CAPTURE
7236 x for PCRE2_EXTENDED
7237 xx for PCRE2_EXTENDED_MORE
7239 For example, (?im) sets caseless, multiline matching. It is also possi-
7240 ble to unset these options by preceding the relevant letters with a
7241 hyphen, for example (?-im). The two "extended" options are not indepen-
7242 dent; unsetting either one cancels the effects of both of them.
7244 A combined setting and unsetting such as (?im-sx), which sets
7245 PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and
7246 PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the
7247 options string. If a letter appears both before and after the hyphen,
7248 the option is unset. An empty options setting "(?)" is allowed. Need-
7249 less to say, it has no effect.
7251 If the first character following (? is a circumflex, it causes all of
7252 the above options to be unset. Thus, (?^) is equivalent to (?-imnsx).
7253 Letters may follow the circumflex to cause some options to be re-
7254 instated, but a hyphen may not appear.
7256 The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
7257 changed in the same way as the Perl-compatible options by using the
7258 characters J and U respectively. However, these are not unset by (?^).
7260 When one of these option changes occurs at top level (that is, not
7261 inside subpattern parentheses), the change applies to the remainder of
7262 the pattern that follows. An option change within a subpattern (see
7263 below for a description of subpatterns) affects only that part of the
7264 subpattern that follows it, so
7268 matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
7269 not used). By this means, options can be made to have different set-
7270 tings in different parts of the pattern. Any changes made in one alter-
7271 native do carry on into subsequent branches within the same subpattern.
7276 matches "ab", "aB", "c", and "C", even though when matching "C" the
7277 first branch is abandoned before the option setting. This is because
7278 the effects of option settings happen at compile time. There would be
7279 some very weird behaviour otherwise.
7281 As a convenient shorthand, if any option settings are required at the
7282 start of a non-capturing subpattern (see the next section), the option
7283 letters may appear between the "?" and the ":". Thus the two patterns
7285 (?i:saturday|sunday)
7286 (?:(?i)saturday|sunday)
7288 match exactly the same set of strings.
7290 Note: There are other PCRE2-specific options that can be set by the
7291 application when the compiling function is called. The pattern can con-
7292 tain special leading sequences such as (*CRLF) to override what the
7293 application has set or what has been defaulted. Details are given in
7294 the section entitled "Newline sequences" above. There are also the
7295 (*UTF) and (*UCP) leading sequences that can be used to set UTF and
7296 Unicode property modes; they are equivalent to setting the PCRE2_UTF
7297 and PCRE2_UCP options, respectively. However, the application can set
7298 the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use
7299 of the (*UTF) and (*UCP) sequences.
7304 Subpatterns are delimited by parentheses (round brackets), which can be
7305 nested. Turning part of a pattern into a subpattern does two things:
7307 1. It localizes a set of alternatives. For example, the pattern
7309 cat(aract|erpillar|)
7311 matches "cataract", "caterpillar", or "cat". Without the parentheses,
7312 it would match "cataract", "erpillar" or an empty string.
7314 2. It sets up the subpattern as a capturing subpattern. This means
7315 that, when the whole pattern matches, the portion of the subject string
7316 that matched the subpattern is passed back to the caller, separately
7317 from the portion that matched the whole pattern. (This applies only to
7318 the traditional matching function; the DFA matching function does not
7321 Opening parentheses are counted from left to right (starting from 1) to
7322 obtain numbers for the capturing subpatterns. For example, if the
7323 string "the red king" is matched against the pattern
7325 the ((red|white) (king|queen))
7327 the captured substrings are "red king", "red", and "king", and are num-
7328 bered 1, 2, and 3, respectively.
7330 The fact that plain parentheses fulfil two functions is not always
7331 helpful. There are often times when a grouping subpattern is required
7332 without a capturing requirement. If an opening parenthesis is followed
7333 by a question mark and a colon, the subpattern does not do any captur-
7334 ing, and is not counted when computing the number of any subsequent
7335 capturing subpatterns. For example, if the string "the white queen" is
7336 matched against the pattern
7338 the ((?:red|white) (king|queen))
7340 the captured substrings are "white queen" and "queen", and are numbered
7341 1 and 2. The maximum number of capturing subpatterns is 65535.
7343 As a convenient shorthand, if any option settings are required at the
7344 start of a non-capturing subpattern, the option letters may appear
7345 between the "?" and the ":". Thus the two patterns
7347 (?i:saturday|sunday)
7348 (?:(?i)saturday|sunday)
7350 match exactly the same set of strings. Because alternative branches are
7351 tried from left to right, and options are not reset until the end of
7352 the subpattern is reached, an option setting in one branch does affect
7353 subsequent branches, so the above patterns match "SUNDAY" as well as
7357 DUPLICATE SUBPATTERN NUMBERS
7359 Perl 5.10 introduced a feature whereby each alternative in a subpattern
7360 uses the same numbers for its capturing parentheses. Such a subpattern
7361 starts with (?| and is itself a non-capturing subpattern. For example,
7362 consider this pattern:
7364 (?|(Sat)ur|(Sun))day
7366 Because the two alternatives are inside a (?| group, both sets of cap-
7367 turing parentheses are numbered one. Thus, when the pattern matches,
7368 you can look at captured substring number one, whichever alternative
7369 matched. This construct is useful when you want to capture part, but
7370 not all, of one of a number of alternatives. Inside a (?| group, paren-
7371 theses are numbered as usual, but the number is reset at the start of
7372 each branch. The numbers of any capturing parentheses that follow the
7373 subpattern start after the highest number used in any branch. The fol-
7374 lowing example is taken from the Perl documentation. The numbers under-
7375 neath show in which buffer the captured content will be stored.
7377 # before ---------------branch-reset----------- after
7378 / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
7381 A backreference to a numbered subpattern uses the most recent value
7382 that is set for that number by any subpattern. The following pattern
7383 matches "abcabc" or "defdef":
7387 In contrast, a subroutine call to a numbered subpattern always refers
7388 to the first one in the pattern with the given number. The following
7389 pattern matches "abcabc" or "defabc":
7391 /(?|(abc)|(def))(?1)/
7393 A relative reference such as (?-1) is no different: it is just a conve-
7394 nient way of computing an absolute group number.
7396 If a condition test for a subpattern's having matched refers to a non-
7397 unique number, the test is true if any of the subpatterns of that num-
7400 An alternative approach to using this "branch reset" feature is to use
7401 duplicate named subpatterns, as described in the next section.
7406 Identifying capturing parentheses by number is simple, but it can be
7407 very hard to keep track of the numbers in complicated patterns. Fur-
7408 thermore, if an expression is modified, the numbers may change. To help
7409 with this difficulty, PCRE2 supports the naming of capturing subpat-
7410 terns. This feature was not added to Perl until release 5.10. Python
7411 had the feature earlier, and PCRE1 introduced it at release 4.0, using
7412 the Python syntax. PCRE2 supports both the Perl and the Python syntax.
7414 In PCRE2, a capturing subpattern can be named in one of three ways:
7415 (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
7416 Names consist of up to 32 alphanumeric characters and underscores, but
7417 must start with a non-digit. References to capturing parentheses from
7418 other parts of the pattern, such as backreferences, recursion, and con-
7419 ditions, can all be made by name as well as by number.
7421 Named capturing parentheses are allocated numbers as well as names,
7422 exactly as if the names were not present. In both PCRE2 and Perl, cap-
7423 turing subpatterns are primarily identified by numbers; any names are
7424 just aliases for these numbers. The PCRE2 API provides function calls
7425 for extracting the complete name-to-number translation table from a
7426 compiled pattern, as well as convenience functions for extracting cap-
7427 tured substrings by name.
7429 Warning: When more than one subpattern has the same number, as
7430 described in the previous section, a name given to one of them applies
7431 to all of them. Perl allows identically numbered subpatterns to have
7432 different names. Consider this pattern, where there are two capturing
7433 subpatterns, both numbered 1:
7435 (?|(?<AA>aa)|(?<BB>bb))
7437 Perl allows this, with both names AA and BB as aliases of group 1.
7438 Thus, after a successful match, both names yield the same value (either
7441 In an attempt to reduce confusion, PCRE2 does not allow the same group
7442 number to be associated with more than one name. The example above pro-
7443 vokes a compile-time error. However, there is still scope for confu-
7444 sion. Consider this pattern:
7448 Although the second subpattern number 1 is not explicitly named, the
7449 name AA is still an alias for subpattern 1. Whether the pattern matches
7450 "aa" or "bb", a reference by name to group AA yields the matched
7453 By default, a name must be unique within a pattern, except that dupli-
7454 cate names are permitted for subpatterns with the same number, for
7457 (?|(?<AA>aa)|(?<AA>bb))
7459 The duplicate name constraint can be disabled by setting the PCRE2_DUP-
7460 NAMES option at compile time, or by the use of (?J) within the pattern.
7461 Duplicate names can be useful for patterns where only one instance of
7462 the named parentheses can match. Suppose you want to match the name of
7463 a weekday, either as a 3-letter abbreviation or as the full name, and
7464 in both cases you want to extract the abbreviation. This pattern
7465 (ignoring the line breaks) does the job:
7467 (?<DN>Mon|Fri|Sun)(?:day)?|
7468 (?<DN>Tue)(?:sday)?|
7469 (?<DN>Wed)(?:nesday)?|
7470 (?<DN>Thu)(?:rsday)?|
7471 (?<DN>Sat)(?:urday)?
7473 There are five capturing substrings, but only one is ever set after a
7474 match. The convenience functions for extracting the data by name
7475 returns the substring for the first (and in this example, the only)
7476 subpattern of that name that matched. This saves searching to find
7477 which numbered subpattern it was. (An alternative way of solving this
7478 problem is to use a "branch reset" subpattern, as described in the pre-
7481 If you make a backreference to a non-unique named subpattern from else-
7482 where in the pattern, the subpatterns to which the name refers are
7483 checked in the order in which they appear in the overall pattern. The
7484 first one that is set is used for the reference. For example, this pat-
7485 tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
7487 (?:(?<n>foo)|(?<n>bar))\k<n>
7490 If you make a subroutine call to a non-unique named subpattern, the one
7491 that corresponds to the first occurrence of the name is used. In the
7492 absence of duplicate numbers this is the one with the lowest number.
7494 If you use a named reference in a condition test (see the section about
7495 conditions below), either to check whether a subpattern has matched, or
7496 to check for recursion, all subpatterns with the same name are tested.
7497 If the condition is true for any one of them, the overall condition is
7498 true. This is the same behaviour as testing by number. For further
7499 details of the interfaces for handling named subpatterns, see the
7500 pcre2api documentation.
7505 Repetition is specified by quantifiers, which can follow any of the
7508 a literal data character
7509 the dot metacharacter
7510 the \C escape sequence
7511 the \X escape sequence
7512 the \R escape sequence
7513 an escape such as \d or \pL that matches a single character
7516 a parenthesized subpattern (including most assertions)
7517 a subroutine call to a subpattern (recursive or otherwise)
7519 The general repetition quantifier specifies a minimum and maximum num-
7520 ber of permitted matches, by giving the two numbers in curly brackets
7521 (braces), separated by a comma. The numbers must be less than 65536,
7522 and the first must be less than or equal to the second. For example:
7526 matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
7527 special character. If the second number is omitted, but the comma is
7528 present, there is no upper limit; if the second number and the comma
7529 are both omitted, the quantifier specifies an exact number of required
7534 matches at least 3 successive vowels, but may match many more, whereas
7538 matches exactly 8 digits. An opening curly bracket that appears in a
7539 position where a quantifier is not allowed, or one that does not match
7540 the syntax of a quantifier, is taken as a literal character. For exam-
7541 ple, {,6} is not a quantifier, but a literal string of four characters.
7543 In UTF modes, quantifiers apply to characters rather than to individual
7544 code units. Thus, for example, \x{100}{2} matches two characters, each
7545 of which is represented by a two-byte sequence in a UTF-8 string. Simi-
7546 larly, \X{3} matches three Unicode extended grapheme clusters, each of
7547 which may be several code units long (and they may be of different
7550 The quantifier {0} is permitted, causing the expression to behave as if
7551 the previous item and the quantifier were not present. This may be use-
7552 ful for subpatterns that are referenced as subroutines from elsewhere
7553 in the pattern (but see also the section entitled "Defining subpatterns
7554 for use by reference only" below). Items other than subpatterns that
7555 have a {0} quantifier are omitted from the compiled pattern.
7557 For convenience, the three most common quantifiers have single-charac-
7560 * is equivalent to {0,}
7561 + is equivalent to {1,}
7562 ? is equivalent to {0,1}
7564 It is possible to construct infinite loops by following a subpattern
7565 that can match no characters with a quantifier that has no upper limit,
7570 Earlier versions of Perl and PCRE1 used to give an error at compile
7571 time for such patterns. However, because there are cases where this can
7572 be useful, such patterns are now accepted, but if any repetition of the
7573 subpattern does in fact match no characters, the loop is forcibly bro-
7576 By default, the quantifiers are "greedy", that is, they match as much
7577 as possible (up to the maximum number of permitted times), without
7578 causing the rest of the pattern to fail. The classic example of where
7579 this gives problems is in trying to match comments in C programs. These
7580 appear between /* and */ and within the comment, individual * and /
7581 characters may appear. An attempt to match C comments by applying the
7588 /* first comment */ not comment /* second comment */
7590 fails, because it matches the entire string owing to the greediness of
7593 If a quantifier is followed by a question mark, it ceases to be greedy,
7594 and instead matches the minimum number of times possible, so the pat-
7599 does the right thing with the C comments. The meaning of the various
7600 quantifiers is not otherwise changed, just the preferred number of
7601 matches. Do not confuse this use of question mark with its use as a
7602 quantifier in its own right. Because it has two uses, it can sometimes
7603 appear doubled, as in
7607 which matches one digit by preference, but can match two if that is the
7608 only way the rest of the pattern matches.
7610 If the PCRE2_UNGREEDY option is set (an option that is not available in
7611 Perl), the quantifiers are not greedy by default, but individual ones
7612 can be made greedy by following them with a question mark. In other
7613 words, it inverts the default behaviour.
7615 When a parenthesized subpattern is quantified with a minimum repeat
7616 count that is greater than 1 or with a limited maximum, more memory is
7617 required for the compiled pattern, in proportion to the size of the
7620 If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option
7621 (equivalent to Perl's /s) is set, thus allowing the dot to match new-
7622 lines, the pattern is implicitly anchored, because whatever follows
7623 will be tried against every character position in the subject string,
7624 so there is no point in retrying the overall match at any position
7625 after the first. PCRE2 normally treats such a pattern as though it were
7628 In cases where it is known that the subject string contains no new-
7629 lines, it is worth setting PCRE2_DOTALL in order to obtain this opti-
7630 mization, or alternatively, using ^ to indicate anchoring explicitly.
7632 However, there are some cases where the optimization cannot be used.
7633 When .* is inside capturing parentheses that are the subject of a
7634 backreference elsewhere in the pattern, a match at the start may fail
7635 where a later one succeeds. Consider, for example:
7639 If the subject is "xyz123abc123" the match point is the fourth charac-
7640 ter. For this reason, such a pattern is not implicitly anchored.
7642 Another case where implicit anchoring is not applied is when the lead-
7643 ing .* is inside an atomic group. Once again, a match at the start may
7644 fail where a later one succeeds. Consider this pattern:
7648 It matches "ab" in the subject "aab". The use of the backtracking con-
7649 trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and
7650 there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
7652 When a capturing subpattern is repeated, the value captured is the sub-
7653 string that matched the final iteration. For example, after
7655 (tweedle[dume]{3}\s*)+
7657 has matched "tweedledum tweedledee" the value of the captured substring
7658 is "tweedledee". However, if there are nested capturing subpatterns,
7659 the corresponding captured values may have been set in previous itera-
7660 tions. For example, after
7664 matches "aba" the value of the second captured substring is "b".
7667 ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
7669 With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
7670 repetition, failure of what follows normally causes the repeated item
7671 to be re-evaluated to see if a different number of repeats allows the
7672 rest of the pattern to match. Sometimes it is useful to prevent this,
7673 either to change the nature of the match, or to cause it fail earlier
7674 than it otherwise might, when the author of the pattern knows there is
7675 no point in carrying on.
7677 Consider, for example, the pattern \d+foo when applied to the subject
7682 After matching all 6 digits and then failing to match "foo", the normal
7683 action of the matcher is to try again with only 5 digits matching the
7684 \d+ item, and then with 4, and so on, before ultimately failing.
7685 "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
7686 the means for specifying that once a subpattern has matched, it is not
7687 to be re-evaluated in this way.
7689 If we use atomic grouping for the previous example, the matcher gives
7690 up immediately on failing to match "foo" the first time. The notation
7691 is a kind of special parenthesis, starting with (?> as in this example:
7695 This kind of parenthesis "locks up" the part of the pattern it con-
7696 tains once it has matched, and a failure further into the pattern is
7697 prevented from backtracking into it. Backtracking past it to previous
7698 items, however, works as normal.
7700 An alternative description is that a subpattern of this type matches
7701 exactly the string of characters that an identical standalone pattern
7702 would match, if anchored at the current point in the subject string.
7704 Atomic grouping subpatterns are not capturing subpatterns. Simple cases
7705 such as the above example can be thought of as a maximizing repeat that
7706 must swallow everything it can. So, while both \d+ and \d+? are pre-
7707 pared to adjust the number of digits they match in order to make the
7708 rest of the pattern match, (?>\d+) can only match an entire sequence of
7711 Atomic groups in general can of course contain arbitrarily complicated
7712 subpatterns, and can be nested. However, when the subpattern for an
7713 atomic group is just a single repeated item, as in the example above, a
7714 simpler notation, called a "possessive quantifier" can be used. This
7715 consists of an additional + character following a quantifier. Using
7716 this notation, the previous example can be rewritten as
7720 Note that a possessive quantifier can be used with an entire group, for
7725 Possessive quantifiers are always greedy; the setting of the
7726 PCRE2_UNGREEDY option is ignored. They are a convenient notation for
7727 the simpler forms of atomic group. However, there is no difference in
7728 the meaning of a possessive quantifier and the equivalent atomic group,
7729 though there may be a performance difference; possessive quantifiers
7730 should be slightly faster.
7732 The possessive quantifier syntax is an extension to the Perl 5.8 syn-
7733 tax. Jeffrey Friedl originated the idea (and the name) in the first
7734 edition of his book. Mike McCloskey liked it, so implemented it when he
7735 built Sun's Java package, and PCRE1 copied it from there. It ultimately
7736 found its way into Perl at release 5.10.
7738 PCRE2 has an optimization that automatically "possessifies" certain
7739 simple pattern constructs. For example, the sequence A+B is treated as
7740 A++B because there is no point in backtracking into a sequence of A's
7741 when B must follow. This feature can be disabled by the PCRE2_NO_AUTO-
7742 POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
7744 When a pattern contains an unlimited repeat inside a subpattern that
7745 can itself be repeated an unlimited number of times, the use of an
7746 atomic group is the only way to avoid some failing matches taking a
7747 very long time indeed. The pattern
7751 matches an unlimited number of substrings that either consist of non-
7752 digits, or digits enclosed in <>, followed by either ! or ?. When it
7753 matches, it runs quickly. However, if it is applied to
7755 aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
7757 it takes a long time before reporting failure. This is because the
7758 string can be divided between the internal \D+ repeat and the external
7759 * repeat in a large number of ways, and all have to be tried. (The
7760 example uses [!?] rather than a single character at the end, because
7761 both PCRE2 and Perl have an optimization that allows for fast failure
7762 when a single character is used. They remember the last single charac-
7763 ter that is required for a match, and fail early if it is not present
7764 in the string.) If the pattern is changed so that it uses an atomic
7767 ((?>\D+)|<\d+>)*[!?]
7769 sequences of non-digits cannot be broken, and failure happens quickly.
7774 Outside a character class, a backslash followed by a digit greater than
7775 0 (and possibly further digits) is a backreference to a capturing sub-
7776 pattern earlier (that is, to its left) in the pattern, provided there
7777 have been that many previous capturing left parentheses.
7779 However, if the decimal number following the backslash is less than 8,
7780 it is always taken as a backreference, and causes an error only if
7781 there are not that many capturing left parentheses in the entire pat-
7782 tern. In other words, the parentheses that are referenced need not be
7783 to the left of the reference for numbers less than 8. A "forward back-
7784 reference" of this type can make sense when a repetition is involved
7785 and the subpattern to the right has participated in an earlier itera-
7788 It is not possible to have a numerical "forward backreference" to a
7789 subpattern whose number is 8 or more using this syntax because a
7790 sequence such as \50 is interpreted as a character defined in octal.
7791 See the subsection entitled "Non-printing characters" above for further
7792 details of the handling of digits following a backslash. There is no
7793 such problem when named parentheses are used. A backreference to any
7794 subpattern is possible using named parentheses (see below).
7796 Another way of avoiding the ambiguity inherent in the use of digits
7797 following a backslash is to use the \g escape sequence. This escape
7798 must be followed by a signed or unsigned number, optionally enclosed in
7799 braces. These examples are all identical:
7805 An unsigned number specifies an absolute reference without the ambigu-
7806 ity that is present in the older syntax. It is also useful when literal
7807 digits follow the reference. A signed number is a relative reference.
7808 Consider this example:
7812 The sequence \g{-1} is a reference to the most recently started captur-
7813 ing subpattern before \g, that is, is it equivalent to \2 in this exam-
7814 ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
7815 references can be helpful in long patterns, and also in patterns that
7816 are created by joining together fragments that contain references
7819 The sequence \g{+1} is a reference to the next capturing subpattern.
7820 This kind of forward reference can be useful it patterns that repeat.
7821 Perl does not support the use of + in this way.
7823 A backreference matches whatever actually matched the capturing subpat-
7824 tern in the current subject string, rather than anything matching the
7825 subpattern itself (see "Subpatterns as subroutines" below for a way of
7826 doing that). So the pattern
7828 (sens|respons)e and \1ibility
7830 matches "sense and sensibility" and "response and responsibility", but
7831 not "sense and responsibility". If caseful matching is in force at the
7832 time of the backreference, the case of letters is relevant. For exam-
7837 matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
7838 original capturing subpattern is matched caselessly.
7840 There are several different ways of writing backreferences to named
7841 subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
7842 \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
7843 unified backreference syntax, in which \g can be used for both numeric
7844 and named references, is also supported. We could rewrite the above
7845 example in any of the following ways:
7847 (?<p1>(?i)rah)\s+\k<p1>
7848 (?'p1'(?i)rah)\s+\k{p1}
7849 (?P<p1>(?i)rah)\s+(?P=p1)
7850 (?<p1>(?i)rah)\s+\g{p1}
7852 A subpattern that is referenced by name may appear in the pattern
7853 before or after the reference.
7855 There may be more than one backreference to the same subpattern. If a
7856 subpattern has not actually been used in a particular match, any back-
7857 references to it always fail by default. For example, the pattern
7861 always fails if it starts to match "a" rather than "bc". However, if
7862 the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
7863 erence to an unset value matches an empty string.
7865 Because there may be many capturing parentheses in a pattern, all dig-
7866 its following a backslash are taken as part of a potential backrefer-
7867 ence number. If the pattern continues with a digit character, some
7868 delimiter must be used to terminate the backreference. If the
7869 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this can be white
7870 space. Otherwise, the \g{ syntax or an empty comment (see "Comments"
7873 Recursive backreferences
7875 A backreference that occurs inside the parentheses to which it refers
7876 fails when the subpattern is first used, so, for example, (a\1) never
7877 matches. However, such references can be useful inside repeated sub-
7878 patterns. For example, the pattern
7882 matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
7883 ation of the subpattern, the backreference matches the character string
7884 corresponding to the previous iteration. In order for this to work, the
7885 pattern must be such that the first iteration does not need to match
7886 the backreference. This can be done using alternation, as in the exam-
7887 ple above, or by a quantifier with a minimum of zero.
7889 Backreferences of this type cause the group that they reference to be
7890 treated as an atomic group. Once the whole group has been matched, a
7891 subsequent matching failure cannot cause backtracking into the middle
7897 An assertion is a test on the characters following or preceding the
7898 current matching point that does not consume any characters. The simple
7899 assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
7902 More complicated assertions are coded as subpatterns. There are two
7903 kinds: those that look ahead of the current position in the subject
7904 string, and those that look behind it, and in each case an assertion
7905 may be positive (must succeed for matching to continue) or negative
7906 (must not succeed for matching to continue). An assertion subpattern is
7907 matched in the normal way, except that, when matching continues after a
7908 successful assertion, the matching position in the subject string is as
7909 it was before the assertion was processed.
7911 Assertion subpatterns are not capturing subpatterns. If an assertion
7912 contains capturing subpatterns within it, these are counted for the
7913 purposes of numbering the capturing subpatterns in the whole pattern.
7914 Within each branch of an assertion, locally captured substrings may be
7915 referenced in the usual way. For example, a sequence such as (.)\g{-1}
7916 can be used to check that two adjacent characters are the same.
7918 When a branch within an assertion fails to match, any substrings that
7919 were captured are discarded (as happens with any pattern branch that
7920 fails to match). A negative assertion succeeds only when all its
7921 branches fail to match; this means that no captured substrings are ever
7922 retained after a successful negative assertion. When an assertion con-
7923 tains a matching branch, what happens depends on the type of assertion.
7925 For a positive assertion, internally captured substrings in the suc-
7926 cessful branch are retained, and matching continues with the next pat-
7927 tern item after the assertion. For a negative assertion, a matching
7928 branch means that the assertion has failed. If the assertion is being
7929 used as a condition in a conditional subpattern (see below), captured
7930 substrings are retained, because matching continues with the "no"
7931 branch of the condition. For other failing negative assertions, control
7932 passes to the previous backtracking point, thus discarding any captured
7933 strings within the assertion.
7935 For compatibility with Perl, most assertion subpatterns may be
7936 repeated; though it makes no sense to assert the same thing several
7937 times, the side effect of capturing parentheses may occasionally be
7938 useful. However, an assertion that forms the condition for a condi-
7939 tional subpattern may not be quantified. In practice, for other asser-
7940 tions, there only three cases:
7942 (1) If the quantifier is {0}, the assertion is never obeyed during
7943 matching. However, it may contain internal capturing parenthesized
7944 groups that are called from elsewhere via the subroutine mechanism.
7946 (2) If quantifier is {0,n} where n is greater than zero, it is treated
7947 as if it were {0,1}. At run time, the rest of the pattern match is
7948 tried with and without the assertion, the order depending on the greed-
7949 iness of the quantifier.
7951 (3) If the minimum repetition is greater than zero, the quantifier is
7952 ignored. The assertion is obeyed just once when encountered during
7955 Lookahead assertions
7957 Lookahead assertions start with (?= for positive assertions and (?! for
7958 negative assertions. For example,
7962 matches a word followed by a semicolon, but does not include the semi-
7963 colon in the match, and
7967 matches any occurrence of "foo" that is not followed by "bar". Note
7968 that the apparently similar pattern
7972 does not find an occurrence of "bar" that is preceded by something
7973 other than "foo"; it finds any occurrence of "bar" whatsoever, because
7974 the assertion (?!foo) is always true when the next three characters are
7975 "bar". A lookbehind assertion is needed to achieve the other effect.
7977 If you want to force a matching failure at some point in a pattern, the
7978 most convenient way to do it is with (?!) because an empty string
7979 always matches, so an assertion that requires there not to be an empty
7980 string must always fail. The backtracking control verb (*FAIL) or (*F)
7981 is a synonym for (?!).
7983 Lookbehind assertions
7985 Lookbehind assertions start with (?<= for positive assertions and (?<!
7986 for negative assertions. For example,
7990 does find an occurrence of "bar" that is not preceded by "foo". The
7991 contents of a lookbehind assertion are restricted such that all the
7992 strings it matches must have a fixed length. However, if there are sev-
7993 eral top-level alternatives, they do not all have to have the same
8002 causes an error at compile time. Branches that match different length
8003 strings are permitted only at the top level of a lookbehind assertion.
8004 This is an extension compared with Perl, which requires all branches to
8005 match the same length of string. An assertion such as
8009 is not permitted, because its single top-level branch can match two
8010 different lengths, but it is acceptable to PCRE2 if rewritten to use
8011 two top-level branches:
8015 In some cases, the escape sequence \K (see above) can be used instead
8016 of a lookbehind assertion to get round the fixed-length restriction.
8018 The implementation of lookbehind assertions is, for each alternative,
8019 to temporarily move the current position back by the fixed length and
8020 then try to match. If there are insufficient characters before the cur-
8021 rent position, the assertion fails.
8023 In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
8024 matches a single code unit even in a UTF mode) to appear in lookbehind
8025 assertions, because it makes it impossible to calculate the length of
8026 the lookbehind. The \X and \R escapes, which can match different num-
8027 bers of code units, are never permitted in lookbehinds.
8029 "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
8030 lookbehinds, as long as the subpattern matches a fixed-length string.
8031 However, recursion, that is, a "subroutine" call into a group that is
8032 already active, is not supported.
8034 Perl does not support backreferences in lookbehinds. PCRE2 does support
8035 them, but only if certain conditions are met. The
8036 PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use
8037 of (?| in the pattern (it creates duplicate subpattern numbers), and if
8038 the backreference is by name, the name must be unique. Of course, the
8039 referenced subpattern must itself be of fixed length. The following
8040 pattern matches words containing at least two characters that begin and
8041 end with the same character:
8045 Possessive quantifiers can be used in conjunction with lookbehind
8046 assertions to specify efficient matching of fixed-length strings at the
8047 end of subject strings. Consider a simple pattern such as
8051 when applied to a long string that does not match. Because matching
8052 proceeds from left to right, PCRE2 will look for each "a" in the sub-
8053 ject and then see if what follows matches the rest of the pattern. If
8054 the pattern is specified as
8058 the initial .* matches the entire string at first, but when this fails
8059 (because there is no following "a"), it backtracks to match all but the
8060 last character, then all but the last two characters, and so on. Once
8061 again the search for "a" covers the entire string, from right to left,
8062 so we are no better off. However, if the pattern is written as
8066 there can be no backtracking for the .*+ item because of the possessive
8067 quantifier; it can match only the entire string. The subsequent lookbe-
8068 hind assertion does a single test on the last four characters. If it
8069 fails, the match fails immediately. For long strings, this approach
8070 makes a significant difference to the processing time.
8072 Using multiple assertions
8074 Several assertions (of any sort) may occur in succession. For example,
8076 (?<=\d{3})(?<!999)foo
8078 matches "foo" preceded by three digits that are not "999". Notice that
8079 each of the assertions is applied independently at the same point in
8080 the subject string. First there is a check that the previous three
8081 characters are all digits, and then there is a check that the same
8082 three characters are not "999". This pattern does not match "foo" pre-
8083 ceded by six characters, the first of which are digits and the last
8084 three of which are not "999". For example, it doesn't match "123abc-
8085 foo". A pattern to do that is
8087 (?<=\d{3}...)(?<!999)foo
8089 This time the first assertion looks at the preceding six characters,
8090 checking that the first three are digits, and then the second assertion
8091 checks that the preceding three characters are not "999".
8093 Assertions can be nested in any combination. For example,
8097 matches an occurrence of "baz" that is preceded by "bar" which in turn
8098 is not preceded by "foo", while
8100 (?<=\d{3}(?!999)...)foo
8102 is another pattern that matches "foo" preceded by three digits and any
8103 three characters that are not "999".
8106 CONDITIONAL SUBPATTERNS
8108 It is possible to cause the matching process to obey a subpattern con-
8109 ditionally or to choose between two alternative subpatterns, depending
8110 on the result of an assertion, or whether a specific capturing subpat-
8111 tern has already been matched. The two possible forms of conditional
8114 (?(condition)yes-pattern)
8115 (?(condition)yes-pattern|no-pattern)
8117 If the condition is satisfied, the yes-pattern is used; otherwise the
8118 no-pattern (if present) is used. An absent no-pattern is equivalent to
8119 an empty string (it always matches). If there are more than two alter-
8120 natives in the subpattern, a compile-time error occurs. Each of the two
8121 alternatives may itself contain nested subpatterns of any form, includ-
8122 ing conditional subpatterns; the restriction to two alternatives
8123 applies only at the level of the condition. This pattern fragment is an
8124 example where the alternatives are complex:
8126 (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
8129 There are five kinds of condition: references to subpatterns, refer-
8130 ences to recursion, two pseudo-conditions called DEFINE and VERSION,
8133 Checking for a used subpattern by number
8135 If the text between the parentheses consists of a sequence of digits,
8136 the condition is true if a capturing subpattern of that number has pre-
8137 viously matched. If there is more than one capturing subpattern with
8138 the same number (see the earlier section about duplicate subpattern
8139 numbers), the condition is true if any of them have matched. An alter-
8140 native notation is to precede the digits with a plus or minus sign. In
8141 this case, the subpattern number is relative rather than absolute. The
8142 most recently opened parentheses can be referenced by (?(-1), the next
8143 most recent by (?(-2), and so on. Inside loops it can also make sense
8144 to refer to subsequent groups. The next parentheses to be opened can be
8145 referenced as (?(+1), and so on. (The value zero in any of these forms
8146 is not used; it provokes a compile-time error.)
8148 Consider the following pattern, which contains non-significant white
8149 space to make it more readable (assume the PCRE2_EXTENDED option) and
8150 to divide it into three parts for ease of discussion:
8152 ( \( )? [^()]+ (?(1) \) )
8154 The first part matches an optional opening parenthesis, and if that
8155 character is present, sets it as the first captured substring. The sec-
8156 ond part matches one or more characters that are not parentheses. The
8157 third part is a conditional subpattern that tests whether or not the
8158 first set of parentheses matched. If they did, that is, if subject
8159 started with an opening parenthesis, the condition is true, and so the
8160 yes-pattern is executed and a closing parenthesis is required. Other-
8161 wise, since no-pattern is not present, the subpattern matches nothing.
8162 In other words, this pattern matches a sequence of non-parentheses,
8163 optionally enclosed in parentheses.
8165 If you were embedding this pattern in a larger one, you could use a
8168 ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
8170 This makes the fragment independent of the parentheses in the larger
8173 Checking for a used subpattern by name
8175 Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
8176 used subpattern by name. For compatibility with earlier versions of
8177 PCRE1, which had this facility before Perl, the syntax (?(name)...) is
8178 also recognized. Note, however, that undelimited names consisting of
8179 the letter R followed by digits are ambiguous (see the following sec-
8182 Rewriting the above example to use a named subpattern gives this:
8184 (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
8186 If the name used in a condition of this kind is a duplicate, the test
8187 is applied to all subpatterns of the same name, and is true if any one
8188 of them has matched.
8190 Checking for pattern recursion
8192 "Recursion" in this sense refers to any subroutine-like call from one
8193 part of the pattern to another, whether or not it is actually recur-
8194 sive. See the sections entitled "Recursive patterns" and "Subpatterns
8195 as subroutines" below for details of recursion and subpattern calls.
8197 If a condition is the string (R), and there is no subpattern with the
8198 name R, the condition is true if matching is currently in a recursion
8199 or subroutine call to the whole pattern or any subpattern. If digits
8200 follow the letter R, and there is no subpattern with that name, the
8201 condition is true if the most recent call is into a subpattern with the
8202 given number, which must exist somewhere in the overall pattern. This
8203 is a contrived example that is equivalent to a+b:
8207 However, in both cases, if there is a subpattern with a matching name,
8208 the condition tests for its being set, as described in the section
8209 above, instead of testing for recursion. For example, creating a group
8210 with the name R1 by adding (?<R1>) to the above pattern completely
8211 changes its meaning.
8213 If a name preceded by ampersand follows the letter R, for example:
8217 the condition is true if the most recent recursion is into a subpattern
8218 of that name (which must exist within the pattern).
8220 This condition does not check the entire recursion stack. It tests only
8221 the current level. If the name used in a condition of this kind is a
8222 duplicate, the test is applied to all subpatterns of the same name, and
8223 is true if any one of them is the most recent recursion.
8225 At "top level", all these recursion test conditions are false.
8227 Defining subpatterns for use by reference only
8229 If the condition is the string (DEFINE), the condition is always false,
8230 even if there is a group with the name DEFINE. In this case, there may
8231 be only one alternative in the subpattern. It is always skipped if con-
8232 trol reaches this point in the pattern; the idea of DEFINE is that it
8233 can be used to define subroutines that can be referenced from else-
8234 where. (The use of subroutines is described below.) For example, a pat-
8235 tern to match an IPv4 address such as "192.168.23.245" could be written
8236 like this (ignore white space and line breaks):
8238 (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
8239 \b (?&byte) (\.(?&byte)){3} \b
8241 The first part of the pattern is a DEFINE group inside which a another
8242 group named "byte" is defined. This matches an individual component of
8243 an IPv4 address (a number less than 256). When matching takes place,
8244 this part of the pattern is skipped because DEFINE acts like a false
8245 condition. The rest of the pattern uses references to the named group
8246 to match the four dot-separated components of an IPv4 address, insist-
8247 ing on a word boundary at each end.
8249 Checking the PCRE2 version
8251 Programs that link with a PCRE2 library can check the version by call-
8252 ing pcre2_config() with appropriate arguments. Users of applications
8253 that do not have access to the underlying code cannot do this. A spe-
8254 cial "condition" called VERSION exists to allow such users to discover
8255 which version of PCRE2 they are dealing with by using this condition to
8256 match a string such as "yesno". VERSION must be followed either by "="
8257 or ">=" and a version number. For example:
8259 (?(VERSION>=10.4)yes|no)
8261 This pattern matches "yes" if the PCRE2 version is greater or equal to
8262 10.4, or "no" otherwise. The fractional part of the version number may
8263 not contain more than two digits.
8265 Assertion conditions
8267 If the condition is not in any of the above formats, it must be an
8268 assertion. This may be a positive or negative lookahead or lookbehind
8269 assertion. Consider this pattern, again containing non-significant
8270 white space, and with the two alternatives on the second line:
8273 \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
8275 The condition is a positive lookahead assertion that matches an
8276 optional sequence of non-letters followed by a letter. In other words,
8277 it tests for the presence of at least one letter in the subject. If a
8278 letter is found, the subject is matched against the first alternative;
8279 otherwise it is matched against the second. This pattern matches
8280 strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
8281 letters and dd are digits.
8283 When an assertion that is a condition contains capturing subpatterns,
8284 any capturing that occurs in a matching branch is retained afterwards,
8285 for both positive and negative assertions, because matching always con-
8286 tinues after the assertion, whether it succeeds or fails. (Compare non-
8287 conditional assertions, when captures are retained only for positive
8288 assertions that succeed.)
8293 There are two ways of including comments in patterns that are processed
8294 by PCRE2. In both cases, the start of the comment must not be in a
8295 character class, nor in the middle of any other sequence of related
8296 characters such as (?: or a subpattern name or number. The characters
8297 that make up a comment play no part in the pattern matching.
8299 The sequence (?# marks the start of a comment that continues up to the
8300 next closing parenthesis. Nested parentheses are not permitted. If the
8301 PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped #
8302 character also introduces a comment, which in this case continues to
8303 immediately after the next newline character or character sequence in
8304 the pattern. Which characters are interpreted as newlines is controlled
8305 by an option passed to the compiling function or by a special sequence
8306 at the start of the pattern, as described in the section entitled "New-
8307 line conventions" above. Note that the end of this type of comment is a
8308 literal newline sequence in the pattern; escape sequences that happen
8309 to represent a newline do not count. For example, consider this pattern
8310 when PCRE2_EXTENDED is set, and the default newline convention (a sin-
8311 gle linefeed character) is in force:
8313 abc #comment \n still comment
8315 On encountering the # character, pcre2_compile() skips along, looking
8316 for a newline in the pattern. The sequence \n is still literal at this
8317 stage, so it does not terminate the comment. Only an actual character
8318 with the code value 0x0a (the default newline) does so.
8323 Consider the problem of matching a string in parentheses, allowing for
8324 unlimited nested parentheses. Without the use of recursion, the best
8325 that can be done is to use a pattern that matches up to some fixed
8326 depth of nesting. It is not possible to handle an arbitrary nesting
8329 For some time, Perl has provided a facility that allows regular expres-
8330 sions to recurse (amongst other things). It does this by interpolating
8331 Perl code in the expression at run time, and the code can refer to the
8332 expression itself. A Perl pattern using code interpolation to solve the
8333 parentheses problem can be created like this:
8335 $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
8337 The (?p{...}) item interpolates Perl code at run time, and in this case
8338 refers recursively to the pattern in which it appears.
8340 Obviously, PCRE2 cannot support the interpolation of Perl code.
8341 Instead, it supports special syntax for recursion of the entire pat-
8342 tern, and also for individual subpattern recursion. After its introduc-
8343 tion in PCRE1 and Python, this kind of recursion was subsequently
8344 introduced into Perl at release 5.10.
8346 A special item that consists of (? followed by a number greater than
8347 zero and a closing parenthesis is a recursive subroutine call of the
8348 subpattern of the given number, provided that it occurs inside that
8349 subpattern. (If not, it is a non-recursive subroutine call, which is
8350 described in the next section.) The special item (?R) or (?0) is a
8351 recursive call of the entire regular expression.
8353 This PCRE2 pattern solves the nested parentheses problem (assume the
8354 PCRE2_EXTENDED option is set so that white space is ignored):
8356 \( ( [^()]++ | (?R) )* \)
8358 First it matches an opening parenthesis. Then it matches any number of
8359 substrings which can either be a sequence of non-parentheses, or a
8360 recursive match of the pattern itself (that is, a correctly parenthe-
8361 sized substring). Finally there is a closing parenthesis. Note the use
8362 of a possessive quantifier to avoid backtracking into sequences of non-
8365 If this were part of a larger pattern, you would not want to recurse
8366 the entire pattern, so instead you could use this:
8368 ( \( ( [^()]++ | (?1) )* \) )
8370 We have put the pattern into parentheses, and caused the recursion to
8371 refer to them instead of the whole pattern.
8373 In a larger pattern, keeping track of parenthesis numbers can be
8374 tricky. This is made easier by the use of relative references. Instead
8375 of (?1) in the pattern above you can write (?-2) to refer to the second
8376 most recently opened parentheses preceding the recursion. In other
8377 words, a negative number counts capturing parentheses leftwards from
8378 the point at which it is encountered.
8380 Be aware however, that if duplicate subpattern numbers are in use, rel-
8381 ative references refer to the earliest subpattern with the appropriate
8382 number. Consider, for example:
8384 (?|(a)|(b)) (c) (?-2)
8386 The first two capturing groups (a) and (b) are both numbered 1, and
8387 group (c) is number 2. When the reference (?-2) is encountered, the
8388 second most recently opened parentheses has the number 1, but it is the
8389 first such group (the (a) group) to which the recursion refers. This
8390 would be the same if an absolute reference (?1) was used. In other
8391 words, relative references are just a shorthand for computing a group
8394 It is also possible to refer to subsequently opened parentheses, by
8395 writing references such as (?+2). However, these cannot be recursive
8396 because the reference is not inside the parentheses that are refer-
8397 enced. They are always non-recursive subroutine calls, as described in
8400 An alternative approach is to use named parentheses. The Perl syntax
8401 for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup-
8402 ported. We could rewrite the above example as follows:
8404 (?<pn> \( ( [^()]++ | (?&pn) )* \) )
8406 If there is more than one subpattern with the same name, the earliest
8409 The example pattern that we have been looking at contains nested unlim-
8410 ited repeats, and so the use of a possessive quantifier for matching
8411 strings of non-parentheses is important when applying the pattern to
8412 strings that do not match. For example, when this pattern is applied to
8414 (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
8416 it yields "no match" quickly. However, if a possessive quantifier is
8417 not used, the match runs for a very long time indeed because there are
8418 so many different ways the + and * repeats can carve up the subject,
8419 and all have to be tested before failure can be reported.
8421 At the end of a match, the values of capturing parentheses are those
8422 from the outermost level. If you want to obtain intermediate values, a
8423 callout function can be used (see below and the pcre2callout documenta-
8424 tion). If the pattern above is matched against
8428 the value for the inner capturing parentheses (numbered 2) is "ef",
8429 which is the last value taken on at the top level. If a capturing sub-
8430 pattern is not matched at the top level, its final captured value is
8431 unset, even if it was (temporarily) set at a deeper level during the
8434 Do not confuse the (?R) item with the condition (R), which tests for
8435 recursion. Consider this pattern, which matches text in angle brack-
8436 ets, allowing for arbitrary nesting. Only digits are allowed in nested
8437 brackets (that is, when recursing), whereas any characters are permit-
8438 ted at the outer level.
8440 < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
8442 In this pattern, (?(R) is the start of a conditional subpattern, with
8443 two different alternatives for the recursive and non-recursive cases.
8444 The (?R) item is the actual recursive call.
8446 Differences in recursion processing between PCRE2 and Perl
8448 Some former differences between PCRE2 and Perl no longer exist.
8450 Before release 10.30, recursion processing in PCRE2 differed from Perl
8451 in that a recursive subpattern call was always treated as an atomic
8452 group. That is, once it had matched some of the subject string, it was
8453 never re-entered, even if it contained untried alternatives and there
8454 was a subsequent matching failure. (Historical note: PCRE implemented
8455 recursion before Perl did.)
8457 Starting with release 10.30, recursive subroutine calls are no longer
8458 treated as atomic. That is, they can be re-entered to try unused alter-
8459 natives if there is a matching failure later in the pattern. This is
8460 now compatible with the way Perl works. If you want a subroutine call
8461 to be atomic, you must explicitly enclose it in an atomic group.
8463 Supporting backtracking into recursions simplifies certain types of
8464 recursive pattern. For example, this pattern matches palindromic
8469 The second branch in the group matches a single central character in
8470 the palindrome when there are an odd number of characters, or nothing
8471 when there are an even number of characters, but in order to work it
8472 has to be able to try the second case when the rest of the pattern
8473 match fails. If you want to match typical palindromic phrases, the pat-
8474 tern has to ignore all non-word characters, which can be done like
8477 ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
8479 If run with the PCRE2_CASELESS option, this pattern matches phrases
8480 such as "A man, a plan, a canal: Panama!". Note the use of the posses-
8481 sive quantifier *+ to avoid backtracking into sequences of non-word
8482 characters. Without this, PCRE2 takes a great deal longer (ten times or
8483 more) to match typical phrases, and Perl takes so long that you think
8484 it has gone into a loop.
8486 Another way in which PCRE2 and Perl used to differ in their recursion
8487 processing is in the handling of captured values. Formerly in Perl,
8488 when a subpattern was called recursively or as a subpattern (see the
8489 next section), it had no access to any values that were captured out-
8490 side the recursion, whereas in PCRE2 these values can be referenced.
8491 Consider this pattern:
8495 This pattern matches "bab". The first capturing parentheses match "b",
8496 then in the second group, when the backreference \1 fails to match "b",
8497 the second alternative matches "a" and then recurses. In the recursion,
8498 \1 does now match "b" and so the whole match succeeds. This match used
8499 to fail in Perl, but in later versions (I tried 5.024) it now works.
8502 SUBPATTERNS AS SUBROUTINES
8504 If the syntax for a recursive subpattern call (either by number or by
8505 name) is used outside the parentheses to which it refers, it operates a
8506 bit like a subroutine in a programming language. More accurately, PCRE2
8507 treats the referenced subpattern as an independent subpattern which it
8508 tries to match at the current matching position. The called subpattern
8509 may be defined before or after the reference. A numbered reference can
8510 be absolute or relative, as in these examples:
8512 (...(absolute)...)...(?2)...
8513 (...(relative)...)...(?-1)...
8514 (...(?+1)...(relative)...
8516 An earlier example pointed out that the pattern
8518 (sens|respons)e and \1ibility
8520 matches "sense and sensibility" and "response and responsibility", but
8521 not "sense and responsibility". If instead the pattern
8523 (sens|respons)e and (?1)ibility
8525 is used, it does match "sense and responsibility" as well as the other
8526 two strings. Another example is given in the discussion of DEFINE
8529 Like recursions, subroutine calls used to be treated as atomic, but
8530 this changed at PCRE2 release 10.30, so backtracking into subroutine
8531 calls can now occur. However, any capturing parentheses that are set
8532 during the subroutine call revert to their previous values afterwards.
8534 Processing options such as case-independence are fixed when a subpat-
8535 tern is defined, so if it is used as a subroutine, such options cannot
8536 be changed for different calls. For example, consider this pattern:
8540 It matches "abcabc". It does not match "abcABC" because the change of
8541 processing option does not affect the called subpattern.
8543 The behaviour of backtracking control verbs in subpatterns when called
8544 as subroutines is described in the section entitled "Backtracking verbs
8545 in subroutines" below.
8548 ONIGURUMA SUBROUTINE SYNTAX
8550 For compatibility with Oniguruma, the non-Perl syntax \g followed by a
8551 name or a number enclosed either in angle brackets or single quotes, is
8552 an alternative syntax for referencing a subpattern as a subroutine,
8553 possibly recursively. Here are two of the examples used above, rewrit-
8554 ten using this syntax:
8556 (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
8557 (sens|respons)e and \g'1'ibility
8559 PCRE2 supports an extension to Oniguruma: if a number is preceded by a
8560 plus or a minus sign it is taken as a relative reference. For example:
8564 Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
8565 synonymous. The former is a backreference; the latter is a subroutine
8571 Perl has a feature whereby using the sequence (?{...}) causes arbitrary
8572 Perl code to be obeyed in the middle of matching a regular expression.
8573 This makes it possible, amongst other things, to extract different sub-
8574 strings that match the same pair of parentheses when there is a repeti-
8577 PCRE2 provides a similar feature, but of course it cannot obey arbi-
8578 trary Perl code. The feature is called "callout". The caller of PCRE2
8579 provides an external function by putting its entry point in a match
8580 context using the function pcre2_set_callout(), and then passing that
8581 context to pcre2_match() or pcre2_dfa_match(). If no match context is
8582 passed, or if the callout entry point is set to NULL, callouts are dis-
8585 Within a regular expression, (?C<arg>) indicates a point at which the
8586 external function is to be called. There are two kinds of callout:
8587 those with a numerical argument and those with a string argument. (?C)
8588 on its own with no argument is treated as (?C0). A numerical argument
8589 allows the application to distinguish between different callouts.
8590 String arguments were added for release 10.20 to make it possible for
8591 script languages that use PCRE2 to embed short scripts within patterns
8592 in a similar way to Perl.
8594 During matching, when PCRE2 reaches a callout point, the external func-
8595 tion is called. It is provided with the number or string argument of
8596 the callout, the position in the pattern, and one item of data that is
8597 also set in the match block. The callout function may cause matching to
8598 proceed, to backtrack, or to fail.
8600 By default, PCRE2 implements a number of optimizations at matching
8601 time, and one side-effect is that sometimes callouts are skipped. If
8602 you need all possible callouts to happen, you need to set options that
8603 disable the relevant optimizations. More details, including a complete
8604 description of the programming interface to the callout function, are
8605 given in the pcre2callout documentation.
8607 Callouts with numerical arguments
8609 If you just want to have a means of identifying different callout
8610 points, put a number less than 256 after the letter C. For example,
8611 this pattern has two callout points:
8615 If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
8616 callouts are automatically installed before each item in the pattern.
8617 They are all numbered 255. If there is a conditional group in the pat-
8618 tern whose condition is an assertion, an additional callout is inserted
8619 just before the condition. An explicit callout may also be set at this
8620 position, as in this example:
8622 (?(?C9)(?=a)abc|def)
8624 Note that this applies only to assertion conditions, not to other types
8627 Callouts with string arguments
8629 A delimited string may be used instead of a number as a callout argu-
8630 ment. The starting delimiter must be one of ` ' " ^ % # $ { and the
8631 ending delimiter is the same as the start, except for {, where the end-
8632 ing delimiter is }. If the ending delimiter is needed within the
8633 string, it must be doubled. For example:
8635 (?C'ab ''c'' d')xyz(?C{any text})pqr
8637 The doubling is removed before the string is passed to the callout
8641 BACKTRACKING CONTROL
8643 There are a number of special "Backtracking Control Verbs" (to use
8644 Perl's terminology) that modify the behaviour of backtracking during
8645 matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
8646 verbs take either form, possibly behaving differently depending on
8647 whether or not a name is present.
8649 By default, for compatibility with Perl, a name is any sequence of
8650 characters that does not include a closing parenthesis. The name is not
8651 processed in any way, and it is not possible to include a closing
8652 parenthesis in the name. This can be changed by setting the
8653 PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
8656 When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
8657 verb names and only an unescaped closing parenthesis terminates the
8658 name. However, the only backslash items that are permitted are \Q, \E,
8659 and sequences such as \x{100} that define character code points. Char-
8660 acter type escapes such as \d are faulted.
8662 A closing parenthesis can be included in a name either as \) or between
8663 \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
8664 or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
8665 names is skipped, and #-comments are recognized, exactly as in the rest
8666 of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
8667 verb names unless PCRE2_ALT_VERBNAMES is also set.
8669 The maximum length of a name is 255 in the 8-bit library and 65535 in
8670 the 16-bit and 32-bit libraries. If the name is empty, that is, if the
8671 closing parenthesis immediately follows the colon, the effect is as if
8672 the colon were not there. Any number of these verbs may occur in a pat-
8675 Since these verbs are specifically related to backtracking, most of
8676 them can be used only when the pattern is to be matched using the tra-
8677 ditional matching function, because that uses a backtracking algorithm.
8678 With the exception of (*FAIL), which behaves like a failing negative
8679 assertion, the backtracking control verbs cause an error if encountered
8680 by the DFA matching function.
8682 The behaviour of these verbs in repeated groups, assertions, and in
8683 subpatterns called as subroutines (whether or not recursively) is docu-
8686 Optimizations that affect backtracking verbs
8688 PCRE2 contains some optimizations that are used to speed up matching by
8689 running some checks at the start of each match attempt. For example, it
8690 may know the minimum length of matching subject, or that a particular
8691 character must be present. When one of these optimizations bypasses the
8692 running of a match, any included backtracking verbs will not, of
8693 course, be processed. You can suppress the start-of-match optimizations
8694 by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
8695 pile(), or by starting the pattern with (*NO_START_OPT). There is more
8696 discussion of this option in the section entitled "Compiling a pattern"
8697 in the pcre2api documentation.
8699 Experiments with Perl suggest that it too has similar optimizations,
8700 and like PCRE2, turning them off can change the result of a match.
8702 Verbs that act immediately
8704 The following verbs act as soon as they are encountered.
8706 (*ACCEPT) or (*ACCEPT:NAME)
8708 This verb causes the match to end successfully, skipping the remainder
8709 of the pattern. However, when it is inside a subpattern that is called
8710 as a subroutine, only that subpattern is ended successfully. Matching
8711 then continues at the outer level. If (*ACCEPT) in triggered in a posi-
8712 tive assertion, the assertion succeeds; in a negative assertion, the
8715 If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
8718 A((?:A|B(*ACCEPT)|C)D)
8720 This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
8721 tured by the outer parentheses.
8723 (*FAIL) or (*FAIL:NAME)
8725 This verb causes a matching failure, forcing backtracking to occur. It
8726 may be abbreviated to (*F). It is equivalent to (?!) but easier to
8727 read. The Perl documentation notes that it is probably useful only when
8728 combined with (?{}) or (??{}). Those are, of course, Perl features that
8729 are not present in PCRE2. The nearest equivalent is the callout fea-
8730 ture, as for example in this pattern:
8734 A match with the string "aaaa" always fails, but the callout is taken
8735 before each backtrack happens (in this example, 10 times).
8737 (*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as
8738 (*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively.
8740 Recording which path was taken
8742 There is one verb whose main purpose is to track how a match was
8743 arrived at, though it also has a secondary use in conjunction with
8744 advancing the match starting point (see (*SKIP) below).
8746 (*MARK:NAME) or (*:NAME)
8748 A name is always required with this verb. There may be as many
8749 instances of (*MARK) as you like in a pattern, and their names do not
8752 When a match succeeds, the name of the last-encountered (*MARK:NAME) on
8753 the matching path is passed back to the caller as described in the sec-
8754 tion entitled "Other information about the match" in the pcre2api docu-
8755 mentation. This applies to all instances of (*MARK), including those
8756 inside assertions and atomic groups. (There are differences in those
8757 cases when (*MARK) is used in conjunction with (*SKIP) as described
8760 As well as (*MARK), the (*COMMIT), (*PRUNE) and (*THEN) verbs may have
8761 associated NAME arguments. Whichever is last on the matching path is
8762 passed back. See below for more details of these other verbs.
8764 Here is an example of pcre2test output, where the "mark" modifier
8765 requests the retrieval and outputting of (*MARK) data:
8767 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
8775 The (*MARK) name is tagged with "MK:" in this output, and in this exam-
8776 ple it indicates which of the two alternatives matched. This is a more
8777 efficient way of obtaining this information than putting each alterna-
8778 tive in its own capturing parentheses.
8780 If a verb with a name is encountered in a positive assertion that is
8781 true, the name is recorded and passed back if it is the last-encoun-
8782 tered. This does not happen for negative assertions or failing positive
8785 After a partial match or a failed match, the last encountered name in
8786 the entire match process is returned. For example:
8788 re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
8792 Note that in this unanchored example the mark is retained from the
8793 match attempt that started at the letter "X" in the subject. Subsequent
8794 match attempts starting at "P" and then with an empty string do not get
8795 as far as the (*MARK) item, but nevertheless do not reset it.
8797 If you are interested in (*MARK) values after failed matches, you
8798 should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to
8799 ensure that the match is always attempted.
8801 Verbs that act after backtracking
8803 The following verbs do nothing when they are encountered. Matching con-
8804 tinues with what follows, but if there is a subsequent match failure,
8805 causing a backtrack to the verb, a failure is forced. That is, back-
8806 tracking cannot pass to the left of the verb. However, when one of
8807 these verbs appears inside an atomic group or in a lookaround assertion
8808 that is true, its effect is confined to that group, because once the
8809 group has been matched, there is never any backtracking into it. Back-
8810 tracking from beyond an assertion or an atomic group ignores the entire
8811 group, and seeks a preceeding backtracking point.
8813 These verbs differ in exactly what kind of failure occurs when back-
8814 tracking reaches them. The behaviour described below is what happens
8815 when the verb is not in a subroutine or an assertion. Subsequent sec-
8816 tions cover these special cases.
8818 (*COMMIT) or (*COMMIT:NAME)
8820 This verb causes the whole match to fail outright if there is a later
8821 matching failure that causes backtracking to reach it. Even if the pat-
8822 tern is unanchored, no further attempts to find a match by advancing
8823 the starting point take place. If (*COMMIT) is the only backtracking
8824 verb that is encountered, once it has been passed pcre2_match() is com-
8825 mitted to finding a match at the current starting point, or not at all.
8830 This matches "xxaab" but not "aacaab". It can be thought of as a kind
8831 of dynamic anchor, or "I've started, so I must finish."
8833 The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
8834 MIT). It is like (*MARK:NAME) in that the name is remembered for pass-
8835 ing back to the caller. However, (*SKIP:NAME) searches only for names
8836 set with (*MARK), ignoring those set by (*COMMIT), (*PRUNE) and
8839 If there is more than one backtracking verb in a pattern, a different
8840 one that follows (*COMMIT) may be triggered first, so merely passing
8841 (*COMMIT) during a match does not always guarantee that a match must be
8842 at this starting point.
8844 Note that (*COMMIT) at the start of a pattern is not the same as an
8845 anchor, unless PCRE2's start-of-match optimizations are turned off, as
8846 shown in this output from pcre2test:
8852 re> /(*COMMIT)abc/no_start_optimize
8856 For the first pattern, PCRE2 knows that any match must start with "a",
8857 so the optimization skips along the subject to "a" before applying the
8858 pattern to the first set of data. The match attempt then succeeds. The
8859 second pattern disables the optimization that skips along to the first
8860 character. The pattern is now applied starting at "x", and so the
8861 (*COMMIT) causes the match to fail without trying any other starting
8864 (*PRUNE) or (*PRUNE:NAME)
8866 This verb causes the match to fail at the current starting position in
8867 the subject if there is a later matching failure that causes backtrack-
8868 ing to reach it. If the pattern is unanchored, the normal "bumpalong"
8869 advance to the next starting character then happens. Backtracking can
8870 occur as usual to the left of (*PRUNE), before it is reached, or when
8871 matching to the right of (*PRUNE), but if there is no match to the
8872 right, backtracking cannot cross (*PRUNE). In simple cases, the use of
8873 (*PRUNE) is just an alternative to an atomic group or possessive quan-
8874 tifier, but there are some uses of (*PRUNE) that cannot be expressed in
8875 any other way. In an anchored pattern (*PRUNE) has the same effect as
8878 The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
8879 It is like (*MARK:NAME) in that the name is remembered for passing back
8880 to the caller. However, (*SKIP:NAME) searches only for names set with
8881 (*MARK), ignoring those set by (*COMMIT), (*PRUNE) or (*THEN).
8885 This verb, when given without a name, is like (*PRUNE), except that if
8886 the pattern is unanchored, the "bumpalong" advance is not to the next
8887 character, but to the position in the subject where (*SKIP) was encoun-
8888 tered. (*SKIP) signifies that whatever text was matched leading up to
8889 it cannot be part of a successful match if there is a later mismatch.
8894 If the subject is "aaaac...", after the first match attempt fails
8895 (starting at the first character in the string), the starting point
8896 skips on to start the next attempt at "c". Note that a possessive quan-
8897 tifer does not have the same effect as this example; although it would
8898 suppress backtracking during the first match attempt, the second
8899 attempt would start at the second character instead of skipping on to
8904 When (*SKIP) has an associated name, its behaviour is modified. When
8905 such a (*SKIP) is triggered, the previous path through the pattern is
8906 searched for the most recent (*MARK) that has the same name. If one is
8907 found, the "bumpalong" advance is to the subject position that corre-
8908 sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
8909 no (*MARK) with a matching name is found, the (*SKIP) is ignored.
8911 The search for a (*MARK) name uses the normal backtracking mechanism,
8912 which means that it does not see (*MARK) settings that are inside
8913 atomic groups or assertions, because they are never re-entered by back-
8914 tracking. Compare the following pcre2test examples:
8916 re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
8921 re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
8926 In the first example, the (*MARK) setting is in an atomic group, so it
8927 is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
8928 This allows the second branch of the pattern to be tried at the first
8929 character position. In the second example, the (*MARK) setting is not
8930 in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it
8931 backtracks, and this causes a new matching attempt to start at the sec-
8932 ond character. This time, the (*MARK) is never seen because "a" does
8933 not match "b", so the matcher immediately jumps to the second branch of
8936 Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
8937 ignores names that are set by (*COMMIT:NAME), (*PRUNE:NAME) or
8940 (*THEN) or (*THEN:NAME)
8942 This verb causes a skip to the next innermost alternative when back-
8943 tracking reaches it. That is, it cancels any further backtracking
8944 within the current alternative. Its name comes from the observation
8945 that it can be used for a pattern-based if-then-else block:
8947 ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
8949 If the COND1 pattern matches, FOO is tried (and possibly further items
8950 after the end of the group if FOO succeeds); on failure, the matcher
8951 skips to the second alternative and tries COND2, without backtracking
8952 into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
8953 quently BAZ fails, there are no more alternatives, so there is a back-
8954 track to whatever came before the entire group. If (*THEN) is not
8955 inside an alternation, it acts like (*PRUNE).
8957 The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
8958 It is like (*MARK:NAME) in that the name is remembered for passing back
8959 to the caller. However, (*SKIP:NAME) searches only for names set with
8960 (*MARK), ignoring those set by (*COMMIT), (*PRUNE) and (*THEN).
8962 A subpattern that does not contain a | character is just a part of the
8963 enclosing alternative; it is not a nested alternation with only one
8964 alternative. The effect of (*THEN) extends beyond such a subpattern to
8965 the enclosing alternative. Consider this pattern, where A, B, etc. are
8966 complex pattern fragments that do not contain any | characters at this
8971 If A and B are matched, but there is a failure in C, matching does not
8972 backtrack into A; instead it moves to the next alternative, that is, D.
8973 However, if the subpattern containing (*THEN) is given an alternative,
8974 it behaves differently:
8976 A (B(*THEN)C | (*FAIL)) | D
8978 The effect of (*THEN) is now confined to the inner subpattern. After a
8979 failure in C, matching moves to (*FAIL), which causes the whole subpat-
8980 tern to fail because there are no more alternatives to try. In this
8981 case, matching does now backtrack into A.
8983 Note that a conditional subpattern is not considered as having two
8984 alternatives, because only one is ever used. In other words, the |
8985 character in a conditional subpattern has a different meaning. Ignoring
8986 white space, consider:
8988 ^.*? (?(?=a) a | b(*THEN)c )
8990 If the subject is "ba", this pattern does not match. Because .*? is
8991 ungreedy, it initially matches zero characters. The condition (?=a)
8992 then fails, the character "b" is matched, but "c" is not. At this
8993 point, matching does not backtrack to .*? as might perhaps be expected
8994 from the presence of the | character. The conditional subpattern is
8995 part of the single alternative that comprises the whole pattern, and so
8996 the match fails. (If there was a backtrack into .*?, allowing it to
8997 match "b", the match would succeed.)
8999 The verbs just described provide four different "strengths" of control
9000 when subsequent matching fails. (*THEN) is the weakest, carrying on the
9001 match at the next alternative. (*PRUNE) comes next, failing the match
9002 at the current starting position, but allowing an advance to the next
9003 character (for an unanchored pattern). (*SKIP) is similar, except that
9004 the advance may be more than one character. (*COMMIT) is the strongest,
9005 causing the entire match to fail.
9007 More than one backtracking verb
9009 If more than one backtracking verb is present in a pattern, the one
9010 that is backtracked onto first acts. For example, consider this pat-
9011 tern, where A, B, etc. are complex pattern fragments:
9013 (A(*COMMIT)B(*THEN)C|ABD)
9015 If A matches but B fails, the backtrack to (*COMMIT) causes the entire
9016 match to fail. However, if A and B match, but C fails, the backtrack to
9017 (*THEN) causes the next alternative (ABD) to be tried. This behaviour
9018 is consistent, but is not always the same as Perl's. It means that if
9019 two or more backtracking verbs appear in succession, all the the last
9020 of them has no effect. Consider this example:
9022 ...(*COMMIT)(*PRUNE)...
9024 If there is a matching failure to the right, backtracking onto (*PRUNE)
9025 causes it to be triggered, and its action is taken. There can never be
9026 a backtrack onto (*COMMIT).
9028 Backtracking verbs in repeated groups
9030 PCRE2 sometimes differs from Perl in its handling of backtracking verbs
9031 in repeated groups. For example, consider:
9035 If the subject is "abac", Perl matches unless its optimizations are
9036 disabled, but PCRE2 always fails because the (*COMMIT) in the second
9037 repeat of the group acts.
9039 Backtracking verbs in assertions
9041 (*FAIL) in any assertion has its normal effect: it forces an immediate
9042 backtrack. The behaviour of the other backtracking verbs depends on
9043 whether or not the assertion is standalone or acting as the condition
9044 in a conditional subpattern.
9046 (*ACCEPT) in a standalone positive assertion causes the assertion to
9047 succeed without any further processing; captured strings and a (*MARK)
9048 name (if set) are retained. In a standalone negative assertion,
9049 (*ACCEPT) causes the assertion to fail without any further processing;
9050 captured substrings and any (*MARK) name are discarded.
9052 If the assertion is a condition, (*ACCEPT) causes the condition to be
9053 true for a positive assertion and false for a negative one; captured
9054 substrings are retained in both cases.
9056 The remaining verbs act only when a later failure causes a backtrack to
9057 reach them. This means that their effect is confined to the assertion,
9058 because lookaround assertions are atomic. A backtrack that occurs after
9059 an assertion is complete does not jump back into the assertion. Note in
9060 particular that a (*MARK) name that is set in an assertion is not
9061 "seen" by an instance of (*SKIP:NAME) latter in the pattern.
9063 The effect of (*THEN) is not allowed to escape beyond an assertion. If
9064 there are no more branches to try, (*THEN) causes a positive assertion
9065 to be false, and a negative assertion to be true.
9067 The other backtracking verbs are not treated specially if they appear
9068 in a standalone positive assertion. In a conditional positive asser-
9069 tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
9070 or (*PRUNE) causes the condition to be false. However, for both stand-
9071 alone and conditional negative assertions, backtracking into (*COMMIT),
9072 (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
9073 ing any further alternative branches.
9075 Backtracking verbs in subroutines
9077 These behaviours occur whether or not the subpattern is called recur-
9080 (*ACCEPT) in a subpattern called as a subroutine causes the subroutine
9081 match to succeed without any further processing. Matching then contin-
9082 ues after the subroutine call. Perl documents this behaviour. Perl's
9083 treatment of the other verbs in subroutines is different in some cases.
9085 (*FAIL) in a subpattern called as a subroutine has its normal effect:
9086 it forces an immediate backtrack.
9088 (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail
9089 when triggered by being backtracked to in a subpattern called as a sub-
9090 routine. There is then a backtrack at the outer level.
9092 (*THEN), when triggered, skips to the next alternative in the innermost
9093 enclosing group within the subpattern that has alternatives (its normal
9094 behaviour). However, if there is no such group within the subroutine
9095 subpattern, the subroutine match fails and there is a backtrack at the
9101 pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
9108 University Computing Service
9114 Last updated: 04 September 2018
9115 Copyright (c) 1997-2018 University of Cambridge.
9116 ------------------------------------------------------------------------------
9119 PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3)
9124 PCRE2 - Perl-compatible regular expressions (revised API)
9128 Two aspects of performance are discussed below: memory usage and pro-
9129 cessing time. The way you express your pattern as a regular expression
9130 can affect both of them.
9133 COMPILED PATTERN MEMORY USAGE
9135 Patterns are compiled by PCRE2 into a reasonably efficient interpretive
9136 code, so that most simple patterns do not use much memory for storing
9137 the compiled version. However, there is one case where the memory usage
9138 of a compiled pattern can be unexpectedly large. If a parenthesized
9139 subpattern has a quantifier with a minimum greater than 1 and/or a lim-
9140 ited maximum, the whole subpattern is repeated in the compiled code.
9141 For example, the pattern
9145 is compiled as if it were
9147 (abc|def)(abc|def)((abc|def)(abc|def)?)?
9149 (Technical aside: It is done this way so that backtrack points within
9150 each of the repetitions can be independently maintained.)
9152 For regular expressions whose quantifiers use only small numbers, this
9153 is not usually a problem. However, if the numbers are large, and par-
9154 ticularly if such repetitions are nested, the memory usage can become
9155 an embarrassment. For example, the very simple pattern
9157 ((ab){1,1000}c){1,3}
9159 uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
9160 compiled with its default internal pointer size of two bytes, the size
9161 limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
9162 libraries, and this is reached with the above pattern if the outer rep-
9163 etition is increased from 3 to 4. PCRE2 can be compiled to use larger
9164 internal pointers and thus handle larger compiled patterns, but it is
9165 better to try to rewrite your pattern to use less memory if you can.
9167 One way of reducing the memory usage for such patterns is to make use
9168 of PCRE2's "subroutine" facility. Re-writing the above pattern as
9170 ((ab)(?2){0,999}c)(?1){0,2}
9172 reduces the memory requirements to around 16KiB, and indeed it remains
9173 under 20KiB even with the outer repetition increased to 100. However,
9174 this kind of pattern is not always exactly equivalent, because any cap-
9175 tures within subroutine calls are lost when the subroutine completes.
9176 If this is not a problem, this kind of rewriting will allow you to
9177 process patterns that PCRE2 cannot otherwise handle. The matching per-
9178 formance of the two different versions of the pattern are roughly the
9179 same. (This applies from release 10.30 - things were different in ear-
9183 STACK AND HEAP USAGE AT RUN TIME
9185 From release 10.30, the interpretive (non-JIT) version of pcre2_match()
9186 uses very little system stack at run time. In earlier releases recur-
9187 sive function calls could use a great deal of stack, and this could
9188 cause problems, but this usage has been eliminated. Backtracking posi-
9189 tions are now explicitly remembered in memory frames controlled by the
9190 code. An initial 20KiB vector of frames is allocated on the system
9191 stack (enough for about 100 frames for small patterns), but if this is
9192 insufficient, heap memory is used. The amount of heap memory can be
9193 limited; if the limit is set to zero, only the initial stack vector is
9194 used. Rewriting patterns to be time-efficient, as described below, may
9195 also reduce the memory requirements.
9197 In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
9198 function calls, but only for processing atomic groups, lookaround
9199 assertions, and recursion within the pattern. The original version of
9200 the code used to allocate quite large internal workspace vectors on the
9201 stack, which caused some problems for some patterns in environments
9202 with small stacks. From release 10.32 the code for pcre2_dfa_match()
9203 has been re-factored to use heap memory when necessary for internal
9204 workspace when recursing, though recursive function calls are still
9207 The "match depth" parameter can be used to limit the depth of function
9208 recursion, and the "match heap" parameter to limit heap memory in
9214 Certain items in regular expression patterns are processed more effi-
9215 ciently than others. It is more efficient to use a character class like
9216 [aeiou] than a set of single-character alternatives such as
9217 (a|e|i|o|u). In general, the simplest construction that provides the
9218 required behaviour is usually the most efficient. Jeffrey Friedl's book
9219 contains a lot of useful general discussion about optimizing regular
9220 expressions for efficient performance. This document contains a few
9221 observations about PCRE2.
9223 Using Unicode character properties (the \p, \P, and \X escapes) is
9224 slow, because PCRE2 has to use a multi-stage table lookup whenever it
9225 needs a character's property. If you can find an alternative pattern
9226 that does not use character properties, it will probably be faster.
9228 By default, the escape sequences \b, \d, \s, and \w, and the POSIX
9229 character classes such as [:alpha:] do not use Unicode properties,
9230 partly for backwards compatibility, and partly for performance reasons.
9231 However, you can set the PCRE2_UCP option or start the pattern with
9232 (*UCP) if you want Unicode character properties to be used. This can
9233 double the matching time for items such as \d, when matched with
9234 pcre2_match(); the performance loss is less with a DFA matching func-
9235 tion, and in both cases there is not much difference for \b.
9237 When a pattern begins with .* not in atomic parentheses, nor in paren-
9238 theses that are the subject of a backreference, and the PCRE2_DOTALL
9239 option is set, the pattern is implicitly anchored by PCRE2, since it
9240 can match only at the start of a subject string. If the pattern has
9241 multiple top-level branches, they must all be anchorable. The optimiza-
9242 tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
9243 automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
9245 If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
9246 because the dot metacharacter does not then match a newline, and if the
9247 subject string contains newlines, the pattern may match from the char-
9248 acter immediately following one of them instead of from the very start.
9249 For example, the pattern
9253 matches the subject "first\nand second" (where \n stands for a newline
9254 character), with the match starting at the seventh character. In order
9255 to do this, PCRE2 has to retry the match starting after every newline
9258 If you are using such a pattern with subject strings that do not con-
9259 tain newlines, the best performance is obtained by setting
9260 PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
9261 explicit anchoring. That saves PCRE2 from having to scan along the sub-
9262 ject looking for a newline to restart at.
9264 Beware of patterns that contain nested indefinite repeats. These can
9265 take a long time to run when applied to a string that does not match.
9266 Consider the pattern fragment
9270 This can match "aaaa" in 16 different ways, and this number increases
9271 very rapidly as the string gets longer. (The * repeat can match 0, 1,
9272 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
9273 repeats can match different numbers of times.) When the remainder of
9274 the pattern is such that the entire match is going to fail, PCRE2 has
9275 in principle to try every possible variation, and this can take an
9276 extremely long time, even for relatively short strings.
9278 An optimization catches some of the more simple cases such as
9282 where a literal character follows. Before embarking on the standard
9283 matching procedure, PCRE2 checks that there is a "b" later in the sub-
9284 ject string, and if there is not, it fails the match immediately. How-
9285 ever, when there is no following literal this optimization cannot be
9286 used. You can see the difference by comparing the behaviour of
9290 with the pattern above. The former gives a failure almost instantly
9291 when applied to a whole line of "a" characters, whereas the latter
9292 takes an appreciable time with strings longer than about 20 characters.
9294 In many cases, the solution to this kind of performance issue is to use
9295 an atomic group or a possessive quantifier. This can often reduce mem-
9296 ory requirements as well. As another example, consider this pattern:
9300 It matches from wherever it starts until it encounters "<inet" or the
9301 end of the data, and is the kind of pattern that might be used when
9302 processing an XML file. Each iteration of the outer parentheses matches
9303 either one character that is not "<" or a "<" that is not followed by
9304 "inet". However, each time a parenthesis is processed, a backtracking
9305 position is passed, so this formulation uses a memory frame for each
9306 matched character. For a long string, a lot of memory is required. Con-
9307 sider now this rewritten pattern, which matches exactly the same
9312 This runs much faster, because sequences of characters that do not con-
9313 tain "<" are "swallowed" in one item inside the parentheses, and a pos-
9314 sessive quantifier is used to stop any backtracking into the runs of
9315 non-"<" characters. This version also uses a lot less memory because
9316 entry to a new set of parentheses happens only when a "<" character
9317 that is not followed by "inet" is encountered (and we assume this is
9320 This example shows that one way of optimizing performance when matching
9321 long subject strings is to write repeated parenthesized subpatterns to
9322 match more than one character whenever possible.
9324 SETTING RESOURCE LIMITS
9326 You can set limits on the amount of processing that takes place when
9327 matching, and on the amount of heap memory that is used. The default
9328 values of the limits are very large, and unlikely ever to operate. They
9329 can be changed when PCRE2 is built, and they can also be set when
9330 pcre2_match() or pcre2_dfa_match() is called. For details of these
9331 interfaces, see the pcre2build documentation and the section entitled
9332 "The match context" in the pcre2api documentation.
9334 The pcre2test test program has a modifier called "find_limits" which,
9335 if applied to a subject line, causes it to find the smallest limits
9336 that allow a pattern to match. This is done by repeatedly matching with
9343 University Computing Service
9349 Last updated: 25 April 2018
9350 Copyright (c) 1997-2018 University of Cambridge.
9351 ------------------------------------------------------------------------------
9354 PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3)
9359 PCRE2 - Perl-compatible regular expressions (revised API)
9363 #include <pcre2posix.h>
9365 int regcomp(regex_t *preg, const char *pattern,
9368 int regexec(const regex_t *preg, const char *string,
9369 size_t nmatch, regmatch_t pmatch[], int eflags);
9371 size_t regerror(int errcode, const regex_t *preg,
9372 char *errbuf, size_t errbuf_size);
9374 void regfree(regex_t *preg);
9379 This set of functions provides a POSIX-style API for the PCRE2 regular
9380 expression 8-bit library. See the pcre2api documentation for a descrip-
9381 tion of PCRE2's native API, which contains much additional functional-
9382 ity. There are no POSIX-style wrappers for PCRE2's 16-bit and 32-bit
9385 The functions described here are just wrapper functions that ultimately
9386 call the PCRE2 native API. Their prototypes are defined in the
9387 pcre2posix.h header file, and on Unix systems the library itself is
9388 called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix to
9389 the command for linking an application that uses them. Because the
9390 POSIX functions call the native ones, it is also necessary to add
9393 Those POSIX option bits that can reasonably be mapped to PCRE2 native
9394 options have been implemented. In addition, the option REG_EXTENDED is
9395 defined with the value zero. This has no effect, but since programs
9396 that are written to the POSIX interface often use it, this makes it
9397 easier to slot in PCRE2 as a replacement library. Other POSIX options
9398 are not even defined.
9400 There are also some options that are not defined by POSIX. These have
9401 been added at the request of users who want to make use of certain
9402 PCRE2-specific features via the POSIX calling interface or to add BSD
9403 or GNU functionality.
9405 When PCRE2 is called via these functions, it is only the API that is
9406 POSIX-like in style. The syntax and semantics of the regular expres-
9407 sions themselves are still those of Perl, subject to the setting of
9408 various PCRE2 options, as described below. "POSIX-like in style" means
9409 that the API approximates to the POSIX definition; it is not fully
9410 POSIX-compatible, and in multi-unit encoding domains it is probably
9411 even less compatible.
9413 The header for these functions is supplied as pcre2posix.h to avoid any
9414 potential clash with other POSIX libraries. It can, of course, be
9415 renamed or aliased as regex.h, which is the "correct" name. It provides
9416 two structure types, regex_t for compiled internal forms, and reg-
9417 match_t for returning captured substrings. It also defines some con-
9418 stants whose names start with "REG_"; these are used for setting
9419 options and identifying error codes.
9424 The function regcomp() is called to compile a pattern into an internal
9425 form. By default, the pattern is a C string terminated by a binary zero
9426 (but see REG_PEND below). The preg argument is a pointer to a regex_t
9427 structure that is used as a base for storing information about the com-
9428 piled regular expression. (It is also used for input when REG_PEND is
9431 The argument cflags is either zero, or contains one or more of the bits
9432 defined by the following macros:
9436 The PCRE2_DOTALL option is set when the regular expression is passed
9437 for compilation to the native function. Note that REG_DOTALL is not
9438 part of the POSIX standard.
9442 The PCRE2_CASELESS option is set when the regular expression is passed
9443 for compilation to the native function.
9447 The PCRE2_MULTILINE option is set when the regular expression is passed
9448 for compilation to the native function. Note that this does not mimic
9449 the defined POSIX behaviour for REG_NEWLINE (see the following sec-
9454 The PCRE2_LITERAL option is set when the regular expression is passed
9455 for compilation to the native function. This disables all meta charac-
9456 ters in the pattern, causing it to be treated as a literal string. The
9457 only other options that are allowed with REG_NOSPEC are REG_ICASE,
9458 REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
9463 When a pattern that is compiled with this flag is passed to regexec()
9464 for matching, the nmatch and pmatch arguments are ignored, and no cap-
9465 tured strings are returned. Versions of the PCRE library prior to 10.22
9466 used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
9467 longer happens because it disables the use of backreferences.
9471 If this option is set, the reg_endp field in the preg structure (which
9472 has the type const char *) must be set to point to the character beyond
9473 the end of the pattern before calling regcomp(). The pattern itself may
9474 now contain binary zeros, which are treated as data characters. Without
9475 REG_PEND, a binary zero terminates the pattern and the re_endp field is
9476 ignored. This is a GNU extension to the POSIX standard and should be
9477 used with caution in software intended to be portable to other systems.
9481 The PCRE2_UCP option is set when the regular expression is passed for
9482 compilation to the native function. This causes PCRE2 to use Unicode
9483 properties when matchine \d, \w, etc., instead of just recognizing
9484 ASCII values. Note that REG_UCP is not part of the POSIX standard.
9488 The PCRE2_UNGREEDY option is set when the regular expression is passed
9489 for compilation to the native function. Note that REG_UNGREEDY is not
9490 part of the POSIX standard.
9494 The PCRE2_UTF option is set when the regular expression is passed for
9495 compilation to the native function. This causes the pattern itself and
9496 all data strings used for matching it to be treated as UTF-8 strings.
9497 Note that REG_UTF is not part of the POSIX standard.
9499 In the absence of these flags, no options are passed to the native
9500 function. This means the the regex is compiled with PCRE2 default
9501 semantics. In particular, the way it handles newline characters in the
9502 subject string is the Perl way, not the POSIX way. Note that setting
9503 PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
9504 It does not affect the way newlines are matched by the dot metacharac-
9505 ter (they are not) or by a negative class such as [^a] (they are).
9507 The yield of regcomp() is zero on success, and non-zero otherwise. The
9508 preg structure is filled in on success, and one other member of the
9509 structure (as well as re_endp) is public: re_nsub contains the number
9510 of capturing subpatterns in the regular expression. Various error codes
9511 are defined in the header file.
9513 NOTE: If the yield of regcomp() is non-zero, you must not attempt to
9514 use the contents of the preg structure. If, for example, you pass it to
9515 regexec(), the result is undefined and your program is likely to crash.
9518 MATCHING NEWLINE CHARACTERS
9520 This area is not simple, because POSIX and Perl take different views of
9521 things. It is not possible to get PCRE2 to obey POSIX semantics, but
9522 then PCRE2 was never intended to be a POSIX engine. The following table
9523 lists the different possibilities for matching newline characters in
9528 . matches newline no PCRE2_DOTALL
9529 newline matches [^a] yes not changeable
9530 $ matches \n at end yes PCRE2_DOLLAR_ENDONLY
9531 $ matches \n in middle no PCRE2_MULTILINE
9532 ^ matches \n in middle no PCRE2_MULTILINE
9534 This is the equivalent table for a POSIX-compatible pattern matcher:
9538 . matches newline yes REG_NEWLINE
9539 newline matches [^a] yes REG_NEWLINE
9540 $ matches \n at end no REG_NEWLINE
9541 $ matches \n in middle no REG_NEWLINE
9542 ^ matches \n in middle no REG_NEWLINE
9544 This behaviour is not what happens when PCRE2 is called via its POSIX
9545 API. By default, PCRE2's behaviour is the same as Perl's, except that
9546 there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
9547 and Perl, there is no way to stop newline from matching [^a].
9549 Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
9550 and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
9551 there is no way to make PCRE2 behave exactly as for the REG_NEWLINE
9552 action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
9553 comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(),
9554 and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
9560 The function regexec() is called to match a compiled pattern preg
9561 against a given string, which is by default terminated by a zero byte
9562 (but see REG_STARTEND below), subject to the options in eflags. These
9567 The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
9572 The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
9573 matching function. Note that REG_NOTEMPTY is not part of the POSIX
9574 standard. However, setting this option can give more POSIX-like behav-
9575 iour in some situations.
9579 The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
9584 When this option is set, the subject string starts at string +
9585 pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
9586 point to the first character beyond the string. There may be binary
9587 zeros within the subject string, and indeed, using REG_STARTEND is the
9588 only way to pass a subject string that contains a binary zero.
9590 Whatever the value of pmatch[0].rm_so, the offsets of the matched
9591 string and any captured substrings are still given relative to the
9592 start of string itself. (Before PCRE2 release 10.30 these were given
9593 relative to string + pmatch[0].rm_so, but this differs from other
9596 This is a BSD extension, compatible with but not specified by IEEE
9597 Standard 1003.2 (POSIX.2), and should be used with caution in software
9598 intended to be portable to other systems. Note that a non-zero rm_so
9599 does not imply REG_NOTBOL; REG_STARTEND affects only the location and
9600 length of the string, not how it is matched. Setting REG_STARTEND and
9601 passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
9604 If the pattern was compiled with the REG_NOSUB flag, no data about any
9605 matched strings is returned. The nmatch and pmatch arguments of
9606 regexec() are ignored (except possibly as input for REG_STARTEND).
9608 The value of nmatch may be zero, and the value pmatch may be NULL
9609 (unless REG_STARTEND is set); in both these cases no data about any
9610 matched strings is returned.
9612 Otherwise, the portion of the string that was matched, and also any
9613 captured substrings, are returned via the pmatch argument, which points
9614 to an array of nmatch structures of type regmatch_t, containing the
9615 members rm_so and rm_eo. These contain the byte offset to the first
9616 character of each substring and the offset to the first character after
9617 the end of each substring, respectively. The 0th element of the vector
9618 relates to the entire portion of string that was matched; subsequent
9619 elements relate to the capturing subpatterns of the regular expression.
9620 Unused entries in the array have both structure members set to -1.
9622 A successful match yields a zero return; various error codes are
9623 defined in the header file, of which REG_NOMATCH is the "expected"
9629 The regerror() function maps a non-zero errorcode from either regcomp()
9630 or regexec() to a printable message. If preg is not NULL, the error
9631 should have arisen from the use of that structure. A message terminated
9632 by a binary zero is placed in errbuf. If the buffer is too short, only
9633 the first errbuf_size - 1 characters of the error message are used. The
9634 yield of the function is the size of buffer needed to hold the whole
9635 message, including the terminating zero. This value is greater than
9636 errbuf_size if the message was truncated.
9641 Compiling a regular expression causes memory to be allocated and asso-
9642 ciated with the preg structure. The function regfree() frees all such
9643 memory, after which preg may no longer be used as a compiled expres-
9650 University Computing Service
9656 Last updated: 15 June 2017
9657 Copyright (c) 1997-2017 University of Cambridge.
9658 ------------------------------------------------------------------------------
9661 PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3)
9666 PCRE2 - Perl-compatible regular expressions (revised API)
9668 PCRE2 SAMPLE PROGRAM
9670 A simple, complete demonstration program to get you started with using
9671 PCRE2 is supplied in the file pcre2demo.c in the src directory in the
9672 PCRE2 distribution. A listing of this program is given in the pcre2demo
9673 documentation. If you do not have a copy of the PCRE2 distribution, you
9674 can save this listing to re-create the contents of pcre2demo.c.
9676 The demonstration program compiles the regular expression that is its
9677 first argument, and matches it against the subject string in its second
9678 argument. No PCRE2 options are set, and default character tables are
9679 used. If matching succeeds, the program outputs the portion of the sub-
9680 ject that matched, together with the contents of any captured sub-
9683 If the -g option is given on the command line, the program then goes on
9684 to check for further matches of the same regular expression in the same
9685 subject string. The logic is a little bit tricky because of the possi-
9686 bility of matching an empty string. Comments in the code explain what
9689 The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit
9690 library. It handles strings and characters that are stored in 8-bit
9691 code units. By default, one character corresponds to one code unit,
9692 but if the pattern starts with "(*UTF)", both it and the subject are
9693 treated as UTF-8 strings, where characters may occupy multiple code
9696 If PCRE2 is installed in the standard include and library directories
9697 for your operating system, you should be able to compile the demonstra-
9698 tion program using a command like this:
9700 cc -o pcre2demo pcre2demo.c -lpcre2-8
9702 If PCRE2 is installed elsewhere, you may need to add additional options
9703 to the command line. For example, on a Unix-like system that has PCRE2
9704 installed in /usr/local, you can compile the demonstration program
9705 using a command like this:
9707 cc -o pcre2demo -I/usr/local/include pcre2demo.c \
9708 -L/usr/local/lib -lpcre2-8
9710 Once you have built the demonstration program, you can run simple tests
9713 ./pcre2demo 'cat|dog' 'the cat sat on the mat'
9714 ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
9716 Note that there is a much more comprehensive test program, called
9717 pcre2test, which supports many more facilities for testing regular
9718 expressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
9719 though not all three need be installed). The pcre2demo program is pro-
9720 vided as a relatively simple coding example.
9722 If you try to run pcre2demo when PCRE2 is not installed in the standard
9723 library directory, you may get an error like this on some operating
9724 systems (e.g. Solaris):
9726 ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
9729 This is caused by the way shared library support works on those sys-
9730 tems. You need to add
9734 (for example) to the compile command to get round this problem.
9740 University Computing Service
9746 Last updated: 02 February 2016
9747 Copyright (c) 1997-2016 University of Cambridge.
9748 ------------------------------------------------------------------------------
9749 PCRE2SERIALIZE(3) Library Functions Manual PCRE2SERIALIZE(3)
9754 PCRE2 - Perl-compatible regular expressions (revised API)
9756 SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
9758 int32_t pcre2_serialize_decode(pcre2_code **codes,
9759 int32_t number_of_codes, const uint32_t *bytes,
9760 pcre2_general_context *gcontext);
9762 int32_t pcre2_serialize_encode(pcre2_code **codes,
9763 int32_t number_of_codes, uint32_t **serialized_bytes,
9764 PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
9766 void pcre2_serialize_free(uint8_t *bytes);
9768 int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
9770 If you are running an application that uses a large number of regular
9771 expression patterns, it may be useful to store them in a precompiled
9772 form instead of having to compile them every time the application is
9773 run. However, if you are using the just-in-time optimization feature,
9774 it is not possible to save and reload the JIT data, because it is posi-
9775 tion-dependent. The host on which the patterns are reloaded must be
9776 running the same version of PCRE2, with the same code unit width, and
9777 must also have the same endianness, pointer width and PCRE2_SIZE type.
9778 For example, patterns compiled on a 32-bit system using PCRE2's 16-bit
9779 library cannot be reloaded on a 64-bit system, nor can they be reloaded
9780 using the 8-bit library.
9782 Note that "serialization" in PCRE2 does not convert compiled patterns
9783 to an abstract format like Java or .NET serialization. The serialized
9784 output is really just a bytecode dump, which is why it can only be
9785 reloaded in the same environment as the one that created it. Hence the
9786 restrictions mentioned above. Applications that are not statically
9787 linked with a fixed version of PCRE2 must be prepared to recompile pat-
9788 terns from their sources, in order to be immune to PCRE2 upgrades.
9793 The facility for saving and restoring compiled patterns is intended for
9794 use within individual applications. As such, the data supplied to
9795 pcre2_serialize_decode() is expected to be trusted data, not data from
9796 arbitrary external sources. There is only some simple consistency
9797 checking, not complete validation of what is being re-loaded. Corrupted
9798 data may cause undefined results. For example, if the length field of a
9799 pattern in the serialized data is corrupted, the deserializing code may
9800 read beyond the end of the byte stream that is passed to it.
9803 SAVING COMPILED PATTERNS
9805 Before compiled patterns can be saved they must be serialized, which in
9806 PCRE2 means converting the pattern to a stream of bytes. A single byte
9807 stream may contain any number of compiled patterns, but they must all
9808 use the same character tables. A single copy of the tables is included
9809 in the byte stream (its size is 1088 bytes). For more details of char-
9810 acter tables, see the section on locale support in the pcre2api docu-
9813 The function pcre2_serialize_encode() creates a serialized byte stream
9814 from a list of compiled patterns. Its first two arguments specify the
9815 list, being a pointer to a vector of pointers to compiled patterns, and
9816 the length of the vector. The third and fourth arguments point to vari-
9817 ables which are set to point to the created byte stream and its length,
9818 respectively. The final argument is a pointer to a general context,
9819 which can be used to specify custom memory mangagement functions. If
9820 this argument is NULL, malloc() is used to obtain memory for the byte
9821 stream. The yield of the function is the number of serialized patterns,
9822 or one of the following negative error codes:
9824 PCRE2_ERROR_BADDATA the number of patterns is zero or less
9825 PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns
9826 PCRE2_ERROR_MEMORY memory allocation failed
9827 PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables
9828 PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL
9830 PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor-
9831 rupted, or that a slot in the vector does not point to a compiled pat-
9834 Once a set of patterns has been serialized you can save the data in any
9835 appropriate manner. Here is sample code that compiles two patterns and
9836 writes them to a file. It assumes that the variable fd refers to a file
9837 that is open for output. The error checking that should be present in a
9838 real application has been omitted for simplicity.
9842 PCRE2_SIZE erroroffset;
9843 PCRE2_SIZE bytescount;
9844 pcre2_code *list_of_codes[2];
9845 list_of_codes[0] = pcre2_compile("first pattern",
9846 PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
9847 list_of_codes[1] = pcre2_compile("second pattern",
9848 PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
9849 errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
9851 errorcode = fwrite(bytes, 1, bytescount, fd);
9853 Note that the serialized data is binary data that may contain any of
9854 the 256 possible byte values. On systems that make a distinction
9855 between binary and non-binary data, be sure that the file is opened for
9858 Serializing a set of patterns leaves the original data untouched, so
9859 they can still be used for matching. Their memory must eventually be
9860 freed in the usual way by calling pcre2_code_free(). When you have fin-
9861 ished with the byte stream, it too must be freed by calling pcre2_seri-
9862 alize_free(). If this function is called with a NULL argument, it
9863 returns immediately without doing anything.
9866 RE-USING PRECOMPILED PATTERNS
9868 In order to re-use a set of saved patterns you must first make the
9869 serialized byte stream available in main memory (for example, by read-
9870 ing from a file). The management of this memory block is up to the
9871 application. You can use the pcre2_serialize_get_number_of_codes()
9872 function to find out how many compiled patterns are in the serialized
9873 data without actually decoding the patterns:
9875 uint8_t *bytes = <serialized data>;
9876 int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
9878 The pcre2_serialize_decode() function reads a byte stream and recreates
9879 the compiled patterns in new memory blocks, setting pointers to them in
9880 a vector. The first two arguments are a pointer to a suitable vector
9881 and its length, and the third argument points to a byte stream. The
9882 final argument is a pointer to a general context, which can be used to
9883 specify custom memory mangagement functions for the decoded patterns.
9884 If this argument is NULL, malloc() and free() are used. After deserial-
9885 ization, the byte stream is no longer needed and can be discarded.
9887 int32_t number_of_codes;
9888 pcre2_code *list_of_codes[2];
9889 uint8_t *bytes = <serialized data>;
9890 int32_t number_of_codes =
9891 pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
9893 If the vector is not large enough for all the patterns in the byte
9894 stream, it is filled with those that fit, and the remainder are
9895 ignored. The yield of the function is the number of decoded patterns,
9896 or one of the following negative error codes:
9898 PCRE2_ERROR_BADDATA second argument is zero or less
9899 PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
9900 PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version
9901 PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure
9902 PCRE2_ERROR_MEMORY memory allocation failed
9903 PCRE2_ERROR_NULL first or third argument is NULL
9905 PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was
9906 compiled on a system with different endianness.
9908 Decoded patterns can be used for matching in the usual way, and must be
9909 freed by calling pcre2_code_free(). However, be aware that there is a
9910 potential race issue if you are using multiple patterns that were
9911 decoded from a single byte stream in a multithreaded application. A
9912 single copy of the character tables is used by all the decoded patterns
9913 and a reference count is used to arrange for its memory to be automati-
9914 cally freed when the last pattern is freed, but there is no locking on
9915 this reference count. Therefore, if you want to call pcre2_code_free()
9916 for these patterns in different threads, you must arrange your own
9917 locking, and ensure that pcre2_code_free() cannot be called by two
9918 threads at the same time.
9920 If a pattern was processed by pcre2_jit_compile() before being serial-
9921 ized, the JIT data is discarded and so is no longer available after a
9922 save/restore cycle. You can, however, process a restored pattern with
9923 pcre2_jit_compile() if you wish.
9929 University Computing Service
9935 Last updated: 27 June 2018
9936 Copyright (c) 1997-2018 University of Cambridge.
9937 ------------------------------------------------------------------------------
9940 PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
9945 PCRE2 - Perl-compatible regular expressions (revised API)
9947 PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
9949 The full syntax and semantics of the regular expressions that are sup-
9950 ported by PCRE2 are described in the pcre2pattern documentation. This
9951 document contains a quick-reference summary of the syntax.
9956 \x where x is non-alphanumeric is a literal x
9957 \Q...\E treat enclosed characters as literal
9962 This table applies to ASCII and Unicode environments.
9964 \a alarm, that is, the BEL character (hex 07)
9965 \cx "control-x", where x is any ASCII printing character
9967 \f form feed (hex 0C)
9969 \r carriage return (hex 0D)
9971 \0dd character with octal code 0dd
9972 \ddd character with octal code ddd, or backreference
9973 \o{ddd..} character with octal code ddd..
9974 \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
9975 \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
9976 \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
9977 \xhh character with hex code hh
9978 \x{hh..} character with hex code hh..
9980 Note that \0dd is always an octal code. The treatment of backslash fol-
9981 lowed by a non-zero digit is complicated; for details see the section
9982 "Non-printing characters" in the pcre2pattern documentation, where
9983 details of escape processing in EBCDIC environments are also given.
9984 \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
9985 EBCDIC environments. Note that \N not followed by an opening curly
9986 bracket has a different meaning (see below).
9988 When \x is not followed by {, from zero to two hexadecimal digits are
9989 read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
9990 imal digits to be recognized as a hexadecimal escape; otherwise it
9991 matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol-
9992 lowed by four hexadecimal digits, it matches a literal "u".
9997 . any character except newline;
9998 in dotall mode, any character whatsoever
9999 \C one code unit, even in UTF mode (best avoided)
10001 \D a character that is not a decimal digit
10002 \h a horizontal white space character
10003 \H a character that is not a horizontal white space character
10004 \N a character that is not a newline
10005 \p{xx} a character with the xx property
10006 \P{xx} a character without the xx property
10007 \R a newline sequence
10008 \s a white space character
10009 \S a character that is not a white space character
10010 \v a vertical white space character
10011 \V a character that is not a vertical white space character
10012 \w a "word" character
10013 \W a "non-word" character
10014 \X a Unicode extended grapheme cluster
10016 \C is dangerous because it may leave the current matching point in the
10017 middle of a UTF-8 or UTF-16 character. The application can lock out the
10018 use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
10019 possible to build PCRE2 with the use of \C permanently disabled.
10021 By default, \d, \s, and \w match only ASCII characters, even in UTF-8
10022 mode or in the 16-bit and 32-bit libraries. However, if locale-specific
10023 matching is happening, \s and \w may also match characters with code
10024 points in the range 128-255. If the PCRE2_UCP option is set, the behav-
10025 iour of these escape sequences is changed to use Unicode properties and
10026 they match many more characters.
10029 GENERAL CATEGORY PROPERTIES FOR \p and \P
10039 Ll Lower case letter
10042 Lt Title case letter
10043 Lu Upper case letter
10049 Mn Non-spacing mark
10057 Pc Connector punctuation
10058 Pd Dash punctuation
10059 Pe Close punctuation
10060 Pf Final punctuation
10061 Pi Initial punctuation
10062 Po Other punctuation
10063 Ps Open punctuation
10068 Sm Mathematical symbol
10073 Zp Paragraph separator
10077 PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
10079 Xan Alphanumeric: union of properties L and N
10080 Xps POSIX space: property Z or tab, NL, VT, FF, CR
10081 Xsp Perl space: property Z or tab, NL, VT, FF, CR
10082 Xuc Univerally-named character: one that can be
10083 represented by a Universal Character Name
10084 Xwd Perl word: property Xan or underscore
10086 Perl and POSIX space are now the same. Perl added VT to its space char-
10087 acter set at release 5.18.
10090 SCRIPT NAMES FOR \p AND \P
10092 Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
10093 nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
10094 Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
10095 nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
10096 Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
10097 Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
10098 Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
10099 Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
10100 Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
10101 nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
10102 Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
10103 jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
10104 Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
10105 Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
10106 Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
10107 ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
10108 dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
10109 Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
10110 Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
10111 vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
10112 Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
10113 Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
10114 nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
10119 [...] positive character class
10120 [^...] negative character class
10121 [x-y] range (can be used for hex characters)
10122 [[:xxx:]] positive POSIX named set
10123 [[:^xxx:]] negative POSIX named set
10129 cntrl control character
10130 digit decimal digit
10131 graph printing, excluding space
10132 lower lower case letter
10133 print printing, including space
10134 punct printing, excluding alphanumeric
10136 upper upper case letter
10138 xdigit hexadecimal digit
10140 In PCRE2, POSIX character set names recognize only ASCII characters by
10141 default, but some of them use Unicode properties if PCRE2_UCP is set.
10142 You can use \Q...\E inside a character class.
10148 ?+ 0 or 1, possessive
10150 * 0 or more, greedy
10151 *+ 0 or more, possessive
10153 + 1 or more, greedy
10154 ++ 1 or more, possessive
10157 {n,m} at least n, no more than m, greedy
10158 {n,m}+ at least n, no more than m, possessive
10159 {n,m}? at least n, no more than m, lazy
10160 {n,} n or more, greedy
10161 {n,}+ n or more, possessive
10162 {n,}? n or more, lazy
10165 ANCHORS AND SIMPLE ASSERTIONS
10168 \B not a word boundary
10170 also after an internal newline in multiline mode
10171 (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
10172 \A start of subject
10174 also before newline at end of subject
10175 also before internal newline in multiline mode
10177 also before newline at end of subject
10179 \G first matching position in subject
10182 REPORTED MATCH POINT SETTING
10184 \K set reported start of match
10186 \K is honoured in positive assertions, but ignored in negative ones.
10196 (...) capturing group
10197 (?<name>...) named capturing group (Perl)
10198 (?'name'...) named capturing group (Perl)
10199 (?P<name>...) named capturing group (Python)
10200 (?:...) non-capturing group
10201 (?|...) non-capturing group; reset group numbers for
10202 capturing groups in each alternative
10207 (?>...) atomic, non-capturing group
10212 (?#....) comment (not nestable)
10216 Changes of these options within a group are automatically cancelled at
10217 the end of the group.
10220 (?J) allow duplicate names
10222 (?n) no auto capture
10223 (?s) single line (dotall)
10224 (?U) default ungreedy (lazy)
10225 (?x) extended: ignore white space except in classes
10226 (?xx) as (?x) but also ignore space and tab in classes
10227 (?-...) unset option(s)
10228 (?^) unset imnsx options
10230 Unsetting x or xx unsets both. Several options may be set at once, and
10231 a mixture of setting and unsetting such as (?i-x) is allowed, but there
10232 may be only one hyphen. Setting (but no unsetting) is allowed after (?^
10233 for example (?^in). An option setting may appear at the start of a non-
10234 capturing group, for example (?i:...).
10236 The following are recognized only at the very start of a pattern or
10237 after one of the newline or \R options with similar syntax. More than
10238 one of them may appear. For the first three, d is a decimal number.
10240 (*LIMIT_DEPTH=d) set the backtracking limit to d
10241 (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
10242 (*LIMIT_MATCH=d) set the match limit to d
10243 (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
10244 (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
10245 (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
10246 (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
10247 (*NO_JIT) disable JIT optimization
10248 (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
10249 (*UTF) set appropriate UTF mode for the library in use
10250 (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
10252 Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
10253 value of the limits set by the caller of pcre2_match() or
10254 pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
10255 synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
10256 and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
10257 respectively, at compile time.
10262 These are recognized only at the very start of the pattern or after
10263 option settings with a similar syntax.
10265 (*CR) carriage return only
10266 (*LF) linefeed only
10267 (*CRLF) carriage return followed by linefeed
10268 (*ANYCRLF) all three of the above
10269 (*ANY) any Unicode newline sequence
10270 (*NUL) the NUL character (binary zero)
10275 These are recognized only at the very start of the pattern or after
10276 option setting with a similar syntax.
10278 (*BSR_ANYCRLF) CR, LF, or CRLF
10279 (*BSR_UNICODE) any Unicode newline sequence
10282 LOOKAHEAD AND LOOKBEHIND ASSERTIONS
10284 (?=...) positive look ahead
10285 (?!...) negative look ahead
10286 (?<=...) positive look behind
10287 (?<!...) negative look behind
10289 Each top-level branch of a look behind must be of a fixed length.
10294 \n reference by number (can be ambiguous)
10295 \gn reference by number
10296 \g{n} reference by number
10297 \g+n relative reference by number (PCRE2 extension)
10298 \g-n relative reference by number
10299 \g{+n} relative reference by number (PCRE2 extension)
10300 \g{-n} relative reference by number
10301 \k<name> reference by name (Perl)
10302 \k'name' reference by name (Perl)
10303 \g{name} reference by name (Perl)
10304 \k{name} reference by name (.NET)
10305 (?P=name) reference by name (Python)
10308 SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
10310 (?R) recurse whole pattern
10311 (?n) call subpattern by absolute number
10312 (?+n) call subpattern by relative number
10313 (?-n) call subpattern by relative number
10314 (?&name) call subpattern by name (Perl)
10315 (?P>name) call subpattern by name (Python)
10316 \g<name> call subpattern by name (Oniguruma)
10317 \g'name' call subpattern by name (Oniguruma)
10318 \g<n> call subpattern by absolute number (Oniguruma)
10319 \g'n' call subpattern by absolute number (Oniguruma)
10320 \g<+n> call subpattern by relative number (PCRE2 extension)
10321 \g'+n' call subpattern by relative number (PCRE2 extension)
10322 \g<-n> call subpattern by relative number (PCRE2 extension)
10323 \g'-n' call subpattern by relative number (PCRE2 extension)
10326 CONDITIONAL PATTERNS
10328 (?(condition)yes-pattern)
10329 (?(condition)yes-pattern|no-pattern)
10331 (?(n) absolute reference condition
10332 (?(+n) relative reference condition
10333 (?(-n) relative reference condition
10334 (?(<name>) named reference condition (Perl)
10335 (?('name') named reference condition (Perl)
10336 (?(name) named reference condition (PCRE2, deprecated)
10337 (?(R) overall recursion condition
10338 (?(Rn) specific numbered group recursion condition
10339 (?(R&name) specific named group recursion condition
10340 (?(DEFINE) define subpattern for reference
10341 (?(VERSION[>]=n.m) test PCRE2 version
10342 (?(assert) assertion condition
10344 Note the ambiguity of (?(R) and (?(Rn) which might be named reference
10345 conditions or recursion tests. Such a condition is interpreted as a
10346 reference condition if the relevant named group exists.
10349 BACKTRACKING CONTROL
10351 All backtracking control verbs may be in the form (*VERB:NAME). For
10352 (*MARK) the name is mandatory, for the others it is optional. (*SKIP)
10353 changes its behaviour if :NAME is present. The others just set a name
10354 for passing back to the caller, but this is not a name that (*SKIP) can
10355 see. The following act immediately they are reached:
10357 (*ACCEPT) force successful match
10358 (*FAIL) force backtrack; synonym (*F)
10359 (*MARK:NAME) set name to be passed back; synonym (*:NAME)
10361 The following act only when a subsequent match failure causes a back-
10362 track to reach them. They all force a match failure, but they differ in
10363 what happens afterwards. Those that advance the start-of-match point do
10364 so only if the pattern is not anchored.
10366 (*COMMIT) overall failure, no advance of starting point
10367 (*PRUNE) advance to next starting character
10368 (*SKIP) advance to current matching position
10369 (*SKIP:NAME) advance to position corresponding to an earlier
10370 (*MARK:NAME); if not found, the (*SKIP) is ignored
10371 (*THEN) local failure, backtrack to next alternation
10373 The effect of one of these verbs in a group called as a subroutine is
10374 confined to the subroutine call.
10379 (?C) callout (assumed number 0)
10380 (?Cn) callout with numerical data n
10381 (?C"text") callout with string data
10383 The allowed string delimiters are ` ' " ^ % # $ (which are the same for
10384 the start and the end), and the starting delimiter { matched with the
10385 ending delimiter }. To encode the ending delimiter within the string,
10391 pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
10398 University Computing Service
10399 Cambridge, England.
10404 Last updated: 02 September 2018
10405 Copyright (c) 1997-2018 University of Cambridge.
10406 ------------------------------------------------------------------------------
10409 PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
10414 PCRE - Perl-compatible regular expressions (revised API)
10416 UNICODE AND UTF SUPPORT
10418 When PCRE2 is built with Unicode support (which is the default), it has
10419 knowledge of Unicode character properties and can process text strings
10420 in UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
10421 However, by default, PCRE2 assumes that one code unit is one character.
10422 To process a pattern as a UTF string, where a character may require
10423 more than one code unit, you must call pcre2_compile() with the
10424 PCRE2_UTF option flag, or the pattern must start with the sequence
10425 (*UTF). When either of these is the case, both the pattern and any sub-
10426 ject strings that are matched against it are treated as UTF strings
10427 instead of strings of individual one-code-unit characters. There are
10428 also some other changes to the way characters are handled, as docu-
10431 If you do not need Unicode support you can build PCRE2 without it, in
10432 which case the library will be smaller.
10435 UNICODE PROPERTY SUPPORT
10437 When PCRE2 is built with Unicode support, the escape sequences \p{..},
10438 \P{..}, and \X can be used. The Unicode properties that can be tested
10439 are limited to the general category properties such as Lu for an upper
10440 case letter or Nd for a decimal number, the Unicode script names such
10441 as Arabic or Han, and the derived properties Any and L&. Full lists are
10442 given in the pcre2pattern and pcre2syntax documentation. Only the short
10443 names for properties are supported. For example, \p{L} matches a let-
10444 ter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in
10445 Perl, many properties may optionally be prefixed by "Is", for compati-
10446 bility with Perl 5.6. PCRE2 does not support this.
10449 WIDE CHARACTERS AND UTF MODES
10451 Code points less than 256 can be specified in patterns by either braced
10452 or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
10453 Larger values have to use braced sequences. Unbraced octal code points
10454 up to \777 are also recognized; larger ones can be coded using \o{...}.
10456 The escape sequence \N{U+<hex digits>} is recognized as another way of
10457 specifying a Unicode character by code point in a UTF mode. It is not
10458 allowed in non-UTF modes.
10460 In UTF modes, repeat quantifiers apply to complete UTF characters, not
10461 to individual code units.
10463 In UTF modes, the dot metacharacter matches one UTF character instead
10464 of a single code unit.
10466 The escape sequence \C can be used to match a single code unit in a UTF
10467 mode, but its use can lead to some strange effects because it breaks up
10468 multi-unit characters (see the description of \C in the pcre2pattern
10471 The use of \C is not supported by the alternative matching function
10472 pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
10473 ter may consist of more than one code unit. The use of \C in these
10474 modes provokes a match-time error. Also, the JIT optimization does not
10475 support \C in these modes. If JIT optimization is requested for a UTF-8
10476 or UTF-16 pattern that contains \C, it will not succeed, and so when
10477 pcre2_match() is called, the matching will be carried out by the normal
10478 interpretive function.
10480 The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
10481 characters of any code value, but, by default, the characters that
10482 PCRE2 recognizes as digits, spaces, or word characters remain the same
10483 set as in non-UTF mode, all with code points less than 256. This
10484 remains true even when PCRE2 is built to include Unicode support,
10485 because to do otherwise would slow down matching in many common cases.
10486 Note that this also applies to \b and \B, because they are defined in
10487 terms of \w and \W. If you want to test for a wider sense of, say,
10488 "digit", you can use explicit Unicode property tests such as \p{Nd}.
10489 Alternatively, if you set the PCRE2_UCP option, the way that the char-
10490 acter escapes work is changed so that Unicode properties are used to
10491 determine which characters match. There are more details in the section
10492 on generic character types in the pcre2pattern documentation.
10494 Similarly, characters that match the POSIX named character classes are
10495 all low-valued characters, unless the PCRE2_UCP option is set.
10497 However, the special horizontal and vertical white space matching
10498 escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
10499 acters, whether or not PCRE2_UCP is set.
10502 CASE-EQUIVALENCE IN UTF MODES
10504 Case-insensitive matching in a UTF mode makes use of Unicode properties
10505 except for characters whose code points are less than 128 and that have
10506 at most two case-equivalent values. For these, a direct table lookup is
10507 used for speed. A few Unicode characters such as Greek sigma have more
10508 than two code points that are case-equivalent, and these are treated as
10512 VALIDITY OF UTF STRINGS
10514 When the PCRE2_UTF option is set, the strings passed as patterns and
10515 subjects are (by default) checked for validity on entry to the relevant
10516 functions. If an invalid UTF string is passed, an negative error code
10517 is returned. The code unit offset to the offending character can be
10518 extracted from the match data block by calling pcre2_get_startchar(),
10519 which is used for this purpose after a UTF error.
10521 UTF-16 and UTF-32 strings can indicate their endianness by special code
10522 knows as a byte-order mark (BOM). The PCRE2 functions do not handle
10523 this, expecting strings to be in host byte order.
10525 A UTF string is checked before any other processing takes place. In the
10526 case of pcre2_match() and pcre2_dfa_match() calls with a non-zero
10527 starting offset, the check is applied only to that part of the subject
10528 that could be inspected during matching, and there is a check that the
10529 starting offset points to the first code unit of a character or to the
10530 end of the subject. If there are no lookbehind assertions in the pat-
10531 tern, the check starts at the starting offset. Otherwise, it starts at
10532 the length of the longest lookbehind before the starting offset, or at
10533 the start of the subject if there are not that many characters before
10534 the starting offset. Note that the sequences \b and \B are one-charac-
10537 In addition to checking the format of the string, there is a check to
10538 ensure that all code points lie in the range U+0 to U+10FFFF, excluding
10539 the surrogate area. The so-called "non-character" code points are not
10540 excluded because Unicode corrigendum #9 makes it clear that they should
10543 Characters in the "Surrogate Area" of Unicode are reserved for use by
10544 UTF-16, where they are used in pairs to encode code points with values
10545 greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
10546 are available independently in the UTF-8 and UTF-32 encodings. (In
10547 other words, the whole surrogate thing is a fudge for UTF-16 which
10548 unfortunately messes up UTF-8 and UTF-32.)
10550 In some situations, you may already know that your strings are valid,
10551 and therefore want to skip these checks in order to improve perfor-
10552 mance, for example in the case of a long subject string that is being
10553 scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
10554 pile time or at match time, PCRE2 assumes that the pattern or subject
10555 it is given (respectively) contains only valid UTF code unit sequences.
10557 Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
10558 for the pattern; it does not also apply to subject strings. If you want
10559 to disable the check for a subject string you must pass this option to
10560 pcre2_match() or pcre2_dfa_match().
10562 If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
10563 result is undefined and your program may crash or loop indefinitely.
10565 Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable
10566 the error that is given if an escape sequence for an invalid Unicode
10567 code point is encountered in the pattern. If you want to allow escape
10568 sequences such as \x{d800} (a surrogate code point) you can set the
10569 PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
10570 sible only in UTF-8 and UTF-32 modes, because these values are not rep-
10571 resentable in UTF-16.
10573 Errors in UTF-8 strings
10575 The following negative error codes are given for invalid UTF-8 strings:
10577 PCRE2_ERROR_UTF8_ERR1
10578 PCRE2_ERROR_UTF8_ERR2
10579 PCRE2_ERROR_UTF8_ERR3
10580 PCRE2_ERROR_UTF8_ERR4
10581 PCRE2_ERROR_UTF8_ERR5
10583 The string ends with a truncated UTF-8 character; the code specifies
10584 how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
10585 characters to be no longer than 4 bytes, the encoding scheme (origi-
10586 nally defined by RFC 2279) allows for up to 6 bytes, and this is
10587 checked first; hence the possibility of 4 or 5 missing bytes.
10589 PCRE2_ERROR_UTF8_ERR6
10590 PCRE2_ERROR_UTF8_ERR7
10591 PCRE2_ERROR_UTF8_ERR8
10592 PCRE2_ERROR_UTF8_ERR9
10593 PCRE2_ERROR_UTF8_ERR10
10595 The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
10596 the character do not have the binary value 0b10 (that is, either the
10597 most significant bit is 0, or the next bit is 1).
10599 PCRE2_ERROR_UTF8_ERR11
10600 PCRE2_ERROR_UTF8_ERR12
10602 A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
10603 long; these code points are excluded by RFC 3629.
10605 PCRE2_ERROR_UTF8_ERR13
10607 A 4-byte character has a value greater than 0x10fff; these code points
10608 are excluded by RFC 3629.
10610 PCRE2_ERROR_UTF8_ERR14
10612 A 3-byte character has a value in the range 0xd800 to 0xdfff; this
10613 range of code points are reserved by RFC 3629 for use with UTF-16, and
10614 so are excluded from UTF-8.
10616 PCRE2_ERROR_UTF8_ERR15
10617 PCRE2_ERROR_UTF8_ERR16
10618 PCRE2_ERROR_UTF8_ERR17
10619 PCRE2_ERROR_UTF8_ERR18
10620 PCRE2_ERROR_UTF8_ERR19
10622 A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
10623 for a value that can be represented by fewer bytes, which is invalid.
10624 For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
10625 rect coding uses just one byte.
10627 PCRE2_ERROR_UTF8_ERR20
10629 The two most significant bits of the first byte of a character have the
10630 binary value 0b10 (that is, the most significant bit is 1 and the sec-
10631 ond is 0). Such a byte can only validly occur as the second or subse-
10632 quent byte of a multi-byte character.
10634 PCRE2_ERROR_UTF8_ERR21
10636 The first byte of a character has the value 0xfe or 0xff. These values
10637 can never occur in a valid UTF-8 string.
10639 Errors in UTF-16 strings
10641 The following negative error codes are given for invalid UTF-16
10644 PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
10645 PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate
10646 PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate
10649 Errors in UTF-32 strings
10651 The following negative error codes are given for invalid UTF-32
10654 PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
10655 PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
10661 University Computing Service
10662 Cambridge, England.
10667 Last updated: 02 September 2018
10668 Copyright (c) 1997-2018 University of Cambridge.
10669 ------------------------------------------------------------------------------