--- /dev/null
+-----------------------------------------------------------------------------
+This file contains a concatenation of the PCRE2 man pages, converted to plain
+text format for ease of searching with a text editor, or for use on systems
+that do not have a man page processor. The small individual files that give
+synopses of each function in the library have not been included. Neither has
+the pcre2demo program. There are separate text files for the pcre2grep and
+pcre2test commands.
+-----------------------------------------------------------------------------
+
+
+PCRE2(3) Library Functions Manual PCRE2(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+INTRODUCTION
+
+ PCRE2 is the name used for a revised API for the PCRE library, which is
+ a set of functions, written in C, that implement regular expression
+ pattern matching using the same syntax and semantics as Perl, with just
+ a few differences. After nearly two decades, the limitations of the
+ original API were making development increasingly difficult. The new
+ API is more extensible, and it was simplified by abolishing the sepa-
+ rate "study" optimizing function; in PCRE2, patterns are automatically
+ optimized where possible. Since forking from PCRE1, the code has been
+ extensively refactored and new features introduced.
+
+ As well as Perl-style regular expression patterns, some features that
+ appeared in Python and the original PCRE before they appeared in Perl
+ are available using the Python syntax. There is also some support for
+ one or two .NET and Oniguruma syntax items, and there are options for
+ requesting some minor changes that give better ECMAScript (aka
+ JavaScript) compatibility.
+
+ The source code for PCRE2 can be compiled to support 8-bit, 16-bit, or
+ 32-bit code units, which means that up to three separate libraries may
+ be installed. The original work to extend PCRE to 16-bit and 32-bit
+ code units was done by Zoltan Herczeg and Christian Persch, respec-
+ tively. In all three cases, strings can be interpreted either as one
+ character per code unit, or as UTF-encoded Unicode, with support for
+ Unicode general category properties. Unicode support is optional at
+ build time (but is the default). However, processing strings as UTF
+ code units must be enabled explicitly at run time. The version of Uni-
+ code in use can be discovered by running
+
+ pcre2test -C
+
+ The three libraries contain identical sets of functions, with names
+ ending in _8, _16, or _32, respectively (for example, pcre2_com-
+ pile_8()). However, by defining PCRE2_CODE_UNIT_WIDTH to be 8, 16, or
+ 32, a program that uses just one code unit width can be written using
+ generic names such as pcre2_compile(), and the documentation is written
+ assuming that this is the case.
+
+ In addition to the Perl-compatible matching function, PCRE2 contains an
+ alternative function that matches the same compiled patterns in a dif-
+ ferent way. In certain circumstances, the alternative function has some
+ advantages. For a discussion of the two matching algorithms, see the
+ pcre2matching page.
+
+ Details of exactly which Perl regular expression features are and are
+ not supported by PCRE2 are given in separate documents. See the
+ pcre2pattern and pcre2compat pages. There is a syntax summary in the
+ pcre2syntax page.
+
+ Some features of PCRE2 can be included, excluded, or changed when the
+ library is built. The pcre2_config() function makes it possible for a
+ client to discover which features are available. The features them-
+ selves are described in the pcre2build page. Documentation about build-
+ ing PCRE2 for various operating systems can be found in the README and
+ NON-AUTOTOOLS_BUILD files in the source distribution.
+
+ The libraries contains a number of undocumented internal functions and
+ data tables that are used by more than one of the exported external
+ functions, but which are not intended for use by external callers.
+ Their names all begin with "_pcre2", which hopefully will not provoke
+ any name clashes. In some environments, it is possible to control which
+ external symbols are exported when a shared library is built, and in
+ these cases the undocumented symbols are not exported.
+
+
+SECURITY CONSIDERATIONS
+
+ If you are using PCRE2 in a non-UTF application that permits users to
+ supply arbitrary patterns for compilation, you should be aware of a
+ feature that allows users to turn on UTF support from within a pattern.
+ For example, an 8-bit pattern that begins with "(*UTF)" turns on UTF-8
+ mode, which interprets patterns and subjects as strings of UTF-8 code
+ units instead of individual 8-bit characters. This causes both the pat-
+ tern and any data against which it is matched to be checked for UTF-8
+ validity. If the data string is very long, such a check might use suf-
+ ficiently many resources as to cause your application to lose perfor-
+ mance.
+
+ One way of guarding against this possibility is to use the pcre2_pat-
+ tern_info() function to check the compiled pattern's options for
+ PCRE2_UTF. Alternatively, you can set the PCRE2_NEVER_UTF option when
+ calling pcre2_compile(). This causes a compile time error if the pat-
+ tern contains a UTF-setting sequence.
+
+ The use of Unicode properties for character types such as \d can also
+ be enabled from within the pattern, by specifying "(*UCP)". This fea-
+ ture can be disallowed by setting the PCRE2_NEVER_UCP option.
+
+ If your application is one that supports UTF, be aware that validity
+ checking can take time. If the same data string is to be matched many
+ times, you can use the PCRE2_NO_UTF_CHECK option for the second and
+ subsequent matches to avoid running redundant checks.
+
+ The use of the \C escape sequence in a UTF-8 or UTF-16 pattern can lead
+ to problems, because it may leave the current matching point in the
+ middle of a multi-code-unit character. The PCRE2_NEVER_BACKSLASH_C
+ option can be used by an application to lock out the use of \C, causing
+ a compile-time error if it is encountered. It is also possible to build
+ PCRE2 with the use of \C permanently disabled.
+
+ Another way that performance can be hit is by running a pattern that
+ has a very large search tree against a string that will never match.
+ Nested unlimited repeats in a pattern are a common example. PCRE2 pro-
+ vides some protection against this: see the pcre2_set_match_limit()
+ function in the pcre2api page. There is a similar function called
+ pcre2_set_depth_limit() that can be used to restrict the amount of mem-
+ ory that is used.
+
+
+USER DOCUMENTATION
+
+ The user documentation for PCRE2 comprises a number of different sec-
+ tions. In the "man" format, each of these is a separate "man page". In
+ the HTML format, each is a separate page, linked from the index page.
+ In the plain text format, the descriptions of the pcre2grep and
+ pcre2test programs are in files called pcre2grep.txt and pcre2test.txt,
+ respectively. The remaining sections, except for the pcre2demo section
+ (which is a program listing), and the short pages for individual func-
+ tions, are concatenated in pcre2.txt, for ease of searching. The sec-
+ tions are as follows:
+
+ pcre2 this document
+ pcre2-config show PCRE2 installation configuration information
+ pcre2api details of PCRE2's native C API
+ pcre2build building PCRE2
+ pcre2callout details of the callout feature
+ pcre2compat discussion of Perl compatibility
+ pcre2convert details of pattern conversion functions
+ pcre2demo a demonstration C program that uses PCRE2
+ pcre2grep description of the pcre2grep command (8-bit only)
+ pcre2jit discussion of just-in-time optimization support
+ pcre2limits details of size and other limits
+ pcre2matching discussion of the two matching algorithms
+ pcre2partial details of the partial matching facility
+ pcre2pattern syntax and semantics of supported regular
+ expression patterns
+ pcre2perform discussion of performance issues
+ pcre2posix the POSIX-compatible C API for the 8-bit library
+ pcre2sample discussion of the pcre2demo program
+ pcre2serialize details of pattern serialization
+ pcre2syntax quick syntax reference
+ pcre2test description of the pcre2test command
+ pcre2unicode discussion of Unicode and UTF support
+
+ In the "man" and HTML formats, there is also a short page for each C
+ library function, listing its arguments and results.
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+ Putting an actual email address here is a spam magnet. If you want to
+ email me, use my two initials, followed by the two digits 10, at the
+ domain cam.ac.uk.
+
+
+REVISION
+
+ Last updated: 11 July 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2API(3) Library Functions Manual PCRE2API(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+ #include <pcre2.h>
+
+ PCRE2 is a new API for PCRE, starting at release 10.0. This document
+ contains a description of all its native functions. See the pcre2 docu-
+ ment for an overview of all the PCRE2 documentation.
+
+
+PCRE2 NATIVE API BASIC FUNCTIONS
+
+ pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
+ uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
+ pcre2_compile_context *ccontext);
+
+ void pcre2_code_free(pcre2_code *code);
+
+ pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
+ pcre2_general_context *gcontext);
+
+ pcre2_match_data *pcre2_match_data_create_from_pattern(
+ const pcre2_code *code, pcre2_general_context *gcontext);
+
+ int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
+ PCRE2_SIZE length, PCRE2_SIZE startoffset,
+ uint32_t options, pcre2_match_data *match_data,
+ pcre2_match_context *mcontext);
+
+ int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
+ PCRE2_SIZE length, PCRE2_SIZE startoffset,
+ uint32_t options, pcre2_match_data *match_data,
+ pcre2_match_context *mcontext,
+ int *workspace, PCRE2_SIZE wscount);
+
+ void pcre2_match_data_free(pcre2_match_data *match_data);
+
+
+PCRE2 NATIVE API AUXILIARY MATCH FUNCTIONS
+
+ PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
+
+ uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
+
+ PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
+
+ PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
+
+
+PCRE2 NATIVE API GENERAL CONTEXT FUNCTIONS
+
+ pcre2_general_context *pcre2_general_context_create(
+ void *(*private_malloc)(PCRE2_SIZE, void *),
+ void (*private_free)(void *, void *), void *memory_data);
+
+ pcre2_general_context *pcre2_general_context_copy(
+ pcre2_general_context *gcontext);
+
+ void pcre2_general_context_free(pcre2_general_context *gcontext);
+
+
+PCRE2 NATIVE API COMPILE CONTEXT FUNCTIONS
+
+ pcre2_compile_context *pcre2_compile_context_create(
+ pcre2_general_context *gcontext);
+
+ pcre2_compile_context *pcre2_compile_context_copy(
+ pcre2_compile_context *ccontext);
+
+ void pcre2_compile_context_free(pcre2_compile_context *ccontext);
+
+ int pcre2_set_bsr(pcre2_compile_context *ccontext,
+ uint32_t value);
+
+ int pcre2_set_character_tables(pcre2_compile_context *ccontext,
+ const unsigned char *tables);
+
+ int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
+ uint32_t extra_options);
+
+ int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
+ PCRE2_SIZE value);
+
+ int pcre2_set_newline(pcre2_compile_context *ccontext,
+ uint32_t value);
+
+ int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
+ uint32_t value);
+
+ int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
+ int (*guard_function)(uint32_t, void *), void *user_data);
+
+
+PCRE2 NATIVE API MATCH CONTEXT FUNCTIONS
+
+ pcre2_match_context *pcre2_match_context_create(
+ pcre2_general_context *gcontext);
+
+ pcre2_match_context *pcre2_match_context_copy(
+ pcre2_match_context *mcontext);
+
+ void pcre2_match_context_free(pcre2_match_context *mcontext);
+
+ int pcre2_set_callout(pcre2_match_context *mcontext,
+ int (*callout_function)(pcre2_callout_block *, void *),
+ void *callout_data);
+
+ int pcre2_set_offset_limit(pcre2_match_context *mcontext,
+ PCRE2_SIZE value);
+
+ int pcre2_set_heap_limit(pcre2_match_context *mcontext,
+ uint32_t value);
+
+ int pcre2_set_match_limit(pcre2_match_context *mcontext,
+ uint32_t value);
+
+ int pcre2_set_depth_limit(pcre2_match_context *mcontext,
+ uint32_t value);
+
+
+PCRE2 NATIVE API STRING EXTRACTION FUNCTIONS
+
+ int pcre2_substring_copy_byname(pcre2_match_data *match_data,
+ PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
+
+ int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
+ uint32_t number, PCRE2_UCHAR *buffer,
+ PCRE2_SIZE *bufflen);
+
+ void pcre2_substring_free(PCRE2_UCHAR *buffer);
+
+ int pcre2_substring_get_byname(pcre2_match_data *match_data,
+ PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
+
+ int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
+ uint32_t number, PCRE2_UCHAR **bufferptr,
+ PCRE2_SIZE *bufflen);
+
+ int pcre2_substring_length_byname(pcre2_match_data *match_data,
+ PCRE2_SPTR name, PCRE2_SIZE *length);
+
+ int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
+ uint32_t number, PCRE2_SIZE *length);
+
+ int pcre2_substring_nametable_scan(const pcre2_code *code,
+ PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
+
+ int pcre2_substring_number_from_name(const pcre2_code *code,
+ PCRE2_SPTR name);
+
+ void pcre2_substring_list_free(PCRE2_SPTR *list);
+
+ int pcre2_substring_list_get(pcre2_match_data *match_data,
+ PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
+
+
+PCRE2 NATIVE API STRING SUBSTITUTION FUNCTION
+
+ int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
+ PCRE2_SIZE length, PCRE2_SIZE startoffset,
+ uint32_t options, pcre2_match_data *match_data,
+ pcre2_match_context *mcontext, PCRE2_SPTR replacementzfP,
+ PCRE2_SIZE rlength, PCRE2_UCHAR *outputbuffer,
+ PCRE2_SIZE *outlengthptr);
+
+
+PCRE2 NATIVE API JIT FUNCTIONS
+
+ int pcre2_jit_compile(pcre2_code *code, uint32_t options);
+
+ int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
+ PCRE2_SIZE length, PCRE2_SIZE startoffset,
+ uint32_t options, pcre2_match_data *match_data,
+ pcre2_match_context *mcontext);
+
+ void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
+
+ pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
+ PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
+
+ void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
+ pcre2_jit_callback callback_function, void *callback_data);
+
+ void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
+
+
+PCRE2 NATIVE API SERIALIZATION FUNCTIONS
+
+ int32_t pcre2_serialize_decode(pcre2_code **codes,
+ int32_t number_of_codes, const uint8_t *bytes,
+ pcre2_general_context *gcontext);
+
+ int32_t pcre2_serialize_encode(const pcre2_code **codes,
+ int32_t number_of_codes, uint8_t **serialized_bytes,
+ PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
+
+ void pcre2_serialize_free(uint8_t *bytes);
+
+ int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
+
+
+PCRE2 NATIVE API AUXILIARY FUNCTIONS
+
+ pcre2_code *pcre2_code_copy(const pcre2_code *code);
+
+ pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
+
+ int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
+ PCRE2_SIZE bufflen);
+
+ const unsigned char *pcre2_maketables(pcre2_general_context *gcontext);
+
+ int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
+
+ int pcre2_callout_enumerate(const pcre2_code *code,
+ int (*callback)(pcre2_callout_enumerate_block *, void *),
+ void *user_data);
+
+ int pcre2_config(uint32_t what, void *where);
+
+
+PCRE2 NATIVE API OBSOLETE FUNCTIONS
+
+ int pcre2_set_recursion_limit(pcre2_match_context *mcontext,
+ uint32_t value);
+
+ int pcre2_set_recursion_memory_management(
+ pcre2_match_context *mcontext,
+ void *(*private_malloc)(PCRE2_SIZE, void *),
+ void (*private_free)(void *, void *), void *memory_data);
+
+ These functions became obsolete at release 10.30 and are retained only
+ for backward compatibility. They should not be used in new code. The
+ first is replaced by pcre2_set_depth_limit(); the second is no longer
+ needed and has no effect (it always returns zero).
+
+
+PCRE2 EXPERIMENTAL PATTERN CONVERSION FUNCTIONS
+
+ pcre2_convert_context *pcre2_convert_context_create(
+ pcre2_general_context *gcontext);
+
+ pcre2_convert_context *pcre2_convert_context_copy(
+ pcre2_convert_context *cvcontext);
+
+ void pcre2_convert_context_free(pcre2_convert_context *cvcontext);
+
+ int pcre2_set_glob_escape(pcre2_convert_context *cvcontext,
+ uint32_t escape_char);
+
+ int pcre2_set_glob_separator(pcre2_convert_context *cvcontext,
+ uint32_t separator_char);
+
+ int pcre2_pattern_convert(PCRE2_SPTR pattern, PCRE2_SIZE length,
+ uint32_t options, PCRE2_UCHAR **buffer,
+ PCRE2_SIZE *blength, pcre2_convert_context *cvcontext);
+
+ void pcre2_converted_pattern_free(PCRE2_UCHAR *converted_pattern);
+
+ These functions provide a way of converting non-PCRE2 patterns into
+ patterns that can be processed by pcre2_compile(). This facility is
+ experimental and may be changed in future releases. At present, "globs"
+ and POSIX basic and extended patterns can be converted. Details are
+ given in the pcre2convert documentation.
+
+
+PCRE2 8-BIT, 16-BIT, AND 32-BIT LIBRARIES
+
+ There are three PCRE2 libraries, supporting 8-bit, 16-bit, and 32-bit
+ code units, respectively. However, there is just one header file,
+ pcre2.h. This contains the function prototypes and other definitions
+ for all three libraries. One, two, or all three can be installed simul-
+ taneously. On Unix-like systems the libraries are called libpcre2-8,
+ libpcre2-16, and libpcre2-32, and they can also co-exist with the orig-
+ inal PCRE libraries.
+
+ Character strings are passed to and from a PCRE2 library as a sequence
+ of unsigned integers in code units of the appropriate width. Every
+ PCRE2 function comes in three different forms, one for each library,
+ for example:
+
+ pcre2_compile_8()
+ pcre2_compile_16()
+ pcre2_compile_32()
+
+ There are also three different sets of data types:
+
+ PCRE2_UCHAR8, PCRE2_UCHAR16, PCRE2_UCHAR32
+ PCRE2_SPTR8, PCRE2_SPTR16, PCRE2_SPTR32
+
+ The UCHAR types define unsigned code units of the appropriate widths.
+ For example, PCRE2_UCHAR16 is usually defined as `uint16_t'. The SPTR
+ types are constant pointers to the equivalent UCHAR types, that is,
+ they are pointers to vectors of unsigned code units.
+
+ Many applications use only one code unit width. For their convenience,
+ macros are defined whose names are the generic forms such as pcre2_com-
+ pile() and PCRE2_SPTR. These macros use the value of the macro
+ PCRE2_CODE_UNIT_WIDTH to generate the appropriate width-specific func-
+ tion and macro names. PCRE2_CODE_UNIT_WIDTH is not defined by default.
+ An application must define it to be 8, 16, or 32 before including
+ pcre2.h in order to make use of the generic names.
+
+ Applications that use more than one code unit width can be linked with
+ more than one PCRE2 library, but must define PCRE2_CODE_UNIT_WIDTH to
+ be 0 before including pcre2.h, and then use the real function names.
+ Any code that is to be included in an environment where the value of
+ PCRE2_CODE_UNIT_WIDTH is unknown should also use the real function
+ names. (Unfortunately, it is not possible in C code to save and restore
+ the value of a macro.)
+
+ If PCRE2_CODE_UNIT_WIDTH is not defined before including pcre2.h, a
+ compiler error occurs.
+
+ When using multiple libraries in an application, you must take care
+ when processing any particular pattern to use only functions from a
+ single library. For example, if you want to run a match using a pat-
+ tern that was compiled with pcre2_compile_16(), you must do so with
+ pcre2_match_16(), not pcre2_match_8() or pcre2_match_32().
+
+ In the function summaries above, and in the rest of this document and
+ other PCRE2 documents, functions and data types are described using
+ their generic names, without the _8, _16, or _32 suffix.
+
+
+PCRE2 API OVERVIEW
+
+ PCRE2 has its own native API, which is described in this document.
+ There are also some wrapper functions for the 8-bit library that corre-
+ spond to the POSIX regular expression API, but they do not give access
+ to all the functionality of PCRE2. They are described in the pcre2posix
+ documentation. Both these APIs define a set of C function calls.
+
+ The native API C data types, function prototypes, option values, and
+ error codes are defined in the header file pcre2.h, which also contains
+ definitions of PCRE2_MAJOR and PCRE2_MINOR, the major and minor release
+ numbers for the library. Applications can use these to include support
+ for different releases of PCRE2.
+
+ In a Windows environment, if you want to statically link an application
+ program against a non-dll PCRE2 library, you must define PCRE2_STATIC
+ before including pcre2.h.
+
+ The functions pcre2_compile() and pcre2_match() are used for compiling
+ and matching regular expressions in a Perl-compatible manner. A sample
+ program that demonstrates the simplest way of using them is provided in
+ the file called pcre2demo.c in the PCRE2 source distribution. A listing
+ of this program is given in the pcre2demo documentation, and the
+ pcre2sample documentation describes how to compile and run it.
+
+ The compiling and matching functions recognize various options that are
+ passed as bits in an options argument. There are also some more compli-
+ cated parameters such as custom memory management functions and
+ resource limits that are passed in "contexts" (which are just memory
+ blocks, described below). Simple applications do not need to make use
+ of contexts.
+
+ Just-in-time (JIT) compiler support is an optional feature of PCRE2
+ that can be built in appropriate hardware environments. It greatly
+ speeds up the matching performance of many patterns. Programs can
+ request that it be used if available by calling pcre2_jit_compile()
+ after a pattern has been successfully compiled by pcre2_compile(). This
+ does nothing if JIT support is not available.
+
+ More complicated programs might need to make use of the specialist
+ functions pcre2_jit_stack_create(), pcre2_jit_stack_free(), and
+ pcre2_jit_stack_assign() in order to control the JIT code's memory
+ usage.
+
+ JIT matching is automatically used by pcre2_match() if it is available,
+ unless the PCRE2_NO_JIT option is set. There is also a direct interface
+ for JIT matching, which gives improved performance at the expense of
+ less sanity checking. The JIT-specific functions are discussed in the
+ pcre2jit documentation.
+
+ A second matching function, pcre2_dfa_match(), which is not Perl-com-
+ patible, is also provided. This uses a different algorithm for the
+ matching. The alternative algorithm finds all possible matches (at a
+ given point in the subject), and scans the subject just once (unless
+ there are lookaround assertions). However, this algorithm does not
+ return captured substrings. A description of the two matching algo-
+ rithms and their advantages and disadvantages is given in the
+ pcre2matching documentation. There is no JIT support for
+ pcre2_dfa_match().
+
+ In addition to the main compiling and matching functions, there are
+ convenience functions for extracting captured substrings from a subject
+ string that has been matched by pcre2_match(). They are:
+
+ pcre2_substring_copy_byname()
+ pcre2_substring_copy_bynumber()
+ pcre2_substring_get_byname()
+ pcre2_substring_get_bynumber()
+ pcre2_substring_list_get()
+ pcre2_substring_length_byname()
+ pcre2_substring_length_bynumber()
+ pcre2_substring_nametable_scan()
+ pcre2_substring_number_from_name()
+
+ pcre2_substring_free() and pcre2_substring_list_free() are also pro-
+ vided, to free memory used for extracted strings. If either of these
+ functions is called with a NULL argument, the function returns immedi-
+ ately without doing anything.
+
+ The function pcre2_substitute() can be called to match a pattern and
+ return a copy of the subject string with substitutions for parts that
+ were matched.
+
+ Functions whose names begin with pcre2_serialize_ are used for saving
+ compiled patterns on disc or elsewhere, and reloading them later.
+
+ Finally, there are functions for finding out information about a com-
+ piled pattern (pcre2_pattern_info()) and about the configuration with
+ which PCRE2 was built (pcre2_config()).
+
+ Functions with names ending with _free() are used for freeing memory
+ blocks of various sorts. In all cases, if one of these functions is
+ called with a NULL argument, it does nothing.
+
+
+STRING LENGTHS AND OFFSETS
+
+ The PCRE2 API uses string lengths and offsets into strings of code
+ units in several places. These values are always of type PCRE2_SIZE,
+ which is an unsigned integer type, currently always defined as size_t.
+ The largest value that can be stored in such a type (that is
+ ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-terminated
+ strings and unset offsets. Therefore, the longest string that can be
+ handled is one less than this maximum.
+
+
+NEWLINES
+
+ PCRE2 supports five different conventions for indicating line breaks in
+ strings: a single CR (carriage return) character, a single LF (line-
+ feed) character, the two-character sequence CRLF, any of the three pre-
+ ceding, or any Unicode newline sequence. The Unicode newline sequences
+ are the three just mentioned, plus the single characters VT (vertical
+ tab, U+000B), FF (form feed, U+000C), NEL (next line, U+0085), LS (line
+ separator, U+2028), and PS (paragraph separator, U+2029).
+
+ Each of the first three conventions is used by at least one operating
+ system as its standard newline sequence. When PCRE2 is built, a default
+ can be specified. If it is not, the default is set to LF, which is the
+ Unix standard. However, the newline convention can be changed by an
+ application when calling pcre2_compile(), or it can be specified by
+ special text at the start of the pattern itself; this overrides any
+ other settings. See the pcre2pattern page for details of the special
+ character sequences.
+
+ In the PCRE2 documentation the word "newline" is used to mean "the
+ character or pair of characters that indicate a line break". The choice
+ of newline convention affects the handling of the dot, circumflex, and
+ dollar metacharacters, the handling of #-comments in /x mode, and, when
+ CRLF is a recognized line ending sequence, the match position advance-
+ ment for a non-anchored pattern. There is more detail about this in the
+ section on pcre2_match() options below.
+
+ The choice of newline convention does not affect the interpretation of
+ the \n or \r escape sequences, nor does it affect what \R matches; this
+ has its own separate convention.
+
+
+MULTITHREADING
+
+ In a multithreaded application it is important to keep thread-specific
+ data separate from data that can be shared between threads. The PCRE2
+ library code itself is thread-safe: it contains no static or global
+ variables. The API is designed to be fairly simple for non-threaded
+ applications while at the same time ensuring that multithreaded appli-
+ cations can use it.
+
+ There are several different blocks of data that are used to pass infor-
+ mation between the application and the PCRE2 libraries.
+
+ The compiled pattern
+
+ A pointer to the compiled form of a pattern is returned to the user
+ when pcre2_compile() is successful. The data in the compiled pattern is
+ fixed, and does not change when the pattern is matched. Therefore, it
+ is thread-safe, that is, the same compiled pattern can be used by more
+ than one thread simultaneously. For example, an application can compile
+ all its patterns at the start, before forking off multiple threads that
+ use them. However, if the just-in-time (JIT) optimization feature is
+ being used, it needs separate memory stack areas for each thread. See
+ the pcre2jit documentation for more details.
+
+ In a more complicated situation, where patterns are compiled only when
+ they are first needed, but are still shared between threads, pointers
+ to compiled patterns must be protected from simultaneous writing by
+ multiple threads, at least until a pattern has been compiled. The logic
+ can be something like this:
+
+ Get a read-only (shared) lock (mutex) for pointer
+ if (pointer == NULL)
+ {
+ Get a write (unique) lock for pointer
+ pointer = pcre2_compile(...
+ }
+ Release the lock
+ Use pointer in pcre2_match()
+
+ Of course, testing for compilation errors should also be included in
+ the code.
+
+ If JIT is being used, but the JIT compilation is not being done immedi-
+ ately, (perhaps waiting to see if the pattern is used often enough)
+ similar logic is required. JIT compilation updates a pointer within the
+ compiled code block, so a thread must gain unique write access to the
+ pointer before calling pcre2_jit_compile(). Alternatively,
+ pcre2_code_copy() or pcre2_code_copy_with_tables() can be used to
+ obtain a private copy of the compiled code before calling the JIT com-
+ piler.
+
+ Context blocks
+
+ The next main section below introduces the idea of "contexts" in which
+ PCRE2 functions are called. A context is nothing more than a collection
+ of parameters that control the way PCRE2 operates. Grouping a number of
+ parameters together in a context is a convenient way of passing them to
+ a PCRE2 function without using lots of arguments. The parameters that
+ are stored in contexts are in some sense "advanced features" of the
+ API. Many straightforward applications will not need to use contexts.
+
+ In a multithreaded application, if the parameters in a context are val-
+ ues that are never changed, the same context can be used by all the
+ threads. However, if any thread needs to change any value in a context,
+ it must make its own thread-specific copy.
+
+ Match blocks
+
+ The matching functions need a block of memory for storing the results
+ of a match. This includes details of what was matched, as well as addi-
+ tional information such as the name of a (*MARK) setting. Each thread
+ must provide its own copy of this memory.
+
+
+PCRE2 CONTEXTS
+
+ Some PCRE2 functions have a lot of parameters, many of which are used
+ only by specialist applications, for example, those that use custom
+ memory management or non-standard character tables. To keep function
+ argument lists at a reasonable size, and at the same time to keep the
+ API extensible, "uncommon" parameters are passed to certain functions
+ in a context instead of directly. A context is just a block of memory
+ that holds the parameter values. Applications that do not need to
+ adjust any of the context parameters can pass NULL when a context
+ pointer is required.
+
+ There are three different types of context: a general context that is
+ relevant for several PCRE2 operations, a compile-time context, and a
+ match-time context.
+
+ The general context
+
+ At present, this context just contains pointers to (and data for)
+ external memory management functions that are called from several
+ places in the PCRE2 library. The context is named `general' rather than
+ specifically `memory' because in future other fields may be added. If
+ you do not want to supply your own custom memory management functions,
+ you do not need to bother with a general context. A general context is
+ created by:
+
+ pcre2_general_context *pcre2_general_context_create(
+ void *(*private_malloc)(PCRE2_SIZE, void *),
+ void (*private_free)(void *, void *), void *memory_data);
+
+ The two function pointers specify custom memory management functions,
+ whose prototypes are:
+
+ void *private_malloc(PCRE2_SIZE, void *);
+ void private_free(void *, void *);
+
+ Whenever code in PCRE2 calls these functions, the final argument is the
+ value of memory_data. Either of the first two arguments of the creation
+ function may be NULL, in which case the system memory management func-
+ tions malloc() and free() are used. (This is not currently useful, as
+ there are no other fields in a general context, but in future there
+ might be.) The private_malloc() function is used (if supplied) to
+ obtain memory for storing the context, and all three values are saved
+ as part of the context.
+
+ Whenever PCRE2 creates a data block of any kind, the block contains a
+ pointer to the free() function that matches the malloc() function that
+ was used. When the time comes to free the block, this function is
+ called.
+
+ A general context can be copied by calling:
+
+ pcre2_general_context *pcre2_general_context_copy(
+ pcre2_general_context *gcontext);
+
+ The memory used for a general context should be freed by calling:
+
+ void pcre2_general_context_free(pcre2_general_context *gcontext);
+
+ If this function is passed a NULL argument, it returns immediately
+ without doing anything.
+
+ The compile context
+
+ A compile context is required if you want to provide an external func-
+ tion for stack checking during compilation or to change the default
+ values of any of the following compile-time parameters:
+
+ What \R matches (Unicode newlines or CR, LF, CRLF only)
+ PCRE2's character tables
+ The newline character sequence
+ The compile time nested parentheses limit
+ The maximum length of the pattern string
+ The extra options bits (none set by default)
+
+ A compile context is also required if you are using custom memory man-
+ agement. If none of these apply, just pass NULL as the context argu-
+ ment of pcre2_compile().
+
+ A compile context is created, copied, and freed by the following func-
+ tions:
+
+ pcre2_compile_context *pcre2_compile_context_create(
+ pcre2_general_context *gcontext);
+
+ pcre2_compile_context *pcre2_compile_context_copy(
+ pcre2_compile_context *ccontext);
+
+ void pcre2_compile_context_free(pcre2_compile_context *ccontext);
+
+ A compile context is created with default values for its parameters.
+ These can be changed by calling the following functions, which return 0
+ on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
+
+ int pcre2_set_bsr(pcre2_compile_context *ccontext,
+ uint32_t value);
+
+ The value must be PCRE2_BSR_ANYCRLF, to specify that \R matches only
+ CR, LF, or CRLF, or PCRE2_BSR_UNICODE, to specify that \R matches any
+ Unicode line ending sequence. The value is used by the JIT compiler and
+ by the two interpreted matching functions, pcre2_match() and
+ pcre2_dfa_match().
+
+ int pcre2_set_character_tables(pcre2_compile_context *ccontext,
+ const unsigned char *tables);
+
+ The value must be the result of a call to pcre2_maketables(), whose
+ only argument is a general context. This function builds a set of char-
+ acter tables in the current locale.
+
+ int pcre2_set_compile_extra_options(pcre2_compile_context *ccontext,
+ uint32_t extra_options);
+
+ As PCRE2 has developed, almost all the 32 option bits that are avail-
+ able in the options argument of pcre2_compile() have been used up. To
+ avoid running out, the compile context contains a set of extra option
+ bits which are used for some newer, assumed rarer, options. This func-
+ tion sets those bits. It always sets all the bits (either on or off).
+ It does not modify any existing setting. The available options are
+ defined in the section entitled "Extra compile options" below.
+
+ int pcre2_set_max_pattern_length(pcre2_compile_context *ccontext,
+ PCRE2_SIZE value);
+
+ This sets a maximum length, in code units, for any pattern string that
+ is compiled with this context. If the pattern is longer, an error is
+ generated. This facility is provided so that applications that accept
+ patterns from external sources can limit their size. The default is the
+ largest number that a PCRE2_SIZE variable can hold, which is effec-
+ tively unlimited.
+
+ int pcre2_set_newline(pcre2_compile_context *ccontext,
+ uint32_t value);
+
+ This specifies which characters or character sequences are to be recog-
+ nized as newlines. The value must be one of PCRE2_NEWLINE_CR (carriage
+ return only), PCRE2_NEWLINE_LF (linefeed only), PCRE2_NEWLINE_CRLF (the
+ two-character sequence CR followed by LF), PCRE2_NEWLINE_ANYCRLF (any
+ of the above), PCRE2_NEWLINE_ANY (any Unicode newline sequence), or
+ PCRE2_NEWLINE_NUL (the NUL character, that is a binary zero).
+
+ A pattern can override the value set in the compile context by starting
+ with a sequence such as (*CRLF). See the pcre2pattern page for details.
+
+ When a pattern is compiled with the PCRE2_EXTENDED or
+ PCRE2_EXTENDED_MORE option, the newline convention affects the recogni-
+ tion of the end of internal comments starting with #. The value is
+ saved with the compiled pattern for subsequent use by the JIT compiler
+ and by the two interpreted matching functions, pcre2_match() and
+ pcre2_dfa_match().
+
+ int pcre2_set_parens_nest_limit(pcre2_compile_context *ccontext,
+ uint32_t value);
+
+ This parameter ajusts the limit, set when PCRE2 is built (default 250),
+ on the depth of parenthesis nesting in a pattern. This limit stops
+ rogue patterns using up too much system stack when being compiled. The
+ limit applies to parentheses of all kinds, not just capturing parenthe-
+ ses.
+
+ int pcre2_set_compile_recursion_guard(pcre2_compile_context *ccontext,
+ int (*guard_function)(uint32_t, void *), void *user_data);
+
+ There is at least one application that runs PCRE2 in threads with very
+ limited system stack, where running out of stack is to be avoided at
+ all costs. The parenthesis limit above cannot take account of how much
+ stack is actually available during compilation. For a finer control,
+ you can supply a function that is called whenever pcre2_compile()
+ starts to compile a parenthesized part of a pattern. This function can
+ check the actual stack size (or anything else that it wants to, of
+ course).
+
+ The first argument to the callout function gives the current depth of
+ nesting, and the second is user data that is set up by the last argu-
+ ment of pcre2_set_compile_recursion_guard(). The callout function
+ should return zero if all is well, or non-zero to force an error.
+
+ The match context
+
+ A match context is required if you want to:
+
+ Set up a callout function
+ Set an offset limit for matching an unanchored pattern
+ Change the limit on the amount of heap used when matching
+ Change the backtracking match limit
+ Change the backtracking depth limit
+ Set custom memory management specifically for the match
+
+ If none of these apply, just pass NULL as the context argument of
+ pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match().
+
+ A match context is created, copied, and freed by the following func-
+ tions:
+
+ pcre2_match_context *pcre2_match_context_create(
+ pcre2_general_context *gcontext);
+
+ pcre2_match_context *pcre2_match_context_copy(
+ pcre2_match_context *mcontext);
+
+ void pcre2_match_context_free(pcre2_match_context *mcontext);
+
+ A match context is created with default values for its parameters.
+ These can be changed by calling the following functions, which return 0
+ on success, or PCRE2_ERROR_BADDATA if invalid data is detected.
+
+ int pcre2_set_callout(pcre2_match_context *mcontext,
+ int (*callout_function)(pcre2_callout_block *, void *),
+ void *callout_data);
+
+ This sets up a "callout" function for PCRE2 to call at specified points
+ during a matching operation. Details are given in the pcre2callout doc-
+ umentation.
+
+ int pcre2_set_offset_limit(pcre2_match_context *mcontext,
+ PCRE2_SIZE value);
+
+ The offset_limit parameter limits how far an unanchored search can
+ advance in the subject string. The default value is PCRE2_UNSET. The
+ pcre2_match() and pcre2_dfa_match() functions return
+ PCRE2_ERROR_NOMATCH if a match with a starting point before or at the
+ given offset is not found. The pcre2_substitute() function makes no
+ more substitutions.
+
+ For example, if the pattern /abc/ is matched against "123abc" with an
+ offset limit less than 3, the result is PCRE2_ERROR_NO_MATCH. A match
+ can never be found if the startoffset argument of pcre2_match(),
+ pcre2_dfa_match(), or pcre2_substitute() is greater than the offset
+ limit set in the match context.
+
+ When using this facility, you must set the PCRE2_USE_OFFSET_LIMIT
+ option when calling pcre2_compile() so that when JIT is in use, differ-
+ ent code can be compiled. If a match is started with a non-default
+ match limit when PCRE2_USE_OFFSET_LIMIT is not set, an error is gener-
+ ated.
+
+ The offset limit facility can be used to track progress when searching
+ large subject strings or to limit the extent of global substitutions.
+ See also the PCRE2_FIRSTLINE option, which requires a match to start
+ before or at the first newline that follows the start of matching in
+ the subject. If this is set with an offset limit, a match must occur in
+ the first line and also within the offset limit. In other words, which-
+ ever limit comes first is used.
+
+ int pcre2_set_heap_limit(pcre2_match_context *mcontext,
+ uint32_t value);
+
+ The heap_limit parameter specifies, in units of kibibytes (1024 bytes),
+ the maximum amount of heap memory that pcre2_match() may use to hold
+ backtracking information when running an interpretive match. This limit
+ also applies to pcre2_dfa_match(), which may use the heap when process-
+ ing patterns with a lot of nested pattern recursion or lookarounds or
+ atomic groups. This limit does not apply to matching with the JIT opti-
+ mization, which has its own memory control arrangements (see the
+ pcre2jit documentation for more details). If the limit is reached, the
+ negative error code PCRE2_ERROR_HEAPLIMIT is returned. The default
+ limit can be set when PCRE2 is built; if it is not, the default is set
+ very large and is essentially "unlimited".
+
+ A value for the heap limit may also be supplied by an item at the start
+ of a pattern of the form
+
+ (*LIMIT_HEAP=ddd)
+
+ where ddd is a decimal number. However, such a setting is ignored
+ unless ddd is less than the limit set by the caller of pcre2_match()
+ or, if no such limit is set, less than the default.
+
+ The pcre2_match() function starts out using a 20KiB vector on the sys-
+ tem stack for recording backtracking points. The more nested backtrack-
+ ing points there are (that is, the deeper the search tree), the more
+ memory is needed. Heap memory is used only if the initial vector is
+ too small. If the heap limit is set to a value less than 21 (in partic-
+ ular, zero) no heap memory will be used. In this case, only patterns
+ that do not have a lot of nested backtracking can be successfully pro-
+ cessed.
+
+ Similarly, for pcre2_dfa_match(), a vector on the system stack is used
+ when processing pattern recursions, lookarounds, or atomic groups, and
+ only if this is not big enough is heap memory used. In this case, too,
+ setting a value of zero disables the use of the heap.
+
+ int pcre2_set_match_limit(pcre2_match_context *mcontext,
+ uint32_t value);
+
+ The match_limit parameter provides a means of preventing PCRE2 from
+ using up too many computing resources when processing patterns that are
+ not going to match, but which have a very large number of possibilities
+ in their search trees. The classic example is a pattern that uses
+ nested unlimited repeats.
+
+ There is an internal counter in pcre2_match() that is incremented each
+ time round its main matching loop. If this value reaches the match
+ limit, pcre2_match() returns the negative value PCRE2_ERROR_MATCHLIMIT.
+ This has the effect of limiting the amount of backtracking that can
+ take place. For patterns that are not anchored, the count restarts from
+ zero for each position in the subject string. This limit also applies
+ to pcre2_dfa_match(), though the counting is done in a different way.
+
+ When pcre2_match() is called with a pattern that was successfully pro-
+ cessed by pcre2_jit_compile(), the way in which matching is executed is
+ entirely different. However, there is still the possibility of runaway
+ matching that goes on for a very long time, and so the match_limit
+ value is also used in this case (but in a different way) to limit how
+ long the matching can continue.
+
+ The default value for the limit can be set when PCRE2 is built; the
+ default default is 10 million, which handles all but the most extreme
+ cases. A value for the match limit may also be supplied by an item at
+ the start of a pattern of the form
+
+ (*LIMIT_MATCH=ddd)
+
+ where ddd is a decimal number. However, such a setting is ignored
+ unless ddd is less than the limit set by the caller of pcre2_match() or
+ pcre2_dfa_match() or, if no such limit is set, less than the default.
+
+ int pcre2_set_depth_limit(pcre2_match_context *mcontext,
+ uint32_t value);
+
+ This parameter limits the depth of nested backtracking in
+ pcre2_match(). Each time a nested backtracking point is passed, a new
+ memory "frame" is used to remember the state of matching at that point.
+ Thus, this parameter indirectly limits the amount of memory that is
+ used in a match. However, because the size of each memory "frame"
+ depends on the number of capturing parentheses, the actual memory limit
+ varies from pattern to pattern. This limit was more useful in versions
+ before 10.30, where function recursion was used for backtracking.
+
+ The depth limit is not relevant, and is ignored, when matching is done
+ using JIT compiled code. However, it is supported by pcre2_dfa_match(),
+ which uses it to limit the depth of nested internal recursive function
+ calls that implement atomic groups, lookaround assertions, and pattern
+ recursions. This limits, indirectly, the amount of system stack that is
+ used. It was more useful in versions before 10.32, when stack memory
+ was used for local workspace vectors for recursive function calls. From
+ version 10.32, only local variables are allocated on the stack and as
+ each call uses only a few hundred bytes, even a small stack can support
+ quite a lot of recursion.
+
+ If the depth of internal recursive function calls is great enough,
+ local workspace vectors are allocated on the heap from version 10.32
+ onwards, so the depth limit also indirectly limits the amount of heap
+ memory that is used. A recursive pattern such as /(.(?2))((?1)|)/, when
+ matched to a very long string using pcre2_dfa_match(), can use a great
+ deal of memory. However, it is probably better to limit heap usage
+ directly by calling pcre2_set_heap_limit().
+
+ The default value for the depth limit can be set when PCRE2 is built;
+ if it is not, the default is set to the same value as the default for
+ the match limit. If the limit is exceeded, pcre2_match() or
+ pcre2_dfa_match() returns PCRE2_ERROR_DEPTHLIMIT. A value for the depth
+ limit may also be supplied by an item at the start of a pattern of the
+ form
+
+ (*LIMIT_DEPTH=ddd)
+
+ where ddd is a decimal number. However, such a setting is ignored
+ unless ddd is less than the limit set by the caller of pcre2_match() or
+ pcre2_dfa_match() or, if no such limit is set, less than the default.
+
+
+CHECKING BUILD-TIME OPTIONS
+
+ int pcre2_config(uint32_t what, void *where);
+
+ The function pcre2_config() makes it possible for a PCRE2 client to
+ discover which optional features have been compiled into the PCRE2
+ library. The pcre2build documentation has more details about these
+ optional features.
+
+ The first argument for pcre2_config() specifies which information is
+ required. The second argument is a pointer to memory into which the
+ information is placed. If NULL is passed, the function returns the
+ amount of memory that is needed for the requested information. For
+ calls that return numerical values, the value is in bytes; when
+ requesting these values, where should point to appropriately aligned
+ memory. For calls that return strings, the required length is given in
+ code units, not counting the terminating zero.
+
+ When requesting information, the returned value from pcre2_config() is
+ non-negative on success, or the negative error code PCRE2_ERROR_BADOP-
+ TION if the value in the first argument is not recognized. The follow-
+ ing information is available:
+
+ PCRE2_CONFIG_BSR
+
+ The output is a uint32_t integer whose value indicates what character
+ sequences the \R escape sequence matches by default. A value of
+ PCRE2_BSR_UNICODE means that \R matches any Unicode line ending
+ sequence; a value of PCRE2_BSR_ANYCRLF means that \R matches only CR,
+ LF, or CRLF. The default can be overridden when a pattern is compiled.
+
+ PCRE2_CONFIG_COMPILED_WIDTHS
+
+ The output is a uint32_t integer whose lower bits indicate which code
+ unit widths were selected when PCRE2 was built. The 1-bit indicates
+ 8-bit support, and the 2-bit and 4-bit indicate 16-bit and 32-bit sup-
+ port, respectively.
+
+ PCRE2_CONFIG_DEPTHLIMIT
+
+ The output is a uint32_t integer that gives the default limit for the
+ depth of nested backtracking in pcre2_match() or the depth of nested
+ recursions, lookarounds, and atomic groups in pcre2_dfa_match(). Fur-
+ ther details are given with pcre2_set_depth_limit() above.
+
+ PCRE2_CONFIG_HEAPLIMIT
+
+ The output is a uint32_t integer that gives, in kibibytes, the default
+ limit for the amount of heap memory used by pcre2_match() or
+ pcre2_dfa_match(). Further details are given with
+ pcre2_set_heap_limit() above.
+
+ PCRE2_CONFIG_JIT
+
+ The output is a uint32_t integer that is set to one if support for
+ just-in-time compiling is available; otherwise it is set to zero.
+
+ PCRE2_CONFIG_JITTARGET
+
+ The where argument should point to a buffer that is at least 48 code
+ units long. (The exact length required can be found by calling
+ pcre2_config() with where set to NULL.) The buffer is filled with a
+ string that contains the name of the architecture for which the JIT
+ compiler is configured, for example "x86 32bit (little endian +
+ unaligned)". If JIT support is not available, PCRE2_ERROR_BADOPTION is
+ returned, otherwise the number of code units used is returned. This is
+ the length of the string, plus one unit for the terminating zero.
+
+ PCRE2_CONFIG_LINKSIZE
+
+ The output is a uint32_t integer that contains the number of bytes used
+ for internal linkage in compiled regular expressions. When PCRE2 is
+ configured, the value can be set to 2, 3, or 4, with the default being
+ 2. This is the value that is returned by pcre2_config(). However, when
+ the 16-bit library is compiled, a value of 3 is rounded up to 4, and
+ when the 32-bit library is compiled, internal linkages always use 4
+ bytes, so the configured value is not relevant.
+
+ The default value of 2 for the 8-bit and 16-bit libraries is sufficient
+ for all but the most massive patterns, since it allows the size of the
+ compiled pattern to be up to 65535 code units. Larger values allow
+ larger regular expressions to be compiled by those two libraries, but
+ at the expense of slower matching.
+
+ PCRE2_CONFIG_MATCHLIMIT
+
+ The output is a uint32_t integer that gives the default match limit for
+ pcre2_match(). Further details are given with pcre2_set_match_limit()
+ above.
+
+ PCRE2_CONFIG_NEWLINE
+
+ The output is a uint32_t integer whose value specifies the default
+ character sequence that is recognized as meaning "newline". The values
+ are:
+
+ PCRE2_NEWLINE_CR Carriage return (CR)
+ PCRE2_NEWLINE_LF Linefeed (LF)
+ PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
+ PCRE2_NEWLINE_ANY Any Unicode line ending
+ PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
+ PCRE2_NEWLINE_NUL The NUL character (binary zero)
+
+ The default should normally correspond to the standard sequence for
+ your operating system.
+
+ PCRE2_CONFIG_NEVER_BACKSLASH_C
+
+ The output is a uint32_t integer that is set to one if the use of \C
+ was permanently disabled when PCRE2 was built; otherwise it is set to
+ zero.
+
+ PCRE2_CONFIG_PARENSLIMIT
+
+ The output is a uint32_t integer that gives the maximum depth of nest-
+ ing of parentheses (of any kind) in a pattern. This limit is imposed to
+ cap the amount of system stack used when a pattern is compiled. It is
+ specified when PCRE2 is built; the default is 250. This limit does not
+ take into account the stack that may already be used by the calling
+ application. For finer control over compilation stack usage, see
+ pcre2_set_compile_recursion_guard().
+
+ PCRE2_CONFIG_STACKRECURSE
+
+ This parameter is obsolete and should not be used in new code. The out-
+ put is a uint32_t integer that is always set to zero.
+
+ PCRE2_CONFIG_UNICODE_VERSION
+
+ The where argument should point to a buffer that is at least 24 code
+ units long. (The exact length required can be found by calling
+ pcre2_config() with where set to NULL.) If PCRE2 has been compiled
+ without Unicode support, the buffer is filled with the text "Unicode
+ not supported". Otherwise, the Unicode version string (for example,
+ "8.0.0") is inserted. The number of code units used is returned. This
+ is the length of the string plus one unit for the terminating zero.
+
+ PCRE2_CONFIG_UNICODE
+
+ The output is a uint32_t integer that is set to one if Unicode support
+ is available; otherwise it is set to zero. Unicode support implies UTF
+ support.
+
+ PCRE2_CONFIG_VERSION
+
+ The where argument should point to a buffer that is at least 24 code
+ units long. (The exact length required can be found by calling
+ pcre2_config() with where set to NULL.) The buffer is filled with the
+ PCRE2 version string, zero-terminated. The number of code units used is
+ returned. This is the length of the string plus one unit for the termi-
+ nating zero.
+
+
+COMPILING A PATTERN
+
+ pcre2_code *pcre2_compile(PCRE2_SPTR pattern, PCRE2_SIZE length,
+ uint32_t options, int *errorcode, PCRE2_SIZE *erroroffset,
+ pcre2_compile_context *ccontext);
+
+ void pcre2_code_free(pcre2_code *code);
+
+ pcre2_code *pcre2_code_copy(const pcre2_code *code);
+
+ pcre2_code *pcre2_code_copy_with_tables(const pcre2_code *code);
+
+ The pcre2_compile() function compiles a pattern into an internal form.
+ The pattern is defined by a pointer to a string of code units and a
+ length (in code units). If the pattern is zero-terminated, the length
+ can be specified as PCRE2_ZERO_TERMINATED. The function returns a
+ pointer to a block of memory that contains the compiled pattern and
+ related data, or NULL if an error occurred.
+
+ If the compile context argument ccontext is NULL, memory for the com-
+ piled pattern is obtained by calling malloc(). Otherwise, it is
+ obtained from the same memory function that was used for the compile
+ context. The caller must free the memory by calling pcre2_code_free()
+ when it is no longer needed. If pcre2_code_free() is called with a
+ NULL argument, it returns immediately, without doing anything.
+
+ The function pcre2_code_copy() makes a copy of the compiled code in new
+ memory, using the same memory allocator as was used for the original.
+ However, if the code has been processed by the JIT compiler (see
+ below), the JIT information cannot be copied (because it is position-
+ dependent). The new copy can initially be used only for non-JIT match-
+ ing, though it can be passed to pcre2_jit_compile() if required. If
+ pcre2_code_copy() is called with a NULL argument, it returns NULL.
+
+ The pcre2_code_copy() function provides a way for individual threads in
+ a multithreaded application to acquire a private copy of shared com-
+ piled code. However, it does not make a copy of the character tables
+ used by the compiled pattern; the new pattern code points to the same
+ tables as the original code. (See "Locale Support" below for details
+ of these character tables.) In many applications the same tables are
+ used throughout, so this behaviour is appropriate. Nevertheless, there
+ are occasions when a copy of a compiled pattern and the relevant tables
+ are needed. The pcre2_code_copy_with_tables() provides this facility.
+ Copies of both the code and the tables are made, with the new code
+ pointing to the new tables. The memory for the new tables is automati-
+ cally freed when pcre2_code_free() is called for the new copy of the
+ compiled code. If pcre2_code_copy_withy_tables() is called with a NULL
+ argument, it returns NULL.
+
+ NOTE: When one of the matching functions is called, pointers to the
+ compiled pattern and the subject string are set in the match data block
+ so that they can be referenced by the substring extraction functions.
+ After running a match, you must not free a compiled pattern (or a sub-
+ ject string) until after all operations on the match data block have
+ taken place.
+
+ The options argument for pcre2_compile() contains various bit settings
+ that affect the compilation. It should be zero if no options are
+ required. The available options are described below. Some of them (in
+ particular, those that are compatible with Perl, but some others as
+ well) can also be set and unset from within the pattern (see the
+ detailed description in the pcre2pattern documentation).
+
+ For those options that can be different in different parts of the pat-
+ tern, the contents of the options argument specifies their settings at
+ the start of compilation. The PCRE2_ANCHORED, PCRE2_ENDANCHORED, and
+ PCRE2_NO_UTF_CHECK options can be set at the time of matching as well
+ as at compile time.
+
+ Other, less frequently required compile-time parameters (for example,
+ the newline setting) can be provided in a compile context (as described
+ above).
+
+ If errorcode or erroroffset is NULL, pcre2_compile() returns NULL imme-
+ diately. Otherwise, the variables to which these point are set to an
+ error code and an offset (number of code units) within the pattern,
+ respectively, when pcre2_compile() returns NULL because a compilation
+ error has occurred. The values are not defined when compilation is suc-
+ cessful and pcre2_compile() returns a non-NULL value.
+
+ There are nearly 100 positive error codes that pcre2_compile() may
+ return if it finds an error in the pattern. There are also some nega-
+ tive error codes that are used for invalid UTF strings. These are the
+ same as given by pcre2_match() and pcre2_dfa_match(), and are described
+ in the pcre2unicode page. There is no separate documentation for the
+ positive error codes, because the textual error messages that are
+ obtained by calling the pcre2_get_error_message() function (see
+ "Obtaining a textual error message" below) should be self-explanatory.
+ Macro names starting with PCRE2_ERROR_ are defined for both positive
+ and negative error codes in pcre2.h.
+
+ The value returned in erroroffset is an indication of where in the pat-
+ tern the error occurred. It is not necessarily the furthest point in
+ the pattern that was read. For example, after the error "lookbehind
+ assertion is not fixed length", the error offset points to the start of
+ the failing assertion. For an invalid UTF-8 or UTF-16 string, the off-
+ set is that of the first code unit of the failing character.
+
+ Some errors are not detected until the whole pattern has been scanned;
+ in these cases, the offset passed back is the length of the pattern.
+ Note that the offset is in code units, not characters, even in a UTF
+ mode. It may sometimes point into the middle of a UTF-8 or UTF-16 char-
+ acter.
+
+ This code fragment shows a typical straightforward call to pcre2_com-
+ pile():
+
+ pcre2_code *re;
+ PCRE2_SIZE erroffset;
+ int errorcode;
+ re = pcre2_compile(
+ "^A.*Z", /* the pattern */
+ PCRE2_ZERO_TERMINATED, /* the pattern is zero-terminated */
+ 0, /* default options */
+ &errorcode, /* for error code */
+ &erroffset, /* for error offset */
+ NULL); /* no compile context */
+
+ The following names for option bits are defined in the pcre2.h header
+ file:
+
+ PCRE2_ANCHORED
+
+ If this bit is set, the pattern is forced to be "anchored", that is, it
+ is constrained to match only at the first matching point in the string
+ that is being searched (the "subject string"). This effect can also be
+ achieved by appropriate constructs in the pattern itself, which is the
+ only way to do it in Perl.
+
+ PCRE2_ALLOW_EMPTY_CLASS
+
+ By default, for compatibility with Perl, a closing square bracket that
+ immediately follows an opening one is treated as a data character for
+ the class. When PCRE2_ALLOW_EMPTY_CLASS is set, it terminates the
+ class, which therefore contains no characters and so can never match.
+
+ PCRE2_ALT_BSUX
+
+ This option request alternative handling of three escape sequences,
+ which makes PCRE2's behaviour more like ECMAscript (aka JavaScript).
+ When it is set:
+
+ (1) \U matches an upper case "U" character; by default \U causes a com-
+ pile time error (Perl uses \U to upper case subsequent characters).
+
+ (2) \u matches a lower case "u" character unless it is followed by four
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, \u causes a compile time error (Perl
+ uses it to upper case the following character).
+
+ (3) \x matches a lower case "x" character unless it is followed by two
+ hexadecimal digits, in which case the hexadecimal number defines the
+ code point to match. By default, as in Perl, a hexadecimal number is
+ always expected after \x, but it may have zero, one, or two digits (so,
+ for example, \xz matches a binary zero character followed by z).
+
+ PCRE2_ALT_CIRCUMFLEX
+
+ In multiline mode (when PCRE2_MULTILINE is set), the circumflex
+ metacharacter matches at the start of the subject (unless PCRE2_NOTBOL
+ is set), and also after any internal newline. However, it does not
+ match after a newline at the end of the subject, for compatibility with
+ Perl. If you want a multiline circumflex also to match after a termi-
+ nating newline, you must set PCRE2_ALT_CIRCUMFLEX.
+
+ PCRE2_ALT_VERBNAMES
+
+ By default, for compatibility with Perl, the name in any verb sequence
+ such as (*MARK:NAME) is any sequence of characters that does not
+ include a closing parenthesis. The name is not processed in any way,
+ and it is not possible to include a closing parenthesis in the name.
+ However, if the PCRE2_ALT_VERBNAMES option is set, normal backslash
+ processing is applied to verb names and only an unescaped closing
+ parenthesis terminates the name. A closing parenthesis can be included
+ in a name either as \) or between \Q and \E. If the PCRE2_EXTENDED or
+ PCRE2_EXTENDED_MORE option is set with PCRE2_ALT_VERBNAMES, unescaped
+ whitespace in verb names is skipped and #-comments are recognized,
+ exactly as in the rest of the pattern.
+
+ PCRE2_AUTO_CALLOUT
+
+ If this bit is set, pcre2_compile() automatically inserts callout
+ items, all with number 255, before each pattern item, except immedi-
+ ately before or after an explicit callout in the pattern. For discus-
+ sion of the callout facility, see the pcre2callout documentation.
+
+ PCRE2_CASELESS
+
+ If this bit is set, letters in the pattern match both upper and lower
+ case letters in the subject. It is equivalent to Perl's /i option, and
+ it can be changed within a pattern by a (?i) option setting. If
+ PCRE2_UTF is set, Unicode properties are used for all characters with
+ more than one other case, and for all characters whose code points are
+ greater than U+007F. For lower valued characters with only one other
+ case, a lookup table is used for speed. When PCRE2_UTF is not set, a
+ lookup table is used for all code points less than 256, and higher code
+ points (available only in 16-bit or 32-bit mode) are treated as not
+ having another case.
+
+ PCRE2_DOLLAR_ENDONLY
+
+ If this bit is set, a dollar metacharacter in the pattern matches only
+ at the end of the subject string. Without this option, a dollar also
+ matches immediately before a newline at the end of the string (but not
+ before any other newlines). The PCRE2_DOLLAR_ENDONLY option is ignored
+ if PCRE2_MULTILINE is set. There is no equivalent to this option in
+ Perl, and no way to set it within a pattern.
+
+ PCRE2_DOTALL
+
+ If this bit is set, a dot metacharacter in the pattern matches any
+ character, including one that indicates a newline. However, it only
+ ever matches one character, even if newlines are coded as CRLF. Without
+ this option, a dot does not match when the current position in the sub-
+ ject is at a newline. This option is equivalent to Perl's /s option,
+ and it can be changed within a pattern by a (?s) option setting. A neg-
+ ative class such as [^a] always matches newline characters, and the \N
+ escape sequence always matches a non-newline character, independent of
+ the setting of PCRE2_DOTALL.
+
+ PCRE2_DUPNAMES
+
+ If this bit is set, names used to identify capturing subpatterns need
+ not be unique. This can be helpful for certain types of pattern when it
+ is known that only one instance of the named subpattern can ever be
+ matched. There are more details of named subpatterns below; see also
+ the pcre2pattern documentation.
+
+ PCRE2_ENDANCHORED
+
+ If this bit is set, the end of any pattern match must be right at the
+ end of the string being searched (the "subject string"). If the pattern
+ match succeeds by reaching (*ACCEPT), but does not reach the end of the
+ subject, the match fails at the current starting point. For unanchored
+ patterns, a new match is then tried at the next starting point. How-
+ ever, if the match succeeds by reaching the end of the pattern, but not
+ the end of the subject, backtracking occurs and an alternative match
+ may be found. Consider these two patterns:
+
+ .(*ACCEPT)|..
+ .|..
+
+ If matched against "abc" with PCRE2_ENDANCHORED set, the first matches
+ "c" whereas the second matches "bc". The effect of PCRE2_ENDANCHORED
+ can also be achieved by appropriate constructs in the pattern itself,
+ which is the only way to do it in Perl.
+
+ For DFA matching with pcre2_dfa_match(), PCRE2_ENDANCHORED applies only
+ to the first (that is, the longest) matched string. Other parallel
+ matches, which are necessarily substrings of the first one, must obvi-
+ ously end before the end of the subject.
+
+ PCRE2_EXTENDED
+
+ If this bit is set, most white space characters in the pattern are
+ totally ignored except when escaped or inside a character class. How-
+ ever, white space is not allowed within sequences such as (?> that
+ introduce various parenthesized subpatterns, nor within numerical quan-
+ tifiers such as {1,3}. Ignorable white space is permitted between an
+ item and a following quantifier and between a quantifier and a follow-
+ ing + that indicates possessiveness. PCRE2_EXTENDED is equivalent to
+ Perl's /x option, and it can be changed within a pattern by a (?x)
+ option setting.
+
+ When PCRE2 is compiled without Unicode support, PCRE2_EXTENDED recog-
+ nizes as white space only those characters with code points less than
+ 256 that are flagged as white space in its low-character table. The ta-
+ ble is normally created by pcre2_maketables(), which uses the isspace()
+ function to identify space characters. In most ASCII environments, the
+ relevant characters are those with code points 0x0009 (tab), 0x000A
+ (linefeed), 0x000B (vertical tab), 0x000C (formfeed), 0x000D (carriage
+ return), and 0x0020 (space).
+
+ When PCRE2 is compiled with Unicode support, in addition to these char-
+ acters, five more Unicode "Pattern White Space" characters are recog-
+ nized by PCRE2_EXTENDED. These are U+0085 (next line), U+200E (left-to-
+ right mark), U+200F (right-to-left mark), U+2028 (line separator), and
+ U+2029 (paragraph separator). This set of characters is the same as
+ recognized by Perl's /x option. Note that the horizontal and vertical
+ space characters that are matched by the \h and \v escapes in patterns
+ are a much bigger set.
+
+ As well as ignoring most white space, PCRE2_EXTENDED also causes char-
+ acters between an unescaped # outside a character class and the next
+ newline, inclusive, to be ignored, which makes it possible to include
+ comments inside complicated patterns. Note that the end of this type of
+ comment is a literal newline sequence in the pattern; escape sequences
+ that happen to represent a newline do not count.
+
+ Which characters are interpreted as newlines can be specified by a set-
+ ting in the compile context that is passed to pcre2_compile() or by a
+ special sequence at the start of the pattern, as described in the sec-
+ tion entitled "Newline conventions" in the pcre2pattern documentation.
+ A default is defined when PCRE2 is built.
+
+ PCRE2_EXTENDED_MORE
+
+ This option has the effect of PCRE2_EXTENDED, but, in addition,
+ unescaped space and horizontal tab characters are ignored inside a
+ character class. Note: only these two characters are ignored, not the
+ full set of pattern white space characters that are ignored outside a
+ character class. PCRE2_EXTENDED_MORE is equivalent to Perl's /xx
+ option, and it can be changed within a pattern by a (?xx) option set-
+ ting.
+
+ PCRE2_FIRSTLINE
+
+ If this option is set, the start of an unanchored pattern match must be
+ before or at the first newline in the subject string following the
+ start of matching, though the matched text may continue over the new-
+ line. If startoffset is non-zero, the limiting newline is not necessar-
+ ily the first newline in the subject. For example, if the subject
+ string is "abc\nxyz" (where \n represents a single-character newline) a
+ pattern match for "yz" succeeds with PCRE2_FIRSTLINE if startoffset is
+ greater than 3. See also PCRE2_USE_OFFSET_LIMIT, which provides a more
+ general limiting facility. If PCRE2_FIRSTLINE is set with an offset
+ limit, a match must occur in the first line and also within the offset
+ limit. In other words, whichever limit comes first is used.
+
+ PCRE2_LITERAL
+
+ If this option is set, all meta-characters in the pattern are disabled,
+ and it is treated as a literal string. Matching literal strings with a
+ regular expression engine is not the most efficient way of doing it. If
+ you are doing a lot of literal matching and are worried about effi-
+ ciency, you should consider using other approaches. The only other main
+ options that are allowed with PCRE2_LITERAL are: PCRE2_ANCHORED,
+ PCRE2_ENDANCHORED, PCRE2_AUTO_CALLOUT, PCRE2_CASELESS, PCRE2_FIRSTLINE,
+ PCRE2_NO_START_OPTIMIZE, PCRE2_NO_UTF_CHECK, PCRE2_UTF, and
+ PCRE2_USE_OFFSET_LIMIT. The extra options PCRE2_EXTRA_MATCH_LINE and
+ PCRE2_EXTRA_MATCH_WORD are also supported. Any other options cause an
+ error.
+
+ PCRE2_MATCH_UNSET_BACKREF
+
+ If this option is set, a backreference to an unset subpattern group
+ matches an empty string (by default this causes the current matching
+ alternative to fail). A pattern such as (\1)(a) succeeds when this
+ option is set (assuming it can find an "a" in the subject), whereas it
+ fails by default, for Perl compatibility. Setting this option makes
+ PCRE2 behave more like ECMAscript (aka JavaScript).
+
+ PCRE2_MULTILINE
+
+ By default, for the purposes of matching "start of line" and "end of
+ line", PCRE2 treats the subject string as consisting of a single line
+ of characters, even if it actually contains newlines. The "start of
+ line" metacharacter (^) matches only at the start of the string, and
+ the "end of line" metacharacter ($) matches only at the end of the
+ string, or before a terminating newline (except when PCRE2_DOL-
+ LAR_ENDONLY is set). Note, however, that unless PCRE2_DOTALL is set,
+ the "any character" metacharacter (.) does not match at a newline. This
+ behaviour (for ^, $, and dot) is the same as Perl.
+
+ When PCRE2_MULTILINE it is set, the "start of line" and "end of line"
+ constructs match immediately following or immediately before internal
+ newlines in the subject string, respectively, as well as at the very
+ start and end. This is equivalent to Perl's /m option, and it can be
+ changed within a pattern by a (?m) option setting. Note that the "start
+ of line" metacharacter does not match after a newline at the end of the
+ subject, for compatibility with Perl. However, you can change this by
+ setting the PCRE2_ALT_CIRCUMFLEX option. If there are no newlines in a
+ subject string, or no occurrences of ^ or $ in a pattern, setting
+ PCRE2_MULTILINE has no effect.
+
+ PCRE2_NEVER_BACKSLASH_C
+
+ This option locks out the use of \C in the pattern that is being com-
+ piled. This escape can cause unpredictable behaviour in UTF-8 or
+ UTF-16 modes, because it may leave the current matching point in the
+ middle of a multi-code-unit character. This option may be useful in
+ applications that process patterns from external sources. Note that
+ there is also a build-time option that permanently locks out the use of
+ \C.
+
+ PCRE2_NEVER_UCP
+
+ This option locks out the use of Unicode properties for handling \B,
+ \b, \D, \d, \S, \s, \W, \w, and some of the POSIX character classes, as
+ described for the PCRE2_UCP option below. In particular, it prevents
+ the creator of the pattern from enabling this facility by starting the
+ pattern with (*UCP). This option may be useful in applications that
+ process patterns from external sources. The option combination PCRE_UCP
+ and PCRE_NEVER_UCP causes an error.
+
+ PCRE2_NEVER_UTF
+
+ This option locks out interpretation of the pattern as UTF-8, UTF-16,
+ or UTF-32, depending on which library is in use. In particular, it pre-
+ vents the creator of the pattern from switching to UTF interpretation
+ by starting the pattern with (*UTF). This option may be useful in
+ applications that process patterns from external sources. The combina-
+ tion of PCRE2_UTF and PCRE2_NEVER_UTF causes an error.
+
+ PCRE2_NO_AUTO_CAPTURE
+
+ If this option is set, it disables the use of numbered capturing paren-
+ theses in the pattern. Any opening parenthesis that is not followed by
+ ? behaves as if it were followed by ?: but named parentheses can still
+ be used for capturing (and they acquire numbers in the usual way). This
+ is the same as Perl's /n option. Note that, when this option is set,
+ references to capturing groups (backreferences or recursion/subroutine
+ calls) may only refer to named groups, though the reference can be by
+ name or by number.
+
+ PCRE2_NO_AUTO_POSSESS
+
+ If this option is set, it disables "auto-possessification", which is an
+ optimization that, for example, turns a+b into a++b in order to avoid
+ backtracks into a+ that can never be successful. However, if callouts
+ are in use, auto-possessification means that some callouts are never
+ taken. You can set this option if you want the matching functions to do
+ a full unoptimized search and run all the callouts, but it is mainly
+ provided for testing purposes.
+
+ PCRE2_NO_DOTSTAR_ANCHOR
+
+ If this option is set, it disables an optimization that is applied when
+ .* is the first significant item in a top-level branch of a pattern,
+ and all the other branches also start with .* or with \A or \G or ^.
+ The optimization is automatically disabled for .* if it is inside an
+ atomic group or a capturing group that is the subject of a backrefer-
+ ence, or if the pattern contains (*PRUNE) or (*SKIP). When the opti-
+ mization is not disabled, such a pattern is automatically anchored if
+ PCRE2_DOTALL is set for all the .* items and PCRE2_MULTILINE is not set
+ for any ^ items. Otherwise, the fact that any match must start either
+ at the start of the subject or following a newline is remembered. Like
+ other optimizations, this can cause callouts to be skipped.
+
+ PCRE2_NO_START_OPTIMIZE
+
+ This is an option whose main effect is at matching time. It does not
+ change what pcre2_compile() generates, but it does affect the output of
+ the JIT compiler.
+
+ There are a number of optimizations that may occur at the start of a
+ match, in order to speed up the process. For example, if it is known
+ that an unanchored match must start with a specific code unit value,
+ the matching code searches the subject for that value, and fails imme-
+ diately if it cannot find it, without actually running the main match-
+ ing function. This means that a special item such as (*COMMIT) at the
+ start of a pattern is not considered until after a suitable starting
+ point for the match has been found. Also, when callouts or (*MARK)
+ items are in use, these "start-up" optimizations can cause them to be
+ skipped if the pattern is never actually used. The start-up optimiza-
+ tions are in effect a pre-scan of the subject that takes place before
+ the pattern is run.
+
+ The PCRE2_NO_START_OPTIMIZE option disables the start-up optimizations,
+ possibly causing performance to suffer, but ensuring that in cases
+ where the result is "no match", the callouts do occur, and that items
+ such as (*COMMIT) and (*MARK) are considered at every possible starting
+ position in the subject string.
+
+ Setting PCRE2_NO_START_OPTIMIZE may change the outcome of a matching
+ operation. Consider the pattern
+
+ (*COMMIT)ABC
+
+ When this is compiled, PCRE2 records the fact that a match must start
+ with the character "A". Suppose the subject string is "DEFABC". The
+ start-up optimization scans along the subject, finds "A" and runs the
+ first match attempt from there. The (*COMMIT) item means that the pat-
+ tern must match the current starting position, which in this case, it
+ does. However, if the same match is run with PCRE2_NO_START_OPTIMIZE
+ set, the initial scan along the subject string does not happen. The
+ first match attempt is run starting from "D" and when this fails,
+ (*COMMIT) prevents any further matches being tried, so the overall
+ result is "no match".
+
+ There are also other start-up optimizations. For example, a minimum
+ length for the subject may be recorded. Consider the pattern
+
+ (*MARK:A)(X|Y)
+
+ The minimum length for a match is one character. If the subject is
+ "ABC", there will be attempts to match "ABC", "BC", and "C". An attempt
+ to match an empty string at the end of the subject does not take place,
+ because PCRE2 knows that the subject is now too short, and so the
+ (*MARK) is never encountered. In this case, the optimization does not
+ affect the overall match result, which is still "no match", but it does
+ affect the auxiliary information that is returned.
+
+ PCRE2_NO_UTF_CHECK
+
+ When PCRE2_UTF is set, the validity of the pattern as a UTF string is
+ automatically checked. There are discussions about the validity of
+ UTF-8 strings, UTF-16 strings, and UTF-32 strings in the pcre2unicode
+ document. If an invalid UTF sequence is found, pcre2_compile() returns
+ a negative error code.
+
+ If you know that your pattern is a valid UTF string, and you want to
+ skip this check for performance reasons, you can set the
+ PCRE2_NO_UTF_CHECK option. When it is set, the effect of passing an
+ invalid UTF string as a pattern is undefined. It may cause your program
+ to crash or loop.
+
+ Note that this option can also be passed to pcre2_match() and
+ pcre_dfa_match(), to suppress UTF validity checking of the subject
+ string.
+
+ Note also that setting PCRE2_NO_UTF_CHECK at compile time does not dis-
+ able the error that is given if an escape sequence for an invalid Uni-
+ code code point is encountered in the pattern. In particular, the so-
+ called "surrogate" code points (0xd800 to 0xdfff) are invalid. If you
+ want to allow escape sequences such as \x{d800} you can set the
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option, as described in the
+ section entitled "Extra compile options" below. However, this is pos-
+ sible only in UTF-8 and UTF-32 modes, because these values are not rep-
+ resentable in UTF-16.
+
+ PCRE2_UCP
+
+ This option changes the way PCRE2 processes \B, \b, \D, \d, \S, \s, \W,
+ \w, and some of the POSIX character classes. By default, only ASCII
+ characters are recognized, but if PCRE2_UCP is set, Unicode properties
+ are used instead to classify characters. More details are given in the
+ section on generic character types in the pcre2pattern page. If you set
+ PCRE2_UCP, matching one of the items it affects takes much longer. The
+ option is available only if PCRE2 has been compiled with Unicode sup-
+ port (which is the default).
+
+ PCRE2_UNGREEDY
+
+ This option inverts the "greediness" of the quantifiers so that they
+ are not greedy by default, but become greedy if followed by "?". It is
+ not compatible with Perl. It can also be set by a (?U) option setting
+ within the pattern.
+
+ PCRE2_USE_OFFSET_LIMIT
+
+ This option must be set for pcre2_compile() if pcre2_set_offset_limit()
+ is going to be used to set a non-default offset limit in a match con-
+ text for matches that use this pattern. An error is generated if an
+ offset limit is set without this option. For more details, see the
+ description of pcre2_set_offset_limit() in the section that describes
+ match contexts. See also the PCRE2_FIRSTLINE option above.
+
+ PCRE2_UTF
+
+ This option causes PCRE2 to regard both the pattern and the subject
+ strings that are subsequently processed as strings of UTF characters
+ instead of single-code-unit strings. It is available when PCRE2 is
+ built to include Unicode support (which is the default). If Unicode
+ support is not available, the use of this option provokes an error.
+ Details of how PCRE2_UTF changes the behaviour of PCRE2 are given in
+ the pcre2unicode page. In particular, note that it changes the way
+ PCRE2_CASELESS handles characters with code points greater than 127.
+
+ Extra compile options
+
+ Unlike the main compile-time options, the extra options are not saved
+ with the compiled pattern. The option bits that can be set in a compile
+ context by calling the pcre2_set_compile_extra_options() function are
+ as follows:
+
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES
+
+ This option applies when compiling a pattern in UTF-8 or UTF-32 mode.
+ It is forbidden in UTF-16 mode, and ignored in non-UTF modes. Unicode
+ "surrogate" code points in the range 0xd800 to 0xdfff are used in pairs
+ in UTF-16 to encode code points with values in the range 0x10000 to
+ 0x10ffff. The surrogates cannot therefore be represented in UTF-16.
+ They can be represented in UTF-8 and UTF-32, but are defined as invalid
+ code points, and cause errors if encountered in a UTF-8 or UTF-32
+ string that is being checked for validity by PCRE2.
+
+ These values also cause errors if encountered in escape sequences such
+ as \x{d912} within a pattern. However, it seems that some applications,
+ when using PCRE2 to check for unwanted characters in UTF-8 strings,
+ explicitly test for the surrogates using escape sequences. The
+ PCRE2_NO_UTF_CHECK option does not disable the error that occurs,
+ because it applies only to the testing of input strings for UTF valid-
+ ity.
+
+ If the extra option PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES is set, surro-
+ gate code point values in UTF-8 and UTF-32 patterns no longer provoke
+ errors and are incorporated in the compiled pattern. However, they can
+ only match subject characters if the matching function is called with
+ PCRE2_NO_UTF_CHECK set.
+
+ PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL
+
+ This is a dangerous option. Use with care. By default, an unrecognized
+ escape such as \j or a malformed one such as \x{2z} causes a compile-
+ time error when detected by pcre2_compile(). Perl is somewhat inconsis-
+ tent in handling such items: for example, \j is treated as a literal
+ "j", and non-hexadecimal digits in \x{} are just ignored, though warn-
+ ings are given in both cases if Perl's warning switch is enabled. How-
+ ever, a malformed octal number after \o{ always causes an error in
+ Perl.
+
+ If the PCRE2_EXTRA_BAD_ESCAPE_IS_LITERAL extra option is passed to
+ pcre2_compile(), all unrecognized or erroneous escape sequences are
+ treated as single-character escapes. For example, \j is a literal "j"
+ and \x{2z} is treated as the literal string "x{2z}". Setting this
+ option means that typos in patterns may go undetected and have unex-
+ pected results. This is a dangerous option. Use with care.
+
+ PCRE2_EXTRA_MATCH_LINE
+
+ This option is provided for use by the -x option of pcre2grep. It
+ causes the pattern only to match complete lines. This is achieved by
+ automatically inserting the code for "^(?:" at the start of the com-
+ piled pattern and ")$" at the end. Thus, when PCRE2_MULTILINE is set,
+ the matched line may be in the middle of the subject string. This
+ option can be used with PCRE2_LITERAL.
+
+ PCRE2_EXTRA_MATCH_WORD
+
+ This option is provided for use by the -w option of pcre2grep. It
+ causes the pattern only to match strings that have a word boundary at
+ the start and the end. This is achieved by automatically inserting the
+ code for "\b(?:" at the start of the compiled pattern and ")\b" at the
+ end. The option may be used with PCRE2_LITERAL. However, it is ignored
+ if PCRE2_EXTRA_MATCH_LINE is also set.
+
+
+JUST-IN-TIME (JIT) COMPILATION
+
+ int pcre2_jit_compile(pcre2_code *code, uint32_t options);
+
+ int pcre2_jit_match(const pcre2_code *code, PCRE2_SPTR subject,
+ PCRE2_SIZE length, PCRE2_SIZE startoffset,
+ uint32_t options, pcre2_match_data *match_data,
+ pcre2_match_context *mcontext);
+
+ void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
+
+ pcre2_jit_stack *pcre2_jit_stack_create(PCRE2_SIZE startsize,
+ PCRE2_SIZE maxsize, pcre2_general_context *gcontext);
+
+ void pcre2_jit_stack_assign(pcre2_match_context *mcontext,
+ pcre2_jit_callback callback_function, void *callback_data);
+
+ void pcre2_jit_stack_free(pcre2_jit_stack *jit_stack);
+
+ These functions provide support for JIT compilation, which, if the
+ just-in-time compiler is available, further processes a compiled pat-
+ tern into machine code that executes much faster than the pcre2_match()
+ interpretive matching function. Full details are given in the pcre2jit
+ documentation.
+
+ JIT compilation is a heavyweight optimization. It can take some time
+ for patterns to be analyzed, and for one-off matches and simple pat-
+ terns the benefit of faster execution might be offset by a much slower
+ compilation time. Most (but not all) patterns can be optimized by the
+ JIT compiler.
+
+
+LOCALE SUPPORT
+
+ PCRE2 handles caseless matching, and determines whether characters are
+ letters, digits, or whatever, by reference to a set of tables, indexed
+ by character code point. This applies only to characters whose code
+ points are less than 256. By default, higher-valued code points never
+ match escapes such as \w or \d. However, if PCRE2 is built with Uni-
+ code support, all characters can be tested with \p and \P, or, alterna-
+ tively, the PCRE2_UCP option can be set when a pattern is compiled;
+ this causes \w and friends to use Unicode property support instead of
+ the built-in tables.
+
+ The use of locales with Unicode is discouraged. If you are handling
+ characters with code points greater than 128, you should either use
+ Unicode support, or use locales, but not try to mix the two.
+
+ PCRE2 contains an internal set of character tables that are used by
+ default. These are sufficient for many applications. Normally, the
+ internal tables recognize only ASCII characters. However, when PCRE2 is
+ built, it is possible to cause the internal tables to be rebuilt in the
+ default "C" locale of the local system, which may cause them to be dif-
+ ferent.
+
+ The internal tables can be overridden by tables supplied by the appli-
+ cation that calls PCRE2. These may be created in a different locale
+ from the default. As more and more applications change to using Uni-
+ code, the need for this locale support is expected to die away.
+
+ External tables are built by calling the pcre2_maketables() function,
+ in the relevant locale. The result can be passed to pcre2_compile() as
+ often as necessary, by creating a compile context and calling
+ pcre2_set_character_tables() to set the tables pointer therein. For
+ example, to build and use tables that are appropriate for the French
+ locale (where accented characters with values greater than 128 are
+ treated as letters), the following code could be used:
+
+ setlocale(LC_CTYPE, "fr_FR");
+ tables = pcre2_maketables(NULL);
+ ccontext = pcre2_compile_context_create(NULL);
+ pcre2_set_character_tables(ccontext, tables);
+ re = pcre2_compile(..., ccontext);
+
+ The locale name "fr_FR" is used on Linux and other Unix-like systems;
+ if you are using Windows, the name for the French locale is "french".
+ It is the caller's responsibility to ensure that the memory containing
+ the tables remains available for as long as it is needed.
+
+ The pointer that is passed (via the compile context) to pcre2_compile()
+ is saved with the compiled pattern, and the same tables are used by
+ pcre2_match() and pcre_dfa_match(). Thus, for any single pattern, com-
+ pilation and matching both happen in the same locale, but different
+ patterns can be processed in different locales.
+
+
+INFORMATION ABOUT A COMPILED PATTERN
+
+ int pcre2_pattern_info(const pcre2 *code, uint32_t what, void *where);
+
+ The pcre2_pattern_info() function returns general information about a
+ compiled pattern. For information about callouts, see the next section.
+ The first argument for pcre2_pattern_info() is a pointer to the com-
+ piled pattern. The second argument specifies which piece of information
+ is required, and the third argument is a pointer to a variable to
+ receive the data. If the third argument is NULL, the first argument is
+ ignored, and the function returns the size in bytes of the variable
+ that is required for the information requested. Otherwise, the yield of
+ the function is zero for success, or one of the following negative num-
+ bers:
+
+ PCRE2_ERROR_NULL the argument code was NULL
+ PCRE2_ERROR_BADMAGIC the "magic number" was not found
+ PCRE2_ERROR_BADOPTION the value of what was invalid
+ PCRE2_ERROR_UNSET the requested field is not set
+
+ The "magic number" is placed at the start of each compiled pattern as
+ an simple check against passing an arbitrary memory pointer. Here is a
+ typical call of pcre2_pattern_info(), to obtain the length of the com-
+ piled pattern:
+
+ int rc;
+ size_t length;
+ rc = pcre2_pattern_info(
+ re, /* result of pcre2_compile() */
+ PCRE2_INFO_SIZE, /* what is required */
+ &length); /* where to put the data */
+
+ The possible values for the second argument are defined in pcre2.h, and
+ are as follows:
+
+ PCRE2_INFO_ALLOPTIONS
+ PCRE2_INFO_ARGOPTIONS
+ PCRE2_INFO_EXTRAOPTIONS
+
+ Return copies of the pattern's options. The third argument should point
+ to a uint32_t variable. PCRE2_INFO_ARGOPTIONS returns exactly the
+ options that were passed to pcre2_compile(), whereas PCRE2_INFO_ALLOP-
+ TIONS returns the compile options as modified by any top-level (*XXX)
+ option settings such as (*UTF) at the start of the pattern itself.
+ PCRE2_INFO_EXTRAOPTIONS returns the extra options that were set in the
+ compile context by calling the pcre2_set_compile_extra_options() func-
+ tion.
+
+ For example, if the pattern /(*UTF)abc/ is compiled with the
+ PCRE2_EXTENDED option, the result for PCRE2_INFO_ALLOPTIONS is
+ PCRE2_EXTENDED and PCRE2_UTF. Option settings such as (?i) that can
+ change within a pattern do not affect the result of PCRE2_INFO_ALLOP-
+ TIONS, even if they appear right at the start of the pattern. (This was
+ different in some earlier releases.)
+
+ A pattern compiled without PCRE2_ANCHORED is automatically anchored by
+ PCRE2 if the first significant item in every top-level branch is one of
+ the following:
+
+ ^ unless PCRE2_MULTILINE is set
+ \A always
+ \G always
+ .* sometimes - see below
+
+ When .* is the first significant item, anchoring is possible only when
+ all the following are true:
+
+ .* is not in an atomic group
+ .* is not in a capturing group that is the subject
+ of a backreference
+ PCRE2_DOTALL is in force for .*
+ Neither (*PRUNE) nor (*SKIP) appears in the pattern
+ PCRE2_NO_DOTSTAR_ANCHOR is not set
+
+ For patterns that are auto-anchored, the PCRE2_ANCHORED bit is set in
+ the options returned for PCRE2_INFO_ALLOPTIONS.
+
+ PCRE2_INFO_BACKREFMAX
+
+ Return the number of the highest backreference in the pattern. The
+ third argument should point to an uint32_t variable. Named subpatterns
+ acquire numbers as well as names, and these count towards the highest
+ backreference. Backreferences such as \4 or \g{12} match the captured
+ characters of the given group, but in addition, the check that a cap-
+ turing group is set in a conditional subpattern such as (?(3)a|b) is
+ also a backreference. Zero is returned if there are no backreferences.
+
+ PCRE2_INFO_BSR
+
+ The output is a uint32_t integer whose value indicates what character
+ sequences the \R escape sequence matches. A value of PCRE2_BSR_UNICODE
+ means that \R matches any Unicode line ending sequence; a value of
+ PCRE2_BSR_ANYCRLF means that \R matches only CR, LF, or CRLF.
+
+ PCRE2_INFO_CAPTURECOUNT
+
+ Return the highest capturing subpattern number in the pattern. In pat-
+ terns where (?| is not used, this is also the total number of capturing
+ subpatterns. The third argument should point to an uint32_t variable.
+
+ PCRE2_INFO_DEPTHLIMIT
+
+ If the pattern set a backtracking depth limit by including an item of
+ the form (*LIMIT_DEPTH=nnnn) at the start, the value is returned. The
+ third argument should point to a uint32_t integer. If no such value has
+ been set, the call to pcre2_pattern_info() returns the error
+ PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
+ ing if it is less than the limit set or defaulted by the caller of the
+ match function.
+
+ PCRE2_INFO_FIRSTBITMAP
+
+ In the absence of a single first code unit for a non-anchored pattern,
+ pcre2_compile() may construct a 256-bit table that defines a fixed set
+ of values for the first code unit in any match. For example, a pattern
+ that starts with [abc] results in a table with three bits set. When
+ code unit values greater than 255 are supported, the flag bit for 255
+ means "any code unit of value 255 or above". If such a table was con-
+ structed, a pointer to it is returned. Otherwise NULL is returned. The
+ third argument should point to a const uint8_t * variable.
+
+ PCRE2_INFO_FIRSTCODETYPE
+
+ Return information about the first code unit of any matched string, for
+ a non-anchored pattern. The third argument should point to an uint32_t
+ variable. If there is a fixed first value, for example, the letter "c"
+ from a pattern such as (cat|cow|coyote), 1 is returned, and the value
+ can be retrieved using PCRE2_INFO_FIRSTCODEUNIT. If there is no fixed
+ first value, but it is known that a match can occur only at the start
+ of the subject or following a newline in the subject, 2 is returned.
+ Otherwise, and for anchored patterns, 0 is returned.
+
+ PCRE2_INFO_FIRSTCODEUNIT
+
+ Return the value of the first code unit of any matched string for a
+ pattern where PCRE2_INFO_FIRSTCODETYPE returns 1; otherwise return 0.
+ The third argument should point to an uint32_t variable. In the 8-bit
+ library, the value is always less than 256. In the 16-bit library the
+ value can be up to 0xffff. In the 32-bit library in UTF-32 mode the
+ value can be up to 0x10ffff, and up to 0xffffffff when not using UTF-32
+ mode.
+
+ PCRE2_INFO_FRAMESIZE
+
+ Return the size (in bytes) of the data frames that are used to remember
+ backtracking positions when the pattern is processed by pcre2_match()
+ without the use of JIT. The third argument should point to a size_t
+ variable. The frame size depends on the number of capturing parentheses
+ in the pattern. Each additional capturing group adds two PCRE2_SIZE
+ variables.
+
+ PCRE2_INFO_HASBACKSLASHC
+
+ Return 1 if the pattern contains any instances of \C, otherwise 0. The
+ third argument should point to an uint32_t variable.
+
+ PCRE2_INFO_HASCRORLF
+
+ Return 1 if the pattern contains any explicit matches for CR or LF
+ characters, otherwise 0. The third argument should point to an uint32_t
+ variable. An explicit match is either a literal CR or LF character, or
+ \r or \n or one of the equivalent hexadecimal or octal escape
+ sequences.
+
+ PCRE2_INFO_HEAPLIMIT
+
+ If the pattern set a heap memory limit by including an item of the form
+ (*LIMIT_HEAP=nnnn) at the start, the value is returned. The third argu-
+ ment should point to a uint32_t integer. If no such value has been set,
+ the call to pcre2_pattern_info() returns the error PCRE2_ERROR_UNSET.
+ Note that this limit will only be used during matching if it is less
+ than the limit set or defaulted by the caller of the match function.
+
+ PCRE2_INFO_JCHANGED
+
+ Return 1 if the (?J) or (?-J) option setting is used in the pattern,
+ otherwise 0. The third argument should point to an uint32_t variable.
+ (?J) and (?-J) set and unset the local PCRE2_DUPNAMES option, respec-
+ tively.
+
+ PCRE2_INFO_JITSIZE
+
+ If the compiled pattern was successfully processed by pcre2_jit_com-
+ pile(), return the size of the JIT compiled code, otherwise return
+ zero. The third argument should point to a size_t variable.
+
+ PCRE2_INFO_LASTCODETYPE
+
+ Returns 1 if there is a rightmost literal code unit that must exist in
+ any matched string, other than at its start. The third argument should
+ point to an uint32_t variable. If there is no such value, 0 is
+ returned. When 1 is returned, the code unit value itself can be
+ retrieved using PCRE2_INFO_LASTCODEUNIT. For anchored patterns, a last
+ literal value is recorded only if it follows something of variable
+ length. For example, for the pattern /^a\d+z\d+/ the returned value is
+ 1 (with "z" returned from PCRE2_INFO_LASTCODEUNIT), but for /^a\dz\d/
+ the returned value is 0.
+
+ PCRE2_INFO_LASTCODEUNIT
+
+ Return the value of the rightmost literal code unit that must exist in
+ any matched string, other than at its start, for a pattern where
+ PCRE2_INFO_LASTCODETYPE returns 1. Otherwise, return 0. The third argu-
+ ment should point to an uint32_t variable.
+
+ PCRE2_INFO_MATCHEMPTY
+
+ Return 1 if the pattern might match an empty string, otherwise 0. The
+ third argument should point to an uint32_t variable. When a pattern
+ contains recursive subroutine calls it is not always possible to deter-
+ mine whether or not it can match an empty string. PCRE2 takes a cau-
+ tious approach and returns 1 in such cases.
+
+ PCRE2_INFO_MATCHLIMIT
+
+ If the pattern set a match limit by including an item of the form
+ (*LIMIT_MATCH=nnnn) at the start, the value is returned. The third
+ argument should point to a uint32_t integer. If no such value has been
+ set, the call to pcre2_pattern_info() returns the error
+ PCRE2_ERROR_UNSET. Note that this limit will only be used during match-
+ ing if it is less than the limit set or defaulted by the caller of the
+ match function.
+
+ PCRE2_INFO_MAXLOOKBEHIND
+
+ Return the number of characters (not code units) in the longest lookbe-
+ hind assertion in the pattern. The third argument should point to a
+ uint32_t integer. This information is useful when doing multi-segment
+ matching using the partial matching facilities. Note that the simple
+ assertions \b and \B require a one-character lookbehind. \A also regis-
+ ters a one-character lookbehind, though it does not actually inspect
+ the previous character. This is to ensure that at least one character
+ from the old segment is retained when a new segment is processed. Oth-
+ erwise, if there are no lookbehinds in the pattern, \A might match
+ incorrectly at the start of a second or subsequent segment.
+
+ PCRE2_INFO_MINLENGTH
+
+ If a minimum length for matching subject strings was computed, its
+ value is returned. Otherwise the returned value is 0. The value is a
+ number of characters, which in UTF mode may be different from the num-
+ ber of code units. The third argument should point to an uint32_t
+ variable. The value is a lower bound to the length of any matching
+ string. There may not be any strings of that length that do actually
+ match, but every string that does match is at least that long.
+
+ PCRE2_INFO_NAMECOUNT
+ PCRE2_INFO_NAMEENTRYSIZE
+ PCRE2_INFO_NAMETABLE
+
+ PCRE2 supports the use of named as well as numbered capturing parenthe-
+ ses. The names are just an additional way of identifying the parenthe-
+ ses, which still acquire numbers. Several convenience functions such as
+ pcre2_substring_get_byname() are provided for extracting captured sub-
+ strings by name. It is also possible to extract the data directly, by
+ first converting the name to a number in order to access the correct
+ pointers in the output vector (described with pcre2_match() below). To
+ do the conversion, you need to use the name-to-number map, which is
+ described by these three values.
+
+ The map consists of a number of fixed-size entries. PCRE2_INFO_NAME-
+ COUNT gives the number of entries, and PCRE2_INFO_NAMEENTRYSIZE gives
+ the size of each entry in code units; both of these return a uint32_t
+ value. The entry size depends on the length of the longest name.
+
+ PCRE2_INFO_NAMETABLE returns a pointer to the first entry of the table.
+ This is a PCRE2_SPTR pointer to a block of code units. In the 8-bit
+ library, the first two bytes of each entry are the number of the cap-
+ turing parenthesis, most significant byte first. In the 16-bit library,
+ the pointer points to 16-bit code units, the first of which contains
+ the parenthesis number. In the 32-bit library, the pointer points to
+ 32-bit code units, the first of which contains the parenthesis number.
+ The rest of the entry is the corresponding name, zero terminated.
+
+ The names are in alphabetical order. If (?| is used to create multiple
+ groups with the same number, as described in the section on duplicate
+ subpattern numbers in the pcre2pattern page, the groups may be given
+ the same name, but there is only one entry in the table. Different
+ names for groups of the same number are not permitted.
+
+ Duplicate names for subpatterns with different numbers are permitted,
+ but only if PCRE2_DUPNAMES is set. They appear in the table in the
+ order in which they were found in the pattern. In the absence of (?|
+ this is the order of increasing number; when (?| is used this is not
+ necessarily the case because later subpatterns may have lower numbers.
+
+ As a simple example of the name/number table, consider the following
+ pattern after compilation by the 8-bit library (assume PCRE2_EXTENDED
+ is set, so white space - including newlines - is ignored):
+
+ (?<date> (?<year>(\d\d)?\d\d) -
+ (?<month>\d\d) - (?<day>\d\d) )
+
+ There are four named subpatterns, so the table has four entries, and
+ each entry in the table is eight bytes long. The table is as follows,
+ with non-printing bytes shows in hexadecimal, and undefined bytes shown
+ as ??:
+
+ 00 01 d a t e 00 ??
+ 00 05 d a y 00 ?? ??
+ 00 04 m o n t h 00
+ 00 02 y e a r 00 ??
+
+ When writing code to extract data from named subpatterns using the
+ name-to-number map, remember that the length of the entries is likely
+ to be different for each compiled pattern.
+
+ PCRE2_INFO_NEWLINE
+
+ The output is one of the following uint32_t values:
+
+ PCRE2_NEWLINE_CR Carriage return (CR)
+ PCRE2_NEWLINE_LF Linefeed (LF)
+ PCRE2_NEWLINE_CRLF Carriage return, linefeed (CRLF)
+ PCRE2_NEWLINE_ANY Any Unicode line ending
+ PCRE2_NEWLINE_ANYCRLF Any of CR, LF, or CRLF
+ PCRE2_NEWLINE_NUL The NUL character (binary zero)
+
+ This identifies the character sequence that will be recognized as mean-
+ ing "newline" while matching.
+
+ PCRE2_INFO_SIZE
+
+ Return the size of the compiled pattern in bytes (for all three
+ libraries). The third argument should point to a size_t variable. This
+ value includes the size of the general data block that precedes the
+ code units of the compiled pattern itself. The value that is used when
+ pcre2_compile() is getting memory in which to place the compiled pat-
+ tern may be slightly larger than the value returned by this option,
+ because there are cases where the code that calculates the size has to
+ over-estimate. Processing a pattern with the JIT compiler does not
+ alter the value returned by this option.
+
+
+INFORMATION ABOUT A PATTERN'S CALLOUTS
+
+ int pcre2_callout_enumerate(const pcre2_code *code,
+ int (*callback)(pcre2_callout_enumerate_block *, void *),
+ void *user_data);
+
+ A script language that supports the use of string arguments in callouts
+ might like to scan all the callouts in a pattern before running the
+ match. This can be done by calling pcre2_callout_enumerate(). The first
+ argument is a pointer to a compiled pattern, the second points to a
+ callback function, and the third is arbitrary user data. The callback
+ function is called for every callout in the pattern in the order in
+ which they appear. Its first argument is a pointer to a callout enumer-
+ ation block, and its second argument is the user_data value that was
+ passed to pcre2_callout_enumerate(). The contents of the callout enu-
+ meration block are described in the pcre2callout documentation, which
+ also gives further details about callouts.
+
+
+SERIALIZATION AND PRECOMPILING
+
+ It is possible to save compiled patterns on disc or elsewhere, and
+ reload them later, subject to a number of restrictions. The host on
+ which the patterns are reloaded must be running the same version of
+ PCRE2, with the same code unit width, and must also have the same endi-
+ anness, pointer width, and PCRE2_SIZE type. Before compiled patterns
+ can be saved, they must be converted to a "serialized" form, which in
+ the case of PCRE2 is really just a bytecode dump. The functions whose
+ names begin with pcre2_serialize_ are used for converting to and from
+ the serialized form. They are described in the pcre2serialize documen-
+ tation. Note that PCRE2 serialization does not convert compiled pat-
+ terns to an abstract format like Java or .NET serialization.
+
+
+THE MATCH DATA BLOCK
+
+ pcre2_match_data *pcre2_match_data_create(uint32_t ovecsize,
+ pcre2_general_context *gcontext);
+
+ pcre2_match_data *pcre2_match_data_create_from_pattern(
+ const pcre2_code *code, pcre2_general_context *gcontext);
+
+ void pcre2_match_data_free(pcre2_match_data *match_data);
+
+ Information about a successful or unsuccessful match is placed in a
+ match data block, which is an opaque structure that is accessed by
+ function calls. In particular, the match data block contains a vector
+ of offsets into the subject string that define the matched part of the
+ subject and any substrings that were captured. This is known as the
+ ovector.
+
+ Before calling pcre2_match(), pcre2_dfa_match(), or pcre2_jit_match()
+ you must create a match data block by calling one of the creation func-
+ tions above. For pcre2_match_data_create(), the first argument is the
+ number of pairs of offsets in the ovector. One pair of offsets is
+ required to identify the string that matched the whole pattern, with an
+ additional pair for each captured substring. For example, a value of 4
+ creates enough space to record the matched portion of the subject plus
+ three captured substrings. A minimum of at least 1 pair is imposed by
+ pcre2_match_data_create(), so it is always possible to return the over-
+ all matched string.
+
+ The second argument of pcre2_match_data_create() is a pointer to a gen-
+ eral context, which can specify custom memory management for obtaining
+ the memory for the match data block. If you are not using custom memory
+ management, pass NULL, which causes malloc() to be used.
+
+ For pcre2_match_data_create_from_pattern(), the first argument is a
+ pointer to a compiled pattern. The ovector is created to be exactly the
+ right size to hold all the substrings a pattern might capture. The sec-
+ ond argument is again a pointer to a general context, but in this case
+ if NULL is passed, the memory is obtained using the same allocator that
+ was used for the compiled pattern (custom or default).
+
+ A match data block can be used many times, with the same or different
+ compiled patterns. You can extract information from a match data block
+ after a match operation has finished, using functions that are
+ described in the sections on matched strings and other match data
+ below.
+
+ When a call of pcre2_match() fails, valid data is available in the
+ match block only when the error is PCRE2_ERROR_NOMATCH,
+ PCRE2_ERROR_PARTIAL, or one of the error codes for an invalid UTF
+ string. Exactly what is available depends on the error, and is detailed
+ below.
+
+ When one of the matching functions is called, pointers to the compiled
+ pattern and the subject string are set in the match data block so that
+ they can be referenced by the extraction functions. After running a
+ match, you must not free a compiled pattern or a subject string until
+ after all operations on the match data block (for that match) have
+ taken place.
+
+ When a match data block itself is no longer needed, it should be freed
+ by calling pcre2_match_data_free(). If this function is called with a
+ NULL argument, it returns immediately, without doing anything.
+
+
+MATCHING A PATTERN: THE TRADITIONAL FUNCTION
+
+ int pcre2_match(const pcre2_code *code, PCRE2_SPTR subject,
+ PCRE2_SIZE length, PCRE2_SIZE startoffset,
+ uint32_t options, pcre2_match_data *match_data,
+ pcre2_match_context *mcontext);
+
+ The function pcre2_match() is called to match a subject string against
+ a compiled pattern, which is passed in the code argument. You can call
+ pcre2_match() with the same code argument as many times as you like, in
+ order to find multiple matches in the subject string or to match dif-
+ ferent subject strings with the same pattern.
+
+ This function is the main matching facility of the library, and it
+ operates in a Perl-like manner. For specialist use there is also an
+ alternative matching function, which is described below in the section
+ about the pcre2_dfa_match() function.
+
+ Here is an example of a simple call to pcre2_match():
+
+ pcre2_match_data *md = pcre2_match_data_create(4, NULL);
+ int rc = pcre2_match(
+ re, /* result of pcre2_compile() */
+ "some string", /* the subject string */
+ 11, /* the length of the subject string */
+ 0, /* start at offset 0 in the subject */
+ 0, /* default options */
+ md, /* the match data block */
+ NULL); /* a match context; NULL means use defaults */
+
+ If the subject string is zero-terminated, the length can be given as
+ PCRE2_ZERO_TERMINATED. A match context must be provided if certain less
+ common matching parameters are to be changed. For details, see the sec-
+ tion on the match context above.
+
+ The string to be matched by pcre2_match()
+
+ The subject string is passed to pcre2_match() as a pointer in subject,
+ a length in length, and a starting offset in startoffset. The length
+ and offset are in code units, not characters. That is, they are in
+ bytes for the 8-bit library, 16-bit code units for the 16-bit library,
+ and 32-bit code units for the 32-bit library, whether or not UTF pro-
+ cessing is enabled.
+
+ If startoffset is greater than the length of the subject, pcre2_match()
+ returns PCRE2_ERROR_BADOFFSET. When the starting offset is zero, the
+ search for a match starts at the beginning of the subject, and this is
+ by far the most common case. In UTF-8 or UTF-16 mode, the starting off-
+ set must point to the start of a character, or to the end of the sub-
+ ject (in UTF-32 mode, one code unit equals one character, so all off-
+ sets are valid). Like the pattern string, the subject may contain
+ binary zeros.
+
+ A non-zero starting offset is useful when searching for another match
+ in the same subject by calling pcre2_match() again after a previous
+ success. Setting startoffset differs from passing over a shortened
+ string and setting PCRE2_NOTBOL in the case of a pattern that begins
+ with any kind of lookbehind. For example, consider the pattern
+
+ \Biss\B
+
+ which finds occurrences of "iss" in the middle of words. (\B matches
+ only if the current position in the subject is not a word boundary.)
+ When applied to the string "Mississipi" the first call to pcre2_match()
+ finds the first occurrence. If pcre2_match() is called again with just
+ the remainder of the subject, namely "issipi", it does not match,
+ because \B is always false at the start of the subject, which is deemed
+ to be a word boundary. However, if pcre2_match() is passed the entire
+ string again, but with startoffset set to 4, it finds the second occur-
+ rence of "iss" because it is able to look behind the starting point to
+ discover that it is preceded by a letter.
+
+ Finding all the matches in a subject is tricky when the pattern can
+ match an empty string. It is possible to emulate Perl's /g behaviour by
+ first trying the match again at the same offset, with the
+ PCRE2_NOTEMPTY_ATSTART and PCRE2_ANCHORED options, and then if that
+ fails, advancing the starting offset and trying an ordinary match
+ again. There is some code that demonstrates how to do this in the
+ pcre2demo sample program. In the most general case, you have to check
+ to see if the newline convention recognizes CRLF as a newline, and if
+ so, and the current character is CR followed by LF, advance the start-
+ ing offset by two characters instead of one.
+
+ If a non-zero starting offset is passed when the pattern is anchored, a
+ single attempt to match at the given offset is made. This can only suc-
+ ceed if the pattern does not require the match to be at the start of
+ the subject. In other words, the anchoring must be the result of set-
+ ting the PCRE2_ANCHORED option or the use of .* with PCRE2_DOTALL, not
+ by starting the pattern with ^ or \A.
+
+ Option bits for pcre2_match()
+
+ The unused bits of the options argument for pcre2_match() must be zero.
+ The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDANCHORED,
+ PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
+ PCRE2_NO_JIT, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PAR-
+ TIAL_SOFT. Their action is described below.
+
+ Setting PCRE2_ANCHORED or PCRE2_ENDANCHORED at match time is not sup-
+ ported by the just-in-time (JIT) compiler. If it is set, JIT matching
+ is disabled and the interpretive code in pcre2_match() is run. Apart
+ from PCRE2_NO_JIT (obviously), the remaining options are supported for
+ JIT matching.
+
+ PCRE2_ANCHORED
+
+ The PCRE2_ANCHORED option limits pcre2_match() to matching at the first
+ matching position. If a pattern was compiled with PCRE2_ANCHORED, or
+ turned out to be anchored by virtue of its contents, it cannot be made
+ unachored at matching time. Note that setting the option at match time
+ disables JIT matching.
+
+ PCRE2_ENDANCHORED
+
+ If the PCRE2_ENDANCHORED option is set, any string that pcre2_match()
+ matches must be right at the end of the subject string. Note that set-
+ ting the option at match time disables JIT matching.
+
+ PCRE2_NOTBOL
+
+ This option specifies that first character of the subject string is not
+ the beginning of a line, so the circumflex metacharacter should not
+ match before it. Setting this without having set PCRE2_MULTILINE at
+ compile time causes circumflex never to match. This option affects only
+ the behaviour of the circumflex metacharacter. It does not affect \A.
+
+ PCRE2_NOTEOL
+
+ This option specifies that the end of the subject string is not the end
+ of a line, so the dollar metacharacter should not match it nor (except
+ in multiline mode) a newline immediately before it. Setting this with-
+ out having set PCRE2_MULTILINE at compile time causes dollar never to
+ match. This option affects only the behaviour of the dollar metacharac-
+ ter. It does not affect \Z or \z.
+
+ PCRE2_NOTEMPTY
+
+ An empty string is not considered to be a valid match if this option is
+ set. If there are alternatives in the pattern, they are tried. If all
+ the alternatives match the empty string, the entire match fails. For
+ example, if the pattern
+
+ a?b?
+
+ is applied to a string not beginning with "a" or "b", it matches an
+ empty string at the start of the subject. With PCRE2_NOTEMPTY set, this
+ match is not valid, so pcre2_match() searches further into the string
+ for occurrences of "a" or "b".
+
+ PCRE2_NOTEMPTY_ATSTART
+
+ This is like PCRE2_NOTEMPTY, except that it locks out an empty string
+ match only at the first matching position, that is, at the start of the
+ subject plus the starting offset. An empty string match later in the
+ subject is permitted. If the pattern is anchored, such a match can
+ occur only if the pattern contains \K.
+
+ PCRE2_NO_JIT
+
+ By default, if a pattern has been successfully processed by
+ pcre2_jit_compile(), JIT is automatically used when pcre2_match() is
+ called with options that JIT supports. Setting PCRE2_NO_JIT disables
+ the use of JIT; it forces matching to be done by the interpreter.
+
+ PCRE2_NO_UTF_CHECK
+
+ When PCRE2_UTF is set at compile time, the validity of the subject as a
+ UTF string is checked by default when pcre2_match() is subsequently
+ called. If a non-zero starting offset is given, the check is applied
+ only to that part of the subject that could be inspected during match-
+ ing, and there is a check that the starting offset points to the first
+ code unit of a character or to the end of the subject. If there are no
+ lookbehind assertions in the pattern, the check starts at the starting
+ offset. Otherwise, it starts at the length of the longest lookbehind
+ before the starting offset, or at the start of the subject if there are
+ not that many characters before the starting offset. Note that the
+ sequences \b and \B are one-character lookbehinds.
+
+ The check is carried out before any other processing takes place, and a
+ negative error code is returned if the check fails. There are several
+ UTF error codes for each code unit width, corresponding to different
+ problems with the code unit sequence. There are discussions about the
+ validity of UTF-8 strings, UTF-16 strings, and UTF-32 strings in the
+ pcre2unicode page.
+
+ If you know that your subject is valid, and you want to skip these
+ checks for performance reasons, you can set the PCRE2_NO_UTF_CHECK
+ option when calling pcre2_match(). You might want to do this for the
+ second and subsequent calls to pcre2_match() if you are making repeated
+ calls to find other matches in the same subject string.
+
+ Warning: When PCRE2_NO_UTF_CHECK is set, the effect of passing an
+ invalid string as a subject, or an invalid value of startoffset, is
+ undefined. Your program may crash or loop indefinitely.
+
+ PCRE2_PARTIAL_HARD
+ PCRE2_PARTIAL_SOFT
+
+ These options turn on the partial matching feature. A partial match
+ occurs if the end of the subject string is reached successfully, but
+ there are not enough subject characters to complete the match. If this
+ happens when PCRE2_PARTIAL_SOFT (but not PCRE2_PARTIAL_HARD) is set,
+ matching continues by testing any remaining alternatives. Only if no
+ complete match can be found is PCRE2_ERROR_PARTIAL returned instead of
+ PCRE2_ERROR_NOMATCH. In other words, PCRE2_PARTIAL_SOFT specifies that
+ the caller is prepared to handle a partial match, but only if no com-
+ plete match can be found.
+
+ If PCRE2_PARTIAL_HARD is set, it overrides PCRE2_PARTIAL_SOFT. In this
+ case, if a partial match is found, pcre2_match() immediately returns
+ PCRE2_ERROR_PARTIAL, without considering any other alternatives. In
+ other words, when PCRE2_PARTIAL_HARD is set, a partial match is consid-
+ ered to be more important that an alternative complete match.
+
+ There is a more detailed discussion of partial and multi-segment match-
+ ing, with examples, in the pcre2partial documentation.
+
+
+NEWLINE HANDLING WHEN MATCHING
+
+ When PCRE2 is built, a default newline convention is set; this is usu-
+ ally the standard convention for the operating system. The default can
+ be overridden in a compile context by calling pcre2_set_newline(). It
+ can also be overridden by starting a pattern string with, for example,
+ (*CRLF), as described in the section on newline conventions in the
+ pcre2pattern page. During matching, the newline choice affects the be-
+ haviour of the dot, circumflex, and dollar metacharacters. It may also
+ alter the way the match starting position is advanced after a match
+ failure for an unanchored pattern.
+
+ When PCRE2_NEWLINE_CRLF, PCRE2_NEWLINE_ANYCRLF, or PCRE2_NEWLINE_ANY is
+ set as the newline convention, and a match attempt for an unanchored
+ pattern fails when the current starting position is at a CRLF sequence,
+ and the pattern contains no explicit matches for CR or LF characters,
+ the match position is advanced by two characters instead of one, in
+ other words, to after the CRLF.
+
+ The above rule is a compromise that makes the most common cases work as
+ expected. For example, if the pattern is .+A (and the PCRE2_DOTALL
+ option is not set), it does not match the string "\r\nA" because, after
+ failing at the start, it skips both the CR and the LF before retrying.
+ However, the pattern [\r\n]A does match that string, because it con-
+ tains an explicit CR or LF reference, and so advances only by one char-
+ acter after the first failure.
+
+ An explicit match for CR of LF is either a literal appearance of one of
+ those characters in the pattern, or one of the \r or \n or equivalent
+ octal or hexadecimal escape sequences. Implicit matches such as [^X] do
+ not count, nor does \s, even though it includes CR and LF in the char-
+ acters that it matches.
+
+ Notwithstanding the above, anomalous effects may still occur when CRLF
+ is a valid newline sequence and explicit \r or \n escapes appear in the
+ pattern.
+
+
+HOW PCRE2_MATCH() RETURNS A STRING AND CAPTURED SUBSTRINGS
+
+ uint32_t pcre2_get_ovector_count(pcre2_match_data *match_data);
+
+ PCRE2_SIZE *pcre2_get_ovector_pointer(pcre2_match_data *match_data);
+
+ In general, a pattern matches a certain portion of the subject, and in
+ addition, further substrings from the subject may be picked out by
+ parenthesized parts of the pattern. Following the usage in Jeffrey
+ Friedl's book, this is called "capturing" in what follows, and the
+ phrase "capturing subpattern" or "capturing group" is used for a frag-
+ ment of a pattern that picks out a substring. PCRE2 supports several
+ other kinds of parenthesized subpattern that do not cause substrings to
+ be captured. The pcre2_pattern_info() function can be used to find out
+ how many capturing subpatterns there are in a compiled pattern.
+
+ You can use auxiliary functions for accessing captured substrings by
+ number or by name, as described in sections below.
+
+ Alternatively, you can make direct use of the vector of PCRE2_SIZE val-
+ ues, called the ovector, which contains the offsets of captured
+ strings. It is part of the match data block. The function
+ pcre2_get_ovector_pointer() returns the address of the ovector, and
+ pcre2_get_ovector_count() returns the number of pairs of values it con-
+ tains.
+
+ Within the ovector, the first in each pair of values is set to the off-
+ set of the first code unit of a substring, and the second is set to the
+ offset of the first code unit after the end of a substring. These val-
+ ues are always code unit offsets, not character offsets. That is, they
+ are byte offsets in the 8-bit library, 16-bit offsets in the 16-bit
+ library, and 32-bit offsets in the 32-bit library.
+
+ After a partial match (error return PCRE2_ERROR_PARTIAL), only the
+ first pair of offsets (that is, ovector[0] and ovector[1]) are set.
+ They identify the part of the subject that was partially matched. See
+ the pcre2partial documentation for details of partial matching.
+
+ After a fully successful match, the first pair of offsets identifies
+ the portion of the subject string that was matched by the entire pat-
+ tern. The next pair is used for the first captured substring, and so
+ on. The value returned by pcre2_match() is one more than the highest
+ numbered pair that has been set. For example, if two substrings have
+ been captured, the returned value is 3. If there are no captured sub-
+ strings, the return value from a successful match is 1, indicating that
+ just the first pair of offsets has been set.
+
+ If a pattern uses the \K escape sequence within a positive assertion,
+ the reported start of a successful match can be greater than the end of
+ the match. For example, if the pattern (?=ab\K) is matched against
+ "ab", the start and end offset values for the match are 2 and 0.
+
+ If a capturing subpattern group is matched repeatedly within a single
+ match operation, it is the last portion of the subject that it matched
+ that is returned.
+
+ If the ovector is too small to hold all the captured substring offsets,
+ as much as possible is filled in, and the function returns a value of
+ zero. If captured substrings are not of interest, pcre2_match() may be
+ called with a match data block whose ovector is of minimum length (that
+ is, one pair).
+
+ It is possible for capturing subpattern number n+1 to match some part
+ of the subject when subpattern n has not been used at all. For example,
+ if the string "abc" is matched against the pattern (a|(z))(bc) the
+ return from the function is 4, and subpatterns 1 and 3 are matched, but
+ 2 is not. When this happens, both values in the offset pairs corre-
+ sponding to unused subpatterns are set to PCRE2_UNSET.
+
+ Offset values that correspond to unused subpatterns at the end of the
+ expression are also set to PCRE2_UNSET. For example, if the string
+ "abc" is matched against the pattern (abc)(x(yz)?)? subpatterns 2 and 3
+ are not matched. The return from the function is 2, because the high-
+ est used capturing subpattern number is 1. The offsets for for the sec-
+ ond and third capturing subpatterns (assuming the vector is large
+ enough, of course) are set to PCRE2_UNSET.
+
+ Elements in the ovector that do not correspond to capturing parentheses
+ in the pattern are never changed. That is, if a pattern contains n cap-
+ turing parentheses, no more than ovector[0] to ovector[2n+1] are set by
+ pcre2_match(). The other elements retain whatever values they previ-
+ ously had. After a failed match attempt, the contents of the ovector
+ are unchanged.
+
+
+OTHER INFORMATION ABOUT A MATCH
+
+ PCRE2_SPTR pcre2_get_mark(pcre2_match_data *match_data);
+
+ PCRE2_SIZE pcre2_get_startchar(pcre2_match_data *match_data);
+
+ As well as the offsets in the ovector, other information about a match
+ is retained in the match data block and can be retrieved by the above
+ functions in appropriate circumstances. If they are called at other
+ times, the result is undefined.
+
+ After a successful match, a partial match (PCRE2_ERROR_PARTIAL), or a
+ failure to match (PCRE2_ERROR_NOMATCH), a (*MARK), (*PRUNE), or (*THEN)
+ name may be available. The function pcre2_get_mark() can be called to
+ access this name. The same function applies to all three verbs. It
+ returns a pointer to the zero-terminated name, which is within the com-
+ piled pattern. If no name is available, NULL is returned. The length of
+ the name (excluding the terminating zero) is stored in the code unit
+ that precedes the name. You should use this length instead of relying
+ on the terminating zero if the name might contain a binary zero.
+
+ After a successful match, the name that is returned is the last
+ (*MARK), (*PRUNE), or (*THEN) name encountered on the matching path
+ through the pattern. Instances of (*PRUNE) and (*THEN) without names
+ are ignored. Thus, for example, if the matching path contains
+ (*MARK:A)(*PRUNE), the name "A" is returned. After a "no match" or a
+ partial match, the last encountered name is returned. For example,
+ consider this pattern:
+
+ ^(*MARK:A)((*MARK:B)a|b)c
+
+ When it matches "bc", the returned name is A. The B mark is "seen" in
+ the first branch of the group, but it is not on the matching path. On
+ the other hand, when this pattern fails to match "bx", the returned
+ name is B.
+
+ Warning: By default, certain start-of-match optimizations are used to
+ give a fast "no match" result in some situations. For example, if the
+ anchoring is removed from the pattern above, there is an initial check
+ for the presence of "c" in the subject before running the matching
+ engine. This check fails for "bx", causing a match failure without see-
+ ing any marks. You can disable the start-of-match optimizations by set-
+ ting the PCRE2_NO_START_OPTIMIZE option for pcre2_compile() or starting
+ the pattern with (*NO_START_OPT).
+
+ After a successful match, a partial match, or one of the invalid UTF
+ errors (for example, PCRE2_ERROR_UTF8_ERR5), pcre2_get_startchar() can
+ be called. After a successful or partial match it returns the code unit
+ offset of the character at which the match started. For a non-partial
+ match, this can be different to the value of ovector[0] if the pattern
+ contains the \K escape sequence. After a partial match, however, this
+ value is always the same as ovector[0] because \K does not affect the
+ result of a partial match.
+
+ After a UTF check failure, pcre2_get_startchar() can be used to obtain
+ the code unit offset of the invalid UTF character. Details are given in
+ the pcre2unicode page.
+
+
+ERROR RETURNS FROM pcre2_match()
+
+ If pcre2_match() fails, it returns a negative number. This can be con-
+ verted to a text string by calling the pcre2_get_error_message() func-
+ tion (see "Obtaining a textual error message" below). Negative error
+ codes are also returned by other functions, and are documented with
+ them. The codes are given names in the header file. If UTF checking is
+ in force and an invalid UTF subject string is detected, one of a number
+ of UTF-specific negative error codes is returned. Details are given in
+ the pcre2unicode page. The following are the other errors that may be
+ returned by pcre2_match():
+
+ PCRE2_ERROR_NOMATCH
+
+ The subject string did not match the pattern.
+
+ PCRE2_ERROR_PARTIAL
+
+ The subject string did not match, but it did match partially. See the
+ pcre2partial documentation for details of partial matching.
+
+ PCRE2_ERROR_BADMAGIC
+
+ PCRE2 stores a 4-byte "magic number" at the start of the compiled code,
+ to catch the case when it is passed a junk pointer. This is the error
+ that is returned when the magic number is not present.
+
+ PCRE2_ERROR_BADMODE
+
+ This error is given when a compiled pattern is passed to a function in
+ a library of a different code unit width, for example, a pattern com-
+ piled by the 8-bit library is passed to a 16-bit or 32-bit library
+ function.
+
+ PCRE2_ERROR_BADOFFSET
+
+ The value of startoffset was greater than the length of the subject.
+
+ PCRE2_ERROR_BADOPTION
+
+ An unrecognized bit was set in the options argument.
+
+ PCRE2_ERROR_BADUTFOFFSET
+
+ The UTF code unit sequence that was passed as a subject was checked and
+ found to be valid (the PCRE2_NO_UTF_CHECK option was not set), but the
+ value of startoffset did not point to the beginning of a UTF character
+ or the end of the subject.
+
+ PCRE2_ERROR_CALLOUT
+
+ This error is never generated by pcre2_match() itself. It is provided
+ for use by callout functions that want to cause pcre2_match() or
+ pcre2_callout_enumerate() to return a distinctive error code. See the
+ pcre2callout documentation for details.
+
+ PCRE2_ERROR_DEPTHLIMIT
+
+ The nested backtracking depth limit was reached.
+
+ PCRE2_ERROR_HEAPLIMIT
+
+ The heap limit was reached.
+
+ PCRE2_ERROR_INTERNAL
+
+ An unexpected internal error has occurred. This error could be caused
+ by a bug in PCRE2 or by overwriting of the compiled pattern.
+
+ PCRE2_ERROR_JIT_STACKLIMIT
+
+ This error is returned when a pattern that was successfully studied
+ using JIT is being matched, but the memory available for the just-in-
+ time processing stack is not large enough. See the pcre2jit documenta-
+ tion for more details.
+
+ PCRE2_ERROR_MATCHLIMIT
+
+ The backtracking match limit was reached.
+
+ PCRE2_ERROR_NOMEMORY
+
+ If a pattern contains many nested backtracking points, heap memory is
+ used to remember them. This error is given when the memory allocation
+ function (default or custom) fails. Note that a different error,
+ PCRE2_ERROR_HEAPLIMIT, is given if the amount of memory needed exceeds
+ the heap limit.
+
+ PCRE2_ERROR_NULL
+
+ Either the code, subject, or match_data argument was passed as NULL.
+
+ PCRE2_ERROR_RECURSELOOP
+
+ This error is returned when pcre2_match() detects a recursion loop
+ within the pattern. Specifically, it means that either the whole pat-
+ tern or a subpattern has been called recursively for the second time at
+ the same position in the subject string. Some simple patterns that
+ might do this are detected and faulted at compile time, but more com-
+ plicated cases, in particular mutual recursions between two different
+ subpatterns, cannot be detected until matching is attempted.
+
+
+OBTAINING A TEXTUAL ERROR MESSAGE
+
+ int pcre2_get_error_message(int errorcode, PCRE2_UCHAR *buffer,
+ PCRE2_SIZE bufflen);
+
+ A text message for an error code from any PCRE2 function (compile,
+ match, or auxiliary) can be obtained by calling pcre2_get_error_mes-
+ sage(). The code is passed as the first argument, with the remaining
+ two arguments specifying a code unit buffer and its length in code
+ units, into which the text message is placed. The message is returned
+ in code units of the appropriate width for the library that is being
+ used.
+
+ The returned message is terminated with a trailing zero, and the func-
+ tion returns the number of code units used, excluding the trailing
+ zero. If the error number is unknown, the negative error code
+ PCRE2_ERROR_BADDATA is returned. If the buffer is too small, the mes-
+ sage is truncated (but still with a trailing zero), and the negative
+ error code PCRE2_ERROR_NOMEMORY is returned. None of the messages are
+ very long; a buffer size of 120 code units is ample.
+
+
+EXTRACTING CAPTURED SUBSTRINGS BY NUMBER
+
+ int pcre2_substring_length_bynumber(pcre2_match_data *match_data,
+ uint32_t number, PCRE2_SIZE *length);
+
+ int pcre2_substring_copy_bynumber(pcre2_match_data *match_data,
+ uint32_t number, PCRE2_UCHAR *buffer,
+ PCRE2_SIZE *bufflen);
+
+ int pcre2_substring_get_bynumber(pcre2_match_data *match_data,
+ uint32_t number, PCRE2_UCHAR **bufferptr,
+ PCRE2_SIZE *bufflen);
+
+ void pcre2_substring_free(PCRE2_UCHAR *buffer);
+
+ Captured substrings can be accessed directly by using the ovector as
+ described above. For convenience, auxiliary functions are provided for
+ extracting captured substrings as new, separate, zero-terminated
+ strings. A substring that contains a binary zero is correctly extracted
+ and has a further zero added on the end, but the result is not, of
+ course, a C string.
+
+ The functions in this section identify substrings by number. The number
+ zero refers to the entire matched substring, with higher numbers refer-
+ ring to substrings captured by parenthesized groups. After a partial
+ match, only substring zero is available. An attempt to extract any
+ other substring gives the error PCRE2_ERROR_PARTIAL. The next section
+ describes similar functions for extracting captured substrings by name.
+
+ If a pattern uses the \K escape sequence within a positive assertion,
+ the reported start of a successful match can be greater than the end of
+ the match. For example, if the pattern (?=ab\K) is matched against
+ "ab", the start and end offset values for the match are 2 and 0. In
+ this situation, calling these functions with a zero substring number
+ extracts a zero-length empty string.
+
+ You can find the length in code units of a captured substring without
+ extracting it by calling pcre2_substring_length_bynumber(). The first
+ argument is a pointer to the match data block, the second is the group
+ number, and the third is a pointer to a variable into which the length
+ is placed. If you just want to know whether or not the substring has
+ been captured, you can pass the third argument as NULL.
+
+ The pcre2_substring_copy_bynumber() function copies a captured sub-
+ string into a supplied buffer, whereas pcre2_substring_get_bynumber()
+ copies it into new memory, obtained using the same memory allocation
+ function that was used for the match data block. The first two argu-
+ ments of these functions are a pointer to the match data block and a
+ capturing group number.
+
+ The final arguments of pcre2_substring_copy_bynumber() are a pointer to
+ the buffer and a pointer to a variable that contains its length in code
+ units. This is updated to contain the actual number of code units used
+ for the extracted substring, excluding the terminating zero.
+
+ For pcre2_substring_get_bynumber() the third and fourth arguments point
+ to variables that are updated with a pointer to the new memory and the
+ number of code units that comprise the substring, again excluding the
+ terminating zero. When the substring is no longer needed, the memory
+ should be freed by calling pcre2_substring_free().
+
+ The return value from all these functions is zero for success, or a
+ negative error code. If the pattern match failed, the match failure
+ code is returned. If a substring number greater than zero is used
+ after a partial match, PCRE2_ERROR_PARTIAL is returned. Other possible
+ error codes are:
+
+ PCRE2_ERROR_NOMEMORY
+
+ The buffer was too small for pcre2_substring_copy_bynumber(), or the
+ attempt to get memory failed for pcre2_substring_get_bynumber().
+
+ PCRE2_ERROR_NOSUBSTRING
+
+ There is no substring with that number in the pattern, that is, the
+ number is greater than the number of capturing parentheses.
+
+ PCRE2_ERROR_UNAVAILABLE
+
+ The substring number, though not greater than the number of captures in
+ the pattern, is greater than the number of slots in the ovector, so the
+ substring could not be captured.
+
+ PCRE2_ERROR_UNSET
+
+ The substring did not participate in the match. For example, if the
+ pattern is (abc)|(def) and the subject is "def", and the ovector con-
+ tains at least two capturing slots, substring number 1 is unset.
+
+
+EXTRACTING A LIST OF ALL CAPTURED SUBSTRINGS
+
+ int pcre2_substring_list_get(pcre2_match_data *match_data,
+ PCRE2_UCHAR ***listptr, PCRE2_SIZE **lengthsptr);
+
+ void pcre2_substring_list_free(PCRE2_SPTR *list);
+
+ The pcre2_substring_list_get() function extracts all available sub-
+ strings and builds a list of pointers to them. It also (optionally)
+ builds a second list that contains their lengths (in code units),
+ excluding a terminating zero that is added to each of them. All this is
+ done in a single block of memory that is obtained using the same memory
+ allocation function that was used to get the match data block.
+
+ This function must be called only after a successful match. If called
+ after a partial match, the error code PCRE2_ERROR_PARTIAL is returned.
+
+ The address of the memory block is returned via listptr, which is also
+ the start of the list of string pointers. The end of the list is marked
+ by a NULL pointer. The address of the list of lengths is returned via
+ lengthsptr. If your strings do not contain binary zeros and you do not
+ therefore need the lengths, you may supply NULL as the lengthsptr argu-
+ ment to disable the creation of a list of lengths. The yield of the
+ function is zero if all went well, or PCRE2_ERROR_NOMEMORY if the mem-
+ ory block could not be obtained. When the list is no longer needed, it
+ should be freed by calling pcre2_substring_list_free().
+
+ If this function encounters a substring that is unset, which can happen
+ when capturing subpattern number n+1 matches some part of the subject,
+ but subpattern n has not been used at all, it returns an empty string.
+ This can be distinguished from a genuine zero-length substring by
+ inspecting the appropriate offset in the ovector, which contain
+ PCRE2_UNSET for unset substrings, or by calling pcre2_sub-
+ string_length_bynumber().
+
+
+EXTRACTING CAPTURED SUBSTRINGS BY NAME
+
+ int pcre2_substring_number_from_name(const pcre2_code *code,
+ PCRE2_SPTR name);
+
+ int pcre2_substring_length_byname(pcre2_match_data *match_data,
+ PCRE2_SPTR name, PCRE2_SIZE *length);
+
+ int pcre2_substring_copy_byname(pcre2_match_data *match_data,
+ PCRE2_SPTR name, PCRE2_UCHAR *buffer, PCRE2_SIZE *bufflen);
+
+ int pcre2_substring_get_byname(pcre2_match_data *match_data,
+ PCRE2_SPTR name, PCRE2_UCHAR **bufferptr, PCRE2_SIZE *bufflen);
+
+ void pcre2_substring_free(PCRE2_UCHAR *buffer);
+
+ To extract a substring by name, you first have to find associated num-
+ ber. For example, for this pattern:
+
+ (a+)b(?<xxx>\d+)...
+
+ the number of the subpattern called "xxx" is 2. If the name is known to
+ be unique (PCRE2_DUPNAMES was not set), you can find the number from
+ the name by calling pcre2_substring_number_from_name(). The first argu-
+ ment is the compiled pattern, and the second is the name. The yield of
+ the function is the subpattern number, PCRE2_ERROR_NOSUBSTRING if there
+ is no subpattern of that name, or PCRE2_ERROR_NOUNIQUESUBSTRING if
+ there is more than one subpattern of that name. Given the number, you
+ can extract the substring directly from the ovector, or use one of the
+ "bynumber" functions described above.
+
+ For convenience, there are also "byname" functions that correspond to
+ the "bynumber" functions, the only difference being that the second
+ argument is a name instead of a number. If PCRE2_DUPNAMES is set and
+ there are duplicate names, these functions scan all the groups with the
+ given name, and return the first named string that is set.
+
+ If there are no groups with the given name, PCRE2_ERROR_NOSUBSTRING is
+ returned. If all groups with the name have numbers that are greater
+ than the number of slots in the ovector, PCRE2_ERROR_UNAVAILABLE is
+ returned. If there is at least one group with a slot in the ovector,
+ but no group is found to be set, PCRE2_ERROR_UNSET is returned.
+
+ Warning: If the pattern uses the (?| feature to set up multiple subpat-
+ terns with the same number, as described in the section on duplicate
+ subpattern numbers in the pcre2pattern page, you cannot use names to
+ distinguish the different subpatterns, because names are not included
+ in the compiled code. The matching process uses only numbers. For this
+ reason, the use of different names for subpatterns of the same number
+ causes an error at compile time.
+
+
+CREATING A NEW STRING WITH SUBSTITUTIONS
+
+ int pcre2_substitute(const pcre2_code *code, PCRE2_SPTR subject,
+ PCRE2_SIZE length, PCRE2_SIZE startoffset,
+ uint32_t options, pcre2_match_data *match_data,
+ pcre2_match_context *mcontext, PCRE2_SPTR replacement,
+ PCRE2_SIZE rlength, PCRE2_UCHAR *outputbufferP,
+ PCRE2_SIZE *outlengthptr);
+
+ This function calls pcre2_match() and then makes a copy of the subject
+ string in outputbuffer, replacing the part that was matched with the
+ replacement string, whose length is supplied in rlength. This can be
+ given as PCRE2_ZERO_TERMINATED for a zero-terminated string. Matches in
+ which a \K item in a lookahead in the pattern causes the match to end
+ before it starts are not supported, and give rise to an error return.
+ For global replacements, matches in which \K in a lookbehind causes the
+ match to start earlier than the point that was reached in the previous
+ iteration are also not supported.
+
+ The first seven arguments of pcre2_substitute() are the same as for
+ pcre2_match(), except that the partial matching options are not permit-
+ ted, and match_data may be passed as NULL, in which case a match data
+ block is obtained and freed within this function, using memory manage-
+ ment functions from the match context, if provided, or else those that
+ were used to allocate memory for the compiled code.
+
+ If an external match_data block is provided, its contents afterwards
+ are those set by the final call to pcre2_match(), which will have ended
+ in a matching error. The contents of the ovector within the match data
+ block may or may not have been changed.
+
+ The outlengthptr argument must point to a variable that contains the
+ length, in code units, of the output buffer. If the function is suc-
+ cessful, the value is updated to contain the length of the new string,
+ excluding the trailing zero that is automatically added.
+
+ If the function is not successful, the value set via outlengthptr
+ depends on the type of error. For syntax errors in the replacement
+ string, the value is the offset in the replacement string where the
+ error was detected. For other errors, the value is PCRE2_UNSET by
+ default. This includes the case of the output buffer being too small,
+ unless PCRE2_SUBSTITUTE_OVERFLOW_LENGTH is set (see below), in which
+ case the value is the minimum length needed, including space for the
+ trailing zero. Note that in order to compute the required length,
+ pcre2_substitute() has to simulate all the matching and copying,
+ instead of giving an error return as soon as the buffer overflows. Note
+ also that the length is in code units, not bytes.
+
+ In the replacement string, which is interpreted as a UTF string in UTF
+ mode, and is checked for UTF validity unless the PCRE2_NO_UTF_CHECK
+ option is set, a dollar character is an escape character that can spec-
+ ify the insertion of characters from capturing groups or (*MARK),
+ (*PRUNE), or (*THEN) items in the pattern. The following forms are
+ always recognized:
+
+ $$ insert a dollar character
+ $<n> or ${<n>} insert the contents of group <n>
+ $*MARK or ${*MARK} insert a (*MARK), (*PRUNE), or (*THEN) name
+
+ Either a group number or a group name can be given for <n>. Curly
+ brackets are required only if the following character would be inter-
+ preted as part of the number or name. The number may be zero to include
+ the entire matched string. For example, if the pattern a(b)c is
+ matched with "=abc=" and the replacement string "+$1$0$1+", the result
+ is "=+babcb+=".
+
+ $*MARK inserts the name from the last encountered (*MARK), (*PRUNE), or
+ (*THEN) on the matching path that has a name. (*MARK) must always
+ include a name, but (*PRUNE) and (*THEN) need not. For example, in the
+ case of (*MARK:A)(*PRUNE) the name inserted is "A", but for
+ (*MARK:A)(*PRUNE:B) the relevant name is "B". This facility can be
+ used to perform simple simultaneous substitutions, as this pcre2test
+ example shows:
+
+ /(*MARK:pear)apple|(*MARK:orange)lemon/g,replace=${*MARK}
+ apple lemon
+ 2: pear orange
+
+ As well as the usual options for pcre2_match(), a number of additional
+ options can be set in the options argument of pcre2_substitute().
+
+ PCRE2_SUBSTITUTE_GLOBAL causes the function to iterate over the subject
+ string, replacing every matching substring. If this option is not set,
+ only the first matching substring is replaced. The search for matches
+ takes place in the original subject string (that is, previous replace-
+ ments do not affect it). Iteration is implemented by advancing the
+ startoffset value for each search, which is always passed the entire
+ subject string. If an offset limit is set in the match context, search-
+ ing stops when that limit is reached.
+
+ You can restrict the effect of a global substitution to a portion of
+ the subject string by setting either or both of startoffset and an off-
+ set limit. Here is a pcre2test example:
+
+ /B/g,replace=!,use_offset_limit
+ ABC ABC ABC ABC\=offset=3,offset_limit=12
+ 2: ABC A!C A!C ABC
+
+ When continuing with global substitutions after matching a substring
+ with zero length, an attempt to find a non-empty match at the same off-
+ set is performed. If this is not successful, the offset is advanced by
+ one character except when CRLF is a valid newline sequence and the next
+ two characters are CR, LF. In this case, the offset is advanced by two
+ characters.
+
+ PCRE2_SUBSTITUTE_OVERFLOW_LENGTH changes what happens when the output
+ buffer is too small. The default action is to return PCRE2_ERROR_NOMEM-
+ ORY immediately. If this option is set, however, pcre2_substitute()
+ continues to go through the motions of matching and substituting (with-
+ out, of course, writing anything) in order to compute the size of buf-
+ fer that is needed. This value is passed back via the outlengthptr
+ variable, with the result of the function still being
+ PCRE2_ERROR_NOMEMORY.
+
+ Passing a buffer size of zero is a permitted way of finding out how
+ much memory is needed for given substitution. However, this does mean
+ that the entire operation is carried out twice. Depending on the appli-
+ cation, it may be more efficient to allocate a large buffer and free
+ the excess afterwards, instead of using PCRE2_SUBSTITUTE_OVER-
+ FLOW_LENGTH.
+
+ PCRE2_SUBSTITUTE_UNKNOWN_UNSET causes references to capturing groups
+ that do not appear in the pattern to be treated as unset groups. This
+ option should be used with care, because it means that a typo in a
+ group name or number no longer causes the PCRE2_ERROR_NOSUBSTRING
+ error.
+
+ PCRE2_SUBSTITUTE_UNSET_EMPTY causes unset capturing groups (including
+ unknown groups when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set) to be
+ treated as empty strings when inserted as described above. If this
+ option is not set, an attempt to insert an unset group causes the
+ PCRE2_ERROR_UNSET error. This option does not influence the extended
+ substitution syntax described below.
+
+ PCRE2_SUBSTITUTE_EXTENDED causes extra processing to be applied to the
+ replacement string. Without this option, only the dollar character is
+ special, and only the group insertion forms listed above are valid.
+ When PCRE2_SUBSTITUTE_EXTENDED is set, two things change:
+
+ Firstly, backslash in a replacement string is interpreted as an escape
+ character. The usual forms such as \n or \x{ddd} can be used to specify
+ particular character codes, and backslash followed by any non-alphanu-
+ meric character quotes that character. Extended quoting can be coded
+ using \Q...\E, exactly as in pattern strings.
+
+ There are also four escape sequences for forcing the case of inserted
+ letters. The insertion mechanism has three states: no case forcing,
+ force upper case, and force lower case. The escape sequences change the
+ current state: \U and \L change to upper or lower case forcing, respec-
+ tively, and \E (when not terminating a \Q quoted sequence) reverts to
+ no case forcing. The sequences \u and \l force the next character (if
+ it is a letter) to upper or lower case, respectively, and then the
+ state automatically reverts to no case forcing. Case forcing applies to
+ all inserted characters, including those from captured groups and let-
+ ters within \Q...\E quoted sequences.
+
+ Note that case forcing sequences such as \U...\E do not nest. For exam-
+ ple, the result of processing "\Uaa\LBB\Ecc\E" is "AAbbcc"; the final
+ \E has no effect.
+
+ The second effect of setting PCRE2_SUBSTITUTE_EXTENDED is to add more
+ flexibility to group substitution. The syntax is similar to that used
+ by Bash:
+
+ ${<n>:-<string>}
+ ${<n>:+<string1>:<string2>}
+
+ As before, <n> may be a group number or a name. The first form speci-
+ fies a default value. If group <n> is set, its value is inserted; if
+ not, <string> is expanded and the result inserted. The second form
+ specifies strings that are expanded and inserted when group <n> is set
+ or unset, respectively. The first form is just a convenient shorthand
+ for
+
+ ${<n>:+${<n>}:<string>}
+
+ Backslash can be used to escape colons and closing curly brackets in
+ the replacement strings. A change of the case forcing state within a
+ replacement string remains in force afterwards, as shown in this
+ pcre2test example:
+
+ /(some)?(body)/substitute_extended,replace=${1:+\U:\L}HeLLo
+ body
+ 1: hello
+ somebody
+ 1: HELLO
+
+ The PCRE2_SUBSTITUTE_UNSET_EMPTY option does not affect these extended
+ substitutions. However, PCRE2_SUBSTITUTE_UNKNOWN_UNSET does cause
+ unknown groups in the extended syntax forms to be treated as unset.
+
+ If successful, pcre2_substitute() returns the number of replacements
+ that were made. This may be zero if no matches were found, and is never
+ greater than 1 unless PCRE2_SUBSTITUTE_GLOBAL is set.
+
+ In the event of an error, a negative error code is returned. Except for
+ PCRE2_ERROR_NOMATCH (which is never returned), errors from
+ pcre2_match() are passed straight back.
+
+ PCRE2_ERROR_NOSUBSTRING is returned for a non-existent substring inser-
+ tion, unless PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set.
+
+ PCRE2_ERROR_UNSET is returned for an unset substring insertion (includ-
+ ing an unknown substring when PCRE2_SUBSTITUTE_UNKNOWN_UNSET is set)
+ when the simple (non-extended) syntax is used and PCRE2_SUBSTI-
+ TUTE_UNSET_EMPTY is not set.
+
+ PCRE2_ERROR_NOMEMORY is returned if the output buffer is not big
+ enough. If the PCRE2_SUBSTITUTE_OVERFLOW_LENGTH option is set, the size
+ of buffer that is needed is returned via outlengthptr. Note that this
+ does not happen by default.
+
+ PCRE2_ERROR_BADREPLACEMENT is used for miscellaneous syntax errors in
+ the replacement string, with more particular errors being
+ PCRE2_ERROR_BADREPESCAPE (invalid escape sequence), PCRE2_ERROR_REP-
+ MISSINGBRACE (closing curly bracket not found), PCRE2_ERROR_BADSUBSTI-
+ TUTION (syntax error in extended group substitution), and
+ PCRE2_ERROR_BADSUBSPATTERN (the pattern match ended before it started
+ or the match started earlier than the current position in the subject,
+ which can happen if \K is used in an assertion).
+
+ As for all PCRE2 errors, a text message that describes the error can be
+ obtained by calling the pcre2_get_error_message() function (see
+ "Obtaining a textual error message" above).
+
+
+DUPLICATE SUBPATTERN NAMES
+
+ int pcre2_substring_nametable_scan(const pcre2_code *code,
+ PCRE2_SPTR name, PCRE2_SPTR *first, PCRE2_SPTR *last);
+
+ When a pattern is compiled with the PCRE2_DUPNAMES option, names for
+ subpatterns are not required to be unique. Duplicate names are always
+ allowed for subpatterns with the same number, created by using the (?|
+ feature. Indeed, if such subpatterns are named, they are required to
+ use the same names.
+
+ Normally, patterns with duplicate names are such that in any one match,
+ only one of the named subpatterns participates. An example is shown in
+ the pcre2pattern documentation.
+
+ When duplicates are present, pcre2_substring_copy_byname() and
+ pcre2_substring_get_byname() return the first substring corresponding
+ to the given name that is set. Only if none are set is
+ PCRE2_ERROR_UNSET is returned. The pcre2_substring_number_from_name()
+ function returns the error PCRE2_ERROR_NOUNIQUESUBSTRING when there are
+ duplicate names.
+
+ If you want to get full details of all captured substrings for a given
+ name, you must use the pcre2_substring_nametable_scan() function. The
+ first argument is the compiled pattern, and the second is the name. If
+ the third and fourth arguments are NULL, the function returns a group
+ number for a unique name, or PCRE2_ERROR_NOUNIQUESUBSTRING otherwise.
+
+ When the third and fourth arguments are not NULL, they must be pointers
+ to variables that are updated by the function. After it has run, they
+ point to the first and last entries in the name-to-number table for the
+ given name, and the function returns the length of each entry in code
+ units. In both cases, PCRE2_ERROR_NOSUBSTRING is returned if there are
+ no entries for the given name.
+
+ The format of the name table is described above in the section entitled
+ Information about a pattern. Given all the relevant entries for the
+ name, you can extract each of their numbers, and hence the captured
+ data.
+
+
+FINDING ALL POSSIBLE MATCHES AT ONE POSITION
+
+ The traditional matching function uses a similar algorithm to Perl,
+ which stops when it finds the first match at a given point in the sub-
+ ject. If you want to find all possible matches, or the longest possible
+ match at a given position, consider using the alternative matching
+ function (see below) instead. If you cannot use the alternative func-
+ tion, you can kludge it up by making use of the callout facility, which
+ is described in the pcre2callout documentation.
+
+ What you have to do is to insert a callout right at the end of the pat-
+ tern. When your callout function is called, extract and save the cur-
+ rent matched substring. Then return 1, which forces pcre2_match() to
+ backtrack and try other alternatives. Ultimately, when it runs out of
+ matches, pcre2_match() will yield PCRE2_ERROR_NOMATCH.
+
+
+MATCHING A PATTERN: THE ALTERNATIVE FUNCTION
+
+ int pcre2_dfa_match(const pcre2_code *code, PCRE2_SPTR subject,
+ PCRE2_SIZE length, PCRE2_SIZE startoffset,
+ uint32_t options, pcre2_match_data *match_data,
+ pcre2_match_context *mcontext,
+ int *workspace, PCRE2_SIZE wscount);
+
+ The function pcre2_dfa_match() is called to match a subject string
+ against a compiled pattern, using a matching algorithm that scans the
+ subject string just once (not counting lookaround assertions), and does
+ not backtrack. This has different characteristics to the normal algo-
+ rithm, and is not compatible with Perl. Some of the features of PCRE2
+ patterns are not supported. Nevertheless, there are times when this
+ kind of matching can be useful. For a discussion of the two matching
+ algorithms, and a list of features that pcre2_dfa_match() does not sup-
+ port, see the pcre2matching documentation.
+
+ The arguments for the pcre2_dfa_match() function are the same as for
+ pcre2_match(), plus two extras. The ovector within the match data block
+ is used in a different way, and this is described below. The other com-
+ mon arguments are used in the same way as for pcre2_match(), so their
+ description is not repeated here.
+
+ The two additional arguments provide workspace for the function. The
+ workspace vector should contain at least 20 elements. It is used for
+ keeping track of multiple paths through the pattern tree. More
+ workspace is needed for patterns and subjects where there are a lot of
+ potential matches.
+
+ Here is an example of a simple call to pcre2_dfa_match():
+
+ int wspace[20];
+ pcre2_match_data *md = pcre2_match_data_create(4, NULL);
+ int rc = pcre2_dfa_match(
+ re, /* result of pcre2_compile() */
+ "some string", /* the subject string */
+ 11, /* the length of the subject string */
+ 0, /* start at offset 0 in the subject */
+ 0, /* default options */
+ md, /* the match data block */
+ NULL, /* a match context; NULL means use defaults */
+ wspace, /* working space vector */
+ 20); /* number of elements (NOT size in bytes) */
+
+ Option bits for pcre_dfa_match()
+
+ The unused bits of the options argument for pcre2_dfa_match() must be
+ zero. The only bits that may be set are PCRE2_ANCHORED, PCRE2_ENDAN-
+ CHORED, PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY,
+ PCRE2_NOTEMPTY_ATSTART, PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD,
+ PCRE2_PARTIAL_SOFT, PCRE2_DFA_SHORTEST, and PCRE2_DFA_RESTART. All but
+ the last four of these are exactly the same as for pcre2_match(), so
+ their description is not repeated here.
+
+ PCRE2_PARTIAL_HARD
+ PCRE2_PARTIAL_SOFT
+
+ These have the same general effect as they do for pcre2_match(), but
+ the details are slightly different. When PCRE2_PARTIAL_HARD is set for
+ pcre2_dfa_match(), it returns PCRE2_ERROR_PARTIAL if the end of the
+ subject is reached and there is still at least one matching possibility
+ that requires additional characters. This happens even if some complete
+ matches have already been found. When PCRE2_PARTIAL_SOFT is set, the
+ return code PCRE2_ERROR_NOMATCH is converted into PCRE2_ERROR_PARTIAL
+ if the end of the subject is reached, there have been no complete
+ matches, but there is still at least one matching possibility. The por-
+ tion of the string that was inspected when the longest partial match
+ was found is set as the first matching string in both cases. There is a
+ more detailed discussion of partial and multi-segment matching, with
+ examples, in the pcre2partial documentation.
+
+ PCRE2_DFA_SHORTEST
+
+ Setting the PCRE2_DFA_SHORTEST option causes the matching algorithm to
+ stop as soon as it has found one match. Because of the way the alterna-
+ tive algorithm works, this is necessarily the shortest possible match
+ at the first possible matching point in the subject string.
+
+ PCRE2_DFA_RESTART
+
+ When pcre2_dfa_match() returns a partial match, it is possible to call
+ it again, with additional subject characters, and have it continue with
+ the same match. The PCRE2_DFA_RESTART option requests this action; when
+ it is set, the workspace and wscount options must reference the same
+ vector as before because data about the match so far is left in them
+ after a partial match. There is more discussion of this facility in the
+ pcre2partial documentation.
+
+ Successful returns from pcre2_dfa_match()
+
+ When pcre2_dfa_match() succeeds, it may have matched more than one sub-
+ string in the subject. Note, however, that all the matches from one run
+ of the function start at the same point in the subject. The shorter
+ matches are all initial substrings of the longer matches. For example,
+ if the pattern
+
+ <.*>
+
+ is matched against the string
+
+ This is <something> <something else> <something further> no more
+
+ the three matched strings are
+
+ <something> <something else> <something further>
+ <something> <something else>
+ <something>
+
+ On success, the yield of the function is a number greater than zero,
+ which is the number of matched substrings. The offsets of the sub-
+ strings are returned in the ovector, and can be extracted by number in
+ the same way as for pcre2_match(), but the numbers bear no relation to
+ any capturing groups that may exist in the pattern, because DFA match-
+ ing does not support group capture.
+
+ Calls to the convenience functions that extract substrings by name
+ return the error PCRE2_ERROR_DFA_UFUNC (unsupported function) if used
+ after a DFA match. The convenience functions that extract substrings by
+ number never return PCRE2_ERROR_NOSUBSTRING.
+
+ The matched strings are stored in the ovector in reverse order of
+ length; that is, the longest matching string is first. If there were
+ too many matches to fit into the ovector, the yield of the function is
+ zero, and the vector is filled with the longest matches.
+
+ NOTE: PCRE2's "auto-possessification" optimization usually applies to
+ character repeats at the end of a pattern (as well as internally). For
+ example, the pattern "a\d+" is compiled as if it were "a\d++". For DFA
+ matching, this means that only one possible match is found. If you
+ really do want multiple matches in such cases, either use an ungreedy
+ repeat such as "a\d+?" or set the PCRE2_NO_AUTO_POSSESS option when
+ compiling.
+
+ Error returns from pcre2_dfa_match()
+
+ The pcre2_dfa_match() function returns a negative number when it fails.
+ Many of the errors are the same as for pcre2_match(), as described
+ above. There are in addition the following errors that are specific to
+ pcre2_dfa_match():
+
+ PCRE2_ERROR_DFA_UITEM
+
+ This return is given if pcre2_dfa_match() encounters an item in the
+ pattern that it does not support, for instance, the use of \C in a UTF
+ mode or a backreference.
+
+ PCRE2_ERROR_DFA_UCOND
+
+ This return is given if pcre2_dfa_match() encounters a condition item
+ that uses a backreference for the condition, or a test for recursion in
+ a specific group. These are not supported.
+
+ PCRE2_ERROR_DFA_WSSIZE
+
+ This return is given if pcre2_dfa_match() runs out of space in the
+ workspace vector.
+
+ PCRE2_ERROR_DFA_RECURSE
+
+ When a recursive subpattern is processed, the matching function calls
+ itself recursively, using private memory for the ovector and workspace.
+ This error is given if the internal ovector is not large enough. This
+ should be extremely rare, as a vector of size 1000 is used.
+
+ PCRE2_ERROR_DFA_BADRESTART
+
+ When pcre2_dfa_match() is called with the PCRE2_DFA_RESTART option,
+ some plausibility checks are made on the contents of the workspace,
+ which should contain data about the previous partial match. If any of
+ these checks fail, this error is given.
+
+
+SEE ALSO
+
+ pcre2build(3), pcre2callout(3), pcre2demo(3), pcre2matching(3),
+ pcre2partial(3), pcre2posix(3), pcre2sample(3), pcre2unicode(3).
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 07 September 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2BUILD(3) Library Functions Manual PCRE2BUILD(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+BUILDING PCRE2
+
+ PCRE2 is distributed with a configure script that can be used to build
+ the library in Unix-like environments using the applications known as
+ Autotools. Also in the distribution are files to support building using
+ CMake instead of configure. The text file README contains general
+ information about building with Autotools (some of which is repeated
+ below), and also has some comments about building on various operating
+ systems. There is a lot more information about building PCRE2 without
+ using Autotools (including information about using CMake and building
+ "by hand") in the text file called NON-AUTOTOOLS-BUILD. You should
+ consult this file as well as the README file if you are building in a
+ non-Unix-like environment.
+
+
+PCRE2 BUILD-TIME OPTIONS
+
+ The rest of this document describes the optional features of PCRE2 that
+ can be selected when the library is compiled. It assumes use of the
+ configure script, where the optional features are selected or dese-
+ lected by providing options to configure before running the make com-
+ mand. However, the same options can be selected in both Unix-like and
+ non-Unix-like environments if you are using CMake instead of configure
+ to build PCRE2.
+
+ If you are not using Autotools or CMake, option selection can be done
+ by editing the config.h file, or by passing parameter settings to the
+ compiler, as described in NON-AUTOTOOLS-BUILD.
+
+ The complete list of options for configure (which includes the standard
+ ones such as the selection of the installation directory) can be
+ obtained by running
+
+ ./configure --help
+
+ The following sections include descriptions of "on/off" options whose
+ names begin with --enable or --disable. Because of the way that config-
+ ure works, --enable and --disable always come in pairs, so the comple-
+ mentary option always exists as well, but as it specifies the default,
+ it is not described. Options that specify values have names that start
+ with --with. At the end of a configure run, a summary of the configura-
+ tion is output.
+
+
+BUILDING 8-BIT, 16-BIT AND 32-BIT LIBRARIES
+
+ By default, a library called libpcre2-8 is built, containing functions
+ that take string arguments contained in arrays of bytes, interpreted
+ either as single-byte characters, or UTF-8 strings. You can also build
+ two other libraries, called libpcre2-16 and libpcre2-32, which process
+ strings that are contained in arrays of 16-bit and 32-bit code units,
+ respectively. These can be interpreted either as single-unit characters
+ or UTF-16/UTF-32 strings. To build these additional libraries, add one
+ or both of the following to the configure command:
+
+ --enable-pcre2-16
+ --enable-pcre2-32
+
+ If you do not want the 8-bit library, add
+
+ --disable-pcre2-8
+
+ as well. At least one of the three libraries must be built. Note that
+ the POSIX wrapper is for the 8-bit library only, and that pcre2grep is
+ an 8-bit program. Neither of these are built if you select only the
+ 16-bit or 32-bit libraries.
+
+
+BUILDING SHARED AND STATIC LIBRARIES
+
+ The Autotools PCRE2 building process uses libtool to build both shared
+ and static libraries by default. You can suppress an unwanted library
+ by adding one of
+
+ --disable-shared
+ --disable-static
+
+ to the configure command.
+
+
+UNICODE AND UTF SUPPORT
+
+ By default, PCRE2 is built with support for Unicode and UTF character
+ strings. To build it without Unicode support, add
+
+ --disable-unicode
+
+ to the configure command. This setting applies to all three libraries.
+ It is not possible to build one library with Unicode support, and
+ another without, in the same configuration.
+
+ Of itself, Unicode support does not make PCRE2 treat strings as UTF-8,
+ UTF-16 or UTF-32. To do that, applications that use the library can set
+ the PCRE2_UTF option when they call pcre2_compile() to compile a pat-
+ tern. Alternatively, patterns may be started with (*UTF) unless the
+ application has locked this out by setting PCRE2_NEVER_UTF.
+
+ UTF support allows the libraries to process character code points up to
+ 0x10ffff in the strings that they handle. Unicode support also gives
+ access to the Unicode properties of characters, using pattern escapes
+ such as \P, \p, and \X. Only the general category properties such as Lu
+ and Nd are supported. Details are given in the pcre2pattern documenta-
+ tion.
+
+ Pattern escapes such as \d and \w do not by default make use of Unicode
+ properties. The application can request that they do by setting the
+ PCRE2_UCP option. Unless the application has set PCRE2_NEVER_UCP, a
+ pattern may also request this by starting with (*UCP).
+
+
+DISABLING THE USE OF \C
+
+ The \C escape sequence, which matches a single code unit, even in a UTF
+ mode, can cause unpredictable behaviour because it may leave the cur-
+ rent matching point in the middle of a multi-code-unit character. The
+ application can lock it out by setting the PCRE2_NEVER_BACKSLASH_C
+ option when calling pcre2_compile(). There is also a build-time option
+
+ --enable-never-backslash-C
+
+ (note the upper case C) which locks out the use of \C entirely.
+
+
+JUST-IN-TIME COMPILER SUPPORT
+
+ Just-in-time (JIT) compiler support is included in the build by speci-
+ fying
+
+ --enable-jit
+
+ This support is available only for certain hardware architectures. If
+ this option is set for an unsupported architecture, a building error
+ occurs. If in doubt, use
+
+ --enable-jit=auto
+
+ which enables JIT only if the current hardware is supported. You can
+ check if JIT is enabled in the configuration summary that is output at
+ the end of a configure run. If you are enabling JIT under SELinux you
+ may also want to add
+
+ --enable-jit-sealloc
+
+ which enables the use of an execmem allocator in JIT that is compatible
+ with SELinux. This has no effect if JIT is not enabled. See the
+ pcre2jit documentation for a discussion of JIT usage. When JIT support
+ is enabled, pcre2grep automatically makes use of it, unless you add
+
+ --disable-pcre2grep-jit
+
+ to the "configure" command.
+
+
+NEWLINE RECOGNITION
+
+ By default, PCRE2 interprets the linefeed (LF) character as indicating
+ the end of a line. This is the normal newline character on Unix-like
+ systems. You can compile PCRE2 to use carriage return (CR) instead, by
+ adding
+
+ --enable-newline-is-cr
+
+ to the configure command. There is also an --enable-newline-is-lf
+ option, which explicitly specifies linefeed as the newline character.
+
+ Alternatively, you can specify that line endings are to be indicated by
+ the two-character sequence CRLF (CR immediately followed by LF). If you
+ want this, add
+
+ --enable-newline-is-crlf
+
+ to the configure command. There is a fourth option, specified by
+
+ --enable-newline-is-anycrlf
+
+ which causes PCRE2 to recognize any of the three sequences CR, LF, or
+ CRLF as indicating a line ending. A fifth option, specified by
+
+ --enable-newline-is-any
+
+ causes PCRE2 to recognize any Unicode newline sequence. The Unicode
+ newline sequences are the three just mentioned, plus the single charac-
+ ters VT (vertical tab, U+000B), FF (form feed, U+000C), NEL (next line,
+ U+0085), LS (line separator, U+2028), and PS (paragraph separator,
+ U+2029). The final option is
+
+ --enable-newline-is-nul
+
+ which causes NUL (binary zero) to be set as the default line-ending
+ character.
+
+ Whatever default line ending convention is selected when PCRE2 is built
+ can be overridden by applications that use the library. At build time
+ it is recommended to use the standard for your operating system.
+
+
+WHAT \R MATCHES
+
+ By default, the sequence \R in a pattern matches any Unicode newline
+ sequence, independently of what has been selected as the line ending
+ sequence. If you specify
+
+ --enable-bsr-anycrlf
+
+ the default is changed so that \R matches only CR, LF, or CRLF. What-
+ ever is selected when PCRE2 is built can be overridden by applications
+ that use the library.
+
+
+HANDLING VERY LARGE PATTERNS
+
+ Within a compiled pattern, offset values are used to point from one
+ part to another (for example, from an opening parenthesis to an alter-
+ nation metacharacter). By default, in the 8-bit and 16-bit libraries,
+ two-byte values are used for these offsets, leading to a maximum size
+ for a compiled pattern of around 64 thousand code units. This is suffi-
+ cient to handle all but the most gigantic patterns. Nevertheless, some
+ people do want to process truly enormous patterns, so it is possible to
+ compile PCRE2 to use three-byte or four-byte offsets by adding a set-
+ ting such as
+
+ --with-link-size=3
+
+ to the configure command. The value given must be 2, 3, or 4. For the
+ 16-bit library, a value of 3 is rounded up to 4. In these libraries,
+ using longer offsets slows down the operation of PCRE2 because it has
+ to load additional data when handling them. For the 32-bit library the
+ value is always 4 and cannot be overridden; the value of --with-link-
+ size is ignored.
+
+
+LIMITING PCRE2 RESOURCE USAGE
+
+ The pcre2_match() function increments a counter each time it goes round
+ its main loop. Putting a limit on this counter controls the amount of
+ computing resource used by a single call to pcre2_match(). The limit
+ can be changed at run time, as described in the pcre2api documentation.
+ The default is 10 million, but this can be changed by adding a setting
+ such as
+
+ --with-match-limit=500000
+
+ to the configure command. This setting also applies to the
+ pcre2_dfa_match() matching function, and to JIT matching (though the
+ counting is done differently).
+
+ The pcre2_match() function starts out using a 20KiB vector on the sys-
+ tem stack to record backtracking points. The more nested backtracking
+ points there are (that is, the deeper the search tree), the more memory
+ is needed. If the initial vector is not large enough, heap memory is
+ used, up to a certain limit, which is specified in kibibytes (units of
+ 1024 bytes). The limit can be changed at run time, as described in the
+ pcre2api documentation. The default limit (in effect unlimited) is 20
+ million. You can change this by a setting such as
+
+ --with-heap-limit=500
+
+ which limits the amount of heap to 500 KiB. This limit applies only to
+ interpretive matching in pcre2_match() and pcre2_dfa_match(), which may
+ also use the heap for internal workspace when processing complicated
+ patterns. This limit does not apply when JIT (which has its own memory
+ arrangements) is used.
+
+ You can also explicitly limit the depth of nested backtracking in the
+ pcre2_match() interpreter. This limit defaults to the value that is set
+ for --with-match-limit. You can set a lower default limit by adding,
+ for example,
+
+ --with-match-limit_depth=10000
+
+ to the configure command. This value can be overridden at run time.
+ This depth limit indirectly limits the amount of heap memory that is
+ used, but because the size of each backtracking "frame" depends on the
+ number of capturing parentheses in a pattern, the amount of heap that
+ is used before the limit is reached varies from pattern to pattern.
+ This limit was more useful in versions before 10.30, where function
+ recursion was used for backtracking.
+
+ As well as applying to pcre2_match(), the depth limit also controls the
+ depth of recursive function calls in pcre2_dfa_match(). These are used
+ for lookaround assertions, atomic groups, and recursion within pat-
+ terns. The limit does not apply to JIT matching.
+
+
+CREATING CHARACTER TABLES AT BUILD TIME
+
+ PCRE2 uses fixed tables for processing characters whose code points are
+ less than 256. By default, PCRE2 is built with a set of tables that are
+ distributed in the file src/pcre2_chartables.c.dist. These tables are
+ for ASCII codes only. If you add
+
+ --enable-rebuild-chartables
+
+ to the configure command, the distributed tables are no longer used.
+ Instead, a program called dftables is compiled and run. This outputs
+ the source for new set of tables, created in the default locale of your
+ C run-time system. This method of replacing the tables does not work if
+ you are cross compiling, because dftables is run on the local host. If
+ you need to create alternative tables when cross compiling, you will
+ have to do so "by hand".
+
+
+USING EBCDIC CODE
+
+ PCRE2 assumes by default that it will run in an environment where the
+ character code is ASCII or Unicode, which is a superset of ASCII. This
+ is the case for most computer operating systems. PCRE2 can, however, be
+ compiled to run in an 8-bit EBCDIC environment by adding
+
+ --enable-ebcdic --disable-unicode
+
+ to the configure command. This setting implies --enable-rebuild-charta-
+ bles. You should only use it if you know that you are in an EBCDIC
+ environment (for example, an IBM mainframe operating system).
+
+ It is not possible to support both EBCDIC and UTF-8 codes in the same
+ version of the library. Consequently, --enable-unicode and --enable-
+ ebcdic are mutually exclusive.
+
+ The EBCDIC character that corresponds to an ASCII LF is assumed to have
+ the value 0x15 by default. However, in some EBCDIC environments, 0x25
+ is used. In such an environment you should use
+
+ --enable-ebcdic-nl25
+
+ as well as, or instead of, --enable-ebcdic. The EBCDIC character for CR
+ has the same value as in ASCII, namely, 0x0d. Whichever of 0x15 and
+ 0x25 is not chosen as LF is made to correspond to the Unicode NEL char-
+ acter (which, in Unicode, is 0x85).
+
+ The options that select newline behaviour, such as --enable-newline-is-
+ cr, and equivalent run-time options, refer to these character values in
+ an EBCDIC environment.
+
+
+PCRE2GREP SUPPORT FOR EXTERNAL SCRIPTS
+
+ By default, on non-Windows systems, pcre2grep supports the use of call-
+ outs with string arguments within the patterns it is matching, in order
+ to run external scripts. For details, see the pcre2grep documentation.
+ This support can be disabled by adding --disable-pcre2grep-callout to
+ the configure command.
+
+
+PCRE2GREP OPTIONS FOR COMPRESSED FILE SUPPORT
+
+ By default, pcre2grep reads all files as plain text. You can build it
+ so that it recognizes files whose names end in .gz or .bz2, and reads
+ them with libz or libbz2, respectively, by adding one or both of
+
+ --enable-pcre2grep-libz
+ --enable-pcre2grep-libbz2
+
+ to the configure command. These options naturally require that the rel-
+ evant libraries are installed on your system. Configuration will fail
+ if they are not.
+
+
+PCRE2GREP BUFFER SIZE
+
+ pcre2grep uses an internal buffer to hold a "window" on the file it is
+ scanning, in order to be able to output "before" and "after" lines when
+ it finds a match. The default starting size of the buffer is 20KiB. The
+ buffer itself is three times this size, but because of the way it is
+ used for holding "before" lines, the longest line that is guaranteed to
+ be processable is the notional buffer size. If a longer line is encoun-
+ tered, pcre2grep automatically expands the buffer, up to a specified
+ maximum size, whose default is 1MiB or the starting size, whichever is
+ the larger. You can change the default parameter values by adding, for
+ example,
+
+ --with-pcre2grep-bufsize=51200
+ --with-pcre2grep-max-bufsize=2097152
+
+ to the configure command. The caller of pcre2grep can override these
+ values by using --buffer-size and --max-buffer-size on the command
+ line.
+
+
+PCRE2TEST OPTION FOR LIBREADLINE SUPPORT
+
+ If you add one of
+
+ --enable-pcre2test-libreadline
+ --enable-pcre2test-libedit
+
+ to the configure command, pcre2test is linked with the libreadline
+ orlibedit library, respectively, and when its input is from a terminal,
+ it reads it using the readline() function. This provides line-editing
+ and history facilities. Note that libreadline is GPL-licensed, so if
+ you distribute a binary of pcre2test linked in this way, there may be
+ licensing issues. These can be avoided by linking instead with libedit,
+ which has a BSD licence.
+
+ Setting --enable-pcre2test-libreadline causes the -lreadline option to
+ be added to the pcre2test build. In many operating environments with a
+ sytem-installed readline library this is sufficient. However, in some
+ environments (e.g. if an unmodified distribution version of readline is
+ in use), some extra configuration may be necessary. The INSTALL file
+ for libreadline says this:
+
+ "Readline uses the termcap functions, but does not link with
+ the termcap or curses library itself, allowing applications
+ which link with readline the to choose an appropriate library."
+
+ If your environment has not been set up so that an appropriate library
+ is automatically included, you may need to add something like
+
+ LIBS="-ncurses"
+
+ immediately before the configure command.
+
+
+INCLUDING DEBUGGING CODE
+
+ If you add
+
+ --enable-debug
+
+ to the configure command, additional debugging code is included in the
+ build. This feature is intended for use by the PCRE2 maintainers.
+
+
+DEBUGGING WITH VALGRIND SUPPORT
+
+ If you add
+
+ --enable-valgrind
+
+ to the configure command, PCRE2 will use valgrind annotations to mark
+ certain memory regions as unaddressable. This allows it to detect
+ invalid memory accesses, and is mostly useful for debugging PCRE2
+ itself.
+
+
+CODE COVERAGE REPORTING
+
+ If your C compiler is gcc, you can build a version of PCRE2 that can
+ generate a code coverage report for its test suite. To enable this, you
+ must install lcov version 1.6 or above. Then specify
+
+ --enable-coverage
+
+ to the configure command and build PCRE2 in the usual way.
+
+ Note that using ccache (a caching C compiler) is incompatible with code
+ coverage reporting. If you have configured ccache to run automatically
+ on your system, you must set the environment variable
+
+ CCACHE_DISABLE=1
+
+ before running make to build PCRE2, so that ccache is not used.
+
+ When --enable-coverage is used, the following addition targets are
+ added to the Makefile:
+
+ make coverage
+
+ This creates a fresh coverage report for the PCRE2 test suite. It is
+ equivalent to running "make coverage-reset", "make coverage-baseline",
+ "make check", and then "make coverage-report".
+
+ make coverage-reset
+
+ This zeroes the coverage counters, but does nothing else.
+
+ make coverage-baseline
+
+ This captures baseline coverage information.
+
+ make coverage-report
+
+ This creates the coverage report.
+
+ make coverage-clean-report
+
+ This removes the generated coverage report without cleaning the cover-
+ age data itself.
+
+ make coverage-clean-data
+
+ This removes the captured coverage data without removing the coverage
+ files created at compile time (*.gcno).
+
+ make coverage-clean
+
+ This cleans all coverage data including the generated coverage report.
+ For more information about code coverage, see the gcov and lcov docu-
+ mentation.
+
+
+SUPPORT FOR FUZZERS
+
+ There is a special option for use by people who want to run fuzzing
+ tests on PCRE2:
+
+ --enable-fuzz-support
+
+ At present this applies only to the 8-bit library. If set, it causes an
+ extra library called libpcre2-fuzzsupport.a to be built, but not
+ installed. This contains a single function called LLVMFuzzerTestOneIn-
+ put() whose arguments are a pointer to a string and the length of the
+ string. When called, this function tries to compile the string as a
+ pattern, and if that succeeds, to match it. This is done both with no
+ options and with some random options bits that are generated from the
+ string.
+
+ Setting --enable-fuzz-support also causes a binary called pcre2fuz-
+ zcheck to be created. This is normally run under valgrind or used when
+ PCRE2 is compiled with address sanitizing enabled. It calls the fuzzing
+ function and outputs information about what it is doing. The input
+ strings are specified by arguments: if an argument starts with "=" the
+ rest of it is a literal input string. Otherwise, it is assumed to be a
+ file name, and the contents of the file are the test string.
+
+
+OBSOLETE OPTION
+
+ In versions of PCRE2 prior to 10.30, there were two ways of handling
+ backtracking in the pcre2_match() function. The default was to use the
+ system stack, but if
+
+ --disable-stack-for-recursion
+
+ was set, memory on the heap was used. From release 10.30 onwards this
+ has changed (the stack is no longer used) and this option now does
+ nothing except give a warning.
+
+
+SEE ALSO
+
+ pcre2api(3), pcre2-config(3).
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 26 April 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2CALLOUT(3) Library Functions Manual PCRE2CALLOUT(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+SYNOPSIS
+
+ #include <pcre2.h>
+
+ int (*pcre2_callout)(pcre2_callout_block *, void *);
+
+ int pcre2_callout_enumerate(const pcre2_code *code,
+ int (*callback)(pcre2_callout_enumerate_block *, void *),
+ void *user_data);
+
+
+DESCRIPTION
+
+ PCRE2 provides a feature called "callout", which is a means of tempo-
+ rarily passing control to the caller of PCRE2 in the middle of pattern
+ matching. The caller of PCRE2 provides an external function by putting
+ its entry point in a match context (see pcre2_set_callout() in the
+ pcre2api documentation).
+
+ Within a regular expression, (?C<arg>) indicates a point at which the
+ external function is to be called. Different callout points can be
+ identified by putting a number less than 256 after the letter C. The
+ default value is zero. Alternatively, the argument may be a delimited
+ string. The starting delimiter must be one of ` ' " ^ % # $ { and the
+ ending delimiter is the same as the start, except for {, where the end-
+ ing delimiter is }. If the ending delimiter is needed within the
+ string, it must be doubled. For example, this pattern has two callout
+ points:
+
+ (?C1)abc(?C"some ""arbitrary"" text")def
+
+ If the PCRE2_AUTO_CALLOUT option bit is set when a pattern is compiled,
+ PCRE2 automatically inserts callouts, all with number 255, before each
+ item in the pattern except for immediately before or after an explicit
+ callout. For example, if PCRE2_AUTO_CALLOUT is used with the pattern
+
+ A(?C3)B
+
+ it is processed as if it were
+
+ (?C255)A(?C3)B(?C255)
+
+ Here is a more complicated example:
+
+ A(\d{2}|--)
+
+ With PCRE2_AUTO_CALLOUT, this pattern is processed as if it were
+
+ (?C255)A(?C255)((?C255)\d{2}(?C255)|(?C255)-(?C255)-(?C255))(?C255)
+
+ Notice that there is a callout before and after each parenthesis and
+ alternation bar. If the pattern contains a conditional group whose con-
+ dition is an assertion, an automatic callout is inserted immediately
+ before the condition. Such a callout may also be inserted explicitly,
+ for example:
+
+ (?(?C9)(?=a)ab|de) (?(?C%text%)(?!=d)ab|de)
+
+ This applies only to assertion conditions (because they are themselves
+ independent groups).
+
+ Callouts can be useful for tracking the progress of pattern matching.
+ The pcre2test program has a pattern qualifier (/auto_callout) that sets
+ automatic callouts. When any callouts are present, the output from
+ pcre2test indicates how the pattern is being matched. This is useful
+ information when you are trying to optimize the performance of a par-
+ ticular pattern.
+
+
+MISSING CALLOUTS
+
+ You should be aware that, because of optimizations in the way PCRE2
+ compiles and matches patterns, callouts sometimes do not happen exactly
+ as you might expect.
+
+ Auto-possessification
+
+ At compile time, PCRE2 "auto-possessifies" repeated items when it knows
+ that what follows cannot be part of the repeat. For example, a+[bc] is
+ compiled as if it were a++[bc]. The pcre2test output when this pattern
+ is compiled with PCRE2_ANCHORED and PCRE2_AUTO_CALLOUT and then applied
+ to the string "aaaa" is:
+
+ --->aaaa
+ +0 ^ a+
+ +2 ^ ^ [bc]
+ No match
+
+ This indicates that when matching [bc] fails, there is no backtracking
+ into a+ (because it is being treated as a++) and therefore the callouts
+ that would be taken for the backtracks do not occur. You can disable
+ the auto-possessify feature by passing PCRE2_NO_AUTO_POSSESS to
+ pcre2_compile(), or starting the pattern with (*NO_AUTO_POSSESS). In
+ this case, the output changes to this:
+
+ --->aaaa
+ +0 ^ a+
+ +2 ^ ^ [bc]
+ +2 ^ ^ [bc]
+ +2 ^ ^ [bc]
+ +2 ^^ [bc]
+ No match
+
+ This time, when matching [bc] fails, the matcher backtracks into a+ and
+ tries again, repeatedly, until a+ itself fails.
+
+ Automatic .* anchoring
+
+ By default, an optimization is applied when .* is the first significant
+ item in a pattern. If PCRE2_DOTALL is set, so that the dot can match
+ any character, the pattern is automatically anchored. If PCRE2_DOTALL
+ is not set, a match can start only after an internal newline or at the
+ beginning of the subject, and pcre2_compile() remembers this. If a pat-
+ tern has more than one top-level branch, automatic anchoring occurs if
+ all branches are anchorable.
+
+ This optimization is disabled, however, if .* is in an atomic group or
+ if there is a backreference to the capturing group in which it appears.
+ It is also disabled if the pattern contains (*PRUNE) or (*SKIP). How-
+ ever, the presence of callouts does not affect it.
+
+ For example, if the pattern .*\d is compiled with PCRE2_AUTO_CALLOUT
+ and applied to the string "aa", the pcre2test output is:
+
+ --->aa
+ +0 ^ .*
+ +2 ^ ^ \d
+ +2 ^^ \d
+ +2 ^ \d
+ No match
+
+ This shows that all match attempts start at the beginning of the sub-
+ ject. In other words, the pattern is anchored. You can disable this
+ optimization by passing PCRE2_NO_DOTSTAR_ANCHOR to pcre2_compile(), or
+ starting the pattern with (*NO_DOTSTAR_ANCHOR). In this case, the out-
+ put changes to:
+
+ --->aa
+ +0 ^ .*
+ +2 ^ ^ \d
+ +2 ^^ \d
+ +2 ^ \d
+ +0 ^ .*
+ +2 ^^ \d
+ +2 ^ \d
+ No match
+
+ This shows more match attempts, starting at the second subject charac-
+ ter. Another optimization, described in the next section, means that
+ there is no subsequent attempt to match with an empty subject.
+
+ Other optimizations
+
+ Other optimizations that provide fast "no match" results also affect
+ callouts. For example, if the pattern is
+
+ ab(?C4)cd
+
+ PCRE2 knows that any matching string must contain the letter "d". If
+ the subject string is "abyz", the lack of "d" means that matching
+ doesn't ever start, and the callout is never reached. However, with
+ "abyd", though the result is still no match, the callout is obeyed.
+
+ For most patterns PCRE2 also knows the minimum length of a matching
+ string, and will immediately give a "no match" return without actually
+ running a match if the subject is not long enough, or, for unanchored
+ patterns, if it has been scanned far enough.
+
+ You can disable these optimizations by passing the PCRE2_NO_START_OPTI-
+ MIZE option to pcre2_compile(), or by starting the pattern with
+ (*NO_START_OPT). This slows down the matching process, but does ensure
+ that callouts such as the example above are obeyed.
+
+
+THE CALLOUT INTERFACE
+
+ During matching, when PCRE2 reaches a callout point, if an external
+ function is provided in the match context, it is called. This applies
+ to both normal, DFA, and JIT matching. The first argument to the call-
+ out function is a pointer to a pcre2_callout block. The second argument
+ is the void * callout data that was supplied when the callout was set
+ up by calling pcre2_set_callout() (see the pcre2api documentation). The
+ callout block structure contains the following fields, not necessarily
+ in this order:
+
+ uint32_t version;
+ uint32_t callout_number;
+ uint32_t capture_top;
+ uint32_t capture_last;
+ uint32_t callout_flags;
+ PCRE2_SIZE *offset_vector;
+ PCRE2_SPTR mark;
+ PCRE2_SPTR subject;
+ PCRE2_SIZE subject_length;
+ PCRE2_SIZE start_match;
+ PCRE2_SIZE current_position;
+ PCRE2_SIZE pattern_position;
+ PCRE2_SIZE next_item_length;
+ PCRE2_SIZE callout_string_offset;
+ PCRE2_SIZE callout_string_length;
+ PCRE2_SPTR callout_string;
+
+ The version field contains the version number of the block format. The
+ current version is 2; the three callout string fields were added for
+ version 1, and the callout_flags field for version 2. If you are writ-
+ ing an application that might use an earlier release of PCRE2, you
+ should check the version number before accessing any of these fields.
+ The version number will increase in future if more fields are added,
+ but the intention is never to remove any of the existing fields.
+
+ Fields for numerical callouts
+
+ For a numerical callout, callout_string is NULL, and callout_number
+ contains the number of the callout, in the range 0-255. This is the
+ number that follows (?C for callouts that part of the pattern; it is
+ 255 for automatically generated callouts.
+
+ Fields for string callouts
+
+ For callouts with string arguments, callout_number is always zero, and
+ callout_string points to the string that is contained within the com-
+ piled pattern. Its length is given by callout_string_length. Duplicated
+ ending delimiters that were present in the original pattern string have
+ been turned into single characters, but there is no other processing of
+ the callout string argument. An additional code unit containing binary
+ zero is present after the string, but is not included in the length.
+ The delimiter that was used to start the string is also stored within
+ the pattern, immediately before the string itself. You can access this
+ delimiter as callout_string[-1] if you need it.
+
+ The callout_string_offset field is the code unit offset to the start of
+ the callout argument string within the original pattern string. This is
+ provided for the benefit of applications such as script languages that
+ might need to report errors in the callout string within the pattern.
+
+ Fields for all callouts
+
+ The remaining fields in the callout block are the same for both kinds
+ of callout.
+
+ The offset_vector field is a pointer to a vector of capturing offsets
+ (the "ovector"). You may read the elements in this vector, but you must
+ not change any of them.
+
+ For calls to pcre2_match(), the offset_vector field is not (since
+ release 10.30) a pointer to the actual ovector that was passed to the
+ matching function in the match data block. Instead it points to an
+ internal ovector of a size large enough to hold all possible captured
+ substrings in the pattern. Note that whenever a recursion or subroutine
+ call within a pattern completes, the capturing state is reset to what
+ it was before.
+
+ The capture_last field contains the number of the most recently cap-
+ tured substring, and the capture_top field contains one more than the
+ number of the highest numbered captured substring so far. If no sub-
+ strings have yet been captured, the value of capture_last is 0 and the
+ value of capture_top is 1. The values of these fields do not always
+ differ by one; for example, when the callout in the pattern
+ ((a)(b))(?C2) is taken, capture_last is 1 but capture_top is 4.
+
+ The contents of ovector[2] to ovector[<capture_top>*2-1] can be
+ inspected in order to extract substrings that have been matched so far,
+ in the same way as extracting substrings after a match has completed.
+ The values in ovector[0] and ovector[1] are always PCRE2_UNSET because
+ the match is by definition not complete. Substrings that have not been
+ captured but whose numbers are less than capture_top also have both of
+ their ovector slots set to PCRE2_UNSET.
+
+ For DFA matching, the offset_vector field points to the ovector that
+ was passed to the matching function in the match data block for call-
+ outs at the top level, but to an internal ovector during the processing
+ of pattern recursions, lookarounds, and atomic groups. However, these
+ ovectors hold no useful information because pcre2_dfa_match() does not
+ support substring capturing. The value of capture_top is always 1 and
+ the value of capture_last is always 0 for DFA matching.
+
+ The subject and subject_length fields contain copies of the values that
+ were passed to the matching function.
+
+ The start_match field normally contains the offset within the subject
+ at which the current match attempt started. However, if the escape
+ sequence \K has been encountered, this value is changed to reflect the
+ modified starting point. If the pattern is not anchored, the callout
+ function may be called several times from the same point in the pattern
+ for different starting points in the subject.
+
+ The current_position field contains the offset within the subject of
+ the current match pointer.
+
+ The pattern_position field contains the offset in the pattern string to
+ the next item to be matched.
+
+ The next_item_length field contains the length of the next item to be
+ processed in the pattern string. When the callout is at the end of the
+ pattern, the length is zero. When the callout precedes an opening
+ parenthesis, the length includes meta characters that follow the paren-
+ thesis. For example, in a callout before an assertion such as (?=ab)
+ the length is 3. For an an alternation bar or a closing parenthesis,
+ the length is one, unless a closing parenthesis is followed by a quan-
+ tifier, in which case its length is included. (This changed in release
+ 10.23. In earlier releases, before an opening parenthesis the length
+ was that of the entire subpattern, and before an alternation bar or a
+ closing parenthesis the length was zero.)
+
+ The pattern_position and next_item_length fields are intended to help
+ in distinguishing between different automatic callouts, which all have
+ the same callout number. However, they are set for all callouts, and
+ are used by pcre2test to show the next item to be matched when display-
+ ing callout information.
+
+ In callouts from pcre2_match() the mark field contains a pointer to the
+ zero-terminated name of the most recently passed (*MARK), (*PRUNE), or
+ (*THEN) item in the match, or NULL if no such items have been passed.
+ Instances of (*PRUNE) or (*THEN) without a name do not obliterate a
+ previous (*MARK). In callouts from the DFA matching function this field
+ always contains NULL.
+
+ The callout_flags field is always zero in callouts from
+ pcre2_dfa_match() or when JIT is being used. When pcre2_match() without
+ JIT is used, the following bits may be set:
+
+ PCRE2_CALLOUT_STARTMATCH
+
+ This is set for the first callout after the start of matching for each
+ new starting position in the subject.
+
+ PCRE2_CALLOUT_BACKTRACK
+
+ This is set if there has been a matching backtrack since the previous
+ callout, or since the start of matching if this is the first callout
+ from a pcre2_match() run.
+
+ Both bits are set when a backtrack has caused a "bumpalong" to a new
+ starting position in the subject. Output from pcre2test does not indi-
+ cate the presence of these bits unless the callout_extra modifier is
+ set.
+
+ The information in the callout_flags field is provided so that applica-
+ tions can track and tell their users how matching with backtracking is
+ done. This can be useful when trying to optimize patterns, or just to
+ understand how PCRE2 works. There is no support in pcre2_dfa_match()
+ because there is no backtracking in DFA matching, and there is no sup-
+ port in JIT because JIT is all about maximimizing matching performance.
+ In both these cases the callout_flags field is always zero.
+
+
+RETURN VALUES FROM CALLOUTS
+
+ The external callout function returns an integer to PCRE2. If the value
+ is zero, matching proceeds as normal. If the value is greater than
+ zero, matching fails at the current point, but the testing of other
+ matching possibilities goes ahead, just as if a lookahead assertion had
+ failed. If the value is less than zero, the match is abandoned, and the
+ matching function returns the negative value.
+
+ Negative values should normally be chosen from the set of
+ PCRE2_ERROR_xxx values. In particular, PCRE2_ERROR_NOMATCH forces a
+ standard "no match" failure. The error number PCRE2_ERROR_CALLOUT is
+ reserved for use by callout functions; it will never be used by PCRE2
+ itself.
+
+
+CALLOUT ENUMERATION
+
+ int pcre2_callout_enumerate(const pcre2_code *code,
+ int (*callback)(pcre2_callout_enumerate_block *, void *),
+ void *user_data);
+
+ A script language that supports the use of string arguments in callouts
+ might like to scan all the callouts in a pattern before running the
+ match. This can be done by calling pcre2_callout_enumerate(). The first
+ argument is a pointer to a compiled pattern, the second points to a
+ callback function, and the third is arbitrary user data. The callback
+ function is called for every callout in the pattern in the order in
+ which they appear. Its first argument is a pointer to a callout enumer-
+ ation block, and its second argument is the user_data value that was
+ passed to pcre2_callout_enumerate(). The data block contains the fol-
+ lowing fields:
+
+ version Block version number
+ pattern_position Offset to next item in pattern
+ next_item_length Length of next item in pattern
+ callout_number Number for numbered callouts
+ callout_string_offset Offset to string within pattern
+ callout_string_length Length of callout string
+ callout_string Points to callout string or is NULL
+
+ The version number is currently 0. It will increase if new fields are
+ ever added to the block. The remaining fields are the same as their
+ namesakes in the pcre2_callout block that is used for callouts during
+ matching, as described above.
+
+ Note that the value of pattern_position is unique for each callout.
+ However, if a callout occurs inside a group that is quantified with a
+ non-zero minimum or a fixed maximum, the group is replicated inside the
+ compiled pattern. For example, a pattern such as /(a){2}/ is compiled
+ as if it were /(a)(a)/. This means that the callout will be enumerated
+ more than once, but with the same value for pattern_position in each
+ case.
+
+ The callback function should normally return zero. If it returns a non-
+ zero value, scanning the pattern stops, and that value is returned from
+ pcre2_callout_enumerate().
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 26 April 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2COMPAT(3) Library Functions Manual PCRE2COMPAT(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+DIFFERENCES BETWEEN PCRE2 AND PERL
+
+ This document describes the differences in the ways that PCRE2 and Perl
+ handle regular expressions. The differences described here are with
+ respect to Perl versions 5.26, but as both Perl and PCRE2 are continu-
+ ally changing, the information may sometimes be out of date.
+
+ 1. PCRE2 has only a subset of Perl's Unicode support. Details of what
+ it does have are given in the pcre2unicode page.
+
+ 2. Like Perl, PCRE2 allows repeat quantifiers on parenthesized asser-
+ tions, but they do not mean what you might think. For example, (?!a){3}
+ does not assert that the next three characters are not "a". It just
+ asserts that the next character is not "a" three times (in principle;
+ PCRE2 optimizes this to run the assertion just once). Perl allows some
+ repeat quantifiers on other assertions, for example, \b* (but not
+ \b{3}), but these do not seem to have any use.
+
+ 3. Capturing subpatterns that occur inside negative lookaround asser-
+ tions are counted, but their entries in the offsets vector are set only
+ when a negative assertion is a condition that has a matching branch
+ (that is, the condition is false).
+
+ 4. The following Perl escape sequences are not supported: \F, \l, \L,
+ \u, \U, and \N when followed by a character name. \N on its own, match-
+ ing a non-newline character, and \N{U+dd..}, matching a Unicode code
+ point, are supported. The escapes that modify the case of following
+ letters are implemented by Perl's general string-handling and are not
+ part of its pattern matching engine. If any of these are encountered by
+ PCRE2, an error is generated by default. However, if the PCRE2_ALT_BSUX
+ option is set, \U and \u are interpreted as ECMAScript interprets them.
+
+ 5. The Perl escape sequences \p, \P, and \X are supported only if PCRE2
+ is built with Unicode support (the default). The properties that can be
+ tested with \p and \P are limited to the general category properties
+ such as Lu and Nd, script names such as Greek or Han, and the derived
+ properties Any and L&. PCRE2 does support the Cs (surrogate) property,
+ which Perl does not; the Perl documentation says "Because Perl hides
+ the need for the user to understand the internal representation of Uni-
+ code characters, there is no need to implement the somewhat messy con-
+ cept of surrogates."
+
+ 6. PCRE2 supports the \Q...\E escape for quoting substrings. Characters
+ in between are treated as literals. However, this is slightly different
+ from Perl in that $ and @ are also handled as literals inside the
+ quotes. In Perl, they cause variable interpolation (but of course PCRE2
+ does not have variables). Also, Perl does "double-quotish backslash
+ interpolation" on any backslashes between \Q and \E which, its documen-
+ tation says, "may lead to confusing results". PCRE2 treats a backslash
+ between \Q and \E just like any other character. Note the following
+ examples:
+
+ Pattern PCRE2 matches Perl matches
+
+ \Qabc$xyz\E abc$xyz abc followed by the
+ contents of $xyz
+ \Qabc\$xyz\E abc\$xyz abc\$xyz
+ \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
+ \QA\B\E A\B A\B
+ \Q\\E \ \\E
+
+ The \Q...\E sequence is recognized both inside and outside character
+ classes.
+
+ 7. Fairly obviously, PCRE2 does not support the (?{code}) and
+ (??{code}) constructions. However, PCRE2 does have a "callout" feature,
+ which allows an external function to be called during pattern matching.
+ See the pcre2callout documentation for details.
+
+ 8. Subroutine calls (whether recursive or not) were treated as atomic
+ groups up to PCRE2 release 10.23, but from release 10.30 this changed,
+ and backtracking into subroutine calls is now supported, as in Perl.
+
+ 9. If any of the backtracking control verbs are used in a subpattern
+ that is called as a subroutine (whether or not recursively), their
+ effect is confined to that subpattern; it does not extend to the sur-
+ rounding pattern. This is not always the case in Perl. In particular,
+ if (*THEN) is present in a group that is called as a subroutine, its
+ action is limited to that group, even if the group does not contain any
+ | characters. Note that such subpatterns are processed as anchored at
+ the point where they are tested.
+
+ 10. If a pattern contains more than one backtracking control verb, the
+ first one that is backtracked onto acts. For example, in the pattern
+ A(*COMMIT)B(*PRUNE)C a failure in B triggers (*COMMIT), but a failure
+ in C triggers (*PRUNE). Perl's behaviour is more complex; in many cases
+ it is the same as PCRE2, but there are cases where it differs.
+
+ 11. Most backtracking verbs in assertions have their normal actions.
+ They are not confined to the assertion.
+
+ 12. There are some differences that are concerned with the settings of
+ captured strings when part of a pattern is repeated. For example,
+ matching "aba" against the pattern /^(a(b)?)+$/ in Perl leaves $2
+ unset, but in PCRE2 it is set to "b".
+
+ 13. PCRE2's handling of duplicate subpattern numbers and duplicate sub-
+ pattern names is not as general as Perl's. This is a consequence of the
+ fact the PCRE2 works internally just with numbers, using an external
+ table to translate between numbers and names. In particular, a pattern
+ such as (?|(?<a>A)|(?<b>B), where the two capturing parentheses have
+ the same number but different names, is not supported, and causes an
+ error at compile time. If it were allowed, it would not be possible to
+ distinguish which parentheses matched, because both names map to cap-
+ turing subpattern number 1. To avoid this confusing situation, an error
+ is given at compile time.
+
+ 14. Perl used to recognize comments in some places that PCRE2 does not,
+ for example, between the ( and ? at the start of a subpattern. If the
+ /x modifier is set, Perl allowed white space between ( and ? though the
+ latest Perls give an error (for a while it was just deprecated). There
+ may still be some cases where Perl behaves differently.
+
+ 15. Perl, when in warning mode, gives warnings for character classes
+ such as [A-\d] or [a-[:digit:]]. It then treats the hyphens as liter-
+ als. PCRE2 has no warning features, so it gives an error in these cases
+ because they are almost certainly user mistakes.
+
+ 16. In PCRE2, the upper/lower case character properties Lu and Ll are
+ not affected when case-independent matching is specified. For example,
+ \p{Lu} always matches an upper case letter. I think Perl has changed in
+ this respect; in the release at the time of writing (5.24), \p{Lu} and
+ \p{Ll} match all letters, regardless of case, when case independence is
+ specified.
+
+ 17. PCRE2 provides some extensions to the Perl regular expression
+ facilities. Perl 5.10 includes new features that are not in earlier
+ versions of Perl, some of which (such as named parentheses) were in
+ PCRE2 for some time before. This list is with respect to Perl 5.26:
+
+ (a) Although lookbehind assertions in PCRE2 must match fixed length
+ strings, each alternative branch of a lookbehind assertion can match a
+ different length of string. Perl requires them all to have the same
+ length.
+
+ (b) From PCRE2 10.23, backreferences to groups of fixed length are sup-
+ ported in lookbehinds, provided that there is no possibility of refer-
+ encing a non-unique number or name. Perl does not support backrefer-
+ ences in lookbehinds.
+
+ (c) If PCRE2_DOLLAR_ENDONLY is set and PCRE2_MULTILINE is not set, the
+ $ meta-character matches only at the very end of the string.
+
+ (d) A backslash followed by a letter with no special meaning is
+ faulted. (Perl can be made to issue a warning.)
+
+ (e) If PCRE2_UNGREEDY is set, the greediness of the repetition quanti-
+ fiers is inverted, that is, by default they are not greedy, but if fol-
+ lowed by a question mark they are.
+
+ (f) PCRE2_ANCHORED can be used at matching time to force a pattern to
+ be tried only at the first matching position in the subject string.
+
+ (g) The PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY and
+ PCRE2_NOTEMPTY_ATSTART options have no Perl equivalents.
+
+ (h) The \R escape sequence can be restricted to match only CR, LF, or
+ CRLF by the PCRE2_BSR_ANYCRLF option.
+
+ (i) The callout facility is PCRE2-specific. Perl supports codeblocks
+ and variable interpolation, but not general hooks on every match.
+
+ (j) The partial matching facility is PCRE2-specific.
+
+ (k) The alternative matching function (pcre2_dfa_match() matches in a
+ different way and is not Perl-compatible.
+
+ (l) PCRE2 recognizes some special sequences such as (*CR) or (*NO_JIT)
+ at the start of a pattern that set overall options that cannot be
+ changed within the pattern.
+
+ 18. The Perl /a modifier restricts /d numbers to pure ascii, and the
+ /aa modifier restricts /i case-insensitive matching to pure ascii,
+ ignoring Unicode rules. This separation cannot be represented with
+ PCRE2_UCP.
+
+ 19. Perl has different limits than PCRE2. See the pcre2limit documenta-
+ tion for details. Perl went with 5.10 from recursion to iteration keep-
+ ing the intermediate matches on the heap, which is ~10% slower but does
+ not fall into any stack-overflow limit. PCRE2 made a similar change at
+ release 10.30, and also has many build-time and run-time customizable
+ limits.
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 28 July 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2JIT(3) Library Functions Manual PCRE2JIT(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+PCRE2 JUST-IN-TIME COMPILER SUPPORT
+
+ Just-in-time compiling is a heavyweight optimization that can greatly
+ speed up pattern matching. However, it comes at the cost of extra pro-
+ cessing before the match is performed, so it is of most benefit when
+ the same pattern is going to be matched many times. This does not nec-
+ essarily mean many calls of a matching function; if the pattern is not
+ anchored, matching attempts may take place many times at various posi-
+ tions in the subject, even for a single call. Therefore, if the subject
+ string is very long, it may still pay to use JIT even for one-off
+ matches. JIT support is available for all of the 8-bit, 16-bit and
+ 32-bit PCRE2 libraries.
+
+ JIT support applies only to the traditional Perl-compatible matching
+ function. It does not apply when the DFA matching function is being
+ used. The code for this support was written by Zoltan Herczeg.
+
+
+AVAILABILITY OF JIT SUPPORT
+
+ JIT support is an optional feature of PCRE2. The "configure" option
+ --enable-jit (or equivalent CMake option) must be set when PCRE2 is
+ built if you want to use JIT. The support is limited to the following
+ hardware platforms:
+
+ ARM 32-bit (v5, v7, and Thumb2)
+ ARM 64-bit
+ Intel x86 32-bit and 64-bit
+ MIPS 32-bit and 64-bit
+ Power PC 32-bit and 64-bit
+ SPARC 32-bit
+
+ If --enable-jit is set on an unsupported platform, compilation fails.
+
+ A program can tell if JIT support is available by calling pcre2_con-
+ fig() with the PCRE2_CONFIG_JIT option. The result is 1 when JIT is
+ available, and 0 otherwise. However, a simple program does not need to
+ check this in order to use JIT. The API is implemented in a way that
+ falls back to the interpretive code if JIT is not available. For pro-
+ grams that need the best possible performance, there is also a "fast
+ path" API that is JIT-specific.
+
+
+SIMPLE USE OF JIT
+
+ To make use of the JIT support in the simplest way, all you have to do
+ is to call pcre2_jit_compile() after successfully compiling a pattern
+ with pcre2_compile(). This function has two arguments: the first is the
+ compiled pattern pointer that was returned by pcre2_compile(), and the
+ second is zero or more of the following option bits: PCRE2_JIT_COM-
+ PLETE, PCRE2_JIT_PARTIAL_HARD, or PCRE2_JIT_PARTIAL_SOFT.
+
+ If JIT support is not available, a call to pcre2_jit_compile() does
+ nothing and returns PCRE2_ERROR_JIT_BADOPTION. Otherwise, the compiled
+ pattern is passed to the JIT compiler, which turns it into machine code
+ that executes much faster than the normal interpretive code, but yields
+ exactly the same results. The returned value from pcre2_jit_compile()
+ is zero on success, or a negative error code.
+
+ There is a limit to the size of pattern that JIT supports, imposed by
+ the size of machine stack that it uses. The exact rules are not docu-
+ mented because they may change at any time, in particular, when new
+ optimizations are introduced. If a pattern is too big, a call to
+ pcre2_jit_compile() returns PCRE2_ERROR_NOMEMORY.
+
+ PCRE2_JIT_COMPLETE requests the JIT compiler to generate code for com-
+ plete matches. If you want to run partial matches using the PCRE2_PAR-
+ TIAL_HARD or PCRE2_PARTIAL_SOFT options of pcre2_match(), you should
+ set one or both of the other options as well as, or instead of
+ PCRE2_JIT_COMPLETE. The JIT compiler generates different optimized code
+ for each of the three modes (normal, soft partial, hard partial). When
+ pcre2_match() is called, the appropriate code is run if it is avail-
+ able. Otherwise, the pattern is matched using interpretive code.
+
+ You can call pcre2_jit_compile() multiple times for the same compiled
+ pattern. It does nothing if it has previously compiled code for any of
+ the option bits. For example, you can call it once with PCRE2_JIT_COM-
+ PLETE and (perhaps later, when you find you need partial matching)
+ again with PCRE2_JIT_COMPLETE and PCRE2_JIT_PARTIAL_HARD. This time it
+ will ignore PCRE2_JIT_COMPLETE and just compile code for partial match-
+ ing. If pcre2_jit_compile() is called with no option bits set, it imme-
+ diately returns zero. This is an alternative way of testing whether JIT
+ is available.
+
+ At present, it is not possible to free JIT compiled code except when
+ the entire compiled pattern is freed by calling pcre2_code_free().
+
+ In some circumstances you may need to call additional functions. These
+ are described in the section entitled "Controlling the JIT stack"
+ below.
+
+ There are some pcre2_match() options that are not supported by JIT, and
+ there are also some pattern items that JIT cannot handle. Details are
+ given below. In both cases, matching automatically falls back to the
+ interpretive code. If you want to know whether JIT was actually used
+ for a particular match, you should arrange for a JIT callback function
+ to be set up as described in the section entitled "Controlling the JIT
+ stack" below, even if you do not need to supply a non-default JIT
+ stack. Such a callback function is called whenever JIT code is about to
+ be obeyed. If the match-time options are not right for JIT execution,
+ the callback function is not obeyed.
+
+ If the JIT compiler finds an unsupported item, no JIT data is gener-
+ ated. You can find out if JIT matching is available after compiling a
+ pattern by calling pcre2_pattern_info() with the PCRE2_INFO_JITSIZE
+ option. A non-zero result means that JIT compilation was successful. A
+ result of 0 means that JIT support is not available, or the pattern was
+ not processed by pcre2_jit_compile(), or the JIT compiler was not able
+ to handle the pattern.
+
+
+UNSUPPORTED OPTIONS AND PATTERN ITEMS
+
+ The pcre2_match() options that are supported for JIT matching are
+ PCRE2_NOTBOL, PCRE2_NOTEOL, PCRE2_NOTEMPTY, PCRE2_NOTEMPTY_ATSTART,
+ PCRE2_NO_UTF_CHECK, PCRE2_PARTIAL_HARD, and PCRE2_PARTIAL_SOFT. The
+ PCRE2_ANCHORED option is not supported at match time.
+
+ If the PCRE2_NO_JIT option is passed to pcre2_match() it disables the
+ use of JIT, forcing matching by the interpreter code.
+
+ The only unsupported pattern items are \C (match a single data unit)
+ when running in a UTF mode, and a callout immediately before an asser-
+ tion condition in a conditional group.
+
+
+RETURN VALUES FROM JIT MATCHING
+
+ When a pattern is matched using JIT matching, the return values are the
+ same as those given by the interpretive pcre2_match() code, with the
+ addition of one new error code: PCRE2_ERROR_JIT_STACKLIMIT. This means
+ that the memory used for the JIT stack was insufficient. See "Control-
+ ling the JIT stack" below for a discussion of JIT stack usage.
+
+ The error code PCRE2_ERROR_MATCHLIMIT is returned by the JIT code if
+ searching a very large pattern tree goes on for too long, as it is in
+ the same circumstance when JIT is not used, but the details of exactly
+ what is counted are not the same. The PCRE2_ERROR_DEPTHLIMIT error code
+ is never returned when JIT matching is used.
+
+
+CONTROLLING THE JIT STACK
+
+ When the compiled JIT code runs, it needs a block of memory to use as a
+ stack. By default, it uses 32KiB on the machine stack. However, some
+ large or complicated patterns need more than this. The error
+ PCRE2_ERROR_JIT_STACKLIMIT is given when there is not enough stack.
+ Three functions are provided for managing blocks of memory for use as
+ JIT stacks. There is further discussion about the use of JIT stacks in
+ the section entitled "JIT stack FAQ" below.
+
+ The pcre2_jit_stack_create() function creates a JIT stack. Its argu-
+ ments are a starting size, a maximum size, and a general context (for
+ memory allocation functions, or NULL for standard memory allocation).
+ It returns a pointer to an opaque structure of type pcre2_jit_stack, or
+ NULL if there is an error. The pcre2_jit_stack_free() function is used
+ to free a stack that is no longer needed. If its argument is NULL, this
+ function returns immediately, without doing anything. (For the techni-
+ cally minded: the address space is allocated by mmap or VirtualAlloc.)
+ A maximum stack size of 512KiB to 1MiB should be more than enough for
+ any pattern.
+
+ The pcre2_jit_stack_assign() function specifies which stack JIT code
+ should use. Its arguments are as follows:
+
+ pcre2_match_context *mcontext
+ pcre2_jit_callback callback
+ void *data
+
+ The first argument is a pointer to a match context. When this is subse-
+ quently passed to a matching function, its information determines which
+ JIT stack is used. If this argument is NULL, the function returns imme-
+ diately, without doing anything. There are three cases for the values
+ of the other two options:
+
+ (1) If callback is NULL and data is NULL, an internal 32KiB block
+ on the machine stack is used. This is the default when a match
+ context is created.
+
+ (2) If callback is NULL and data is not NULL, data must be
+ a pointer to a valid JIT stack, the result of calling
+ pcre2_jit_stack_create().
+
+ (3) If callback is not NULL, it must point to a function that is
+ called with data as an argument at the start of matching, in
+ order to set up a JIT stack. If the return from the callback
+ function is NULL, the internal 32KiB stack is used; otherwise the
+ return value must be a valid JIT stack, the result of calling
+ pcre2_jit_stack_create().
+
+ A callback function is obeyed whenever JIT code is about to be run; it
+ is not obeyed when pcre2_match() is called with options that are incom-
+ patible for JIT matching. A callback function can therefore be used to
+ determine whether a match operation was executed by JIT or by the
+ interpreter.
+
+ You may safely use the same JIT stack for more than one pattern (either
+ by assigning directly or by callback), as long as the patterns are
+ matched sequentially in the same thread. Currently, the only way to set
+ up non-sequential matches in one thread is to use callouts: if a call-
+ out function starts another match, that match must use a different JIT
+ stack to the one used for currently suspended match(es).
+
+ In a multithread application, if you do not specify a JIT stack, or if
+ you assign or pass back NULL from a callback, that is thread-safe,
+ because each thread has its own machine stack. However, if you assign
+ or pass back a non-NULL JIT stack, this must be a different stack for
+ each thread so that the application is thread-safe.
+
+ Strictly speaking, even more is allowed. You can assign the same non-
+ NULL stack to a match context that is used by any number of patterns,
+ as long as they are not used for matching by multiple threads at the
+ same time. For example, you could use the same stack in all compiled
+ patterns, with a global mutex in the callback to wait until the stack
+ is available for use. However, this is an inefficient solution, and not
+ recommended.
+
+ This is a suggestion for how a multithreaded program that needs to set
+ up non-default JIT stacks might operate:
+
+ During thread initalization
+ thread_local_var = pcre2_jit_stack_create(...)
+
+ During thread exit
+ pcre2_jit_stack_free(thread_local_var)
+
+ Use a one-line callback function
+ return thread_local_var
+
+ All the functions described in this section do nothing if JIT is not
+ available.
+
+
+JIT STACK FAQ
+
+ (1) Why do we need JIT stacks?
+
+ PCRE2 (and JIT) is a recursive, depth-first engine, so it needs a stack
+ where the local data of the current node is pushed before checking its
+ child nodes. Allocating real machine stack on some platforms is diffi-
+ cult. For example, the stack chain needs to be updated every time if we
+ extend the stack on PowerPC. Although it is possible, its updating
+ time overhead decreases performance. So we do the recursion in memory.
+
+ (2) Why don't we simply allocate blocks of memory with malloc()?
+
+ Modern operating systems have a nice feature: they can reserve an
+ address space instead of allocating memory. We can safely allocate mem-
+ ory pages inside this address space, so the stack could grow without
+ moving memory data (this is important because of pointers). Thus we can
+ allocate 1MiB address space, and use only a single memory page (usually
+ 4KiB) if that is enough. However, we can still grow up to 1MiB anytime
+ if needed.
+
+ (3) Who "owns" a JIT stack?
+
+ The owner of the stack is the user program, not the JIT studied pattern
+ or anything else. The user program must ensure that if a stack is being
+ used by pcre2_match(), (that is, it is assigned to a match context that
+ is passed to the pattern currently running), that stack must not be
+ used by any other threads (to avoid overwriting the same memory area).
+ The best practice for multithreaded programs is to allocate a stack for
+ each thread, and return this stack through the JIT callback function.
+
+ (4) When should a JIT stack be freed?
+
+ You can free a JIT stack at any time, as long as it will not be used by
+ pcre2_match() again. When you assign the stack to a match context, only
+ a pointer is set. There is no reference counting or any other magic.
+ You can free compiled patterns, contexts, and stacks in any order, any-
+ time. Just do not call pcre2_match() with a match context pointing to
+ an already freed stack, as that will cause SEGFAULT. (Also, do not free
+ a stack currently used by pcre2_match() in another thread). You can
+ also replace the stack in a context at any time when it is not in use.
+ You should free the previous stack before assigning a replacement.
+
+ (5) Should I allocate/free a stack every time before/after calling
+ pcre2_match()?
+
+ No, because this is too costly in terms of resources. However, you
+ could implement some clever idea which release the stack if it is not
+ used in let's say two minutes. The JIT callback can help to achieve
+ this without keeping a list of patterns.
+
+ (6) OK, the stack is for long term memory allocation. But what happens
+ if a pattern causes stack overflow with a stack of 1MiB? Is that 1MiB
+ kept until the stack is freed?
+
+ Especially on embedded sytems, it might be a good idea to release mem-
+ ory sometimes without freeing the stack. There is no API for this at
+ the moment. Probably a function call which returns with the currently
+ allocated memory for any stack and another which allows releasing mem-
+ ory (shrinking the stack) would be a good idea if someone needs this.
+
+ (7) This is too much of a headache. Isn't there any better solution for
+ JIT stack handling?
+
+ No, thanks to Windows. If POSIX threads were used everywhere, we could
+ throw out this complicated API.
+
+
+FREEING JIT SPECULATIVE MEMORY
+
+ void pcre2_jit_free_unused_memory(pcre2_general_context *gcontext);
+
+ The JIT executable allocator does not free all memory when it is possi-
+ ble. It expects new allocations, and keeps some free memory around to
+ improve allocation speed. However, in low memory conditions, it might
+ be better to free all possible memory. You can cause this to happen by
+ calling pcre2_jit_free_unused_memory(). Its argument is a general con-
+ text, for custom memory management, or NULL for standard memory manage-
+ ment.
+
+
+EXAMPLE CODE
+
+ This is a single-threaded example that specifies a JIT stack without
+ using a callback. A real program should include error checking after
+ all the function calls.
+
+ int rc;
+ pcre2_code *re;
+ pcre2_match_data *match_data;
+ pcre2_match_context *mcontext;
+ pcre2_jit_stack *jit_stack;
+
+ re = pcre2_compile(pattern, PCRE2_ZERO_TERMINATED, 0,
+ &errornumber, &erroffset, NULL);
+ rc = pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
+ mcontext = pcre2_match_context_create(NULL);
+ jit_stack = pcre2_jit_stack_create(32*1024, 512*1024, NULL);
+ pcre2_jit_stack_assign(mcontext, NULL, jit_stack);
+ match_data = pcre2_match_data_create(re, 10);
+ rc = pcre2_match(re, subject, length, 0, 0, match_data, mcontext);
+ /* Process result */
+
+ pcre2_code_free(re);
+ pcre2_match_data_free(match_data);
+ pcre2_match_context_free(mcontext);
+ pcre2_jit_stack_free(jit_stack);
+
+
+JIT FAST PATH API
+
+ Because the API described above falls back to interpreted matching when
+ JIT is not available, it is convenient for programs that are written
+ for general use in many environments. However, calling JIT via
+ pcre2_match() does have a performance impact. Programs that are written
+ for use where JIT is known to be available, and which need the best
+ possible performance, can instead use a "fast path" API to call JIT
+ matching directly instead of calling pcre2_match() (obviously only for
+ patterns that have been successfully processed by pcre2_jit_compile()).
+
+ The fast path function is called pcre2_jit_match(), and it takes
+ exactly the same arguments as pcre2_match(). The return values are also
+ the same, plus PCRE2_ERROR_JIT_BADOPTION if a matching mode (partial or
+ complete) is requested that was not compiled. Unsupported option bits
+ (for example, PCRE2_ANCHORED) are ignored, as is the PCRE2_NO_JIT
+ option.
+
+ When you call pcre2_match(), as well as testing for invalid options, a
+ number of other sanity checks are performed on the arguments. For exam-
+ ple, if the subject pointer is NULL, an immediate error is given. Also,
+ unless PCRE2_NO_UTF_CHECK is set, a UTF subject string is tested for
+ validity. In the interests of speed, these checks do not happen on the
+ JIT fast path, and if invalid data is passed, the result is undefined.
+
+ Bypassing the sanity checks and the pcre2_match() wrapping can give
+ speedups of more than 10%.
+
+
+SEE ALSO
+
+ pcre2api(3)
+
+
+AUTHOR
+
+ Philip Hazel (FAQ by Zoltan Herczeg)
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 28 June 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2LIMITS(3) Library Functions Manual PCRE2LIMITS(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+SIZE AND OTHER LIMITATIONS
+
+ There are some size limitations in PCRE2 but it is hoped that they will
+ never in practice be relevant.
+
+ The maximum size of a compiled pattern is approximately 64 thousand
+ code units for the 8-bit and 16-bit libraries if PCRE2 is compiled with
+ the default internal linkage size, which is 2 bytes for these
+ libraries. If you want to process regular expressions that are truly
+ enormous, you can compile PCRE2 with an internal linkage size of 3 or 4
+ (when building the 16-bit library, 3 is rounded up to 4). See the
+ README file in the source distribution and the pcre2build documentation
+ for details. In these cases the limit is substantially larger. How-
+ ever, the speed of execution is slower. In the 32-bit library, the
+ internal linkage size is always 4.
+
+ The maximum length of a source pattern string is essentially unlimited;
+ it is the largest number a PCRE2_SIZE variable can hold. However, the
+ program that calls pcre2_compile() can specify a smaller limit.
+
+ The maximum length (in code units) of a subject string is one less than
+ the largest number a PCRE2_SIZE variable can hold. PCRE2_SIZE is an
+ unsigned integer type, usually defined as size_t. Its maximum value
+ (that is ~(PCRE2_SIZE)0) is reserved as a special indicator for zero-
+ terminated strings and unset offsets.
+
+ All values in repeating quantifiers must be less than 65536.
+
+ The maximum length of a lookbehind assertion is 65535 characters.
+
+ There is no limit to the number of parenthesized subpatterns, but there
+ can be no more than 65535 capturing subpatterns. There is, however, a
+ limit to the depth of nesting of parenthesized subpatterns of all
+ kinds. This is imposed in order to limit the amount of system stack
+ used at compile time. The default limit can be specified when PCRE2 is
+ built; if not, the default is set to 250. An application can change
+ this limit by calling pcre2_set_parens_nest_limit() to set the limit in
+ a compile context.
+
+ The maximum length of name for a named subpattern is 32 code units, and
+ the maximum number of named subpatterns is 10000.
+
+ The maximum length of a name in a (*MARK), (*PRUNE), (*SKIP), or
+ (*THEN) verb is 255 code units for the 8-bit library and 65535 code
+ units for the 16-bit and 32-bit libraries.
+
+ The maximum length of a string argument to a callout is the largest
+ number a 32-bit unsigned integer can hold.
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 30 March 2017
+ Copyright (c) 1997-2017 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2MATCHING(3) Library Functions Manual PCRE2MATCHING(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+PCRE2 MATCHING ALGORITHMS
+
+ This document describes the two different algorithms that are available
+ in PCRE2 for matching a compiled regular expression against a given
+ subject string. The "standard" algorithm is the one provided by the
+ pcre2_match() function. This works in the same as as Perl's matching
+ function, and provide a Perl-compatible matching operation. The just-
+ in-time (JIT) optimization that is described in the pcre2jit documenta-
+ tion is compatible with this function.
+
+ An alternative algorithm is provided by the pcre2_dfa_match() function;
+ it operates in a different way, and is not Perl-compatible. This alter-
+ native has advantages and disadvantages compared with the standard
+ algorithm, and these are described below.
+
+ When there is only one possible way in which a given subject string can
+ match a pattern, the two algorithms give the same answer. A difference
+ arises, however, when there are multiple possibilities. For example, if
+ the pattern
+
+ ^<.*>
+
+ is matched against the string
+
+ <something> <something else> <something further>
+
+ there are three possible answers. The standard algorithm finds only one
+ of them, whereas the alternative algorithm finds all three.
+
+
+REGULAR EXPRESSIONS AS TREES
+
+ The set of strings that are matched by a regular expression can be rep-
+ resented as a tree structure. An unlimited repetition in the pattern
+ makes the tree of infinite size, but it is still a tree. Matching the
+ pattern to a given subject string (from a given starting point) can be
+ thought of as a search of the tree. There are two ways to search a
+ tree: depth-first and breadth-first, and these correspond to the two
+ matching algorithms provided by PCRE2.
+
+
+THE STANDARD MATCHING ALGORITHM
+
+ In the terminology of Jeffrey Friedl's book "Mastering Regular Expres-
+ sions", the standard algorithm is an "NFA algorithm". It conducts a
+ depth-first search of the pattern tree. That is, it proceeds along a
+ single path through the tree, checking that the subject matches what is
+ required. When there is a mismatch, the algorithm tries any alterna-
+ tives at the current point, and if they all fail, it backs up to the
+ previous branch point in the tree, and tries the next alternative
+ branch at that level. This often involves backing up (moving to the
+ left) in the subject string as well. The order in which repetition
+ branches are tried is controlled by the greedy or ungreedy nature of
+ the quantifier.
+
+ If a leaf node is reached, a matching string has been found, and at
+ that point the algorithm stops. Thus, if there is more than one possi-
+ ble match, this algorithm returns the first one that it finds. Whether
+ this is the shortest, the longest, or some intermediate length depends
+ on the way the greedy and ungreedy repetition quantifiers are specified
+ in the pattern.
+
+ Because it ends up with a single path through the tree, it is rela-
+ tively straightforward for this algorithm to keep track of the sub-
+ strings that are matched by portions of the pattern in parentheses.
+ This provides support for capturing parentheses and backreferences.
+
+
+THE ALTERNATIVE MATCHING ALGORITHM
+
+ This algorithm conducts a breadth-first search of the tree. Starting
+ from the first matching point in the subject, it scans the subject
+ string from left to right, once, character by character, and as it does
+ this, it remembers all the paths through the tree that represent valid
+ matches. In Friedl's terminology, this is a kind of "DFA algorithm",
+ though it is not implemented as a traditional finite state machine (it
+ keeps multiple states active simultaneously).
+
+ Although the general principle of this matching algorithm is that it
+ scans the subject string only once, without backtracking, there is one
+ exception: when a lookaround assertion is encountered, the characters
+ following or preceding the current point have to be independently
+ inspected.
+
+ The scan continues until either the end of the subject is reached, or
+ there are no more unterminated paths. At this point, terminated paths
+ represent the different matching possibilities (if there are none, the
+ match has failed). Thus, if there is more than one possible match,
+ this algorithm finds all of them, and in particular, it finds the long-
+ est. The matches are returned in decreasing order of length. There is
+ an option to stop the algorithm after the first match (which is neces-
+ sarily the shortest) is found.
+
+ Note that all the matches that are found start at the same point in the
+ subject. If the pattern
+
+ cat(er(pillar)?)?
+
+ is matched against the string "the caterpillar catchment", the result
+ is the three strings "caterpillar", "cater", and "cat" that start at
+ the fifth character of the subject. The algorithm does not automati-
+ cally move on to find matches that start at later positions.
+
+ PCRE2's "auto-possessification" optimization usually applies to charac-
+ ter repeats at the end of a pattern (as well as internally). For exam-
+ ple, the pattern "a\d+" is compiled as if it were "a\d++" because there
+ is no point even considering the possibility of backtracking into the
+ repeated digits. For DFA matching, this means that only one possible
+ match is found. If you really do want multiple matches in such cases,
+ either use an ungreedy repeat ("a\d+?") or set the PCRE2_NO_AUTO_POS-
+ SESS option when compiling.
+
+ There are a number of features of PCRE2 regular expressions that are
+ not supported by the alternative matching algorithm. They are as fol-
+ lows:
+
+ 1. Because the algorithm finds all possible matches, the greedy or
+ ungreedy nature of repetition quantifiers is not relevant (though it
+ may affect auto-possessification, as just described). During matching,
+ greedy and ungreedy quantifiers are treated in exactly the same way.
+ However, possessive quantifiers can make a difference when what follows
+ could also match what is quantified, for example in a pattern like
+ this:
+
+ ^a++\w!
+
+ This pattern matches "aaab!" but not "aaa!", which would be matched by
+ a non-possessive quantifier. Similarly, if an atomic group is present,
+ it is matched as if it were a standalone pattern at the current point,
+ and the longest match is then "locked in" for the rest of the overall
+ pattern.
+
+ 2. When dealing with multiple paths through the tree simultaneously, it
+ is not straightforward to keep track of captured substrings for the
+ different matching possibilities, and PCRE2's implementation of this
+ algorithm does not attempt to do this. This means that no captured sub-
+ strings are available.
+
+ 3. Because no substrings are captured, backreferences within the pat-
+ tern are not supported, and cause errors if encountered.
+
+ 4. For the same reason, conditional expressions that use a backrefer-
+ ence as the condition or test for a specific group recursion are not
+ supported.
+
+ 5. Because many paths through the tree may be active, the \K escape
+ sequence, which resets the start of the match when encountered (but may
+ be on some paths and not on others), is not supported. It causes an
+ error if encountered.
+
+ 6. Callouts are supported, but the value of the capture_top field is
+ always 1, and the value of the capture_last field is always 0.
+
+ 7. The \C escape sequence, which (in the standard algorithm) always
+ matches a single code unit, even in a UTF mode, is not supported in
+ these modes, because the alternative algorithm moves through the sub-
+ ject string one character (not code unit) at a time, for all active
+ paths through the tree.
+
+ 8. Except for (*FAIL), the backtracking control verbs such as (*PRUNE)
+ are not supported. (*FAIL) is supported, and behaves like a failing
+ negative assertion.
+
+
+ADVANTAGES OF THE ALTERNATIVE ALGORITHM
+
+ Using the alternative matching algorithm provides the following advan-
+ tages:
+
+ 1. All possible matches (at a single point in the subject) are automat-
+ ically found, and in particular, the longest match is found. To find
+ more than one match using the standard algorithm, you have to do kludgy
+ things with callouts.
+
+ 2. Because the alternative algorithm scans the subject string just
+ once, and never needs to backtrack (except for lookbehinds), it is pos-
+ sible to pass very long subject strings to the matching function in
+ several pieces, checking for partial matching each time. Although it is
+ also possible to do multi-segment matching using the standard algo-
+ rithm, by retaining partially matched substrings, it is more compli-
+ cated. The pcre2partial documentation gives details of partial matching
+ and discusses multi-segment matching.
+
+
+DISADVANTAGES OF THE ALTERNATIVE ALGORITHM
+
+ The alternative algorithm suffers from a number of disadvantages:
+
+ 1. It is substantially slower than the standard algorithm. This is
+ partly because it has to search for all possible matches, but is also
+ because it is less susceptible to optimization.
+
+ 2. Capturing parentheses and backreferences are not supported.
+
+ 3. Although atomic groups are supported, their use does not provide the
+ performance advantage that it does for the standard algorithm.
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 29 September 2014
+ Copyright (c) 1997-2014 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2PARTIAL(3) Library Functions Manual PCRE2PARTIAL(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions
+
+PARTIAL MATCHING IN PCRE2
+
+ In normal use of PCRE2, if the subject string that is passed to a
+ matching function matches as far as it goes, but is too short to match
+ the entire pattern, PCRE2_ERROR_NOMATCH is returned. There are circum-
+ stances where it might be helpful to distinguish this case from other
+ cases in which there is no match.
+
+ Consider, for example, an application where a human is required to type
+ in data for a field with specific formatting requirements. An example
+ might be a date in the form ddmmmyy, defined by this pattern:
+
+ ^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$
+
+ If the application sees the user's keystrokes one by one, and can check
+ that what has been typed so far is potentially valid, it is able to
+ raise an error as soon as a mistake is made, by beeping and not
+ reflecting the character that has been typed, for example. This immedi-
+ ate feedback is likely to be a better user interface than a check that
+ is delayed until the entire string has been entered. Partial matching
+ can also be useful when the subject string is very long and is not all
+ available at once.
+
+ PCRE2 supports partial matching by means of the PCRE2_PARTIAL_SOFT and
+ PCRE2_PARTIAL_HARD options, which can be set when calling a matching
+ function. The difference between the two options is whether or not a
+ partial match is preferred to an alternative complete match, though the
+ details differ between the two types of matching function. If both
+ options are set, PCRE2_PARTIAL_HARD takes precedence.
+
+ If you want to use partial matching with just-in-time optimized code,
+ you must call pcre2_jit_compile() with one or both of these options:
+
+ PCRE2_JIT_PARTIAL_SOFT
+ PCRE2_JIT_PARTIAL_HARD
+
+ PCRE2_JIT_COMPLETE should also be set if you are going to run non-par-
+ tial matches on the same pattern. If the appropriate JIT mode has not
+ been compiled, interpretive matching code is used.
+
+ Setting a partial matching option disables two of PCRE2's standard
+ optimizations. PCRE2 remembers the last literal code unit in a pattern,
+ and abandons matching immediately if it is not present in the subject
+ string. This optimization cannot be used for a subject string that
+ might match only partially. PCRE2 also knows the minimum length of a
+ matching string, and does not bother to run the matching function on
+ shorter strings. This optimization is also disabled for partial match-
+ ing.
+
+
+PARTIAL MATCHING USING pcre2_match()
+
+ A partial match occurs during a call to pcre2_match() when the end of
+ the subject string is reached successfully, but matching cannot con-
+ tinue because more characters are needed. However, at least one charac-
+ ter in the subject must have been inspected. This character need not
+ form part of the final matched string; lookbehind assertions and the \K
+ escape sequence provide ways of inspecting characters before the start
+ of a matched string. The requirement for inspecting at least one char-
+ acter exists because an empty string can always be matched; without
+ such a restriction there would always be a partial match of an empty
+ string at the end of the subject.
+
+ When a partial match is returned, the first two elements in the ovector
+ point to the portion of the subject that was matched, but the values in
+ the rest of the ovector are undefined. The appearance of \K in the pat-
+ tern has no effect for a partial match. Consider this pattern:
+
+ /abc\K123/
+
+ If it is matched against "456abc123xyz" the result is a complete match,
+ and the ovector defines the matched string as "123", because \K resets
+ the "start of match" point. However, if a partial match is requested
+ and the subject string is "456abc12", a partial match is found for the
+ string "abc12", because all these characters are needed for a subse-
+ quent re-match with additional characters.
+
+ What happens when a partial match is identified depends on which of the
+ two partial matching options are set.
+
+ PCRE2_PARTIAL_SOFT WITH pcre2_match()
+
+ If PCRE2_PARTIAL_SOFT is set when pcre2_match() identifies a partial
+ match, the partial match is remembered, but matching continues as nor-
+ mal, and other alternatives in the pattern are tried. If no complete
+ match can be found, PCRE2_ERROR_PARTIAL is returned instead of
+ PCRE2_ERROR_NOMATCH.
+
+ This option is "soft" because it prefers a complete match over a par-
+ tial match. All the various matching items in a pattern behave as if
+ the subject string is potentially complete. For example, \z, \Z, and $
+ match at the end of the subject, as normal, and for \b and \B the end
+ of the subject is treated as a non-alphanumeric.
+
+ If there is more than one partial match, the first one that was found
+ provides the data that is returned. Consider this pattern:
+
+ /123\w+X|dogY/
+
+ If this is matched against the subject string "abc123dog", both alter-
+ natives fail to match, but the end of the subject is reached during
+ matching, so PCRE2_ERROR_PARTIAL is returned. The offsets are set to 3
+ and 9, identifying "123dog" as the first partial match that was found.
+ (In this example, there are two partial matches, because "dog" on its
+ own partially matches the second alternative.)
+
+ PCRE2_PARTIAL_HARD WITH pcre2_match()
+
+ If PCRE2_PARTIAL_HARD is set for pcre2_match(), PCRE2_ERROR_PARTIAL is
+ returned as soon as a partial match is found, without continuing to
+ search for possible complete matches. This option is "hard" because it
+ prefers an earlier partial match over a later complete match. For this
+ reason, the assumption is made that the end of the supplied subject
+ string may not be the true end of the available data, and so, if \z,
+ \Z, \b, \B, or $ are encountered at the end of the subject, the result
+ is PCRE2_ERROR_PARTIAL, provided that at least one character in the
+ subject has been inspected.
+
+ Comparing hard and soft partial matching
+
+ The difference between the two partial matching options can be illus-
+ trated by a pattern such as:
+
+ /dog(sbody)?/
+
+ This matches either "dog" or "dogsbody", greedily (that is, it prefers
+ the longer string if possible). If it is matched against the string
+ "dog" with PCRE2_PARTIAL_SOFT, it yields a complete match for "dog".
+ However, if PCRE2_PARTIAL_HARD is set, the result is PCRE2_ERROR_PAR-
+ TIAL. On the other hand, if the pattern is made ungreedy the result is
+ different:
+
+ /dog(sbody)??/
+
+ In this case the result is always a complete match because that is
+ found first, and matching never continues after finding a complete
+ match. It might be easier to follow this explanation by thinking of the
+ two patterns like this:
+
+ /dog(sbody)?/ is the same as /dogsbody|dog/
+ /dog(sbody)??/ is the same as /dog|dogsbody/
+
+ The second pattern will never match "dogsbody", because it will always
+ find the shorter match first.
+
+
+PARTIAL MATCHING USING pcre2_dfa_match()
+
+ The DFA functions move along the subject string character by character,
+ without backtracking, searching for all possible matches simultane-
+ ously. If the end of the subject is reached before the end of the pat-
+ tern, there is the possibility of a partial match, again provided that
+ at least one character has been inspected.
+
+ When PCRE2_PARTIAL_SOFT is set, PCRE2_ERROR_PARTIAL is returned only if
+ there have been no complete matches. Otherwise, the complete matches
+ are returned. However, if PCRE2_PARTIAL_HARD is set, a partial match
+ takes precedence over any complete matches. The portion of the string
+ that was matched when the longest partial match was found is set as the
+ first matching string.
+
+ Because the DFA functions always search for all possible matches, and
+ there is no difference between greedy and ungreedy repetition, their
+ behaviour is different from the standard functions when PCRE2_PAR-
+ TIAL_HARD is set. Consider the string "dog" matched against the
+ ungreedy pattern shown above:
+
+ /dog(sbody)??/
+
+ Whereas the standard function stops as soon as it finds the complete
+ match for "dog", the DFA function also finds the partial match for
+ "dogsbody", and so returns that when PCRE2_PARTIAL_HARD is set.
+
+
+PARTIAL MATCHING AND WORD BOUNDARIES
+
+ If a pattern ends with one of sequences \b or \B, which test for word
+ boundaries, partial matching with PCRE2_PARTIAL_SOFT can give counter-
+ intuitive results. Consider this pattern:
+
+ /\bcat\b/
+
+ This matches "cat", provided there is a word boundary at either end. If
+ the subject string is "the cat", the comparison of the final "t" with a
+ following character cannot take place, so a partial match is found.
+ However, normal matching carries on, and \b matches at the end of the
+ subject when the last character is a letter, so a complete match is
+ found. The result, therefore, is not PCRE2_ERROR_PARTIAL. Using
+ PCRE2_PARTIAL_HARD in this case does yield PCRE2_ERROR_PARTIAL, because
+ then the partial match takes precedence.
+
+
+EXAMPLE OF PARTIAL MATCHING USING PCRE2TEST
+
+ If the partial_soft (or ps) modifier is present on a pcre2test data
+ line, the PCRE2_PARTIAL_SOFT option is used for the match. Here is a
+ run of pcre2test that uses the date example quoted above:
+
+ re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
+ data> 25jun04\=ps
+ 0: 25jun04
+ 1: jun
+ data> 25dec3\=ps
+ Partial match: 23dec3
+ data> 3ju\=ps
+ Partial match: 3ju
+ data> 3juj\=ps
+ No match
+ data> j\=ps
+ No match
+
+ The first data string is matched completely, so pcre2test shows the
+ matched substrings. The remaining four strings do not match the com-
+ plete pattern, but the first two are partial matches. Similar output is
+ obtained if DFA matching is used.
+
+ If the partial_hard (or ph) modifier is present on a pcre2test data
+ line, the PCRE2_PARTIAL_HARD option is set for the match.
+
+
+MULTI-SEGMENT MATCHING WITH pcre2_dfa_match()
+
+ When a partial match has been found using a DFA matching function, it
+ is possible to continue the match by providing additional subject data
+ and calling the function again with the same compiled regular expres-
+ sion, this time setting the PCRE2_DFA_RESTART option. You must pass the
+ same working space as before, because this is where details of the pre-
+ vious partial match are stored. Here is an example using pcre2test:
+
+ re> /^\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d$/
+ data> 23ja\=dfa,ps
+ Partial match: 23ja
+ data> n05\=dfa,dfa_restart
+ 0: n05
+
+ The first call has "23ja" as the subject, and requests partial match-
+ ing; the second call has "n05" as the subject for the continued
+ (restarted) match. Notice that when the match is complete, only the
+ last part is shown; PCRE2 does not retain the previously partially-
+ matched string. It is up to the calling program to do that if it needs
+ to.
+
+ That means that, for an unanchored pattern, if a continued match fails,
+ it is not possible to try again at a new starting point. All this
+ facility is capable of doing is continuing with the previous match
+ attempt. In the previous example, if the second set of data is "ug23"
+ the result is no match, even though there would be a match for "aug23"
+ if the entire string were given at once. Depending on the application,
+ this may or may not be what you want. The only way to allow for start-
+ ing again at the next character is to retain the matched part of the
+ subject and try a new complete match.
+
+ You can set the PCRE2_PARTIAL_SOFT or PCRE2_PARTIAL_HARD options with
+ PCRE2_DFA_RESTART to continue partial matching over multiple segments.
+ This facility can be used to pass very long subject strings to the DFA
+ matching functions.
+
+
+MULTI-SEGMENT MATCHING WITH pcre2_match()
+
+ Unlike the DFA function, it is not possible to restart the previous
+ match with a new segment of data when using pcre2_match(). Instead, new
+ data must be added to the previous subject string, and the entire match
+ re-run, starting from the point where the partial match occurred. Ear-
+ lier data can be discarded.
+
+ It is best to use PCRE2_PARTIAL_HARD in this situation, because it does
+ not treat the end of a segment as the end of the subject when matching
+ \z, \Z, \b, \B, and $. Consider an unanchored pattern that matches
+ dates:
+
+ re> /\d?\d(jan|feb|mar|apr|may|jun|jul|aug|sep|oct|nov|dec)\d\d/
+ data> The date is 23ja\=ph
+ Partial match: 23ja
+
+ At this stage, an application could discard the text preceding "23ja",
+ add on text from the next segment, and call the matching function
+ again. Unlike the DFA matching function, the entire matching string
+ must always be available, and the complete matching process occurs for
+ each call, so more memory and more processing time is needed.
+
+
+ISSUES WITH MULTI-SEGMENT MATCHING
+
+ Certain types of pattern may give problems with multi-segment matching,
+ whichever matching function is used.
+
+ 1. If the pattern contains a test for the beginning of a line, you need
+ to pass the PCRE2_NOTBOL option when the subject string for any call
+ does start at the beginning of a line. There is also a PCRE2_NOTEOL
+ option, but in practice when doing multi-segment matching you should be
+ using PCRE2_PARTIAL_HARD, which includes the effect of PCRE2_NOTEOL.
+
+ 2. If a pattern contains a lookbehind assertion, characters that pre-
+ cede the start of the partial match may have been inspected during the
+ matching process. When using pcre2_match(), sufficient characters must
+ be retained for the next match attempt. You can ensure that enough
+ characters are retained by doing the following:
+
+ Before doing any matching, find the length of the longest lookbehind in
+ the pattern by calling pcre2_pattern_info() with the
+ PCRE2_INFO_MAXLOOKBEHIND option. Note that the resulting count is in
+ characters, not code units. After a partial match, moving back from the
+ ovector[0] offset in the subject by the number of characters given for
+ the maximum lookbehind gets you to the earliest character that must be
+ retained. In a non-UTF or a 32-bit situation, moving back is just a
+ subtraction, but in UTF-8 or UTF-16 you have to count characters while
+ moving back through the code units.
+
+ Characters before the point you have now reached can be discarded, and
+ after the next segment has been added to what is retained, you should
+ run the next match with the startoffset argument set so that the match
+ begins at the same point as before.
+
+ For example, if the pattern "(?<=123)abc" is partially matched against
+ the string "xx123ab", the ovector offsets are 5 and 7 ("ab"). The maxi-
+ mum lookbehind count is 3, so all characters before offset 2 can be
+ discarded. The value of startoffset for the next match should be 3.
+ When pcre2test displays a partial match, it indicates the lookbehind
+ characters with '<' characters:
+
+ re> "(?<=123)abc"
+ data> xx123ab\=ph
+ Partial match: 123ab
+ <<<
+
+ 3. Because a partial match must always contain at least one character,
+ what might be considered a partial match of an empty string actually
+ gives a "no match" result. For example:
+
+ re> /c(?<=abc)x/
+ data> ab\=ps
+ No match
+
+ If the next segment begins "cx", a match should be found, but this will
+ only happen if characters from the previous segment are retained. For
+ this reason, a "no match" result should be interpreted as "partial
+ match of an empty string" when the pattern contains lookbehinds.
+
+ 4. Matching a subject string that is split into multiple segments may
+ not always produce exactly the same result as matching over one single
+ long string, especially when PCRE2_PARTIAL_SOFT is used. The section
+ "Partial Matching and Word Boundaries" above describes an issue that
+ arises if the pattern ends with \b or \B. Another kind of difference
+ may occur when there are multiple matching possibilities, because (for
+ PCRE2_PARTIAL_SOFT) a partial match result is given only when there are
+ no completed matches. This means that as soon as the shortest match has
+ been found, continuation to a new subject segment is no longer possi-
+ ble. Consider this pcre2test example:
+
+ re> /dog(sbody)?/
+ data> dogsb\=ps
+ 0: dog
+ data> do\=ps,dfa
+ Partial match: do
+ data> gsb\=ps,dfa,dfa_restart
+ 0: g
+ data> dogsbody\=dfa
+ 0: dogsbody
+ 1: dog
+
+ The first data line passes the string "dogsb" to a standard matching
+ function, setting the PCRE2_PARTIAL_SOFT option. Although the string is
+ a partial match for "dogsbody", the result is not PCRE2_ERROR_PARTIAL,
+ because the shorter string "dog" is a complete match. Similarly, when
+ the subject is presented to a DFA matching function in several parts
+ ("do" and "gsb" being the first two) the match stops when "dog" has
+ been found, and it is not possible to continue. On the other hand, if
+ "dogsbody" is presented as a single string, a DFA matching function
+ finds both matches.
+
+ Because of these problems, it is best to use PCRE2_PARTIAL_HARD when
+ matching multi-segment data. The example above then behaves differ-
+ ently:
+
+ re> /dog(sbody)?/
+ data> dogsb\=ph
+ Partial match: dogsb
+ data> do\=ps,dfa
+ Partial match: do
+ data> gsb\=ph,dfa,dfa_restart
+ Partial match: gsb
+
+ 5. Patterns that contain alternatives at the top level which do not all
+ start with the same pattern item may not work as expected when
+ PCRE2_DFA_RESTART is used. For example, consider this pattern:
+
+ 1234|3789
+
+ If the first part of the subject is "ABC123", a partial match of the
+ first alternative is found at offset 3. There is no partial match for
+ the second alternative, because such a match does not start at the same
+ point in the subject string. Attempting to continue with the string
+ "7890" does not yield a match because only those alternatives that
+ match at one point in the subject are remembered. The problem arises
+ because the start of the second alternative matches within the first
+ alternative. There is no problem with anchored patterns or patterns
+ such as:
+
+ 1234|ABCD
+
+ where no string can be a partial match for both alternatives. This is
+ not a problem if a standard matching function is used, because the
+ entire match has to be rerun each time:
+
+ re> /1234|3789/
+ data> ABC123\=ph
+ Partial match: 123
+ data> 1237890
+ 0: 3789
+
+ Of course, instead of using PCRE2_DFA_RESTART, the same technique of
+ re-running the entire match can also be used with the DFA matching
+ function. Another possibility is to work with two buffers. If a partial
+ match at offset n in the first buffer is followed by "no match" when
+ PCRE2_DFA_RESTART is used on the second buffer, you can then try a new
+ match starting at offset n+1 in the first buffer.
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 22 December 2014
+ Copyright (c) 1997-2014 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2PATTERN(3) Library Functions Manual PCRE2PATTERN(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+PCRE2 REGULAR EXPRESSION DETAILS
+
+ The syntax and semantics of the regular expressions that are supported
+ by PCRE2 are described in detail below. There is a quick-reference syn-
+ tax summary in the pcre2syntax page. PCRE2 tries to match Perl syntax
+ and semantics as closely as it can. PCRE2 also supports some alterna-
+ tive regular expression syntax (which does not conflict with the Perl
+ syntax) in order to provide some compatibility with regular expressions
+ in Python, .NET, and Oniguruma.
+
+ Perl's regular expressions are described in its own documentation, and
+ regular expressions in general are covered in a number of books, some
+ of which have copious examples. Jeffrey Friedl's "Mastering Regular
+ Expressions", published by O'Reilly, covers regular expressions in
+ great detail. This description of PCRE2's regular expressions is
+ intended as reference material.
+
+ This document discusses the patterns that are supported by PCRE2 when
+ its main matching function, pcre2_match(), is used. PCRE2 also has an
+ alternative matching function, pcre2_dfa_match(), which matches using a
+ different algorithm that is not Perl-compatible. Some of the features
+ discussed below are not available when DFA matching is used. The advan-
+ tages and disadvantages of the alternative function, and how it differs
+ from the normal function, are discussed in the pcre2matching page.
+
+
+SPECIAL START-OF-PATTERN ITEMS
+
+ A number of options that can be passed to pcre2_compile() can also be
+ set by special items at the start of a pattern. These are not Perl-com-
+ patible, but are provided to make these options accessible to pattern
+ writers who are not able to change the program that processes the pat-
+ tern. Any number of these items may appear, but they must all be
+ together right at the start of the pattern string, and the letters must
+ be in upper case.
+
+ UTF support
+
+ In the 8-bit and 16-bit PCRE2 libraries, characters may be coded either
+ as single code units, or as multiple UTF-8 or UTF-16 code units. UTF-32
+ can be specified for the 32-bit library, in which case it constrains
+ the character values to valid Unicode code points. To process UTF
+ strings, PCRE2 must be built to include Unicode support (which is the
+ default). When using UTF strings you must either call the compiling
+ function with the PCRE2_UTF option, or the pattern must start with the
+ special sequence (*UTF), which is equivalent to setting the relevant
+ option. How setting a UTF mode affects pattern matching is mentioned in
+ several places below. There is also a summary of features in the
+ pcre2unicode page.
+
+ Some applications that allow their users to supply patterns may wish to
+ restrict them to non-UTF data for security reasons. If the
+ PCRE2_NEVER_UTF option is passed to pcre2_compile(), (*UTF) is not
+ allowed, and its appearance in a pattern causes an error.
+
+ Unicode property support
+
+ Another special sequence that may appear at the start of a pattern is
+ (*UCP). This has the same effect as setting the PCRE2_UCP option: it
+ causes sequences such as \d and \w to use Unicode properties to deter-
+ mine character types, instead of recognizing only characters with codes
+ less than 256 via a lookup table.
+
+ Some applications that allow their users to supply patterns may wish to
+ restrict them for security reasons. If the PCRE2_NEVER_UCP option is
+ passed to pcre2_compile(), (*UCP) is not allowed, and its appearance in
+ a pattern causes an error.
+
+ Locking out empty string matching
+
+ Starting a pattern with (*NOTEMPTY) or (*NOTEMPTY_ATSTART) has the same
+ effect as passing the PCRE2_NOTEMPTY or PCRE2_NOTEMPTY_ATSTART option
+ to whichever matching function is subsequently called to match the pat-
+ tern. These options lock out the matching of empty strings, either
+ entirely, or only at the start of the subject.
+
+ Disabling auto-possessification
+
+ If a pattern starts with (*NO_AUTO_POSSESS), it has the same effect as
+ setting the PCRE2_NO_AUTO_POSSESS option. This stops PCRE2 from making
+ quantifiers possessive when what follows cannot match the repeated
+ item. For example, by default a+b is treated as a++b. For more details,
+ see the pcre2api documentation.
+
+ Disabling start-up optimizations
+
+ If a pattern starts with (*NO_START_OPT), it has the same effect as
+ setting the PCRE2_NO_START_OPTIMIZE option. This disables several opti-
+ mizations for quickly reaching "no match" results. For more details,
+ see the pcre2api documentation.
+
+ Disabling automatic anchoring
+
+ If a pattern starts with (*NO_DOTSTAR_ANCHOR), it has the same effect
+ as setting the PCRE2_NO_DOTSTAR_ANCHOR option. This disables optimiza-
+ tions that apply to patterns whose top-level branches all start with .*
+ (match any number of arbitrary characters). For more details, see the
+ pcre2api documentation.
+
+ Disabling JIT compilation
+
+ If a pattern that starts with (*NO_JIT) is successfully compiled, an
+ attempt by the application to apply the JIT optimization by calling
+ pcre2_jit_compile() is ignored.
+
+ Setting match resource limits
+
+ The pcre2_match() function contains a counter that is incremented every
+ time it goes round its main loop. The caller of pcre2_match() can set a
+ limit on this counter, which therefore limits the amount of computing
+ resource used for a match. The maximum depth of nested backtracking can
+ also be limited; this indirectly restricts the amount of heap memory
+ that is used, but there is also an explicit memory limit that can be
+ set.
+
+ These facilities are provided to catch runaway matches that are pro-
+ voked by patterns with huge matching trees (a typical example is a pat-
+ tern with nested unlimited repeats applied to a long string that does
+ not match). When one of these limits is reached, pcre2_match() gives an
+ error return. The limits can also be set by items at the start of the
+ pattern of the form
+
+ (*LIMIT_HEAP=d)
+ (*LIMIT_MATCH=d)
+ (*LIMIT_DEPTH=d)
+
+ where d is any number of decimal digits. However, the value of the set-
+ ting must be less than the value set (or defaulted) by the caller of
+ pcre2_match() for it to have any effect. In other words, the pattern
+ writer can lower the limits set by the programmer, but not raise them.
+ If there is more than one setting of one of these limits, the lower
+ value is used. The heap limit is specified in kibibytes (units of 1024
+ bytes).
+
+ Prior to release 10.30, LIMIT_DEPTH was called LIMIT_RECURSION. This
+ name is still recognized for backwards compatibility.
+
+ The heap limit applies only when the pcre2_match() or pcre2_dfa_match()
+ interpreters are used for matching. It does not apply to JIT. The match
+ limit is used (but in a different way) when JIT is being used, or when
+ pcre2_dfa_match() is called, to limit computing resource usage by those
+ matching functions. The depth limit is ignored by JIT but is relevant
+ for DFA matching, which uses function recursion for recursions within
+ the pattern and for lookaround assertions and atomic groups. In this
+ case, the depth limit controls the depth of such recursion.
+
+ Newline conventions
+
+ PCRE2 supports six different conventions for indicating line breaks in
+ strings: a single CR (carriage return) character, a single LF (line-
+ feed) character, the two-character sequence CRLF, any of the three pre-
+ ceding, any Unicode newline sequence, or the NUL character (binary
+ zero). The pcre2api page has further discussion about newlines, and
+ shows how to set the newline convention when calling pcre2_compile().
+
+ It is also possible to specify a newline convention by starting a pat-
+ tern string with one of the following sequences:
+
+ (*CR) carriage return
+ (*LF) linefeed
+ (*CRLF) carriage return, followed by linefeed
+ (*ANYCRLF) any of the three above
+ (*ANY) all Unicode newline sequences
+ (*NUL) the NUL character (binary zero)
+
+ These override the default and the options given to the compiling func-
+ tion. For example, on a Unix system where LF is the default newline
+ sequence, the pattern
+
+ (*CR)a.b
+
+ changes the convention to CR. That pattern matches "a\nb" because LF is
+ no longer a newline. If more than one of these settings is present, the
+ last one is used.
+
+ The newline convention affects where the circumflex and dollar asser-
+ tions are true. It also affects the interpretation of the dot metachar-
+ acter when PCRE2_DOTALL is not set, and the behaviour of \N when not
+ followed by an opening brace. However, it does not affect what the \R
+ escape sequence matches. By default, this is any Unicode newline
+ sequence, for Perl compatibility. However, this can be changed; see the
+ next section and the description of \R in the section entitled "Newline
+ sequences" below. A change of \R setting can be combined with a change
+ of newline convention.
+
+ Specifying what \R matches
+
+ It is possible to restrict \R to match only CR, LF, or CRLF (instead of
+ the complete set of Unicode line endings) by setting the option
+ PCRE2_BSR_ANYCRLF at compile time. This effect can also be achieved by
+ starting a pattern with (*BSR_ANYCRLF). For completeness, (*BSR_UNI-
+ CODE) is also recognized, corresponding to PCRE2_BSR_UNICODE.
+
+
+EBCDIC CHARACTER CODES
+
+ PCRE2 can be compiled to run in an environment that uses EBCDIC as its
+ character code instead of ASCII or Unicode (typically a mainframe sys-
+ tem). In the sections below, character code values are ASCII or Uni-
+ code; in an EBCDIC environment these characters may have different code
+ values, and there are no code points greater than 255.
+
+
+CHARACTERS AND METACHARACTERS
+
+ A regular expression is a pattern that is matched against a subject
+ string from left to right. Most characters stand for themselves in a
+ pattern, and match the corresponding characters in the subject. As a
+ trivial example, the pattern
+
+ The quick brown fox
+
+ matches a portion of a subject string that is identical to itself. When
+ caseless matching is specified (the PCRE2_CASELESS option), letters are
+ matched independently of case.
+
+ The power of regular expressions comes from the ability to include
+ alternatives and repetitions in the pattern. These are encoded in the
+ pattern by the use of metacharacters, which do not stand for themselves
+ but instead are interpreted in some special way.
+
+ There are two different sets of metacharacters: those that are recog-
+ nized anywhere in the pattern except within square brackets, and those
+ that are recognized within square brackets. Outside square brackets,
+ the metacharacters are as follows:
+
+ \ general escape character with several uses
+ ^ assert start of string (or line, in multiline mode)
+ $ assert end of string (or line, in multiline mode)
+ . match any character except newline (by default)
+ [ start character class definition
+ | start of alternative branch
+ ( start subpattern
+ ) end subpattern
+ ? extends the meaning of (
+ also 0 or 1 quantifier
+ also quantifier minimizer
+ * 0 or more quantifier
+ + 1 or more quantifier
+ also "possessive quantifier"
+ { start min/max quantifier
+
+ Part of a pattern that is in square brackets is called a "character
+ class". In a character class the only metacharacters are:
+
+ \ general escape character
+ ^ negate the class, but only if the first character
+ - indicates character range
+ [ POSIX character class (only if followed by POSIX
+ syntax)
+ ] terminates the character class
+
+ The following sections describe the use of each of the metacharacters.
+
+
+BACKSLASH
+
+ The backslash character has several uses. Firstly, if it is followed by
+ a character that is not a number or a letter, it takes away any special
+ meaning that character may have. This use of backslash as an escape
+ character applies both inside and outside character classes.
+
+ For example, if you want to match a * character, you must write \* in
+ the pattern. This escaping action applies whether or not the following
+ character would otherwise be interpreted as a metacharacter, so it is
+ always safe to precede a non-alphanumeric with backslash to specify
+ that it stands for itself. In particular, if you want to match a back-
+ slash, you write \\.
+
+ In a UTF mode, only ASCII numbers and letters have any special meaning
+ after a backslash. All other characters (in particular, those whose
+ code points are greater than 127) are treated as literals.
+
+ If a pattern is compiled with the PCRE2_EXTENDED option, most white
+ space in the pattern (other than in a character class), and characters
+ between a # outside a character class and the next newline, inclusive,
+ are ignored. An escaping backslash can be used to include a white space
+ or # character as part of the pattern.
+
+ If you want to remove the special meaning from a sequence of charac-
+ ters, you can do so by putting them between \Q and \E. This is differ-
+ ent from Perl in that $ and @ are handled as literals in \Q...\E
+ sequences in PCRE2, whereas in Perl, $ and @ cause variable interpola-
+ tion. Also, Perl does "double-quotish backslash interpolation" on any
+ backslashes between \Q and \E which, its documentation says, "may lead
+ to confusing results". PCRE2 treats a backslash between \Q and \E just
+ like any other character. Note the following examples:
+
+ Pattern PCRE2 matches Perl matches
+
+ \Qabc$xyz\E abc$xyz abc followed by the
+ contents of $xyz
+ \Qabc\$xyz\E abc\$xyz abc\$xyz
+ \Qabc\E\$\Qxyz\E abc$xyz abc$xyz
+ \QA\B\E A\B A\B
+ \Q\\E \ \\E
+
+ The \Q...\E sequence is recognized both inside and outside character
+ classes. An isolated \E that is not preceded by \Q is ignored. If \Q
+ is not followed by \E later in the pattern, the literal interpretation
+ continues to the end of the pattern (that is, \E is assumed at the
+ end). If the isolated \Q is inside a character class, this causes an
+ error, because the character class is not terminated by a closing
+ square bracket.
+
+ Non-printing characters
+
+ A second use of backslash provides a way of encoding non-printing char-
+ acters in patterns in a visible manner. There is no restriction on the
+ appearance of non-printing characters in a pattern, but when a pattern
+ is being prepared by text editing, it is often easier to use one of the
+ following escape sequences than the binary character it represents. In
+ an ASCII or Unicode environment, these escapes are as follows:
+
+ \a alarm, that is, the BEL character (hex 07)
+ \cx "control-x", where x is any printable ASCII character
+ \e escape (hex 1B)
+ \f form feed (hex 0C)
+ \n linefeed (hex 0A)
+ \r carriage return (hex 0D)
+ \t tab (hex 09)
+ \0dd character with octal code 0dd
+ \ddd character with octal code ddd, or backreference
+ \o{ddd..} character with octal code ddd..
+ \xhh character with hex code hh
+ \x{hhh..} character with hex code hhh..
+ \N{U+hhh..} character with Unicode hex code point hhh..
+ \uhhhh character with hex code hhhh (when PCRE2_ALT_BSUX is set)
+
+ The \N{U+hhh..} escape sequence is recognized only when the PCRE2_UTF
+ option is set, that is, when PCRE2 is operating in a Unicode mode. Perl
+ also uses \N{name} to specify characters by Unicode name; PCRE2 does
+ not support this. Note that when \N is not followed by an opening
+ brace (curly bracket) it has an entirely different meaning, matching
+ any character that is not a newline.
+
+ The precise effect of \cx on ASCII characters is as follows: if x is a
+ lower case letter, it is converted to upper case. Then bit 6 of the
+ character (hex 40) is inverted. Thus \cA to \cZ become hex 01 to hex 1A
+ (A is 41, Z is 5A), but \c{ becomes hex 3B ({ is 7B), and \c; becomes
+ hex 7B (; is 3B). If the code unit following \c has a value less than
+ 32 or greater than 126, a compile-time error occurs.
+
+ When PCRE2 is compiled in EBCDIC mode, \N{U+hhh..} is not supported.
+ \a, \e, \f, \n, \r, and \t generate the appropriate EBCDIC code values.
+ The \c escape is processed as specified for Perl in the perlebcdic doc-
+ ument. The only characters that are allowed after \c are A-Z, a-z, or
+ one of @, [, \, ], ^, _, or ?. Any other character provokes a compile-
+ time error. The sequence \c@ encodes character code 0; after \c the
+ letters (in either case) encode characters 1-26 (hex 01 to hex 1A); [,
+ \, ], ^, and _ encode characters 27-31 (hex 1B to hex 1F), and \c?
+ becomes either 255 (hex FF) or 95 (hex 5F).
+
+ Thus, apart from \c?, these escapes generate the same character code
+ values as they do in an ASCII environment, though the meanings of the
+ values mostly differ. For example, \cG always generates code value 7,
+ which is BEL in ASCII but DEL in EBCDIC.
+
+ The sequence \c? generates DEL (127, hex 7F) in an ASCII environment,
+ but because 127 is not a control character in EBCDIC, Perl makes it
+ generate the APC character. Unfortunately, there are several variants
+ of EBCDIC. In most of them the APC character has the value 255 (hex
+ FF), but in the one Perl calls POSIX-BC its value is 95 (hex 5F). If
+ certain other characters have POSIX-BC values, PCRE2 makes \c? generate
+ 95; otherwise it generates 255.
+
+ After \0 up to two further octal digits are read. If there are fewer
+ than two digits, just those that are present are used. Thus the
+ sequence \0\x\015 specifies two binary zeros followed by a CR character
+ (code value 13). Make sure you supply two digits after the initial zero
+ if the pattern character that follows is itself an octal digit.
+
+ The escape \o must be followed by a sequence of octal digits, enclosed
+ in braces. An error occurs if this is not the case. This escape is a
+ recent addition to Perl; it provides way of specifying character code
+ points as octal numbers greater than 0777, and it also allows octal
+ numbers and backreferences to be unambiguously specified.
+
+ For greater clarity and unambiguity, it is best to avoid following \ by
+ a digit greater than zero. Instead, use \o{} or \x{} to specify numeri-
+ cal character code points, and \g{} to specify backreferences. The fol-
+ lowing paragraphs describe the old, ambiguous syntax.
+
+ The handling of a backslash followed by a digit other than 0 is compli-
+ cated, and Perl has changed over time, causing PCRE2 also to change.
+
+ Outside a character class, PCRE2 reads the digit and any following dig-
+ its as a decimal number. If the number is less than 10, begins with the
+ digit 8 or 9, or if there are at least that many previous capturing
+ left parentheses in the expression, the entire sequence is taken as a
+ backreference. A description of how this works is given later, follow-
+ ing the discussion of parenthesized subpatterns. Otherwise, up to
+ three octal digits are read to form a character code.
+
+ Inside a character class, PCRE2 handles \8 and \9 as the literal char-
+ acters "8" and "9", and otherwise reads up to three octal digits fol-
+ lowing the backslash, using them to generate a data character. Any sub-
+ sequent digits stand for themselves. For example, outside a character
+ class:
+
+ \040 is another way of writing an ASCII space
+ \40 is the same, provided there are fewer than 40
+ previous capturing subpatterns
+ \7 is always a backreference
+ \11 might be a backreference, or another way of
+ writing a tab
+ \011 is always a tab
+ \0113 is a tab followed by the character "3"
+ \113 might be a backreference, otherwise the
+ character with octal code 113
+ \377 might be a backreference, otherwise
+ the value 255 (decimal)
+ \81 is always a backreference
+
+ Note that octal values of 100 or greater that are specified using this
+ syntax must not be introduced by a leading zero, because no more than
+ three octal digits are ever read.
+
+ By default, after \x that is not followed by {, from zero to two hexa-
+ decimal digits are read (letters can be in upper or lower case). Any
+ number of hexadecimal digits may appear between \x{ and }. If a charac-
+ ter other than a hexadecimal digit appears between \x{ and }, or if
+ there is no terminating }, an error occurs.
+
+ If the PCRE2_ALT_BSUX option is set, the interpretation of \x is as
+ just described only when it is followed by two hexadecimal digits. Oth-
+ erwise, it matches a literal "x" character. In this mode, support for
+ code points greater than 256 is provided by \u, which must be followed
+ by four hexadecimal digits; otherwise it matches a literal "u" charac-
+ ter.
+
+ Characters whose value is less than 256 can be defined by either of the
+ two syntaxes for \x (or by \u in PCRE2_ALT_BSUX mode). There is no dif-
+ ference in the way they are handled. For example, \xdc is exactly the
+ same as \x{dc} (or \u00dc in PCRE2_ALT_BSUX mode).
+
+ Constraints on character values
+
+ Characters that are specified using octal or hexadecimal numbers are
+ limited to certain values, as follows:
+
+ 8-bit non-UTF mode no greater than 0xff
+ 16-bit non-UTF mode no greater than 0xffff
+ 32-bit non-UTF mode no greater than 0xffffffff
+ All UTF modes no greater than 0x10ffff and a valid code point
+
+ Invalid Unicode code points are all those in the range 0xd800 to 0xdfff
+ (the so-called "surrogate" code points). The check for these can be
+ disabled by the caller of pcre2_compile() by setting the option
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES. However, this is possible only in
+ UTF-8 and UTF-32 modes, because these values are not representable in
+ UTF-16.
+
+ Escape sequences in character classes
+
+ All the sequences that define a single character value can be used both
+ inside and outside character classes. In addition, inside a character
+ class, \b is interpreted as the backspace character (hex 08).
+
+ When not followed by an opening brace, \N is not allowed in a character
+ class. \B, \R, and \X are not special inside a character class. Like
+ other unrecognized alphabetic escape sequences, they cause an error.
+ Outside a character class, these sequences have different meanings.
+
+ Unsupported escape sequences
+
+ In Perl, the sequences \F, \l, \L, \u, and \U are recognized by its
+ string handler and used to modify the case of following characters. By
+ default, PCRE2 does not support these escape sequences. However, if the
+ PCRE2_ALT_BSUX option is set, \U matches a "U" character, and \u can be
+ used to define a character by code point, as described above.
+
+ Absolute and relative backreferences
+
+ The sequence \g followed by a signed or unsigned number, optionally
+ enclosed in braces, is an absolute or relative backreference. A named
+ backreference can be coded as \g{name}. Backreferences are discussed
+ later, following the discussion of parenthesized subpatterns.
+
+ Absolute and relative subroutine calls
+
+ For compatibility with Oniguruma, the non-Perl syntax \g followed by a
+ name or a number enclosed either in angle brackets or single quotes, is
+ an alternative syntax for referencing a subpattern as a "subroutine".
+ Details are discussed later. Note that \g{...} (Perl syntax) and
+ \g<...> (Oniguruma syntax) are not synonymous. The former is a backref-
+ erence; the latter is a subroutine call.
+
+ Generic character types
+
+ Another use of backslash is for specifying generic character types:
+
+ \d any decimal digit
+ \D any character that is not a decimal digit
+ \h any horizontal white space character
+ \H any character that is not a horizontal white space character
+ \N any character that is not a newline
+ \s any white space character
+ \S any character that is not a white space character
+ \v any vertical white space character
+ \V any character that is not a vertical white space character
+ \w any "word" character
+ \W any "non-word" character
+
+ The \N escape sequence has the same meaning as the "." metacharacter
+ when PCRE2_DOTALL is not set, but setting PCRE2_DOTALL does not change
+ the meaning of \N. Note that when \N is followed by an opening brace it
+ has a different meaning. See the section entitled "Non-printing charac-
+ ters" above for details. Perl also uses \N{name} to specify characters
+ by Unicode name; PCRE2 does not support this.
+
+ Each pair of lower and upper case escape sequences partitions the com-
+ plete set of characters into two disjoint sets. Any given character
+ matches one, and only one, of each pair. The sequences can appear both
+ inside and outside character classes. They each match one character of
+ the appropriate type. If the current matching point is at the end of
+ the subject string, all of them fail, because there is no character to
+ match.
+
+ The default \s characters are HT (9), LF (10), VT (11), FF (12), CR
+ (13), and space (32), which are defined as white space in the "C"
+ locale. This list may vary if locale-specific matching is taking place.
+ For example, in some locales the "non-breaking space" character (\xA0)
+ is recognized as white space, and in others the VT character is not.
+
+ A "word" character is an underscore or any character that is a letter
+ or digit. By default, the definition of letters and digits is con-
+ trolled by PCRE2's low-valued character tables, and may vary if locale-
+ specific matching is taking place (see "Locale support" in the pcre2api
+ page). For example, in a French locale such as "fr_FR" in Unix-like
+ systems, or "french" in Windows, some character codes greater than 127
+ are used for accented letters, and these are then matched by \w. The
+ use of locales with Unicode is discouraged.
+
+ By default, characters whose code points are greater than 127 never
+ match \d, \s, or \w, and always match \D, \S, and \W, although this may
+ be different for characters in the range 128-255 when locale-specific
+ matching is happening. These escape sequences retain their original
+ meanings from before Unicode support was available, mainly for effi-
+ ciency reasons. If the PCRE2_UCP option is set, the behaviour is
+ changed so that Unicode properties are used to determine character
+ types, as follows:
+
+ \d any character that matches \p{Nd} (decimal digit)
+ \s any character that matches \p{Z} or \h or \v
+ \w any character that matches \p{L} or \p{N}, plus underscore
+
+ The upper case escapes match the inverse sets of characters. Note that
+ \d matches only decimal digits, whereas \w matches any Unicode digit,
+ as well as any Unicode letter, and underscore. Note also that PCRE2_UCP
+ affects \b, and \B because they are defined in terms of \w and \W.
+ Matching these sequences is noticeably slower when PCRE2_UCP is set.
+
+ The sequences \h, \H, \v, and \V, in contrast to the other sequences,
+ which match only ASCII characters by default, always match a specific
+ list of code points, whether or not PCRE2_UCP is set. The horizontal
+ space characters are:
+
+ U+0009 Horizontal tab (HT)
+ U+0020 Space
+ U+00A0 Non-break space
+ U+1680 Ogham space mark
+ U+180E Mongolian vowel separator
+ U+2000 En quad
+ U+2001 Em quad
+ U+2002 En space
+ U+2003 Em space
+ U+2004 Three-per-em space
+ U+2005 Four-per-em space
+ U+2006 Six-per-em space
+ U+2007 Figure space
+ U+2008 Punctuation space
+ U+2009 Thin space
+ U+200A Hair space
+ U+202F Narrow no-break space
+ U+205F Medium mathematical space
+ U+3000 Ideographic space
+
+ The vertical space characters are:
+
+ U+000A Linefeed (LF)
+ U+000B Vertical tab (VT)
+ U+000C Form feed (FF)
+ U+000D Carriage return (CR)
+ U+0085 Next line (NEL)
+ U+2028 Line separator
+ U+2029 Paragraph separator
+
+ In 8-bit, non-UTF-8 mode, only the characters with code points less
+ than 256 are relevant.
+
+ Newline sequences
+
+ Outside a character class, by default, the escape sequence \R matches
+ any Unicode newline sequence. In 8-bit non-UTF-8 mode \R is equivalent
+ to the following:
+
+ (?>\r\n|\n|\x0b|\f|\r|\x85)
+
+ This is an example of an "atomic group", details of which are given
+ below. This particular group matches either the two-character sequence
+ CR followed by LF, or one of the single characters LF (linefeed,
+ U+000A), VT (vertical tab, U+000B), FF (form feed, U+000C), CR (car-
+ riage return, U+000D), or NEL (next line, U+0085). Because this is an
+ atomic group, the two-character sequence is treated as a single unit
+ that cannot be split.
+
+ In other modes, two additional characters whose code points are greater
+ than 255 are added: LS (line separator, U+2028) and PS (paragraph sepa-
+ rator, U+2029). Unicode support is not needed for these characters to
+ be recognized.
+
+ It is possible to restrict \R to match only CR, LF, or CRLF (instead of
+ the complete set of Unicode line endings) by setting the option
+ PCRE2_BSR_ANYCRLF at compile time. (BSR is an abbrevation for "back-
+ slash R".) This can be made the default when PCRE2 is built; if this is
+ the case, the other behaviour can be requested via the PCRE2_BSR_UNI-
+ CODE option. It is also possible to specify these settings by starting
+ a pattern string with one of the following sequences:
+
+ (*BSR_ANYCRLF) CR, LF, or CRLF only
+ (*BSR_UNICODE) any Unicode newline sequence
+
+ These override the default and the options given to the compiling func-
+ tion. Note that these special settings, which are not Perl-compatible,
+ are recognized only at the very start of a pattern, and that they must
+ be in upper case. If more than one of them is present, the last one is
+ used. They can be combined with a change of newline convention; for
+ example, a pattern can start with:
+
+ (*ANY)(*BSR_ANYCRLF)
+
+ They can also be combined with the (*UTF) or (*UCP) special sequences.
+ Inside a character class, \R is treated as an unrecognized escape
+ sequence, and causes an error.
+
+ Unicode character properties
+
+ When PCRE2 is built with Unicode support (the default), three addi-
+ tional escape sequences that match characters with specific properties
+ are available. In 8-bit non-UTF-8 mode, these sequences are of course
+ limited to testing characters whose code points are less than 256, but
+ they do work in this mode. In 32-bit non-UTF mode, code points greater
+ than 0x10ffff (the Unicode limit) may be encountered. These are all
+ treated as being in the Common script and with an unassigned type. The
+ extra escape sequences are:
+
+ \p{xx} a character with the xx property
+ \P{xx} a character without the xx property
+ \X a Unicode extended grapheme cluster
+
+ The property names represented by xx above are limited to the Unicode
+ script names, the general category properties, "Any", which matches any
+ character (including newline), and some special PCRE2 properties
+ (described in the next section). Other Perl properties such as "InMu-
+ sicalSymbols" are not supported by PCRE2. Note that \P{Any} does not
+ match any characters, so always causes a match failure.
+
+ Sets of Unicode characters are defined as belonging to certain scripts.
+ A character from one of these sets can be matched using a script name.
+ For example:
+
+ \p{Greek}
+ \P{Han}
+
+ Those that are not part of an identified script are lumped together as
+ "Common". The current list of scripts is:
+
+ Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
+ nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
+ Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
+ nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
+ Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
+ Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
+ Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
+ Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
+ Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
+ nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
+ Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
+ jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
+ Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
+ Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
+ Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
+ ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
+ dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
+ Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
+ Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
+ vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
+ Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
+ Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
+ nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
+
+ Each character has exactly one Unicode general category property, spec-
+ ified by a two-letter abbreviation. For compatibility with Perl, nega-
+ tion can be specified by including a circumflex between the opening
+ brace and the property name. For example, \p{^Lu} is the same as
+ \P{Lu}.
+
+ If only one letter is specified with \p or \P, it includes all the gen-
+ eral category properties that start with that letter. In this case, in
+ the absence of negation, the curly brackets in the escape sequence are
+ optional; these two examples have the same effect:
+
+ \p{L}
+ \pL
+
+ The following general category property codes are supported:
+
+ C Other
+ Cc Control
+ Cf Format
+ Cn Unassigned
+ Co Private use
+ Cs Surrogate
+
+ L Letter
+ Ll Lower case letter
+ Lm Modifier letter
+ Lo Other letter
+ Lt Title case letter
+ Lu Upper case letter
+
+ M Mark
+ Mc Spacing mark
+ Me Enclosing mark
+ Mn Non-spacing mark
+
+ N Number
+ Nd Decimal number
+ Nl Letter number
+ No Other number
+
+ P Punctuation
+ Pc Connector punctuation
+ Pd Dash punctuation
+ Pe Close punctuation
+ Pf Final punctuation
+ Pi Initial punctuation
+ Po Other punctuation
+ Ps Open punctuation
+
+ S Symbol
+ Sc Currency symbol
+ Sk Modifier symbol
+ Sm Mathematical symbol
+ So Other symbol
+
+ Z Separator
+ Zl Line separator
+ Zp Paragraph separator
+ Zs Space separator
+
+ The special property L& is also supported: it matches a character that
+ has the Lu, Ll, or Lt property, in other words, a letter that is not
+ classified as a modifier or "other".
+
+ The Cs (Surrogate) property applies only to characters in the range
+ U+D800 to U+DFFF. Such characters are not valid in Unicode strings and
+ so cannot be tested by PCRE2, unless UTF validity checking has been
+ turned off (see the discussion of PCRE2_NO_UTF_CHECK in the pcre2api
+ page). Perl does not support the Cs property.
+
+ The long synonyms for property names that Perl supports (such as
+ \p{Letter}) are not supported by PCRE2, nor is it permitted to prefix
+ any of these properties with "Is".
+
+ No character that is in the Unicode table has the Cn (unassigned) prop-
+ erty. Instead, this property is assumed for any code point that is not
+ in the Unicode table.
+
+ Specifying caseless matching does not affect these escape sequences.
+ For example, \p{Lu} always matches only upper case letters. This is
+ different from the behaviour of current versions of Perl.
+
+ Matching characters by Unicode property is not fast, because PCRE2 has
+ to do a multistage table lookup in order to find a character's prop-
+ erty. That is why the traditional escape sequences such as \d and \w do
+ not use Unicode properties in PCRE2 by default, though you can make
+ them do so by setting the PCRE2_UCP option or by starting the pattern
+ with (*UCP).
+
+ Extended grapheme clusters
+
+ The \X escape matches any number of Unicode characters that form an
+ "extended grapheme cluster", and treats the sequence as an atomic group
+ (see below). Unicode supports various kinds of composite character by
+ giving each character a grapheme breaking property, and having rules
+ that use these properties to define the boundaries of extended grapheme
+ clusters. The rules are defined in Unicode Standard Annex 29, "Unicode
+ Text Segmentation". Unicode 11.0.0 abandoned the use of some previous
+ properties that had been used for emojis. Instead it introduced vari-
+ ous emoji-specific properties. PCRE2 uses only the Extended Picto-
+ graphic property.
+
+ \X always matches at least one character. Then it decides whether to
+ add additional characters according to the following rules for ending a
+ cluster:
+
+ 1. End at the end of the subject string.
+
+ 2. Do not end between CR and LF; otherwise end after any control char-
+ acter.
+
+ 3. Do not break Hangul (a Korean script) syllable sequences. Hangul
+ characters are of five types: L, V, T, LV, and LVT. An L character may
+ be followed by an L, V, LV, or LVT character; an LV or V character may
+ be followed by a V or T character; an LVT or T character may be follwed
+ only by a T character.
+
+ 4. Do not end before extending characters or spacing marks or the
+ "zero-width joiner" character. Characters with the "mark" property
+ always have the "extend" grapheme breaking property.
+
+ 5. Do not end after prepend characters.
+
+ 6. Do not break within emoji modifier sequences or emoji zwj sequences.
+ That is, do not break between characters with the Extended_Pictographic
+ property. Extend and ZWJ characters are allowed between the charac-
+ ters.
+
+ 7. Do not break within emoji flag sequences. That is, do not break
+ between regional indicator (RI) characters if there are an odd number
+ of RI characters before the break point.
+
+ 8. Otherwise, end the cluster.
+
+ PCRE2's additional properties
+
+ As well as the standard Unicode properties described above, PCRE2 sup-
+ ports four more that make it possible to convert traditional escape
+ sequences such as \w and \s to use Unicode properties. PCRE2 uses these
+ non-standard, non-Perl properties internally when PCRE2_UCP is set.
+ However, they may also be used explicitly. These properties are:
+
+ Xan Any alphanumeric character
+ Xps Any POSIX space character
+ Xsp Any Perl space character
+ Xwd Any Perl "word" character
+
+ Xan matches characters that have either the L (letter) or the N (num-
+ ber) property. Xps matches the characters tab, linefeed, vertical tab,
+ form feed, or carriage return, and any other character that has the Z
+ (separator) property. Xsp is the same as Xps; in PCRE1 it used to
+ exclude vertical tab, for Perl compatibility, but Perl changed. Xwd
+ matches the same characters as Xan, plus underscore.
+
+ There is another non-standard property, Xuc, which matches any charac-
+ ter that can be represented by a Universal Character Name in C++ and
+ other programming languages. These are the characters $, @, ` (grave
+ accent), and all characters with Unicode code points greater than or
+ equal to U+00A0, except for the surrogates U+D800 to U+DFFF. Note that
+ most base (ASCII) characters are excluded. (Universal Character Names
+ are of the form \uHHHH or \UHHHHHHHH where H is a hexadecimal digit.
+ Note that the Xuc property does not match these sequences but the char-
+ acters that they represent.)
+
+ Resetting the match start
+
+ In normal use, the escape sequence \K causes any previously matched
+ characters not to be included in the final matched sequence that is
+ returned. For example, the pattern:
+
+ foo\Kbar
+
+ matches "foobar", but reports that it has matched "bar". \K does not
+ interact with anchoring in any way. The pattern:
+
+ ^foo\Kbar
+
+ matches only when the subject begins with "foobar" (in single line
+ mode), though it again reports the matched string as "bar". This fea-
+ ture is similar to a lookbehind assertion (described below). However,
+ in this case, the part of the subject before the real match does not
+ have to be of fixed length, as lookbehind assertions do. The use of \K
+ does not interfere with the setting of captured substrings. For exam-
+ ple, when the pattern
+
+ (foo)\Kbar
+
+ matches "foobar", the first substring is still set to "foo".
+
+ Perl documents that the use of \K within assertions is "not well
+ defined". In PCRE2, \K is acted upon when it occurs inside positive
+ assertions, but is ignored in negative assertions. Note that when a
+ pattern such as (?=ab\K) matches, the reported start of the match can
+ be greater than the end of the match. Using \K in a lookbehind asser-
+ tion at the start of a pattern can also lead to odd effects. For exam-
+ ple, consider this pattern:
+
+ (?<=\Kfoo)bar
+
+ If the subject is "foobar", a call to pcre2_match() with a starting
+ offset of 3 succeeds and reports the matching string as "foobar", that
+ is, the start of the reported match is earlier than where the match
+ started.
+
+ Simple assertions
+
+ The final use of backslash is for certain simple assertions. An asser-
+ tion specifies a condition that has to be met at a particular point in
+ a match, without consuming any characters from the subject string. The
+ use of subpatterns for more complicated assertions is described below.
+ The backslashed assertions are:
+
+ \b matches at a word boundary
+ \B matches when not at a word boundary
+ \A matches at the start of the subject
+ \Z matches at the end of the subject
+ also matches before a newline at the end of the subject
+ \z matches only at the end of the subject
+ \G matches at the first matching position in the subject
+
+ Inside a character class, \b has a different meaning; it matches the
+ backspace character. If any other of these assertions appears in a
+ character class, an "invalid escape sequence" error is generated.
+
+ A word boundary is a position in the subject string where the current
+ character and the previous character do not both match \w or \W (i.e.
+ one matches \w and the other matches \W), or the start or end of the
+ string if the first or last character matches \w, respectively. In a
+ UTF mode, the meanings of \w and \W can be changed by setting the
+ PCRE2_UCP option. When this is done, it also affects \b and \B. Neither
+ PCRE2 nor Perl has a separate "start of word" or "end of word" metase-
+ quence. However, whatever follows \b normally determines which it is.
+ For example, the fragment \ba matches "a" at the start of a word.
+
+ The \A, \Z, and \z assertions differ from the traditional circumflex
+ and dollar (described in the next section) in that they only ever match
+ at the very start and end of the subject string, whatever options are
+ set. Thus, they are independent of multiline mode. These three asser-
+ tions are not affected by the PCRE2_NOTBOL or PCRE2_NOTEOL options,
+ which affect only the behaviour of the circumflex and dollar metachar-
+ acters. However, if the startoffset argument of pcre2_match() is non-
+ zero, indicating that matching is to start at a point other than the
+ beginning of the subject, \A can never match. The difference between
+ \Z and \z is that \Z matches before a newline at the end of the string
+ as well as at the very end, whereas \z matches only at the end.
+
+ The \G assertion is true only when the current matching position is at
+ the start point of the matching process, as specified by the startoff-
+ set argument of pcre2_match(). It differs from \A when the value of
+ startoffset is non-zero. By calling pcre2_match() multiple times with
+ appropriate arguments, you can mimic Perl's /g option, and it is in
+ this kind of implementation where \G can be useful.
+
+ Note, however, that PCRE2's implementation of \G, being true at the
+ starting character of the matching process, is subtly different from
+ Perl's, which defines it as true at the end of the previous match. In
+ Perl, these can be different when the previously matched string was
+ empty. Because PCRE2 does just one match at a time, it cannot reproduce
+ this behaviour.
+
+ If all the alternatives of a pattern begin with \G, the expression is
+ anchored to the starting match position, and the "anchored" flag is set
+ in the compiled regular expression.
+
+
+CIRCUMFLEX AND DOLLAR
+
+ The circumflex and dollar metacharacters are zero-width assertions.
+ That is, they test for a particular condition being true without con-
+ suming any characters from the subject string. These two metacharacters
+ are concerned with matching the starts and ends of lines. If the new-
+ line convention is set so that only the two-character sequence CRLF is
+ recognized as a newline, isolated CR and LF characters are treated as
+ ordinary data characters, and are not recognized as newlines.
+
+ Outside a character class, in the default matching mode, the circumflex
+ character is an assertion that is true only if the current matching
+ point is at the start of the subject string. If the startoffset argu-
+ ment of pcre2_match() is non-zero, or if PCRE2_NOTBOL is set, circum-
+ flex can never match if the PCRE2_MULTILINE option is unset. Inside a
+ character class, circumflex has an entirely different meaning (see
+ below).
+
+ Circumflex need not be the first character of the pattern if a number
+ of alternatives are involved, but it should be the first thing in each
+ alternative in which it appears if the pattern is ever to match that
+ branch. If all possible alternatives start with a circumflex, that is,
+ if the pattern is constrained to match only at the start of the sub-
+ ject, it is said to be an "anchored" pattern. (There are also other
+ constructs that can cause a pattern to be anchored.)
+
+ The dollar character is an assertion that is true only if the current
+ matching point is at the end of the subject string, or immediately
+ before a newline at the end of the string (by default), unless
+ PCRE2_NOTEOL is set. Note, however, that it does not actually match the
+ newline. Dollar need not be the last character of the pattern if a num-
+ ber of alternatives are involved, but it should be the last item in any
+ branch in which it appears. Dollar has no special meaning in a charac-
+ ter class.
+
+ The meaning of dollar can be changed so that it matches only at the
+ very end of the string, by setting the PCRE2_DOLLAR_ENDONLY option at
+ compile time. This does not affect the \Z assertion.
+
+ The meanings of the circumflex and dollar metacharacters are changed if
+ the PCRE2_MULTILINE option is set. When this is the case, a dollar
+ character matches before any newlines in the string, as well as at the
+ very end, and a circumflex matches immediately after internal newlines
+ as well as at the start of the subject string. It does not match after
+ a newline that ends the string, for compatibility with Perl. However,
+ this can be changed by setting the PCRE2_ALT_CIRCUMFLEX option.
+
+ For example, the pattern /^abc$/ matches the subject string "def\nabc"
+ (where \n represents a newline) in multiline mode, but not otherwise.
+ Consequently, patterns that are anchored in single line mode because
+ all branches start with ^ are not anchored in multiline mode, and a
+ match for circumflex is possible when the startoffset argument of
+ pcre2_match() is non-zero. The PCRE2_DOLLAR_ENDONLY option is ignored
+ if PCRE2_MULTILINE is set.
+
+ When the newline convention (see "Newline conventions" below) recog-
+ nizes the two-character sequence CRLF as a newline, this is preferred,
+ even if the single characters CR and LF are also recognized as new-
+ lines. For example, if the newline convention is "any", a multiline
+ mode circumflex matches before "xyz" in the string "abc\r\nxyz" rather
+ than after CR, even though CR on its own is a valid newline. (It also
+ matches at the very start of the string, of course.)
+
+ Note that the sequences \A, \Z, and \z can be used to match the start
+ and end of the subject in both modes, and if all branches of a pattern
+ start with \A it is always anchored, whether or not PCRE2_MULTILINE is
+ set.
+
+
+FULL STOP (PERIOD, DOT) AND \N
+
+ Outside a character class, a dot in the pattern matches any one charac-
+ ter in the subject string except (by default) a character that signi-
+ fies the end of a line.
+
+ When a line ending is defined as a single character, dot never matches
+ that character; when the two-character sequence CRLF is used, dot does
+ not match CR if it is immediately followed by LF, but otherwise it
+ matches all characters (including isolated CRs and LFs). When any Uni-
+ code line endings are being recognized, dot does not match CR or LF or
+ any of the other line ending characters.
+
+ The behaviour of dot with regard to newlines can be changed. If the
+ PCRE2_DOTALL option is set, a dot matches any one character, without
+ exception. If the two-character sequence CRLF is present in the sub-
+ ject string, it takes two dots to match it.
+
+ The handling of dot is entirely independent of the handling of circum-
+ flex and dollar, the only relationship being that they both involve
+ newlines. Dot has no special meaning in a character class.
+
+ The escape sequence \N when not followed by an opening brace behaves
+ like a dot, except that it is not affected by the PCRE2_DOTALL option.
+ In other words, it matches any character except one that signifies the
+ end of a line.
+
+ When \N is followed by an opening brace it has a different meaning. See
+ the section entitled "Non-printing characters" above for details. Perl
+ also uses \N{name} to specify characters by Unicode name; PCRE2 does
+ not support this.
+
+
+MATCHING A SINGLE CODE UNIT
+
+ Outside a character class, the escape sequence \C matches any one code
+ unit, whether or not a UTF mode is set. In the 8-bit library, one code
+ unit is one byte; in the 16-bit library it is a 16-bit unit; in the
+ 32-bit library it is a 32-bit unit. Unlike a dot, \C always matches
+ line-ending characters. The feature is provided in Perl in order to
+ match individual bytes in UTF-8 mode, but it is unclear how it can use-
+ fully be used.
+
+ Because \C breaks up characters into individual code units, matching
+ one unit with \C in UTF-8 or UTF-16 mode means that the rest of the
+ string may start with a malformed UTF character. This has undefined
+ results, because PCRE2 assumes that it is matching character by charac-
+ ter in a valid UTF string (by default it checks the subject string's
+ validity at the start of processing unless the PCRE2_NO_UTF_CHECK
+ option is used).
+
+ An application can lock out the use of \C by setting the
+ PCRE2_NEVER_BACKSLASH_C option when compiling a pattern. It is also
+ possible to build PCRE2 with the use of \C permanently disabled.
+
+ PCRE2 does not allow \C to appear in lookbehind assertions (described
+ below) in UTF-8 or UTF-16 modes, because this would make it impossible
+ to calculate the length of the lookbehind. Neither the alternative
+ matching function pcre2_dfa_match() nor the JIT optimizer support \C in
+ these UTF modes. The former gives a match-time error; the latter fails
+ to optimize and so the match is always run using the interpreter.
+
+ In the 32-bit library, however, \C is always supported (when not
+ explicitly locked out) because it always matches a single code unit,
+ whether or not UTF-32 is specified.
+
+ In general, the \C escape sequence is best avoided. However, one way of
+ using it that avoids the problem of malformed UTF-8 or UTF-16 charac-
+ ters is to use a lookahead to check the length of the next character,
+ as in this pattern, which could be used with a UTF-8 string (ignore
+ white space and line breaks):
+
+ (?| (?=[\x00-\x7f])(\C) |
+ (?=[\x80-\x{7ff}])(\C)(\C) |
+ (?=[\x{800}-\x{ffff}])(\C)(\C)(\C) |
+ (?=[\x{10000}-\x{1fffff}])(\C)(\C)(\C)(\C))
+
+ In this example, a group that starts with (?| resets the capturing
+ parentheses numbers in each alternative (see "Duplicate Subpattern Num-
+ bers" below). The assertions at the start of each branch check the next
+ UTF-8 character for values whose encoding uses 1, 2, 3, or 4 bytes,
+ respectively. The character's individual bytes are then captured by the
+ appropriate number of \C groups.
+
+
+SQUARE BRACKETS AND CHARACTER CLASSES
+
+ An opening square bracket introduces a character class, terminated by a
+ closing square bracket. A closing square bracket on its own is not spe-
+ cial by default. If a closing square bracket is required as a member
+ of the class, it should be the first data character in the class (after
+ an initial circumflex, if present) or escaped with a backslash. This
+ means that, by default, an empty class cannot be defined. However, if
+ the PCRE2_ALLOW_EMPTY_CLASS option is set, a closing square bracket at
+ the start does end the (empty) class.
+
+ A character class matches a single character in the subject. A matched
+ character must be in the set of characters defined by the class, unless
+ the first character in the class definition is a circumflex, in which
+ case the subject character must not be in the set defined by the class.
+ If a circumflex is actually required as a member of the class, ensure
+ it is not the first character, or escape it with a backslash.
+
+ For example, the character class [aeiou] matches any lower case vowel,
+ while [^aeiou] matches any character that is not a lower case vowel.
+ Note that a circumflex is just a convenient notation for specifying the
+ characters that are in the class by enumerating those that are not. A
+ class that starts with a circumflex is not an assertion; it still con-
+ sumes a character from the subject string, and therefore it fails if
+ the current pointer is at the end of the string.
+
+ Characters in a class may be specified by their code points using \o,
+ \x, or \N{U+hh..} in the usual way. When caseless matching is set, any
+ letters in a class represent both their upper case and lower case ver-
+ sions, so for example, a caseless [aeiou] matches "A" as well as "a",
+ and a caseless [^aeiou] does not match "A", whereas a caseful version
+ would.
+
+ Characters that might indicate line breaks are never treated in any
+ special way when matching character classes, whatever line-ending
+ sequence is in use, and whatever setting of the PCRE2_DOTALL and
+ PCRE2_MULTILINE options is used. A class such as [^a] always matches
+ one of these characters.
+
+ The generic character type escape sequences \d, \D, \h, \H, \p, \P, \s,
+ \S, \v, \V, \w, and \W may appear in a character class, and add the
+ characters that they match to the class. For example, [\dABCDEF]
+ matches any hexadecimal digit. In UTF modes, the PCRE2_UCP option
+ affects the meanings of \d, \s, \w and their upper case partners, just
+ as it does when they appear outside a character class, as described in
+ the section entitled "Generic character types" above. The escape
+ sequence \b has a different meaning inside a character class; it
+ matches the backspace character. The sequences \B, \R, and \X are not
+ special inside a character class. Like any other unrecognized escape
+ sequences, they cause an error. The same is true for \N when not fol-
+ lowed by an opening brace.
+
+ The minus (hyphen) character can be used to specify a range of charac-
+ ters in a character class. For example, [d-m] matches any letter
+ between d and m, inclusive. If a minus character is required in a
+ class, it must be escaped with a backslash or appear in a position
+ where it cannot be interpreted as indicating a range, typically as the
+ first or last character in the class, or immediately after a range. For
+ example, [b-d-z] matches letters in the range b to d, a hyphen charac-
+ ter, or z.
+
+ Perl treats a hyphen as a literal if it appears before or after a POSIX
+ class (see below) or before or after a character type escape such as as
+ \d or \H. However, unless the hyphen is the last character in the
+ class, Perl outputs a warning in its warning mode, as this is most
+ likely a user error. As PCRE2 has no facility for warning, an error is
+ given in these cases.
+
+ It is not possible to have the literal character "]" as the end charac-
+ ter of a range. A pattern such as [W-]46] is interpreted as a class of
+ two characters ("W" and "-") followed by a literal string "46]", so it
+ would match "W46]" or "-46]". However, if the "]" is escaped with a
+ backslash it is interpreted as the end of range, so [W-\]46] is inter-
+ preted as a class containing a range followed by two other characters.
+ The octal or hexadecimal representation of "]" can also be used to end
+ a range.
+
+ Ranges normally include all code points between the start and end char-
+ acters, inclusive. They can also be used for code points specified
+ numerically, for example [\000-\037]. Ranges can include any characters
+ that are valid for the current mode. In any UTF mode, the so-called
+ "surrogate" characters (those whose code points lie between 0xd800 and
+ 0xdfff inclusive) may not be specified explicitly by default (the
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES option disables this check). How-
+ ever, ranges such as [\x{d7ff}-\x{e000}], which include the surrogates,
+ are always permitted.
+
+ There is a special case in EBCDIC environments for ranges whose end
+ points are both specified as literal letters in the same case. For com-
+ patibility with Perl, EBCDIC code points within the range that are not
+ letters are omitted. For example, [h-k] matches only four characters,
+ even though the codes for h and k are 0x88 and 0x92, a range of 11 code
+ points. However, if the range is specified numerically, for example,
+ [\x88-\x92] or [h-\x92], all code points are included.
+
+ If a range that includes letters is used when caseless matching is set,
+ it matches the letters in either case. For example, [W-c] is equivalent
+ to [][\\^_`wxyzabc], matched caselessly, and in a non-UTF mode, if
+ character tables for a French locale are in use, [\xc8-\xcb] matches
+ accented E characters in both cases.
+
+ A circumflex can conveniently be used with the upper case character
+ types to specify a more restricted set of characters than the matching
+ lower case type. For example, the class [^\W_] matches any letter or
+ digit, but not underscore, whereas [\w] includes underscore. A positive
+ character class should be read as "something OR something OR ..." and a
+ negative class as "NOT something AND NOT something AND NOT ...".
+
+ The only metacharacters that are recognized in character classes are
+ backslash, hyphen (only where it can be interpreted as specifying a
+ range), circumflex (only at the start), opening square bracket (only
+ when it can be interpreted as introducing a POSIX class name, or for a
+ special compatibility feature - see the next two sections), and the
+ terminating closing square bracket. However, escaping other non-
+ alphanumeric characters does no harm.
+
+
+POSIX CHARACTER CLASSES
+
+ Perl supports the POSIX notation for character classes. This uses names
+ enclosed by [: and :] within the enclosing square brackets. PCRE2 also
+ supports this notation. For example,
+
+ [01[:alpha:]%]
+
+ matches "0", "1", any alphabetic character, or "%". The supported class
+ names are:
+
+ alnum letters and digits
+ alpha letters
+ ascii character codes 0 - 127
+ blank space or tab only
+ cntrl control characters
+ digit decimal digits (same as \d)
+ graph printing characters, excluding space
+ lower lower case letters
+ print printing characters, including space
+ punct printing characters, excluding letters and digits and space
+ space white space (the same as \s from PCRE2 8.34)
+ upper upper case letters
+ word "word" characters (same as \w)
+ xdigit hexadecimal digits
+
+ The default "space" characters are HT (9), LF (10), VT (11), FF (12),
+ CR (13), and space (32). If locale-specific matching is taking place,
+ the list of space characters may be different; there may be fewer or
+ more of them. "Space" and \s match the same set of characters.
+
+ The name "word" is a Perl extension, and "blank" is a GNU extension
+ from Perl 5.8. Another Perl extension is negation, which is indicated
+ by a ^ character after the colon. For example,
+
+ [12[:^digit:]]
+
+ matches "1", "2", or any non-digit. PCRE2 (and Perl) also recognize the
+ POSIX syntax [.ch.] and [=ch=] where "ch" is a "collating element", but
+ these are not supported, and an error is given if they are encountered.
+
+ By default, characters with values greater than 127 do not match any of
+ the POSIX character classes, although this may be different for charac-
+ ters in the range 128-255 when locale-specific matching is happening.
+ However, if the PCRE2_UCP option is passed to pcre2_compile(), some of
+ the classes are changed so that Unicode character properties are used.
+ This is achieved by replacing certain POSIX classes with other
+ sequences, as follows:
+
+ [:alnum:] becomes \p{Xan}
+ [:alpha:] becomes \p{L}
+ [:blank:] becomes \h
+ [:cntrl:] becomes \p{Cc}
+ [:digit:] becomes \p{Nd}
+ [:lower:] becomes \p{Ll}
+ [:space:] becomes \p{Xps}
+ [:upper:] becomes \p{Lu}
+ [:word:] becomes \p{Xwd}
+
+ Negated versions, such as [:^alpha:] use \P instead of \p. Three other
+ POSIX classes are handled specially in UCP mode:
+
+ [:graph:] This matches characters that have glyphs that mark the page
+ when printed. In Unicode property terms, it matches all char-
+ acters with the L, M, N, P, S, or Cf properties, except for:
+
+ U+061C Arabic Letter Mark
+ U+180E Mongolian Vowel Separator
+ U+2066 - U+2069 Various "isolate"s
+
+
+ [:print:] This matches the same characters as [:graph:] plus space
+ characters that are not controls, that is, characters with
+ the Zs property.
+
+ [:punct:] This matches all characters that have the Unicode P (punctua-
+ tion) property, plus those characters with code points less
+ than 256 that have the S (Symbol) property.
+
+ The other POSIX classes are unchanged, and match only characters with
+ code points less than 256.
+
+
+COMPATIBILITY FEATURE FOR WORD BOUNDARIES
+
+ In the POSIX.2 compliant library that was included in 4.4BSD Unix, the
+ ugly syntax [[:<:]] and [[:>:]] is used for matching "start of word"
+ and "end of word". PCRE2 treats these items as follows:
+
+ [[:<:]] is converted to \b(?=\w)
+ [[:>:]] is converted to \b(?<=\w)
+
+ Only these exact character sequences are recognized. A sequence such as
+ [a[:<:]b] provokes error for an unrecognized POSIX class name. This
+ support is not compatible with Perl. It is provided to help migrations
+ from other environments, and is best not used in any new patterns. Note
+ that \b matches at the start and the end of a word (see "Simple asser-
+ tions" above), and in a Perl-style pattern the preceding or following
+ character normally shows which is wanted, without the need for the
+ assertions that are used above in order to give exactly the POSIX be-
+ haviour.
+
+
+VERTICAL BAR
+
+ Vertical bar characters are used to separate alternative patterns. For
+ example, the pattern
+
+ gilbert|sullivan
+
+ matches either "gilbert" or "sullivan". Any number of alternatives may
+ appear, and an empty alternative is permitted (matching the empty
+ string). The matching process tries each alternative in turn, from left
+ to right, and the first one that succeeds is used. If the alternatives
+ are within a subpattern (defined below), "succeeds" means matching the
+ rest of the main pattern as well as the alternative in the subpattern.
+
+
+INTERNAL OPTION SETTING
+
+ The settings of the PCRE2_CASELESS, PCRE2_MULTILINE, PCRE2_DOTALL,
+ PCRE2_EXTENDED, PCRE2_EXTENDED_MORE, and PCRE2_NO_AUTO_CAPTURE options
+ can be changed from within the pattern by a sequence of letters
+ enclosed between "(?" and ")". These options are Perl-compatible, and
+ are described in detail in the pcre2api documentation. The option let-
+ ters are:
+
+ i for PCRE2_CASELESS
+ m for PCRE2_MULTILINE
+ n for PCRE2_NO_AUTO_CAPTURE
+ s for PCRE2_DOTALL
+ x for PCRE2_EXTENDED
+ xx for PCRE2_EXTENDED_MORE
+
+ For example, (?im) sets caseless, multiline matching. It is also possi-
+ ble to unset these options by preceding the relevant letters with a
+ hyphen, for example (?-im). The two "extended" options are not indepen-
+ dent; unsetting either one cancels the effects of both of them.
+
+ A combined setting and unsetting such as (?im-sx), which sets
+ PCRE2_CASELESS and PCRE2_MULTILINE while unsetting PCRE2_DOTALL and
+ PCRE2_EXTENDED, is also permitted. Only one hyphen may appear in the
+ options string. If a letter appears both before and after the hyphen,
+ the option is unset. An empty options setting "(?)" is allowed. Need-
+ less to say, it has no effect.
+
+ If the first character following (? is a circumflex, it causes all of
+ the above options to be unset. Thus, (?^) is equivalent to (?-imnsx).
+ Letters may follow the circumflex to cause some options to be re-
+ instated, but a hyphen may not appear.
+
+ The PCRE2-specific options PCRE2_DUPNAMES and PCRE2_UNGREEDY can be
+ changed in the same way as the Perl-compatible options by using the
+ characters J and U respectively. However, these are not unset by (?^).
+
+ When one of these option changes occurs at top level (that is, not
+ inside subpattern parentheses), the change applies to the remainder of
+ the pattern that follows. An option change within a subpattern (see
+ below for a description of subpatterns) affects only that part of the
+ subpattern that follows it, so
+
+ (a(?i)b)c
+
+ matches abc and aBc and no other strings (assuming PCRE2_CASELESS is
+ not used). By this means, options can be made to have different set-
+ tings in different parts of the pattern. Any changes made in one alter-
+ native do carry on into subsequent branches within the same subpattern.
+ For example,
+
+ (a(?i)b|c)
+
+ matches "ab", "aB", "c", and "C", even though when matching "C" the
+ first branch is abandoned before the option setting. This is because
+ the effects of option settings happen at compile time. There would be
+ some very weird behaviour otherwise.
+
+ As a convenient shorthand, if any option settings are required at the
+ start of a non-capturing subpattern (see the next section), the option
+ letters may appear between the "?" and the ":". Thus the two patterns
+
+ (?i:saturday|sunday)
+ (?:(?i)saturday|sunday)
+
+ match exactly the same set of strings.
+
+ Note: There are other PCRE2-specific options that can be set by the
+ application when the compiling function is called. The pattern can con-
+ tain special leading sequences such as (*CRLF) to override what the
+ application has set or what has been defaulted. Details are given in
+ the section entitled "Newline sequences" above. There are also the
+ (*UTF) and (*UCP) leading sequences that can be used to set UTF and
+ Unicode property modes; they are equivalent to setting the PCRE2_UTF
+ and PCRE2_UCP options, respectively. However, the application can set
+ the PCRE2_NEVER_UTF and PCRE2_NEVER_UCP options, which lock out the use
+ of the (*UTF) and (*UCP) sequences.
+
+
+SUBPATTERNS
+
+ Subpatterns are delimited by parentheses (round brackets), which can be
+ nested. Turning part of a pattern into a subpattern does two things:
+
+ 1. It localizes a set of alternatives. For example, the pattern
+
+ cat(aract|erpillar|)
+
+ matches "cataract", "caterpillar", or "cat". Without the parentheses,
+ it would match "cataract", "erpillar" or an empty string.
+
+ 2. It sets up the subpattern as a capturing subpattern. This means
+ that, when the whole pattern matches, the portion of the subject string
+ that matched the subpattern is passed back to the caller, separately
+ from the portion that matched the whole pattern. (This applies only to
+ the traditional matching function; the DFA matching function does not
+ support capturing.)
+
+ Opening parentheses are counted from left to right (starting from 1) to
+ obtain numbers for the capturing subpatterns. For example, if the
+ string "the red king" is matched against the pattern
+
+ the ((red|white) (king|queen))
+
+ the captured substrings are "red king", "red", and "king", and are num-
+ bered 1, 2, and 3, respectively.
+
+ The fact that plain parentheses fulfil two functions is not always
+ helpful. There are often times when a grouping subpattern is required
+ without a capturing requirement. If an opening parenthesis is followed
+ by a question mark and a colon, the subpattern does not do any captur-
+ ing, and is not counted when computing the number of any subsequent
+ capturing subpatterns. For example, if the string "the white queen" is
+ matched against the pattern
+
+ the ((?:red|white) (king|queen))
+
+ the captured substrings are "white queen" and "queen", and are numbered
+ 1 and 2. The maximum number of capturing subpatterns is 65535.
+
+ As a convenient shorthand, if any option settings are required at the
+ start of a non-capturing subpattern, the option letters may appear
+ between the "?" and the ":". Thus the two patterns
+
+ (?i:saturday|sunday)
+ (?:(?i)saturday|sunday)
+
+ match exactly the same set of strings. Because alternative branches are
+ tried from left to right, and options are not reset until the end of
+ the subpattern is reached, an option setting in one branch does affect
+ subsequent branches, so the above patterns match "SUNDAY" as well as
+ "Saturday".
+
+
+DUPLICATE SUBPATTERN NUMBERS
+
+ Perl 5.10 introduced a feature whereby each alternative in a subpattern
+ uses the same numbers for its capturing parentheses. Such a subpattern
+ starts with (?| and is itself a non-capturing subpattern. For example,
+ consider this pattern:
+
+ (?|(Sat)ur|(Sun))day
+
+ Because the two alternatives are inside a (?| group, both sets of cap-
+ turing parentheses are numbered one. Thus, when the pattern matches,
+ you can look at captured substring number one, whichever alternative
+ matched. This construct is useful when you want to capture part, but
+ not all, of one of a number of alternatives. Inside a (?| group, paren-
+ theses are numbered as usual, but the number is reset at the start of
+ each branch. The numbers of any capturing parentheses that follow the
+ subpattern start after the highest number used in any branch. The fol-
+ lowing example is taken from the Perl documentation. The numbers under-
+ neath show in which buffer the captured content will be stored.
+
+ # before ---------------branch-reset----------- after
+ / ( a ) (?| x ( y ) z | (p (q) r) | (t) u (v) ) ( z ) /x
+ # 1 2 2 3 2 3 4
+
+ A backreference to a numbered subpattern uses the most recent value
+ that is set for that number by any subpattern. The following pattern
+ matches "abcabc" or "defdef":
+
+ /(?|(abc)|(def))\1/
+
+ In contrast, a subroutine call to a numbered subpattern always refers
+ to the first one in the pattern with the given number. The following
+ pattern matches "abcabc" or "defabc":
+
+ /(?|(abc)|(def))(?1)/
+
+ A relative reference such as (?-1) is no different: it is just a conve-
+ nient way of computing an absolute group number.
+
+ If a condition test for a subpattern's having matched refers to a non-
+ unique number, the test is true if any of the subpatterns of that num-
+ ber have matched.
+
+ An alternative approach to using this "branch reset" feature is to use
+ duplicate named subpatterns, as described in the next section.
+
+
+NAMED SUBPATTERNS
+
+ Identifying capturing parentheses by number is simple, but it can be
+ very hard to keep track of the numbers in complicated patterns. Fur-
+ thermore, if an expression is modified, the numbers may change. To help
+ with this difficulty, PCRE2 supports the naming of capturing subpat-
+ terns. This feature was not added to Perl until release 5.10. Python
+ had the feature earlier, and PCRE1 introduced it at release 4.0, using
+ the Python syntax. PCRE2 supports both the Perl and the Python syntax.
+
+ In PCRE2, a capturing subpattern can be named in one of three ways:
+ (?<name>...) or (?'name'...) as in Perl, or (?P<name>...) as in Python.
+ Names consist of up to 32 alphanumeric characters and underscores, but
+ must start with a non-digit. References to capturing parentheses from
+ other parts of the pattern, such as backreferences, recursion, and con-
+ ditions, can all be made by name as well as by number.
+
+ Named capturing parentheses are allocated numbers as well as names,
+ exactly as if the names were not present. In both PCRE2 and Perl, cap-
+ turing subpatterns are primarily identified by numbers; any names are
+ just aliases for these numbers. The PCRE2 API provides function calls
+ for extracting the complete name-to-number translation table from a
+ compiled pattern, as well as convenience functions for extracting cap-
+ tured substrings by name.
+
+ Warning: When more than one subpattern has the same number, as
+ described in the previous section, a name given to one of them applies
+ to all of them. Perl allows identically numbered subpatterns to have
+ different names. Consider this pattern, where there are two capturing
+ subpatterns, both numbered 1:
+
+ (?|(?<AA>aa)|(?<BB>bb))
+
+ Perl allows this, with both names AA and BB as aliases of group 1.
+ Thus, after a successful match, both names yield the same value (either
+ "aa" or "bb").
+
+ In an attempt to reduce confusion, PCRE2 does not allow the same group
+ number to be associated with more than one name. The example above pro-
+ vokes a compile-time error. However, there is still scope for confu-
+ sion. Consider this pattern:
+
+ (?|(?<AA>aa)|(bb))
+
+ Although the second subpattern number 1 is not explicitly named, the
+ name AA is still an alias for subpattern 1. Whether the pattern matches
+ "aa" or "bb", a reference by name to group AA yields the matched
+ string.
+
+ By default, a name must be unique within a pattern, except that dupli-
+ cate names are permitted for subpatterns with the same number, for
+ example:
+
+ (?|(?<AA>aa)|(?<AA>bb))
+
+ The duplicate name constraint can be disabled by setting the PCRE2_DUP-
+ NAMES option at compile time, or by the use of (?J) within the pattern.
+ Duplicate names can be useful for patterns where only one instance of
+ the named parentheses can match. Suppose you want to match the name of
+ a weekday, either as a 3-letter abbreviation or as the full name, and
+ in both cases you want to extract the abbreviation. This pattern
+ (ignoring the line breaks) does the job:
+
+ (?<DN>Mon|Fri|Sun)(?:day)?|
+ (?<DN>Tue)(?:sday)?|
+ (?<DN>Wed)(?:nesday)?|
+ (?<DN>Thu)(?:rsday)?|
+ (?<DN>Sat)(?:urday)?
+
+ There are five capturing substrings, but only one is ever set after a
+ match. The convenience functions for extracting the data by name
+ returns the substring for the first (and in this example, the only)
+ subpattern of that name that matched. This saves searching to find
+ which numbered subpattern it was. (An alternative way of solving this
+ problem is to use a "branch reset" subpattern, as described in the pre-
+ vious section.)
+
+ If you make a backreference to a non-unique named subpattern from else-
+ where in the pattern, the subpatterns to which the name refers are
+ checked in the order in which they appear in the overall pattern. The
+ first one that is set is used for the reference. For example, this pat-
+ tern matches both "foofoo" and "barbar" but not "foobar" or "barfoo":
+
+ (?:(?<n>foo)|(?<n>bar))\k<n>
+
+
+ If you make a subroutine call to a non-unique named subpattern, the one
+ that corresponds to the first occurrence of the name is used. In the
+ absence of duplicate numbers this is the one with the lowest number.
+
+ If you use a named reference in a condition test (see the section about
+ conditions below), either to check whether a subpattern has matched, or
+ to check for recursion, all subpatterns with the same name are tested.
+ If the condition is true for any one of them, the overall condition is
+ true. This is the same behaviour as testing by number. For further
+ details of the interfaces for handling named subpatterns, see the
+ pcre2api documentation.
+
+
+REPETITION
+
+ Repetition is specified by quantifiers, which can follow any of the
+ following items:
+
+ a literal data character
+ the dot metacharacter
+ the \C escape sequence
+ the \X escape sequence
+ the \R escape sequence
+ an escape such as \d or \pL that matches a single character
+ a character class
+ a backreference
+ a parenthesized subpattern (including most assertions)
+ a subroutine call to a subpattern (recursive or otherwise)
+
+ The general repetition quantifier specifies a minimum and maximum num-
+ ber of permitted matches, by giving the two numbers in curly brackets
+ (braces), separated by a comma. The numbers must be less than 65536,
+ and the first must be less than or equal to the second. For example:
+
+ z{2,4}
+
+ matches "zz", "zzz", or "zzzz". A closing brace on its own is not a
+ special character. If the second number is omitted, but the comma is
+ present, there is no upper limit; if the second number and the comma
+ are both omitted, the quantifier specifies an exact number of required
+ matches. Thus
+
+ [aeiou]{3,}
+
+ matches at least 3 successive vowels, but may match many more, whereas
+
+ \d{8}
+
+ matches exactly 8 digits. An opening curly bracket that appears in a
+ position where a quantifier is not allowed, or one that does not match
+ the syntax of a quantifier, is taken as a literal character. For exam-
+ ple, {,6} is not a quantifier, but a literal string of four characters.
+
+ In UTF modes, quantifiers apply to characters rather than to individual
+ code units. Thus, for example, \x{100}{2} matches two characters, each
+ of which is represented by a two-byte sequence in a UTF-8 string. Simi-
+ larly, \X{3} matches three Unicode extended grapheme clusters, each of
+ which may be several code units long (and they may be of different
+ lengths).
+
+ The quantifier {0} is permitted, causing the expression to behave as if
+ the previous item and the quantifier were not present. This may be use-
+ ful for subpatterns that are referenced as subroutines from elsewhere
+ in the pattern (but see also the section entitled "Defining subpatterns
+ for use by reference only" below). Items other than subpatterns that
+ have a {0} quantifier are omitted from the compiled pattern.
+
+ For convenience, the three most common quantifiers have single-charac-
+ ter abbreviations:
+
+ * is equivalent to {0,}
+ + is equivalent to {1,}
+ ? is equivalent to {0,1}
+
+ It is possible to construct infinite loops by following a subpattern
+ that can match no characters with a quantifier that has no upper limit,
+ for example:
+
+ (a?)*
+
+ Earlier versions of Perl and PCRE1 used to give an error at compile
+ time for such patterns. However, because there are cases where this can
+ be useful, such patterns are now accepted, but if any repetition of the
+ subpattern does in fact match no characters, the loop is forcibly bro-
+ ken.
+
+ By default, the quantifiers are "greedy", that is, they match as much
+ as possible (up to the maximum number of permitted times), without
+ causing the rest of the pattern to fail. The classic example of where
+ this gives problems is in trying to match comments in C programs. These
+ appear between /* and */ and within the comment, individual * and /
+ characters may appear. An attempt to match C comments by applying the
+ pattern
+
+ /\*.*\*/
+
+ to the string
+
+ /* first comment */ not comment /* second comment */
+
+ fails, because it matches the entire string owing to the greediness of
+ the .* item.
+
+ If a quantifier is followed by a question mark, it ceases to be greedy,
+ and instead matches the minimum number of times possible, so the pat-
+ tern
+
+ /\*.*?\*/
+
+ does the right thing with the C comments. The meaning of the various
+ quantifiers is not otherwise changed, just the preferred number of
+ matches. Do not confuse this use of question mark with its use as a
+ quantifier in its own right. Because it has two uses, it can sometimes
+ appear doubled, as in
+
+ \d??\d
+
+ which matches one digit by preference, but can match two if that is the
+ only way the rest of the pattern matches.
+
+ If the PCRE2_UNGREEDY option is set (an option that is not available in
+ Perl), the quantifiers are not greedy by default, but individual ones
+ can be made greedy by following them with a question mark. In other
+ words, it inverts the default behaviour.
+
+ When a parenthesized subpattern is quantified with a minimum repeat
+ count that is greater than 1 or with a limited maximum, more memory is
+ required for the compiled pattern, in proportion to the size of the
+ minimum or maximum.
+
+ If a pattern starts with .* or .{0,} and the PCRE2_DOTALL option
+ (equivalent to Perl's /s) is set, thus allowing the dot to match new-
+ lines, the pattern is implicitly anchored, because whatever follows
+ will be tried against every character position in the subject string,
+ so there is no point in retrying the overall match at any position
+ after the first. PCRE2 normally treats such a pattern as though it were
+ preceded by \A.
+
+ In cases where it is known that the subject string contains no new-
+ lines, it is worth setting PCRE2_DOTALL in order to obtain this opti-
+ mization, or alternatively, using ^ to indicate anchoring explicitly.
+
+ However, there are some cases where the optimization cannot be used.
+ When .* is inside capturing parentheses that are the subject of a
+ backreference elsewhere in the pattern, a match at the start may fail
+ where a later one succeeds. Consider, for example:
+
+ (.*)abc\1
+
+ If the subject is "xyz123abc123" the match point is the fourth charac-
+ ter. For this reason, such a pattern is not implicitly anchored.
+
+ Another case where implicit anchoring is not applied is when the lead-
+ ing .* is inside an atomic group. Once again, a match at the start may
+ fail where a later one succeeds. Consider this pattern:
+
+ (?>.*?a)b
+
+ It matches "ab" in the subject "aab". The use of the backtracking con-
+ trol verbs (*PRUNE) and (*SKIP) also disable this optimization, and
+ there is an option, PCRE2_NO_DOTSTAR_ANCHOR, to do so explicitly.
+
+ When a capturing subpattern is repeated, the value captured is the sub-
+ string that matched the final iteration. For example, after
+
+ (tweedle[dume]{3}\s*)+
+
+ has matched "tweedledum tweedledee" the value of the captured substring
+ is "tweedledee". However, if there are nested capturing subpatterns,
+ the corresponding captured values may have been set in previous itera-
+ tions. For example, after
+
+ (a|(b))+
+
+ matches "aba" the value of the second captured substring is "b".
+
+
+ATOMIC GROUPING AND POSSESSIVE QUANTIFIERS
+
+ With both maximizing ("greedy") and minimizing ("ungreedy" or "lazy")
+ repetition, failure of what follows normally causes the repeated item
+ to be re-evaluated to see if a different number of repeats allows the
+ rest of the pattern to match. Sometimes it is useful to prevent this,
+ either to change the nature of the match, or to cause it fail earlier
+ than it otherwise might, when the author of the pattern knows there is
+ no point in carrying on.
+
+ Consider, for example, the pattern \d+foo when applied to the subject
+ line
+
+ 123456bar
+
+ After matching all 6 digits and then failing to match "foo", the normal
+ action of the matcher is to try again with only 5 digits matching the
+ \d+ item, and then with 4, and so on, before ultimately failing.
+ "Atomic grouping" (a term taken from Jeffrey Friedl's book) provides
+ the means for specifying that once a subpattern has matched, it is not
+ to be re-evaluated in this way.
+
+ If we use atomic grouping for the previous example, the matcher gives
+ up immediately on failing to match "foo" the first time. The notation
+ is a kind of special parenthesis, starting with (?> as in this example:
+
+ (?>\d+)foo
+
+ This kind of parenthesis "locks up" the part of the pattern it con-
+ tains once it has matched, and a failure further into the pattern is
+ prevented from backtracking into it. Backtracking past it to previous
+ items, however, works as normal.
+
+ An alternative description is that a subpattern of this type matches
+ exactly the string of characters that an identical standalone pattern
+ would match, if anchored at the current point in the subject string.
+
+ Atomic grouping subpatterns are not capturing subpatterns. Simple cases
+ such as the above example can be thought of as a maximizing repeat that
+ must swallow everything it can. So, while both \d+ and \d+? are pre-
+ pared to adjust the number of digits they match in order to make the
+ rest of the pattern match, (?>\d+) can only match an entire sequence of
+ digits.
+
+ Atomic groups in general can of course contain arbitrarily complicated
+ subpatterns, and can be nested. However, when the subpattern for an
+ atomic group is just a single repeated item, as in the example above, a
+ simpler notation, called a "possessive quantifier" can be used. This
+ consists of an additional + character following a quantifier. Using
+ this notation, the previous example can be rewritten as
+
+ \d++foo
+
+ Note that a possessive quantifier can be used with an entire group, for
+ example:
+
+ (abc|xyz){2,3}+
+
+ Possessive quantifiers are always greedy; the setting of the
+ PCRE2_UNGREEDY option is ignored. They are a convenient notation for
+ the simpler forms of atomic group. However, there is no difference in
+ the meaning of a possessive quantifier and the equivalent atomic group,
+ though there may be a performance difference; possessive quantifiers
+ should be slightly faster.
+
+ The possessive quantifier syntax is an extension to the Perl 5.8 syn-
+ tax. Jeffrey Friedl originated the idea (and the name) in the first
+ edition of his book. Mike McCloskey liked it, so implemented it when he
+ built Sun's Java package, and PCRE1 copied it from there. It ultimately
+ found its way into Perl at release 5.10.
+
+ PCRE2 has an optimization that automatically "possessifies" certain
+ simple pattern constructs. For example, the sequence A+B is treated as
+ A++B because there is no point in backtracking into a sequence of A's
+ when B must follow. This feature can be disabled by the PCRE2_NO_AUTO-
+ POSSESS option, or starting the pattern with (*NO_AUTO_POSSESS).
+
+ When a pattern contains an unlimited repeat inside a subpattern that
+ can itself be repeated an unlimited number of times, the use of an
+ atomic group is the only way to avoid some failing matches taking a
+ very long time indeed. The pattern
+
+ (\D+|<\d+>)*[!?]
+
+ matches an unlimited number of substrings that either consist of non-
+ digits, or digits enclosed in <>, followed by either ! or ?. When it
+ matches, it runs quickly. However, if it is applied to
+
+ aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa
+
+ it takes a long time before reporting failure. This is because the
+ string can be divided between the internal \D+ repeat and the external
+ * repeat in a large number of ways, and all have to be tried. (The
+ example uses [!?] rather than a single character at the end, because
+ both PCRE2 and Perl have an optimization that allows for fast failure
+ when a single character is used. They remember the last single charac-
+ ter that is required for a match, and fail early if it is not present
+ in the string.) If the pattern is changed so that it uses an atomic
+ group, like this:
+
+ ((?>\D+)|<\d+>)*[!?]
+
+ sequences of non-digits cannot be broken, and failure happens quickly.
+
+
+BACKREFERENCES
+
+ Outside a character class, a backslash followed by a digit greater than
+ 0 (and possibly further digits) is a backreference to a capturing sub-
+ pattern earlier (that is, to its left) in the pattern, provided there
+ have been that many previous capturing left parentheses.
+
+ However, if the decimal number following the backslash is less than 8,
+ it is always taken as a backreference, and causes an error only if
+ there are not that many capturing left parentheses in the entire pat-
+ tern. In other words, the parentheses that are referenced need not be
+ to the left of the reference for numbers less than 8. A "forward back-
+ reference" of this type can make sense when a repetition is involved
+ and the subpattern to the right has participated in an earlier itera-
+ tion.
+
+ It is not possible to have a numerical "forward backreference" to a
+ subpattern whose number is 8 or more using this syntax because a
+ sequence such as \50 is interpreted as a character defined in octal.
+ See the subsection entitled "Non-printing characters" above for further
+ details of the handling of digits following a backslash. There is no
+ such problem when named parentheses are used. A backreference to any
+ subpattern is possible using named parentheses (see below).
+
+ Another way of avoiding the ambiguity inherent in the use of digits
+ following a backslash is to use the \g escape sequence. This escape
+ must be followed by a signed or unsigned number, optionally enclosed in
+ braces. These examples are all identical:
+
+ (ring), \1
+ (ring), \g1
+ (ring), \g{1}
+
+ An unsigned number specifies an absolute reference without the ambigu-
+ ity that is present in the older syntax. It is also useful when literal
+ digits follow the reference. A signed number is a relative reference.
+ Consider this example:
+
+ (abc(def)ghi)\g{-1}
+
+ The sequence \g{-1} is a reference to the most recently started captur-
+ ing subpattern before \g, that is, is it equivalent to \2 in this exam-
+ ple. Similarly, \g{-2} would be equivalent to \1. The use of relative
+ references can be helpful in long patterns, and also in patterns that
+ are created by joining together fragments that contain references
+ within themselves.
+
+ The sequence \g{+1} is a reference to the next capturing subpattern.
+ This kind of forward reference can be useful it patterns that repeat.
+ Perl does not support the use of + in this way.
+
+ A backreference matches whatever actually matched the capturing subpat-
+ tern in the current subject string, rather than anything matching the
+ subpattern itself (see "Subpatterns as subroutines" below for a way of
+ doing that). So the pattern
+
+ (sens|respons)e and \1ibility
+
+ matches "sense and sensibility" and "response and responsibility", but
+ not "sense and responsibility". If caseful matching is in force at the
+ time of the backreference, the case of letters is relevant. For exam-
+ ple,
+
+ ((?i)rah)\s+\1
+
+ matches "rah rah" and "RAH RAH", but not "RAH rah", even though the
+ original capturing subpattern is matched caselessly.
+
+ There are several different ways of writing backreferences to named
+ subpatterns. The .NET syntax \k{name} and the Perl syntax \k<name> or
+ \k'name' are supported, as is the Python syntax (?P=name). Perl 5.10's
+ unified backreference syntax, in which \g can be used for both numeric
+ and named references, is also supported. We could rewrite the above
+ example in any of the following ways:
+
+ (?<p1>(?i)rah)\s+\k<p1>
+ (?'p1'(?i)rah)\s+\k{p1}
+ (?P<p1>(?i)rah)\s+(?P=p1)
+ (?<p1>(?i)rah)\s+\g{p1}
+
+ A subpattern that is referenced by name may appear in the pattern
+ before or after the reference.
+
+ There may be more than one backreference to the same subpattern. If a
+ subpattern has not actually been used in a particular match, any back-
+ references to it always fail by default. For example, the pattern
+
+ (a|(bc))\2
+
+ always fails if it starts to match "a" rather than "bc". However, if
+ the PCRE2_MATCH_UNSET_BACKREF option is set at compile time, a backref-
+ erence to an unset value matches an empty string.
+
+ Because there may be many capturing parentheses in a pattern, all dig-
+ its following a backslash are taken as part of a potential backrefer-
+ ence number. If the pattern continues with a digit character, some
+ delimiter must be used to terminate the backreference. If the
+ PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, this can be white
+ space. Otherwise, the \g{ syntax or an empty comment (see "Comments"
+ below) can be used.
+
+ Recursive backreferences
+
+ A backreference that occurs inside the parentheses to which it refers
+ fails when the subpattern is first used, so, for example, (a\1) never
+ matches. However, such references can be useful inside repeated sub-
+ patterns. For example, the pattern
+
+ (a|b\1)+
+
+ matches any number of "a"s and also "aba", "ababbaa" etc. At each iter-
+ ation of the subpattern, the backreference matches the character string
+ corresponding to the previous iteration. In order for this to work, the
+ pattern must be such that the first iteration does not need to match
+ the backreference. This can be done using alternation, as in the exam-
+ ple above, or by a quantifier with a minimum of zero.
+
+ Backreferences of this type cause the group that they reference to be
+ treated as an atomic group. Once the whole group has been matched, a
+ subsequent matching failure cannot cause backtracking into the middle
+ of the group.
+
+
+ASSERTIONS
+
+ An assertion is a test on the characters following or preceding the
+ current matching point that does not consume any characters. The simple
+ assertions coded as \b, \B, \A, \G, \Z, \z, ^ and $ are described
+ above.
+
+ More complicated assertions are coded as subpatterns. There are two
+ kinds: those that look ahead of the current position in the subject
+ string, and those that look behind it, and in each case an assertion
+ may be positive (must succeed for matching to continue) or negative
+ (must not succeed for matching to continue). An assertion subpattern is
+ matched in the normal way, except that, when matching continues after a
+ successful assertion, the matching position in the subject string is as
+ it was before the assertion was processed.
+
+ Assertion subpatterns are not capturing subpatterns. If an assertion
+ contains capturing subpatterns within it, these are counted for the
+ purposes of numbering the capturing subpatterns in the whole pattern.
+ Within each branch of an assertion, locally captured substrings may be
+ referenced in the usual way. For example, a sequence such as (.)\g{-1}
+ can be used to check that two adjacent characters are the same.
+
+ When a branch within an assertion fails to match, any substrings that
+ were captured are discarded (as happens with any pattern branch that
+ fails to match). A negative assertion succeeds only when all its
+ branches fail to match; this means that no captured substrings are ever
+ retained after a successful negative assertion. When an assertion con-
+ tains a matching branch, what happens depends on the type of assertion.
+
+ For a positive assertion, internally captured substrings in the suc-
+ cessful branch are retained, and matching continues with the next pat-
+ tern item after the assertion. For a negative assertion, a matching
+ branch means that the assertion has failed. If the assertion is being
+ used as a condition in a conditional subpattern (see below), captured
+ substrings are retained, because matching continues with the "no"
+ branch of the condition. For other failing negative assertions, control
+ passes to the previous backtracking point, thus discarding any captured
+ strings within the assertion.
+
+ For compatibility with Perl, most assertion subpatterns may be
+ repeated; though it makes no sense to assert the same thing several
+ times, the side effect of capturing parentheses may occasionally be
+ useful. However, an assertion that forms the condition for a condi-
+ tional subpattern may not be quantified. In practice, for other asser-
+ tions, there only three cases:
+
+ (1) If the quantifier is {0}, the assertion is never obeyed during
+ matching. However, it may contain internal capturing parenthesized
+ groups that are called from elsewhere via the subroutine mechanism.
+
+ (2) If quantifier is {0,n} where n is greater than zero, it is treated
+ as if it were {0,1}. At run time, the rest of the pattern match is
+ tried with and without the assertion, the order depending on the greed-
+ iness of the quantifier.
+
+ (3) If the minimum repetition is greater than zero, the quantifier is
+ ignored. The assertion is obeyed just once when encountered during
+ matching.
+
+ Lookahead assertions
+
+ Lookahead assertions start with (?= for positive assertions and (?! for
+ negative assertions. For example,
+
+ \w+(?=;)
+
+ matches a word followed by a semicolon, but does not include the semi-
+ colon in the match, and
+
+ foo(?!bar)
+
+ matches any occurrence of "foo" that is not followed by "bar". Note
+ that the apparently similar pattern
+
+ (?!foo)bar
+
+ does not find an occurrence of "bar" that is preceded by something
+ other than "foo"; it finds any occurrence of "bar" whatsoever, because
+ the assertion (?!foo) is always true when the next three characters are
+ "bar". A lookbehind assertion is needed to achieve the other effect.
+
+ If you want to force a matching failure at some point in a pattern, the
+ most convenient way to do it is with (?!) because an empty string
+ always matches, so an assertion that requires there not to be an empty
+ string must always fail. The backtracking control verb (*FAIL) or (*F)
+ is a synonym for (?!).
+
+ Lookbehind assertions
+
+ Lookbehind assertions start with (?<= for positive assertions and (?<!
+ for negative assertions. For example,
+
+ (?<!foo)bar
+
+ does find an occurrence of "bar" that is not preceded by "foo". The
+ contents of a lookbehind assertion are restricted such that all the
+ strings it matches must have a fixed length. However, if there are sev-
+ eral top-level alternatives, they do not all have to have the same
+ fixed length. Thus
+
+ (?<=bullock|donkey)
+
+ is permitted, but
+
+ (?<!dogs?|cats?)
+
+ causes an error at compile time. Branches that match different length
+ strings are permitted only at the top level of a lookbehind assertion.
+ This is an extension compared with Perl, which requires all branches to
+ match the same length of string. An assertion such as
+
+ (?<=ab(c|de))
+
+ is not permitted, because its single top-level branch can match two
+ different lengths, but it is acceptable to PCRE2 if rewritten to use
+ two top-level branches:
+
+ (?<=abc|abde)
+
+ In some cases, the escape sequence \K (see above) can be used instead
+ of a lookbehind assertion to get round the fixed-length restriction.
+
+ The implementation of lookbehind assertions is, for each alternative,
+ to temporarily move the current position back by the fixed length and
+ then try to match. If there are insufficient characters before the cur-
+ rent position, the assertion fails.
+
+ In UTF-8 and UTF-16 modes, PCRE2 does not allow the \C escape (which
+ matches a single code unit even in a UTF mode) to appear in lookbehind
+ assertions, because it makes it impossible to calculate the length of
+ the lookbehind. The \X and \R escapes, which can match different num-
+ bers of code units, are never permitted in lookbehinds.
+
+ "Subroutine" calls (see below) such as (?2) or (?&X) are permitted in
+ lookbehinds, as long as the subpattern matches a fixed-length string.
+ However, recursion, that is, a "subroutine" call into a group that is
+ already active, is not supported.
+
+ Perl does not support backreferences in lookbehinds. PCRE2 does support
+ them, but only if certain conditions are met. The
+ PCRE2_MATCH_UNSET_BACKREF option must not be set, there must be no use
+ of (?| in the pattern (it creates duplicate subpattern numbers), and if
+ the backreference is by name, the name must be unique. Of course, the
+ referenced subpattern must itself be of fixed length. The following
+ pattern matches words containing at least two characters that begin and
+ end with the same character:
+
+ \b(\w)\w++(?<=\1)
+
+ Possessive quantifiers can be used in conjunction with lookbehind
+ assertions to specify efficient matching of fixed-length strings at the
+ end of subject strings. Consider a simple pattern such as
+
+ abcd$
+
+ when applied to a long string that does not match. Because matching
+ proceeds from left to right, PCRE2 will look for each "a" in the sub-
+ ject and then see if what follows matches the rest of the pattern. If
+ the pattern is specified as
+
+ ^.*abcd$
+
+ the initial .* matches the entire string at first, but when this fails
+ (because there is no following "a"), it backtracks to match all but the
+ last character, then all but the last two characters, and so on. Once
+ again the search for "a" covers the entire string, from right to left,
+ so we are no better off. However, if the pattern is written as
+
+ ^.*+(?<=abcd)
+
+ there can be no backtracking for the .*+ item because of the possessive
+ quantifier; it can match only the entire string. The subsequent lookbe-
+ hind assertion does a single test on the last four characters. If it
+ fails, the match fails immediately. For long strings, this approach
+ makes a significant difference to the processing time.
+
+ Using multiple assertions
+
+ Several assertions (of any sort) may occur in succession. For example,
+
+ (?<=\d{3})(?<!999)foo
+
+ matches "foo" preceded by three digits that are not "999". Notice that
+ each of the assertions is applied independently at the same point in
+ the subject string. First there is a check that the previous three
+ characters are all digits, and then there is a check that the same
+ three characters are not "999". This pattern does not match "foo" pre-
+ ceded by six characters, the first of which are digits and the last
+ three of which are not "999". For example, it doesn't match "123abc-
+ foo". A pattern to do that is
+
+ (?<=\d{3}...)(?<!999)foo
+
+ This time the first assertion looks at the preceding six characters,
+ checking that the first three are digits, and then the second assertion
+ checks that the preceding three characters are not "999".
+
+ Assertions can be nested in any combination. For example,
+
+ (?<=(?<!foo)bar)baz
+
+ matches an occurrence of "baz" that is preceded by "bar" which in turn
+ is not preceded by "foo", while
+
+ (?<=\d{3}(?!999)...)foo
+
+ is another pattern that matches "foo" preceded by three digits and any
+ three characters that are not "999".
+
+
+CONDITIONAL SUBPATTERNS
+
+ It is possible to cause the matching process to obey a subpattern con-
+ ditionally or to choose between two alternative subpatterns, depending
+ on the result of an assertion, or whether a specific capturing subpat-
+ tern has already been matched. The two possible forms of conditional
+ subpattern are:
+
+ (?(condition)yes-pattern)
+ (?(condition)yes-pattern|no-pattern)
+
+ If the condition is satisfied, the yes-pattern is used; otherwise the
+ no-pattern (if present) is used. An absent no-pattern is equivalent to
+ an empty string (it always matches). If there are more than two alter-
+ natives in the subpattern, a compile-time error occurs. Each of the two
+ alternatives may itself contain nested subpatterns of any form, includ-
+ ing conditional subpatterns; the restriction to two alternatives
+ applies only at the level of the condition. This pattern fragment is an
+ example where the alternatives are complex:
+
+ (?(1) (A|B|C) | (D | (?(2)E|F) | E) )
+
+
+ There are five kinds of condition: references to subpatterns, refer-
+ ences to recursion, two pseudo-conditions called DEFINE and VERSION,
+ and assertions.
+
+ Checking for a used subpattern by number
+
+ If the text between the parentheses consists of a sequence of digits,
+ the condition is true if a capturing subpattern of that number has pre-
+ viously matched. If there is more than one capturing subpattern with
+ the same number (see the earlier section about duplicate subpattern
+ numbers), the condition is true if any of them have matched. An alter-
+ native notation is to precede the digits with a plus or minus sign. In
+ this case, the subpattern number is relative rather than absolute. The
+ most recently opened parentheses can be referenced by (?(-1), the next
+ most recent by (?(-2), and so on. Inside loops it can also make sense
+ to refer to subsequent groups. The next parentheses to be opened can be
+ referenced as (?(+1), and so on. (The value zero in any of these forms
+ is not used; it provokes a compile-time error.)
+
+ Consider the following pattern, which contains non-significant white
+ space to make it more readable (assume the PCRE2_EXTENDED option) and
+ to divide it into three parts for ease of discussion:
+
+ ( \( )? [^()]+ (?(1) \) )
+
+ The first part matches an optional opening parenthesis, and if that
+ character is present, sets it as the first captured substring. The sec-
+ ond part matches one or more characters that are not parentheses. The
+ third part is a conditional subpattern that tests whether or not the
+ first set of parentheses matched. If they did, that is, if subject
+ started with an opening parenthesis, the condition is true, and so the
+ yes-pattern is executed and a closing parenthesis is required. Other-
+ wise, since no-pattern is not present, the subpattern matches nothing.
+ In other words, this pattern matches a sequence of non-parentheses,
+ optionally enclosed in parentheses.
+
+ If you were embedding this pattern in a larger one, you could use a
+ relative reference:
+
+ ...other stuff... ( \( )? [^()]+ (?(-1) \) ) ...
+
+ This makes the fragment independent of the parentheses in the larger
+ pattern.
+
+ Checking for a used subpattern by name
+
+ Perl uses the syntax (?(<name>)...) or (?('name')...) to test for a
+ used subpattern by name. For compatibility with earlier versions of
+ PCRE1, which had this facility before Perl, the syntax (?(name)...) is
+ also recognized. Note, however, that undelimited names consisting of
+ the letter R followed by digits are ambiguous (see the following sec-
+ tion).
+
+ Rewriting the above example to use a named subpattern gives this:
+
+ (?<OPEN> \( )? [^()]+ (?(<OPEN>) \) )
+
+ If the name used in a condition of this kind is a duplicate, the test
+ is applied to all subpatterns of the same name, and is true if any one
+ of them has matched.
+
+ Checking for pattern recursion
+
+ "Recursion" in this sense refers to any subroutine-like call from one
+ part of the pattern to another, whether or not it is actually recur-
+ sive. See the sections entitled "Recursive patterns" and "Subpatterns
+ as subroutines" below for details of recursion and subpattern calls.
+
+ If a condition is the string (R), and there is no subpattern with the
+ name R, the condition is true if matching is currently in a recursion
+ or subroutine call to the whole pattern or any subpattern. If digits
+ follow the letter R, and there is no subpattern with that name, the
+ condition is true if the most recent call is into a subpattern with the
+ given number, which must exist somewhere in the overall pattern. This
+ is a contrived example that is equivalent to a+b:
+
+ ((?(R1)a+|(?1)b))
+
+ However, in both cases, if there is a subpattern with a matching name,
+ the condition tests for its being set, as described in the section
+ above, instead of testing for recursion. For example, creating a group
+ with the name R1 by adding (?<R1>) to the above pattern completely
+ changes its meaning.
+
+ If a name preceded by ampersand follows the letter R, for example:
+
+ (?(R&name)...)
+
+ the condition is true if the most recent recursion is into a subpattern
+ of that name (which must exist within the pattern).
+
+ This condition does not check the entire recursion stack. It tests only
+ the current level. If the name used in a condition of this kind is a
+ duplicate, the test is applied to all subpatterns of the same name, and
+ is true if any one of them is the most recent recursion.
+
+ At "top level", all these recursion test conditions are false.
+
+ Defining subpatterns for use by reference only
+
+ If the condition is the string (DEFINE), the condition is always false,
+ even if there is a group with the name DEFINE. In this case, there may
+ be only one alternative in the subpattern. It is always skipped if con-
+ trol reaches this point in the pattern; the idea of DEFINE is that it
+ can be used to define subroutines that can be referenced from else-
+ where. (The use of subroutines is described below.) For example, a pat-
+ tern to match an IPv4 address such as "192.168.23.245" could be written
+ like this (ignore white space and line breaks):
+
+ (?(DEFINE) (?<byte> 2[0-4]\d | 25[0-5] | 1\d\d | [1-9]?\d) )
+ \b (?&byte) (\.(?&byte)){3} \b
+
+ The first part of the pattern is a DEFINE group inside which a another
+ group named "byte" is defined. This matches an individual component of
+ an IPv4 address (a number less than 256). When matching takes place,
+ this part of the pattern is skipped because DEFINE acts like a false
+ condition. The rest of the pattern uses references to the named group
+ to match the four dot-separated components of an IPv4 address, insist-
+ ing on a word boundary at each end.
+
+ Checking the PCRE2 version
+
+ Programs that link with a PCRE2 library can check the version by call-
+ ing pcre2_config() with appropriate arguments. Users of applications
+ that do not have access to the underlying code cannot do this. A spe-
+ cial "condition" called VERSION exists to allow such users to discover
+ which version of PCRE2 they are dealing with by using this condition to
+ match a string such as "yesno". VERSION must be followed either by "="
+ or ">=" and a version number. For example:
+
+ (?(VERSION>=10.4)yes|no)
+
+ This pattern matches "yes" if the PCRE2 version is greater or equal to
+ 10.4, or "no" otherwise. The fractional part of the version number may
+ not contain more than two digits.
+
+ Assertion conditions
+
+ If the condition is not in any of the above formats, it must be an
+ assertion. This may be a positive or negative lookahead or lookbehind
+ assertion. Consider this pattern, again containing non-significant
+ white space, and with the two alternatives on the second line:
+
+ (?(?=[^a-z]*[a-z])
+ \d{2}-[a-z]{3}-\d{2} | \d{2}-\d{2}-\d{2} )
+
+ The condition is a positive lookahead assertion that matches an
+ optional sequence of non-letters followed by a letter. In other words,
+ it tests for the presence of at least one letter in the subject. If a
+ letter is found, the subject is matched against the first alternative;
+ otherwise it is matched against the second. This pattern matches
+ strings in one of the two forms dd-aaa-dd or dd-dd-dd, where aaa are
+ letters and dd are digits.
+
+ When an assertion that is a condition contains capturing subpatterns,
+ any capturing that occurs in a matching branch is retained afterwards,
+ for both positive and negative assertions, because matching always con-
+ tinues after the assertion, whether it succeeds or fails. (Compare non-
+ conditional assertions, when captures are retained only for positive
+ assertions that succeed.)
+
+
+COMMENTS
+
+ There are two ways of including comments in patterns that are processed
+ by PCRE2. In both cases, the start of the comment must not be in a
+ character class, nor in the middle of any other sequence of related
+ characters such as (?: or a subpattern name or number. The characters
+ that make up a comment play no part in the pattern matching.
+
+ The sequence (?# marks the start of a comment that continues up to the
+ next closing parenthesis. Nested parentheses are not permitted. If the
+ PCRE2_EXTENDED or PCRE2_EXTENDED_MORE option is set, an unescaped #
+ character also introduces a comment, which in this case continues to
+ immediately after the next newline character or character sequence in
+ the pattern. Which characters are interpreted as newlines is controlled
+ by an option passed to the compiling function or by a special sequence
+ at the start of the pattern, as described in the section entitled "New-
+ line conventions" above. Note that the end of this type of comment is a
+ literal newline sequence in the pattern; escape sequences that happen
+ to represent a newline do not count. For example, consider this pattern
+ when PCRE2_EXTENDED is set, and the default newline convention (a sin-
+ gle linefeed character) is in force:
+
+ abc #comment \n still comment
+
+ On encountering the # character, pcre2_compile() skips along, looking
+ for a newline in the pattern. The sequence \n is still literal at this
+ stage, so it does not terminate the comment. Only an actual character
+ with the code value 0x0a (the default newline) does so.
+
+
+RECURSIVE PATTERNS
+
+ Consider the problem of matching a string in parentheses, allowing for
+ unlimited nested parentheses. Without the use of recursion, the best
+ that can be done is to use a pattern that matches up to some fixed
+ depth of nesting. It is not possible to handle an arbitrary nesting
+ depth.
+
+ For some time, Perl has provided a facility that allows regular expres-
+ sions to recurse (amongst other things). It does this by interpolating
+ Perl code in the expression at run time, and the code can refer to the
+ expression itself. A Perl pattern using code interpolation to solve the
+ parentheses problem can be created like this:
+
+ $re = qr{\( (?: (?>[^()]+) | (?p{$re}) )* \)}x;
+
+ The (?p{...}) item interpolates Perl code at run time, and in this case
+ refers recursively to the pattern in which it appears.
+
+ Obviously, PCRE2 cannot support the interpolation of Perl code.
+ Instead, it supports special syntax for recursion of the entire pat-
+ tern, and also for individual subpattern recursion. After its introduc-
+ tion in PCRE1 and Python, this kind of recursion was subsequently
+ introduced into Perl at release 5.10.
+
+ A special item that consists of (? followed by a number greater than
+ zero and a closing parenthesis is a recursive subroutine call of the
+ subpattern of the given number, provided that it occurs inside that
+ subpattern. (If not, it is a non-recursive subroutine call, which is
+ described in the next section.) The special item (?R) or (?0) is a
+ recursive call of the entire regular expression.
+
+ This PCRE2 pattern solves the nested parentheses problem (assume the
+ PCRE2_EXTENDED option is set so that white space is ignored):
+
+ \( ( [^()]++ | (?R) )* \)
+
+ First it matches an opening parenthesis. Then it matches any number of
+ substrings which can either be a sequence of non-parentheses, or a
+ recursive match of the pattern itself (that is, a correctly parenthe-
+ sized substring). Finally there is a closing parenthesis. Note the use
+ of a possessive quantifier to avoid backtracking into sequences of non-
+ parentheses.
+
+ If this were part of a larger pattern, you would not want to recurse
+ the entire pattern, so instead you could use this:
+
+ ( \( ( [^()]++ | (?1) )* \) )
+
+ We have put the pattern into parentheses, and caused the recursion to
+ refer to them instead of the whole pattern.
+
+ In a larger pattern, keeping track of parenthesis numbers can be
+ tricky. This is made easier by the use of relative references. Instead
+ of (?1) in the pattern above you can write (?-2) to refer to the second
+ most recently opened parentheses preceding the recursion. In other
+ words, a negative number counts capturing parentheses leftwards from
+ the point at which it is encountered.
+
+ Be aware however, that if duplicate subpattern numbers are in use, rel-
+ ative references refer to the earliest subpattern with the appropriate
+ number. Consider, for example:
+
+ (?|(a)|(b)) (c) (?-2)
+
+ The first two capturing groups (a) and (b) are both numbered 1, and
+ group (c) is number 2. When the reference (?-2) is encountered, the
+ second most recently opened parentheses has the number 1, but it is the
+ first such group (the (a) group) to which the recursion refers. This
+ would be the same if an absolute reference (?1) was used. In other
+ words, relative references are just a shorthand for computing a group
+ number.
+
+ It is also possible to refer to subsequently opened parentheses, by
+ writing references such as (?+2). However, these cannot be recursive
+ because the reference is not inside the parentheses that are refer-
+ enced. They are always non-recursive subroutine calls, as described in
+ the next section.
+
+ An alternative approach is to use named parentheses. The Perl syntax
+ for this is (?&name); PCRE1's earlier syntax (?P>name) is also sup-
+ ported. We could rewrite the above example as follows:
+
+ (?<pn> \( ( [^()]++ | (?&pn) )* \) )
+
+ If there is more than one subpattern with the same name, the earliest
+ one is used.
+
+ The example pattern that we have been looking at contains nested unlim-
+ ited repeats, and so the use of a possessive quantifier for matching
+ strings of non-parentheses is important when applying the pattern to
+ strings that do not match. For example, when this pattern is applied to
+
+ (aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa()
+
+ it yields "no match" quickly. However, if a possessive quantifier is
+ not used, the match runs for a very long time indeed because there are
+ so many different ways the + and * repeats can carve up the subject,
+ and all have to be tested before failure can be reported.
+
+ At the end of a match, the values of capturing parentheses are those
+ from the outermost level. If you want to obtain intermediate values, a
+ callout function can be used (see below and the pcre2callout documenta-
+ tion). If the pattern above is matched against
+
+ (ab(cd)ef)
+
+ the value for the inner capturing parentheses (numbered 2) is "ef",
+ which is the last value taken on at the top level. If a capturing sub-
+ pattern is not matched at the top level, its final captured value is
+ unset, even if it was (temporarily) set at a deeper level during the
+ matching process.
+
+ Do not confuse the (?R) item with the condition (R), which tests for
+ recursion. Consider this pattern, which matches text in angle brack-
+ ets, allowing for arbitrary nesting. Only digits are allowed in nested
+ brackets (that is, when recursing), whereas any characters are permit-
+ ted at the outer level.
+
+ < (?: (?(R) \d++ | [^<>]*+) | (?R)) * >
+
+ In this pattern, (?(R) is the start of a conditional subpattern, with
+ two different alternatives for the recursive and non-recursive cases.
+ The (?R) item is the actual recursive call.
+
+ Differences in recursion processing between PCRE2 and Perl
+
+ Some former differences between PCRE2 and Perl no longer exist.
+
+ Before release 10.30, recursion processing in PCRE2 differed from Perl
+ in that a recursive subpattern call was always treated as an atomic
+ group. That is, once it had matched some of the subject string, it was
+ never re-entered, even if it contained untried alternatives and there
+ was a subsequent matching failure. (Historical note: PCRE implemented
+ recursion before Perl did.)
+
+ Starting with release 10.30, recursive subroutine calls are no longer
+ treated as atomic. That is, they can be re-entered to try unused alter-
+ natives if there is a matching failure later in the pattern. This is
+ now compatible with the way Perl works. If you want a subroutine call
+ to be atomic, you must explicitly enclose it in an atomic group.
+
+ Supporting backtracking into recursions simplifies certain types of
+ recursive pattern. For example, this pattern matches palindromic
+ strings:
+
+ ^((.)(?1)\2|.?)$
+
+ The second branch in the group matches a single central character in
+ the palindrome when there are an odd number of characters, or nothing
+ when there are an even number of characters, but in order to work it
+ has to be able to try the second case when the rest of the pattern
+ match fails. If you want to match typical palindromic phrases, the pat-
+ tern has to ignore all non-word characters, which can be done like
+ this:
+
+ ^\W*+((.)\W*+(?1)\W*+\2|\W*+.?)\W*+$
+
+ If run with the PCRE2_CASELESS option, this pattern matches phrases
+ such as "A man, a plan, a canal: Panama!". Note the use of the posses-
+ sive quantifier *+ to avoid backtracking into sequences of non-word
+ characters. Without this, PCRE2 takes a great deal longer (ten times or
+ more) to match typical phrases, and Perl takes so long that you think
+ it has gone into a loop.
+
+ Another way in which PCRE2 and Perl used to differ in their recursion
+ processing is in the handling of captured values. Formerly in Perl,
+ when a subpattern was called recursively or as a subpattern (see the
+ next section), it had no access to any values that were captured out-
+ side the recursion, whereas in PCRE2 these values can be referenced.
+ Consider this pattern:
+
+ ^(.)(\1|a(?2))
+
+ This pattern matches "bab". The first capturing parentheses match "b",
+ then in the second group, when the backreference \1 fails to match "b",
+ the second alternative matches "a" and then recurses. In the recursion,
+ \1 does now match "b" and so the whole match succeeds. This match used
+ to fail in Perl, but in later versions (I tried 5.024) it now works.
+
+
+SUBPATTERNS AS SUBROUTINES
+
+ If the syntax for a recursive subpattern call (either by number or by
+ name) is used outside the parentheses to which it refers, it operates a
+ bit like a subroutine in a programming language. More accurately, PCRE2
+ treats the referenced subpattern as an independent subpattern which it
+ tries to match at the current matching position. The called subpattern
+ may be defined before or after the reference. A numbered reference can
+ be absolute or relative, as in these examples:
+
+ (...(absolute)...)...(?2)...
+ (...(relative)...)...(?-1)...
+ (...(?+1)...(relative)...
+
+ An earlier example pointed out that the pattern
+
+ (sens|respons)e and \1ibility
+
+ matches "sense and sensibility" and "response and responsibility", but
+ not "sense and responsibility". If instead the pattern
+
+ (sens|respons)e and (?1)ibility
+
+ is used, it does match "sense and responsibility" as well as the other
+ two strings. Another example is given in the discussion of DEFINE
+ above.
+
+ Like recursions, subroutine calls used to be treated as atomic, but
+ this changed at PCRE2 release 10.30, so backtracking into subroutine
+ calls can now occur. However, any capturing parentheses that are set
+ during the subroutine call revert to their previous values afterwards.
+
+ Processing options such as case-independence are fixed when a subpat-
+ tern is defined, so if it is used as a subroutine, such options cannot
+ be changed for different calls. For example, consider this pattern:
+
+ (abc)(?i:(?-1))
+
+ It matches "abcabc". It does not match "abcABC" because the change of
+ processing option does not affect the called subpattern.
+
+ The behaviour of backtracking control verbs in subpatterns when called
+ as subroutines is described in the section entitled "Backtracking verbs
+ in subroutines" below.
+
+
+ONIGURUMA SUBROUTINE SYNTAX
+
+ For compatibility with Oniguruma, the non-Perl syntax \g followed by a
+ name or a number enclosed either in angle brackets or single quotes, is
+ an alternative syntax for referencing a subpattern as a subroutine,
+ possibly recursively. Here are two of the examples used above, rewrit-
+ ten using this syntax:
+
+ (?<pn> \( ( (?>[^()]+) | \g<pn> )* \) )
+ (sens|respons)e and \g'1'ibility
+
+ PCRE2 supports an extension to Oniguruma: if a number is preceded by a
+ plus or a minus sign it is taken as a relative reference. For example:
+
+ (abc)(?i:\g<-1>)
+
+ Note that \g{...} (Perl syntax) and \g<...> (Oniguruma syntax) are not
+ synonymous. The former is a backreference; the latter is a subroutine
+ call.
+
+
+CALLOUTS
+
+ Perl has a feature whereby using the sequence (?{...}) causes arbitrary
+ Perl code to be obeyed in the middle of matching a regular expression.
+ This makes it possible, amongst other things, to extract different sub-
+ strings that match the same pair of parentheses when there is a repeti-
+ tion.
+
+ PCRE2 provides a similar feature, but of course it cannot obey arbi-
+ trary Perl code. The feature is called "callout". The caller of PCRE2
+ provides an external function by putting its entry point in a match
+ context using the function pcre2_set_callout(), and then passing that
+ context to pcre2_match() or pcre2_dfa_match(). If no match context is
+ passed, or if the callout entry point is set to NULL, callouts are dis-
+ abled.
+
+ Within a regular expression, (?C<arg>) indicates a point at which the
+ external function is to be called. There are two kinds of callout:
+ those with a numerical argument and those with a string argument. (?C)
+ on its own with no argument is treated as (?C0). A numerical argument
+ allows the application to distinguish between different callouts.
+ String arguments were added for release 10.20 to make it possible for
+ script languages that use PCRE2 to embed short scripts within patterns
+ in a similar way to Perl.
+
+ During matching, when PCRE2 reaches a callout point, the external func-
+ tion is called. It is provided with the number or string argument of
+ the callout, the position in the pattern, and one item of data that is
+ also set in the match block. The callout function may cause matching to
+ proceed, to backtrack, or to fail.
+
+ By default, PCRE2 implements a number of optimizations at matching
+ time, and one side-effect is that sometimes callouts are skipped. If
+ you need all possible callouts to happen, you need to set options that
+ disable the relevant optimizations. More details, including a complete
+ description of the programming interface to the callout function, are
+ given in the pcre2callout documentation.
+
+ Callouts with numerical arguments
+
+ If you just want to have a means of identifying different callout
+ points, put a number less than 256 after the letter C. For example,
+ this pattern has two callout points:
+
+ (?C1)abc(?C2)def
+
+ If the PCRE2_AUTO_CALLOUT flag is passed to pcre2_compile(), numerical
+ callouts are automatically installed before each item in the pattern.
+ They are all numbered 255. If there is a conditional group in the pat-
+ tern whose condition is an assertion, an additional callout is inserted
+ just before the condition. An explicit callout may also be set at this
+ position, as in this example:
+
+ (?(?C9)(?=a)abc|def)
+
+ Note that this applies only to assertion conditions, not to other types
+ of condition.
+
+ Callouts with string arguments
+
+ A delimited string may be used instead of a number as a callout argu-
+ ment. The starting delimiter must be one of ` ' " ^ % # $ { and the
+ ending delimiter is the same as the start, except for {, where the end-
+ ing delimiter is }. If the ending delimiter is needed within the
+ string, it must be doubled. For example:
+
+ (?C'ab ''c'' d')xyz(?C{any text})pqr
+
+ The doubling is removed before the string is passed to the callout
+ function.
+
+
+BACKTRACKING CONTROL
+
+ There are a number of special "Backtracking Control Verbs" (to use
+ Perl's terminology) that modify the behaviour of backtracking during
+ matching. They are generally of the form (*VERB) or (*VERB:NAME). Some
+ verbs take either form, possibly behaving differently depending on
+ whether or not a name is present.
+
+ By default, for compatibility with Perl, a name is any sequence of
+ characters that does not include a closing parenthesis. The name is not
+ processed in any way, and it is not possible to include a closing
+ parenthesis in the name. This can be changed by setting the
+ PCRE2_ALT_VERBNAMES option, but the result is no longer Perl-compati-
+ ble.
+
+ When PCRE2_ALT_VERBNAMES is set, backslash processing is applied to
+ verb names and only an unescaped closing parenthesis terminates the
+ name. However, the only backslash items that are permitted are \Q, \E,
+ and sequences such as \x{100} that define character code points. Char-
+ acter type escapes such as \d are faulted.
+
+ A closing parenthesis can be included in a name either as \) or between
+ \Q and \E. In addition to backslash processing, if the PCRE2_EXTENDED
+ or PCRE2_EXTENDED_MORE option is also set, unescaped whitespace in verb
+ names is skipped, and #-comments are recognized, exactly as in the rest
+ of the pattern. PCRE2_EXTENDED and PCRE2_EXTENDED_MORE do not affect
+ verb names unless PCRE2_ALT_VERBNAMES is also set.
+
+ The maximum length of a name is 255 in the 8-bit library and 65535 in
+ the 16-bit and 32-bit libraries. If the name is empty, that is, if the
+ closing parenthesis immediately follows the colon, the effect is as if
+ the colon were not there. Any number of these verbs may occur in a pat-
+ tern.
+
+ Since these verbs are specifically related to backtracking, most of
+ them can be used only when the pattern is to be matched using the tra-
+ ditional matching function, because that uses a backtracking algorithm.
+ With the exception of (*FAIL), which behaves like a failing negative
+ assertion, the backtracking control verbs cause an error if encountered
+ by the DFA matching function.
+
+ The behaviour of these verbs in repeated groups, assertions, and in
+ subpatterns called as subroutines (whether or not recursively) is docu-
+ mented below.
+
+ Optimizations that affect backtracking verbs
+
+ PCRE2 contains some optimizations that are used to speed up matching by
+ running some checks at the start of each match attempt. For example, it
+ may know the minimum length of matching subject, or that a particular
+ character must be present. When one of these optimizations bypasses the
+ running of a match, any included backtracking verbs will not, of
+ course, be processed. You can suppress the start-of-match optimizations
+ by setting the PCRE2_NO_START_OPTIMIZE option when calling pcre2_com-
+ pile(), or by starting the pattern with (*NO_START_OPT). There is more
+ discussion of this option in the section entitled "Compiling a pattern"
+ in the pcre2api documentation.
+
+ Experiments with Perl suggest that it too has similar optimizations,
+ and like PCRE2, turning them off can change the result of a match.
+
+ Verbs that act immediately
+
+ The following verbs act as soon as they are encountered.
+
+ (*ACCEPT) or (*ACCEPT:NAME)
+
+ This verb causes the match to end successfully, skipping the remainder
+ of the pattern. However, when it is inside a subpattern that is called
+ as a subroutine, only that subpattern is ended successfully. Matching
+ then continues at the outer level. If (*ACCEPT) in triggered in a posi-
+ tive assertion, the assertion succeeds; in a negative assertion, the
+ assertion fails.
+
+ If (*ACCEPT) is inside capturing parentheses, the data so far is cap-
+ tured. For example:
+
+ A((?:A|B(*ACCEPT)|C)D)
+
+ This matches "AB", "AAD", or "ACD"; when it matches "AB", "B" is cap-
+ tured by the outer parentheses.
+
+ (*FAIL) or (*FAIL:NAME)
+
+ This verb causes a matching failure, forcing backtracking to occur. It
+ may be abbreviated to (*F). It is equivalent to (?!) but easier to
+ read. The Perl documentation notes that it is probably useful only when
+ combined with (?{}) or (??{}). Those are, of course, Perl features that
+ are not present in PCRE2. The nearest equivalent is the callout fea-
+ ture, as for example in this pattern:
+
+ a+(?C)(*FAIL)
+
+ A match with the string "aaaa" always fails, but the callout is taken
+ before each backtrack happens (in this example, 10 times).
+
+ (*ACCEPT:NAME) and (*FAIL:NAME) behave exactly the same as
+ (*MARK:NAME)(*ACCEPT) and (*MARK:NAME)(*FAIL), respectively.
+
+ Recording which path was taken
+
+ There is one verb whose main purpose is to track how a match was
+ arrived at, though it also has a secondary use in conjunction with
+ advancing the match starting point (see (*SKIP) below).
+
+ (*MARK:NAME) or (*:NAME)
+
+ A name is always required with this verb. There may be as many
+ instances of (*MARK) as you like in a pattern, and their names do not
+ have to be unique.
+
+ When a match succeeds, the name of the last-encountered (*MARK:NAME) on
+ the matching path is passed back to the caller as described in the sec-
+ tion entitled "Other information about the match" in the pcre2api docu-
+ mentation. This applies to all instances of (*MARK), including those
+ inside assertions and atomic groups. (There are differences in those
+ cases when (*MARK) is used in conjunction with (*SKIP) as described
+ below.)
+
+ As well as (*MARK), the (*COMMIT), (*PRUNE) and (*THEN) verbs may have
+ associated NAME arguments. Whichever is last on the matching path is
+ passed back. See below for more details of these other verbs.
+
+ Here is an example of pcre2test output, where the "mark" modifier
+ requests the retrieval and outputting of (*MARK) data:
+
+ re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
+ data> XY
+ 0: XY
+ MK: A
+ XZ
+ 0: XZ
+ MK: B
+
+ The (*MARK) name is tagged with "MK:" in this output, and in this exam-
+ ple it indicates which of the two alternatives matched. This is a more
+ efficient way of obtaining this information than putting each alterna-
+ tive in its own capturing parentheses.
+
+ If a verb with a name is encountered in a positive assertion that is
+ true, the name is recorded and passed back if it is the last-encoun-
+ tered. This does not happen for negative assertions or failing positive
+ assertions.
+
+ After a partial match or a failed match, the last encountered name in
+ the entire match process is returned. For example:
+
+ re> /X(*MARK:A)Y|X(*MARK:B)Z/mark
+ data> XP
+ No match, mark = B
+
+ Note that in this unanchored example the mark is retained from the
+ match attempt that started at the letter "X" in the subject. Subsequent
+ match attempts starting at "P" and then with an empty string do not get
+ as far as the (*MARK) item, but nevertheless do not reset it.
+
+ If you are interested in (*MARK) values after failed matches, you
+ should probably set the PCRE2_NO_START_OPTIMIZE option (see above) to
+ ensure that the match is always attempted.
+
+ Verbs that act after backtracking
+
+ The following verbs do nothing when they are encountered. Matching con-
+ tinues with what follows, but if there is a subsequent match failure,
+ causing a backtrack to the verb, a failure is forced. That is, back-
+ tracking cannot pass to the left of the verb. However, when one of
+ these verbs appears inside an atomic group or in a lookaround assertion
+ that is true, its effect is confined to that group, because once the
+ group has been matched, there is never any backtracking into it. Back-
+ tracking from beyond an assertion or an atomic group ignores the entire
+ group, and seeks a preceeding backtracking point.
+
+ These verbs differ in exactly what kind of failure occurs when back-
+ tracking reaches them. The behaviour described below is what happens
+ when the verb is not in a subroutine or an assertion. Subsequent sec-
+ tions cover these special cases.
+
+ (*COMMIT) or (*COMMIT:NAME)
+
+ This verb causes the whole match to fail outright if there is a later
+ matching failure that causes backtracking to reach it. Even if the pat-
+ tern is unanchored, no further attempts to find a match by advancing
+ the starting point take place. If (*COMMIT) is the only backtracking
+ verb that is encountered, once it has been passed pcre2_match() is com-
+ mitted to finding a match at the current starting point, or not at all.
+ For example:
+
+ a+(*COMMIT)b
+
+ This matches "xxaab" but not "aacaab". It can be thought of as a kind
+ of dynamic anchor, or "I've started, so I must finish."
+
+ The behaviour of (*COMMIT:NAME) is not the same as (*MARK:NAME)(*COM-
+ MIT). It is like (*MARK:NAME) in that the name is remembered for pass-
+ ing back to the caller. However, (*SKIP:NAME) searches only for names
+ set with (*MARK), ignoring those set by (*COMMIT), (*PRUNE) and
+ (*THEN).
+
+ If there is more than one backtracking verb in a pattern, a different
+ one that follows (*COMMIT) may be triggered first, so merely passing
+ (*COMMIT) during a match does not always guarantee that a match must be
+ at this starting point.
+
+ Note that (*COMMIT) at the start of a pattern is not the same as an
+ anchor, unless PCRE2's start-of-match optimizations are turned off, as
+ shown in this output from pcre2test:
+
+ re> /(*COMMIT)abc/
+ data> xyzabc
+ 0: abc
+ data>
+ re> /(*COMMIT)abc/no_start_optimize
+ data> xyzabc
+ No match
+
+ For the first pattern, PCRE2 knows that any match must start with "a",
+ so the optimization skips along the subject to "a" before applying the
+ pattern to the first set of data. The match attempt then succeeds. The
+ second pattern disables the optimization that skips along to the first
+ character. The pattern is now applied starting at "x", and so the
+ (*COMMIT) causes the match to fail without trying any other starting
+ points.
+
+ (*PRUNE) or (*PRUNE:NAME)
+
+ This verb causes the match to fail at the current starting position in
+ the subject if there is a later matching failure that causes backtrack-
+ ing to reach it. If the pattern is unanchored, the normal "bumpalong"
+ advance to the next starting character then happens. Backtracking can
+ occur as usual to the left of (*PRUNE), before it is reached, or when
+ matching to the right of (*PRUNE), but if there is no match to the
+ right, backtracking cannot cross (*PRUNE). In simple cases, the use of
+ (*PRUNE) is just an alternative to an atomic group or possessive quan-
+ tifier, but there are some uses of (*PRUNE) that cannot be expressed in
+ any other way. In an anchored pattern (*PRUNE) has the same effect as
+ (*COMMIT).
+
+ The behaviour of (*PRUNE:NAME) is not the same as (*MARK:NAME)(*PRUNE).
+ It is like (*MARK:NAME) in that the name is remembered for passing back
+ to the caller. However, (*SKIP:NAME) searches only for names set with
+ (*MARK), ignoring those set by (*COMMIT), (*PRUNE) or (*THEN).
+
+ (*SKIP)
+
+ This verb, when given without a name, is like (*PRUNE), except that if
+ the pattern is unanchored, the "bumpalong" advance is not to the next
+ character, but to the position in the subject where (*SKIP) was encoun-
+ tered. (*SKIP) signifies that whatever text was matched leading up to
+ it cannot be part of a successful match if there is a later mismatch.
+ Consider:
+
+ a+(*SKIP)b
+
+ If the subject is "aaaac...", after the first match attempt fails
+ (starting at the first character in the string), the starting point
+ skips on to start the next attempt at "c". Note that a possessive quan-
+ tifer does not have the same effect as this example; although it would
+ suppress backtracking during the first match attempt, the second
+ attempt would start at the second character instead of skipping on to
+ "c".
+
+ (*SKIP:NAME)
+
+ When (*SKIP) has an associated name, its behaviour is modified. When
+ such a (*SKIP) is triggered, the previous path through the pattern is
+ searched for the most recent (*MARK) that has the same name. If one is
+ found, the "bumpalong" advance is to the subject position that corre-
+ sponds to that (*MARK) instead of to where (*SKIP) was encountered. If
+ no (*MARK) with a matching name is found, the (*SKIP) is ignored.
+
+ The search for a (*MARK) name uses the normal backtracking mechanism,
+ which means that it does not see (*MARK) settings that are inside
+ atomic groups or assertions, because they are never re-entered by back-
+ tracking. Compare the following pcre2test examples:
+
+ re> /a(?>(*MARK:X))(*SKIP:X)(*F)|(.)/
+ data: abc
+ 0: a
+ 1: a
+ data:
+ re> /a(?:(*MARK:X))(*SKIP:X)(*F)|(.)/
+ data: abc
+ 0: b
+ 1: b
+
+ In the first example, the (*MARK) setting is in an atomic group, so it
+ is not seen when (*SKIP:X) triggers, causing the (*SKIP) to be ignored.
+ This allows the second branch of the pattern to be tried at the first
+ character position. In the second example, the (*MARK) setting is not
+ in an atomic group. This allows (*SKIP:X) to find the (*MARK) when it
+ backtracks, and this causes a new matching attempt to start at the sec-
+ ond character. This time, the (*MARK) is never seen because "a" does
+ not match "b", so the matcher immediately jumps to the second branch of
+ the pattern.
+
+ Note that (*SKIP:NAME) searches only for names set by (*MARK:NAME). It
+ ignores names that are set by (*COMMIT:NAME), (*PRUNE:NAME) or
+ (*THEN:NAME).
+
+ (*THEN) or (*THEN:NAME)
+
+ This verb causes a skip to the next innermost alternative when back-
+ tracking reaches it. That is, it cancels any further backtracking
+ within the current alternative. Its name comes from the observation
+ that it can be used for a pattern-based if-then-else block:
+
+ ( COND1 (*THEN) FOO | COND2 (*THEN) BAR | COND3 (*THEN) BAZ ) ...
+
+ If the COND1 pattern matches, FOO is tried (and possibly further items
+ after the end of the group if FOO succeeds); on failure, the matcher
+ skips to the second alternative and tries COND2, without backtracking
+ into COND1. If that succeeds and BAR fails, COND3 is tried. If subse-
+ quently BAZ fails, there are no more alternatives, so there is a back-
+ track to whatever came before the entire group. If (*THEN) is not
+ inside an alternation, it acts like (*PRUNE).
+
+ The behaviour of (*THEN:NAME) is not the same as (*MARK:NAME)(*THEN).
+ It is like (*MARK:NAME) in that the name is remembered for passing back
+ to the caller. However, (*SKIP:NAME) searches only for names set with
+ (*MARK), ignoring those set by (*COMMIT), (*PRUNE) and (*THEN).
+
+ A subpattern that does not contain a | character is just a part of the
+ enclosing alternative; it is not a nested alternation with only one
+ alternative. The effect of (*THEN) extends beyond such a subpattern to
+ the enclosing alternative. Consider this pattern, where A, B, etc. are
+ complex pattern fragments that do not contain any | characters at this
+ level:
+
+ A (B(*THEN)C) | D
+
+ If A and B are matched, but there is a failure in C, matching does not
+ backtrack into A; instead it moves to the next alternative, that is, D.
+ However, if the subpattern containing (*THEN) is given an alternative,
+ it behaves differently:
+
+ A (B(*THEN)C | (*FAIL)) | D
+
+ The effect of (*THEN) is now confined to the inner subpattern. After a
+ failure in C, matching moves to (*FAIL), which causes the whole subpat-
+ tern to fail because there are no more alternatives to try. In this
+ case, matching does now backtrack into A.
+
+ Note that a conditional subpattern is not considered as having two
+ alternatives, because only one is ever used. In other words, the |
+ character in a conditional subpattern has a different meaning. Ignoring
+ white space, consider:
+
+ ^.*? (?(?=a) a | b(*THEN)c )
+
+ If the subject is "ba", this pattern does not match. Because .*? is
+ ungreedy, it initially matches zero characters. The condition (?=a)
+ then fails, the character "b" is matched, but "c" is not. At this
+ point, matching does not backtrack to .*? as might perhaps be expected
+ from the presence of the | character. The conditional subpattern is
+ part of the single alternative that comprises the whole pattern, and so
+ the match fails. (If there was a backtrack into .*?, allowing it to
+ match "b", the match would succeed.)
+
+ The verbs just described provide four different "strengths" of control
+ when subsequent matching fails. (*THEN) is the weakest, carrying on the
+ match at the next alternative. (*PRUNE) comes next, failing the match
+ at the current starting position, but allowing an advance to the next
+ character (for an unanchored pattern). (*SKIP) is similar, except that
+ the advance may be more than one character. (*COMMIT) is the strongest,
+ causing the entire match to fail.
+
+ More than one backtracking verb
+
+ If more than one backtracking verb is present in a pattern, the one
+ that is backtracked onto first acts. For example, consider this pat-
+ tern, where A, B, etc. are complex pattern fragments:
+
+ (A(*COMMIT)B(*THEN)C|ABD)
+
+ If A matches but B fails, the backtrack to (*COMMIT) causes the entire
+ match to fail. However, if A and B match, but C fails, the backtrack to
+ (*THEN) causes the next alternative (ABD) to be tried. This behaviour
+ is consistent, but is not always the same as Perl's. It means that if
+ two or more backtracking verbs appear in succession, all the the last
+ of them has no effect. Consider this example:
+
+ ...(*COMMIT)(*PRUNE)...
+
+ If there is a matching failure to the right, backtracking onto (*PRUNE)
+ causes it to be triggered, and its action is taken. There can never be
+ a backtrack onto (*COMMIT).
+
+ Backtracking verbs in repeated groups
+
+ PCRE2 sometimes differs from Perl in its handling of backtracking verbs
+ in repeated groups. For example, consider:
+
+ /(a(*COMMIT)b)+ac/
+
+ If the subject is "abac", Perl matches unless its optimizations are
+ disabled, but PCRE2 always fails because the (*COMMIT) in the second
+ repeat of the group acts.
+
+ Backtracking verbs in assertions
+
+ (*FAIL) in any assertion has its normal effect: it forces an immediate
+ backtrack. The behaviour of the other backtracking verbs depends on
+ whether or not the assertion is standalone or acting as the condition
+ in a conditional subpattern.
+
+ (*ACCEPT) in a standalone positive assertion causes the assertion to
+ succeed without any further processing; captured strings and a (*MARK)
+ name (if set) are retained. In a standalone negative assertion,
+ (*ACCEPT) causes the assertion to fail without any further processing;
+ captured substrings and any (*MARK) name are discarded.
+
+ If the assertion is a condition, (*ACCEPT) causes the condition to be
+ true for a positive assertion and false for a negative one; captured
+ substrings are retained in both cases.
+
+ The remaining verbs act only when a later failure causes a backtrack to
+ reach them. This means that their effect is confined to the assertion,
+ because lookaround assertions are atomic. A backtrack that occurs after
+ an assertion is complete does not jump back into the assertion. Note in
+ particular that a (*MARK) name that is set in an assertion is not
+ "seen" by an instance of (*SKIP:NAME) latter in the pattern.
+
+ The effect of (*THEN) is not allowed to escape beyond an assertion. If
+ there are no more branches to try, (*THEN) causes a positive assertion
+ to be false, and a negative assertion to be true.
+
+ The other backtracking verbs are not treated specially if they appear
+ in a standalone positive assertion. In a conditional positive asser-
+ tion, backtracking (from within the assertion) into (*COMMIT), (*SKIP),
+ or (*PRUNE) causes the condition to be false. However, for both stand-
+ alone and conditional negative assertions, backtracking into (*COMMIT),
+ (*SKIP), or (*PRUNE) causes the assertion to be true, without consider-
+ ing any further alternative branches.
+
+ Backtracking verbs in subroutines
+
+ These behaviours occur whether or not the subpattern is called recur-
+ sively.
+
+ (*ACCEPT) in a subpattern called as a subroutine causes the subroutine
+ match to succeed without any further processing. Matching then contin-
+ ues after the subroutine call. Perl documents this behaviour. Perl's
+ treatment of the other verbs in subroutines is different in some cases.
+
+ (*FAIL) in a subpattern called as a subroutine has its normal effect:
+ it forces an immediate backtrack.
+
+ (*COMMIT), (*SKIP), and (*PRUNE) cause the subroutine match to fail
+ when triggered by being backtracked to in a subpattern called as a sub-
+ routine. There is then a backtrack at the outer level.
+
+ (*THEN), when triggered, skips to the next alternative in the innermost
+ enclosing group within the subpattern that has alternatives (its normal
+ behaviour). However, if there is no such group within the subroutine
+ subpattern, the subroutine match fails and there is a backtrack at the
+ outer level.
+
+
+SEE ALSO
+
+ pcre2api(3), pcre2callout(3), pcre2matching(3), pcre2syntax(3),
+ pcre2(3).
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 04 September 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2PERFORM(3) Library Functions Manual PCRE2PERFORM(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+PCRE2 PERFORMANCE
+
+ Two aspects of performance are discussed below: memory usage and pro-
+ cessing time. The way you express your pattern as a regular expression
+ can affect both of them.
+
+
+COMPILED PATTERN MEMORY USAGE
+
+ Patterns are compiled by PCRE2 into a reasonably efficient interpretive
+ code, so that most simple patterns do not use much memory for storing
+ the compiled version. However, there is one case where the memory usage
+ of a compiled pattern can be unexpectedly large. If a parenthesized
+ subpattern has a quantifier with a minimum greater than 1 and/or a lim-
+ ited maximum, the whole subpattern is repeated in the compiled code.
+ For example, the pattern
+
+ (abc|def){2,4}
+
+ is compiled as if it were
+
+ (abc|def)(abc|def)((abc|def)(abc|def)?)?
+
+ (Technical aside: It is done this way so that backtrack points within
+ each of the repetitions can be independently maintained.)
+
+ For regular expressions whose quantifiers use only small numbers, this
+ is not usually a problem. However, if the numbers are large, and par-
+ ticularly if such repetitions are nested, the memory usage can become
+ an embarrassment. For example, the very simple pattern
+
+ ((ab){1,1000}c){1,3}
+
+ uses over 50KiB when compiled using the 8-bit library. When PCRE2 is
+ compiled with its default internal pointer size of two bytes, the size
+ limit on a compiled pattern is 65535 code units in the 8-bit and 16-bit
+ libraries, and this is reached with the above pattern if the outer rep-
+ etition is increased from 3 to 4. PCRE2 can be compiled to use larger
+ internal pointers and thus handle larger compiled patterns, but it is
+ better to try to rewrite your pattern to use less memory if you can.
+
+ One way of reducing the memory usage for such patterns is to make use
+ of PCRE2's "subroutine" facility. Re-writing the above pattern as
+
+ ((ab)(?2){0,999}c)(?1){0,2}
+
+ reduces the memory requirements to around 16KiB, and indeed it remains
+ under 20KiB even with the outer repetition increased to 100. However,
+ this kind of pattern is not always exactly equivalent, because any cap-
+ tures within subroutine calls are lost when the subroutine completes.
+ If this is not a problem, this kind of rewriting will allow you to
+ process patterns that PCRE2 cannot otherwise handle. The matching per-
+ formance of the two different versions of the pattern are roughly the
+ same. (This applies from release 10.30 - things were different in ear-
+ lier releases.)
+
+
+STACK AND HEAP USAGE AT RUN TIME
+
+ From release 10.30, the interpretive (non-JIT) version of pcre2_match()
+ uses very little system stack at run time. In earlier releases recur-
+ sive function calls could use a great deal of stack, and this could
+ cause problems, but this usage has been eliminated. Backtracking posi-
+ tions are now explicitly remembered in memory frames controlled by the
+ code. An initial 20KiB vector of frames is allocated on the system
+ stack (enough for about 100 frames for small patterns), but if this is
+ insufficient, heap memory is used. The amount of heap memory can be
+ limited; if the limit is set to zero, only the initial stack vector is
+ used. Rewriting patterns to be time-efficient, as described below, may
+ also reduce the memory requirements.
+
+ In contrast to pcre2_match(), pcre2_dfa_match() does use recursive
+ function calls, but only for processing atomic groups, lookaround
+ assertions, and recursion within the pattern. The original version of
+ the code used to allocate quite large internal workspace vectors on the
+ stack, which caused some problems for some patterns in environments
+ with small stacks. From release 10.32 the code for pcre2_dfa_match()
+ has been re-factored to use heap memory when necessary for internal
+ workspace when recursing, though recursive function calls are still
+ used.
+
+ The "match depth" parameter can be used to limit the depth of function
+ recursion, and the "match heap" parameter to limit heap memory in
+ pcre2_dfa_match().
+
+
+PROCESSING TIME
+
+ Certain items in regular expression patterns are processed more effi-
+ ciently than others. It is more efficient to use a character class like
+ [aeiou] than a set of single-character alternatives such as
+ (a|e|i|o|u). In general, the simplest construction that provides the
+ required behaviour is usually the most efficient. Jeffrey Friedl's book
+ contains a lot of useful general discussion about optimizing regular
+ expressions for efficient performance. This document contains a few
+ observations about PCRE2.
+
+ Using Unicode character properties (the \p, \P, and \X escapes) is
+ slow, because PCRE2 has to use a multi-stage table lookup whenever it
+ needs a character's property. If you can find an alternative pattern
+ that does not use character properties, it will probably be faster.
+
+ By default, the escape sequences \b, \d, \s, and \w, and the POSIX
+ character classes such as [:alpha:] do not use Unicode properties,
+ partly for backwards compatibility, and partly for performance reasons.
+ However, you can set the PCRE2_UCP option or start the pattern with
+ (*UCP) if you want Unicode character properties to be used. This can
+ double the matching time for items such as \d, when matched with
+ pcre2_match(); the performance loss is less with a DFA matching func-
+ tion, and in both cases there is not much difference for \b.
+
+ When a pattern begins with .* not in atomic parentheses, nor in paren-
+ theses that are the subject of a backreference, and the PCRE2_DOTALL
+ option is set, the pattern is implicitly anchored by PCRE2, since it
+ can match only at the start of a subject string. If the pattern has
+ multiple top-level branches, they must all be anchorable. The optimiza-
+ tion can be disabled by the PCRE2_NO_DOTSTAR_ANCHOR option, and is
+ automatically disabled if the pattern contains (*PRUNE) or (*SKIP).
+
+ If PCRE2_DOTALL is not set, PCRE2 cannot make this optimization,
+ because the dot metacharacter does not then match a newline, and if the
+ subject string contains newlines, the pattern may match from the char-
+ acter immediately following one of them instead of from the very start.
+ For example, the pattern
+
+ .*second
+
+ matches the subject "first\nand second" (where \n stands for a newline
+ character), with the match starting at the seventh character. In order
+ to do this, PCRE2 has to retry the match starting after every newline
+ in the subject.
+
+ If you are using such a pattern with subject strings that do not con-
+ tain newlines, the best performance is obtained by setting
+ PCRE2_DOTALL, or starting the pattern with ^.* or ^.*? to indicate
+ explicit anchoring. That saves PCRE2 from having to scan along the sub-
+ ject looking for a newline to restart at.
+
+ Beware of patterns that contain nested indefinite repeats. These can
+ take a long time to run when applied to a string that does not match.
+ Consider the pattern fragment
+
+ ^(a+)*
+
+ This can match "aaaa" in 16 different ways, and this number increases
+ very rapidly as the string gets longer. (The * repeat can match 0, 1,
+ 2, 3, or 4 times, and for each of those cases other than 0 or 4, the +
+ repeats can match different numbers of times.) When the remainder of
+ the pattern is such that the entire match is going to fail, PCRE2 has
+ in principle to try every possible variation, and this can take an
+ extremely long time, even for relatively short strings.
+
+ An optimization catches some of the more simple cases such as
+
+ (a+)*b
+
+ where a literal character follows. Before embarking on the standard
+ matching procedure, PCRE2 checks that there is a "b" later in the sub-
+ ject string, and if there is not, it fails the match immediately. How-
+ ever, when there is no following literal this optimization cannot be
+ used. You can see the difference by comparing the behaviour of
+
+ (a+)*\d
+
+ with the pattern above. The former gives a failure almost instantly
+ when applied to a whole line of "a" characters, whereas the latter
+ takes an appreciable time with strings longer than about 20 characters.
+
+ In many cases, the solution to this kind of performance issue is to use
+ an atomic group or a possessive quantifier. This can often reduce mem-
+ ory requirements as well. As another example, consider this pattern:
+
+ ([^<]|<(?!inet))+
+
+ It matches from wherever it starts until it encounters "<inet" or the
+ end of the data, and is the kind of pattern that might be used when
+ processing an XML file. Each iteration of the outer parentheses matches
+ either one character that is not "<" or a "<" that is not followed by
+ "inet". However, each time a parenthesis is processed, a backtracking
+ position is passed, so this formulation uses a memory frame for each
+ matched character. For a long string, a lot of memory is required. Con-
+ sider now this rewritten pattern, which matches exactly the same
+ strings:
+
+ ([^<]++|<(?!inet))+
+
+ This runs much faster, because sequences of characters that do not con-
+ tain "<" are "swallowed" in one item inside the parentheses, and a pos-
+ sessive quantifier is used to stop any backtracking into the runs of
+ non-"<" characters. This version also uses a lot less memory because
+ entry to a new set of parentheses happens only when a "<" character
+ that is not followed by "inet" is encountered (and we assume this is
+ relatively rare).
+
+ This example shows that one way of optimizing performance when matching
+ long subject strings is to write repeated parenthesized subpatterns to
+ match more than one character whenever possible.
+
+ SETTING RESOURCE LIMITS
+
+ You can set limits on the amount of processing that takes place when
+ matching, and on the amount of heap memory that is used. The default
+ values of the limits are very large, and unlikely ever to operate. They
+ can be changed when PCRE2 is built, and they can also be set when
+ pcre2_match() or pcre2_dfa_match() is called. For details of these
+ interfaces, see the pcre2build documentation and the section entitled
+ "The match context" in the pcre2api documentation.
+
+ The pcre2test test program has a modifier called "find_limits" which,
+ if applied to a subject line, causes it to find the smallest limits
+ that allow a pattern to match. This is done by repeatedly matching with
+ different limits.
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 25 April 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2POSIX(3) Library Functions Manual PCRE2POSIX(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+SYNOPSIS
+
+ #include <pcre2posix.h>
+
+ int regcomp(regex_t *preg, const char *pattern,
+ int cflags);
+
+ int regexec(const regex_t *preg, const char *string,
+ size_t nmatch, regmatch_t pmatch[], int eflags);
+
+ size_t regerror(int errcode, const regex_t *preg,
+ char *errbuf, size_t errbuf_size);
+
+ void regfree(regex_t *preg);
+
+
+DESCRIPTION
+
+ This set of functions provides a POSIX-style API for the PCRE2 regular
+ expression 8-bit library. See the pcre2api documentation for a descrip-
+ tion of PCRE2's native API, which contains much additional functional-
+ ity. There are no POSIX-style wrappers for PCRE2's 16-bit and 32-bit
+ libraries.
+
+ The functions described here are just wrapper functions that ultimately
+ call the PCRE2 native API. Their prototypes are defined in the
+ pcre2posix.h header file, and on Unix systems the library itself is
+ called libpcre2-posix.a, so can be accessed by adding -lpcre2-posix to
+ the command for linking an application that uses them. Because the
+ POSIX functions call the native ones, it is also necessary to add
+ -lpcre2-8.
+
+ Those POSIX option bits that can reasonably be mapped to PCRE2 native
+ options have been implemented. In addition, the option REG_EXTENDED is
+ defined with the value zero. This has no effect, but since programs
+ that are written to the POSIX interface often use it, this makes it
+ easier to slot in PCRE2 as a replacement library. Other POSIX options
+ are not even defined.
+
+ There are also some options that are not defined by POSIX. These have
+ been added at the request of users who want to make use of certain
+ PCRE2-specific features via the POSIX calling interface or to add BSD
+ or GNU functionality.
+
+ When PCRE2 is called via these functions, it is only the API that is
+ POSIX-like in style. The syntax and semantics of the regular expres-
+ sions themselves are still those of Perl, subject to the setting of
+ various PCRE2 options, as described below. "POSIX-like in style" means
+ that the API approximates to the POSIX definition; it is not fully
+ POSIX-compatible, and in multi-unit encoding domains it is probably
+ even less compatible.
+
+ The header for these functions is supplied as pcre2posix.h to avoid any
+ potential clash with other POSIX libraries. It can, of course, be
+ renamed or aliased as regex.h, which is the "correct" name. It provides
+ two structure types, regex_t for compiled internal forms, and reg-
+ match_t for returning captured substrings. It also defines some con-
+ stants whose names start with "REG_"; these are used for setting
+ options and identifying error codes.
+
+
+COMPILING A PATTERN
+
+ The function regcomp() is called to compile a pattern into an internal
+ form. By default, the pattern is a C string terminated by a binary zero
+ (but see REG_PEND below). The preg argument is a pointer to a regex_t
+ structure that is used as a base for storing information about the com-
+ piled regular expression. (It is also used for input when REG_PEND is
+ set.)
+
+ The argument cflags is either zero, or contains one or more of the bits
+ defined by the following macros:
+
+ REG_DOTALL
+
+ The PCRE2_DOTALL option is set when the regular expression is passed
+ for compilation to the native function. Note that REG_DOTALL is not
+ part of the POSIX standard.
+
+ REG_ICASE
+
+ The PCRE2_CASELESS option is set when the regular expression is passed
+ for compilation to the native function.
+
+ REG_NEWLINE
+
+ The PCRE2_MULTILINE option is set when the regular expression is passed
+ for compilation to the native function. Note that this does not mimic
+ the defined POSIX behaviour for REG_NEWLINE (see the following sec-
+ tion).
+
+ REG_NOSPEC
+
+ The PCRE2_LITERAL option is set when the regular expression is passed
+ for compilation to the native function. This disables all meta charac-
+ ters in the pattern, causing it to be treated as a literal string. The
+ only other options that are allowed with REG_NOSPEC are REG_ICASE,
+ REG_NOSUB, REG_PEND, and REG_UTF. Note that REG_NOSPEC is not part of
+ the POSIX standard.
+
+ REG_NOSUB
+
+ When a pattern that is compiled with this flag is passed to regexec()
+ for matching, the nmatch and pmatch arguments are ignored, and no cap-
+ tured strings are returned. Versions of the PCRE library prior to 10.22
+ used to set the PCRE2_NO_AUTO_CAPTURE compile option, but this no
+ longer happens because it disables the use of backreferences.
+
+ REG_PEND
+
+ If this option is set, the reg_endp field in the preg structure (which
+ has the type const char *) must be set to point to the character beyond
+ the end of the pattern before calling regcomp(). The pattern itself may
+ now contain binary zeros, which are treated as data characters. Without
+ REG_PEND, a binary zero terminates the pattern and the re_endp field is
+ ignored. This is a GNU extension to the POSIX standard and should be
+ used with caution in software intended to be portable to other systems.
+
+ REG_UCP
+
+ The PCRE2_UCP option is set when the regular expression is passed for
+ compilation to the native function. This causes PCRE2 to use Unicode
+ properties when matchine \d, \w, etc., instead of just recognizing
+ ASCII values. Note that REG_UCP is not part of the POSIX standard.
+
+ REG_UNGREEDY
+
+ The PCRE2_UNGREEDY option is set when the regular expression is passed
+ for compilation to the native function. Note that REG_UNGREEDY is not
+ part of the POSIX standard.
+
+ REG_UTF
+
+ The PCRE2_UTF option is set when the regular expression is passed for
+ compilation to the native function. This causes the pattern itself and
+ all data strings used for matching it to be treated as UTF-8 strings.
+ Note that REG_UTF is not part of the POSIX standard.
+
+ In the absence of these flags, no options are passed to the native
+ function. This means the the regex is compiled with PCRE2 default
+ semantics. In particular, the way it handles newline characters in the
+ subject string is the Perl way, not the POSIX way. Note that setting
+ PCRE2_MULTILINE has only some of the effects specified for REG_NEWLINE.
+ It does not affect the way newlines are matched by the dot metacharac-
+ ter (they are not) or by a negative class such as [^a] (they are).
+
+ The yield of regcomp() is zero on success, and non-zero otherwise. The
+ preg structure is filled in on success, and one other member of the
+ structure (as well as re_endp) is public: re_nsub contains the number
+ of capturing subpatterns in the regular expression. Various error codes
+ are defined in the header file.
+
+ NOTE: If the yield of regcomp() is non-zero, you must not attempt to
+ use the contents of the preg structure. If, for example, you pass it to
+ regexec(), the result is undefined and your program is likely to crash.
+
+
+MATCHING NEWLINE CHARACTERS
+
+ This area is not simple, because POSIX and Perl take different views of
+ things. It is not possible to get PCRE2 to obey POSIX semantics, but
+ then PCRE2 was never intended to be a POSIX engine. The following table
+ lists the different possibilities for matching newline characters in
+ Perl and PCRE2:
+
+ Default Change with
+
+ . matches newline no PCRE2_DOTALL
+ newline matches [^a] yes not changeable
+ $ matches \n at end yes PCRE2_DOLLAR_ENDONLY
+ $ matches \n in middle no PCRE2_MULTILINE
+ ^ matches \n in middle no PCRE2_MULTILINE
+
+ This is the equivalent table for a POSIX-compatible pattern matcher:
+
+ Default Change with
+
+ . matches newline yes REG_NEWLINE
+ newline matches [^a] yes REG_NEWLINE
+ $ matches \n at end no REG_NEWLINE
+ $ matches \n in middle no REG_NEWLINE
+ ^ matches \n in middle no REG_NEWLINE
+
+ This behaviour is not what happens when PCRE2 is called via its POSIX
+ API. By default, PCRE2's behaviour is the same as Perl's, except that
+ there is no equivalent for PCRE2_DOLLAR_ENDONLY in Perl. In both PCRE2
+ and Perl, there is no way to stop newline from matching [^a].
+
+ Default POSIX newline handling can be obtained by setting PCRE2_DOTALL
+ and PCRE2_DOLLAR_ENDONLY when calling pcre2_compile() directly, but
+ there is no way to make PCRE2 behave exactly as for the REG_NEWLINE
+ action. When using the POSIX API, passing REG_NEWLINE to PCRE2's reg-
+ comp() function causes PCRE2_MULTILINE to be passed to pcre2_compile(),
+ and REG_DOTALL passes PCRE2_DOTALL. There is no way to pass PCRE2_DOL-
+ LAR_ENDONLY.
+
+
+MATCHING A PATTERN
+
+ The function regexec() is called to match a compiled pattern preg
+ against a given string, which is by default terminated by a zero byte
+ (but see REG_STARTEND below), subject to the options in eflags. These
+ can be:
+
+ REG_NOTBOL
+
+ The PCRE2_NOTBOL option is set when calling the underlying PCRE2 match-
+ ing function.
+
+ REG_NOTEMPTY
+
+ The PCRE2_NOTEMPTY option is set when calling the underlying PCRE2
+ matching function. Note that REG_NOTEMPTY is not part of the POSIX
+ standard. However, setting this option can give more POSIX-like behav-
+ iour in some situations.
+
+ REG_NOTEOL
+
+ The PCRE2_NOTEOL option is set when calling the underlying PCRE2 match-
+ ing function.
+
+ REG_STARTEND
+
+ When this option is set, the subject string starts at string +
+ pmatch[0].rm_so and ends at string + pmatch[0].rm_eo, which should
+ point to the first character beyond the string. There may be binary
+ zeros within the subject string, and indeed, using REG_STARTEND is the
+ only way to pass a subject string that contains a binary zero.
+
+ Whatever the value of pmatch[0].rm_so, the offsets of the matched
+ string and any captured substrings are still given relative to the
+ start of string itself. (Before PCRE2 release 10.30 these were given
+ relative to string + pmatch[0].rm_so, but this differs from other
+ implementations.)
+
+ This is a BSD extension, compatible with but not specified by IEEE
+ Standard 1003.2 (POSIX.2), and should be used with caution in software
+ intended to be portable to other systems. Note that a non-zero rm_so
+ does not imply REG_NOTBOL; REG_STARTEND affects only the location and
+ length of the string, not how it is matched. Setting REG_STARTEND and
+ passing pmatch as NULL are mutually exclusive; the error REG_INVARG is
+ returned.
+
+ If the pattern was compiled with the REG_NOSUB flag, no data about any
+ matched strings is returned. The nmatch and pmatch arguments of
+ regexec() are ignored (except possibly as input for REG_STARTEND).
+
+ The value of nmatch may be zero, and the value pmatch may be NULL
+ (unless REG_STARTEND is set); in both these cases no data about any
+ matched strings is returned.
+
+ Otherwise, the portion of the string that was matched, and also any
+ captured substrings, are returned via the pmatch argument, which points
+ to an array of nmatch structures of type regmatch_t, containing the
+ members rm_so and rm_eo. These contain the byte offset to the first
+ character of each substring and the offset to the first character after
+ the end of each substring, respectively. The 0th element of the vector
+ relates to the entire portion of string that was matched; subsequent
+ elements relate to the capturing subpatterns of the regular expression.
+ Unused entries in the array have both structure members set to -1.
+
+ A successful match yields a zero return; various error codes are
+ defined in the header file, of which REG_NOMATCH is the "expected"
+ failure code.
+
+
+ERROR MESSAGES
+
+ The regerror() function maps a non-zero errorcode from either regcomp()
+ or regexec() to a printable message. If preg is not NULL, the error
+ should have arisen from the use of that structure. A message terminated
+ by a binary zero is placed in errbuf. If the buffer is too short, only
+ the first errbuf_size - 1 characters of the error message are used. The
+ yield of the function is the size of buffer needed to hold the whole
+ message, including the terminating zero. This value is greater than
+ errbuf_size if the message was truncated.
+
+
+MEMORY USAGE
+
+ Compiling a regular expression causes memory to be allocated and asso-
+ ciated with the preg structure. The function regfree() frees all such
+ memory, after which preg may no longer be used as a compiled expres-
+ sion.
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 15 June 2017
+ Copyright (c) 1997-2017 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2SAMPLE(3) Library Functions Manual PCRE2SAMPLE(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+PCRE2 SAMPLE PROGRAM
+
+ A simple, complete demonstration program to get you started with using
+ PCRE2 is supplied in the file pcre2demo.c in the src directory in the
+ PCRE2 distribution. A listing of this program is given in the pcre2demo
+ documentation. If you do not have a copy of the PCRE2 distribution, you
+ can save this listing to re-create the contents of pcre2demo.c.
+
+ The demonstration program compiles the regular expression that is its
+ first argument, and matches it against the subject string in its second
+ argument. No PCRE2 options are set, and default character tables are
+ used. If matching succeeds, the program outputs the portion of the sub-
+ ject that matched, together with the contents of any captured sub-
+ strings.
+
+ If the -g option is given on the command line, the program then goes on
+ to check for further matches of the same regular expression in the same
+ subject string. The logic is a little bit tricky because of the possi-
+ bility of matching an empty string. Comments in the code explain what
+ is going on.
+
+ The code in pcre2demo.c is an 8-bit program that uses the PCRE2 8-bit
+ library. It handles strings and characters that are stored in 8-bit
+ code units. By default, one character corresponds to one code unit,
+ but if the pattern starts with "(*UTF)", both it and the subject are
+ treated as UTF-8 strings, where characters may occupy multiple code
+ units.
+
+ If PCRE2 is installed in the standard include and library directories
+ for your operating system, you should be able to compile the demonstra-
+ tion program using a command like this:
+
+ cc -o pcre2demo pcre2demo.c -lpcre2-8
+
+ If PCRE2 is installed elsewhere, you may need to add additional options
+ to the command line. For example, on a Unix-like system that has PCRE2
+ installed in /usr/local, you can compile the demonstration program
+ using a command like this:
+
+ cc -o pcre2demo -I/usr/local/include pcre2demo.c \
+ -L/usr/local/lib -lpcre2-8
+
+ Once you have built the demonstration program, you can run simple tests
+ like this:
+
+ ./pcre2demo 'cat|dog' 'the cat sat on the mat'
+ ./pcre2demo -g 'cat|dog' 'the dog sat on the cat'
+
+ Note that there is a much more comprehensive test program, called
+ pcre2test, which supports many more facilities for testing regular
+ expressions using all three PCRE2 libraries (8-bit, 16-bit, and 32-bit,
+ though not all three need be installed). The pcre2demo program is pro-
+ vided as a relatively simple coding example.
+
+ If you try to run pcre2demo when PCRE2 is not installed in the standard
+ library directory, you may get an error like this on some operating
+ systems (e.g. Solaris):
+
+ ld.so.1: pcre2demo: fatal: libpcre2-8.so.0: open failed: No such file
+ or directory
+
+ This is caused by the way shared library support works on those sys-
+ tems. You need to add
+
+ -R/usr/local/lib
+
+ (for example) to the compile command to get round this problem.
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 02 February 2016
+ Copyright (c) 1997-2016 University of Cambridge.
+------------------------------------------------------------------------------
+PCRE2SERIALIZE(3) Library Functions Manual PCRE2SERIALIZE(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+SAVING AND RE-USING PRECOMPILED PCRE2 PATTERNS
+
+ int32_t pcre2_serialize_decode(pcre2_code **codes,
+ int32_t number_of_codes, const uint32_t *bytes,
+ pcre2_general_context *gcontext);
+
+ int32_t pcre2_serialize_encode(pcre2_code **codes,
+ int32_t number_of_codes, uint32_t **serialized_bytes,
+ PCRE2_SIZE *serialized_size, pcre2_general_context *gcontext);
+
+ void pcre2_serialize_free(uint8_t *bytes);
+
+ int32_t pcre2_serialize_get_number_of_codes(const uint8_t *bytes);
+
+ If you are running an application that uses a large number of regular
+ expression patterns, it may be useful to store them in a precompiled
+ form instead of having to compile them every time the application is
+ run. However, if you are using the just-in-time optimization feature,
+ it is not possible to save and reload the JIT data, because it is posi-
+ tion-dependent. The host on which the patterns are reloaded must be
+ running the same version of PCRE2, with the same code unit width, and
+ must also have the same endianness, pointer width and PCRE2_SIZE type.
+ For example, patterns compiled on a 32-bit system using PCRE2's 16-bit
+ library cannot be reloaded on a 64-bit system, nor can they be reloaded
+ using the 8-bit library.
+
+ Note that "serialization" in PCRE2 does not convert compiled patterns
+ to an abstract format like Java or .NET serialization. The serialized
+ output is really just a bytecode dump, which is why it can only be
+ reloaded in the same environment as the one that created it. Hence the
+ restrictions mentioned above. Applications that are not statically
+ linked with a fixed version of PCRE2 must be prepared to recompile pat-
+ terns from their sources, in order to be immune to PCRE2 upgrades.
+
+
+SECURITY CONCERNS
+
+ The facility for saving and restoring compiled patterns is intended for
+ use within individual applications. As such, the data supplied to
+ pcre2_serialize_decode() is expected to be trusted data, not data from
+ arbitrary external sources. There is only some simple consistency
+ checking, not complete validation of what is being re-loaded. Corrupted
+ data may cause undefined results. For example, if the length field of a
+ pattern in the serialized data is corrupted, the deserializing code may
+ read beyond the end of the byte stream that is passed to it.
+
+
+SAVING COMPILED PATTERNS
+
+ Before compiled patterns can be saved they must be serialized, which in
+ PCRE2 means converting the pattern to a stream of bytes. A single byte
+ stream may contain any number of compiled patterns, but they must all
+ use the same character tables. A single copy of the tables is included
+ in the byte stream (its size is 1088 bytes). For more details of char-
+ acter tables, see the section on locale support in the pcre2api docu-
+ mentation.
+
+ The function pcre2_serialize_encode() creates a serialized byte stream
+ from a list of compiled patterns. Its first two arguments specify the
+ list, being a pointer to a vector of pointers to compiled patterns, and
+ the length of the vector. The third and fourth arguments point to vari-
+ ables which are set to point to the created byte stream and its length,
+ respectively. The final argument is a pointer to a general context,
+ which can be used to specify custom memory mangagement functions. If
+ this argument is NULL, malloc() is used to obtain memory for the byte
+ stream. The yield of the function is the number of serialized patterns,
+ or one of the following negative error codes:
+
+ PCRE2_ERROR_BADDATA the number of patterns is zero or less
+ PCRE2_ERROR_BADMAGIC mismatch of id bytes in one of the patterns
+ PCRE2_ERROR_MEMORY memory allocation failed
+ PCRE2_ERROR_MIXEDTABLES the patterns do not all use the same tables
+ PCRE2_ERROR_NULL the 1st, 3rd, or 4th argument is NULL
+
+ PCRE2_ERROR_BADMAGIC means either that a pattern's code has been cor-
+ rupted, or that a slot in the vector does not point to a compiled pat-
+ tern.
+
+ Once a set of patterns has been serialized you can save the data in any
+ appropriate manner. Here is sample code that compiles two patterns and
+ writes them to a file. It assumes that the variable fd refers to a file
+ that is open for output. The error checking that should be present in a
+ real application has been omitted for simplicity.
+
+ int errorcode;
+ uint8_t *bytes;
+ PCRE2_SIZE erroroffset;
+ PCRE2_SIZE bytescount;
+ pcre2_code *list_of_codes[2];
+ list_of_codes[0] = pcre2_compile("first pattern",
+ PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
+ list_of_codes[1] = pcre2_compile("second pattern",
+ PCRE2_ZERO_TERMINATED, 0, &errorcode, &erroroffset, NULL);
+ errorcode = pcre2_serialize_encode(list_of_codes, 2, &bytes,
+ &bytescount, NULL);
+ errorcode = fwrite(bytes, 1, bytescount, fd);
+
+ Note that the serialized data is binary data that may contain any of
+ the 256 possible byte values. On systems that make a distinction
+ between binary and non-binary data, be sure that the file is opened for
+ binary output.
+
+ Serializing a set of patterns leaves the original data untouched, so
+ they can still be used for matching. Their memory must eventually be
+ freed in the usual way by calling pcre2_code_free(). When you have fin-
+ ished with the byte stream, it too must be freed by calling pcre2_seri-
+ alize_free(). If this function is called with a NULL argument, it
+ returns immediately without doing anything.
+
+
+RE-USING PRECOMPILED PATTERNS
+
+ In order to re-use a set of saved patterns you must first make the
+ serialized byte stream available in main memory (for example, by read-
+ ing from a file). The management of this memory block is up to the
+ application. You can use the pcre2_serialize_get_number_of_codes()
+ function to find out how many compiled patterns are in the serialized
+ data without actually decoding the patterns:
+
+ uint8_t *bytes = <serialized data>;
+ int32_t number_of_codes = pcre2_serialize_get_number_of_codes(bytes);
+
+ The pcre2_serialize_decode() function reads a byte stream and recreates
+ the compiled patterns in new memory blocks, setting pointers to them in
+ a vector. The first two arguments are a pointer to a suitable vector
+ and its length, and the third argument points to a byte stream. The
+ final argument is a pointer to a general context, which can be used to
+ specify custom memory mangagement functions for the decoded patterns.
+ If this argument is NULL, malloc() and free() are used. After deserial-
+ ization, the byte stream is no longer needed and can be discarded.
+
+ int32_t number_of_codes;
+ pcre2_code *list_of_codes[2];
+ uint8_t *bytes = <serialized data>;
+ int32_t number_of_codes =
+ pcre2_serialize_decode(list_of_codes, 2, bytes, NULL);
+
+ If the vector is not large enough for all the patterns in the byte
+ stream, it is filled with those that fit, and the remainder are
+ ignored. The yield of the function is the number of decoded patterns,
+ or one of the following negative error codes:
+
+ PCRE2_ERROR_BADDATA second argument is zero or less
+ PCRE2_ERROR_BADMAGIC mismatch of id bytes in the data
+ PCRE2_ERROR_BADMODE mismatch of code unit size or PCRE2 version
+ PCRE2_ERROR_BADSERIALIZEDDATA other sanity check failure
+ PCRE2_ERROR_MEMORY memory allocation failed
+ PCRE2_ERROR_NULL first or third argument is NULL
+
+ PCRE2_ERROR_BADMAGIC may mean that the data is corrupt, or that it was
+ compiled on a system with different endianness.
+
+ Decoded patterns can be used for matching in the usual way, and must be
+ freed by calling pcre2_code_free(). However, be aware that there is a
+ potential race issue if you are using multiple patterns that were
+ decoded from a single byte stream in a multithreaded application. A
+ single copy of the character tables is used by all the decoded patterns
+ and a reference count is used to arrange for its memory to be automati-
+ cally freed when the last pattern is freed, but there is no locking on
+ this reference count. Therefore, if you want to call pcre2_code_free()
+ for these patterns in different threads, you must arrange your own
+ locking, and ensure that pcre2_code_free() cannot be called by two
+ threads at the same time.
+
+ If a pattern was processed by pcre2_jit_compile() before being serial-
+ ized, the JIT data is discarded and so is no longer available after a
+ save/restore cycle. You can, however, process a restored pattern with
+ pcre2_jit_compile() if you wish.
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 27 June 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2SYNTAX(3) Library Functions Manual PCRE2SYNTAX(3)
+
+
+
+NAME
+ PCRE2 - Perl-compatible regular expressions (revised API)
+
+PCRE2 REGULAR EXPRESSION SYNTAX SUMMARY
+
+ The full syntax and semantics of the regular expressions that are sup-
+ ported by PCRE2 are described in the pcre2pattern documentation. This
+ document contains a quick-reference summary of the syntax.
+
+
+QUOTING
+
+ \x where x is non-alphanumeric is a literal x
+ \Q...\E treat enclosed characters as literal
+
+
+ESCAPED CHARACTERS
+
+ This table applies to ASCII and Unicode environments.
+
+ \a alarm, that is, the BEL character (hex 07)
+ \cx "control-x", where x is any ASCII printing character
+ \e escape (hex 1B)
+ \f form feed (hex 0C)
+ \n newline (hex 0A)
+ \r carriage return (hex 0D)
+ \t tab (hex 09)
+ \0dd character with octal code 0dd
+ \ddd character with octal code ddd, or backreference
+ \o{ddd..} character with octal code ddd..
+ \U "U" if PCRE2_ALT_BSUX is set (otherwise is an error)
+ \N{U+hh..} character with Unicode code point hh.. (Unicode mode only)
+ \uhhhh character with hex code hhhh (if PCRE2_ALT_BSUX is set)
+ \xhh character with hex code hh
+ \x{hh..} character with hex code hh..
+
+ Note that \0dd is always an octal code. The treatment of backslash fol-
+ lowed by a non-zero digit is complicated; for details see the section
+ "Non-printing characters" in the pcre2pattern documentation, where
+ details of escape processing in EBCDIC environments are also given.
+ \N{U+hh..} is synonymous with \x{hh..} in PCRE2 but is not supported in
+ EBCDIC environments. Note that \N not followed by an opening curly
+ bracket has a different meaning (see below).
+
+ When \x is not followed by {, from zero to two hexadecimal digits are
+ read, but if PCRE2_ALT_BSUX is set, \x must be followed by two hexadec-
+ imal digits to be recognized as a hexadecimal escape; otherwise it
+ matches a literal "x". Likewise, if \u (in ALT_BSUX mode) is not fol-
+ lowed by four hexadecimal digits, it matches a literal "u".
+
+
+CHARACTER TYPES
+
+ . any character except newline;
+ in dotall mode, any character whatsoever
+ \C one code unit, even in UTF mode (best avoided)
+ \d a decimal digit
+ \D a character that is not a decimal digit
+ \h a horizontal white space character
+ \H a character that is not a horizontal white space character
+ \N a character that is not a newline
+ \p{xx} a character with the xx property
+ \P{xx} a character without the xx property
+ \R a newline sequence
+ \s a white space character
+ \S a character that is not a white space character
+ \v a vertical white space character
+ \V a character that is not a vertical white space character
+ \w a "word" character
+ \W a "non-word" character
+ \X a Unicode extended grapheme cluster
+
+ \C is dangerous because it may leave the current matching point in the
+ middle of a UTF-8 or UTF-16 character. The application can lock out the
+ use of \C by setting the PCRE2_NEVER_BACKSLASH_C option. It is also
+ possible to build PCRE2 with the use of \C permanently disabled.
+
+ By default, \d, \s, and \w match only ASCII characters, even in UTF-8
+ mode or in the 16-bit and 32-bit libraries. However, if locale-specific
+ matching is happening, \s and \w may also match characters with code
+ points in the range 128-255. If the PCRE2_UCP option is set, the behav-
+ iour of these escape sequences is changed to use Unicode properties and
+ they match many more characters.
+
+
+GENERAL CATEGORY PROPERTIES FOR \p and \P
+
+ C Other
+ Cc Control
+ Cf Format
+ Cn Unassigned
+ Co Private use
+ Cs Surrogate
+
+ L Letter
+ Ll Lower case letter
+ Lm Modifier letter
+ Lo Other letter
+ Lt Title case letter
+ Lu Upper case letter
+ L& Ll, Lu, or Lt
+
+ M Mark
+ Mc Spacing mark
+ Me Enclosing mark
+ Mn Non-spacing mark
+
+ N Number
+ Nd Decimal number
+ Nl Letter number
+ No Other number
+
+ P Punctuation
+ Pc Connector punctuation
+ Pd Dash punctuation
+ Pe Close punctuation
+ Pf Final punctuation
+ Pi Initial punctuation
+ Po Other punctuation
+ Ps Open punctuation
+
+ S Symbol
+ Sc Currency symbol
+ Sk Modifier symbol
+ Sm Mathematical symbol
+ So Other symbol
+
+ Z Separator
+ Zl Line separator
+ Zp Paragraph separator
+ Zs Space separator
+
+
+PCRE2 SPECIAL CATEGORY PROPERTIES FOR \p and \P
+
+ Xan Alphanumeric: union of properties L and N
+ Xps POSIX space: property Z or tab, NL, VT, FF, CR
+ Xsp Perl space: property Z or tab, NL, VT, FF, CR
+ Xuc Univerally-named character: one that can be
+ represented by a Universal Character Name
+ Xwd Perl word: property Xan or underscore
+
+ Perl and POSIX space are now the same. Perl added VT to its space char-
+ acter set at release 5.18.
+
+
+SCRIPT NAMES FOR \p AND \P
+
+ Adlam, Ahom, Anatolian_Hieroglyphs, Arabic, Armenian, Avestan, Bali-
+ nese, Bamum, Bassa_Vah, Batak, Bengali, Bhaiksuki, Bopomofo, Brahmi,
+ Braille, Buginese, Buhid, Canadian_Aboriginal, Carian, Caucasian_Alba-
+ nian, Chakma, Cham, Cherokee, Common, Coptic, Cuneiform, Cypriot,
+ Cyrillic, Deseret, Devanagari, Dogra, Duployan, Egyptian_Hieroglyphs,
+ Elbasan, Ethiopic, Georgian, Glagolitic, Gothic, Grantha, Greek,
+ Gujarati, Gunjala_Gondi, Gurmukhi, Han, Hangul, Hanifi_Rohingya,
+ Hanunoo, Hatran, Hebrew, Hiragana, Imperial_Aramaic, Inherited,
+ Inscriptional_Pahlavi, Inscriptional_Parthian, Javanese, Kaithi, Kan-
+ nada, Katakana, Kayah_Li, Kharoshthi, Khmer, Khojki, Khudawadi, Lao,
+ Latin, Lepcha, Limbu, Linear_A, Linear_B, Lisu, Lycian, Lydian, Maha-
+ jani, Makasar, Malayalam, Mandaic, Manichaean, Marchen, Masaram_Gondi,
+ Medefaidrin, Meetei_Mayek, Mende_Kikakui, Meroitic_Cursive,
+ Meroitic_Hieroglyphs, Miao, Modi, Mongolian, Mro, Multani, Myanmar,
+ Nabataean, New_Tai_Lue, Newa, Nko, Nushu, Ogham, Ol_Chiki, Old_Hungar-
+ ian, Old_Italic, Old_North_Arabian, Old_Permic, Old_Persian, Old_Sog-
+ dian, Old_South_Arabian, Old_Turkic, Oriya, Osage, Osmanya,
+ Pahawh_Hmong, Palmyrene, Pau_Cin_Hau, Phags_Pa, Phoenician,
+ Psalter_Pahlavi, Rejang, Runic, Samaritan, Saurashtra, Sharada, Sha-
+ vian, Siddham, SignWriting, Sinhala, Sogdian, Sora_Sompeng, Soyombo,
+ Sundanese, Syloti_Nagri, Syriac, Tagalog, Tagbanwa, Tai_Le, Tai_Tham,
+ Tai_Viet, Takri, Tamil, Tangut, Telugu, Thaana, Thai, Tibetan, Tifi-
+ nagh, Tirhuta, Ugaritic, Vai, Warang_Citi, Yi, Zanabazar_Square.
+
+
+CHARACTER CLASSES
+
+ [...] positive character class
+ [^...] negative character class
+ [x-y] range (can be used for hex characters)
+ [[:xxx:]] positive POSIX named set
+ [[:^xxx:]] negative POSIX named set
+
+ alnum alphanumeric
+ alpha alphabetic
+ ascii 0-127
+ blank space or tab
+ cntrl control character
+ digit decimal digit
+ graph printing, excluding space
+ lower lower case letter
+ print printing, including space
+ punct printing, excluding alphanumeric
+ space white space
+ upper upper case letter
+ word same as \w
+ xdigit hexadecimal digit
+
+ In PCRE2, POSIX character set names recognize only ASCII characters by
+ default, but some of them use Unicode properties if PCRE2_UCP is set.
+ You can use \Q...\E inside a character class.
+
+
+QUANTIFIERS
+
+ ? 0 or 1, greedy
+ ?+ 0 or 1, possessive
+ ?? 0 or 1, lazy
+ * 0 or more, greedy
+ *+ 0 or more, possessive
+ *? 0 or more, lazy
+ + 1 or more, greedy
+ ++ 1 or more, possessive
+ +? 1 or more, lazy
+ {n} exactly n
+ {n,m} at least n, no more than m, greedy
+ {n,m}+ at least n, no more than m, possessive
+ {n,m}? at least n, no more than m, lazy
+ {n,} n or more, greedy
+ {n,}+ n or more, possessive
+ {n,}? n or more, lazy
+
+
+ANCHORS AND SIMPLE ASSERTIONS
+
+ \b word boundary
+ \B not a word boundary
+ ^ start of subject
+ also after an internal newline in multiline mode
+ (after any newline if PCRE2_ALT_CIRCUMFLEX is set)
+ \A start of subject
+ $ end of subject
+ also before newline at end of subject
+ also before internal newline in multiline mode
+ \Z end of subject
+ also before newline at end of subject
+ \z end of subject
+ \G first matching position in subject
+
+
+REPORTED MATCH POINT SETTING
+
+ \K set reported start of match
+
+ \K is honoured in positive assertions, but ignored in negative ones.
+
+
+ALTERNATION
+
+ expr|expr|expr...
+
+
+CAPTURING
+
+ (...) capturing group
+ (?<name>...) named capturing group (Perl)
+ (?'name'...) named capturing group (Perl)
+ (?P<name>...) named capturing group (Python)
+ (?:...) non-capturing group
+ (?|...) non-capturing group; reset group numbers for
+ capturing groups in each alternative
+
+
+ATOMIC GROUPS
+
+ (?>...) atomic, non-capturing group
+
+
+COMMENT
+
+ (?#....) comment (not nestable)
+
+
+OPTION SETTING
+ Changes of these options within a group are automatically cancelled at
+ the end of the group.
+
+ (?i) caseless
+ (?J) allow duplicate names
+ (?m) multiline
+ (?n) no auto capture
+ (?s) single line (dotall)
+ (?U) default ungreedy (lazy)
+ (?x) extended: ignore white space except in classes
+ (?xx) as (?x) but also ignore space and tab in classes
+ (?-...) unset option(s)
+ (?^) unset imnsx options
+
+ Unsetting x or xx unsets both. Several options may be set at once, and
+ a mixture of setting and unsetting such as (?i-x) is allowed, but there
+ may be only one hyphen. Setting (but no unsetting) is allowed after (?^
+ for example (?^in). An option setting may appear at the start of a non-
+ capturing group, for example (?i:...).
+
+ The following are recognized only at the very start of a pattern or
+ after one of the newline or \R options with similar syntax. More than
+ one of them may appear. For the first three, d is a decimal number.
+
+ (*LIMIT_DEPTH=d) set the backtracking limit to d
+ (*LIMIT_HEAP=d) set the heap size limit to d * 1024 bytes
+ (*LIMIT_MATCH=d) set the match limit to d
+ (*NOTEMPTY) set PCRE2_NOTEMPTY when matching
+ (*NOTEMPTY_ATSTART) set PCRE2_NOTEMPTY_ATSTART when matching
+ (*NO_AUTO_POSSESS) no auto-possessification (PCRE2_NO_AUTO_POSSESS)
+ (*NO_DOTSTAR_ANCHOR) no .* anchoring (PCRE2_NO_DOTSTAR_ANCHOR)
+ (*NO_JIT) disable JIT optimization
+ (*NO_START_OPT) no start-match optimization (PCRE2_NO_START_OPTIMIZE)
+ (*UTF) set appropriate UTF mode for the library in use
+ (*UCP) set PCRE2_UCP (use Unicode properties for \d etc)
+
+ Note that LIMIT_DEPTH, LIMIT_HEAP, and LIMIT_MATCH can only reduce the
+ value of the limits set by the caller of pcre2_match() or
+ pcre2_dfa_match(), not increase them. LIMIT_RECURSION is an obsolete
+ synonym for LIMIT_DEPTH. The application can lock out the use of (*UTF)
+ and (*UCP) by setting the PCRE2_NEVER_UTF or PCRE2_NEVER_UCP options,
+ respectively, at compile time.
+
+
+NEWLINE CONVENTION
+
+ These are recognized only at the very start of the pattern or after
+ option settings with a similar syntax.
+
+ (*CR) carriage return only
+ (*LF) linefeed only
+ (*CRLF) carriage return followed by linefeed
+ (*ANYCRLF) all three of the above
+ (*ANY) any Unicode newline sequence
+ (*NUL) the NUL character (binary zero)
+
+
+WHAT \R MATCHES
+
+ These are recognized only at the very start of the pattern or after
+ option setting with a similar syntax.
+
+ (*BSR_ANYCRLF) CR, LF, or CRLF
+ (*BSR_UNICODE) any Unicode newline sequence
+
+
+LOOKAHEAD AND LOOKBEHIND ASSERTIONS
+
+ (?=...) positive look ahead
+ (?!...) negative look ahead
+ (?<=...) positive look behind
+ (?<!...) negative look behind
+
+ Each top-level branch of a look behind must be of a fixed length.
+
+
+BACKREFERENCES
+
+ \n reference by number (can be ambiguous)
+ \gn reference by number
+ \g{n} reference by number
+ \g+n relative reference by number (PCRE2 extension)
+ \g-n relative reference by number
+ \g{+n} relative reference by number (PCRE2 extension)
+ \g{-n} relative reference by number
+ \k<name> reference by name (Perl)
+ \k'name' reference by name (Perl)
+ \g{name} reference by name (Perl)
+ \k{name} reference by name (.NET)
+ (?P=name) reference by name (Python)
+
+
+SUBROUTINE REFERENCES (POSSIBLY RECURSIVE)
+
+ (?R) recurse whole pattern
+ (?n) call subpattern by absolute number
+ (?+n) call subpattern by relative number
+ (?-n) call subpattern by relative number
+ (?&name) call subpattern by name (Perl)
+ (?P>name) call subpattern by name (Python)
+ \g<name> call subpattern by name (Oniguruma)
+ \g'name' call subpattern by name (Oniguruma)
+ \g<n> call subpattern by absolute number (Oniguruma)
+ \g'n' call subpattern by absolute number (Oniguruma)
+ \g<+n> call subpattern by relative number (PCRE2 extension)
+ \g'+n' call subpattern by relative number (PCRE2 extension)
+ \g<-n> call subpattern by relative number (PCRE2 extension)
+ \g'-n' call subpattern by relative number (PCRE2 extension)
+
+
+CONDITIONAL PATTERNS
+
+ (?(condition)yes-pattern)
+ (?(condition)yes-pattern|no-pattern)
+
+ (?(n) absolute reference condition
+ (?(+n) relative reference condition
+ (?(-n) relative reference condition
+ (?(<name>) named reference condition (Perl)
+ (?('name') named reference condition (Perl)
+ (?(name) named reference condition (PCRE2, deprecated)
+ (?(R) overall recursion condition
+ (?(Rn) specific numbered group recursion condition
+ (?(R&name) specific named group recursion condition
+ (?(DEFINE) define subpattern for reference
+ (?(VERSION[>]=n.m) test PCRE2 version
+ (?(assert) assertion condition
+
+ Note the ambiguity of (?(R) and (?(Rn) which might be named reference
+ conditions or recursion tests. Such a condition is interpreted as a
+ reference condition if the relevant named group exists.
+
+
+BACKTRACKING CONTROL
+
+ All backtracking control verbs may be in the form (*VERB:NAME). For
+ (*MARK) the name is mandatory, for the others it is optional. (*SKIP)
+ changes its behaviour if :NAME is present. The others just set a name
+ for passing back to the caller, but this is not a name that (*SKIP) can
+ see. The following act immediately they are reached:
+
+ (*ACCEPT) force successful match
+ (*FAIL) force backtrack; synonym (*F)
+ (*MARK:NAME) set name to be passed back; synonym (*:NAME)
+
+ The following act only when a subsequent match failure causes a back-
+ track to reach them. They all force a match failure, but they differ in
+ what happens afterwards. Those that advance the start-of-match point do
+ so only if the pattern is not anchored.
+
+ (*COMMIT) overall failure, no advance of starting point
+ (*PRUNE) advance to next starting character
+ (*SKIP) advance to current matching position
+ (*SKIP:NAME) advance to position corresponding to an earlier
+ (*MARK:NAME); if not found, the (*SKIP) is ignored
+ (*THEN) local failure, backtrack to next alternation
+
+ The effect of one of these verbs in a group called as a subroutine is
+ confined to the subroutine call.
+
+
+CALLOUTS
+
+ (?C) callout (assumed number 0)
+ (?Cn) callout with numerical data n
+ (?C"text") callout with string data
+
+ The allowed string delimiters are ` ' " ^ % # $ (which are the same for
+ the start and the end), and the starting delimiter { matched with the
+ ending delimiter }. To encode the ending delimiter within the string,
+ double it.
+
+
+SEE ALSO
+
+ pcre2pattern(3), pcre2api(3), pcre2callout(3), pcre2matching(3),
+ pcre2(3).
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 02 September 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+
+PCRE2UNICODE(3) Library Functions Manual PCRE2UNICODE(3)
+
+
+
+NAME
+ PCRE - Perl-compatible regular expressions (revised API)
+
+UNICODE AND UTF SUPPORT
+
+ When PCRE2 is built with Unicode support (which is the default), it has
+ knowledge of Unicode character properties and can process text strings
+ in UTF-8, UTF-16, or UTF-32 format (depending on the code unit width).
+ However, by default, PCRE2 assumes that one code unit is one character.
+ To process a pattern as a UTF string, where a character may require
+ more than one code unit, you must call pcre2_compile() with the
+ PCRE2_UTF option flag, or the pattern must start with the sequence
+ (*UTF). When either of these is the case, both the pattern and any sub-
+ ject strings that are matched against it are treated as UTF strings
+ instead of strings of individual one-code-unit characters. There are
+ also some other changes to the way characters are handled, as docu-
+ mented below.
+
+ If you do not need Unicode support you can build PCRE2 without it, in
+ which case the library will be smaller.
+
+
+UNICODE PROPERTY SUPPORT
+
+ When PCRE2 is built with Unicode support, the escape sequences \p{..},
+ \P{..}, and \X can be used. The Unicode properties that can be tested
+ are limited to the general category properties such as Lu for an upper
+ case letter or Nd for a decimal number, the Unicode script names such
+ as Arabic or Han, and the derived properties Any and L&. Full lists are
+ given in the pcre2pattern and pcre2syntax documentation. Only the short
+ names for properties are supported. For example, \p{L} matches a let-
+ ter. Its Perl synonym, \p{Letter}, is not supported. Furthermore, in
+ Perl, many properties may optionally be prefixed by "Is", for compati-
+ bility with Perl 5.6. PCRE2 does not support this.
+
+
+WIDE CHARACTERS AND UTF MODES
+
+ Code points less than 256 can be specified in patterns by either braced
+ or unbraced hexadecimal escape sequences (for example, \x{b3} or \xb3).
+ Larger values have to use braced sequences. Unbraced octal code points
+ up to \777 are also recognized; larger ones can be coded using \o{...}.
+
+ The escape sequence \N{U+<hex digits>} is recognized as another way of
+ specifying a Unicode character by code point in a UTF mode. It is not
+ allowed in non-UTF modes.
+
+ In UTF modes, repeat quantifiers apply to complete UTF characters, not
+ to individual code units.
+
+ In UTF modes, the dot metacharacter matches one UTF character instead
+ of a single code unit.
+
+ The escape sequence \C can be used to match a single code unit in a UTF
+ mode, but its use can lead to some strange effects because it breaks up
+ multi-unit characters (see the description of \C in the pcre2pattern
+ documentation).
+
+ The use of \C is not supported by the alternative matching function
+ pcre2_dfa_match() when in UTF-8 or UTF-16 mode, that is, when a charac-
+ ter may consist of more than one code unit. The use of \C in these
+ modes provokes a match-time error. Also, the JIT optimization does not
+ support \C in these modes. If JIT optimization is requested for a UTF-8
+ or UTF-16 pattern that contains \C, it will not succeed, and so when
+ pcre2_match() is called, the matching will be carried out by the normal
+ interpretive function.
+
+ The character escapes \b, \B, \d, \D, \s, \S, \w, and \W correctly test
+ characters of any code value, but, by default, the characters that
+ PCRE2 recognizes as digits, spaces, or word characters remain the same
+ set as in non-UTF mode, all with code points less than 256. This
+ remains true even when PCRE2 is built to include Unicode support,
+ because to do otherwise would slow down matching in many common cases.
+ Note that this also applies to \b and \B, because they are defined in
+ terms of \w and \W. If you want to test for a wider sense of, say,
+ "digit", you can use explicit Unicode property tests such as \p{Nd}.
+ Alternatively, if you set the PCRE2_UCP option, the way that the char-
+ acter escapes work is changed so that Unicode properties are used to
+ determine which characters match. There are more details in the section
+ on generic character types in the pcre2pattern documentation.
+
+ Similarly, characters that match the POSIX named character classes are
+ all low-valued characters, unless the PCRE2_UCP option is set.
+
+ However, the special horizontal and vertical white space matching
+ escapes (\h, \H, \v, and \V) do match all the appropriate Unicode char-
+ acters, whether or not PCRE2_UCP is set.
+
+
+CASE-EQUIVALENCE IN UTF MODES
+
+ Case-insensitive matching in a UTF mode makes use of Unicode properties
+ except for characters whose code points are less than 128 and that have
+ at most two case-equivalent values. For these, a direct table lookup is
+ used for speed. A few Unicode characters such as Greek sigma have more
+ than two code points that are case-equivalent, and these are treated as
+ such.
+
+
+VALIDITY OF UTF STRINGS
+
+ When the PCRE2_UTF option is set, the strings passed as patterns and
+ subjects are (by default) checked for validity on entry to the relevant
+ functions. If an invalid UTF string is passed, an negative error code
+ is returned. The code unit offset to the offending character can be
+ extracted from the match data block by calling pcre2_get_startchar(),
+ which is used for this purpose after a UTF error.
+
+ UTF-16 and UTF-32 strings can indicate their endianness by special code
+ knows as a byte-order mark (BOM). The PCRE2 functions do not handle
+ this, expecting strings to be in host byte order.
+
+ A UTF string is checked before any other processing takes place. In the
+ case of pcre2_match() and pcre2_dfa_match() calls with a non-zero
+ starting offset, the check is applied only to that part of the subject
+ that could be inspected during matching, and there is a check that the
+ starting offset points to the first code unit of a character or to the
+ end of the subject. If there are no lookbehind assertions in the pat-
+ tern, the check starts at the starting offset. Otherwise, it starts at
+ the length of the longest lookbehind before the starting offset, or at
+ the start of the subject if there are not that many characters before
+ the starting offset. Note that the sequences \b and \B are one-charac-
+ ter lookbehinds.
+
+ In addition to checking the format of the string, there is a check to
+ ensure that all code points lie in the range U+0 to U+10FFFF, excluding
+ the surrogate area. The so-called "non-character" code points are not
+ excluded because Unicode corrigendum #9 makes it clear that they should
+ not be.
+
+ Characters in the "Surrogate Area" of Unicode are reserved for use by
+ UTF-16, where they are used in pairs to encode code points with values
+ greater than 0xFFFF. The code points that are encoded by UTF-16 pairs
+ are available independently in the UTF-8 and UTF-32 encodings. (In
+ other words, the whole surrogate thing is a fudge for UTF-16 which
+ unfortunately messes up UTF-8 and UTF-32.)
+
+ In some situations, you may already know that your strings are valid,
+ and therefore want to skip these checks in order to improve perfor-
+ mance, for example in the case of a long subject string that is being
+ scanned repeatedly. If you set the PCRE2_NO_UTF_CHECK option at com-
+ pile time or at match time, PCRE2 assumes that the pattern or subject
+ it is given (respectively) contains only valid UTF code unit sequences.
+
+ Passing PCRE2_NO_UTF_CHECK to pcre2_compile() just disables the check
+ for the pattern; it does not also apply to subject strings. If you want
+ to disable the check for a subject string you must pass this option to
+ pcre2_match() or pcre2_dfa_match().
+
+ If you pass an invalid UTF string when PCRE2_NO_UTF_CHECK is set, the
+ result is undefined and your program may crash or loop indefinitely.
+
+ Note that setting PCRE2_NO_UTF_CHECK at compile time does not disable
+ the error that is given if an escape sequence for an invalid Unicode
+ code point is encountered in the pattern. If you want to allow escape
+ sequences such as \x{d800} (a surrogate code point) you can set the
+ PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES extra option. However, this is pos-
+ sible only in UTF-8 and UTF-32 modes, because these values are not rep-
+ resentable in UTF-16.
+
+ Errors in UTF-8 strings
+
+ The following negative error codes are given for invalid UTF-8 strings:
+
+ PCRE2_ERROR_UTF8_ERR1
+ PCRE2_ERROR_UTF8_ERR2
+ PCRE2_ERROR_UTF8_ERR3
+ PCRE2_ERROR_UTF8_ERR4
+ PCRE2_ERROR_UTF8_ERR5
+
+ The string ends with a truncated UTF-8 character; the code specifies
+ how many bytes are missing (1 to 5). Although RFC 3629 restricts UTF-8
+ characters to be no longer than 4 bytes, the encoding scheme (origi-
+ nally defined by RFC 2279) allows for up to 6 bytes, and this is
+ checked first; hence the possibility of 4 or 5 missing bytes.
+
+ PCRE2_ERROR_UTF8_ERR6
+ PCRE2_ERROR_UTF8_ERR7
+ PCRE2_ERROR_UTF8_ERR8
+ PCRE2_ERROR_UTF8_ERR9
+ PCRE2_ERROR_UTF8_ERR10
+
+ The two most significant bits of the 2nd, 3rd, 4th, 5th, or 6th byte of
+ the character do not have the binary value 0b10 (that is, either the
+ most significant bit is 0, or the next bit is 1).
+
+ PCRE2_ERROR_UTF8_ERR11
+ PCRE2_ERROR_UTF8_ERR12
+
+ A character that is valid by the RFC 2279 rules is either 5 or 6 bytes
+ long; these code points are excluded by RFC 3629.
+
+ PCRE2_ERROR_UTF8_ERR13
+
+ A 4-byte character has a value greater than 0x10fff; these code points
+ are excluded by RFC 3629.
+
+ PCRE2_ERROR_UTF8_ERR14
+
+ A 3-byte character has a value in the range 0xd800 to 0xdfff; this
+ range of code points are reserved by RFC 3629 for use with UTF-16, and
+ so are excluded from UTF-8.
+
+ PCRE2_ERROR_UTF8_ERR15
+ PCRE2_ERROR_UTF8_ERR16
+ PCRE2_ERROR_UTF8_ERR17
+ PCRE2_ERROR_UTF8_ERR18
+ PCRE2_ERROR_UTF8_ERR19
+
+ A 2-, 3-, 4-, 5-, or 6-byte character is "overlong", that is, it codes
+ for a value that can be represented by fewer bytes, which is invalid.
+ For example, the two bytes 0xc0, 0xae give the value 0x2e, whose cor-
+ rect coding uses just one byte.
+
+ PCRE2_ERROR_UTF8_ERR20
+
+ The two most significant bits of the first byte of a character have the
+ binary value 0b10 (that is, the most significant bit is 1 and the sec-
+ ond is 0). Such a byte can only validly occur as the second or subse-
+ quent byte of a multi-byte character.
+
+ PCRE2_ERROR_UTF8_ERR21
+
+ The first byte of a character has the value 0xfe or 0xff. These values
+ can never occur in a valid UTF-8 string.
+
+ Errors in UTF-16 strings
+
+ The following negative error codes are given for invalid UTF-16
+ strings:
+
+ PCRE2_ERROR_UTF16_ERR1 Missing low surrogate at end of string
+ PCRE2_ERROR_UTF16_ERR2 Invalid low surrogate follows high surrogate
+ PCRE2_ERROR_UTF16_ERR3 Isolated low surrogate
+
+
+ Errors in UTF-32 strings
+
+ The following negative error codes are given for invalid UTF-32
+ strings:
+
+ PCRE2_ERROR_UTF32_ERR1 Surrogate character (0xd800 to 0xdfff)
+ PCRE2_ERROR_UTF32_ERR2 Code point is greater than 0x10ffff
+
+
+AUTHOR
+
+ Philip Hazel
+ University Computing Service
+ Cambridge, England.
+
+
+REVISION
+
+ Last updated: 02 September 2018
+ Copyright (c) 1997-2018 University of Cambridge.
+------------------------------------------------------------------------------
+
+