GNU regex is a cross platform POSIX regular expression library (C language) provided by GNU. Excluding the extension functions provided by GNU, the regex Library of POSIX standard has four functions: regcomp, regerror, regexec and regfree We know that regexec cannot find all the string positions that meet the matching conditions in the string through one call, so we need to cycle through step offset to find all the matching conditions in the string. The start offset of each matching is the end offset of the last matching string.
In the last blog C: GNU regex library (regex.h) regular expression call example In, I have implemented regular expression matching multiple catch groups, and implemented regexec repeatedly. This paper improves the previous implementation and further encapsulates the loop matching logic into an easy-to-use function rx_search.
The practical significance of this encapsulation for me is that a recent project runs on an embedded platform. The SDK provided by the device has a GNU regex library, but it is a very old version. There are only four functions, regcomp, regerror, regexec and regfree. There is no re with a higher version_ Search function. So if you want to achieve multiple matches, you can only do it yourself.
Here is Rx_ Implementation code of search:
rx_serach
//************************************ // Finds all matches in the string using the specified regular expression // @param const char * input String to match // @param const char * pattern regular expression // @param size_t groupcnt The number of groups captured in the regular expression (including the default group group 0). When it is 0, the default value is used, that is, pattern compiled regex_ Re of T_ nsub+1 // regex_t.re_nsub The field is the number of sub expressions of regular expression, and sub expressions are divided into capture and non capture // So re_nsub + 1 must be greater than or equal to the number of all capture groups (including the default group 0) in the expression // @param int eflags For regular expression matching execution flag, see regexec // @param search_match_t * _psmatch [out] Saves all matching positions of the string // @return int If the match is successful, the number of matches will be returned. If there is no match, 0 will be returned. If it fails, 1 will be returned, // The calling layer must call rx_search_match_uninit frees up allocated space //************************************ int rx_serach(const char* input, const char* pattern, size_t groupcnt, int eflags, search_match_t* _psmatch) { if (NULL == input || NULL == pattern || NULL == _psmatch) { printf("%s:%d NULL ARGUMENT\n",__FILE__,__LINE__); return 0; } regex_t reg; /************************************************************************/ /* Compile the regular expression and compile the regex successfully_ T object can be used by subsequent regexec */ /************************************************************************/ int c = regcomp(®, pattern, REG_EXTENDED); if (0 != c) { /************************************************************************/ /* Regular expression compilation error, output error information */ /* Call regerror to output the error information to regerrbuf */ /* regerrbuf Set 0 at the end to ensure that when the regerror call above causes regerrbuf overflow, */ /* String still has end 0 */ /* Then printf output */ /************************************************************************/ regerror(c, ®, regerrbuf, sizeof(regerrbuf)); regerrbuf[sizeof(regerrbuf) - 1] = '\0'; printf("%s:%d %s\n",__FILE__,__LINE__, regerrbuf); return -1; } if (0 == groupcnt) { groupcnt = reg.re_nsub + 1; } c = rx_search_match_init(_psmatch, groupcnt); if (0 != c) { /** search_match_t Initialization failed. Release the regex that was successfully initialized_ t */ regfree(®); return c; } /** Offset from start match */ size_t offset = 0; /************************************************************************/ /* regexec You cannot find all the string positions that meet the matching conditions in the string through one call, */ /* Therefore, you need to cycle through all matching strings in the string by step offset, */ /* The start offset of each match is the end offset of the string that was matched last time */ /************************************************************************/ do { printf("MATCH start %d\n", (int)offset); /** Output buffer expansion */ regmatch_t* pmatch = rx_search_match_ensure(_psmatch, 1); if (NULL == pmatch) { printf("%s:%d MEMORY ERROR for rx_search_match_ensure\n",__FILE__,__LINE__); c = -1; break; } /** Start address for regular expression matching */ const char* p = input + offset; /************************************************************************/ /* regmatch_t It is used to record the results of regular expression matching, and each regmatch_t record a capture */ /* The starting position of the catch group in the string. */ /* If regmatch is not provided when regexec is called_ T (nmatch is 0,pmatch is NULL), */ /* Or the provided regmatch_ The T number is less than the number of all capture groups in the regular expression, */ /* regexec It can also match normally, but the matching position cannot be recorded */ /* Or all matching results cannot be fully recorded */ /************************************************************************/ c = regexec(®, p, _psmatch->groupcnt, pmatch, eflags); if (REG_NOMATCH == c) { /************************************************************************/ /** No matching end loop found */ /************************************************************************/ printf("MATCH FINISHED\n"); break; } else if (0 == c) { /** Match count plus 1 */ _psmatch->matchcnt++; /** If a match is found, all catch groups that match are output */ printf("%d MATCH (%d-%d)\n", (int)_psmatch->matchcnt, pmatch[0].rm_so, pmatch[0].rm_eo); for (int i = 0; i < _psmatch->groupcnt; ++i) { printf("group %d :<<", i); print_str(p, pmatch[i].rm_so, pmatch[i].rm_eo); printf(">>\n"); } /** (group 0)End position of */ size_t eo = pmatch[0].rm_eo; for (int i = 0; i < _psmatch->groupcnt; ++i) { /** The offset is modified to be relative to the start of the string */ pmatch[i].rm_so += (int)offset; pmatch[i].rm_eo += (int)offset; } /************************************************************************/ /* Capture the update offset of the end position of group 0 using overall matching, */ /* The next match starts at the end of the current match */ /************************************************************************/ offset += eo; continue; } else { /************************************************************************/ /** regexec Call error, output error information and end the loop */ /************************************************************************/ regerror(c, ®, regerrbuf, sizeof(regerrbuf)); regerrbuf[sizeof(regerrbuf) - 1] = '\0'; printf("%s\n", regerrbuf); c = -1; break; } } while (1); printf("%d MATCH FOUND\n", (int)_psmatch->matchcnt); /************************************************************************/ /** regfree It must be paired with regcomp, otherwise memory leakage will occur */ /************************************************************************/ regfree(®); /** REG_NOMATCH Is the normal cycle end flag */ if (c != REG_NOMATCH) { /** Release search on error_ match_ T memory occupied */ rx_search_match_uninit(_psmatch); return c; } return (int)_psmatch->matchcnt; }
search_match_t
Because it is impossible to predict how many matches in the string meet the regular expression conditions. So I designed a search_ match_ The T structure is used to save the matching result data, Rx_ The search execution results are saved in search_match_t medium
/************************************************************************/ /* Save and execute regexec multiple matching data */ /************************************************************************/ typedef struct search_match_t { /** Number of capture groups (including group 0) */ size_t groupcnt; /** Number of matches that can be saved */ size_t capacity; /** Matching times */ size_t matchcnt; /************************************************************************/ /* Save each matching data in the order in the string, */ /* The array length is capacity*groupcnt, */ /* rx_search During execution, the array length will be automatically expanded as needed */ /************************************************************************/ regmatch_t* pmatch; }search_match_t;
rx_search_match_ensure
Because it is impossible to predict how many matches in the string meet the regular expression conditions, Rx is executed_ Search when the matching quantity exceeds search_ match_ t. When pmatch the array capacity, it will automatically search as needed_ match_ t. Pmatch array length expansion, The following is search_match_t expansion function RX_ search_ match_ Implementation of security:
//************************************ // search_match_t Expand capacity to ensure search_match_t has enough free space to store matching data of the size specified by freecnt // Memory reset of expansion part // @param search_match_t * _psmatch // @param size_t freecnt The number of matching spaces required to be saved in free space (the number of regmatch_t required for each matching is groupcnt) // @return regmatch_t* The last idle regmatch is returned after the capacity expansion is successful_ T start position, otherwise NULL is returned //************************************ static regmatch_t* rx_search_match_ensure(search_match_t * _psmatch, size_t freecnt) { regmatch_t *newbuffer = NULL; size_t newsize = 0; size_t newcapacity = 0; if ((_psmatch == NULL) || (_psmatch->pmatch == NULL)) { printf("%s:%d NULL ARGUMENT\n",__FILE__,__LINE__); return NULL; } if ((_psmatch->capacity > 0) && (_psmatch->matchcnt >= _psmatch->capacity)) { printf("%s:%d INVALID matchcnt %d\n",__FILE__,__LINE__,(int)_psmatch->matchcnt); return NULL; } if (freecnt > (INT_MAX / 64)) { printf("%s:%d TOO LARGE ARGUMENT matchcnt %d\n",__FILE__,__LINE__,(int)freecnt); return NULL; } if (freecnt <= (_psmatch->capacity - _psmatch->matchcnt)) { return _psmatch->pmatch + (_psmatch->matchcnt * _psmatch->groupcnt); } /** Capacity expansion at 16 integral multiples */ newcapacity = ((freecnt + _psmatch->matchcnt + 16 - 1) >> 4 << 4); newsize = newcapacity * _psmatch->groupcnt * sizeof(regmatch_t); /* reallocate with realloc if available */ newbuffer = (regmatch_t*)realloc(_psmatch->pmatch, newsize); if (newbuffer == NULL) { printf("%s:%d MEM ERROR\n",__FILE__,__LINE__); free(_psmatch->pmatch); _psmatch->capacity = 0; _psmatch->groupcnt = 0; _psmatch->matchcnt = 0; _psmatch->pmatch = NULL; return NULL; } size_t oldsize = _psmatch->matchcnt * _psmatch->groupcnt * sizeof(regmatch_t); size_t expsize = (newcapacity - _psmatch->matchcnt) * _psmatch->groupcnt * sizeof(regmatch_t); /** Clearing of expansion part */ memset(newbuffer + oldsize, 0, expsize); _psmatch->capacity = newcapacity; _psmatch->pmatch = newbuffer; printf("%s:%d pmatch buffer expand to %d match\n",__FILE__,__LINE__, (int)newcapacity); return _psmatch->pmatch + (_psmatch->matchcnt * _psmatch->groupcnt); } //************************************ // Release search_ match_ Space allocated in T, // @param search_match_t * _psmatch //************************************ void rx_search_match_uninit(search_match_t* _psmatch) { if (_psmatch) { free(_psmatch->pmatch); memset(_psmatch, 0, sizeof(search_match_t)); } }
rx_search_match_uninit
rx_ After search is executed, search_ match_ The memory caller allocated in t is responsible for releasing This requires another function rx_search_match_uninit to complete the search_ match_ Release of T
//************************************ // Release search_ match_ Space allocated in T, // @param search_match_t * _psmatch //************************************ void rx_search_match_uninit(search_match_t* _psmatch) { if (_psmatch) { free(_psmatch->pmatch); memset(_psmatch, 0, sizeof(search_match_t)); } }
Complete code
The following is the complete code (including tests) that can be compiled and run directly with MSVC/GCC
rx_search_test.c
/************************************************************************/ /* rx_search_test.c */ /* GNU Regex test */ /* rx_search The implementation calls regexe to search the string for matches that meet the regular expression multiple times, */ /* And save to search_match_t */ /* author guyadong */ /************************************************************************/ #include <stdio.h> #include <stdlib.h> #include <regex.h> #include <limits.h> #include <string.h> /** regex Error output buffer */ static char regerrbuf[256]; /** Outputs the specified range of characters in the string to the console */ void print_str(const char* input, size_t _start, size_t _end) { if (input) { for (size_t i = _start; i < _end; ++i) { printf("%c", input[i]); } } } /************************************************************************/ /* Save and execute regexec multiple matching data */ /************************************************************************/ typedef struct search_match_t { /** Number of capture groups (including group 0) */ size_t groupcnt; /** Number of matches that can be saved */ size_t capacity; /** Matching times */ size_t matchcnt; /************************************************************************/ /* Save each matching data in the order in the string, */ /* The array length is capacity*groupcnt, */ /* rx_search During execution, the array length will be automatically expanded as needed */ /************************************************************************/ regmatch_t* pmatch; }search_match_t; //************************************ // search_match_t Initialize, allocate memory with the initial capacity of 16, and clear the memory // If the number of matches exceeds the capacity, Rx is called_ search_ match_ Ensure automatic capacity expansion // @param search_match_t * _psmatch // @param size_t groupcnt // @return int 0 is returned successfully, otherwise - 1 is returned //************************************ static int rx_search_match_init(search_match_t* _psmatch,size_t groupcnt) { if(NULL == _psmatch){ return -1; } _psmatch->capacity = 16; _psmatch->matchcnt = 0; _psmatch->groupcnt = groupcnt; size_t size = sizeof(regmatch_t) * groupcnt * _psmatch->capacity; _psmatch->pmatch = (regmatch_t*)malloc(size); if(!_psmatch->pmatch) { printf("%s:%d MEM ERROR\n",__FILE__,__LINE__); return -1; } /** Memory reset */ memset(_psmatch->pmatch, 0, size); return 0; } //************************************ // search_match_t Expand capacity to ensure search_match_t has enough free space to store matching data of the size specified by freecnt // Memory reset of expansion part // @param search_match_t * _psmatch // @param size_t freecnt The number of matching spaces required to be saved in free space (the number of regmatch_t required for each matching is groupcnt) // @return regmatch_t* The last idle regmatch is returned after the capacity expansion is successful_ T start position, otherwise NULL is returned //************************************ static regmatch_t* rx_search_match_ensure(search_match_t * _psmatch, size_t freecnt) { regmatch_t *newbuffer = NULL; size_t newsize = 0; size_t newcapacity = 0; if ((_psmatch == NULL) || (_psmatch->pmatch == NULL)) { printf("%s:%d NULL ARGUMENT\n",__FILE__,__LINE__); return NULL; } if ((_psmatch->capacity > 0) && (_psmatch->matchcnt >= _psmatch->capacity)) { printf("%s:%d INVALID matchcnt %d\n",__FILE__,__LINE__,(int)_psmatch->matchcnt); return NULL; } if (freecnt > (INT_MAX / 64)) { printf("%s:%d TOO LARGE ARGUMENT matchcnt %d\n",__FILE__,__LINE__,(int)freecnt); return NULL; } if (freecnt <= (_psmatch->capacity - _psmatch->matchcnt)) { return _psmatch->pmatch + (_psmatch->matchcnt * _psmatch->groupcnt); } /** Capacity expansion at 16 integral multiples */ newcapacity = ((freecnt + _psmatch->matchcnt + 16 - 1) >> 4 << 4); newsize = newcapacity * _psmatch->groupcnt * sizeof(regmatch_t); /* reallocate with realloc if available */ newbuffer = (regmatch_t*)realloc(_psmatch->pmatch, newsize); if (newbuffer == NULL) { printf("%s:%d MEM ERROR\n",__FILE__,__LINE__); free(_psmatch->pmatch); _psmatch->capacity = 0; _psmatch->groupcnt = 0; _psmatch->matchcnt = 0; _psmatch->pmatch = NULL; return NULL; } size_t oldsize = _psmatch->matchcnt * _psmatch->groupcnt * sizeof(regmatch_t); size_t expsize = (newcapacity - _psmatch->matchcnt) * _psmatch->groupcnt * sizeof(regmatch_t); /** Clearing of expansion part */ memset(newbuffer + oldsize, 0, expsize); _psmatch->capacity = newcapacity; _psmatch->pmatch = newbuffer; printf("%s:%d pmatch buffer expand to %d match\n",__FILE__,__LINE__, (int)newcapacity); return _psmatch->pmatch + (_psmatch->matchcnt * _psmatch->groupcnt); } //************************************ // Release search_ match_ Space allocated in T, // @param search_match_t * _psmatch //************************************ void rx_search_match_uninit(search_match_t* _psmatch) { if (_psmatch) { free(_psmatch->pmatch); memset(_psmatch, 0, sizeof(search_match_t)); } } //************************************ // Finds all matches in the string using the specified regular expression // @param const char * input String to match // @param const char * pattern regular expression // @param size_t groupcnt The number of groups captured in the regular expression (including the default group group 0). When it is 0, the default value is used, that is, pattern compiled regex_ Re of T_ nsub+1 // regex_t.re_nsub The field is the number of sub expressions of regular expression, and sub expressions are divided into capture and non capture // So re_nsub + 1 must be greater than or equal to the number of all capture groups (including the default group 0) in the expression // @param int eflags For regular expression matching execution flag, see regexec // @param search_match_t * _psmatch [out] Saves all matching positions of the string // @return int If the match is successful, the number of matches will be returned. If there is no match, 0 will be returned. If it fails, 1 will be returned, // The calling layer must call rx_search_match_uninit frees up allocated space //************************************ int rx_serach(const char* input, const char* pattern, size_t groupcnt, int eflags, search_match_t* _psmatch) { if (NULL == input || NULL == pattern || NULL == _psmatch) { printf("%s:%d NULL ARGUMENT\n",__FILE__,__LINE__); return 0; } regex_t reg; /************************************************************************/ /* Compile the regular expression and compile the regex successfully_ T object can be used by subsequent regexec */ /************************************************************************/ int c = regcomp(®, pattern, REG_EXTENDED); if (0 != c) { /************************************************************************/ /* Regular expression compilation error, output error information */ /* Call regerror to output the error information to regerrbuf */ /* regerrbuf Set 0 at the end to ensure that when the regerror call above causes regerrbuf overflow, */ /* String still has end 0 */ /* Then printf output */ /************************************************************************/ regerror(c, ®, regerrbuf, sizeof(regerrbuf)); regerrbuf[sizeof(regerrbuf) - 1] = '\0'; printf("%s:%d %s\n",__FILE__,__LINE__, regerrbuf); return -1; } if (0 == groupcnt) { groupcnt = reg.re_nsub + 1; } c = rx_search_match_init(_psmatch, groupcnt); if (0 != c) { /** search_match_t Initialization failed. Release the regex that was successfully initialized_ t */ regfree(®); return c; } /** Offset from start match */ size_t offset = 0; /************************************************************************/ /* regexec You cannot find all the string positions that meet the matching conditions in the string through one call, */ /* Therefore, you need to cycle through all matching strings in the string by step offset, */ /* The start offset of each match is the end offset of the string that was matched last time */ /************************************************************************/ do { printf("MATCH start %d\n", (int)offset); /** Output buffer expansion */ regmatch_t* pmatch = rx_search_match_ensure(_psmatch, 1); if (NULL == pmatch) { printf("%s:%d MEMORY ERROR for rx_search_match_ensure\n",__FILE__,__LINE__); c = -1; break; } /** Start address for regular expression matching */ const char* p = input + offset; /************************************************************************/ /* regmatch_t It is used to record the results of regular expression matching, and each regmatch_t record a capture */ /* The starting position of the catch group in the string. */ /* If regmatch is not provided when regexec is called_ T (nmatch is 0,pmatch is NULL), */ /* Or the provided regmatch_ The T number is less than the number of all capture groups in the regular expression, */ /* regexec It can also match normally, but the matching position cannot be recorded */ /* Or all matching results cannot be fully recorded */ /************************************************************************/ c = regexec(®, p, _psmatch->groupcnt, pmatch, eflags); if (REG_NOMATCH == c) { /************************************************************************/ /** No matching end loop found */ /************************************************************************/ printf("MATCH FINISHED\n"); break; } else if (0 == c) { /** Match count plus 1 */ _psmatch->matchcnt++; /** If a match is found, all catch groups that match are output */ printf("%d MATCH (%d-%d)\n", (int)_psmatch->matchcnt, pmatch[0].rm_so, pmatch[0].rm_eo); for (int i = 0; i < _psmatch->groupcnt; ++i) { printf("group %d :<<", i); print_str(p, pmatch[i].rm_so, pmatch[i].rm_eo); printf(">>\n"); } /** (group 0)End position of */ size_t eo = pmatch[0].rm_eo; for (int i = 0; i < _psmatch->groupcnt; ++i) { /** The offset is modified to be relative to the start of the string */ pmatch[i].rm_so += (int)offset; pmatch[i].rm_eo += (int)offset; } /************************************************************************/ /* Capture the update offset of the end position of group 0 using overall matching, */ /* The next match starts at the end of the current match */ /************************************************************************/ offset += eo; continue; } else { /************************************************************************/ /** regexec Call error, output error information and end the loop */ /************************************************************************/ regerror(c, ®, regerrbuf, sizeof(regerrbuf)); regerrbuf[sizeof(regerrbuf) - 1] = '\0'; printf("%s\n", regerrbuf); c = -1; break; } } while (1); printf("%d MATCH FOUND\n", (int)_psmatch->matchcnt); /************************************************************************/ /** regfree It must be paired with regcomp, otherwise memory leakage will occur */ /************************************************************************/ regfree(®); /** REG_NOMATCH Is the normal cycle end flag */ if (c != REG_NOMATCH) { /** Release search on error_ match_ T memory occupied */ rx_search_match_uninit(_psmatch); return c; } return (int)_psmatch->matchcnt; } int main() { /** String to match */ const char* inputstr = "hello,welcome to my party"; regex_t reg; /** regular expression */ const char* pattern = "(we|par)([a-z]+)"; printf("==rx_serach Test==\n"); printf("Pattern :%s\n", pattern); printf("Input String:%s\n", inputstr); search_match_t _smatch; int c = rx_serach(inputstr,pattern, 0, 0, &_smatch); if(c > 0) { /* Output search_ match_ All matching results recorded in t */ printf("====MATCH RESULT====\n"); size_t off = 0; for (int i = 0; i < c; ++i,off += _smatch.groupcnt) { printf("MATCH %d\n", i); regmatch_t* gm = _smatch.pmatch + off ; for (int g = 0; g < _smatch.groupcnt; ++g) { printf("\tgroup %d <<", g); print_str(inputstr, gm[g].rm_so, gm[g].rm_eo); printf(">>\n"); } } } /************************************************************************/ /* Call RX_ RX must be called after serach_ search_ match_ Uninit frees up allocated memory */ /* Otherwise, a memory leak will occur */ /************************************************************************/ rx_search_match_uninit(&_smatch); return 0; }
Compilation example
gcc/linux
Because GNU regex is built into linux gcc, the above code is easy to compile under linux:
# compile $ gcc rx_search_test.c # Run test $ ./a.out ==rx_serach Test== Pattern :(we|par)([a-z]+) Input String:hello,welcome to my party MATCH start 0 1 MATCH (6-13) group 0 :<<welcome>> group 1 :<<we>> group 2 :<<lcome>> MATCH start 13 2 MATCH (7-12) group 0 :<<party>> group 1 :<<par>> group 2 :<<ty>> MATCH start 25 MATCH FINISHED 2 MATCH FOUND ====MATCH RESULT==== MATCH 0 group 0 <<welcome>> group 1 <<we>> group 2 <<lcome>> MATCH 1 group 0 <<party>> group 1 <<par>> group 2 <<ty>>
MSVC/Windows
Because MSVC does not provide GNU regex library, please refer to another blog for the GNU regex library required to compile the above code under windows Using GNU regex (regular expression C language interface regex.h) under MSVC
The above complete compiled code and GNU regex library for MSVC are stored in the code cloud warehouse: https://gitee.com/l0km/libgnurx-msvc.git
You can execute the following commands to compile and run rx_search_test.c
# It needs to be executed under the VS2015 developer prompt (CMD), otherwise the nmake command cannot be found J:>git clone https://gitee.com/l0km/libgnurx-msvc.git J:>cd libgnurx-msvc J:\libgnurx-msvc>nmake /f NMakefile test3 Microsoft (R) Program maintenance utility 14.00.24210.0 edition copyright (C) Microsoft Corporation. All rights reserved. cl.exe /D WIN32 /D _WINDOWS /I . /MD /wd4819 rx_search_test.c regex.lib be used for x64 of Microsoft (R) C/C++ Optimizing compiler 19.00.24215.1 edition copyright(C) Microsoft Corporation. All rights reserved. rx_search_test.c Microsoft (R) Incremental Linker Version 14.00.24215.1 Copyright (C) Microsoft Corporation. All rights reserved. /out:rx_search_test.exe rx_search_test.obj regex.lib rx_search_test.exe ==rx_serach Test== Pattern :(we|par)([a-z]+) Input String:hello,welcome to my party MATCH start 0 1 MATCH (6-13) group 0 :<<welcome>> group 1 :<<we>> group 2 :<<lcome>> MATCH start 13 2 MATCH (7-12) group 0 :<<party>> group 1 :<<par>> group 2 :<<ty>> MATCH start 25 MATCH FINISHED 2 MATCH FOUND ====MATCH RESULT==== MATCH 0 group 0 <<welcome>> group 1 <<we>> group 2 <<lcome>> MATCH 1 group 0 <<party>> group 1 <<par>> group 2 <<ty>> J:\libgnurx-msvc>