gram_grep is a search tool that goes far beyond the capabilities of grep. Searches can span multiple lines and may be chained together in a variety of ways and can even utilise bison style grammars.
Maybe you want a search to ignore comments, or search only within strings. Maybe you have code that has SQL within strings and that SQL itself contains strings that you want to search in. The possibilities are endless and there is no limit to the sequence of sub-searches.
For example, here is how you would search for memory_file outside of C and C++ style comments:
gram_grep -vE "[/][/].*|[/][*](?s:.)*?[*][/]" -E memory_file main.cpp
Note that '^' is the escape character in the command prompt therefore if you want to use the '^' character you will have to double it up. The same goes for double quotes ('"') and in addition any regex using them will need surrounding with double quotes, as will any containing the pipe symbol ('|') (although in this case you do not to double it up).
It quickly gets tedious trying to correctly escape characters in a command shell, so we switch to a configuration file to also exclude strings:
gram_grep -f nosc.g main.cpp
The config file nosc.g
looks like this:
%% %% %% '([^'\\]|\\.)*' skip() ["]([^"\\]|\\.)*["] skip() R["][(](?s:.)*?[)]["] skip() [/][/].*|[/][*](?s:.)*?[*][/] skip() memory_file 1 %%
Note how characters are also skipped just in case there is a character
containing a double quote! Also note how we have moved our search for
memory_file
directly into the config file as this part of
the config lists regexes that are passed to a lexer generator.
This means that we specify the things we want to match (use 1
for the id in this case) or explicitly skip (use skip()
in
this case) all within the same section. This mode alone has already given
us far more searching power than with traditional techniques.
If we wanted to only search in strings or comments, we would use
1
instead of skip()
for those regexes and omit the
memory_file
line altogether. We would then pass
memory_file
with -E
or -P
as a
command line parameter.
Note that it is possible to issue a command to check out files from source control:
gram_grep -r -E "v4\.5\.1" -replace v4.5.2 -o -checkout "tf.exe checkout $1" *.csproj
The above example would replace v4.5.1
with v4.5.2
in *.csproj
, checking out the files from TFS as they match.
Note that there are also switches -startup
and
-shutdown
where you can run other commands at startup and
exit respectively if required (e.g.,
"tf.exe workspace /new /collection:http://... refactor /noprompt"
and
"tf.exe workspace /delete /collection:http://... refactor /noprompt"
).
The config file has the following format:
<grammar/lexer directives> %% <grammar> %% <regex macros> %% <regexes> %%
As implied above, the grammar/lexer directives
,
grammar
and regex macros
are all optional.
Here is an example of a simple grammar that recognises C++ strings
split over multiple lines (strings.g
):
/* NOTE: in order to successfully find strings it is necessary to filter out comments and chars. As a subtlety, comments could contain apostrophes (or even unbalanced double quotes in an extreme case)! */ %token RawString String %% list: String { match = substr($1, 1, 1); }; list: RawString { match = substr($1, 3, 2); }; list: list String { match += substr($2, 1, 1); }; list: list RawString { match += substr($2, 3, 2); }; %% %% ["]([^"\\]|\\.)*["] String R["][(](?s:.)*?[)]["] RawString '([^'\\]|\\.)*' skip() [ \t\r\n]+|[/][/].*|[/][*](?s:.)*?[*][/] skip() %%
Although the grammar is just about as simple as it gets, note the scripting
added. Each string fragment is joined into a match
, that can
then be searched on by a following search. This means we can search within
C++ strings without worrying about how they are split over lines.
Note how we have switched from using 1
as the matching regex id
to names which we have specified using %token
and used in the
grammar.
Example usage:
gram_grep -f sample_configs/strings.g -E memory_file main.cpp
The full list of scripting commands are listed below. You can see their use
in the more sophisticated examples that follow later. $n
,
$from
and $to
refer to the item in the production
you are interested in (numbering starts at 1
).
erase($n);
erase($from, $to);
erase($from.second, $to.first);
insert($n, 'text');
insert($n.second, 'text');
match = $n;
match = substr($n, <omit from left>, <omit from right>);
match += $n;
match += substr($n, <omit from left>, <omit from right>);
replace($n, 'text');
replace($from, $to, 'text');
replace($from.second, $to.first, 'text');
replace_all($n, 'regex', 'text');
By default, the entire grammar will match. However, there are times you are
only interested if specific parts of your grammar matches. If you want to
only match on particular grammar rules, use {}
just before the
terminating semi-colon for that rule. This technique is shown in a later
example.
Most of the time, the only grammar/lexer directive you will care about
will be %token
. However, the following are supported:
--help
(Shows help)-checkout
<checkout command (include $1 for pathname)>-E
<regex> (Search using DFA regex)-exclude
<wildcard> (exclude any pathname matching wildcard)-f
<config file> (Search using config file)-force_write
(If a file is read only, force it to be writable)-hits
Show hit count per file.-i
(Case insensitive searching)-l
(Output pathname only)-o
(Output changes to matching file)-P
<regex> (Search using std::regex
)-r, -R, --recursive
(Recurse subdirectories)-replace
<Replacement literal text>-shutdown
<command to run when exiting>-startup
<command to run at startup>-utf8
(In the absence of a BOM assume UTF-8)-vE
<regex> Search using DFA regex (negated - match all text other than regex)-VE
<regex> Search using DFA regex (all negated - match if regex not found)-vf
<config file> Search using config file (negated - match all text other than config)-Vf
<config file> Search using config file (all negated - match if config not found)-vP
<regex> Search using std::regex
(negated - match all text other than regex)-VP
<regex> Search using std::regex
(all negated - match if regex not found)-writable
Only process files that are writable<pathname>
... (Files to search (wildcards supported))If an input file has a BOM (byte order marker), then that will be recognised. In the case of UTF-16, the contents will be automatically converted to UTF-8 in memory to allow uniform processing.
Unicode support can be enabled with the -utf8
switch.
Two things happen with this switch enabled:
-E
, -vE
, -VE
, -f
, -vf
, -Vf
). Note that the std::regex
support (-P
, -vP
, -VP
) does not currently support Unicode.insert.g
:
%token INSERT INTO Name String VALUES %% start: insert; insert: INSERT into name VALUES; into: INTO | %empty; name: Name | Name '.' Name | Name '.' Name '.' Name; %% %% (?i:INSERT) INSERT (?i:INTO) INTO (?i:VALUES) VALUES [.] '.' (?i:[a-z_][a-z0-9@$#_]*|\[[a-z_][a-z0-9@$#_]*[ ]*\]) Name '([^']|'')*' String \s+|--.*|[/][*](?s:.)*?[*][/] skip() %%
The command line looks like this:
gram_grep -r -f sample_configs/insert.g *.sql
First the string extraction (strings.g
):
%token RawString String %% list: String { match = substr($1, 1, 1); }; list: RawString { match = substr($1, 3, 2); }; list: list String { match += substr($2, 1, 1); }; list: list RawString { match += substr($2, 3, 2); }; %% %% ["]([^"\\]|\\.)*["] String R["][(](?s:.)*?[)]["] RawString '([^'\\]|\\.)*' skip() [ \t\r\n]+|[/][/].*|[/][*](?s:.)*?[*][/] skip() %%
Or if we wanted to scan C#:
%token String VString %% list: String { match = substr($1, 1, 1); }; list: VString { match = substr($1, 2, 1); }; list: list '+' String { match += substr($3, 1, 1); }; list: list '+' VString { match += substr($3, 2, 1); }; %% ws [ \t\r\n]+ %% [+] '+' [\"]([^"\\]|\\.)*[\"] String @[\"]([^\"]|[\"][\"])*["] VString '([^'\\]|\\.)*' skip() {ws}|[/][/].*|[/][*](?s:.)*?[*][/] skip() %%
Now the grammar to search inside the strings (merge.g
):
%token AS Integer INTO MERGE Name PERCENT TOP USING %% merge: MERGE opt_top opt_into name opt_alias USING; opt_top: %empty | TOP '(' Integer ')' opt_percent; opt_percent: %empty | PERCENT; opt_into: %empty | INTO; name: Name | Name '.' Name | Name '.' Name '.' Name; opt_alias: %empty | opt_as Name; opt_as: %empty | AS; %% %% (?i:AS) AS (?i:INTO) INTO (?i:MERGE) MERGE (?i:PERCENT) PERCENT (?i:TOP) TOP (?i:USING) USING \. '.' \( '(' \) ')' \d+ Integer (?i:[a-z_][a-z0-9@$#_]*|\[[a-z_][a-z0-9@$#_]*[ ]*\]) Name \s+ skip() %%
The command line looks like this:
gram_grep -r -f sample_configs/strings.g -f sample_configs/merge.g *.cpp
Note the use of {}
here to specify that we only care when
the rule item: Name;
matches.
%token Bool Char Name NULLPTR Number String Type %% start: decl; decl: Type list ';'; list: item | list ',' item; item: Name {}; item: Name '=' value; value: Bool | Char | Number | NULLPTR | String; %% NAME [_A-Za-z][_0-9A-Za-z]* %% = '=' , ',' ; ';' true|TRUE|false|FALSE Bool nullptr NULLPTR BOOL|BSTR|BYTE|COLORREF|D?WORD|DWORD_PTR Type DROPEFFECT|HACCEL|HANDLE|HBITMAP|HBRUSH Type HCRYPTHASH|HCRYPTKEY|HCRYPTPROV|HCURSOR|HDBC Type HICON|HINSTANCE|HMENU|HMODULE|HSTMT|HTREEITEM Type HWND|LPARAM|LPCTSTR|LPDEVMODE|POSITION|SDWORD Type SQLHANDLE|SQLINTEGER|SQLSMALLINT|UINT|U?INT_PTR Type UWORD|WPARAM Type bool|(unsigned\s+)?char|double|float Type (unsigned\s+)?int((32|64)_t)?|long|size_t Type {NAME}(\s*::\s*{NAME})*(\s*[*])+ Type {NAME} Name -?\d+([.]\d+)? Number '([^'\\]|\\.)*' Char ["]([^\"\\]|\\.)*["] String [ \t\r\n]+|[/][/].*|[/][*](?s:.)*?[*][/] skip() %%
The command line looks like this:
gram_grep -r -f sample_configs/uninit.g *.h
Note the use of a variety of scripting commands:
%token Integer Name RawString String %% start: '(' format list ')' '.' 'str' '(' ')' /* Erase the first "(" and the trailing ".str()" */ { erase($1); erase($5, $8); }; start: 'str' '(' format list ')' /* Erase "str(" */ { erase($1, $2); }; format: 'boost' '::' 'format' '(' string ')' /* Replace "boost" with "std" */ /* Replace the format specifiers within the strings */ { replace($1, 'std'); replace_all($5, '%(\d+[Xdsx])', '{:$1}'); replace_all($5, '%((?:\d+)?\.\d+f)', '{:$1}'); replace_all($5, '%x', '{:x}'); replace_all($5, '%[ds]', '{}'); replace_all($5, '%%', '%'); erase($6); }; string: String; string: RawString; string: string String; string: string RawString; list: %empty; list: list '%' param /* Replace "%" with ", " */ { replace($2, ', '); }; param: Integer; param: name /* Replace any trailing ".c_str()" calls with "" */ { replace_all($1, '\.c_str\(\)$', ''); }; name: Name opt_func | name deref Name opt_func; opt_func: %empty | '(' opt_param ')'; deref: '.' | '->' | '::'; opt_param: %empty | Integer | name; %% %% \( '(' \) ')' \. '.' % '%' :: '::' -> '->' boost 'boost' format 'format' str 'str' -?\d+ Integer \"([^"\\]|\\.)*\" String R\"\((?s:.)*?\)\" RawString '([^'\\]|\\.)*' skip() [_a-zA-Z][_0-9a-zA-Z]* Name \s+|\/\/.*|\/\*(?s:.)*?\*\/ skip() %%
The command line looks like this:
gram_grep -o -r -f format.g *.cpp
This example finds an if
statement, its opening parenthesis
and its closing parenthesis and copes with any parenthesis nested in
between. We introduce the nonsense token anything
so that
we stop matching directly after the closing parenthesis and we rely on
lexer states to cope with the nesting.
%token if anything %x PREBODY BODY PARENS %% start: if '(' ')'; %% any (?s:.) char '([^'\\]|\\.)+' name [A-Z_a-z][0-9A-Z_a-z]* string ["]([^"\\]|\\.)*["]|R["][(](?s:.)*?[)]["] ws [ \t\r\n]+|[/][/].*|[/][*](?s:.)*?[*][/] %% <INITIAL>if<PREBODY> if <PREBODY>[(]<BODY> '(' <PREBODY>(?s:.)<.> skip() <BODY,PARENS>[(]<>PARENS> skip() <PARENS>[)]<<> skip() <BODY>[)]<INITIAL> ')' <BODY,PARENS>{string}<.> skip() <BODY,PARENS>{char}<.> skip() <BODY,PARENS>{ws}<.> skip() <BODY,PARENS>{name}<.> skip() <BODY,PARENS>{any}<.> skip() {string} anything {char} anything {ws} anything {name} anything {any} anything %%
All of these example configs are available in the zip with a
.g
extension.
I used the following command line to build under Linux/g++:
g++ -o gram_grep main.cpp -std=c++17 -lstdc++fs