The lexertl Blog

2015-07-01

stl@microsoft.com has announced that std::auto_ptr will be opt in following VC++ 2015 and so this seems a good point to start work on the C++11 conversion of lexertl. So far I have switched to std::unique_ptr and abandoned my pointer containers seeing as std::unique_ptr works with STL containers and I'm using Scott Meyers "Effective Modern C++" to work through other changes. I would like to use fewer containers of pointers and instead rely on move semantics, so we'll see how that goes. As time goes on, the thought of a simpler codebase does appeal I must say!

I think making all the changes will take a while as there always seems to be more modernisation that is possible. It will be nice to add in typedefs for the new Unicode types too. When I first wrote lexertl the idea was to use all the best practice I was aware of and so I would like to attempt that again now.

2014-03-24

The new version of lexertl has been released.

I converted the SQL tokens (baring keywords) as follows:

        // Macros built into GOLD Parser:
        rules_.insert_macro("All_Valid", R"([\x01-\xFF])");
        rules_.insert_macro("Alphanumeric", R"([\x30-\x39\x41-\x5A\x61-\x7A])");
        rules_.insert_macro("Control_Codes", R"([\0-\x1F\x7F-\x9F])");
        rules_.insert_macro("Digit", R"([\x30-\x39])");
        rules_.insert_macro("Letter", R"([\x41-\x5A\x61-\x7A])");
        rules_.insert_macro("Whitespace", R"([\x09-\x0D\x20\xA0])");

        // Macros defined by SQL grammar:
        rules_.insert_macro("Any_Ch", "{All_Valid}{-}{Control_Codes}");
        rules_.insert_macro("String_Ch", "{Any_Ch}{-}'{+}{Whitespace}");
        rules_.insert_macro("Id_Ch_Standard", "{Alphanumeric}{+}_");
        rules_.insert_macro("Id_Ch_Delimited_Bracket", R"({Any_Ch}{-}[\][])");
        rules_.insert_macro("Id_Ch_Delimited_Bracket_Start",
            "{Id_Ch_Delimited_Bracket}{-}#");
        rules_.insert_macro("Id_Ch_Delimited_Quote", R"({Any_Ch}{-}["])");
        rules_.insert_macro("Id_Ch_Delimited_Quote_Start",
            "{Id_Ch_Delimited_Quote}{-}#");
        rules_.insert_macro("Hex_Ch", "[0-9A-F]");
        rules_.insert_macro("Nonspace_Ch", "{Any_Ch}{-}{Whitespace}");

        rules_.push(R"(--.*|[/][*](.|\n)*?[*][/])", eComment);
        rules_.push("\n", eNewLine);
        rules_.push("[\r \t]+", eWhitespace);
        rules_.push("-", eMinus);
        rules_.push("!<", eNLT);
        rules_.push("!=", eNE);
        rules_.push("!>", eNGT);
        rules_.push("%", ePercent);
        rules_.push("&", eBitAnd);
        rules_.push("[(]", eOpenParen);
        rules_.push("[)]", eCloseParen);
        rules_.push("[*]", eTimes);
        rules_.push(",", eComma);
        rules_.push("[.]", eDot);
        rules_.push("[/]", eDivide);
        rules_.push(":", eColon);
        rules_.push(";", eSemiColon);
        rules_.push("[|]", eBitOr);
        rules_.push("~", eBitNot);
        rules_.push("[+]", ePlus);
        rules_.push("<", eLT);
        rules_.push("<=", eLTE);
        rules_.push("<>", eNotEqual);
        rules_.push("=", eEqual);
        rules_.push(">", eGT);
        rules_.push(">=", eGTE);

I also got the lookup working today for the SQL parser :-)

2014-03-23

For the first time at work there is a requirement for a full blown (SQL) parser.

In the absence of a parser generator to complement lexertl, I turned to the internet fully intending to use lemon (http://www.hwaci.com/sw/lemon/). It turns out that Arsène von Wyss has had a good go at the SQL Server SQL grammar here: https://code.google.com/p/bsn-modulestore/source/browse/bsn.ModuleStore.Parser/Sql/ModuleStoreSQL.grm, but that this grammar is coded for use with the Gold Parsing System (http://goldparser.org/).

Luckily a compiled state machine, via the Gold Parser Builder GUI, can be exported to XML. I then used lexertl to massage this data into C++ tables which I am going to play around with this week.

Of course I am using lexertl for all the tokens and what was interesting is that the non-keyword tokens look like this:

    {Any Ch} = {All Valid} - {Control Codes}

So the obvious thing to do was:

    lexertl::rules rules_;

    rules_.insert_macro("AllValid", "[\\x1-\\xff]");
    rules_.insert_macro("ControlCodes", "[\\0-\\x1f\\x7f-\\x9f]");
    rules_.insert_macro("AnyCh", "{AllValid}{-}{ControlCodes}");

However, lexertl doesn't currently support the use of regex macros with the difference operators. Today I decided to fix this which rather opened a can of worms, so even though changes are now complete I need to do some testing before releasing the new version. The good news is this means that rules are now tokenised and validated as they are added, which will make it a lot easier to spot errors in the debugger. It was also a good excuse to change the rules API on suggestion of the London C++ User Group. Rules are now added with push() and macros and states are added with insert_macro() and push_state().

2013-07-09

A simple example of how to logically reverse a regular expression and search a string backwards using it:

First, build a regular expression tokeniser (std::regex syntax in ECMAScript mode - the default):

//#include "../lexertl/lexertl/debug.hpp"
#include "../lexertl/lexertl/generator.hpp"
#include <iostream>
#include "../lexertl/lexertl/iterator.hpp"
#include <regex>

enum eToken {eEOF, eCharset, eQuantifier, eBOL, eEOL, eOr,
    eBackreference, eOpen, eClose};
typedef std::pair<eToken, std::string> token_pair;
typedef std::stack<token_pair> token_stack;

lexertl::state_machine g_sm;

// std::regex (ECMAScript) syntax:
void build_lexer()
{
    lexertl::rules rules_;

    rules_.insert_macro("posix_name", "alnum|alpha|blank|cntrl|digit|d|graph|"
        "lower|print|punct|space|s|upper|xdigit|w");
    rules_.insert_macro("posix", "\\[(:{posix_name}:|[.]{posix_name}[.]|"
        "={posix_name}=)\\]");
    rules_.insert_macro("escape", "\\\\([^xuc]|x\\d{2}|u\\d{4}|c[a-zA-Z])");

    rules_.push("\\\\[0-9]{1,2}", eBackreference);
    rules_.push("{escape}", eCharset);
    rules_.push("\\[^?({escape}|{posix}|[^\\\\\\]])*\\]", eCharset);
    rules_.push("\\^", eBOL);
    rules_.push("\\$", eEOL);
    rules_.push("[|]", eOr);
    rules_.push("[*+?][?]?", eQuantifier);
    rules_.push("[{][0-9]+(,([0-9]+)?)?[}]", eQuantifier);
    rules_.push("[(]([?][:=!])?", eOpen);
    rules_.push("[)]", eClose);
    rules_.push("(.|\n)", eCharset);

    lexertl::generator::build(rules_, g_sm);
//    lexertl::debug::dump(g_sm, std::cout);
}

Then, the regex reverser:

void reduce_stack(token_stack &stack_)
{
    std::string str_;

    for (; !stack_.empty() && (stack_.top().first == eCharset ||
        stack_.top().first == eOr); stack_.pop())
    {
        str_ += stack_.top().second;
    }

    stack_.push(token_pair(eCharset, str_));
}

std::string rev_regex(const std::string &rx_)
{
    std::string str_;
    std::string curr_;
    token_stack stack_;
    lexertl::citerator iter_(rx_.c_str(), rx_.c_str() + rx_.length(),
        g_sm);
    lexertl::citerator end_;
    std::size_t parens_ = 0;

    for (; iter_ != end_; ++iter_)
    {
        switch (iter_->id)
        {
            case eCharset:
                stack_.push(token_pair((eToken)iter_->id, iter_->str()));
                break;
            case eQuantifier:
                if (stack_.empty())
                {
                    throw std::runtime_error("Quantifier without regex.");
                }

                stack_.top().second += iter_->str();
                break;
            case eBOL:
                // Reverse to eEOL
                stack_.push(token_pair(eEOL, std::string(1, '$')));
                break;
            case eEOL:
                // Reverse to eBOL
                stack_.push(token_pair(eBOL, std::string(1, '^')));
                break;
            case eOr:
                reduce_stack (stack_);
                stack_.push(token_pair((eToken)iter_->id, iter_->str()));
                break;
            case eBackreference:
                // See http://www.perlmonks.org/?node_id=457884
                throw std::runtime_error("Backreferences not supported!");
                break;
            case eOpen:
                stack_.push(token_pair((eToken)iter_->id, iter_->str()));
                ++parens_;
                break;
            case eClose:
            {
                if (parens_ < 1)
                {
                    throw std::runtime_error("Unbalanced parens in regex.");
                }

                std::string str_;

                reduce_stack(stack_);
                str_ = stack_.top().second;
                str_ += iter_->str();
                stack_.pop();
                stack_.top().second += str_;
                stack_.top().first = eCharset;
                --parens_;
                break;
            }
        }
    }

    for (; !stack_.empty(); stack_.pop())
    {
        str_ += stack_.top().second;
    }

    return str_;
}

Finally, the calling routine:

int main(int argc, char *argv[])
{
    try
    {
        std::string regex_("[A-Z][a-z]+");
        std::string str_ = "Ben Test";

        build_lexer();
        regex_ = rev_regex(regex_);

        std::regex rx_(regex_);
        std::regex_iterator<std::string::const_reverse_iterator>
            iter_(str_.crbegin(), str_.crend(), rx_);
        std::regex_iterator<std::string::const_reverse_iterator>
            end_;

        for (; iter_ != end_; ++iter_)
        {
            std::cout << iter_->str() << std::endl;
        }
    }
    catch (const std::exception &e_)
    {
        std::cout << e_.what() << std::endl;
    }

    return 0;
}

2013-06-30

Introduced lexertl::iterator to ape std::regex_iterator.

2013-06-13

I finally got around to writing code for a skeletal (very basic) C++ source analyser. I'm still frustrated by the lack of C++11 support in Visual C++, but I'm going for the best techniques I can given the limitations of VC++ 2010. I present the source code below, in case anyone else is interested.

I thought it was about time I used boost::filesystem and sure enough, it is very nice to use. Having used std::regex for previous simple refactoring (I even did auto checkout and replacement previously), I've now introduced lexertl. C++ Rocks! has a nice example for boost::filesystem and informIT has a nice article on std::regex. Although I own the Josuttis (2nd edition) std library book, It was the informIT article that made me think about regex_iterator and how we should use it in the code base at work (a previous refactor I did recently (this one by hand) had me changing all occurrences of boost::regex to std::regex).

The obvious next step is to introduce lexertl::iterator in the same vein, especially as I have already named lexertl::match_results (with associated typedefs) in line with the std::regex version.

Anyway, without further ado here is the new source code scanning code as it currently stands:

#include <assert.h>
//#include "../Ben/Private/lexertl/lexertl/debug.hpp"
#include <boost/filesystem.hpp>
#include "../Ben/Private/lexertl/lexertl/generator.hpp"
#include "../Ben/Private/lexertl/lexertl/lookup.hpp"
#include "../Ben/Private/lexertl/lexertl/memory_file.hpp"
#include <regex>

enum {eEOF, eWhitespace, eComment, eChar, eString, eOpenCurly, eCloseCurly,
    eOpenParen, eCloseParen, eComma, eSemiColon, eNewline};
lexertl::state_machine sm_;

void process(const std::string &pathname_)
{
    lexertl::memory_file file_(pathname_.c_str());
    lexertl::cmatch results_(file_.data(), file_.data() + file_.size());

    do
    {
        lexertl::lookup(sm_, results_);

        if (results_.id == 100)
        {
            std::size_t brack_ = 1;
            std::size_t params_ = 0;
            std::string format_;

            do
            {
                lexertl::lookup(sm_, results_);
            } while (results_.id != eOpenParen);

            do
            {
                lexertl::lookup(sm_, results_);

                switch (results_.id)
                {
                case eOpenParen:
                    ++brack_;
                    break;
                case eCloseParen:
                    --brack_;
                    break;
                case eString:
                    if (brack_ == 1)
                    {
                        format_ += std::string(results_.first + 1,
                            results_.second - 1);
                    }

                    break;
                case eComma:
                    if (brack_ == 1)
                    {
                        ++params_;
                    }

                    break;
                }
            } while (brack_ != 0);

            assert(brack_ == 0);

            std::regex rx_("\\{\\d+\\}");
            std::cregex_iterator riter_(format_.c_str(),
                format_.c_str() + format_.length(), rx_);
            std::cregex_iterator rend_;

            for (; riter_ != rend_; ++riter_)
            {
                std::size_t i = atoi((*riter_)[0].first + 1);

                if (i >= params_)
                {
                    std::cout << pathname_ << std::endl;
                }
            }
        }
    } while (results_.id != 0);
}

void iterate(const char *path_)
{
    namespace fs = boost::filesystem;

    for (auto iter_ = fs::directory_iterator(path_),
        end_ = fs::directory_iterator(); iter_ != end_; ++iter_)
    {
        // file object contains relative path in the case of 
        // directory iterators (i.e. just the file name)
        const auto &file_ = iter_->path();

        if (file_.extension() == ".cpp")
        {
            process(file_.string());
        }
    }
}

void recurse(const char *path_)
{
    namespace fs = boost::filesystem;

    for (auto iter_ = fs::recursive_directory_iterator(path_),
        end_ = fs::recursive_directory_iterator(); iter_ != end_; ++iter_)
    {
        // file object contains relative path in the case of 
        // directory iterators (i.e. just the file name)
        const auto &file_ = iter_->path();

        if (file_.extension() == ".cpp")
        {
            process(file_.string());
        }
    }
}

void build_lexer()
{
    lexertl::rules rules_;

    rules_.push("[ \t]+", eWhitespace);
    rules_.push("[/][*](.|\n)*?[*][/]|[/][/].*", eComment);
    rules_.push("'(\\\\('|[^']+)|[^\\\\'])'", eChar);
    rules_.push("[\"]([^\"\\\\]|\\\\.)*[\"]", eString);
    rules_.push("\\{", eOpenCurly);
    rules_.push("\\}", eCloseCurly);
    rules_.push("\\(", eOpenParen);
    rules_.push("\\)", eCloseParen);
    rules_.push(",", eComma);
    rules_.push(";", eSemiColon);
    rules_.push("\r|\n|\r\n", eNewline);

    rules_.push("tfb::db::sprintf", 100);

    lexertl::generator::build(rules_, sm_);
//    lexertl::debug::dump(sm_, std::cout);
}

int main(int argc, char *argv[])
{
    if (argc < 2 || argc > 3 ||
        argc == 3 && strcmp(argv[1], "-r") != 0)
    {
        std::cout << "USAGE: refactor [-r] <pathname>\n";
        return 1;
    }

    try
    {
        build_lexer();

        if (argc == 2)
        {
            iterate(argv[1]);
        }
        else
        {
            recurse(argv[2]);
        }
    }
    catch (const std::exception &e)
    {
        std::cout << e.what() << std::endl;
    }

    return 0;
}

2012-08-08

Added advance flag to match_results for use by boost.spirit (spirit will turn this flag off).

2012-08-06

Added compile time flags to match_results in order to enable lookup to be optimised depending on the state_machine features you are using. By default all features are enabled as before.

2012-07-29

Fixed a push bug in rules.hpp - false was being stored instead of npos.

2012-07-26

Fixed EOL bugs in lookup.hpp and generate_cpp.hpp. Thanks to Markus Dreyer for pointing these out.

Renamed match_results to bring in line with std::regex.

2012-07-22

Moved match_results out of lookup.hpp into match_results.hpp. Added str() function to match_results.

2012-06-24

Added POSIX character set support. Note that the help page has been updated and also an examples directory has been added to the zip file. A test directory will also be added at some point.

2012-06-21

Added support for (?x:) and (?#).

2012-06-15

Added support for {+} and {-}. Examples are [a-z]{-}[aeiou] (creates charset of consonants) and [a-z]{+}[0-9] which results in [0-9a-z].

2012-06-04

\n and $ ambiguity was not being resolved for state_machines with wide chars. This is now finally fixed.

2012-05-26

Hari Rangarajan pointed me to A memory mapped file implementation. This works so well for files that fit into memory, I have reworked the example as a template class and included it as file memory_file.hpp. See main.cpp for example usage.

2012-05-22

More help from Hari Rangarajan. The basic_stream_shared_iterator has been sped up. This has been achieved by eliminating the linear search when adding and removing iterators to the internal shared list.

2012-05-21

Added an experimental file buffer basic_fast_filebuf. Construct with a (text file) filename and drop into the constructor of an istream instance to use it. Various people have asked why C++ streams are so slow. Hopefully the use of FILE * internally in this buffer will solve the problem for file parsing. All feedback appreciated.

2012-05-20

The bug fixes recently are due to reports from Hari Rangarajan, so thanks to him for his valuable contributions. He pointed out that file_shared_iterator just needed a typedef change and rename in order to work with other C++ streams, so that change has now been made too.

2012-03-08

A colleague used lexertl yesterday and made the classic mistake of creating a rule that could match zero characters. This inevitably lead to an infinite loop in his user code. This convinced me that rules that can match zero characters should be disallowed by default, but for the expert user there is now a flag to force the issue - match_zero_len.

2011-11-10

Finally had a requirement to create a .NET interface to lexertl. So far I have done the proof of concept at work (C++/CLR) and can now call it from C#. I will see about releasing the source on CodeProject (as the wrapper was developed in work time..!) when it is in better shape.

2011-10-27

Agreed with Hartmut to rename unique_id to user_id. The auto counter behaviour is gone and instead this id is optionally settable when adding a rule. This should mean this version is ready for integration with boost::spirit again, bar testing.

2011-10-23

Fixed up the recursive rule handling. Improved the lookup so that leftmost longest is maintained when using recursive rules and added the possibility to push a state other than current.

The code generator still needs attention, but a new release is in order regardless.

2011-10-15

I wondered whether it was a retrograde step to make minimise_dfa() a member function of state_machine, but after having tried it I am now convinced that the code is different enough that it is worth it. char_state_machine can now also be minimised.

2011-10-14

Made a change I have been meaning to make for ages: the EOL index is now the last column in each row. This means if $ is not used in the rules then no space is wasted for it in the state machine.

When using recursive rules, lookup now checks whether the stack is empty before attempting to pop states.

I've realised that state machine minimisation should be a member function for each state machine type and currently char_state_machine does not support minimisation. I will look into this along with pushing recursive matching as far as possible. I have started with a simple calculator example in main.cpp. I will mull how far it is logical to push the recursive rules (i.e. are there too many problems introduced by supporting simple recursive grammars.)

2011-09-11

Decided to add iterator conversion from char32_t to UTF-8 and UTF-16 (mainly for seeing Unicode tokens on Linux).

2011-08-27

Looked in Unicode Explained from O'Reilly and decided to implement Simple Case Mapping (page 245).

2011-08-21

Tried compiling with VC++ 7.1 and was pleased to note that it still compiles OK! I have fixed warnings highlighted by that compiler and added a default constructor as well as clear() and reset() functions to match_results. These also work fine with VC++ 7.1.

The char types for rules and state_machine have now been separated. This means that you can use char based rules with a char32_t based state_machine (for example).

2011-08-15

I'm now ploughing ahead with Unicode features. More Unicode character sets have been added, rules and state machines can now cope with char32_t and I have even added UTF-8 and UTF-16 iterators after being inspired by UTF8-CPP! (http://utfcpp.sourceforge.net/)

I have decided to definitely allow char based rules with wide char state machines and then hopefully that will be the end of the Unicode work for now.

2011-07-30

Unicode character sets added ( http://www.unicode.org/Public/5.1.0/ucd/UCD.html#General_Category_Values). Syntax is, for example, \p{Lu}. So far this is still only using wchar_t, as I'm still waiting for Microsoft to support the new char types for the 32 bit char stuff.

It's a relief to finally have wide char slicing working correctly!

2011-01-31

Added a parameter to the code generator for the lookup function name. Also added a parameter so that the DFA can use pointers directly (according to Anteru's testing this is 10% faster).

2011-01-24

I've generated code using each 'feature' and found and fixed lots of errors. Also, the skip constant was still defined as std::size_t which would have completely failed for non std::size_t based state machines!

Next up is traits for lookup (performance) and user data for use by the boost::spirit lexer.

2011-01-23

At the request of Anteru (http://anteru.net/) I have finally rewritten the C++ (table based) code generator. I have introduced a 'features' enum and therefore generate only the necessary code for any given state machine. I am still testing this code so I am not updating the rss until that is complete. The zip already contains the latest code however.

2010-12-24

Inspired by the work of Roberto Ierusalimschy (http://www.cs.tufts.edu/~nr/cs257/archive/roberto-ierusalimschy/peg.pdf) It dawned on me that I could easily add recursive rule support to lexertl (after all a lexer + stack = parser!) This technique would also be very interesting for a DFA based regex library (it would be cool to be able to match nested brackets when grepping source code).

2010-11-13

VC++ 6 support has finally been removed. The code base was actually already broken on that compiler and there was no obvious way to fix it. I first wrote a regex library in 2004 when VC++ 6 was still very much in use. With C++11 now on the horizon, it seems fair enough to move on (and of course boost dropped support for VC++ 6 ages ago).

2010-10-19

I have finally whipped the new version into shape. You can now specify the id_type for the state_machine if you wish. The default is std::size_t as before.

The generator is now capable of generating a char_state_machine directly and in fact will cope with a custom state machine type if you set it up right! In conjunction with this, char_state_machine doesn't use iterators any more as the model didn't really work that well. You can dump state machines as before. Note that a char_state_machine will store *real character ranges* even in wchar_t mode, whereas state_machine always slices wide characters, allowing it to consistently use a 256 entry first phase lookup (Most of you won't care about that, but for anyone doing code generators this is an important distinction).

The code generator is next in the list to be reworked.

2010-08-27

I think I've finally got a proper solution to eol ambiguity resolution (slapped wrist for not testing my previous solution more thoroughly). As usual, doing the right thing (TM) was surprisingly easy, although I'm sure there must be simpler approaches to ^ and $ handling without all this post processing. I'm now happily in a position to tackle meatier issues, such as a more flexible generator interface and another ponder on look-ahead. I think I can solve at least half of the look-ahead problem with some regex syntax tree shenanigans, but I need both a*/a and a/a* to work correctly to really solve it. I may even be able to use some more post processing thinking about it. Maybe that's the easiest way.

2010-07-28

Spoke to Hartmut and agreed to ditch the boost.lexer.zip and instead only update that code base in boosts SVN repository. Any boost review is still ages away and it is a moot point seeing as spirit has been using the library for years already!

I finally fixed the ambiguity with $ and \n in the lexertl code base. The boost version will be fixed soon. I will also start work on some kind of abstract state machine interface with Hartmut so that generator::build() can produce a char_state_machine directly soon.

2010-03-01

As I haven't got around to making changes in preparation for the re2c style code generator, I have switched lexertl.zip to the latest version which includes the changes mentioned below. I'm thinking that it is probably better to either build to a char_state_machine directly in the generator class and that it probably doesn't really need iterators. There will be an option as to whether to group transitions by state or by char cluster. The latter is needed for a re2c style code generator.

2009-09-29

As I have recently started a revamp of lexertl I have decided to start a blog to keep everybody up to date. As this version is not feature complete yet, I have added a separate zip file which you can find here. Note: These changes are now part of the main distribution!

So far I have implemented the following improvements:

Auto compression of wchar_t based state machines (over-ridable).
A generic lookup mechanism based around iterators.
Added the lexertl::skip token constant.
Removed regex macro length limitation.
Made the BOL (^) link a singleton (as it can only occur at the beginning of a token).
debug::dump() now compresses ranges.

This dramatically reduces the list of (easier) features I wanted to add and just leaves the following for the immediate future:

file_iterator
Turn size_t into a templated type for state machine creation.
Re-write the code generator.
Redo serialisation.