I'm sure you all remember me mentioning my current unemployment side project, a C preprocessor written in Python, and that by far the hardest part in it is correct function macro expansion. Indeed, Microsoft's preprocessor has long gotten it wrong, and it's worth looking into some of the problems.
Superficially a C preprocessor looks very, very straightforward. It originated in the 1970s as a simple string match and substitute preprocessor, so:
#define FOO foo
printf("FOO is %d\n", FOO);
... after preprocessing substitutes FOO for foo, but not in string literals (anything surrounded by double quotes), so output is:
printf("FOO is %d\n", foo);
You can implement one of those very easily, simply scan the line for string literals and replace them with placeholders, then scan the remainder for the macros you know about and substitute them, after restoring the string literals. You'd hope any first year compsci student could implement one of those.
The above is what is called an "object macro", those are very easy. Much later in the 1970s they decided the parameterise macros, so you could do:
#define FOO(x) (foo, x)
printf("FOO(5) is %d\n", FOO(5));
... and get:
printf("FOO is %d\n", (foo, 5));
This looks straightforward right? So it may seem, but it is not. Here is what pcpp was choking on last night:
#define f(a) f(x * (a))
#define x 2
#define z z
... needs to become:
f(2 * (f(2 * (z))))
The hard part is not expanding too far. When you expand f(a), you get another f(x * (a)). You must NOT expand that new f(), but yet you MUST expand the user supplied parameter of f(z). So you immediately think surely the easiest way of doing that is to expand from inner-to-outer, so you expand the z exactly once into z, then the f(z) into f(x * (z)) and then finally the outer f() into the correct expansion?
Well, you'd think that, and that would work for this particular instance. But that approach would fail this expansion:
#define TWO_ARGS a,b
#define sub( x, y) (x - y)
sub( TWO_ARGS, 1)
Here sub() takes two parameters, and if we expand TWO_ARGS first then we call sub(a,b, 1) which will fail as that's three parameters. So clearly the C preprocessor does not do inner-to-outer expansion for function macros.
So how are function macros implemented then? Real implementations use tokenisation rather than string matching, so:
... becomes IDENTIFIER LPAREN IDENTIFIER LPAREN IDENTIFIER RPAREN RPAREN
During expansion, we expand to different "coloured" tokens to distinguish original text from expansion text, so original text can be expanded from an outer to inner approach. This means TWO_ARGS from before gets expanded after sub() is expanded, but also f(f(z)) gets expanded as follows:
f(x * (f(z)))
f(x * (f(x * (z))))
f(x * (f(x * (z))))
... where hopefully Google Plus is showing the original, unexpanded text in bold.
I had hoped to avoid tokenisation in pcpp entirely as it's slow and requires bringing in a proper full strength "Look Ahead Left-to-Right (LALR) parser generator". For these you need to supply an analytic grammar with which the computer can parse a language syntax, something I've definitely not touched since my compsci degree a very long time ago (nearly two decades!). With one of those in hand, colouring tokens to mark expanded vs original text is straightforward and is probably the least awful way of implementing function macro expansion which is standards conforming.
Function macros are one of two places in a C preprocessor that you have no choice but to use a proper LALR parser. I had hoped to avoid them as I'd hacked an evil-but-sorta-works solution for the other mandatory use case, but I've come to the conclusion that I'm just going to have to bite the bullet.
My original goal with this side project was to get away from Boost.Outcome and C++ development in general for a bit, and freshen up my Python and other skillsets which I don't get practice in regularly. That's the "continuing professional development" you do in between contracts. It is turning into an awfully large mouthful to swallow, but then it always seems to!