David Bakin’s programming blog.

Breaking a Windows command line into separate arguments, respecting quotes and backslashes

I went on a side track recently and discovered the strangely intricate world of breaking a Windows command line into arguments.  That is, how do you do Windows command line lexing?  (By established convention, command line parsing refers to interpreting arguments as options to programs: interpreting flags, collecting file names, handling missing required arguments, etc.)

TL;DR: I wrote a fully tested C# library to do this and it is on github and nuget for your public domain amusement.  (It’s also on symbolsource.org, for your source debugging needs, but I can’t get it to work in my VS2013 environment … let me know if it works for you.)

Most of the time there’s no need to worry about breaking up a command line into arguments.  Your C/C++ program gets them pre-lexed as arguments to main(): the well known argv and argc, handled by your compiler’s runtimes. And your C# program gets a string[] args array, handled by the .NET assembly launcher. And for most occasions, that’s sufficient.

But maybe it isn’t. For example, I was trying to use Clang’s libclang to process some C++ source code. An excellent resource if you want your C++ lexed, parsed, and indexed. But to get it going you’ve got to pass compiler command line arguments to the function which parses a translation unit. Those arguments must include all the include directories, preprocessor symbol definitions, and everything else that you’d ordinarily pass to your compiler (in clang’s case, these are normally gcc’s options). A lot of times these are build into makefile macros or even more difficult to reach locations—like inside of Visual Studio’s project files.

For my purposes I wanted to grab them from MSBuild logfiles so I could get the actual command lines as seen by Visual C++. And that meant, I needed to lex a command line into arguments.

So that turns out to be intricate, as I said above. The key issue is caused by a…unfortunate design choice?…mistake?…that dates back to MS-DOS/PC-DOS 2.0: The use of the backslash as the directory separator character in a path string. Since in C and C-derived languages (and many other languages) the backslash is used as an escape character in a double-quoted string literal, and since paths containing backslashes are often passed as arguments to programs, and since those paths are frequently in double-quoted arguments (to protect blanks and other special characters) there’s a conflict that leads to confusing interactions between quoted arguments and escaped characters and path strings.

In this article on MSDN, Parsing C++ Command Line Arguments, Microsoft describes the rules: note the special cases for even or odd sets of backslashes immediately followed by a double quote character, versus a set of backslashes not so followed. But it’s more complex than that. There is a special rule for backslashes at the end of the string. There is special handling of the first (“zeroth”) argument on the command line: The executable path. The rules changed slightly in 2008. And some programs don’t use Visual C++’s runtime to lex arguments, they use the Windows API CommandLineToArgvW to do it—and wouldn’t you know, it handles things slightly differently.

I ended up writing a C# library that lexed arguments, letting you choose between the Visual C++ way of doing it or the CommandLineToArgvW way of doing it. There are also routines for “requoting” arguments properly so that you can form them back up into a command line. (I haven’t done globbing yet, but that’s coming.) I’ve put it on github (with a public domain license, so party on) and it’s on nuget as well. Bug reports, discussion, praise is all cheerfully accepted at the github site (or as comments here).

Naturally, I didn’t figure out the crafty little details myself. I relied on a reports written by a bunch of people who got there first. And, here are links to that work, which were quite useful to me: