pop·cy·cli·cal

Saturday, September 11, 2010

Splitting Pascal/Camel Case with RegEx Enhancements

In Jon Galloway’s Splitting Camel Case with RegEx blog post, he introduced a simple regular expression replacement which can split “ThisIsInPascalCase” into “This Is In Pascal Case”.  Here’s the original code:

output = System.Text.RegularExpressions.Regex.Replace(
    input,
    "([A-Z])",
    " $1",
    System.Text.RegularExpressions.RegexOptions.Compiled).Trim();

Simple and effective.  Matches any capital letters and inserts a space before them.  But there’s room for improvement.  First, the call to String.Trim() to remove any spaces potentially added if the first letter is uppercase – this can be handled with a “Match if prefix is absent” group containing the “beginning of line” character ^.  This prevents any matches from occurring on the first character, which eliminates the need for the String.Trim() call.  The formal name for this grouping construct is “Zero-width negative lookbehind assertion”, but just think of it as “if you see what’s in here, don’t match the next thing”.

    (?<!^)([A-Z])

Next - there’s a potential issue with how acronyms get handled with this.  Given this fictional book title: “WCFForNoobs” – the split will occur on each uppercase letter resulting in “W C F For Noobs”.  The fix is simple, though – require that uppercase letters be followed by a lowercase:

    (?<!^)([A-Z][a-z])

…Now it’ll result in “WCF For Noobs” (aren’t we all!).  But now it won’t add a space before the acronym – for “LearnWCFInSixEasyMonths”, the result will be “LearnWCF In Six Easy Months”.  No problem – add an alternate match for a lowercase letter coming before the uppercase letter.  The replace pattern makes this more difficult – we don’t want the space to go before the lowercase letter, we want it between the lowercase and the first capital letter of the acronym.  RegEx can handle this with another lookbehind match group – “Match prefix but exclude it” - (?<=).  This allows the match to occur on the lowercase-uppercase pair, but only the uppercase portion will get matched, so when it comes time to run the replacement, the space will get inserted between the two letters.  By itself, that’ll look like this:

    ((?<=[a-z])[A-Z])

Great!  But this needs to be combined with previous expression.  Easy accomplished with an either/or match using the vertical bar “or” construct:

    (?<!^)([A-Z][a-z]|(?<=[a-z])[A-Z])

The example “LearnWCFInSixEasyMonths” will now be split into “Learn WCF In Six Easy Months”.  These same techniques can be used for additional splits – perhaps on numbers or underscores.  More generally, lookbehind and lookahead are great tools to have in your RegEx toolbelt.

Tuesday, March 23, 2010

Programming Language Misuse

I’m feeling a bit guilty about some code I wrote:

using (new OperationTimer("MyOperation", this))
{
    // ... complete operation
}

This innocent looking C# snippet is hiding a tricky secret - the using statement is being misused (no pun intended).  The documentation defines the intended usage clearly:

using Statement
Defines a scope, outside of which an object or objects will be disposed.

The problem?  The notion of “object disposal” is being hijacked!  In your garden variety IDisposable implementation, you’d be dealing with an external resource that needs to be released before the object can be removed from memory.  Instead, I’m using it to time a block of code like so:

class OperationTimer : IDisposable
{
    private readonly string _operationName;
    private readonly ITimable _obj;
    private readonly Stopwatch _stopwatch;

    public OperationTimer(string operationName, ITimable obj)
    {
        _operationName = operationName;
        _obj = obj;
        _stopwatch = new Stopwatch();
        _stopwatch.Start();
    }

    public void Dispose()
    {
        _stopwatch.Stop();
        _obj.OnOperationCompleted(_operationName, _stopwatch.Elapsed);
    }
}

The constructor starts a timer and the Dispose() method stops it and reports the elapsed time.  (aside: if you’re interested in how I’m using the timer, check out my previous article Simplified Performance Counters) There are certainly other ways to accomplish this same behavior, but they lack the elegance of a neatly scoped code block.  It’s arguably an acceptable way to repurpose the language.  In fact, the ASP.NET MVC authors saw fit to use it in a similar fashion with the BeginForm helper.  The only “resource” it disposes of is to render a closing </form> tag.

My question is: When does repurposing language constructs turn from “acceptable language use” to a “dirty trick”, or worse, “illegible line noise”?

It seems like a slippery slope.  One instance that I don’t care for is controlling execution flow by-way-of logical operator precedence in most C-like languages:

expression1 && expression2 || expression3

Which is equivalent to:

if (expression1)
    expression2
else
    expression3

This takes advantage of the order of evaluation in a logical statement – it is assumed (correctly) that expression2 will never be evaluated if expression1 is evaluated as false, and instead, expression3 will get to run.  Likewise, if the first two evaluate to true, the truth value is known for the statement and expression3 is never evaluated. This is clearly not the intended usage which the language designers had in mind, but it works, and it saves any keywords from being written.

Some truly beautiful code has been written by way of hijacking the language.  For instance, here’s a program that will calculate the value of pi using an ascii circle.  Truly neat - but also completely useless from a software development standpoint.

What do you think?  Should I just get over my guilt about repurposing IDisposable?  Or, should I be true to the original intent of the language and find another way?