Wednesday, April 25, 2012

How Do You Solve a Problem Like a Euler? Using F#

I’ve been slowly making my way through the first 100 Project Euler problems while learning F# in an attempt to be more pragmatic - and to try and prevent my grey matter from getting too stale.  After many hours of keyboard-head-banging, I’m now getting to the point where I don’t flail uselessly when beginning to type in some F# code – the pattern matching, automatic generalization and higher-order functions now feel like useful tools rather than strange curiosities.  Since I’ve primarily coded in C# for many years, I’ve been using the book “Real World Functional Programming” by Tomas Petricek as a starting point – it’s been a pretty good read. So, for this post the code will be in F#, but it’ll be secondary to the more general topic of problem solving for puzzle-type scenarios commonly found on Project Euler.

I found this particular problem, #98, a bit more challenging than usual - and kind of fun to work through.  It involves matching anagrams with square numbers using character replacement – here’s the full description:

By replacing each of the letters in the word CARE with 1, 2, 9, and 6 respectively, we form a square number: 1296 = 362. What is remarkable is that, by using the same digital substitutions, the anagram, RACE, also forms a square number: 9216 = 962. We shall call CARE (and RACE) a square anagram word pair and specify further that leading zeroes are not permitted, neither may a different letter have the same digital value as another letter.

Using words.txt, a 16K text file containing nearly two-thousand common English words, find all the square anagram word pairs (a palindromic word is NOT considered to be an anagram of itself).

What is the largest square number formed by any member of such a pair?

This is what I’d consider a pretty good puzzle – the elements (a word list and square numbers) are easy enough to grasp, the way they’re related together is unique for this scenario, and an effective solution does not immediately pop into mind.  Well, it didn’t pop into my mind, at least.  So – where to begin?

Many of the problems (at least the ones I’ve solved so far) on Project Euler involve a few common steps to reach a solution:

  • Generate a set of input values for the problem – usually this will be a large set of potential values, such as all the prime numbers below 10,000,000, or in this case, 16K of words and a bunch of square numbers.
  • Given these inputs, build an algorithm that can test for the solution.
  • Add constraints to reduce the size of the input set so that the solution can quickly be found.  On modern hardware, this is usually under 1 second – but anything under a minute is considered kosher.

Input Values

The input values are easy here – they’ve provided the word list, and a sequence of square numbers should be trivial.  First, reading the word list.  Instead of placing the words each on a new line, the file instead contains a single line formatted like:


Ok – so it’ll take a little parsing.  No problem.

let words =
    |> Array.collect (fun line -> line.Split(','))
    |> Array.map (fun w -> w.Replace("\"", "").ToLower())

This loads the line from the file, splits the words by commas, and removes the quotations… and I threw in making the words lowercase – since I hate having my puzzles constantly shouting at me.  A little aside for the language particulars: the |> is the magical-looking-but-actually-quite-simple pipeline operator – it passes the output from the function on the left to be input for the function on the right. And the fun –> is F# syntax for lambda expressions.  At the end of the pipeline comes a simple array containing the parsed words.


So for the rest of the input - the square numbers.  F# has the means to generate an infinite sequence which can then be sliced and diced to get at the bits that are needed.  The perfect squares can be generated with:

let allSquares = 
Seq.unfold(fun (square,odd) -> Some(square, (square+odd, odd+2))) (0,1)

The square numbers are “unfolded” – that is, each element is calculated based on the result from the previous element, like you’d do with.  Square numbers have a property where they can be generated by as the sum of a list of successive odd numbers. The tuple of (0,1) on the far right is fed in as an initial value, with 0 being the first “square” number and 1 being the first odd number.  At each step of the unfold, the odd number is increased by 2, and the square number is calculated by the adding the previous odd number. So, the (square,odd) tuples being computed will look like:(0,1),(1,3),(4,5),(9,7),(16,9), with only the squares being captured as output.  Works for me!

As with most simple-yet-effective bits of code, this one went through a couple of refinements before nailing it.  The first pass looked like this:

Seq.unfold(fun i -> Some(i, i + 1)) 0
|> Seq.map (fun i -> i*i)

…where the squares are generated in two steps instead of one. The first bit unfolds all the natural numbers from 0 to infinity (or more likely, maxint), and the second part maps these each by squaring them.  It produces the same result, but is less elegant because it uses more moving pieces.

That’s all the input this problem should require – next, to design the…

Success Algorithm

A useful bit of the Project Euler problem descriptions is that they usually include an example which can be used as a test case for a solution algorithm. Here they’ve provided:

care, race // <- anagrams map to
1296, 9216 // <- squares

…with a replacement dictionary of (c,1),(a,2),(r,9),(e,6).  It’s important to note that the anagrams and squares might be symmetric, where the numbers can be swapped (i.e. (c,9),(a,2),(r,1),(e,6) works as well), but there are cases (such as the final answer) where this doesn’t hold.

To devise an algorithm to test for an answer, the first step is to think it over before doing any typing.  It’s better to have some notion of a strategy than to rush in with the codes.  I struggle with adhering to this – it’s just too easy to think “I’m so good, I’ll just figure it out as I program.”  I’ve found it’s more productive to get away - get a pen and some paper and start sketching ideas, take a walk, etc.  There’ll be plenty of time to work through all the details in code, but it’s best to go in with a plan – a vague or even an incorrect strategy is better than none at all.

For this problem, I started with the idea of comparing each letter of the first word with the first square number, building a replacement dictionary while iterating through them.  If a valid replacement dictionary for the first set is found, it would then be tested against the second word-square-number pair. Writing in algorithm in F# meant I could treat the words as lists and iterate them using a common combination of recursive function calls and pattern matching.

let rec getReplacementDictionary (sToT, tToS) (source, target) =
    match source, target with
    // have already seen this exact mapping -> skip it
    | s::ss, t::tt when Map.containsKey s sToT && (Map.find s sToT) = t 
        -> getReplacementDictionary (sToT, tToS) (ss, tt)
    // have a mapping for the source, but it's not the target -> failure
    | s::_, _ when Map.containsKey s sToT
        -> None
    // have a mapping for the target, but it's not the source -> failure
    | _::_, t::_ when Map.containsKey t tToS 
        -> None
    // never before seen mapping -> add it
    | s::ss, t::tt
        -> getReplacementDictionary (Map.add s t sToT, Map.add t s tToS) (ss, tt)
    // end of the line - a successful translation!
    | [], [] -> Some(sToT, tToS)
    | _ -> raise(System.ArgumentException("words not equal length"))

I’ve named the variables “source” and “target” – they’ll get passed the word and the number, respectively.  They are “matched” against conditions using the patterns seen on each line following the pipe “|” character.  The easiest of these to grasp is near the end – “| [], [] –> Some(sToT, tToS)“ which matches two empty lists, indicating that the words have been completely checked and that there is a valid dictionary.  For the dictionary itself, I found that it was necessary to keep tabs on both the source-to-target values (sToT) as well as the target-to-source (tToS) values.  A bi-directional mapping structure would be ideal, but it would have been more effort to construct that than to fudge it with an extra variable.  If there is a better way to handle this, I’d be interested to hear it… 

At any rate, the result of this function will either be a failure - with the None value - or with the completed dictionary(s) – the Some(sToT, tToS).  Testing it against “care” and “1296” produces the expect result:

> getReplacementDictionary (Map.empty, Map.empty) ("care" |> List.ofSeq, "1296" |> List.ofSeq);;
val it : (Map<char,char> * Map<char,char>) option =
    (map [('a', '2'); ('c', '1'); ('e', '6'); ('r', '9')],
     map [('1', 'c'); ('2', 'a'); ('6', 'e'); ('9', 'r')])

…and changing a digit in the number to one of the already mapped digits results (I’m using 1292 which repeats the 2) in a failure:

> getReplacementDictionary (Map.empty, Map.empty) ("care" |> List.ofSeq, "1292" |> List.ofSeq);;
val it : (Map<char,char> * Map<char,char>) option = None

To test this dictionary against the second group, “race” and “9216”, the original function may be reused because all of the character mappings will have been seen in the first group and they will simply be verified via the first pattern. 

let dict = getReplacementDictionary (Map.empty, Map.empty) ("care" |> List.ofSeq, "1296" |> List.ofSeq)
getReplacementDictionary dict.Value ("race" |> List.ofSeq, "9216" |> List.ofSeq);;

val dict : (Map<char,char> * Map<char,char>) option =
    (map [('a', '2'); ('c', '1'); ('e', '6'); ('r', '9')],
     map [('1', 'c'); ('2', 'a'); ('6', 'e'); ('9', 'r')])

This is making an assumption that the input words are anagrams of each other – this is an OK assumption to make for now.  Indeed, it can be even assumed that the input data may be filtered for anagrams – I’ll cover that in the next section.

I noticed that this could be further streamlined by simply appending the words and the numbers together, since that is more-or-less what is occurring when calling it twice.  And in that case, the function no longer needs to return the mapping dictionary – just a simple boolean indicating if a valid dictionary can be applied to the input.   So, the function can be modified as follows, making sure to rename it:

let rec hasReplacementDictionary (sToT, tToS) (source, target) =
    match source, target with
    | s::ss, t::tt when Map.containsKey s sToT && (Map.find s sToT) = t 
        -> hasReplacementDictionary (sToT, tToS) (ss, tt)
    | s::_, _ when Map.containsKey s sToT
        -> false
    | _::_, t::_ when Map.containsKey t tToS 
        -> false
    | s::ss, t::tt
        -> hasReplacementDictionary (Map.add s t sToT, Map.add t s tToS) (ss, tt)
    | [], [] -> true
    | _ -> raise(System.ArgumentException("words not equal length"))

and run a test:

> let los = List.ofSeq
let s = List.append (List.ofSeq "care") (List.ofSeq "race")
let t = List.append (List.ofSeq "1296") (List.ofSeq "9216");;

> hasReplacementDictionary (Map.empty, Map.empty) (s, t);;
val it : bool = true

Success!  Now, having all the necessary input and a valid algorithm, it would be possible to test every possibly combination of inputs to find the correct answer. 

Input Constraints

Project Euler problems are usually infeasible to answer without a computer.  (There’s sometimes really smart math folks in the forums who will find clever ways around this, but for the rest of us…) 

How many combinations of the input data could there be, anyway?  There’s a total of 1,786 words ranging in length from 1 to 14 characters, which leads to ~3.2 million combinations of just the words.  My script to calculate the count of all square numbers under 14 digits ran out of memory, but it turns out only up to 9 digit squares are needed.  There are 31,623 of those.  The number of combinations would be 1,7862 × 31,6232 ≈ 3.2E15.  That’s several thousand billion – got to keep filtering or it’ll take a year!

The next obvious step is to pair up the anagrams and discard the rest of the words.  A simple way to do this is to alphabetically sort each character in each word and group together the ones that are exactly the same.  For example “care” and “race” will both map to “acer” – neat trick!  Here’s the code

let groupAnagrams ws =
    // sort by characters in each string and then glue them back together
    |> Seq.groupBy (Array.ofSeq >> Array.sort >> (fun cs -> new string(cs)))
    // take only the anagrams - where the number of words is greater than one
    |> Seq.map snd
    |> Seq.filter (Seq.length >> ((<) 1))

This is a great first step – but it’s not quite enough.  I want all the pairs of anagrams and there are cases where there are more than 2 words that all have the same letters.  For example, “post”, “spot” and “stop” will all group together.  What I need is combinations of all the pairs – so I’ll end up with (“post”, “spot”), (“spot”, stop”), and (“post”, “stop”).  Fortunately, I had already borrowed/stolen a generic combination function that Tomas had posted on Stack Overflow.

let combinations size set = 
    let rec combinations acc size set = seq {
        match size, set with 
        | n, x::xs -> 
            if n > 0 then yield! combinations (x::acc) (n - 1) xs
            if n >= 0 then yield! combinations acc n xs 
        | 0, [] -> yield acc 
        | _, [] -> () }
    combinations [] size (set |> List.ofSeq)

Now I can run all the whole list of anagram groups through the combinations algorithm and produce a single list of anagram pairs:

let pairwiseTuples =
    Seq.collect (
        combinations 2
        >> Seq.map (fun l -> l.[0], l.[1])

And indeed – it does its job:

[|("cat", "act"); ... ("race", "care");
  ("spot", "post"); ("stop", "post"); ("stop", "spot"); 

Running the words through this results in 44 pairs of anagrams with the lengthiest pair being 9 letters long ("reduction", "introduce"). This is starting to sound reasonable to process – and it’s where I got stuck for a bit.  Looping through all of the squares for each pair still seems computationally prohibitive. 

I walked away from it for the day and in the shower the next morning – “duh, the square numbers can be bundled up as anagrams as well!"  These are the ah-ha moments that make it all worthwhile.  A few examples – (“1024”, “2401”), (“14400”, “10404”).  There turned out to still be quite a few for the larger digit counts (for example, there are 70052 anagram pairs for square numbers with 9 digits), but since there are so few words in the list that are above 5 characters long, it turned out to be OK.

All was left was to write a few more lines to run all the equal-length anagram pairs through the success function and find the result.  It ran in about a second on my hardware, so that is an acceptable solution in my mind.  There are undoubtedly numerous additional optimizations that could be applied – and perhaps even a way to word it out on paper (I doubt it for this one) – so I’d be eager to hear any comments.