March 11, 2024

Three Genuinely Different Solutions: Week 7 of 48 in 24

Week 7 sees us take on the challenge of turning a string into an acronym. If we’re being picky, these are mostly initialisms rather than acronyms because they don’t spell a word. And in a break with tradition, I actually took three fairly different approaches to solving this problem in the three languages I tried.

In fact, I also tried doing this exercise in PowerShell (because I had ambitions to go for the gold medal of doing Haskell, PowerShell and TCL). I gave up because learning bash and elixir pretty much from scratch for this project is already a lot, so trying to wrap my head around PowerShell as well very quickly became more effort than I was willing to put in. I like the idea of PowerShell (what if a shell, but with objects) but in practice I find it kind of annoying.

Bash

Let’s start with bash, the most straightforward of my solutions. To create an acronym, we need to find the first letter of each word, and add it to a list, and then turn that list into a string.

#!/usr/bin/env bash

main(){
    s="$*"
    len="${#s}"
    acc=("${s:0:1}")
    for ((i=1;i<len;i++)); do
        l=${s:i:2}
        case $l in
            [[:space:]-_][[:alpha:]]) acc+=("${l:1:1}")
            ;;
        esac
    done
    echo "${acc[@]}" | tr '[:lower:]' '[:upper:]' | tr -d '[:space:]'
}
main "$@"

So what we do is, as we loop through the string, we look at each character and the one after it, if that pair of chracters matches the pattern “a space, and then a letter” we add the letter to acc. Why did we do a case statement with only one case? Well, I was expecting I’d have to handle a bunch of different cases, but it turns out that I could use bash’s pattern matching to catch all the first letters in one case. (I was expecting more complicated cases because of wrinkles we’ll encounter in the haskell solution). We saw case before in the Protein Translation solution but here we’ve got a bit more of a complicated matching expression. Bash pattern matching is a bit like regular expressions (which we’ll discuss further later) but it’s not the same. If you look at the mess of brackets in the case statement, you’ll see we have two pairs of square brackets. The first is [[:space:]-_] and the second is [[:alpha:]]. There’s no wildcards or repeats or anything so we’re matching two characters. The first is either a space character ([:space:]) or a hypen (-) or an underscore (_). And the second is an alphabetic character ([:alpha:]). In hindsight, I should not have put the hyphen in the middle there because you can also use the hyphen to signify a range – [0-9A-F] matches a hexadecimal digit for example – but I think it’s safe because the thing to its left (the character class [:space:]) can’t be the endpoint of a range? Probably better to put the hyphen first in the list though.

Now you’ll note that this never matches the first character of the string, but it’s ok because acc is initialised with the first character. What if the first character is a space? It’s ok because at the end we use tr to delete (-d) anything matching [:space:] from our string. The tr -d [:space:] is actually turning our array “P N G” into “PNG”, and the removing leading spaces is just an added bonus sdie-effect. We also use tr to replace any lowercase characters with uppercase ones.

Elixir

Speaking of regular expressions, I knew as soon as I saw the exercise that I wanted to try to solve it with regex. I tried to do it in bash, but bash doesn’t really support extracting capture groups in a flexible enough way for me to do what I had planned. So I fell back on a different approach in bash, and decided to use regex in elixir instead.

defmodule Acronym do
  @doc """
  Generate an acronym from a string.
  "This is a string" => "TIAS"
  """
  @spec abbreviate(String.t()) :: String.t()
  def abbreviate(string) do
    Regex.scan(~r/([A-Z]|(?<=\s)[a-z])\w*/, string)
    |> Enum.map(fn [_, y] -> y end)
    |> Enum.join("")
    |> String.upcase()
  end
end

All the action, really, is in that first argument to Regex.scan. Before we go through that, let’s look at the remainder of the solution. Regex scan will give you a list of all the matches of your regular expression. Each match will consist of a list where the first element is the full match, and then the succeeding elements are the captured groups. For our solution we just want the first captured group (there is only one). So the line after the scan extracts the second element from each list, the next line joins those characters into a string, and the last line converts to upper case.

But what is that regex doing? ~/r.../ is elixir’s way of signifying we’re creating a regex. Then we have (...|...)\w*: so that means a capturing group with two possible patterns to match and then zero or more word characters. \w is equivalent to [a-zA-Z0-9_]. The inclusion of the underscore here has tripped me up numerous times. If you thought that “british_broadcasting_corporation” should be abbreviated “BBC” then you’d want to count the underscore as a “space-like” character rather than a “word-like” one. We don’t have to worry about that here because the test cases are quite lenient.

So what are the two possibilities for the content of the matching group? The first is simply [A-Z] so an uppercase letter. The second is (?<=\s)[a-z]. That’s a positive lookbehind for a \s – a space character – and then a lowercase letter. So that’s saying “match if we’ve got a lowercase letter where the character before it was a space”.

In unpacking this regex, I don’t think it actually addresses the cases I thought it did, and I got away with it because the test cases aren’t very comprehensive. I’ll come back to this once I’ve discussed my haskell solution.

Haskell

Another thing that I quickly realised when I started thinking about this exercise was that it would be a good opportunity to try out one of haskell’s parsing libraries. I chose Parsec, primarily because it’s the one that is discussed in Real World Haskell.

Haskell does have regex support, but there appear to be a number of options and it seems like the prevailing wisdom is that it’s often better to build a bespoke parser rather than using regex. Going into this, I wasn’t sure this is true if you don’t know how haskell parser libraries work and you do know regex, but I thought I’d give building a parser a go anyway.

Making this solution work will require adding parsec to the dependencies in the package.yaml file in exercism.

module Acronym (abbreviate) where

import Text.Parsec(sepBy,many1,letter,ParseError,parse, upper,many, (<|>), char,lower, oneOf)
import Data.Char(toUpper)

continuations = many1 (lower <|> char '\'') <|> many (upper <|> char '\'')
initial = letter <* continuations
initials = sepBy (many initial) (many1 (oneOf " -,_"))

parseInput :: String -> Either ParseError [String]
parseInput = parse initials "ERR" 

extractInput :: Either ParseError [String] -> [String]
extractInput (Left err)  = [show err]
extractInput (Right xs) = xs

abbreviate :: String -> String
abbreviate xs = map toUpper $ concat parsed
  where parsed = extractInput . parseInput $ xs

I am not going to attempt to explain what’s actually going on behind the scenes here, because this library is using some of haskell’s more funky features. But the overall structure is as follows: you build a parser – called initials here – by chaining together various component parsers, then parse initials "ERR" returns a function that takes an input string and returns either a ParseError or a list of strings. (The ParseError will make use of your "ERR" string.)

The basic parser functions char, letter or lower for example consume a character of the input, and then if it’s a character they were expecting they return it, otherwise they trigger a parser error. For example, char '\'' will consume and return an apostrophe, or trigger an error if the chracter is not an apostrophe. These functions can then be combined in a bunch of ways. many for example keeps repeating the same parser until it errors. many1 does the same thing, but it will trigger an error if it doesn’t successfully consume at least one character. The <|> operator is used to chain parsers together. It works kind of like an “or”. So the continuations parser is looking for many (at least one) lowercase characters or apostrophe; OR many uppercase characters of apostrophe. So this parser would successfully consume “ab’c” and it would happily consume “A’BC”, but it would not reach the end of “Abc”: it would consume the “A” and then stop.

initial uses another kind of combination operator <*. In the context of combining parsers, what this does is it applies the left hand parser (in this case, letter) and then it applies the right hand parser (continuations) and it throws away the output of the right hand parser and just keeps the left hand parser output. So the initial parser consumes the first letter of the input and keeps it, and then consumes the continuation of the input and throws it away. There are also *> and <*> which keep the right hand only and both sides respectively. For example, if you wanted to parse and output a capitalised word – an uppercase letter followed by zero or more lowercase letters – you could do that with upper <*> (many lower).

We then also have the sepBy operator for combining parsers. This takes two parsers, and applies them alternately, keeping only the outputs of the first. (The idea is that you use sepBy when there’s some data split up by separators and you want to extract the data and ignore the separators). So the parsers here are many initial – we keep the output of these – and many1 (oneOf " -,_"). So the things we’re interested in are separated by spaces, hyphens, commas or underscores (and we throw away these irrelevant separators).

You might think that this seems overcomplex for the problem at hand. Why can’t the first argument to sepBy not just be initial? Why do we need to handle the upper and lower cases separately in continuations? The answer is that the haskell track, unlike the other two tracks I explored for this exercise, wants you to handle camel case acronyms: “HyperText Markup Language” should become “HTML” rather than “HML”, but “GNU Network Object Model Environment” should be “GNOME” rather than “GNUNOME”.1 Convince yourself that the bash and elixir solutions would fail on these cases.

So after initial consumes H, continuations needs to consume yper but then let many trigger initial to consume the T before continuations gets to eat ext. On the other hand after initial consumes the G, it’s the right disjunct of continuations that consumes NU.

You’ll notice that I have not given these parser functions type signatures. That’s because they would be horrible and complex. The best approach is to use them in a function that does get a type signature (like parseInput) and then let the compiler infer the parser’s types from that. This is, in fact, an important point: if you don’t give the compiler this clue as to what the types should be, it will probably complain about ambiguous types.

So once the parser is built, all we need to do is extract the successfully parsed list of strings from the Either, uppercase them, and concat them into a string.

I don’t really understand the dark magic and weird invocations that make Parsec tick, but I’ve read enough tutorials and docs to piece together this parser. I think this is an approach I will come back to in the future, because it is already clear that building a parser like this is has the potential to be more readable and more powerful than resorting to regex. Next time, I might try the apparently superior megaparsec library. I’ve seen two good megaparsec tutorials recommended online (one, two) and I’ve since learned that parsec is no longer being actively developed. So expect more of this sort of nonsense.

I have a kind of ulterior motive in finding an excuse to learn about haskell parsers. 2023’s attempt at Advent of Code fell apart very quickly because I found parsing the inputs in haskell to be something of a chore.2 So I’m planning to go into the 2024 edition with a stronger grasp of haskell parsing libraries in the hope of getting further than day 4.

I don’t want to fall too far behind with 48 in 24, but I am considering a bonus post on what you’d need to do to get the elixir and bash solutions to correctly handle camelcase. I’m not making any promises though.


  1. The actual exercism test uses the GNU Image Manipulation Program as the example, but I think we should just retire that acronym at this stage. ↩︎

  2. I have plans to write up a slightly longer AoC postmortem soon. ↩︎

© Seamus Bradley 2021–3

Powered by Hugo & Kiss.