February 23, 2024

Haskellbrain Harder: Week 5 of 48 in 24

Week 5 of 48 in 24! This time around we’re doing Protein Translation. Basically, take a sequence of amino acids and translate that into a list of proteins. Each triple of amino acids – a codon – gets mapped to a protein. Or rather, some codons are valid proteins, some represent a “stop” signal, and others are garbage and should throw an error.

Bash

The first language I thought I’d try this week is bash. I use bash shell most days, but I very rarely do anything beyond simple commands. (Maybe a little grep, some modest piping…) It would, however, be kind of useful to know a little more about how to write simple scripts. Often if I need to do something like that, I’d reach for python which I know better than I do bash, but sometimes that seems like overkill. And I did run into a situation at work recently where a little bash script was useful. The container I was troubleshooting didn’t have python installed, indeed, it didn’t have lsof installed, but I needed a way to get an overview of what was eating all the file descriptors. I cobbled together something where I looped over ls /proc/[0-9]* and pulled out all the cmdlines. It sort of worked but it was painful and it made me realise a better handle on bash (and unix tools more generally) would be a good investment. So here is my prosaic, straightforward solution to the exercise in bash.

#!/usr/bin/env bash
main() {
    v=$1 #get input
    l=${#v} #get length of input
    x=""
    for ((i=0; i<l; i=i+3))
    do
        r=${v:i:3}
        case "$r" in
            AUG) x+="Methionine ";;
            UUU|UUC) x+="Phenylalanine ";;
            UUA|UUG) x+="Leucine ";;
            UCU|UCC|UCA|UCG) x+="Serine ";;
            UAU|UAC) x+="Tyrosine ";;
            UGU|UGC) x+="Cysteine ";;
            UGG) x+="Tryptophan ";;
            UAA|UAG|UGA) break;;
            *) echo "Invalid codon"; return 1;;
        esac
    done
    echo "${x% }"
}
main "$@"

OK so we define a function called main and then in the last line we call it. "$@" is bash’s way of saying “pass the whole of any argument passed into this script into main”. You can use $1 etc to refer to the different arguments passed in (where arguments are delimited by whitespace). The input, in this case, is the sequence of amino acids we need to translate. Within the function, the first thing we do is throw away everything except the first argument and call that first argument v. Then we get its length with ${#v}. Then, basically we proceed as you might expect. We loop over the input string in threes using index i. We grab the substring with r=${v:i:3}, that is, let r be the substring of v that starts at index i and goes on for three positions. Then we use a case – bash’s switch statement – to pattern match on the valid codons. If we find a match, we add the relevant protein to the output string x. If we find a stop codon, we break out of the loop. One nice thing about this pattern matching is you can use a|b to match a or b. If we fall through all of those possibilities, we echo the error statement, and return a non-zero value. If we get to the end of the loop (through running out of codons or through break) we echo the output string.

My original final echo statement there was just echo $x which passes all the tests, but when I submit the solution, exercism runs a linter called ShellCheck on the code and the linter complains that I should enclosed $x in quotes. But when I did that, the tests no longer passed because of the trailing whitespace. Note each time we append a protein to x we also append a space, so there’s an extra space on the end of the string. So the ${x% }was my solution for removing the trailing whitespace. ${a%b} is bash’s way of removing the shortest string matching b from the end of a.

Elixir

This week, again, I spent more time doing the elixir learning exercises to unlock Protein Translation than I did actually doing Protein Translation. I didn’t particularly like the constraints imposed by the way the exercise was being tested, so I ended up effectively writing my implementation how I wanted to, and then writing wrappers to make it fit the way exercism wanted the tests to look.

defmodule ProteinTranslation do
  @doc """
  Given an RNA string, return a list of proteins specified by codons, in order.
  """
  @spec of_rna(String.t()) :: {:ok, list(String.t())} | {:error, String.t()}
  def of_rna(rna) do
    p = Enum.take_while(map_to_proteins(rna), &(&1!=:STOP))
    cond do
      Enum.any?(p, &(&1==:ERROR)) -> {:error, "invalid RNA"}
      p == [] -> {:ok, []}
      true -> {:ok, p}
    end
  end
  
  defp map_to_proteins(rna) do
    to_charlist(rna)
    |> Enum.chunk_every(3)
    |> Enum.map(&_codon/1)
  end
  
  @spec of_codon(String.t()) :: {:ok, String.t()} | {:error, String.t()}
  def of_codon(codon) do
    c = to_charlist(codon)
        |> _codon()
    case c do
      :ERROR -> {:error, "invalid codon"}
      :STOP -> {:ok, "STOP"}
      :EMPTY -> nil
      _ -> {:ok, c}
    end
  end
  
  defp _codon(codon) do
    case codon do
      ~c"UGU" -> "Cysteine"
      ~c"UGC" -> "Cysteine"
      ~c"UUA" -> "Leucine"
      ~c"UUG" -> "Leucine"
      ~c"AUG" -> "Methionine"
      ~c"UUU" -> "Phenylalanine"
      ~c"UUC" -> "Phenylalanine"
      ~c"UCU" -> "Serine"
      ~c"UCC" -> "Serine"
      ~c"UCA" -> "Serine"
      ~c"UCG" -> "Serine"
      ~c"UGG" -> "Tryptophan"
      ~c"UAU" -> "Tyrosine"
      ~c"UAC" -> "Tyrosine"
      ~c"UAA" -> :STOP
      ~c"UAG" -> :STOP
      ~c"UGA" -> :STOP
      ~c"" -> :EMPTY
      _ -> :ERROR
    end
  end
end

The first thing to note is that in the elixir version of the problem, you aren’t just outputting a string of space-separated proteins. Instead you are outputting a tuple consisting of an atom that indicates how the translation went – :ok if it went ok, :error if it didn’t – and a list of proteins. Kind of like the Go convention where you return the value and an error (which is null if the return value is good). Elixir doesn’t have multiple return values, so it goes for a tuple. This is actually a case where something like Haskell’s Maybe could make sense. We’ll see what I do with that later. The tests were also broken down into a function that turns a codon into a tuple, and a function that turns a string into a tuple. So in the way exercism want you to solve the exercise, you write a of_codon function that returns one of these return tuples with a protein in it, and then of_rna takes the full string, breaks it into codons, presumably uses of_codon internally to do the translation, handles all the returned tuples, and compiles a result tuple.

I didn’t particularly like that approach. So I kind of bypassed it. The real work is being done by _codon and map_to_proteins and the other two functions are really just wrappers to provide the tuples that exercism expects. So _codon takes a charlist – which, remember, is not the same as a string in elixir – and maps codons to proteins, or to atoms that signify some other status such as stop codons, errors and so on. This is then wrapped by of_codon which just turns the input string into a charlist and passes it to _codon.

A word on elixir’s |> pipe construct. These work pretty much how pipes work in, say, bash. So the output from the line above is passed in to a line that starts with |> as the first argument of the function on that line. So

    c = to_charlist(codon)
        |> _codon()

means the same as

    c = _codon(to_charlist(codon))

I don’t love this “implicit” argument thing, it’s sort of confusing to write _codon() but mean “pass some unnamed value into _codon”. But it does make the code pretty readable, to be fair.

of_codon then takes the output from _codon and wraps it in the appropriate tuple box. So that’s the first part.

Part two involves taking the full string, breaking it down, and mapping those chunks to the right proteins. You’ll note here that I don’t call of_codon, I just call _codon. map_to_proteins does all the manipulation really. We turn the string into a charlist first. Enum.chunk_every(ls, n) turns a list ls into a list of n-element sublists. There’s a variety of more complex chunk functions to handle variable length sublists, how to manage an underfull final chunk etc, but not of that is relevant here. So at this stage in the pipeline we have a list of 3-element charlists. We then use Enum.map(ls, &_codon/1) to map the function &_codon/1 over the list ls. (That first argument ls isn’t actually explicitly named in the above code: it comes from the line above through the pipe).

Elixir has a number of ways of referring to anonymous functions. Most basically, fn x -> f(x) end refers to the function that does f to x. Then there’s &(f(&1)) would refer to the same thing. So & signifies we’re defining an anonymous function, and then within the parentheses, &1 would refer to the first parameter passed in to the anonymous function. So &(&1 + &2) is the anonymous function that returns the sum of its first two parameters. Finally, if you need to refer to an already named function such as f you can do so with &f/1 which means “the form of f that takes one argument”. This is necessary since elixir permits multiple function definitions for the same name. That’s what we’re doing with &_codon/1 we’re passing in _codon as the function to map over the list of charlist chunks. The following all mean the same:

    |> Enum.map(&_codon/1)
    ---
    |> Enum.map(&(_codon(&1)))
    ---
    |> Enum.map(fn x -> _codon(x) end)

All of_rna needs to do then is, take all of the elements from that output list up to where the first :STOP codon appears. We do this with a Enum.take_while which checks each element of the list against a function that returns a boolean and returns the initial segment up to but not including the first element where the condition is true. The boolean function in this instance is &(&1!=:STOP) which is using that second form of anonymous function we mentioned earlier. This translates into fn x -> x != :STOP end. Once we’ve done that, check if there’s any :ERROR atoms in the list. If there are, return {:error, "invalid codon"}. (I don’t actually know if atoms are case sensitive but it doesn’t matter here, these can be different atoms or the same). If there are not any :ERROR atoms in the list, just return the whole list. Because the exercism tests insisted on treating the empty string as a weird edge case in of_codon we also have to do the same here. But other than that, if we’ve passed the above checks, we can just return our result wrapped up in the :ok tuple box.

Haskell

Now both of the above solutions are fairly orthodox ways to solve the Protein Translation problem. In bash we use a loop, in elixir we use a map, but in both cases it’s mostly just plumbing around repeatedly running chunks of input through a big switch statement, more or less. For this week’s haskell solution, I ended up getting really silly.

Bear with me, it’s going to be a long-ish explanation. OK. So instead of consuming the input data three characters at a time, what if we consumed the characters one at a time, and used some sort of little state machine to keep track of where we are, and what are the valid moves available to us? If you look at the list of valid codons, you can see that, for example, the only valid characters for the first third of a codon are U or A. Now, if we get a U as our first character, then the valid continuations are U, C, A, or G. on the other hand, if we get an A, the only valid continuation is a U, So imagine that our first letter was A and our second character was U. Then the only valid completion of the codon is G which gives use AUG: Methionine. In effect, there’s a little sort of tree for each starting letter corresponding to what the valid continuations are. And at the leaf nodes of this tree is the corresponding protein or stop node for the codon you have traversed. Yeah sure, let’s code it up that way instead of just mapping over a switch statement.

The code for this one is a bit longer, so I’m going to explain it in chunks, and then only at the end show the whole thing.

The Tree Structure

module ProteinTranslation(proteins) where

type Nucleotide = Char
type Protein = String
data ProteinTree = StopNode Nucleotide | Node Nucleotide [ProteinTree] | LeafNode Nucleotide Protein deriving Show

The first thing we do is set up some types. Most things we’re going to be handling are characters, strings and lists of strings so it would get confusing if we couldn’t add a little clarity on what’s what. Hence type synonyms Nucleotide and Protein. Then we have a new type definition which is basically the type for the tree structure we are going to be using. It’s a recursive definition: the Node type contains a list of ProteinTrees.

Using this we can define the data structure that we will be using in solving this problem:

starterTree :: ProteinTree
starterTree = Node 'Z' [ Node 'A' [ Node 'U' [ LeafNode 'G' "Methionine" ]],
                         Node 'U' [ Node 'U' [ LeafNode 'U' "Phenylalanine",
                                               LeafNode 'C' "Phenylalanine",
                                               LeafNode 'A' "Leucine",
                                               LeafNode 'G' "Leucine" ],
                                    Node 'C' [ LeafNode 'U' "Serine",
                                               LeafNode 'C' "Serine",
                                               LeafNode 'A' "Serine",
                                               LeafNode 'G' "Serine" ],
                                    Node 'A' [ LeafNode 'U' "Tyrosine",
                                               LeafNode 'C' "Tyrosine",
                                               StopNode 'A',
                                               StopNode 'G'],
                                    Node 'G' [ LeafNode 'U' "Cysteine",
                                               LeafNode 'C' "Cysteine",
                                               LeafNode 'G' "Tryptophan",
                                               StopNode 'A']]
             ]

We start at the initial node, its ProteinTree component contains a U node and a A node. Each of them contains nodes within and so on down until we hit the terminal nodes, LeafNode or StopNode.

Let’s also define some helper functions that allow us to get bits of data out of our tree. getRNA takes a ProteinTree and extracts the Nucleotide from its root node. getChildFromParent tries to look in the [ProteinTree] list of a Node for a child node whose nucleotide matches the input. It returns a Maybe ProteinTree so if it can’t find a match, it returns Nothing. getChildFromList does basically the same thing, but starting from a [ProteinTree] list itself, rather than the node that contains it.

getRNA :: ProteinTree -> Nucleotide
getRNA (Node s _) = s
getRNA (LeafNode s _) =  s
getRNA (StopNode s) =  s

getChildFromParent :: Nucleotide -> ProteinTree -> Maybe ProteinTree
getChildFromParent n (Node _ pt) = getChildFromList n pt
getChildFromParent _ _ = Nothing

getChildFromList :: Nucleotide -> [ProteinTree] -> Maybe ProteinTree
getChildFromList _ [] = Nothing
getChildFromList s (x:xs) = if getRNA x ==  s then Just x else getChildFromList s xs

The State Machine

We’ve got our data structure, now we need to figure out how to traverse it. Before we get to the real heart of the state machine, let’s set up a type for the state, and a helper function.

type TreeState = (String, ProteinTree, [Protein])

pushOutMaybe :: (String, Maybe ProteinTree, [Protein]) -> Maybe TreeState
pushOutMaybe (_, Nothing, _) = Nothing
pushOutMaybe (s, Just p, l) = Just (s,p,l)

TreeState will keep track of all the information we need while performing this calculation. The first element of the TreeState tuple is the remaining input data that hasn’t yet been consumed. The ProteinTree will contain the node we are currently positioned on and [Protein] is where the output goes. Recall that getChildFromParent and getChildFromList yield a Maybe ProteinTree. The function pushOutMaybe does what it says, it turns that inner Maybe ProteinTree into a Maybe TreeState by effectively turning a Nothing into a Nothing and a Just inside the tuple into a Just tuple. I’m sure there’s a more haskell-y way to do this, I just haven’t had time to go back and figure it out yet.

Now we get to where the real action is. This function performs one step of our state machine.

step :: TreeState -> Maybe TreeState
step ("",     (LeafNode _ p), proteinList) = Just ("", StopNode 'Z', p:proteinList)
step ((x:xs), (Node _ pt) ,   proteinList) = pushOutMaybe (xs, getChildFromList x pt, proteinList)
step ((x:xs), (LeafNode _ p), proteinList) = pushOutMaybe (xs, getChildFromParent x starterTree, p:proteinList)
step _ = Nothing

We’re using some pattern matching here so the first of these patterns to match will be the one performed. Starting from the top, then. If we’re at a leaf node and the input string is empty, we’re done: output a TreeState where the leaf node is replaced with a stop node, and we’ve added the leaf node’s protein to the front of the protein list. (Appending to the front of a list is apparently faster in haskell than appending to the end). Wrap it all up in a Just. If that first pattern didn’t match, and we’re at a Node node, then take the first character of the input and try to find a child node matching that nucleotide. The resulting TreeState will be the remaining input string less the character you consumed, the child node you found and an unchanged protein list. If you didn’t find a matching node, then pushOutMaybe will make this result in a Nothing. If we didn’t match that line, try matching on LeafNode. If we are at a leaf node, we have to add the leaf node’s protein to the protein list. We then, in the same step consume the next character of the input and try to match it against the children of the starterTree. Again, pushOutMaybe will ensure we return Nothing if there is no such child. Finally, if we hit any other kind of situation, return Nothing.

OK so that’s how we do one step. How do we do many?

isStopState :: Maybe TreeState -> Bool
isStopState Nothing = True
isStopState (Just (_, StopNode _, _)) = True
isStopState _ = False

multiStep ::  Maybe TreeState -> Maybe TreeState
multiStep ts = until (isStopState) (>>= step) ts

First, we need to figure out when we are supposed to stop, because if we don’t catch those successful stops, then we’ll always eventually run out of input and hit that final step _ = Nothing line. So, if we’ve already hit a Nothing then, yeah, no point continuing. If we see a StopNode that also means stop. Otherwise we should keep going.

Then we have multiStep which packs a lot in to one line. I explained until in a previous week, but it basically keeps doing its second argument to its third until its first argument is true. So basically we keep doing step to ts until isStopState is true.

But we don’t keep just doing step ts, we in fact keep applying >>= step to ts. This is haskell’s monad bind operator. I’m only just getting the hang of this stuff myself, so I’m not sure how good a job I’ll do at explaining it, and I might get things wrong, but here’s what’s going on as I understand it.

step ts would cause a type error, because according to the type declaration for multiStep, ts is of type Maybe TreeState, whereas according to the type declaration for step, its argument should be of type TreeState. So that would cause a problem. We want to repeatedly apply a function of type TreeState -> Maybe TreeState. That is, we want to apply the output of one instance of that function to another. But the input and output types don’t match up, so how are we supposed to do it?

This is a pretty common pattern: you want to perform some actions, and each of them might cause an error. Rather than have each step have to handle the possible errors from the previous step, you just write all the functions as if they’re guaranteed to receive a succesful output from the previous step. Then you just have to wire them together in such a way that, at each step, an error state is caught and handled by the thing doing the wiring, rather than the functions themselves. In this case, each of the functions is the same step function applied consecutively. We let the monad bind operator work its magic here. Basically what it does is, if the output of the previous step was a Just ts, then it passes ts in to the next iteration, and if the previous output was a Nothing then the next step outputs Nothing too.

So until (_) (step) ts would have ended up with a series of applications that looks like: step . step . step ... ts, with until (_) (>>= step) ts we get: ts >>= step >>= step >>= step...

And with that, we’re pretty much done. We just need to turn Nothing into a string complaining about “invalid codon” and reverse our list of proteins (remember we were adding to the front of the list, but we want the list to mirror the order of the codons).

gatherProteins :: Maybe TreeState -> [Protein]
gatherProteins Nothing = ["Invalid codon"]
gatherProteins (Just (_,_,pl)) = reverse pl

proteins :: String -> Maybe [String]
proteins s = Just. gatherProteins . multiStep $  Just (s, starterTree, [])

Revisiting this now, it strikes me that the tests are set up very weirdly. I’m supposed to return a Maybe [String] but when I hit the fail state, I don’t return a Nothing, I return a Just singleton list with an error message in. You’ll notice that gatherProteins takes in a maybe and outputs a… not-Maybe. And then proteins claims to output a Maybe [String] but there’s no way it can return a Nothing. It strikes me these tests are just not configured to actually test proper use of a Maybe.

There are a couple of things that I am not super happy with in this code. The use of Z as a placeholder in a couple of places where I define a node, but it doesn’t matter what value it has. (In the starter tree, in the first line of multiStep). I also think there has to be a neater way to handle what pushOutMaybe is currently doing: turning a Maybe inside some structure into a Maybe for the whole structure.

Conclusion

That haskell solution was silly, and unnecessary. I’m sure the bash and elixir solutions would be more efficient, and they’re both significantly easier to understand than the haskell solution. On the other hand, I learned a lot from trying to do something silly, and that is, after all, what the point is.

© Seamus Bradley 2021–3

Powered by Hugo & Kiss.