ChainFlow

TL; DR;

This article describes how to refactor and improve readability of complex data processing with syntax similar to Haskell do-notation and State monad.

Motivation

Latest project I’ve been involved in deals with a lot of data processing.
Typical case was a method that receives a data chunk and chains it with a couple of Enumerator methods like map,
each_with_objects etc.
Some of the chain blocks were simple, some of them not.
But overall method readability degraded significantly even with a two of them chained together.

I’d love to have this refactored and decomposition looks like an obvious solution. Slice the big method into small ones and then chain them together. Seems straightforward.

Refactoring

Let’s say we have a public interface with a method like this:

module Work
  def process(data, parameter)
    data.group_by do |point|
      compute_point_key(point)
    end.each_with_object({}) do |(key, points), memo|
      memo[key] = compute_value(points, parameter)
    end
  end
end

Don’t try to guess what is going on here – method is completely made up.
Let’s try to refactor it in several ways.

Iteration I: extra variable

module Work
  def process(data, parameter)
    grouped_data = group_by_key(data)
    compute_values(grouped_data, parameter)
  end

  def group_by_key(points)
    points.group_by do |point|
      compute_point_key(point)
    end
  end

  def compute_values(points, parameter)
    points.each_with_object({}) do |(key, points), memo|
      memo[key] = compute_value(points, parameter)
    end
  end
end

This looks fine, however notice the extra grouped_data variable.
The more blocks you chain, the more extra variables you’ll have to deal with.

Without extra variables it gets even more clumsy due to parameter argument and reversed order of function invocations (from right to left).

def process(data, parameter)
  compute_values (group_by_key data), parameter
end

Iteration II: adding state

class Work < Struct(:data)
  def process(parameter)
    group_by_key
    compute_values(parameter)
  end

  def group_by_key
    data.group_by! do |point|
      compute_point_key(point)
    end
  end

  def compute_values(parameter)
    data.each_with_object!({}) do |(key, points), memo|
      memo[key] = compute_value(points, parameter)
    end
  end
end

Now the process looks much better (I would even say ideal). The trade-off is that group_by_key and compute_values are forced to use data state variable.
But I don’t want to convert all my modules into classes every time I refactor the code.
Especially when my module is shared between other multiple classes.

Could we somehow preserve the syntax from Iteration II and not constraint ourselves
to keep the state?

Hang tight, meet ChainFlow:

require 'chain_flow'

module Work
  include ChainFlow

  def process(data, parameter)
    flow(data) do
      group_by_key
      compute_values(parameter)
    end
  end

  def group_by_key(points)
    points.group_by do |point|
      compute_point_key(point)
    end
  end

  def compute_values(points, parameter)
    points.each_with_object({}) do |(key, points), memo|
      memo[key] = compute_value(points, parameter)
    end
  end
end

Now we have an emulation of Iteratioin II syntax beauty.
The order of the flow is not reversed, which is a great benefit for readability of our public interface.

Another variation provided by chain_flow is similar to Arel chains we’ve got used to:

def process(data, parameter)
  chain { data }.group_by_key.compute_values(parameter).fetch
end

Notice how compute_values receives 2 arguments, but only the second (parameter) is passed.
First argument is considered to be the state and being passed silently.

Interested? Let’s see how it works.

It’s a kind of magic, magic, magic…

Actually it’s a plain meta-programming.

Notice both chain and flow methods provided by ChainFlow module are capturing the context using closures.
All the following processing functions calls (group_by_key and compute_values) are intercepted with method_missing behind the scenes.
And re-executed one-by-one in a captured context pipelining the initial data through them along with other params.
By omitting the first paramater in our new syntax we emphasize the fact that state (hidden under the hood) is unimportant.
We are concentrating not on the temporary variables to pass it through, but rather on processing calls which form the pipeline.

The chain_flow code itself is quite small.

Feel the Functional flavor

The one who is into Haskell might notice that syntax provided by flow resembles Haskell do-notation.
Haskell do-notation is a syntax sugar aimed to imporove the look of monadic functions composition.
The do-notation produces especially beautiful syntax in case State monad. See this code snippet manipulating the Stack:

stackManip :: State Stack Int
stackManip = do
  push 3
  pop
  pop

While the actual state is hidden, stackManip composes 3 state-full computations and
as a result produces a computation which (when executed on an initial stack) will push 3 to the stack and then pop 2 times from it.
The idea behind chain_flow was to build similar syntax in Ruby.

Performance

Of course nothing comes for free. And here the trade off is the speed. eval and other meta programming tricks are quite expensive (as well as lambdas).
That said, if you’re dealing with reasonable amount of data and using chain_flow only for processing method calls time/resources necessary for the chain_flow ‘magic’ is rather small in comparison with actual data processing.

P.S.

See funkify library which provides Haskell-style partial application and composition for Ruby methods. It relies heavily on Ruby lambdas though.
Good examples of monad implementations in Ruby is monadic. do-notation provides sort of a do-notation syntax with a couple of monad implementations as well.
See also the docile gem – the very first example for Array modification looks great!

After finishing this post, I came across awesome article by Pat Shaugnessy. It’s good to know other people moving in the same direction. Here’s an attempt to refactor Pat’s initial parse1 method from the article with chain_flow.

ChainFlow – refactor your data processing