TL; DR;
This article describes how to refactor and improve readability of complex data processing with syntax similar to Haskell do-notation and State
monad.
Motivation
Latest project I’ve been involved in deals with a lot of data processing.
Typical case was a method that receives a data chunk and chains it with a couple of Enumerator
methods like map
,each_with_objects
etc.
Some of the chain blocks were simple, some of them not.
But overall method readability degraded significantly even with a two of them chained together.
I’d love to have this refactored and decomposition looks like an obvious solution. Slice the big method into small ones and then chain them together. Seems straightforward.
Refactoring
Let’s say we have a public interface with a method like this:
module Work def process(data, parameter) data.group_by do |point| compute_point_key(point) end.each_with_object({}) do |(key, points), memo| memo[key] = compute_value(points, parameter) end end end
Don’t try to guess what is going on here – method is completely made up.
Let’s try to refactor it in several ways.
Iteration I: extra variable
module Work def process(data, parameter) grouped_data = group_by_key(data) compute_values(grouped_data, parameter) end def group_by_key(points) points.group_by do |point| compute_point_key(point) end end def compute_values(points, parameter) points.each_with_object({}) do |(key, points), memo| memo[key] = compute_value(points, parameter) end end end
This looks fine, however notice the extra grouped_data
variable.
The more blocks you chain, the more extra variables you’ll have to deal with.
Without extra variables it gets even more clumsy due to parameter
argument and reversed order of function invocations (from right to left).
def process(data, parameter) compute_values (group_by_key data), parameter end
Iteration II: adding state
class Work < Struct(:data) def process(parameter) group_by_key compute_values(parameter) end def group_by_key data.group_by! do |point| compute_point_key(point) end end def compute_values(parameter) data.each_with_object!({}) do |(key, points), memo| memo[key] = compute_value(points, parameter) end end end
Now the process
looks much better (I would even say ideal). The trade-off is that group_by_key
and compute_values
are forced to use data
state variable.
But I don’t want to convert all my modules into classes every time I refactor the code.
Especially when my module is shared between other multiple classes.
ChainFlow
Could we somehow preserve the syntax from Iteration II and not constraint ourselves
to keep the state?
Hang tight, meet ChainFlow:
require 'chain_flow' module Work include ChainFlow def process(data, parameter) flow(data) do group_by_key compute_values(parameter) end end def group_by_key(points) points.group_by do |point| compute_point_key(point) end end def compute_values(points, parameter) points.each_with_object({}) do |(key, points), memo| memo[key] = compute_value(points, parameter) end end end
Now we have an emulation of Iteratioin II syntax beauty.
The order of the flow is not reversed, which is a great benefit for readability of our public interface.
Another variation provided by chain_flow is similar to Arel chains we’ve got used to:
def process(data, parameter) chain { data }.group_by_key.compute_values(parameter).fetch end
Notice how compute_values
receives 2 arguments, but only the second (parameter
) is passed.
First argument is considered to be the state and being passed silently.
Interested? Let’s see how it works.
It’s a kind of magic, magic, magic…
Actually it’s a plain meta-programming.
Notice both chain
and flow
methods provided by ChainFlow module are capturing the context using closures.
All the following processing functions calls (group_by_key
and compute_values
) are intercepted with method_missing
behind the scenes.
And re-executed one-by-one in a captured context pipelining the initial data through them along with other params.
By omitting the first paramater in our new syntax we emphasize the fact that state (hidden under the hood) is unimportant.
We are concentrating not on the temporary variables to pass it through, but rather on processing calls which form the pipeline.
The chain_flow code itself is quite small.
Feel the Functional flavor
The one who is into Haskell might notice that syntax provided by flow
resembles Haskell do-notation.
Haskell do-notation is a syntax sugar aimed to imporove the look of monadic functions composition.
The do-notation produces especially beautiful syntax in case State monad. See this code snippet manipulating the Stack:
stackManip :: State Stack Int stackManip = do push 3 pop pop
While the actual state is hidden, stackManip
composes 3 state-full computations and
as a result produces a computation which (when executed on an initial stack) will push 3 to the stack and then pop 2 times from it.
The idea behind chain_flow was to build similar syntax in Ruby.
Performance
Of course nothing comes for free. And here the trade off is the speed. eval
and other meta programming tricks are quite expensive (as well as lambdas).
That said, if you’re dealing with reasonable amount of data and using chain_flow only for processing method calls time/resources necessary for the chain_flow ‘magic’ is rather small in comparison with actual data processing.
P.S.
See funkify library which provides Haskell-style partial application and composition for Ruby methods. It relies heavily on Ruby lambdas though.
Good examples of monad implementations in Ruby is monadic. do-notation provides sort of a do-notation syntax with a couple of monad implementations as well.
See also the docile gem – the very first example for Array modification looks great!
After finishing this post, I came across awesome article by Pat Shaugnessy. It’s good to know other people moving in the same direction. Here’s an attempt to refactor Pat’s initial parse1
method from the article with chain_flow.