lambda chain

ChainFlow – refactor your data processing

| 6 Comments

TL; DR;

This article describes how to refactor and improve readability of complex data processing with syntax similar to Haskell do-notation and State monad.

Motivation

Latest project I’ve been involved in deals with a lot of data processing. Typical case was a method that receives a data chunk and chains it with a couple of Enumerator methods like map, each_with_objects etc. Some of the chain blocks were simple, some of them not. But overall method readability degraded significantly even with a two of them chained together.

I’d love to have this refactored and decomposition looks like an obvious solution. Slice the big method into small ones and then chain them together. Seems straightforward.

Refactoring

Let’s say we have a public interface with a method like this:

Don’t try to guess what is going on here – method is completely made up. Let’s try to refactor it in several ways.

Iteration I: extra variable

This looks fine, however notice the extra grouped_data variable. The more blocks you chain, the more extra variables you’ll have to deal with.

Without extra variables it gets even more clumsy due to parameter argument and reversed order of function invocations (from right to left).

Iteration II: adding state

Now the process looks much better (I would even say ideal). The trade-off is that group_by_key and compute_values are forced to use data state variable. But I don’t want to convert all my modules into classes every time I refactor the code. Especially when my module is shared between other multiple classes.

ChainFlow

Could we somehow preserve the syntax from Iteration II and not constraint ourselves to keep the state?

Hang tight, meet ChainFlow:

Now we have an emulation of Iteratioin II syntax beauty. The order of the flow is not reversed, which is a great benefit for readability of our public interface.

Another variation provided by chain_flow is similar to Arel chains we’ve got used to:

Notice how compute_values receives 2 arguments, but only the second (parameter) is passed. First argument is considered to be the state and being passed silently.

Interested? Let’s see how it works.

It’s a kind of magic, magic, magic…

Actually it’s a plain meta-programming.

Notice both chain and flow methods provided by ChainFlow module are capturing the context using closures. All the following processing functions calls (group_by_key and compute_values) are intercepted with method_missing behind the scenes. And re-executed one-by-one in a captured context pipelining the initial data through them along with other params. By omitting the first paramater in our new syntax we emphasize the fact that state (hidden under the hood) is unimportant. We are concentrating not on the temporary variables to pass it through, but rather on processing calls which form the pipeline.

The chain_flow code itself is quite small.

Feel the Functional flavor

The one who is into Haskell might notice that syntax provided by flow resembles Haskell do-notation. Haskell do-notation is a syntax sugar aimed to imporove the look of monadic functions composition. The do-notation produces especially beautiful syntax in case State monad. See this code snippet manipulating the Stack:

While the actual state is hidden, stackManip composes 3 state-full computations and as a result produces a computation which (when executed on an initial stack) will push 3 to the stack and then pop 2 times from it. The idea behind chain_flow was to build similar syntax in Ruby.

Performance

Of course nothing comes for free. And here the trade off is the speed. eval and other meta programming tricks are quite expensive (as well as lambdas). That said, if you’re dealing with reasonable amount of data and using chain_flow only for processing method calls time/resources necessary for the chain_flow ‘magic’ is rather small in comparison with actual data processing.

P.S.

See funkify library which provides Haskell-style partial application and composition for Ruby methods. It relies heavily on Ruby lambdas though. Good examples of monad implementations in Ruby is monadic. do-notation provides sort of a do-notation syntax with a couple of monad implementations as well. See also the docile gem – the very first example for Array modification looks great!

After finishing this post, I came across awesome article by Pat Shaugnessy. It’s good to know other people moving in the same direction. Here’s an attempt to refactor Pat’s initial parse1 method from the article with chain_flow.

Share
* Railsware is a premium software development consulting company, focused on delivering great web and mobile applications. Learn more about us.
  • Konstantin Tennhard

    This is a funny coincidence; I gave a talk on basically the same concept but a different implementation yesterday at RubyC in Kiev. In the project I’m currently working on, we need to deal with highly complex but very linear business processes. We ended up modeling each data processing step with a separate class and then assemble the resulting components in a processing pipeline. If you like the idea, please take a look at the gem I published and let me know what you think: https://github.com/t6d/composable_operations Sadly the link to your gem seems to be broken. I’d love to get the chance to compare it with my implementation. I’m sure there is something interesting to learn.

    • Innokenty Mihailov

      Aah, I knew I should join the RubyC :-(
      I looked through the README of your gem and it makes sense to me.
      Functionality it provides definitely much more mature than the chain_flow where the atomic entity you are operating with just a plain method. The chain_flow implementation is therefore quite small and simple – sorry I forgot to publish the repo – now the link should work.
      Indeed ComposableOperations can solve the same issue – split the complex flow on small incapsulated parts (sharable across multiple other flows) which is great.

      In chain_flow I concentrated more on a syntax similar to the one I love in Haskell.

  • Kache

    Meh, I think the gains of this refactoring isn’t worth the cost of having to deal with the uncommon syntax of flow(data) {}.

    I think I’d be happy with the first example. Actually, I would do instead:

    def process(data, parameter)
    keyval = data.map do |point|
    [compute_key(point), compute_value(point, parameter)]
    end
    memo = Hash[keyval]
    end

    If the chains start getting really complex, I would consider creating
    a class to instantiate for each data to handle data manipulation:

    def process(data, parameter)
    datapoints = data.map { |p| DataPoint.new(p, parameter) }
    memo = Hash[datapoints.map(&:compute_keyval)]
    end

    • Innokenty Mihailov

      chain_flow is all about the code style. If you do like functional programming high chances you’d like to organize (refactor?) your public interface as a composition of several independent functions (covered with tests independently).

      Hiding internals behind them will shorten your public interface to a couple of human readable calls – your colleagues will probably thank you for this readability improvement.
      Personally, I prefer this:
      flow(data) do
      group_by_key
      compute_values(parameter)
      end
      instead of your example:
      datapoints = data.map { |p| DataPoint.new(p, parameter) }
      memo = Hash[datapoints.map(&:compute_keyval)]

      just because I as a developer can easily get what’s going on (via the proper method naming) without diving into internals right away.

  • Pingback: ChainFlow – refactor your data processing | Open World

  • patshaughnessy

    Wow – really interesting stuff! I love the idea of using ideas from Haskell or other functional languages in Ruby. Now if only you could get it introduced into Ruby’s core syntax (like you did with Enumerable::Lazy!) so we don’t need to include a special module, etc.

    The only worry I have about it is just that it’s so “magical.” I’ve been using metaprogramming less and less these days, preferring verbose but readable code instead. But the way you pass state along is very readable, once you understand what’s going on.

    Nice job! And thanks for the nice link to my article :)

Want to get more of Railsware blog?

RSS FEED

We're always ready to help!

CONTACT US