{"id":6899,"date":"2014-06-02T17:44:43","date_gmt":"2014-06-02T14:44:43","guid":{"rendered":"http:\/\/railsware.com\/blog\/?p=6899"},"modified":"2021-08-16T14:08:22","modified_gmt":"2021-08-16T11:08:22","slug":"chainflow-refactor-your-data-processing","status":"publish","type":"post","link":"https:\/\/railsware.com\/blog\/chainflow-refactor-your-data-processing\/","title":{"rendered":"ChainFlow &#8211; refactor your data processing"},"content":{"rendered":"\n<h2 class=\"wp-block-heading\">TL; DR;<\/h2>\n\n\n\n<p>This article describes how to refactor and improve readability of complex data processing with syntax similar to Haskell do-notation and <code>State<\/code> monad.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Motivation<\/h2>\n\n\n\n<p>Latest project I&#8217;ve been involved in deals with a lot of data processing.<br>Typical case was a method that receives a data chunk and chains it with a couple of <code>Enumerator<\/code> methods like <code>map<\/code>,<br><code>each_with_objects<\/code> etc.<br>Some of the chain blocks were simple, some of them not.<br>But overall method readability degraded significantly even with a two of them chained together.<\/p>\n\n\n\n<p>I&#8217;d love to have this refactored and decomposition looks like an obvious solution. Slice the big method into small ones and then chain them together. Seems straightforward.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Refactoring<\/h2>\n\n\n\n<p>Let&#8217;s say we have a public interface with a method like this:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted lang:ruby decode:true\">module Work\n  def process(data, parameter)\n    data.group_by do |point|\n      compute_point_key(point)\n    end.each_with_object({}) do |(key, points), memo|\n      memo[key] = compute_value(points, parameter)\n    end\n  end\nend\n<\/pre>\n\n\n\n<p>Don&#8217;t try to guess what is going on here &#8211; method is completely made up.<br>Let&#8217;s try to refactor it in several ways.<\/p>\n\n\n\n<h3 class=\"wp-block-heading\">Iteration I: extra variable<\/h3>\n\n\n\n<pre class=\"wp-block-preformatted lang:ruby decode:true\">module Work\n  def process(data, parameter)\n    grouped_data = group_by_key(data)\n    compute_values(grouped_data, parameter)\n  end\n\n  def group_by_key(points)\n    points.group_by do |point|\n      compute_point_key(point)\n    end\n  end\n\n  def compute_values(points, parameter)\n    points.each_with_object({}) do |(key, points), memo|\n      memo[key] = compute_value(points, parameter)\n    end\n  end\nend\n<\/pre>\n\n\n\n<p>This looks fine, however notice the extra <code>grouped_data<\/code> variable.<br>The more blocks you chain, the more extra variables you&#8217;ll have to deal with.<\/p>\n\n\n\n<p>Without extra variables it gets even more clumsy due to <code>parameter<\/code> argument and reversed order of function invocations (from right to left).<\/p>\n\n\n\n<pre class=\"wp-block-preformatted lang:ruby decode:true\">def process(data, parameter)\n  compute_values (group_by_key data), parameter\nend\n<\/pre>\n\n\n\n<h3 class=\"wp-block-heading\">Iteration II: adding state<\/h3>\n\n\n\n<pre class=\"wp-block-preformatted lang:ruby decode:true\">class Work &lt; Struct(:data)\n  def process(parameter)\n    group_by_key\n    compute_values(parameter)\n  end\n\n  def group_by_key\n    data.group_by! do |point|\n      compute_point_key(point)\n    end\n  end\n\n  def compute_values(parameter)\n    data.each_with_object!({}) do |(key, points), memo|\n      memo[key] = compute_value(points, parameter)\n    end\n  end\nend\n<\/pre>\n\n\n\n<p>Now the <code>process<\/code> looks much better (I would even say ideal). The trade-off is that <code>group_by_key<\/code> and <code>compute_values<\/code> are forced to use <code>data<\/code> state variable.<br>But I don&#8217;t want to convert all my modules into classes every time I refactor the code.<br>Especially when my module is shared between other multiple classes.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">ChainFlow<\/h2>\n\n\n\n<p>Could we somehow preserve the syntax from Iteration II and not constraint ourselves<br>to keep the state?<\/p>\n\n\n\n<p>Hang tight, meet ChainFlow:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted lang:ruby decode:true\">require 'chain_flow'\n\nmodule Work\n  include ChainFlow\n\n  def process(data, parameter)\n    flow(data) do\n      group_by_key\n      compute_values(parameter)\n    end\n  end\n\n  def group_by_key(points)\n    points.group_by do |point|\n      compute_point_key(point)\n    end\n  end\n\n  def compute_values(points, parameter)\n    points.each_with_object({}) do |(key, points), memo|\n      memo[key] = compute_value(points, parameter)\n    end\n  end\nend\n<\/pre>\n\n\n\n<p>Now we have an emulation of Iteratioin II syntax beauty.<br>The order of the flow is not reversed, which is a great benefit for readability of our public interface.<\/p>\n\n\n\n<p>Another variation provided by chain_flow is similar to Arel chains we&#8217;ve got used to:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted lang:ruby decode:true\">def process(data, parameter)\n  chain { data }.group_by_key.compute_values(parameter).fetch\nend\n<\/pre>\n\n\n\n<p>Notice how <code>compute_values<\/code> receives 2 arguments, but only the second (<code>parameter<\/code>) is passed.<br>First argument is considered to be the state and being passed silently.<\/p>\n\n\n\n<p>Interested? Let&#8217;s see how it works.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">It&#8217;s a kind of magic, magic, magic&#8230;<\/h2>\n\n\n\n<p>Actually it&#8217;s a plain meta-programming.<\/p>\n\n\n\n<p>Notice both <code>chain<\/code> and <code>flow<\/code> methods provided by ChainFlow module are capturing the context using closures.<br>All the following processing functions calls (<code>group_by_key<\/code> and <code>compute_values<\/code>) are intercepted with <code>method_missing<\/code> behind the scenes.<br>And re-executed one-by-one in a captured context pipelining the initial data through them along with other params.<br>By omitting the first paramater in our new syntax we emphasize the fact that state (hidden under the hood) is unimportant.<br>We are concentrating not on the temporary variables to pass it through, but rather on processing calls which form the pipeline.<\/p>\n\n\n\n<p>The <a href=\"https:\/\/github.com\/railsware\/chain_flow\" target=\"_blank\" rel=\"noreferrer noopener\">chain_flow code<\/a> itself is quite small.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Feel the Functional flavor<\/h2>\n\n\n\n<p>The one who is into Haskell might notice that syntax provided by <code>flow<\/code> resembles Haskell do-notation.<br>Haskell do-notation is a syntax sugar aimed to imporove the look of monadic functions composition.<br>The do-notation produces especially beautiful syntax in case <a href=\"http:\/\/learnyouahaskell.com\/for-a-few-monads-more#state\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">State monad<\/a>. See this code snippet manipulating the Stack:<\/p>\n\n\n\n<pre class=\"wp-block-preformatted lang:haskell decode:true\">stackManip :: State Stack Int\nstackManip = do\n  push 3\n  pop\n  pop\n<\/pre>\n\n\n\n<p>While the actual state is hidden, <code>stackManip<\/code> composes 3 state-full computations and<br>as a result produces a computation which (when executed on an initial stack) will push 3 to the stack and then pop 2 times from it.<br>The idea behind chain_flow was to build similar syntax in Ruby.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">Performance<\/h2>\n\n\n\n<p>Of course nothing comes for free. And here the trade off is the speed. <code>eval<\/code> and other meta programming tricks are quite expensive (as well as lambdas).<br>That said, if you&#8217;re dealing with reasonable amount of data and using chain_flow only for processing method calls time\/resources necessary for the chain_flow &#8216;magic&#8217; is rather small in comparison with actual data processing.<\/p>\n\n\n\n<h2 class=\"wp-block-heading\">P.S.<\/h2>\n\n\n\n<p>See <a href=\"https:\/\/github.com\/banister\/funkify\" target=\"_blank\" rel=\"noreferrer noopener\">funkify<\/a> library which provides <em>Haskell-style partial application and composition for Ruby methods<\/em>. It relies heavily on Ruby lambdas though.<br>Good examples of monad implementations in Ruby is <a href=\"https:\/\/github.com\/pzol\/monadic\" target=\"_blank\" rel=\"noreferrer noopener\">monadic<\/a>. <a href=\"https:\/\/github.com\/aanand\/do_notation\" target=\"_blank\" rel=\"noreferrer noopener\">do-notation<\/a> provides sort of a do-notation syntax with a couple of monad implementations as well.<br>See also the <a href=\"https:\/\/github.com\/ms-ati\/docile\" target=\"_blank\" rel=\"noreferrer noopener\">docile gem<\/a> &#8211; the very first example for Array modification looks great!<\/p>\n\n\n\n<p>After finishing this post, I came across <a href=\"http:\/\/patshaughnessy.net\/2014\/4\/8\/using-a-ruby-class-to-write-functional-code\" target=\"_blank\" rel=\"noreferrer noopener nofollow\">awesome article<\/a> by Pat Shaugnessy. It&#8217;s good to know other people moving in the same direction. Here&#8217;s <a href=\"https:\/\/gist.github.com\/gregolsen\/0c29a4dc253830cf0ad5\" target=\"_blank\" rel=\"noreferrer noopener\">an attempt<\/a> to refactor Pat&#8217;s initial <code>parse1<\/code> method from the article with chain_flow.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>TL; DR; This article describes how to refactor and improve readability of complex data processing with syntax similar to Haskell do-notation and State monad. Motivation Latest project I&#8217;ve been involved in deals with a lot of data processing.Typical case was a method that receives a data chunk and chains it with a couple of Enumerator&#8230;<\/p>\n","protected":false},"author":34,"featured_media":9436,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"_acf_changed":false,"footnotes":""},"categories":[3],"tags":[],"coauthors":["Innokenty Mihailov"],"class_list":["post-6899","post","type-post","status-publish","format-standard","has-post-thumbnail","hentry","category-development"],"acf":[],"aioseo_notices":[],"categories_data":[{"name":"Engineering","link":"https:\/\/railsware.com\/blog?category=development"}],"post_thumbnails":"https:\/\/railsware.com\/blog\/wp-content\/themes\/railsware\/vendors\/images\/article-thumbnail-default.jpg","amp_enabled":true,"_links":{"self":[{"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/posts\/6899","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/users\/34"}],"replies":[{"embeddable":true,"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/comments?post=6899"}],"version-history":[{"count":24,"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/posts\/6899\/revisions"}],"predecessor-version":[{"id":14125,"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/posts\/6899\/revisions\/14125"}],"wp:featuredmedia":[{"embeddable":true,"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/media\/9436"}],"wp:attachment":[{"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/media?parent=6899"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/categories?post=6899"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/tags?post=6899"},{"taxonomy":"author","embeddable":true,"href":"https:\/\/railsware.com\/blog\/wp-json\/wp\/v2\/coauthors?post=6899"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}