ik

Ivan Kusalic - home page

Adventure With MongoDB's Map-reduce

While working with some analytics data, I stumbled upon puzzling MongoDB behaviour. I was trying to aggregate the data and suddenly a light bulb popped just above my head. Notwithstanding the quality of produced light, the idea was brilliant: I can just use map-reduce.

I have to admit, I love the idea of map-reduce, both the programming model for processing large datasets and the functions sharing those names in the functional programming paradigm. So I was already getting into a good mood, after all I get to play with it for the first time in MongoDB.

What could go wrong?

Getting the hands dirty

I was coding in Ruby, so the first draft was something like this:

aggregate daily data with map-reduce
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def daily_data(collection_name)
  map = <<-HERE
    function() {
      var unix_time = parseInt(this.time, 10);
      var today = new Date(unix_time * 1000);
      var d = today.getDate();
      var m = today.getMonth() + 1;
      var y = today.getFullYear();
      if(d<10) d = '0' + d;
      if(m<10) m = '0' + m;

      var date_key = y + '-' + m + '-' + d;

      emit(date_key, 1);
    }
  HERE

  reduce = <<-HERE
    function(key, vals) {
      return vals.length;
    }
  HERE

  db = mongo_connection
  coll = db[collection_name]
  opts = {:out => {:inline => 1}, :raw => true }
  return coll.map_reduce(BSON::Code.new(map), BSON::Code.new(reduce), opts)['results']
end

Map function uses the current date in “YYYY-MM-DD” format as the key, and emits number 1 to signal that the key was encountered. Reduce part is even simpler, just count the number of times a key was encountered. That’s all, really simple.

It seemed to work, so I went on with my business.

Forecast: cloudy

After playing with the data for a while, I was getting the results that seemed to be a bit off. It looked like the problem was directly in aggregated data.

Not to rain on my own map-reduce parade, but I decided to code a small, throw-away, sanity-checking method in Ruby:

simple Ruby substitute method
1
2
3
4
5
6
7
8
9
10
11
12
13
14
def daily_data_rb(collection_name)  # FIXME replacement for map-reduce
  db = mongo_connection
  coll = db[collection_name]

  aggregated_data = coll.find().reduce({}) do |acc, e|
    key = to_date(e['time']).to_s
    acc[key] = (acc[key] || 0) + 1
    acc
  end

  return aggregated_data.entries.sort_by(&:first).map do |k, v|
    { '_id' => k, 'value' => v }
  end
end

I plugged in the method and got different results, the exact results I initially expected for this data. Lights dimmed a bit and there was smell of the rain in the air.

“What’s going on? Could I have a bug in map-reduce code?”

Map function is a bit more complicated so I checked it first. No bugs detected. The reduce function is so simple that it “can’t possibly contain a bug ™”, but I checked it anyway. Let’s just inspect the array content. Here’s the new reduce function:

reduce function, take two
1
2
3
function(key, vals) {
  return vals.toString();
}

It returned “1,1,1,1(…),1,1” with the expected number of “1”s.

“Hm, this is confusing, but interesting.”

So I decided to skip Ruby and go directly to MongoDB console. Results are shown as inline comments.

switching to MongoDB console
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
var m = function() {
  var unix_time = parseInt(this.time, 10);
  var today = new Date(unix_time * 1000);
  var d = today.getDate();
  var m = today.getMonth() + 1;
  var y = today.getFullYear();
  if(d<10) d = '0' + d;
  if(m<10) m = '0' + m;

  var date_key = y + '-' + m + '-' + d;

  emit(date_key, 1);
}

var r1 = function(key, vals) {
  return vals.length;
}

var r2 = function(key, vals) {
  return vals.toString();
}

db['mr_test'].mapReduce(m, r1, { out: { inline: 1 } })  // 3
db['mr_test'].mapReduce(m, r2, { out: { inline: 1 } })  // "102" ("1,1,1,1, (...) 1,1")

The same results: 102 (correct) and 3 (incorrect). Now I’m getting more confused.

Ok, let’s check if I’m really dealing with an array of values:

reduce, take 3 experiments
1
2
3
4
5
var r3 = function(key, vals) {
  return Object.prototype.toString.call(vals);
}

db['mr_test'].mapReduce(m, r3, { out: { inline: 1 } })  // "[object Array]"

Looks like I am.

How about just iterating trough the array for the heck of it:

reduce, take 4
1
2
3
4
5
6
7
8
var r4 = function(key, vals) {
  var r = 0;
  for(var i = 0; i < vals.length; i++)
    r += vals[i];
  return r;
}

db['mr_test'].mapReduce(m, r4, { out: { inline: 1 } })  // 102

Iteration gets me the correct result.

Let’s play a bit more:

reduce, take 5
1
2
3
4
5
6
7
var r5 = function(key, vals) {
  var r = 0;
  for(var i = 0; i < vals.length; i++)
    r += 1;
  return r;
}
db['mr_test'].mapReduce(m, r5, { out: { inline: 1 } })  // 3

Now I was really confused, how can “r5” return wrong result, while “r4” returned the correct result? The only difference is in counting instead of summing elements up.

Just to be sure it’s not nested arrays I’m dealing with, quick check in js console: 0 + [1,1,1] // "01,1,1", looks like <number> + <array> = <string> in Javascript. So no nested arrays.

Enlightenment

Suddenly a new light bulb appeared, manifesting from the thin air. Maybe MongoDB calls reduce multiple times, cascading them? Result of one being input for the next one?

Yes, from the documentation it looks like that’s the case.

Now suddenly everything makes sense, that’s why the sum of elements produced correct results.

new approach
1
2
3
4
5
6
7
8
9
10
var r6 = function(key, vals) {
  var r = 0;
  for(var i = 0; i < vals.length; i++) {
    r += vals[i]
    if(vals[i] > 1) return "FOUND BIG ONE: " + vals[i];  // signal error
  }
  return r;
}

db['mr_test'].mapReduce(m, r6, { out: { inline: 1 } })  // "FOUND BIG ONE: 100"

Bingo.

Harsh reality

In real life I actually called it a day, and plugged back in the pure-Ruby replacement method, while thinking “It is efficient more then enough anyway. For now…”

The enlightenment came later, on my way home. There is also a small lesson here: if stuck, sometimes it’s best to call it a day or take a break.

I still find it kinda unfortunate that “r2”, the one with string representation, worked. But:

slipping on toString()
1
[1,"1,1,1",1].toString()  // "1,1,1,1,1"

Oh well…

Lets just go with “r4”.

Conclusion

This was a fun and humbling programming session. Maybe I should have read the docs first. I should have also figured it out earlier. I should be more careful with conclusions I draw from debugging steps (toString()…). But sometimes you’re just tired and do not function at 100% of your capacity.

… and that’s ok.

If you have stories like this one, I’d welcome them (and not only them) in my inbox.