While working with some analytics data, I stumbled upon puzzling MongoDB behaviour. I was trying to aggregate the data and suddenly a light bulb popped just above my head. Notwithstanding the quality of produced light, the idea was brilliant: I can just use map-reduce.
I have to admit, I love the idea of map-reduce, both the programming model for processing large datasets and the functions sharing those names in the functional programming paradigm. So I was already getting into a good mood, after all I get to play with it for the first time in MongoDB.
What could go wrong?
Getting the hands dirty
I was coding in Ruby, so the first draft was something like this:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
Map function uses the current date in “YYYY-MM-DD” format as the key, and emits number 1 to signal that the key was encountered. Reduce part is even simpler, just count the number of times a key was encountered. That’s all, really simple.
It seemed to work, so I went on with my business.
After playing with the data for a while, I was getting the results that seemed to be a bit off. It looked like the problem was directly in aggregated data.
Not to rain on my own map-reduce parade, but I decided to code a small, throw-away, sanity-checking method in Ruby:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
I plugged in the method and got different results, the exact results I initially expected for this data. Lights dimmed a bit and there was smell of the rain in the air.
“What’s going on? Could I have a bug in map-reduce code?”
Map function is a bit more complicated so I checked it first. No bugs detected. The reduce function is so simple that it “can’t possibly contain a bug ™”, but I checked it anyway. Let’s just inspect the array content. Here’s the new reduce function:
1 2 3
It returned “1,1,1,1(…),1,1” with the expected number of “1”s.
“Hm, this is confusing, but interesting.”
So I decided to skip Ruby and go directly to MongoDB console. Results are shown as inline comments.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
The same results: 102 (correct) and 3 (incorrect). Now I’m getting more confused.
Ok, let’s check if I’m really dealing with an array of values:
1 2 3 4 5
Looks like I am.
How about just iterating trough the array for the heck of it:
1 2 3 4 5 6 7 8
Iteration gets me the correct result.
Let’s play a bit more:
1 2 3 4 5 6 7
Now I was really confused, how can “r5” return wrong result, while “r4” returned the correct result? The only difference is in counting instead of summing elements up.
Just to be sure it’s not nested arrays I’m dealing with, quick check in js
0 + [1,1,1] // "01,1,1", looks like <number> + <array>
Suddenly a new light bulb appeared, manifesting from the thin air. Maybe MongoDB calls reduce multiple times, cascading them? Result of one being input for the next one?
Yes, from the documentation it looks like that’s the case.
Now suddenly everything makes sense, that’s why the sum of elements produced correct results.
1 2 3 4 5 6 7 8 9 10
In real life I actually called it a day, and plugged back in the pure-Ruby replacement method, while thinking “It is efficient more then enough anyway. For now…”
The enlightenment came later, on my way home. There is also a small lesson here: if stuck, sometimes it’s best to call it a day or take a break.
I still find it kinda unfortunate that “r2”, the one with string representation, worked. But:
Lets just go with “r4”.
This was a fun and humbling programming session. Maybe I should have read the docs first. I should have also figured it out earlier. I should be more careful with conclusions I draw from debugging steps (toString()…). But sometimes you’re just tired and do not function at 100% of your capacity.
… and that’s ok.
If you have stories like this one, I’d welcome them (and not only them) in my inbox.