Twitter HyperLogLog monoids in Spark

Want to count unique elements in a stream without blowing up memory? In more specific words, do you want to use a HyperLogLog counter in Spark? Until today, I’d never heard the word “monoid” before. However, Twitter Algebird is a project that contains a collection of monoids including a HyperLogLog monoid, which can be used to aggregate a stream into unique elements. The code looks like this:

import com.twitter.algebird._
val aggregator = new HyperLogLogMonoid(12)
inputData.reduceByKey(aggregator.plus(_, _))

This young man tells you all about it, and then some:

The video also mentions another Twitter project, the Storehaus project, which can be used to integrate Spark with a lot of NoSQL databases like DynamoDB. Looks very useful indeed.

And just to go completely crazy with the Twitter project references, the talk also brings on Summingbird. The Twitter team has a separate blog post
about using Summingbird with Spark Streaming.

Leave a Reply