Want to count unique elements in a stream without blowing up memory? In more specific words, do you want to use a HyperLogLog counter in Spark? Until today, I’d never heard the word “monoid” before. However, Twitter Algebird is a project that contains a collection of monoids including a HyperLogLog monoid, which can be used to aggregate a stream into unique elements. The code looks like this:
import com.twitter.algebird._ val aggregator = new HyperLogLogMonoid(12) inputData.reduceByKey(aggregator.plus(_, _))
This young man tells you all about it, and then some:
The video also mentions another Twitter project, the Storehaus project, which can be used to integrate Spark with a lot of NoSQL databases like DynamoDB. Looks very useful indeed.