As part of a project for my day job, I've been getting to grips with Flume. Chances are that if you've found this post, you're already aware of what Flume does, but for the uninitiated:
Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. Its main goal is to deliver data from applications to Hadoop’s HDFS. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. The system is centrally managed and allows for intelligent dynamic management. It uses a simple extensible data model that allows for online analytic applications.
The work that I'm doing requires me to manipulate events as they traverse a data flow. To do this I will extend Flume using its plugin functionality and a custom Decorator:
Continue Reading Comments
Sink decorators can add properties to the sink and can modify the data streams that pass through them. For example, you can use them to increase reliability via write ahead logging, increase network throughput via batching/compression, sampling, benchmarking, and even lightweight analytics.