linkedin / cruise-control
- пятница, 1 сентября 2017 г. в 03:16:23
Cruise-control is the first of its kind to fully automate the dynamic workload rebalance and self-healing of a kafka cluster. It provides great value to Kafka users by simplifying the operation of Kafka clusters.
Cruise Control is a product that helps run Apache Kafka clusters at large scale. Due to the popularity of Apache Kafka, many companies have bigger and bigger Kafka clusters. At LinkedIn, we have 1800+ Kafka brokers, which means broker deaths are an almost daily occurrence and balancing the workload of Kafka also becomes a big overhead.
Kafka Cruise control is designed to address this operation scalability issue.
Kafka Cruise Control provides the following features out of the box:
Resource utilization tracking for brokers, topics and partitions.
Multi-goal rebalance proposal generation
Anomaly detection and alerting for the Kafka cluster including
Admin operations including:
bootstrap.servers
and zookeeper.connect
to the Kafka cluster to be monitored.metric.sampler.class
to your implementation (the default sampler class is CruiseControlMetricsReporterSampler)sample.store.class
to your implementation if necessary (the default SampleStore is KafkaSampleStore)./gradlew jar copyDependantLibs
./kafka-cruise-control-start.sh [-jars PATH_TO_YOUR_JAR_1,PATH_TO_YOUR_JAR_2] config/cruisecontrol.properties [port]
Cruise Control has provided a REST API for users to interact with. See the wiki page for more details.
Cruise Control tries to understand the workload of each replica and provide a optimization solution to the current cluster based on this knowledge.
Cruise Control periodically gets the resource utilization samples at both broker and partition level to understand the traffic pattern of each partition. Based on the traffic characteristics of all the partitions, it derives the load impact of each partition in the brokers. Cruise Control then builds a workload model to simulate the workload of the Kafka cluster. The goal optimizer will explore different ways to generate the cluster workload optimization proposals based on the list of goals specified by the users.
Cruise Control also monitors the liveness of all the brokers in the cluster. When a broker fails in the cluster, Cruise Control will automatically move the replicas on the failed broker to the healthy brokers to avoid the loss of redundancy.
For more details about how Cruise Control achieves that, see these slides.
To read more about the configurations. Check the configurations wiki page.
More about pluggable components can be found in the pluggable components wiki page.
The metric sampler is one of the most important pluggables in Cruise Control, it allows users to easily deploy Cruise Control to various environments and work with any existing metric system.
Cruise Control provides a metrics reporter which can be configured in your Apache Kafka server. It will produce performance metrics to a kafka metrics topic which can be consumed by Cruise Control.
The Sample Store is used to store the collected metric samples and training samples to external storage. One problem in metric sampling is that we are using some derived data from the raw metrics. And the way we derive the data relies on the metadata of the cluster at that point. So when we look at the old metrics, if we do not know the metadata at the point the metric was collected the derived data would not be accurate. Sample Store help solve this problem by storing the derived data directly to an external storage for later loading.
The default sample store implementation produces the metric samples back to Kafka.
The goals in Cruise Control are pluggable with different priorities. The default goals are (in order of decreasing priority):
The anomaly notifier allows users to be notified when an anomaly is detected. Anomalies include:
In addition to anomaly notifications users can specify actions to be taken in response to the anomaly. the following actions are supported: