The challenges of running Druid at large scale, and future directions, part 2

In the previous post I described how Druid time series data store is used at Metamarkets and discussed some of the major challenges that we face when scaling this system to hundreds of nodes. In this post, I want to present other scalability issues with Druid.

To understand this post better, read through the overview of the Druid architecture, given in the previous post.

Druid cluster view, simplified and without “indexing” part

Issues with ultra-large queries

In ad analytics, time series data sources are generally very “thick”. Reporting queries in our cluster over many months of historical data cover up to millions of segments. The amount of computation required for such queries is enough to saturate the processing capacity of the entire historical layer for up to tens of seconds.

If we just let such queries to execute somehow along with interactive queries over recent data, it would make the latency and the user experience for the interactive queries horrible. Currently we isolate interactive queries from reporting queries using time-based tiering in the historical layer, i. e. recent segments in the data sources are loaded in one set of historical nodes (we call it hot tier), and older segments, which constitute the majority of the segments needed to be processed for reporting queries, are loaded in another set of historical nodes (cold tier).

This approach has some flaws:

Tiers in Druid historical layer, too rigid resource isolation

This set of issues has been recognised in the Druid community for years. One solution proposed to make super-heavy queries to run faster, and not to hog resources of a Druid cluster, was to employ Spark to run such queries. However, it’s going to be woefully inefficient, because Spark could only download entire segments from Druid’s deep storage, sometimes of dozens of columns, instead of just a few columns, required for the query execution. Also Spark query processing engine is not optimized as tightly for time series data and specific query types, as Druid query processing engine. On the other hand, if there is a large Spark cluster with a lot of spare resources anyway, this solution is practical.

A more efficient solution within the current Druid architecture would be to isolate thread and memory resources for running interactive and reporting queries within historical nodes (to allow both types of queries opportunistically use all resources of a node), rather than between historical nodes, and also to make it possible to bypass page cache when loading a segment into memory of historical node. The downside of this solution is that it’s difficult to implement. It makes Druid a more complex, not a simpler piece of engineering.

Decoupling of storage and compute, proposed in the previous post, also solves the problems with large queries efficiently. First, since compute resources (CPU and memory) available for processing of some segment are not coupled with a node where the segment resides, it makes “query inference” not an issue at all, as long as there are enough compute resources in the entire historical layer (that could, by the way, be scaled up and down very quickly, because historical nodes are almost ephemeral). To make processing of some particularly huge queries faster, new historical nodes could be provisioned just-in-time.

Brokers need to keep the view of the whole cluster in memory

About ten millions of segments are stored in our Druid cluster. As it is currently implemented, all broker nodes keep metadata about all of the segments in the cluster in their JVM heap memory, that already requires much more than 10 GB of memory and is the cause of significant JVM GC pauses on brokers.

Intermediate solution is to skip some parts of segment metadata from announcements, but on 10x scale this problem is going to hit again. It should be possible to partition segments between brokers, so that each broker operates only on a subset of the whole cluster view.

Issues with Marathon

This is not something about Druid itself, but about how we deploy it at scale. Marathon is an orchestration solution on top of Mesos. We have negative experience with it. The typical issues are: