Comparison of the Open Source OLAP Systems for Big Data: ClickHouse, Druid, and Pinot

In this post I want to compare ClickHouse, Druid, and Pinot, the three open source data stores that run analytical queries over big volumes of data with interactive latencies.

Warning: this post is pretty big, you may want to read just the “Summary” section in the end.

Sources of Information

I’m a committer to Druid, but I don’t have a vested interest in this system (in fact, I’m probably going to stop being involved in it’s development soon), so readers could expect me to be fairly objective about Druid.

Everything that I write in this post about Pinot is based on the Architecture page in Pinot’s wiki and the other wiki pages in the “Design Docs” section, that were last updated in June 2017, more than half a year ago.

This post was also reviewed Alexey Zatelepin and Vitaliy Lyudvichenko (developers of ClickHouse), Gian Merlino (PMC member and the most active developer of Druid), Kishore Gopalakrishna (architect of Pinot) and Jean-François Im (developer of Pinot). Thanks to the reviewers.

Similarities between the Systems

Coupled Data and Compute

Differences from Big Data SQL Systems: Indexes and Static Data Distribution

  • Have their own format for storing data with indexes, and tightly integrated with their query processing engines. SQL-on-Hadoop systems are generally agnostic of the data format and therefore less “intrusive” in Big Data backends.
  • Have data distributed relatively “statically” between the nodes, and the distributed query execution takes advantage of this knowledge. On the flip side, ClickHouse, Druid and Pinot don’t support queries that require movements of large amounts of data between the nodes, e. g. joins between two large tables.

No Point Updates and Deletes

ClickHouse developers at Yandex aim to support updates and deletes in the future, but I’m not sure, would it be true point queries or updates and deletes of ranges of data.

Big Data Style Ingestion

Proven at Large Scale

Immature

This brings us to the next important section —

On Performance Comparisons and Choice of the System

I think this approach is wrong, at least in the case of Open Source Big Data OLAP systems. The problem of crafting a generic Big Data OLAP system that would work efficiently across the majority of use cases and features (and the power set of their combinations!) is really, really huge — I would estimate that it takes at least 100 man-years to build such a system.

ClickHouse, Druid and Pinot are currently optimized only for the specific use cases that their developers care about, and have almost exclusively only the features that their developers need. If you are going to deploy a big cluster of either one of those systems and care about efficiency, I guarantee that your use case is going to hit it’s unique bottlenecks, that the developers of the subject OLAP systems didn’t encounter before or don’t care about. Let alone that the aforementioned approach “throw data into the system that you know little about and measure efficiency” has a very big chance of suffering from some major bottlenecks that could be fixed by just changing some configurations, or data schema, or making queries differently.

CloudFlare: ClickHouse vs. Druid

I don’t challenge their ultimate decision to choose ClickHouse, because on the scale of about ten nodes, and for their use case, I also think that ClickHouse is a better choice than Druid (I’m explaining this below in this post). But their conclusion that ClickHouse is at least an order of magnitude more efficient (in terms of infrastructure costs) than Druid is a complete fallacy. In fact, among the three systems discussed here, Druid offers the most to enable really cheap installations, see the section “Tiering of Query Processing Nodes in Druid” below.

When choosing a Big Data OLAP system, don’t compare how optimal they currently are for your use case. They are all very suboptimal, at the moment. Compare how fast your organization could make those systems to move in the direction of being more optimal for your use case.

  • Either your organization should have engineers who are able to read, understand and modify the source code of the chosen system, and have capacity to do that. Note that ClickHouse is written in C++, Druid and Pinot — in Java.
  • Or, your organization should sign a contract with a company that provides support of the chosen system. There is Altinity for ClickHouse, Imply and Hortonworks for Druid. There are no such companies for Pinot at the moment.

Other development considerations:

  • ClickHouse developers at Yandex stated that they spend 50% of their time building features that they need internally at the company, and 50% of time — the features with the most “community votes”. However, to be able to benefit from this, the features that you need in ClickHouse should match the features that most other people in the community need.
  • Druid developers from Imply are motivated to build widely applicable features in order to maximize their future business.
  • Druid’s development process is very similar to the Apache model, for several years it’s been developed by several companies with quite different priorities and without dominance of either one of those companies. ClickHouse and Pinot are currently far from that state, they are developed almost exclusively at Yandex and LinkedIn, respectively. Contributions into Druid have the least chance of being declined or revoked later, because they misalign with the goals of the primary developer. Druid doesn’t have a “primary” developer company.
  • Druid commits to support “developer API”, that allows to contribute custom column types, aggregation algorithms, “deep storage” options, etc. and maintain them separate from the codebase of the core Druid. Druid developers document this API and track it’s compatibility with the prior versions. However, this API is not matured yet and is broken pretty much in every Druid release. As far as I know, ClickHouse and Pinot don’t maintain similar APIs.
  • According to Github, Pinot has the most people who work on it, seems that at least 10 man-years were invested in Pinot in the last year. This figure is probably 6 for ClickHouse and about 7 for Druid. It means that in theory, Pinot is being improved the fastest among the subject systems.

Architectures of Druid and Pinot are almost identical to each other, while ClickHouse stands a little apart from them. I will first compare the architecture of ClickHouse with “generic” Druid/Pinot architecture, and then discuss smaller differences between Druid and Pinot.

Differences between ClickHouse and Druid/Pinot

Data Management: Druid and Pinot

Segments are persisted in “deep storage” (e. g. HDFS) and could be loaded on query processing nodes, but the latter are not responsible for durability of the segments, so query processing nodes could be replaced relatively freely. Segments are not rigidly attached to some nodes, they could be loaded more or less on any nodes. A special dedicated server (called “Coordinator” in Druid and “Controller” in Pinot, but I’m going to generically refer to it as “master” below) is responsible for assigning the segments to the nodes, and moving segments between the nodes, if needed. (It doesn’t contradict what I pointed above in this post that all three subject systems, including Druid and Pinot, have “static” data distribution between the nodes, because segment loads and movements in Druid (and Pinot, I suppose) are expensive operations and not done for each particular query, and usually happen only every several minutes, or hours, or days.)

Metadata about segments is persisted in ZooKeeper, directly in Druid, and via Helix framework in Pinot. In Druid metadata is also persisted in an SQL database, it’s explained in more details in the section “Differences between Druid and Pinot” below in this post.

Data Management: ClickHouse

ClickHouse has partitioned tables, consisting of specific sets of nodes. There is no “central authority” or metadata server. All nodes, between which some table is partitioned, have full, identical copies of the table metadata, including addresses of all other nodes, on which partitions of this table are stored.

Metadata of partitioned table includes “weights” of nodes for distribution of the newly written data, e. g. 40% of data should go to the node A, 30% to the node B and 30% to the node C. It should normally be just equal distribution among the nodes. “Skew”, as in example above, is only required when a new node is added to the partitioned table, in order to fill the new node faster with some data. Updates of those “weights” should be done manually by ClickHouse cluster administrators, or an automated system should be built on top of ClickHouse.

Data Management: Comparison

Data distribution tradeoff in ClickHouse

In the example that is given in the picture above, the table data is distributed between three nodes in Druid or Pinot, but a query for a little interval of data will usually hit just two nodes (unless that interval crosses a segment interval boundary). In ClickHouse, any queries will need to hit three nodes, if the table is partitioned between three nodes. In this example it doesn’t seem like a dramatic difference, but imagine that the number of nodes is 100, while the partitioning factor could still be e. g. 10 in Druid or Pinot.

To mitigate this problem, the largest ClickHouse cluster at Yandex (of hundreds of nodes) is in fact split into many “subclusters” of a few dozens of nodes each. This ClickHouse cluster is used to power website analytics, and each point of data has “website ID” dimension. There is strict assignment of each website ID to a specific subcluster, where all data for that website ID go. There is some business logic layer on top of that ClickHouse cluster to manage such data separation on both data ingestion and querying sides. Thankfully in their use case, little queries need to hit data across multiple website IDs, and such queries are coming not from customers of the service, so they don’t have strict real-time SLA.

Another drawback of the ClickHouse’s approach is that when a cluster grows rapidly, data is not rebalanced automatically without humans manually changing “node weights” in a partitioned table.

Tiering of Query Processing Nodes in Druid

This feature allows Metamarkets to save hundreds of thousands dollars of Druid infrastructure spend per month, versus it had a “flat” cluster.

Tiering of query processing nodes in Druid

As far as I know, ClickHouse and Pinot don’t have similar features yet, all nodes in their clusters are supposed to be the same.

Since the architecture of Pinot is very similar to Druid’s, I think it would be not very hard to introduce a similar feature in Pinot. It might be harder to do this in ClickHouse, because the concept of segments is really helpful for the implementation of such feature, but still possible though.

Data Replication: Druid and Pinot

Data Replication: ClickHouse

In the largest ClickHouse cluster at Yandex, there are two equal sets of nodes in different data centers, and they are paired. In each pair the nodes are replicas of each other (i. e. the replication factor of two is used) and located in different data centers.

ClickHouse depends on ZooKeeper for replication management, but otherwise ZooKeeper is not needed. It means that ZooKeeper is not needed for a single-node ClickHouse deployment.

Data Ingestion: Druid and Pinot

When a table could be updated with delay of an hour or more, segments are created using batch processing engines, such as Hadoop or Spark. Both Druid and Pinot have “first class” out-of-the-box support for Hadoop. There is a third-party plugin for Druid indexing in Spark, but it’s unsupported at the moment. As far as I know, Pinot doesn’t have even such level of support for Spark, i. e. you should contribute it yourself: understand Pinot interfaces and code, write some Java or Scala code. But it shouldn’t be very hard to do. (Update: Ananth PackkilDurai from Slack is contributing support for Spark in Pinot now.)

When a table should be updated in real time, both Druid and Pinot introduce a concept of “real time nodes”, which do three things: accept new data from Kafka (Druid supports other sources too), serve queries to the recent data, and create segments in background, later pushing them to the “deep storage”.

Data Ingestion: ClickHouse

If a table is partitioned, the node that accepts a batch write (e. g. 10k rows) distributes the data according to the “weights” of all nodes in the partitioned table itself (see section “Data management: ClickHouse” above).

Rows written in a single batch form a small “set”. Set is immediately converted into columnar format. There is a background process on each ClickHouse node, that merges row sets into larger ones. Documentation of ClickHouse heavily refers to this principle as “MergeTree” and highlights it’s similarity with log-structured merge trees, although IMO it’s a little confusing because data is not organized in trees, it’s in a flat columnar format.

Data Ingestion: Comparison

Data ingestion in ClickHouse is much simpler (at the expense of more complicated historical data management — see above), although there is one caveat: you should be able to “batch” data in front of ClickHouse itself. Automatic ingestion and batching of data from Kafka is available out-of-the-box, but if you have a different source of real-time data, ranging from queueing infrastructure alternative to Kafka and stream processing engines to simple HTTP endpoints, you need to create an intermediate batching service, or contribute code to ClickHouse directly.

Query Execution

I can only speculate why the decision to extract another one type of nodes was made when Druid and Pinot were designed. But now it seems essential, because with the total number of segments in clusters going beyond ten millions, segment to node mapping information takes gigabytes of memory. It’s too wasteful to allocate so much memory on all query processing nodes. So, this is another drawback, imposed by Druid’s and Pinot’s “segmented” data management architecture.

In ClickHouse dedicating a separate set of nodes for “query brokering” is usually not needed. There is a special ephemeral “distributed” table type in ClickHouse, that could be set up on any node, and queries to this table do everything for what “broker” nodes are responsible in Druid and Pinot. Usually such ephemeral tables are set up on each node that participates the partitioned table, so, in practice, every node could be the “entry point” for a query to a ClickHouse cluster. This node will issue necessary subqueries to other partitions, process it’s part of the query itself, and merge it with partial results from other partitions.

When a node (either one of processing nodes in ClickHouse, or a “broker” node in Druid and Pinot) issues subqueries to other nodes, and a single or a few subqueries fail for whatever reason, ClickHouse and Pinot handle this situation properly: they merge the results of all succeeded subqueries and still return partial result to the user. Druid notoriously lacks this feature at the moment: if any subquery fails, the whole query fails as well.

ClickHouse vs. Druid or Pinot: Conclusions

ClickHouse resembles traditional RDMBS, e. g. PostgreSQL. In particular, ClickHouse could be deployed just on a single server. If the projected size of the deployment is small, e. g. not bigger than in the order of 100 CPU cores for query processing and 1 TB of data, I would say that ClickHouse has significant advantage over Druid and Pinot, due to it’s simplicity and not requiring additional types of nodes, such as “master”, “real-time ingestion nodes”, “brokers”. On this field, ClickHouse competes rather with InfluxDB, than with Druid or Pinot.

Druid and Pinot resemble Big Data systems such as HBase. Not by their performance characteristics, but by dependency on ZooKeeper, dependency on persistent replicated storage (such as HDFS), focus on resilience to failures of single nodes, and autonomous work and data management not requiring regular human attention.

For a wide range of applications, neither ClickHouse nor Druid or Pinot are obvious winners. First and foremost, I recommend to take into account, the source code of which system your are able to understand, fix bugs, add features, etc. The section “On Performance Comparisons and Choice of the System” discusses this more.

Secondly, you could look at the table below. Each cell in this table describes a property of some application, that makes either ClickHouse or Druid/Pinot probably a better choice. Rows are not ordered by their importance. The relative importance of each row is different for different applications, but if your application is described by many properties from one column in the table and by no or a few properties from another, it’s likely that you should choose the corresponding system from the column header.

Note: neither of the properties above means that you must use the corresponding system(s), or must avoid the other system(s). For example, if you cluster is projected to be big, it doesn’t mean that you should only consider Druid or Pinot, but never ClickHouse. It rather means that Druid or Pinot become more likely better solutions, but other properties could outweigh and ClickHouse could be ultimately a more optimal choice even for large clusters, in some applications.

Differences between Druid and Pinot

There is just one difference between Druid and Pinot, that is probably too big to be eliminated in foreseeable future — it’s the implementation of segment management in the “master” node. Also developers of the both systems probably wouldn’t want to do that anyway, because the approach of both has it’s pros and cons, it’s not that one is totally better than another.

Segment Management in Druid

  • Less data is stored in ZooKeeper. Only minimal information about the mapping from the segment id to the list of query processing nodes on which the segment is loaded is kept in ZooKeeper. The remaining extended metadata, such as size of the segment, list of dimensions and metrics in it’s data, etc. is stored only in the SQL database.
  • When segments of data are evicted from the cluster because they become too old (this is a commonplace feature of timeseries databases, all ClickHouse, Druid and Pinot have it), they are offloaded from the query processing nodes and metadata about them is removed from ZooKeeper, but not from the “deep storage” and the SQL database. As long as they are not removed manually from those places, it allows to “revive” really old data quickly, in case the data is needed for some reporting or investigation.
  • Unlikely it was an intention originally, but now there are plans in Druid to make dependency on ZooKeeper optional. Currently ZooKeeper is used for three different things: segment management, service discovery, and property store, e. g. for realtime data ingestion management. Service discovery and property store functionality could be provided by Consul. Segment management could be implemented with HTTP announcements and commands, and it’s partially enabled by the fact that the persistence function of ZooKeeper is “backed up” by SQL database.

The downside of having SQL database as a dependency is greater operational burden, especially if some SQL database is not set up in the organization yet. Druid supports MySQL and PostgreSQL, there is a community extension for Microsoft SQL Server. Also, when Druid is deployed in the cloud, convenient managed RDBMS services could be used, such as Amazon RDS.

Segment Management in Pinot

On the other hand, Helix probably constrains Pinot with it’s “framework bounds”. Helix, and consequently Pinot, are probably going to depend on ZooKeeper forever.

Now I’m going to enumerate more shallow differences between Druid and Pinot. By “shallow” here I mean that there is a clear path to replicate those features in the system that lacks them, if anybody wants.

“Predicate pushdown” in Pinot

This feature is important for performance in some applications.

Currently Druid supports key-based partitioning, if segments are created in Hadoop, but not yet when segments are created during realtime ingestion. Druid doesn’t currently implement “predicate pushdown” on brokers.

“Pluggable” Druid and Opinionated Pinot

  • HDFS, or Cassandra, or Amazon S3, or Google Cloud Storage, or Azure Blob Storage, etc. as “deep storage”;
  • Kafka, or RabbitMQ, Samza, or Flink, or Spark, Storm, etc. (via tranquility) as real-time data ingestion source;
  • Druid itself, or Graphite, or Ambari, or StatsD, or Kafka as a sink for telemetry of Druid cluster (metrics).

Since Pinot was developed almost exclusively at LinkedIn and to meet LinkedIn’s needs to date, currently it often doesn’t provide user much choice: HDFS or Amazon S3 must be used as deep storage, only Kafka for real-time data ingestion. But if somebody needs that, I can imagine that it’s not hard to introduce support of multiple pluggable options for any service in Pinot. Maybe it’s going to change soon since Uber and Slack have started to use Pinot.

Data Format and Query Execution Engine are Optimized Better in Pinot

  • Compression of indexed columns with bit granularity, byte granularity in Druid.
  • Inverted index is optional for each column, in Druid it’s obligatory, that’s not needed sometimes and consumes a lot of space. The difference in space consumption between Druid and Pinot observed by Uber could probably be attributed to this.
  • Min and max values in numeric columns are recorded per segment.
  • Out-of-the-box support for data sorting. In Druid it could be achieved only manually and in a hackish way, as explained in the section “CloudFlare: ClickHouse vs. Druid” above. Data sorting means better compression, so this feature of Pinot is another probable reason of the difference in space consumption (and query performance!) between Druid and Pinot observed by Uber.
  • Some more optimized format is used for multi-value columns than in Druid.

All those things could be implemented in Druid, though. And although Pinot’s format is substantially better optimized that Druid’s at the moment, it is also pretty far from from being really optimal. For example, Pinot (as well as Druid) uses only general-purpose compression such as Zstd, and not yet implemented any of the compression ideas from the Gorilla paper.

Regarding query execution, it’s unfortunate that Uber mainly used count (*) queries to compare performance of Druid with Pinot (1, 2), because it’s just a dumb linear scan in Druid at the moment, although it’s really easy to replace it with a proper O(1) implementation. It’s an illustration of meaninglessness of “black box” comparisons, covered in the section “On Performance Comparisons and Choice of the System” above in this post.

I think that the difference in GROUP BY query performance, observed by Uber, should be attributed to the lack of data sorting in Druid’s segments, as noted above in this section.

Druid Has a Smarter Segment Assignment (Balancing) Algorithm

I don’t know how LinkedIn is fine with so simple segment balancing algorithm in Pinot, but probably huge gains are awaiting them around the corner, if they take time to improve their algorithm.

Pinot is More Fault Tolerant on the Query Execution Path

Druid doesn’t implement this feature at the moment.

Tiering of Query Processing Nodes in Druid

As far as I know, Pinot doesn’t currently have a similar feature.

Summary

Due to their architectural similarity, ClickHouse, Druid and Pinot have approximately the same “optimization limit”. But as of now, all three systems are immature and very far from that limit. Substantial efficiency improvements to either of those systems (when applied to a specific use case) are possible in a matter of a few engineer-months of work. I don’t recommend to compare performance of the subject systems at all, choose the one which source code you are able to understand and modify, or in which you want to invest.

Among those three systems, ClickHouse stands a little apart from Druid and Pinot, while the latter two are almost identical, they are pretty much two independently developed implementations of exactly the same system.

ClickHouse more resembles “traditional” databases like PostgreSQL. A single-node installation of ClickHouse is possible. On small scale (less than 1 TB of memory, less than 100 CPU cores) ClickHouse is much more interesting than Druid or Pinot, if you still want to compare with them, because ClickHouse is simpler and has less moving parts and services. I would say that it competes with InfluxDB or Prometheus on this scale, rather than with Druid or Pinot.

Druid and Pinot more resemble other Big Data systems in the Hadoop ecosystem. They retain “self-driving” properties even on very large scale (more than 500 nodes), while ClickHouse requires a lot of attention of professional SREs. Also, Druid and Pinot are in the better position to optimize for infrastructure costs of large clusters, and better suited for the cloud environments, than ClickHouse.

The only sustainable difference between Druid and Pinot is that Pinot depends on Helix framework and going to continue to depend on ZooKeeper, while Druid could move away from the dependency on ZooKeeper. On the other hand, Druid installations are going to continue to depend on the presence of some SQL database.

Currently Pinot is optimized better than Druid. (But please read again above — “I don’t recommend to compare performance of the subject systems at all”, and corresponding sections in the post.)

--

--

Writing about systems, technology, philosophy.

Love podcasts or audiobooks? Learn on the go with our new app.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store