Roman Leventov
1 min readMay 19, 2021

--

I think this is still a *broadcast* JOINs, i. e. the joins of large timeseries tables with small tables which are distributed to all the data nodes.

Distributed joins of two or more large tables are very hard to implement efficiently and robustly. Whole distributed query engines such as Apache Impala, Presto, and Apache Spark focus pretty much on solving this exact problem. It would be smart on the part of timeseries data stores to try to leverage the hundreds of man-years of effort that have been put into these query engines instead of trying to implement distributed joins on their own. Apache Kudu does this (in fact, it was developed as a storage engine for Apache Impala from the beginning). Apache Pinot achieves this through Pinot-Presto connector. There is an ongoing effort to integrate Apache Druid and Apache Spark. The prominent proprietary datastore that has columnar (timeseries-ready) storage and supports distributed joins is Snowflake.

--

--

Roman Leventov
Roman Leventov

Written by Roman Leventov

Writing about systems, technology, philosophy.

No responses yet