May 15, 2026

Building an Effective Reporting System with an OLAP Cluster

olapanalyticsdatabasearchitecturereporting

Transactional databases (OLTP) are optimised for writes and point lookups. Reporting systems (OLAP) need to scan millions of rows, aggregate across dimensions, and return results in seconds. Trying to run both on the same store eventually breaks — which is why dedicated OLAP clusters exist.

Here is how to design a reporting system that stays fast, fresh, and maintainable.

Star schema over normalised models

OLAP queries are read-heavy and column-selective. The star schema is the proven data model:

-- Fact table: one row per measurable event
CREATE TABLE fact_sales (
  id         BIGSERIAL,
  date_key   INTEGER NOT NULL REFERENCES dim_date(id),
  product_key INTEGER NOT NULL REFERENCES dim_product(id),
  store_key  INTEGER NOT NULL REFERENCES dim_store(id),
  quantity   INTEGER NOT NULL,
  amount     NUMERIC(12,2) NOT NULL,
  cost       NUMERIC(12,2) NOT NULL
) ENGINE = MergeTree()
ORDER BY (date_key, product_key, store_key);

Fact tables store metrics and foreign keys to dimensions. Keep them narrow — every extra column slows scans.
Dimension tables store descriptive attributes (product name, category, region). They are small, denormalised, and rarely updated.
ORDER BY in the fact table determines the sort key. Choose columns that appear in WHERE and GROUP BY most frequently.

Columnar storage is non-negotiable

A columnar store (ClickHouse, DuckDB, Redshift, BigQuery) reads only the columns referenced in a query. If your fact table has 50 columns but a report touches 4, a row-oriented database reads 46× more data from disk. This alone accounts for 10–100× performance differences on analytical workloads.

Key features to look for:

Vectorised execution — processes batches of values, not individual rows.
Lightweight compression — columns with low cardinality compress heavily (run-length encoding, delta encoding).
Projections/materialised views — pre-compute common aggregations.

Cluster topology

A typical OLAP cluster has three tiers:

Ingestion layer → Compute layer → Storage layer

Ingestion handles streaming (Kafka, Kinesis) and batch (S3, HDFS) data. It buffers, partitions, and distributes writes so the compute layer is not overwhelmed.

Compute is a set of stateless or semi-stateless nodes that execute queries. In MPP systems (ClickHouse, Redshift), data is co-located with compute on the same nodes for performance. In disaggregated architectures (Snowflake, BigQuery), compute and storage scale independently — useful for variable workloads.

Storage can be local SSD (fastest), network-attached, or object store (cheapest). Object-store-backed systems trade latency for elastic scaling.

Ingestion patterns

Batch (daily/hourly loads)

Source DB → CDC or full dump → staging bucket → OLAP cluster

Use INSERT INTO ... SELECT or COPY from Parquet files. Batch is simpler to reason about and easier to backfill.

Streaming (near-real-time)

Source DB → Debezium → Kafka → OLAP table

Streaming reduces data freshness from hours to seconds. The trade-off is more complex error handling and exactly-once semantics.

Most production systems use hybrid ingestion: streaming for recent data (last 24 hours), batch for historical loads and backfills.

Query optimisation

Pre-aggregation

The single most impactful optimisation. Create materialised views for common aggregation levels:

CREATE MATERIALIZED VIEW agg_sales_daily
ENGINE = SummingMergeTree()
PARTITION BY toYYYYMM(date)
ORDER BY (date, product_key, store_key)
AS SELECT
  date,
  product_key,
  store_key,
  sum(quantity)  AS total_qty,
  sum(amount)    AS total_amount
FROM fact_sales
GROUP BY date, product_key, store_key;

Queries hitting pre-aggregated tables return in milliseconds instead of seconds.

Partitioning and TTL

Partition by time — it makes dropping old data a metadata operation instead of a delete:

PARTITION BY toYYYYMM(date_key)
TTL date_key + INTERVAL 2 YEAR DELETE

Most OLAP systems support TTL-based expiry natively. Use it to cap storage costs automatically.

Indexing

OLAP indexes are different from B-trees:

Min-max indexes skip granules that fall outside the query range.
Bloom filters accelerate equality filters on high-cardinality columns.
Sparse indexes (the primary sort key) limit which granules are scanned.

Design the sort key to match your most common filter patterns.

Handling concurrency

OLAP clusters are shared resources. A single expensive query can saturate CPU and starve other users. Mitigations:

Resource groups — isolate heavy reporting from dashboard queries.
Query queues — limit concurrent queries per user or per priority.
Result caching — cache identical queries for a TTL window.
Async queries — run long queries in the background and poll for results.

Schema evolution

Fact tables accumulate billions of rows. Changing a column type or adding a new one can be expensive:

Additive changes (new nullable columns, new materialised views) are cheap.
Destructive changes (drop column, change sort key) often require table rebuilds.
Use versioned schemas — land new columns alongside old ones, migrate consumers, then drop.

Monitoring

Key metrics to track:

Query latency — P50, P95, P99.
Rows scanned per query — high numbers indicate missing pre-aggregation or poor sort key design.
Merge backlog — in MergeTree engines, a pending merge backlog degrades read performance.
Disk utilisation per node — skewed data distribution causes straggler nodes.

When an OLAP cluster is overkill

A reporting system does not always need a cluster. Consider simpler alternatives:

Materialised views on the OLTP database — works up to millions of rows.
DuckDB or SQLite in read-only mode — suitable for per-tenant reporting.
Dedicated read replica — more capacity without architectural change.

Reach for an OLAP cluster when your fact table exceeds 100 million rows, query latency targets are sub-second, or you need concurrent reporting across teams.

Conclusion

An effective OLAP reporting system combines a star-schema data model, columnar storage, thoughtful partitioning, pre-aggregation, and resource isolation. Start simple — batch loads and a single node are fine for most teams — and add streaming, clustering, and caching as volume and concurrency grow.

The best reporting system is the one your team can operate confidently at 3 AM.