Engineering

Proprietary Data Analytics Platforms are a Trap

Arjun Lall

Mar 14, 2025 • 3 min read

When analytical queries and workloads scale out of transactional databases like Postgres, most companies default to a cloud vendor like Snowflake, connecting and transforming their data in the process. Proprietary data platforms are the industry standard here. We have one of the fastest growing open source database projects, and every advisor tells us to “build a cloud platform”. It’s what the industry expects and also the easiest revenue model.

But we’re taking a contrarian path. I’ll explain why we think it’s time to rethink data analytics and why current platforms are a trap—namely because of high costs, vendor lock-in, and unnecessary complexity.

Complexity

Cloud data analytics is a scam, today. Snowflake and other old proprietary cloud data warehouses were built on the assumption that distributed clusters, data movement orchestration, and constant infrastructure maintenance are necessary for analytics at scale. Let’s revisit if this Hadoop-era way of thinking still holds true with today’s technical shifts:

Hardware advances: the raw power of a single server today is orders of magnitude more powerful than what it used to be. A single machine with 64 CPU cores can easily scan 1TB of data. And since 98% of Redshift queries on >10TB datasets scan less than 1TB, distributed systems and CAP theorem tradeoffs aren’t necessary for analytical workloads anymore.
Embedded analytics engines: projects like DuckDB show you can quickly scan and analyze billions of rows of data on even your laptop. It runs fully in-process, sharing memory with your application, which means there’s no separate database server or network overhead.
Separation of compute and storage: with accessible durable object storage like S3, you can store massive datasets cheaply without needing always-on infrastructure. You only spin up compute for the data you actually query, which reduces overhead—especially since most queries only scan a fraction of what’s stored.

Together, these shifts enable a simpler architecture — you can self-host without the bloat and cost of managing complex infrastructure. For example, our BemiDB project is just a single binary. It embeds DuckDB, automatically syncs data from Postgres databases to an analytics optimized S3 bucket, and leverages Postgres’ wire and query protocol to speak Postgres. In practice, this means spinning up a fast and scalable data warehouse can be as simple as:

> curl -sSL https://raw.githubusercontent.com/BemiHQ/BemiDB/refs/heads/main/scripts/install.sh | bash

> ./bemidb sync --pg-sync-interval 10m --pg-database-url postgres://<user>:<pass>@<host>:5432/<dbname>

> ./bemidb start

> psql postgres://localhost:54321/bemidb
bemidb=> SELECT country, COUNT(*) FROM users GROUP BY country;

The lock in tax

Snowflake and similar vendors store data in proprietary formats, meaning you’re effectively stuck and unable to leave. Especially as you scale, this means exorbitant fees and no flexibility of using any other tools and services with your data. This is in stark contrast to modern open table formats like Apache Iceberg which are now becoming the standard.

Iceberg provides a consistent structure for your data on object storage that keeps it readable by all data tools and services. Paired with open-standard columnar files like Apache Parquet, your data remains fully portable.

Additionally, proprietary cloud query engines limit where you can run and optimize workloads. Open source query engines let you deploy on VMs, bare metal, or containers—whatever fits your performance and security needs.

Embracing open formats and open source helps avoid the lock-in tax and keeps your data truly yours.

Monetization

Data teams end up paying well over $1,000 per month on cloud data warehouses—often climbing to tens of thousands when you factor in ETL pipelines, data egress, and infrastructure overhead. By storing data in cheap, durable object storage within the same region and running queries on a modest VM, you can cut those monthly costs down to hundreds.

We’re a venture backed startup that’s after profit, but our mission is to simplify data, and we encourage everyone to use our open source and self-host. Not everything can be packaged into a single binary, so we charge for support and extra features that require integrating additional components.

Give self hosted a try

The exorbitant cloud costs and lock-in are worth it when the alternative is complex infrastructure to build or maintain. Unless you’re in the big data one percent, this isn’t the case anymore for data analytics.

Check out our GitHub repo and give BemiDB a star! We’re always pushing to make data simpler and more open.

Complexity

The lock in tax

Monetization

Give self hosted a try

Subscribe for more like this