A Guide to Reducing Snowflake Costs

Join StarRocks Community on Slack

TABLE OF CONTENTS

It's no secret. Snowflake has a lot to offer when it comes to empowering your analytics operations. It probably felt like everything you were looking for when you signed up, but that honeymoon period lasted about as long as it took for that first bill to arrive. Often, one of the first questions teams ask when they start using Snowflake isn't "how can we do more with Snowflake?" It's "how can we cut down or better control Snowflake's costs?".

The common approaches to Snowflake cost control tend to focus on limiting the business’ usage or demand a lot of extra work from data warehouse developers. All of this results in only marginal cost savings. So how can you preserve all those great Snowflake analytics benefits while meaningfully reducing your costs?

In this article, we'll go over the basics of Snowflake and its pricing, examine the most popular approaches to controlling Snowflake costs, and also share a new, alternative approach capable of saving companies using Snowflake up to 80% of what they're paying today.

Snowflake: An Introduction

Snowflake is one of the world’s leading cloud-based data warehouse companies. Founded in 2012 by three data warehousing experts, Snowflake now supports nearly 7,000 customers. Snowflake’s data warehouse offering is fully managed. This means users shouldn’t have to concern themselves with back-end work like server installation and maintenance. Snowflake is also fairly flexible in its deployment options, with its data warehouse instances able to be spun up on AWS, GCP, and Azure.

The Advantages of Snowflake

Snowflake boasts plenty of advantages. This is especially true for small businesses that can put all of their data into Snowflake for analysis. Here are some of Snowflake’s biggest advantages:

Ease of Use

Snowflake is celebrated for its simple and intuitive interface. You can get started with the service quickly, and you can automatically spin up and down compute clusters of any size or workload without impacting other jobs. With near-infinite, instant elasticity and concurrency, Snowflake promises to deliver the performance your organization needs.

Outstanding Performance

In many cases, the performance of Snowflake is outstanding, and it offers different virtual warehouse sizes that can meet a wide variety of performance requirements. With Snowflake, if you want better performance, you just need to upgrade your virtual warehouse. Snowflake also ensures their query processing is on par with other competitive offerings.

Separation of Compute and Storage

Snowflake’s multi-cluster, shared data architecture shines because compute and storage resources are kept physically separate. At the same time, they are logically part of a single, integrated and modern cloud-built data warehouse system. Snowflake’s architecture also includes built-in cloud services such as the transparent provisioning of resources and automatic metadata management and resilience.

Seamless Data Sharing

Snowflake’s architecture allows for smooth and safe data sharing among Snowflake’s users. Companies can also share data with consumers regardless of the consumers being Snowflake’s customers or not. A client account can be created for them directly from the user interface and managed by the provider.

Cost Is a Major Issue With Snowflake

Although Snowflake’s capabilities have been a tremendous boon to data analytics teams, the honeymoon period tends to last up until the first invoice. Customers are often surprised by their month end Snowflake bills. This is because, as users ingest larger volumes of data into Snowflake, queries become more and more complex, and the concurrency becomes higher too. To guarantee their desired query latency, users need to use a bigger virtual warehouse for these complex queries. At the same time, the high concurrency of the queries leads to spinning up more virtual warehouses. This gets expensive very quickly.

As a result, most customers would want to find some way to reduce snowflake costs after using this product for a while. Let's look at some of the most popular approaches to reducing Snowflake costs.

A Brief Look at Snowflake's Pricing Structure

First, it's important to understand Snowflake's pricing structure and the options you have when working with them.

Snowflake pricing normally falls into one of four levels:

Standard
Enterprise
Business Critical
Virtual Private Snowflake (VPS)

Snowflake also offers the ability to pay for usage month-to-month and even pay for usage up front. While there can be no doubt Snowflake pricing is flexible, this does not necessarily equate to cheap. Snowflake pricing is based on your compute, storage, and cloud services usage. Compute costs are calculated depending on how long your warehouses are running. Storage costs are calculated based on the amount of data you are storing, and cloud services depend on the Snowflake features you are using. Usually, compute costs are the most important part of your Snowflake bills. So, we focus on the compute costs for the remainder of this article.

How To Reduce Snowflake Costs

If you've spent anytime looking for solutions to Snowflake pricing online, you can summarize the most popular recommended approaches as follows:

Use Resource Monitors To Analyze Your Usage

Organizations tend to have a monthly threshold or quota that they want to spend for compute on Snowflake. It is recommended that you set up resource monitors for raising alerts and taking action upon reaching your monthly credits consumption threshold. In these instances, the business should abort all running queries and suspend its virtual warehouse immediately. Alternatively, you can complete all the running queries then suspend the virtual warehouse.

Scale Your Virtual Warehouses Appropriately

A good rule of thumb with Snowflake warehouses is to start small and increase the size as needed. Since the cost of a warehouse doubles each time you size up, starting smaller with longer running queries will allow for better cost management and monitoring.

Also, by creating virtual warehouses for specific use cases, you’ll be able to customize them for each case. That has the potential to deliver a better ROI to your business.

Use Query Control Parameters

In Snowflake, both query processing and concurrency can be controlled by configuring several parameters. For example, STATEMENT_TIMEOUT_IN_SECONDS controls the amount of time, in seconds, after which a running SQL statement (query, DDL, DML, etc.) is canceled by the system. This parameter can help to prevent wasting money on some unexpectedly large queries. Making good use of these kinds of parameters can be very useful for controlling costs.

Optimize Data Loading

How you load data and how efficient your queries are, are two of the most important Snowflake cost control factors. By default, data is automatically partitioned in Snowflake according to its ingestion order. It can, however, be more efficient to sort the data in its cloud bucket before loading rather than sorting the data with an `ORDER BY` after loading.

Compression also helps to improve load times, and not all options are made equal. A CSV file that has been Gzipped loads 2–3 times faster than Parquet or ORC.

Even the move command matters: COPY INTO is more efficient than INSERT because it applies an optimized SQL bulk loading process.

Using Materialized Views

A materialized view is a pre-computed data set stored for later use. This feature is also an effective Snowflake cost optimization strategy when running frequent or complex queries on a subsection of large data sets. Materialized views are only executed against all the data in a table once. After executing against all the available data on its initial run, it only runs against new data added. It does this by utilizing INSERT and UPDATE commands.

However, materialized views in Snowflake can only be utilized on data models that use simple SELECT statements rather than those that use joins or aggregation.

The above mentioned strategies are common approaches you will find in most of the tech blogs. They help users effectively manage the costs of Snowflake systems to avoid month-end surprises, but they generally reduce costs by a limited amount. On the other hand, some of these approaches may impact how businesses are run by limiting or even suspending queries, which makes them unpopular choices for enterprises.

To significantly reduce cloud analytics costs without negatively impacting business, we need to look at the problem from a different angle.

Common Challenges in Reducing Snowflake Costs

The approaches mentioned above are what you will most likely find when scouring tech blogs online. These suggestions can help users effectively manage the costs of Snowflake systems to avoid month-end surprises, but they generally yield marginal savings. Any amount of savings may still sound attractive, but they come with their own hidden costs. Most of the approaches will constrain business operations by limiting or even suspending queries, which makes them unpopular choices in most cases.

In order to significantly reduce cloud analytics costs with Snowflake, you will often need to sacrifice business flexibility and scalability. Often, this tradeoff is not worth the savings. The reality is that Snowflake pricing isn't going to decrease much. Even if you're willing to give up significant business performance. If you're not happy with your Snowflake costs, you need to consider another option: finding a new solution.

That's easier said than done. After all, you started using Snowflake because it met your original evaluation criteria. How can you find a solution that can easily meet your needs and is simple to transition to?

At CelerData, we work with customers every day who are unsatisfied with their Snowflake costs and who are looking for a more cost-effective way to power their data analytics. We know what that journey looks like, what makes it difficult, and how to make it easy. You have your share of solutions to pick from, but working alongside our customers, we regularly find that the open source StarRocks project is the best way forward.

How StarRocks Can Help

Launched in 2020, StarRocks has quickly become a leading open source project in the analytics space. StarRocks delivers record-breaking query performance, for both streaming and batch-based historical data analytics, supported by a sophisticated MPP architecture.

StarRocks’ architecture is mainly composed of two modules, Front End (FE) and Back End (BE). It doesn’t rely on any external components. You can easily manage a StarRocks cluster in your private environment. Managed SaaS services are also available through vendors like CelerData.

StarRocks Architecture

StarRocks Architecture Overview

StarRocks divides a single table into various tablets, each of which is replicated and then evenly distributed among BE nodes. There are two types of division: partitioning and bucketing. Partitioning (or sharding) helps a table to decompose into multiple partitions. In bucketing, partitions can be subdivided as tablets into buckets based on the hash function of one or more columns.

Sub-Second Query Response Time

Thanks to its fully vectorized query engine and cost-based optimizer (CBO), StarRocks can provide a sub-second query experience for different analytical scenarios even when the data volume is at TB scale. As demonstrated in the benchmark results later in this article, StarRocks has more than 2x the performance advantage over Snowflake at a much lower cost.

Real-Time Analytics

Unlike Snowflake, which primarily focuses on batch-based analytics, StarRocks also has great real-time analytics performance. It can load streaming data from Kafka and Pulsar easily. In Snowflake, the continuous streaming data loaded from Snowpipe isn’t available for at least a few minutes. In StarRocks, the latest data can be queried immediately after it is ingested. This high ingestion performance guarantees efficient real-time ingestion. What’s more, StarRocks can also conduct upserts and deletes as well, which is very important for real-time analytics.

Data Lakehouse Analytics

StarRocks can directly query data stored in data lakes such as Hive, Iceberg, and Hudi. These data lakes are usually built on object storage (e.g. S3) or HDFS. With the statistical information provided by the data lake module, StarRocks can optimize the query plan using its own CBO. From the results of several standardized benchmark tests, the query performance on the data lake can be as much as 3-5x that of Trino.

Ease of Use

StarRocks is easy to scale up and scale out, and users can choose the hardware that best fits their needs. If a user wants to add more nodes to a StarRocks cluster, they only need to send a command to StarRocks, and everything else can be done automatically. Storage and concurrency capacity scales linearly with the cluster size, and StarRocks also supports the separation of storage and compute. Users can utilize Kubernetes to manage the cluster and make good use of the cloud’s elasticity with ease: scale out during peak usage and scale in to save costs when there is no workload.

Advanced Techniques for Performance Improvements

StarRocks leverages different technologies to improve its query performance. Indexes in StarRocks (e.g. short key index, bloom filter index, and bitmap index) can help to greatly reduce the data scan time. Materialized views are another useful tool which makes use of pre-compute to improve query performance. Unlike Snowflake, which can only support building materialized views on single tables, StarRocks can build materialized views on multi-tables and grants more flexibility when processing data.

In short, StarRocks offers a unified, feature-rich analytics platform with the best performance numbers amongst analytics databases available to the industry today. Its community and user base continues to grow and includes some of the largest enterprises in the world. Best of all, it’s free to use.

StarRocks’ performance and capabilities make it an attractive alternative to Snowflake, and a viable option for reducing your analytics costs. We’ll dig deeper into this in the next section where we will take a look at the price and performance comparisons between StarRocks and Snowflake.

Price-Performance Comparison Between StarRocks and Snowflake

Thanks to the flexibility it offers in choosing your own hardware, StarRocks can deliver much better price-performance results compared to Snowflake.

Snowflake pricing is based on your compute, storage, and cloud services usage. Compute costs are calculated depending on how long your warehouses are running. Storage costs are calculated based on the amount of data you are storing, and cloud services depend on the Snowflake features you are using. Usually, compute costs are the most important part of your Snowflake bills. So, we’ll focus on compute costs in this comparison.

Since individual use cases can differ, it wouldn’t be as helpful to examine any one specific use case in this comparison. Instead, we’ll use standard benchmarks, TPC-H and PTC-DS, to compare the performance of the two products. This will make it easier to reproduce and validate the test results.

In the following benchmark testing, we used Snowflake Standard Edition on its XL Virtual Warehouse. To run StarRocks, we used 16 instances of AWS EC2 r5.2xlarge. The total number of CPU cores here is 128. The data volume we used in this test was 1TB.

Table 1: TPC-DS 1T Benchmark (in ms, shorter is better)

SQL	AWS-SR(2.4.0)	AWS-SF(XL)	AWS-SR/AWS-SF
tpcds_1t.query01	807	1,751	0.46
tpcds_1t.query02	3,137	3,827	0.82
tpcds_1t.query03	435	1,442	0.3
tpcds_1t.query04	35,751	44,763	0.8
tpcds_1t.query05	2,084	3,249	0.64
tpcds_1t.query06	367	2,033	0.18
tpcds_1t.query07	693	2,172	0.32
tpcds_1t.query08	278	1,847	0.15
tpcds_1t.query09	4,039	12,427	0.33
tpcds_1t.query10	402	1,318	0.31
tpcds_1t.query11	23,390	21,858	1.07
tpcds_1t.query12	151	702	0.22
tpcds_1t.query13	409	3,124	0.13
tpcds_1t.query14	24,011	19,979	1.2
tpcds_1t.query15	458	2,454	0.19
tpcds_1t.query16	1,499	1,369	1.09
tpcds_1t.query17	944	3,809	0.25
tpcds_1t.query18	859	2,656	0.32
tpcds_1t.query19	261	2,226	0.12
tpcds_1t.query20	159	732	0.22
tpcds_1t.query21	117	725	0.16
tpcds_1t.query22	1,414	2,246	0.63
tpcds_1t.query23	67,976	59,069	1.15
tpcds_1t.query24	6,593	14,177	0.47
tpcds_1t.query25	751	4,726	0.16
tpcds_1t.query26	407	1,372	0.3
tpcds_1t.query27	406	1,648	0.25
tpcds_1t.query28	3,203	7,616	0.42
tpcds_1t.query29	1,522	4,558	0.33
tpcds_1t.query30	431	2,755	0.16
tpcds_1t.query31	2,163	3,681	0.59
tpcds_1t.query32	165	728	0.23
tpcds_1t.query33	639	2,378	0.27
tpcds_1t.query34	404	3,224	0.13
tpcds_1t.query35	1,278	7,659	0.17
tpcds_1t.query36	467	1,548	0.3
tpcds_1t.query37	284	872	0.33
tpcds_1t.query38	8,054	10,091	0.8
tpcds_1t.query39	422	5,089	0.08
tpcds_1t.query40	414	1,381	0.3
tpcds_1t.query41	94	358	0.26
tpcds_1t.query42	110	439	0.25
tpcds_1t.query43	701	1,451	0.48
tpcds_1t.query44	844	3,627	0.23
tpcds_1t.query45	646	2,734	0.24
tpcds_1t.query46	673	4,621	0.15
tpcds_1t.query47	14,554	4,301	3.38
tpcds_1t.query48	479	2,964	0.16
tpcds_1t.query49	541	7,189	0.08
tpcds_1t.query50	1,589	3,129	0.51
tpcds_1t.query51	4,599	4,702	0.98
tpcds_1t.query52	111	426	0.26
tpcds_1t.query53	481	1,196	0.4
tpcds_1t.query54	623	1,563	0.4
tpcds_1t.query55	115	427	0.27
tpcds_1t.query56	543	1,994	0.27
tpcds_1t.query57	8,224	3,297	2.49
tpcds_1t.query58	587	1,845	0.32
tpcds_1t.query59	4,247	4,727	0.9
tpcds_1t.query60	630	2,545	0.25
tpcds_1t.query61	422	2,770	0.15
tpcds_1t.query62	678	1,612	0.42
tpcds_1t.query63	456	1,067	0.43
tpcds_1t.query64	9,184	14,504	0.63
tpcds_1t.query65	5,640	5,747	0.98
tpcds_1t.query66	616	3,208	0.19
tpcds_1t.query67	53,480	35,793	1.49
tpcds_1t.query68	354	3,851	0.09
tpcds_1t.query69	432	1,774	0.24
tpcds_1t.query70	3,055	2,514	1.22
tpcds_1t.query71	482	1,952	0.25
tpcds_1t.query72	1,398	5,127	0.27
tpcds_1t.query73	232	2,413	0.1
tpcds_1t.query74	12,918	15,108	0.86
tpcds_1t.query75	5,428	15,569	0.35
tpcds_1t.query76	755	3,441	0.22
tpcds_1t.query77	555	1,648	0.34
tpcds_1t.query78	13,752	23,796	0.58
tpcds_1t.query79	1,296	3,618	0.36
tpcds_1t.query80	1,382	4,659	0.3
tpcds_1t.query81	589	4,424	0.13
tpcds_1t.query82	911	1,065	0.86
tpcds_1t.query83	430	1,430	0.3
tpcds_1t.query84	248	1,058	0.23
tpcds_1t.query85	457	4,291	0.11
tpcds_1t.query86	965	1,202	0.8
tpcds_1t.query87	7,713	16,944	0.46
tpcds_1t.query88	6,241	11,188	0.56
tpcds_1t.query89	563	1,369	0.41
tpcds_1t.query90	350	1,895	0.18
tpcds_1t.query91	197	1,375	0.14
tpcds_1t.query92	154	616	0.25
tpcds_1t.query93	1,021	2,908	0.35
tpcds_1t.query94	900	1,360	0.66
tpcds_1t.query95	3,414	9,551	0.36
tpcds_1t.query96	808	1,636	0.49
tpcds_1t.query97	4,324	7,467	0.58
tpcds_1t.query98	213	1,620	0.13
tpcds_1t.query99	1,273	2,327	0.55
Total	380,921	546,713	0.46

Table 2: TPC-H 1T Benchmark (in ms, shorter is better)

sql_name	AWS-SR(2.4.0)	AWS-SF(XL)	AWS-SR/AWS-SF
tpch_1t.q01	10020	7528	1.33
tpch_1t.q02	769	3658	0.21
tpch_1t.q03	3536	6215	0.57
tpch_1t.q04	1595	5364	0.30
tpch_1t.q05	4820	8088	0.60
tpch_1t.q06	197	664	0.30
tpch_1t.q07	3769	5317	0.71
tpch_1t.q08	3730	7976	0.47
tpch_1t.q09	15,385	62,692	0.25
tpch_1t.q10	2349	7143	0.33
tpch_1t.q11	798	1512	0.53
tpch_1t.q12	1204	3840	0.31
tpch_1t.q13	9280	11975	0.77
tpch_1t.q14	732	1423	0.51
tpch_1t.q15	1604	2259	0.71
tpch_1t.q16	1046	3224	0.32
tpch_1t.q17	3414	4240	0.81
tpch_1t.q18	7,784	24,594	0.32
tpch_1t.q19	2609	3919	0.67
tpch_1t.q20	946	3862	0.24
tpch_1t.q21	5931	14315	0.41
tpch_1t.q22	1831	4870	0.38
Total	83349	194678	0.50

From these results, we can see that StarRocks is able to deliver 2x the performance of Snowflake on standard benchmark data sets.

Next, Let’s take the price of Snowflake’s virtual warehouse and AWS’ EC2 into consideration. The standard version of Snowflake’s XL virtual warehouse costs 16 credits per hour, and every credit costs 2 dollars. AWS EC2 r5.2xlarge, however, costs 0.504 dollars per hour. Sixteen EC2 instances of this size were used to complete this test. Since StarRocks performance is 2x faster than Snowflake, if we want to finish the same workload that StarRocks can finish in one hour, we need to use Snowflake with the same hardware for two hours.

(0.504 * 16) / (16 * 2 *2) = 12.6%

As we can see, the total hardware and software cost of StarRocks is 12.6% the cost of Snowflake. We should note that this comparison is limited in some aspects:

The storage costs and cloud service costs of Snowflake haven't been taken into account.
The elasticity of Snowflake’s virtual warehouse hasn’t been taken into account. Users may suspend the virtual warehouse when their work is finished in some scenarios. However, in most online cases the query workload is stable.
The operating costs of StarRocks haven't been taken into account.

If one considers all of these points, it may add more costs to the StarRocks solution, but not by much. It is still reasonable to estimate that one could save up to 80% by adopting StarRocks compared to using Snowflake as their only data warehouse.

Detailed Comparison: StarRocks vs. Snowflake

Here's a breakdown of the major differences between StarRocks and Snowflake:

Capability	StarRocks	Snowflake
Open source	Yes	No
Deployment mode	On-premise and SaaS	SaaS
Separation of storage and compute	Yes	Yes
Isolated tenancy	Multi-tenant pooled resources	Multi-tenant pooled resources
Control vs abstraction of compute	Configurable cluster size and hardware type	Configurable cluster size, no control over hardware
Elasticity	Cluster resize with customized node size, automatically scaling out	Cluster resize, no choice of node size, auto scaling out
High concurrency	A node can handle up to several thousand concurrent queries. Adding more nodes can support more concurrent queries.	8 concurrent queries per warehouse by default, Autoscaling up to 10 warehouses.
Real-time support	Continuous streaming data is available in seconds. Supports real-time update.	Continuous streaming data isn't available for a few minutes.
Indexes	- Sort key index - Bitmap index - Bloom filter index	“Search optimization service” indexes fields for accelerated point lookup queries, at additional cost
Materialized views	- Synchronized with tables - Supports query rewrites - Work on multi-table and aggregation	- Synchronized with tables - Supports query rewrites - only works on a single table
Stored procedure	No	Yes
Table-level data distribution	Partitioning divides a table into multiple segments called partitions. Bucketing divides a partition into multiple more manageable parts called tablets	Data is automatically divided into micro-partitions. Pruning at micro-partition level. Table clustering (cluster keys)
Query latency	Sub-second level	Second level
Query Federation	Hive, Iceberg, Hudi, Delta Lake	Hive, Delta Lake, Iceberg

Using StarRocks for Enterprise

As an open source solution, StarRocks can be downloaded for free from starrocks.io and used in production environments without any performance or capacity limit. We also encourage database developers and users to join the hundreds of contributors worldwide on StarRocks' GitHub repository, and participate in discussions in the project's Slack channels.

But if you’re in need of a more enterprise-ready solution that can deliver the fantastic price-performance numbers StarRocks promises, you’ll want to check out CelerData.

CelerData was founded by the original creators of StarRocks to help businesses enjoy the performance benefits StarRocks has to offer accompanied by enterprise-scale features and support.

When it comes to deployment, CelerData provides two options:

CelerData Enterprise - which is deployed in your data center or private VPC in the cloud.
CelerData Cloud - which is an SaaS cloud offering managed by CelerData.

You can learn more about CelerData by visiting CelerData.Com.

When it comes to reducing your Snowflake costs, you have a lot of options. Traditional approaches to cost savings can only go so far. If you’re ready to really bring costs down (and boost performance at the same time), it’s time to take a serious look at StarRocks.

copy success

Flexible and Fresh Real-Time Analytics with CelerData and Confluent

Today, we’re excited to share the news that CelerData has teamed up with Confluent as a registered ISV partner. While this is only the ...

Albert Wong

The Open Data Lakehouse: Towards Democratized Data Analytics

The future looks bright for open data lakehouses, but why is that? It's not just some marketing term, but a real, intelligent approach ...

Sida Shen

How Real-Time Analytics Works: A Step-by-Step Breakdown

How does real-time analytics work? How do we seamlessly transition from generating data to making data-driven decisions in a matter of ...

Sida Shen

A Guide to Reducing Snowflake Costs

Snowflake: An Introduction

The Advantages of Snowflake

Ease of Use

Outstanding Performance

Separation of Compute and Storage

Seamless Data Sharing

Cost Is a Major Issue With Snowflake

A Brief Look at Snowflake's Pricing Structure

How To Reduce Snowflake Costs

Use Resource Monitors To Analyze Your Usage

Scale Your Virtual Warehouses Appropriately

Use Query Control Parameters

Optimize Data Loading

Using Materialized Views

Common Challenges in Reducing Snowflake Costs

How StarRocks Can Help

Sub-Second Query Response Time

Real-Time Analytics

Data Lakehouse Analytics

Ease of Use

Advanced Techniques for Performance Improvements

Price-Performance Comparison Between StarRocks and Snowflake

Detailed Comparison: StarRocks vs. Snowflake

Using StarRocks for Enterprise

Related Articles

Have questions? Talk to a CelerData expert.