It's no secret. Snowflake has a lot to offer when it comes to empowering your analytics operations. It probably felt like everything you were looking for when you signed up, but that honeymoon period lasted about as long as it took for that first bill to arrive. Often, one of the first questions teams ask when they start using Snowflake isn't "how can we do more with Snowflake?" It's "how can we cut down or better control Snowflake's costs?".
The common approaches to Snowflake cost control tend to focus on limiting the business’ usage or demand a lot of extra work from data warehouse developers. All of this results in only marginal cost savings. So how can you preserve all those great Snowflake analytics benefits while meaningfully reducing your costs?
In this article, we'll go over the basics of Snowflake and its pricing, examine the most popular approaches to controlling Snowflake costs, and also share a new, alternative approach capable of saving companies using Snowflake up to 80% of what they're paying today.
Snowflake is one of the world’s leading cloud-based data warehouse companies. Founded in 2012 by three data warehousing experts, Snowflake now supports nearly 7,000 customers. Snowflake’s data warehouse offering is fully managed. This means users shouldn’t have to concern themselves with back-end work like server installation and maintenance. Snowflake is also fairly flexible in its deployment options, with its data warehouse instances able to be spun up on AWS, GCP, and Azure.
Snowflake boasts plenty of advantages. This is especially true for small businesses that can put all of their data into Snowflake for analysis. Here are some of Snowflake’s biggest advantages:
Snowflake is celebrated for its simple and intuitive interface. You can get started with the service quickly, and you can automatically spin up and down compute clusters of any size or workload without impacting other jobs. With near-infinite, instant elasticity and concurrency, Snowflake promises to deliver the performance your organization needs.
In many cases, the performance of Snowflake is outstanding, and it offers different virtual warehouse sizes that can meet a wide variety of performance requirements. With Snowflake, if you want better performance, you just need to upgrade your virtual warehouse. Snowflake also ensures their query processing is on par with other competitive offerings.
Snowflake’s multi-cluster, shared data architecture shines because compute and storage resources are kept physically separate. At the same time, they are logically part of a single, integrated and modern cloud-built data warehouse system. Snowflake’s architecture also includes built-in cloud services such as the transparent provisioning of resources and automatic metadata management and resilience.
Snowflake’s architecture allows for smooth and safe data sharing among Snowflake’s users. Companies can also share data with consumers regardless of the consumers being Snowflake’s customers or not. A client account can be created for them directly from the user interface and managed by the provider.
Although Snowflake’s capabilities have been a tremendous boon to data analytics teams, the honeymoon period tends to last up until the first invoice. Customers are often surprised by their month end Snowflake bills. This is because, as users ingest larger volumes of data into Snowflake, queries become more and more complex, and the concurrency becomes higher too. To guarantee their desired query latency, users need to use a bigger virtual warehouse for these complex queries. At the same time, the high concurrency of the queries leads to spinning up more virtual warehouses. This gets expensive very quickly.
As a result, most customers would want to find some way to reduce snowflake costs after using this product for a while. Let's look at some of the most popular approaches to reducing Snowflake costs.
First, it's important to understand Snowflake's pricing structure and the options you have when working with them.
Snowflake pricing normally falls into one of four levels:
Standard
Enterprise
Business Critical
Virtual Private Snowflake (VPS)
Snowflake also offers the ability to pay for usage month-to-month and even pay for usage up front. While there can be no doubt Snowflake pricing is flexible, this does not necessarily equate to cheap. Snowflake pricing is based on your compute, storage, and cloud services usage. Compute costs are calculated depending on how long your warehouses are running. Storage costs are calculated based on the amount of data you are storing, and cloud services depend on the Snowflake features you are using. Usually, compute costs are the most important part of your Snowflake bills. So, we focus on the compute costs for the remainder of this article.
If you've spent anytime looking for solutions to Snowflake pricing online, you can summarize the most popular recommended approaches as follows:
Organizations tend to have a monthly threshold or quota that they want to spend for compute on Snowflake. It is recommended that you set up resource monitors for raising alerts and taking action upon reaching your monthly credits consumption threshold. In these instances, the business should abort all running queries and suspend its virtual warehouse immediately. Alternatively, you can complete all the running queries then suspend the virtual warehouse.
A good rule of thumb with Snowflake warehouses is to start small and increase the size as needed. Since the cost of a warehouse doubles each time you size up, starting smaller with longer running queries will allow for better cost management and monitoring.
Also, by creating virtual warehouses for specific use cases, you’ll be able to customize them for each case. That has the potential to deliver a better ROI to your business.
In Snowflake, both query processing and concurrency can be controlled by configuring several parameters. For example, STATEMENT_TIMEOUT_IN_SECONDS controls the amount of time, in seconds, after which a running SQL statement (query, DDL, DML, etc.) is canceled by the system. This parameter can help to prevent wasting money on some unexpectedly large queries. Making good use of these kinds of parameters can be very useful for controlling costs.
How you load data and how efficient your queries are, are two of the most important Snowflake cost control factors. By default, data is automatically partitioned in Snowflake according to its ingestion order. It can, however, be more efficient to sort the data in its cloud bucket before loading rather than sorting the data with an `ORDER BY` after loading.
Compression also helps to improve load times, and not all options are made equal. A CSV file that has been Gzipped loads 2–3 times faster than Parquet or ORC.
Even the move command matters: COPY INTO is more efficient than INSERT because it applies an optimized SQL bulk loading process.
A materialized view is a pre-computed data set stored for later use. This feature is also an effective Snowflake cost optimization strategy when running frequent or complex queries on a subsection of large data sets. Materialized views are only executed against all the data in a table once. After executing against all the available data on its initial run, it only runs against new data added. It does this by utilizing INSERT and UPDATE commands.
However, materialized views in Snowflake can only be utilized on data models that use simple SELECT statements rather than those that use joins or aggregation.
The above mentioned strategies are common approaches you will find in most of the tech blogs. They help users effectively manage the costs of Snowflake systems to avoid month-end surprises, but they generally reduce costs by a limited amount. On the other hand, some of these approaches may impact how businesses are run by limiting or even suspending queries, which makes them unpopular choices for enterprises.
To significantly reduce cloud analytics costs without negatively impacting business, we need to look at the problem from a different angle.
The approaches mentioned above are what you will most likely find when scouring tech blogs online. These suggestions can help users effectively manage the costs of Snowflake systems to avoid month-end surprises, but they generally yield marginal savings. Any amount of savings may still sound attractive, but they come with their own hidden costs. Most of the approaches will constrain business operations by limiting or even suspending queries, which makes them unpopular choices in most cases.
In order to significantly reduce cloud analytics costs with Snowflake, you will often need to sacrifice business flexibility and scalability. Often, this tradeoff is not worth the savings. The reality is that Snowflake pricing isn't going to decrease much. Even if you're willing to give up significant business performance. If you're not happy with your Snowflake costs, you need to consider another option: finding a new solution.
That's easier said than done. After all, you started using Snowflake because it met your original evaluation criteria. How can you find a solution that can easily meet your needs and is simple to transition to?
At CelerData, we work with customers every day who are unsatisfied with their Snowflake costs and who are looking for a more cost-effective way to power their data analytics. We know what that journey looks like, what makes it difficult, and how to make it easy. You have your share of solutions to pick from, but working alongside our customers, we regularly find that the open source StarRocks project is the best way forward.
Launched in 2020, StarRocks has quickly become a leading open source project in the analytics space. StarRocks delivers record-breaking query performance, for both streaming and batch-based historical data analytics, supported by a sophisticated MPP architecture.
StarRocks’ architecture is mainly composed of two modules, Front End (FE) and Back End (BE). It doesn’t rely on any external components. You can easily manage a StarRocks cluster in your private environment. Managed SaaS services are also available through vendors like CelerData.
StarRocks Architecture Overview
StarRocks divides a single table into various tablets, each of which is replicated and then evenly distributed among BE nodes. There are two types of division: partitioning and bucketing. Partitioning (or sharding) helps a table to decompose into multiple partitions. In bucketing, partitions can be subdivided as tablets into buckets based on the hash function of one or more columns.
Thanks to its fully vectorized query engine and cost-based optimizer (CBO), StarRocks can provide a sub-second query experience for different analytical scenarios even when the data volume is at TB scale. As demonstrated in the benchmark results later in this article, StarRocks has more than 2x the performance advantage over Snowflake at a much lower cost.
Unlike Snowflake, which primarily focuses on batch-based analytics, StarRocks also has great real-time analytics performance. It can load streaming data from Kafka and Pulsar easily. In Snowflake, the continuous streaming data loaded from Snowpipe isn’t available for at least a few minutes. In StarRocks, the latest data can be queried immediately after it is ingested. This high ingestion performance guarantees efficient real-time ingestion. What’s more, StarRocks can also conduct upserts and deletes as well, which is very important for real-time analytics.
StarRocks can directly query data stored in data lakes such as Hive, Iceberg, and Hudi. These data lakes are usually built on object storage (e.g. S3) or HDFS. With the statistical information provided by the data lake module, StarRocks can optimize the query plan using its own CBO. From the results of several standardized benchmark tests, the query performance on the data lake can be as much as 3-5x that of Trino.
StarRocks is easy to scale up and scale out, and users can choose the hardware that best fits their needs. If a user wants to add more nodes to a StarRocks cluster, they only need to send a command to StarRocks, and everything else can be done automatically. Storage and concurrency capacity scales linearly with the cluster size, and StarRocks also supports the separation of storage and compute. Users can utilize Kubernetes to manage the cluster and make good use of the cloud’s elasticity with ease: scale out during peak usage and scale in to save costs when there is no workload.
StarRocks leverages different technologies to improve its query performance. Indexes in StarRocks (e.g. short key index, bloom filter index, and bitmap index) can help to greatly reduce the data scan time. Materialized views are another useful tool which makes use of pre-compute to improve query performance. Unlike Snowflake, which can only support building materialized views on single tables, StarRocks can build materialized views on multi-tables and grants more flexibility when processing data.
In short, StarRocks offers a unified, feature-rich analytics platform with the best performance numbers amongst analytics databases available to the industry today. Its community and user base continues to grow and includes some of the largest enterprises in the world. Best of all, it’s free to use.
StarRocks’ performance and capabilities make it an attractive alternative to Snowflake, and a viable option for reducing your analytics costs. We’ll dig deeper into this in the next section where we will take a look at the price and performance comparisons between StarRocks and Snowflake.
Thanks to the flexibility it offers in choosing your own hardware, StarRocks can deliver much better price-performance results compared to Snowflake.
Snowflake pricing is based on your compute, storage, and cloud services usage. Compute costs are calculated depending on how long your warehouses are running. Storage costs are calculated based on the amount of data you are storing, and cloud services depend on the Snowflake features you are using. Usually, compute costs are the most important part of your Snowflake bills. So, we’ll focus on compute costs in this comparison.
Since individual use cases can differ, it wouldn’t be as helpful to examine any one specific use case in this comparison. Instead, we’ll use standard benchmarks, TPC-H and PTC-DS, to compare the performance of the two products. This will make it easier to reproduce and validate the test results.
In the following benchmark testing, we used Snowflake Standard Edition on its XL Virtual Warehouse. To run StarRocks, we used 16 instances of AWS EC2 r5.2xlarge. The total number of CPU cores here is 128. The data volume we used in this test was 1TB.
Table 1: TPC-DS 1T Benchmark (in ms, shorter is better)
SQL
|
AWS-SR(2.4.0)
|
AWS-SF(XL)
|
AWS-SR/AWS-SF
|
tpcds_1t.query01
|
807
|
1,751
|
0.46
|
tpcds_1t.query02
|
3,137
|
3,827
|
0.82
|
tpcds_1t.query03
|
435
|
1,442
|
0.3
|
tpcds_1t.query04
|
35,751
|
44,763
|
0.8
|
tpcds_1t.query05
|
2,084
|
3,249
|
0.64
|
tpcds_1t.query06
|
367
|
2,033
|
0.18
|
tpcds_1t.query07
|
693
|
2,172
|
0.32
|
tpcds_1t.query08
|
278
|
1,847
|
0.15
|
tpcds_1t.query09
|
4,039
|
12,427
|
0.33
|
tpcds_1t.query10
|
402
|
1,318
|
0.31
|
tpcds_1t.query11
|
23,390
|
21,858
|
1.07
|
tpcds_1t.query12
|
151
|
702
|
0.22
|
tpcds_1t.query13
|
409
|
3,124
|
0.13
|
tpcds_1t.query14
|
24,011
|
19,979
|
1.2
|
tpcds_1t.query15
|
458
|
2,454
|
0.19
|
tpcds_1t.query16
|
1,499
|
1,369
|
1.09
|
tpcds_1t.query17
|
944
|
3,809
|
0.25
|
tpcds_1t.query18
|
859
|
2,656
|
0.32
|
tpcds_1t.query19
|
261
|
2,226
|
0.12
|
tpcds_1t.query20
|
159
|
732
|
0.22
|
tpcds_1t.query21
|
117
|
725
|
0.16
|
tpcds_1t.query22
|
1,414
|
2,246
|
0.63
|
tpcds_1t.query23
|
67,976
|
59,069
|
1.15
|
tpcds_1t.query24
|
6,593
|
14,177
|
0.47
|
tpcds_1t.query25
|
751
|
4,726
|
0.16
|
tpcds_1t.query26
|
407
|
1,372
|
0.3
|
tpcds_1t.query27
|
406
|
1,648
|
0.25
|
tpcds_1t.query28
|
3,203
|
7,616
|
0.42
|
tpcds_1t.query29
|
1,522
|
4,558
|
0.33
|
tpcds_1t.query30
|
431
|
2,755
|
0.16
|
tpcds_1t.query31
|
2,163
|
3,681
|
0.59
|
tpcds_1t.query32
|
165
|
728
|
0.23
|
tpcds_1t.query33
|
639
|
2,378
|
0.27
|
tpcds_1t.query34
|
404
|
3,224
|
0.13
|
tpcds_1t.query35
|
1,278
|
7,659
|
0.17
|
tpcds_1t.query36
|
467
|
1,548
|
0.3
|
tpcds_1t.query37
|
284
|
872
|
0.33
|
tpcds_1t.query38
|
8,054
|
10,091
|
0.8
|
tpcds_1t.query39
|
422
|
5,089
|
0.08
|
tpcds_1t.query40
|
414
|
1,381
|
0.3
|
tpcds_1t.query41
|
94
|
358
|
0.26
|
tpcds_1t.query42
|
110
|
439
|
0.25
|
tpcds_1t.query43
|
701
|
1,451
|
0.48
|
tpcds_1t.query44
|
844
|
3,627
|
0.23
|
tpcds_1t.query45
|
646
|
2,734
|
0.24
|
tpcds_1t.query46
|
673
|
4,621
|
0.15
|
tpcds_1t.query47
|
14,554
|
4,301
|
3.38
|
tpcds_1t.query48
|
479
|
2,964
|
0.16
|
tpcds_1t.query49
|
541
|
7,189
|
0.08
|
tpcds_1t.query50
|
1,589
|
3,129
|
0.51
|
tpcds_1t.query51
|
4,599
|
4,702
|
0.98
|
tpcds_1t.query52
|
111
|
426
|
0.26
|
tpcds_1t.query53
|
481
|
1,196
|
0.4
|
tpcds_1t.query54
|
623
|
1,563
|
0.4
|
tpcds_1t.query55
|
115
|
427
|
0.27
|
tpcds_1t.query56
|
543
|
1,994
|
0.27
|
tpcds_1t.query57
|
8,224
|
3,297
|
2.49
|
tpcds_1t.query58
|
587
|
1,845
|
0.32
|
tpcds_1t.query59
|
4,247
|
4,727
|
0.9
|
tpcds_1t.query60
|
630
|
2,545
|
0.25
|
tpcds_1t.query61
|
422
|
2,770
|
0.15
|
tpcds_1t.query62
|
678
|
1,612
|
0.42
|
tpcds_1t.query63
|
456
|
1,067
|
0.43
|
tpcds_1t.query64
|
9,184
|
14,504
|
0.63
|
tpcds_1t.query65
|
5,640
|
5,747
|
0.98
|
tpcds_1t.query66
|
616
|
3,208
|
0.19
|
tpcds_1t.query67
|
53,480
|
35,793
|
1.49
|
tpcds_1t.query68
|
354
|
3,851
|
0.09
|
tpcds_1t.query69
|
432
|
1,774
|
0.24
|
tpcds_1t.query70
|
3,055
|
2,514
|
1.22
|
tpcds_1t.query71
|
482
|
1,952
|
0.25
|
tpcds_1t.query72
|
1,398
|
5,127
|
0.27
|
tpcds_1t.query73
|
232
|
2,413
|
0.1
|
tpcds_1t.query74
|
12,918
|
15,108
|
0.86
|
tpcds_1t.query75
|
5,428
|
15,569
|
0.35
|
tpcds_1t.query76
|
755
|
3,441
|
0.22
|
tpcds_1t.query77
|
555
|
1,648
|
0.34
|
tpcds_1t.query78
|
13,752
|
23,796
|
0.58
|
tpcds_1t.query79
|
1,296
|
3,618
|
0.36
|
tpcds_1t.query80
|
1,382
|
4,659
|
0.3
|
tpcds_1t.query81
|
589
|
4,424
|
0.13
|
tpcds_1t.query82
|
911
|
1,065
|
0.86
|
tpcds_1t.query83
|
430
|
1,430
|
0.3
|
tpcds_1t.query84
|
248
|
1,058
|
0.23
|
tpcds_1t.query85
|
457
|
4,291
|
0.11
|
tpcds_1t.query86
|
965
|
1,202
|
0.8
|
tpcds_1t.query87
|
7,713
|
16,944
|
0.46
|
tpcds_1t.query88
|
6,241
|
11,188
|
0.56
|
tpcds_1t.query89
|
563
|
1,369
|
0.41
|
tpcds_1t.query90
|
350
|
1,895
|
0.18
|
tpcds_1t.query91
|
197
|
1,375
|
0.14
|
tpcds_1t.query92
|
154
|
616
|
0.25
|
tpcds_1t.query93
|
1,021
|
2,908
|
0.35
|
tpcds_1t.query94
|
900
|
1,360
|
0.66
|
tpcds_1t.query95
|
3,414
|
9,551
|
0.36
|
tpcds_1t.query96
|
808
|
1,636
|
0.49
|
tpcds_1t.query97
|
4,324
|
7,467
|
0.58
|
tpcds_1t.query98
|
213
|
1,620
|
0.13
|
tpcds_1t.query99
|
1,273
|
2,327
|
0.55
|
Total
|
380,921
|
546,713
|
0.46
|
Table 2: TPC-H 1T Benchmark (in ms, shorter is better)
sql_name
|
AWS-SR(2.4.0)
|
AWS-SF(XL)
|
AWS-SR/AWS-SF
|
tpch_1t.q01
|
10020
|
7528
|
1.33
|
tpch_1t.q02
|
769
|
3658
|
0.21
|
tpch_1t.q03
|
3536
|
6215
|
0.57
|
tpch_1t.q04
|
1595
|
5364
|
0.30
|
tpch_1t.q05
|
4820
|
8088
|
0.60
|
tpch_1t.q06
|
197
|
664
|
0.30
|
tpch_1t.q07
|
3769
|
5317
|
0.71
|
tpch_1t.q08
|
3730
|
7976
|
0.47
|
tpch_1t.q09
|
15,385
|
62,692
|
0.25
|
tpch_1t.q10
|
2349
|
7143
|
0.33
|
tpch_1t.q11
|
798
|
1512
|
0.53
|
tpch_1t.q12
|
1204
|
3840
|
0.31
|
tpch_1t.q13
|
9280
|
11975
|
0.77
|
tpch_1t.q14
|
732
|
1423
|
0.51
|
tpch_1t.q15
|
1604
|
2259
|
0.71
|
tpch_1t.q16
|
1046
|
3224
|
0.32
|
tpch_1t.q17
|
3414
|
4240
|
0.81
|
tpch_1t.q18
|
7,784
|
24,594
|
0.32
|
tpch_1t.q19
|
2609
|
3919
|
0.67
|
tpch_1t.q20
|
946
|
3862
|
0.24
|
tpch_1t.q21
|
5931
|
14315
|
0.41
|
tpch_1t.q22
|
1831
|
4870
|
0.38
|
Total
|
83349
|
194678
|
0.50
|
From these results, we can see that StarRocks is able to deliver 2x the performance of Snowflake on standard benchmark data sets.
Next, Let’s take the price of Snowflake’s virtual warehouse and AWS’ EC2 into consideration. The standard version of Snowflake’s XL virtual warehouse costs 16 credits per hour, and every credit costs 2 dollars. AWS EC2 r5.2xlarge, however, costs 0.504 dollars per hour. Sixteen EC2 instances of this size were used to complete this test. Since StarRocks performance is 2x faster than Snowflake, if we want to finish the same workload that StarRocks can finish in one hour, we need to use Snowflake with the same hardware for two hours.
(0.504 * 16) / (16 * 2 *2) = 12.6%
As we can see, the total hardware and software cost of StarRocks is 12.6% the cost of Snowflake. We should note that this comparison is limited in some aspects:
The storage costs and cloud service costs of Snowflake haven't been taken into account.
The elasticity of Snowflake’s virtual warehouse hasn’t been taken into account. Users may suspend the virtual warehouse when their work is finished in some scenarios. However, in most online cases the query workload is stable.
The operating costs of StarRocks haven't been taken into account.
If one considers all of these points, it may add more costs to the StarRocks solution, but not by much. It is still reasonable to estimate that one could save up to 80% by adopting StarRocks compared to using Snowflake as their only data warehouse.
Here's a breakdown of the major differences between StarRocks and Snowflake:
Capability
|
StarRocks
|
Snowflake
|
Open source
|
Yes
|
No
|
Deployment mode
|
On-premise and SaaS
|
SaaS
|
Separation of storage and compute
|
Yes
|
Yes
|
Isolated tenancy
|
Multi-tenant pooled resources
|
Multi-tenant pooled resources
|
Control vs abstraction of compute
|
Configurable cluster size and hardware type
|
Configurable cluster size, no control over hardware
|
Elasticity
|
Cluster resize with customized node size, automatically scaling out
|
Cluster resize, no choice of node size, auto scaling out
|
High concurrency
|
A node can handle up to several thousand concurrent queries. Adding more nodes can support more concurrent queries.
|
8 concurrent queries per warehouse by default, Autoscaling up to 10 warehouses.
|
Real-time support
|
Continuous streaming data is available in seconds. Supports real-time update.
|
Continuous streaming data isn't available for a few minutes.
|
Indexes
|
- Sort key index
- Bitmap index
- Bloom filter index
|
“Search optimization service” indexes fields for accelerated point lookup queries, at additional cost
|
Materialized views
|
- Synchronized with tables
- Supports query rewrites
- Work on multi-table and aggregation
|
- Synchronized with tables
- Supports query rewrites
- only works on a single table
|
Stored procedure
|
No
|
Yes
|
Table-level data distribution
|
Partitioning divides a table into multiple segments called partitions. Bucketing divides a partition into multiple more manageable parts called tablets
|
Data is automatically divided into micro-partitions. Pruning at micro-partition level. Table clustering (cluster keys)
|
Query latency
|
Sub-second level
|
Second level
|
Query Federation
|
Hive, Iceberg, Hudi, Delta Lake
|
Hive, Delta Lake, Iceberg
|
As an open source solution, StarRocks can be downloaded for free from starrocks.io and used in production environments without any performance or capacity limit. We also encourage database developers and users to join the hundreds of contributors worldwide on StarRocks' GitHub repository, and participate in discussions in the project's Slack channels.
But if you’re in need of a more enterprise-ready solution that can deliver the fantastic price-performance numbers StarRocks promises, you’ll want to check out CelerData.
CelerData was founded by the original creators of StarRocks to help businesses enjoy the performance benefits StarRocks has to offer accompanied by enterprise-scale features and support.
When it comes to deployment, CelerData provides two options:
CelerData Enterprise - which is deployed in your data center or private VPC in the cloud.
CelerData Cloud - which is an SaaS cloud offering managed by CelerData.
You can learn more about CelerData by visiting CelerData.Com.
When it comes to reducing your Snowflake costs, you have a lot of options. Traditional approaches to cost savings can only go so far. If you’re ready to really bring costs down (and boost performance at the same time), it’s time to take a serious look at StarRocks.