CelerData Blog

A Guide to Reducing Snowflake Costs | CelerData

Written by CelerData | Nov 19, 2024 5:15:38 PM

It's no secret. Snowflake has a lot to offer when it comes to empowering your analytics operations. It probably felt like everything you were looking for when you signed up, but that honeymoon period lasted about as long as it took for that first bill to arrive. Often, one of the first questions teams ask when they start using Snowflake isn't "how can we do more with Snowflake?" It's "how can we cut down or better control Snowflake's costs?".

The common approaches to Snowflake cost control tend to focus on limiting the business’ usage or demand a lot of extra work from data warehouse developers. All of this results in only marginal cost savings. So how can you preserve all those great Snowflake analytics benefits while meaningfully reducing your costs?

In this article, we'll go over the basics of Snowflake and its pricing, examine the most popular approaches to controlling Snowflake costs, and also share a new, alternative approach capable of saving companies using Snowflake up to 80% of what they're paying today.

 

Snowflake: An Introduction

Snowflake is one of the world’s leading cloud-based data warehouse companies. Founded in 2012 by three data warehousing experts, Snowflake now supports nearly 7,000 customers. Snowflake’s data warehouse offering is fully managed. This means users shouldn’t have to concern themselves with back-end work like server installation and maintenance. Snowflake is also fairly flexible in its deployment options, with its data warehouse instances able to be spun up on AWS, GCP, and Azure.

 

The Advantages of Snowflake

Snowflake boasts plenty of advantages. This is especially true for small businesses that can put all of their data into Snowflake for analysis. Here are some of Snowflake’s biggest advantages:

 

Ease of Use

Snowflake is celebrated for its simple and intuitive interface. You can get started with the service quickly, and you can automatically spin up and down compute clusters of any size or workload without impacting other jobs. With near-infinite, instant elasticity and concurrency, Snowflake promises to deliver the performance your organization needs.

 

Outstanding Performance

In many cases, the performance of Snowflake is outstanding, and it offers different virtual warehouse sizes that can meet a wide variety of performance requirements. With Snowflake, if you want better performance, you just need to upgrade your virtual warehouse. Snowflake also ensures their query processing is on par with other competitive offerings.

 

Separation of Compute and Storage

Snowflake’s multi-cluster, shared data architecture shines because compute and storage resources are kept physically separate. At the same time, they are logically part of a single, integrated and modern cloud-built data warehouse system. Snowflake’s architecture also includes built-in cloud services such as the transparent provisioning of resources and automatic metadata management and resilience.

 

Seamless Data Sharing

Snowflake’s architecture allows for smooth and safe data sharing among Snowflake’s users. Companies can also share data with consumers regardless of the consumers being Snowflake’s customers or not. A client account can be created for them directly from the user interface and managed by the provider.

 

Cost Is a Major Issue With Snowflake

Although Snowflake’s capabilities have been a tremendous boon to data analytics teams, the honeymoon period tends to last up until the first invoice. Customers are often surprised by their month end Snowflake bills. This is because, as users ingest larger volumes of data into Snowflake, queries become more and more complex, and the concurrency becomes higher too. To guarantee their desired query latency, users need to use a bigger virtual warehouse for these complex queries. At the same time, the high concurrency of the queries leads to spinning up more virtual warehouses. This gets expensive very quickly.

As a result, most customers would want to find some way to reduce snowflake costs after using this product for a while. Let's look at some of the most popular approaches to reducing Snowflake costs.

 

A Brief Look at Snowflake's Pricing Structure

First, it's important to understand Snowflake's pricing structure and the options you have when working with them.

Snowflake pricing normally falls into one of four levels:

 

  • Standard

  • Enterprise

  • Business Critical

  • Virtual Private Snowflake (VPS)

 

Snowflake also offers the ability to pay for usage month-to-month and even pay for usage up front. While there can be no doubt Snowflake pricing is flexible, this does not necessarily equate to cheap. Snowflake pricing is based on your compute, storage, and cloud services usage. Compute costs are calculated depending on how long your warehouses are running. Storage costs are calculated based on the amount of data you are storing, and cloud services depend on the Snowflake features you are using. Usually, compute costs are the most important part of your Snowflake bills. So, we focus on the compute costs for the remainder of this article.

 

How To Reduce Snowflake Costs

If you've spent anytime looking for solutions to Snowflake pricing online, you can summarize the most popular recommended approaches as follows:

 

Use Resource Monitors To Analyze Your Usage

Organizations tend to have a monthly threshold or quota that they want to spend for compute on Snowflake. It is recommended that you set up resource monitors for raising alerts and taking action upon reaching your monthly credits consumption threshold. In these instances, the business should abort all running queries and suspend its virtual warehouse immediately. Alternatively, you can complete all the running queries then suspend the virtual warehouse.

 

Scale Your Virtual Warehouses Appropriately

A good rule of thumb with Snowflake warehouses is to start small and increase the size as needed. Since the cost of a warehouse doubles each time you size up, starting smaller with longer running queries will allow for better cost management and monitoring.

Also, by creating virtual warehouses for specific use cases, you’ll be able to customize them for each case. That has the potential to deliver a better ROI to your business.

 

Use Query Control Parameters

In Snowflake, both query processing and concurrency can be controlled by configuring several parameters. For example, STATEMENT_TIMEOUT_IN_SECONDS controls the amount of time, in seconds, after which a running SQL statement (query, DDL, DML, etc.) is canceled by the system. This parameter can help to prevent wasting money on some unexpectedly large queries. Making good use of these kinds of parameters can be very useful for controlling costs.

 

Optimize Data Loading

How you load data and how efficient your queries are, are two of the most important Snowflake cost control factors. By default, data is automatically partitioned in Snowflake according to its ingestion order. It can, however, be more efficient to sort the data in its cloud bucket before loading rather than sorting the data with an `ORDER BY` after loading.

Compression also helps to improve load times, and not all options are made equal. A CSV file that has been Gzipped loads 2–3 times faster than Parquet or ORC.

Even the move command matters: COPY INTO is more efficient than INSERT because it applies an optimized SQL bulk loading process.

 

Using Materialized Views

A materialized view is a pre-computed data set stored for later use. This feature is also an effective Snowflake cost optimization strategy when running frequent or complex queries on a subsection of large data sets. Materialized views are only executed against all the data in a table once. After executing against all the available data on its initial run, it only runs against new data added. It does this by utilizing INSERT and UPDATE commands.

However, materialized views in Snowflake can only be utilized on data models that use simple SELECT statements rather than those that use joins or aggregation.

 

The above mentioned strategies are common approaches you will find in most of the tech blogs. They help users effectively manage the costs of Snowflake systems to avoid month-end surprises, but they generally reduce costs by a limited amount. On the other hand, some of these approaches may impact how businesses are run by limiting or even suspending queries, which makes them unpopular choices for enterprises.

To significantly reduce cloud analytics costs without negatively impacting business, we need to look at the problem from a different angle.

 

Common Challenges in Reducing Snowflake Costs

The approaches mentioned above are what you will most likely find when scouring tech blogs online. These suggestions can help users effectively manage the costs of Snowflake systems to avoid month-end surprises, but they generally yield marginal savings. Any amount of savings may still sound attractive, but they come with their own hidden costs. Most of the approaches will constrain business operations by limiting or even suspending queries, which makes them unpopular choices in most cases.

In order to significantly reduce cloud analytics costs with Snowflake, you will often need to sacrifice business flexibility and scalability. Often, this tradeoff is not worth the savings. The reality is that Snowflake pricing isn't going to decrease much. Even if you're willing to give up significant business performance. If you're not happy with your Snowflake costs, you need to consider another option: finding a new solution.

That's easier said than done. After all, you started using Snowflake because it met your original evaluation criteria. How can you find a solution that can easily meet your needs and is simple to transition to?

At CelerData, we work with customers every day who are unsatisfied with their Snowflake costs and who are looking for a more cost-effective way to power their data analytics. We know what that journey looks like, what makes it difficult, and how to make it easy. You have your share of solutions to pick from, but working alongside our customers, we regularly find that the open source StarRocks project is the best way forward.

 

How StarRocks Can Help

Launched in 2020, StarRocks has quickly become a leading open source project in the analytics space. StarRocks delivers record-breaking query performance, for both streaming and batch-based historical data analytics, supported by a sophisticated MPP architecture.

StarRocks’ architecture is mainly composed of two modules, Front End (FE) and Back End (BE). It doesn’t rely on any external components. You can easily manage a StarRocks cluster in your private environment. Managed SaaS services are also available through vendors like CelerData.

 

StarRocks Architecture Overview

 

StarRocks divides a single table into various tablets, each of which is replicated and then evenly distributed among BE nodes. There are two types of division: partitioning and bucketing. Partitioning (or sharding) helps a table to decompose into multiple partitions. In bucketing, partitions can be subdivided as tablets into buckets based on the hash function of one or more columns.

 

Sub-Second Query Response Time

Thanks to its fully vectorized query engine and cost-based optimizer (CBO), StarRocks can provide a sub-second query experience for different analytical scenarios even when the data volume is at TB scale. As demonstrated in the benchmark results later in this article, StarRocks has more than 2x the performance advantage over Snowflake at a much lower cost.

 

Real-Time Analytics

Unlike Snowflake, which primarily focuses on batch-based analytics, StarRocks also has great real-time analytics performance. It can load streaming data from Kafka and Pulsar easily. In Snowflake, the continuous streaming data loaded from Snowpipe isn’t available for at least a few minutes. In StarRocks, the latest data can be queried immediately after it is ingested. This high ingestion performance guarantees efficient real-time ingestion. What’s more, StarRocks can also conduct upserts and deletes as well, which is very important for real-time analytics.

 

Data Lakehouse Analytics

StarRocks can directly query data stored in data lakes such as Hive, Iceberg, and Hudi. These data lakes are usually built on object storage (e.g. S3) or HDFS. With the statistical information provided by the data lake module, StarRocks can optimize the query plan using its own CBO. From the results of several standardized benchmark tests, the query performance on the data lake can be as much as 3-5x that of Trino.

 

Ease of Use

StarRocks is easy to scale up and scale out, and users can choose the hardware that best fits their needs. If a user wants to add more nodes to a StarRocks cluster, they only need to send a command to StarRocks, and everything else can be done automatically. Storage and concurrency capacity scales linearly with the cluster size, and StarRocks also supports the separation of storage and compute. Users can utilize Kubernetes to manage the cluster and make good use of the cloud’s elasticity with ease: scale out during peak usage and scale in to save costs when there is no workload.

 

Advanced Techniques for Performance Improvements

StarRocks leverages different technologies to improve its query performance. Indexes in StarRocks (e.g. short key index, bloom filter index, and bitmap index) can help to greatly reduce the data scan time. Materialized views are another useful tool which makes use of pre-compute to improve query performance. Unlike Snowflake, which can only support building materialized views on single tables, StarRocks can build materialized views on multi-tables and grants more flexibility when processing data.

 

In short, StarRocks offers a unified, feature-rich analytics platform with the best performance numbers amongst analytics databases available to the industry today. Its community and user base continues to grow and includes some of the largest enterprises in the world. Best of all, it’s free to use.

StarRocks’ performance and capabilities make it an attractive alternative to Snowflake, and a viable option for reducing your analytics costs. We’ll dig deeper into this in the next section where we will take a look at the price and performance comparisons between StarRocks and Snowflake.

 

Price-Performance Comparison Between StarRocks and Snowflake

Thanks to the flexibility it offers in choosing your own hardware, StarRocks can deliver much better price-performance results compared to Snowflake.

Snowflake pricing is based on your compute, storage, and cloud services usage. Compute costs are calculated depending on how long your warehouses are running. Storage costs are calculated based on the amount of data you are storing, and cloud services depend on the Snowflake features you are using. Usually, compute costs are the most important part of your Snowflake bills. So, we’ll focus on compute costs in this comparison.

Since individual use cases can differ, it wouldn’t be as helpful to examine any one specific use case in this comparison. Instead, we’ll use standard benchmarks, TPC-H and PTC-DS, to compare the performance of the two products. This will make it easier to reproduce and validate the test results.

In the following benchmark testing, we used Snowflake Standard Edition on its XL Virtual Warehouse. To run StarRocks, we used 16 instances of AWS EC2 r5.2xlarge. The total number of CPU cores here is 128. The data volume we used in this test was 1TB.

 

Table 1: TPC-DS 1T Benchmark (in ms, shorter is better)

 

SQL
AWS-SR(2.4.0)
AWS-SF(XL)
AWS-SR/AWS-SF
tpcds_1t.query01
807
1,751
0.46
tpcds_1t.query02
3,137
3,827
0.82
tpcds_1t.query03
435
1,442
0.3
tpcds_1t.query04
35,751
44,763
0.8
tpcds_1t.query05
2,084
3,249
0.64
tpcds_1t.query06
367
2,033
0.18
tpcds_1t.query07
693
2,172
0.32
tpcds_1t.query08
278
1,847
0.15
tpcds_1t.query09
4,039
12,427
0.33
tpcds_1t.query10
402
1,318
0.31
tpcds_1t.query11
23,390
21,858
1.07
tpcds_1t.query12
151
702
0.22
tpcds_1t.query13
409
3,124
0.13
tpcds_1t.query14
24,011
19,979
1.2
tpcds_1t.query15
458
2,454
0.19
tpcds_1t.query16
1,499
1,369
1.09
tpcds_1t.query17
944
3,809
0.25
tpcds_1t.query18
859
2,656
0.32
tpcds_1t.query19
261
2,226
0.12
tpcds_1t.query20
159
732
0.22
tpcds_1t.query21
117
725
0.16
tpcds_1t.query22
1,414
2,246
0.63
tpcds_1t.query23
67,976
59,069
1.15
tpcds_1t.query24
6,593
14,177
0.47
tpcds_1t.query25
751
4,726
0.16
tpcds_1t.query26
407
1,372
0.3
tpcds_1t.query27
406
1,648
0.25
tpcds_1t.query28
3,203
7,616
0.42
tpcds_1t.query29
1,522
4,558
0.33
tpcds_1t.query30
431
2,755
0.16
tpcds_1t.query31
2,163
3,681
0.59
tpcds_1t.query32
165
728
0.23
tpcds_1t.query33
639
2,378
0.27
tpcds_1t.query34
404
3,224
0.13
tpcds_1t.query35
1,278
7,659
0.17
tpcds_1t.query36
467
1,548
0.3
tpcds_1t.query37
284
872
0.33
tpcds_1t.query38
8,054
10,091
0.8
tpcds_1t.query39
422
5,089
0.08
tpcds_1t.query40
414
1,381
0.3
tpcds_1t.query41
94
358
0.26
tpcds_1t.query42
110
439
0.25
tpcds_1t.query43
701
1,451
0.48
tpcds_1t.query44
844
3,627
0.23
tpcds_1t.query45
646
2,734
0.24
tpcds_1t.query46
673
4,621
0.15
tpcds_1t.query47
14,554
4,301
3.38
tpcds_1t.query48
479
2,964
0.16
tpcds_1t.query49
541
7,189
0.08
tpcds_1t.query50
1,589
3,129
0.51
tpcds_1t.query51
4,599
4,702
0.98
tpcds_1t.query52
111
426
0.26
tpcds_1t.query53
481
1,196
0.4
tpcds_1t.query54
623
1,563
0.4
tpcds_1t.query55
115
427
0.27
tpcds_1t.query56
543
1,994
0.27
tpcds_1t.query57
8,224
3,297
2.49
tpcds_1t.query58
587
1,845
0.32
tpcds_1t.query59
4,247
4,727
0.9
tpcds_1t.query60
630
2,545
0.25
tpcds_1t.query61
422
2,770
0.15
tpcds_1t.query62
678
1,612
0.42
tpcds_1t.query63
456
1,067
0.43
tpcds_1t.query64
9,184
14,504
0.63
tpcds_1t.query65
5,640
5,747
0.98
tpcds_1t.query66
616
3,208
0.19
tpcds_1t.query67
53,480
35,793
1.49
tpcds_1t.query68
354
3,851
0.09
tpcds_1t.query69
432
1,774
0.24
tpcds_1t.query70
3,055
2,514
1.22
tpcds_1t.query71
482
1,952
0.25
tpcds_1t.query72
1,398
5,127
0.27
tpcds_1t.query73
232
2,413
0.1
tpcds_1t.query74
12,918
15,108
0.86
tpcds_1t.query75
5,428
15,569
0.35
tpcds_1t.query76
755
3,441
0.22
tpcds_1t.query77
555
1,648
0.34
tpcds_1t.query78
13,752
23,796
0.58
tpcds_1t.query79
1,296
3,618
0.36
tpcds_1t.query80
1,382
4,659
0.3
tpcds_1t.query81
589
4,424
0.13
tpcds_1t.query82
911
1,065
0.86
tpcds_1t.query83
430
1,430
0.3
tpcds_1t.query84
248
1,058
0.23
tpcds_1t.query85
457
4,291
0.11
tpcds_1t.query86
965
1,202
0.8
tpcds_1t.query87
7,713
16,944
0.46
tpcds_1t.query88
6,241
11,188
0.56
tpcds_1t.query89
563
1,369
0.41
tpcds_1t.query90
350
1,895
0.18
tpcds_1t.query91
197
1,375
0.14
tpcds_1t.query92
154
616
0.25
tpcds_1t.query93
1,021
2,908
0.35
tpcds_1t.query94
900
1,360
0.66
tpcds_1t.query95
3,414
9,551
0.36
tpcds_1t.query96
808
1,636
0.49
tpcds_1t.query97
4,324
7,467
0.58
tpcds_1t.query98
213
1,620
0.13
tpcds_1t.query99
1,273
2,327
0.55
Total
380,921
546,713
0.46

 

Table 2: TPC-H 1T Benchmark (in ms, shorter is better)

 

sql_name
AWS-SR(2.4.0)
AWS-SF(XL)
AWS-SR/AWS-SF
tpch_1t.q01
10020
7528
1.33
tpch_1t.q02
769
3658
0.21
tpch_1t.q03
3536
6215
0.57
tpch_1t.q04
1595
5364
0.30
tpch_1t.q05
4820
8088
0.60
tpch_1t.q06
197
664
0.30
tpch_1t.q07
3769
5317
0.71
tpch_1t.q08
3730
7976
0.47
tpch_1t.q09
15,385
62,692
0.25
tpch_1t.q10
2349
7143
0.33
tpch_1t.q11
798
1512
0.53
tpch_1t.q12
1204
3840
0.31
tpch_1t.q13
9280
11975
0.77
tpch_1t.q14
732
1423
0.51
tpch_1t.q15
1604
2259
0.71
tpch_1t.q16
1046
3224
0.32
tpch_1t.q17
3414
4240
0.81
tpch_1t.q18
7,784
24,594
0.32
tpch_1t.q19
2609
3919
0.67
tpch_1t.q20
946
3862
0.24
tpch_1t.q21
5931
14315
0.41
tpch_1t.q22
1831
4870
0.38
Total
83349
194678
0.50

 

From these results, we can see that StarRocks is able to deliver 2x the performance of Snowflake on standard benchmark data sets.

 

Next, Let’s take the price of Snowflake’s virtual warehouse and AWS’ EC2 into consideration. The standard version of Snowflake’s XL virtual warehouse costs 16 credits per hour, and every credit costs 2 dollars. AWS EC2 r5.2xlarge, however, costs 0.504 dollars per hour. Sixteen EC2 instances of this size were used to complete this test. Since StarRocks performance is 2x faster than Snowflake, if we want to finish the same workload that StarRocks can finish in one hour, we need to use Snowflake with the same hardware for two hours.

 

(0.504 * 16) / (16 * 2 *2) = 12.6%

 

As we can see, the total hardware and software cost of StarRocks is 12.6% the cost of Snowflake. We should note that this comparison is limited in some aspects:

 

  • The storage costs and cloud service costs of Snowflake haven't been taken into account.

  • The elasticity of Snowflake’s virtual warehouse hasn’t been taken into account. Users may suspend the virtual warehouse when their work is finished in some scenarios. However, in most online cases the query workload is stable.

  • The operating costs of StarRocks haven't been taken into account.

 

If one considers all of these points, it may add more costs to the StarRocks solution, but not by much. It is still reasonable to estimate that one could save up to 80% by adopting StarRocks compared to using Snowflake as their only data warehouse.

 

Detailed Comparison: StarRocks vs. Snowflake

Here's a breakdown of the major differences between StarRocks and Snowflake:

 
Capability
StarRocks
Snowflake
Open source
Yes
No
Deployment mode
On-premise and SaaS
SaaS
Separation of storage and compute
Yes
Yes
Isolated tenancy
Multi-tenant pooled resources
Multi-tenant pooled resources
Control vs abstraction of compute
Configurable cluster size and hardware type
Configurable cluster size, no control over hardware
Elasticity
Cluster resize with customized node size, automatically scaling out
Cluster resize, no choice of node size, auto scaling out
High concurrency
A node can handle up to several thousand concurrent queries. Adding more nodes can support more concurrent queries.
8 concurrent queries per warehouse by default, Autoscaling up to 10 warehouses.
Real-time support
Continuous streaming data is available in seconds. Supports real-time update.
Continuous streaming data isn't available for a few minutes.
Indexes
- Sort key index
- Bitmap index
- Bloom filter index
“Search optimization service” indexes fields for accelerated point lookup queries, at additional cost
Materialized views
- Synchronized with tables
- Supports query rewrites
- Work on multi-table and aggregation
- Synchronized with tables
- Supports query rewrites
- only works on a single table
Stored procedure
No
Yes
Table-level data distribution
Partitioning divides a table into multiple segments called partitions. Bucketing divides a partition into multiple more manageable parts called tablets
Data is automatically divided into micro-partitions. Pruning at micro-partition level. Table clustering (cluster keys)
Query latency
Sub-second level
Second level
Query Federation
Hive, Iceberg, Hudi, Delta Lake
Hive, Delta Lake, Iceberg

 

Using StarRocks for Enterprise

As an open source solution, StarRocks can be downloaded for free from starrocks.io and used in production environments without any performance or capacity limit. We also encourage database developers and users to join the hundreds of contributors worldwide on StarRocks' GitHub repository, and participate in discussions in the project's Slack channels.

But if you’re in need of a more enterprise-ready solution that can deliver the fantastic price-performance numbers StarRocks promises, you’ll want to check out CelerData.

CelerData was founded by the original creators of StarRocks to help businesses enjoy the performance benefits StarRocks has to offer accompanied by enterprise-scale features and support.

When it comes to deployment, CelerData provides two options:

 

 

You can learn more about CelerData by visiting CelerData.Com.

When it comes to reducing your Snowflake costs, you have a lot of options. Traditional approaches to cost savings can only go so far. If you’re ready to really bring costs down (and boost performance at the same time), it’s time to take a serious look at StarRocks.