Join StarRocks Community on Slack

Connect on Slack
TABLE OF CONTENTS
     

    It's no secret. Snowflake has a lot to offer when it comes to empowering your analytics operations. It probably felt like everything you were looking for when you signed up, but that honeymoon period lasted about as long as it took for that first bill to arrive. Often, one of the first questions teams ask when they start using Snowflake isn't "how can we do more with Snowflake?" It's "how can we cut down or better control Snowflake's costs?".

    The common approaches to Snowflake cost control tend to focus on limiting the business’ usage or demand a lot of extra work from data warehouse developers. All of this results in only marginal cost savings. So how can you preserve all those great Snowflake analytics benefits while meaningfully reducing your costs?

    In this article, we'll go over the basics of Snowflake and its pricing, examine the most popular approaches to controlling Snowflake costs, and also share a new, alternative approach capable of saving companies using Snowflake up to 80% of what they're paying today.

     

    Snowflake: An Introduction

    Snowflake is one of the world’s leading cloud-based data warehouse companies. Founded in 2012 by three data warehousing experts, Snowflake now supports nearly 7,000 customers. Snowflake’s data warehouse offering is fully managed. This means users shouldn’t have to concern themselves with back-end work like server installation and maintenance. Snowflake is also fairly flexible in its deployment options, with its data warehouse instances able to be spun up on AWS, GCP, and Azure.

     

    The Advantages of Snowflake

    Snowflake boasts plenty of advantages. This is especially true for small businesses that can put all of their data into Snowflake for analysis. Here are some of Snowflake’s biggest advantages:

     

    Ease of Use

    Snowflake is celebrated for its simple and intuitive interface. You can get started with the service quickly, and you can automatically spin up and down compute clusters of any size or workload without impacting other jobs. With near-infinite, instant elasticity and concurrency, Snowflake promises to deliver the performance your organization needs.

     

    Outstanding Performance

    In many cases, the performance of Snowflake is outstanding, and it offers different virtual warehouse sizes that can meet a wide variety of performance requirements. With Snowflake, if you want better performance, you just need to upgrade your virtual warehouse. Snowflake also ensures their query processing is on par with other competitive offerings.

     

    Separation of Compute and Storage

    Snowflake’s multi-cluster, shared data architecture shines because compute and storage resources are kept physically separate. At the same time, they are logically part of a single, integrated and modern cloud-built data warehouse system. Snowflake’s architecture also includes built-in cloud services such as the transparent provisioning of resources and automatic metadata management and resilience.

     

    Seamless Data Sharing

    Snowflake’s architecture allows for smooth and safe data sharing among Snowflake’s users. Companies can also share data with consumers regardless of the consumers being Snowflake’s customers or not. A client account can be created for them directly from the user interface and managed by the provider.

     

    Cost Is a Major Issue With Snowflake

    Although Snowflake’s capabilities have been a tremendous boon to data analytics teams, the honeymoon period tends to last up until the first invoice. Customers are often surprised by their month end Snowflake bills. This is because, as users ingest larger volumes of data into Snowflake, queries become more and more complex, and the concurrency becomes higher too. To guarantee their desired query latency, users need to use a bigger virtual warehouse for these complex queries. At the same time, the high concurrency of the queries leads to spinning up more virtual warehouses. This gets expensive very quickly.

    As a result, most customers would want to find some way to reduce snowflake costs after using this product for a while. Let's look at some of the most popular approaches to reducing Snowflake costs.

     

    A Brief Look at Snowflake's Pricing Structure

    First, it's important to understand Snowflake's pricing structure and the options you have when working with them.

    Snowflake pricing normally falls into one of four levels:

     

    • Standard

    • Enterprise

    • Business Critical

    • Virtual Private Snowflake (VPS)

     

    Snowflake also offers the ability to pay for usage month-to-month and even pay for usage up front. While there can be no doubt Snowflake pricing is flexible, this does not necessarily equate to cheap. Snowflake pricing is based on your compute, storage, and cloud services usage. Compute costs are calculated depending on how long your warehouses are running. Storage costs are calculated based on the amount of data you are storing, and cloud services depend on the Snowflake features you are using. Usually, compute costs are the most important part of your Snowflake bills. So, we focus on the compute costs for the remainder of this article.

     

    How To Reduce Snowflake Costs

    If you've spent anytime looking for solutions to Snowflake pricing online, you can summarize the most popular recommended approaches as follows:

     

    Use Resource Monitors To Analyze Your Usage

    Organizations tend to have a monthly threshold or quota that they want to spend for compute on Snowflake. It is recommended that you set up resource monitors for raising alerts and taking action upon reaching your monthly credits consumption threshold. In these instances, the business should abort all running queries and suspend its virtual warehouse immediately. Alternatively, you can complete all the running queries then suspend the virtual warehouse.

     

    Scale Your Virtual Warehouses Appropriately

    A good rule of thumb with Snowflake warehouses is to start small and increase the size as needed. Since the cost of a warehouse doubles each time you size up, starting smaller with longer running queries will allow for better cost management and monitoring.

    Also, by creating virtual warehouses for specific use cases, you’ll be able to customize them for each case. That has the potential to deliver a better ROI to your business.

     

    Use Query Control Parameters

    In Snowflake, both query processing and concurrency can be controlled by configuring several parameters. For example, STATEMENT_TIMEOUT_IN_SECONDS controls the amount of time, in seconds, after which a running SQL statement (query, DDL, DML, etc.) is canceled by the system. This parameter can help to prevent wasting money on some unexpectedly large queries. Making good use of these kinds of parameters can be very useful for controlling costs.

     

    Optimize Data Loading

    How you load data and how efficient your queries are, are two of the most important Snowflake cost control factors. By default, data is automatically partitioned in Snowflake according to its ingestion order. It can, however, be more efficient to sort the data in its cloud bucket before loading rather than sorting the data with an `ORDER BY` after loading.

    Compression also helps to improve load times, and not all options are made equal. A CSV file that has been Gzipped loads 2–3 times faster than Parquet or ORC.

    Even the move command matters: COPY INTO is more efficient than INSERT because it applies an optimized SQL bulk loading process.

     

    Using Materialized Views

    A materialized view is a pre-computed data set stored for later use. This feature is also an effective Snowflake cost optimization strategy when running frequent or complex queries on a subsection of large data sets. Materialized views are only executed against all the data in a table once. After executing against all the available data on its initial run, it only runs against new data added. It does this by utilizing INSERT and UPDATE commands.

    However, materialized views in Snowflake can only be utilized on data models that use simple SELECT statements rather than those that use joins or aggregation.

     

    The above mentioned strategies are common approaches you will find in most of the tech blogs. They help users effectively manage the costs of Snowflake systems to avoid month-end surprises, but they generally reduce costs by a limited amount. On the other hand, some of these approaches may impact how businesses are run by limiting or even suspending queries, which makes them unpopular choices for enterprises.

    To significantly reduce cloud analytics costs without negatively impacting business, we need to look at the problem from a different angle.

     

    Common Challenges in Reducing Snowflake Costs

    The approaches mentioned above are what you will most likely find when scouring tech blogs online. These suggestions can help users effectively manage the costs of Snowflake systems to avoid month-end surprises, but they generally yield marginal savings. Any amount of savings may still sound attractive, but they come with their own hidden costs. Most of the approaches will constrain business operations by limiting or even suspending queries, which makes them unpopular choices in most cases.

    In order to significantly reduce cloud analytics costs with Snowflake, you will often need to sacrifice business flexibility and scalability. Often, this tradeoff is not worth the savings. The reality is that Snowflake pricing isn't going to decrease much. Even if you're willing to give up significant business performance. If you're not happy with your Snowflake costs, you need to consider another option: finding a new solution.

    That's easier said than done. After all, you started using Snowflake because it met your original evaluation criteria. How can you find a solution that can easily meet your needs and is simple to transition to?

    At CelerData, we work with customers every day who are unsatisfied with their Snowflake costs and who are looking for a more cost-effective way to power their data analytics. We know what that journey looks like, what makes it difficult, and how to make it easy. You have your share of solutions to pick from, but working alongside our customers, we regularly find that the open source StarRocks project is the best way forward.

     

    How StarRocks Can Help

    Launched in 2020, StarRocks has quickly become a leading open source project in the analytics space. StarRocks delivers record-breaking query performance, for both streaming and batch-based historical data analytics, supported by a sophisticated MPP architecture.

    StarRocks’ architecture is mainly composed of two modules, Front End (FE) and Back End (BE). It doesn’t rely on any external components. You can easily manage a StarRocks cluster in your private environment. Managed SaaS services are also available through vendors like CelerData.

     

    StarRocks Architecture

    StarRocks Architecture Overview

     

    StarRocks divides a single table into various tablets, each of which is replicated and then evenly distributed among BE nodes. There are two types of division: partitioning and bucketing. Partitioning (or sharding) helps a table to decompose into multiple partitions. In bucketing, partitions can be subdivided as tablets into buckets based on the hash function of one or more columns.

     

    Sub-Second Query Response Time

    Thanks to its fully vectorized query engine and cost-based optimizer (CBO), StarRocks can provide a sub-second query experience for different analytical scenarios even when the data volume is at TB scale. As demonstrated in the benchmark results later in this article, StarRocks has more than 2x the performance advantage over Snowflake at a much lower cost.

     

    Real-Time Analytics

    Unlike Snowflake, which primarily focuses on batch-based analytics, StarRocks also has great real-time analytics performance. It can load streaming data from Kafka and Pulsar easily. In Snowflake, the continuous streaming data loaded from Snowpipe isn’t available for at least a few minutes. In StarRocks, the latest data can be queried immediately after it is ingested. This high ingestion performance guarantees efficient real-time ingestion. What’s more, StarRocks can also conduct upserts and deletes as well, which is very important for real-time analytics.

     

    Data Lakehouse Analytics

    StarRocks can directly query data stored in data lakes such as Hive, Iceberg, and Hudi. These data lakes are usually built on object storage (e.g. S3) or HDFS. With the statistical information provided by the data lake module, StarRocks can optimize the query plan using its own CBO. From the results of several standardized benchmark tests, the query performance on the data lake can be as much as 3-5x that of Trino.

     

    Ease of Use

    StarRocks is easy to scale up and scale out, and users can choose the hardware that best fits their needs. If a user wants to add more nodes to a StarRocks cluster, they only need to send a command to StarRocks, and everything else can be done automatically. Storage and concurrency capacity scales linearly with the cluster size, and StarRocks also supports the separation of storage and compute. Users can utilize Kubernetes to manage the cluster and make good use of the cloud’s elasticity with ease: scale out during peak usage and scale in to save costs when there is no workload.

     

    Advanced Techniques for Performance Improvements

    StarRocks leverages different technologies to improve its query performance. Indexes in StarRocks (e.g. short key index, bloom filter index, and bitmap index) can help to greatly reduce the data scan time. Materialized views are another useful tool which makes use of pre-compute to improve query performance. Unlike Snowflake, which can only support building materialized views on single tables, StarRocks can build materialized views on multi-tables and grants more flexibility when processing data.

     

    In short, StarRocks offers a unified, feature-rich analytics platform with the best performance numbers amongst analytics databases available to the industry today. Its community and user base continues to grow and includes some of the largest enterprises in the world. Best of all, it’s free to use.

    StarRocks’ performance and capabilities make it an attractive alternative to Snowflake, and a viable option for reducing your analytics costs. We’ll dig deeper into this in the next section where we will take a look at the price and performance comparisons between StarRocks and Snowflake.

     

    Price-Performance Comparison Between StarRocks and Snowflake

    Thanks to the flexibility it offers in choosing your own hardware, StarRocks can deliver much better price-performance results compared to Snowflake.

    Snowflake pricing is based on your compute, storage, and cloud services usage. Compute costs are calculated depending on how long your warehouses are running. Storage costs are calculated based on the amount of data you are storing, and cloud services depend on the Snowflake features you are using. Usually, compute costs are the most important part of your Snowflake bills. So, we’ll focus on compute costs in this comparison.

    Since individual use cases can differ, it wouldn’t be as helpful to examine any one specific use case in this comparison. Instead, we’ll use standard benchmarks, TPC-H and PTC-DS, to compare the performance of the two products. This will make it easier to reproduce and validate the test results.

    In the following benchmark testing, we used Snowflake Standard Edition on its XL Virtual Warehouse. To run StarRocks, we used 16 instances of AWS EC2 r5.2xlarge. The total number of CPU cores here is 128. The data volume we used in this test was 1TB.

     

    Table 1: TPC-DS 1T Benchmark (in ms, shorter is better)

     

    SQL
    AWS-SR(2.4.0)
    AWS-SF(XL)
    AWS-SR/AWS-SF
    tpcds_1t.query01
    807
    1,751
    0.46
    tpcds_1t.query02
    3,137
    3,827
    0.82
    tpcds_1t.query03
    435
    1,442
    0.3
    tpcds_1t.query04
    35,751
    44,763
    0.8
    tpcds_1t.query05
    2,084
    3,249
    0.64
    tpcds_1t.query06
    367
    2,033
    0.18
    tpcds_1t.query07
    693
    2,172
    0.32
    tpcds_1t.query08
    278
    1,847
    0.15
    tpcds_1t.query09
    4,039
    12,427
    0.33
    tpcds_1t.query10
    402
    1,318
    0.31
    tpcds_1t.query11
    23,390
    21,858
    1.07
    tpcds_1t.query12
    151
    702
    0.22
    tpcds_1t.query13
    409
    3,124
    0.13
    tpcds_1t.query14
    24,011
    19,979
    1.2
    tpcds_1t.query15
    458
    2,454
    0.19
    tpcds_1t.query16
    1,499
    1,369
    1.09
    tpcds_1t.query17
    944
    3,809
    0.25
    tpcds_1t.query18
    859
    2,656
    0.32
    tpcds_1t.query19
    261
    2,226
    0.12
    tpcds_1t.query20
    159
    732
    0.22
    tpcds_1t.query21
    117
    725
    0.16
    tpcds_1t.query22
    1,414
    2,246
    0.63
    tpcds_1t.query23
    67,976
    59,069
    1.15
    tpcds_1t.query24
    6,593
    14,177
    0.47
    tpcds_1t.query25
    751
    4,726
    0.16
    tpcds_1t.query26
    407
    1,372
    0.3
    tpcds_1t.query27
    406
    1,648
    0.25
    tpcds_1t.query28
    3,203
    7,616
    0.42
    tpcds_1t.query29
    1,522
    4,558
    0.33
    tpcds_1t.query30
    431
    2,755
    0.16
    tpcds_1t.query31
    2,163
    3,681
    0.59
    tpcds_1t.query32
    165
    728
    0.23
    tpcds_1t.query33
    639
    2,378
    0.27
    tpcds_1t.query34
    404
    3,224
    0.13
    tpcds_1t.query35
    1,278
    7,659
    0.17
    tpcds_1t.query36
    467
    1,548
    0.3
    tpcds_1t.query37
    284
    872
    0.33
    tpcds_1t.query38
    8,054
    10,091
    0.8
    tpcds_1t.query39
    422
    5,089
    0.08
    tpcds_1t.query40
    414
    1,381
    0.3
    tpcds_1t.query41
    94
    358
    0.26
    tpcds_1t.query42
    110
    439
    0.25
    tpcds_1t.query43
    701
    1,451
    0.48
    tpcds_1t.query44
    844
    3,627
    0.23
    tpcds_1t.query45
    646
    2,734
    0.24
    tpcds_1t.query46
    673
    4,621
    0.15
    tpcds_1t.query47
    14,554
    4,301
    3.38
    tpcds_1t.query48
    479
    2,964
    0.16
    tpcds_1t.query49
    541
    7,189
    0.08
    tpcds_1t.query50
    1,589
    3,129
    0.51
    tpcds_1t.query51
    4,599
    4,702
    0.98
    tpcds_1t.query52
    111
    426
    0.26
    tpcds_1t.query53
    481
    1,196
    0.4
    tpcds_1t.query54
    623
    1,563
    0.4
    tpcds_1t.query55
    115
    427
    0.27
    tpcds_1t.query56
    543
    1,994
    0.27
    tpcds_1t.query57
    8,224
    3,297
    2.49
    tpcds_1t.query58
    587
    1,845
    0.32
    tpcds_1t.query59
    4,247
    4,727
    0.9
    tpcds_1t.query60
    630
    2,545
    0.25
    tpcds_1t.query61
    422
    2,770
    0.15
    tpcds_1t.query62
    678
    1,612
    0.42
    tpcds_1t.query63
    456
    1,067
    0.43
    tpcds_1t.query64
    9,184
    14,504
    0.63
    tpcds_1t.query65
    5,640
    5,747
    0.98
    tpcds_1t.query66
    616
    3,208
    0.19
    tpcds_1t.query67
    53,480
    35,793
    1.49
    tpcds_1t.query68
    354
    3,851
    0.09
    tpcds_1t.query69
    432
    1,774
    0.24
    tpcds_1t.query70
    3,055
    2,514
    1.22
    tpcds_1t.query71
    482
    1,952
    0.25
    tpcds_1t.query72
    1,398
    5,127
    0.27
    tpcds_1t.query73
    232
    2,413
    0.1
    tpcds_1t.query74
    12,918
    15,108
    0.86
    tpcds_1t.query75
    5,428
    15,569
    0.35
    tpcds_1t.query76
    755
    3,441
    0.22
    tpcds_1t.query77
    555
    1,648
    0.34
    tpcds_1t.query78
    13,752
    23,796
    0.58
    tpcds_1t.query79
    1,296
    3,618
    0.36
    tpcds_1t.query80
    1,382
    4,659
    0.3
    tpcds_1t.query81
    589
    4,424
    0.13
    tpcds_1t.query82
    911
    1,065
    0.86
    tpcds_1t.query83
    430
    1,430
    0.3
    tpcds_1t.query84
    248
    1,058
    0.23
    tpcds_1t.query85
    457
    4,291
    0.11
    tpcds_1t.query86
    965
    1,202
    0.8
    tpcds_1t.query87
    7,713
    16,944
    0.46
    tpcds_1t.query88
    6,241
    11,188
    0.56
    tpcds_1t.query89
    563
    1,369
    0.41
    tpcds_1t.query90
    350
    1,895
    0.18
    tpcds_1t.query91
    197
    1,375
    0.14
    tpcds_1t.query92
    154
    616
    0.25
    tpcds_1t.query93
    1,021
    2,908
    0.35
    tpcds_1t.query94
    900
    1,360
    0.66
    tpcds_1t.query95
    3,414
    9,551
    0.36
    tpcds_1t.query96
    808
    1,636
    0.49
    tpcds_1t.query97
    4,324
    7,467
    0.58
    tpcds_1t.query98
    213
    1,620
    0.13
    tpcds_1t.query99
    1,273
    2,327
    0.55
    Total
    380,921
    546,713
    0.46

     

    Table 2: TPC-H 1T Benchmark (in ms, shorter is better)

     

    sql_name
    AWS-SR(2.4.0)
    AWS-SF(XL)
    AWS-SR/AWS-SF
    tpch_1t.q01
    10020
    7528
    1.33
    tpch_1t.q02
    769
    3658
    0.21
    tpch_1t.q03
    3536
    6215
    0.57
    tpch_1t.q04
    1595
    5364
    0.30
    tpch_1t.q05
    4820
    8088
    0.60
    tpch_1t.q06
    197
    664
    0.30
    tpch_1t.q07
    3769
    5317
    0.71
    tpch_1t.q08
    3730
    7976
    0.47
    tpch_1t.q09
    15,385
    62,692
    0.25
    tpch_1t.q10
    2349
    7143
    0.33
    tpch_1t.q11
    798
    1512
    0.53
    tpch_1t.q12
    1204
    3840
    0.31
    tpch_1t.q13
    9280
    11975
    0.77
    tpch_1t.q14
    732
    1423
    0.51
    tpch_1t.q15
    1604
    2259
    0.71
    tpch_1t.q16
    1046
    3224
    0.32
    tpch_1t.q17
    3414
    4240
    0.81
    tpch_1t.q18
    7,784
    24,594
    0.32
    tpch_1t.q19
    2609
    3919
    0.67
    tpch_1t.q20
    946
    3862
    0.24
    tpch_1t.q21
    5931
    14315
    0.41
    tpch_1t.q22
    1831
    4870
    0.38
    Total
    83349
    194678
    0.50

     

    From these results, we can see that StarRocks is able to deliver 2x the performance of Snowflake on standard benchmark data sets.

     

    Next, Let’s take the price of Snowflake’s virtual warehouse and AWS’ EC2 into consideration. The standard version of Snowflake’s XL virtual warehouse costs 16 credits per hour, and every credit costs 2 dollars. AWS EC2 r5.2xlarge, however, costs 0.504 dollars per hour. Sixteen EC2 instances of this size were used to complete this test. Since StarRocks performance is 2x faster than Snowflake, if we want to finish the same workload that StarRocks can finish in one hour, we need to use Snowflake with the same hardware for two hours.

     

    (0.504 * 16) / (16 * 2 *2) = 12.6%

     

    As we can see, the total hardware and software cost of StarRocks is 12.6% the cost of Snowflake. We should note that this comparison is limited in some aspects:

     

    • The storage costs and cloud service costs of Snowflake haven't been taken into account.

    • The elasticity of Snowflake’s virtual warehouse hasn’t been taken into account. Users may suspend the virtual warehouse when their work is finished in some scenarios. However, in most online cases the query workload is stable.

    • The operating costs of StarRocks haven't been taken into account.

     

    If one considers all of these points, it may add more costs to the StarRocks solution, but not by much. It is still reasonable to estimate that one could save up to 80% by adopting StarRocks compared to using Snowflake as their only data warehouse.

     

    Detailed Comparison: StarRocks vs. Snowflake

    Here's a breakdown of the major differences between StarRocks and Snowflake:

     
    Capability
    StarRocks
    Snowflake
    Open source
    Yes
    No
    Deployment mode
    On-premise and SaaS
    SaaS
    Separation of storage and compute
    Yes
    Yes
    Isolated tenancy
    Multi-tenant pooled resources
    Multi-tenant pooled resources
    Control vs abstraction of compute
    Configurable cluster size and hardware type
    Configurable cluster size, no control over hardware
    Elasticity
    Cluster resize with customized node size, automatically scaling out
    Cluster resize, no choice of node size, auto scaling out
    High concurrency
    A node can handle up to several thousand concurrent queries. Adding more nodes can support more concurrent queries.
    8 concurrent queries per warehouse by default, Autoscaling up to 10 warehouses.
    Real-time support
    Continuous streaming data is available in seconds. Supports real-time update.
    Continuous streaming data isn't available for a few minutes.
    Indexes
    - Sort key index
    - Bitmap index
    - Bloom filter index
    “Search optimization service” indexes fields for accelerated point lookup queries, at additional cost
    Materialized views
    - Synchronized with tables
    - Supports query rewrites
    - Work on multi-table and aggregation
    - Synchronized with tables
    - Supports query rewrites
    - only works on a single table
    Stored procedure
    No
    Yes
    Table-level data distribution
    Partitioning divides a table into multiple segments called partitions. Bucketing divides a partition into multiple more manageable parts called tablets
    Data is automatically divided into micro-partitions. Pruning at micro-partition level. Table clustering (cluster keys)
    Query latency
    Sub-second level
    Second level
    Query Federation
    Hive, Iceberg, Hudi, Delta Lake
    Hive, Delta Lake, Iceberg

     

    Using StarRocks for Enterprise

    As an open source solution, StarRocks can be downloaded for free from starrocks.io and used in production environments without any performance or capacity limit. We also encourage database developers and users to join the hundreds of contributors worldwide on StarRocks' GitHub repository, and participate in discussions in the project's Slack channels.

    But if you’re in need of a more enterprise-ready solution that can deliver the fantastic price-performance numbers StarRocks promises, you’ll want to check out CelerData.

    CelerData was founded by the original creators of StarRocks to help businesses enjoy the performance benefits StarRocks has to offer accompanied by enterprise-scale features and support.

    When it comes to deployment, CelerData provides two options:

     

     

    You can learn more about CelerData by visiting CelerData.Com.

    When it comes to reducing your Snowflake costs, you have a lot of options. Traditional approaches to cost savings can only go so far. If you’re ready to really bring costs down (and boost performance at the same time), it’s time to take a serious look at StarRocks.

    copy success