This article aims to compare two data warehouses: Snowflake and BigQuery. They are 2nd generation of data warehouses, which improved scalability by separating storage and compute, and simplified administration.
What are Snowflake and BigQuery?
Snowflake is a data warehousing solution, as a software as a service. It has been built for the cloud and can be hosted in the 3 major public clouds: Amazon Web Services (AWS) or Microsoft Azure or Google Cloud Platform (GCP). Snowflake allows a total separation between compute and storage. With this it gives a greater flexibility while reducing costs.
BigQuery, or Google BigQuery, is a data warehouse owned by Google. It's an important part of the Google Cloud Platform. BigQuery is a cloud-based big data analytics service for processing large datasets.
Snowflake is based on traditional shared-disk and shared-nothing architectures1. It makes the data available to all compute nodes in the platform by using a central repository for persisted data.
Based on ANSI SQL Snowflake is totally serverless. It uses Massively Parallel Processing, or MPP, to process queries. All servers store a portion of the entire data set locally. For the storage, data is separated and organized in micro partitions that are internally optimized and compressed into columnar storage. Snowflake automatically manage aspects of data storage, as file size, structure, compression, metadata or statistics.
BigQuery is similar as Snowflake, because of the ANSI SQL, but the architecture is different. BigQuery use a vast set of multi-tenant services driven by Google infrastructure like Dremel, Colossus, Jupiter and Borg.
Dremel is a large multi-tenant compute cluster used to execute SQL queries. Concretely, Dremel turn SQL queries into execution trees. Slots, trees leaves in BigQuery, read data from storage and do computation. Mixers, branches of tree, manage the aggregations. BigQuery compresses data into columnar format, for store it in Colossus, the global storage system of Google. Colossus handles data replication, recovery, and distributed management. Jupiter is used for move data quickly from a location to another. Hardware resource allocation and orchestration in BigQuery is done with Borg, the Google's precursor to Kubernetes.
According to GigaOm, Snowflake systematically outperform BigQuery. On 103 queries, Snowflake used 5 793 seconds, whereas BigQuery used 37 283 seconds (6 times longer). Judging performance by the queries speed is too reductionist.
Still, GigaOm found that in 44 queries, BigQuery surpassed Snowflake of the benchmark tests. In conclusion, they are both still active, with frequent new features and performance enhancements.
- Scalability: Snowflake allows users to scale their compute and storage resources up and down independently. While the platform is running, it automatically optimizes performance and monitors the workload in order to improve the query time.
On the other side, BigQuery automatically gives additional compute resources on an as-needed basis for manage large data workloads. It can process petabytes of data in few minutes.
- Security: Both Snowflake and BigQuery offer security of users for protect their data, especially when enterprises use data warehouse with confidential or sensitive data.
Snowflake offers SOC 1 type II and SOC 2 type II compliance, and also HIPAA and PCI DSS compliance. Depending the level, it offers other features like multi-factor authentication, support for OAuth and user SSO (single sign-on), IP address whitelisting and blacklisting, access control and automatic data encryption.
BigQuery provides automatic encryption for data in workflow or in rest. Google's Cloud Identity and Access Management feature gives the opportunity for user to adjust with precision access to cloud resources. BigQuery is compliant with the standard of HIPAA and PCI DSS.
The price of Snowflake depends of the use. As I said upper, compute and storage are separated, so the cost is separated too. For storage, the pricing is 23$ per terabyte per month, for upfront payment (or 40$ on-demand). For compute, the pricing is a little bit complex. Cheapest, or "Standard" price, is 2$ per hour. Here, the complete pricing guide, if you are interested.
As Snowflake, BigQuery's prices are separated for storage and compute. For storage, there is a flat rate of 20$ per terabyte per month for active and uncompressed storage, or 10$ for long-term storage. And the first ten gigabytes of storage each month are free. For compute, demand queries are paid for 5$ per terabyte and the first terabyte of queries is free every month. I encourage you to read the pricing for all specificities.
In storage pricing, BigQuery is cheaper than Snowflake. In compute pricing, it's complicated to estimate the cost of data warehouse because Google charges per amount of data and not per hours used.
According to the comparisons, Snowflake is not better than BigQuery, and inversely. The only important differences that will make you choose one over the other will be those related to your use.
It appears really often that if you are already using GCP, BigQuery is the obvious choice as you don't need a new contract, everything just work. If you are on Azure or AWS, Snowflake is a great alternative but this you can still consider BigQuery because probably you're already using Google Workspace and you already have Google accounts.
My main advice is: as both warehouses are very close, pick one and stick with it while you become more mature to have your own opinion on what you really need.
Join the newsletter to receive the latest updates in your inbox.