Mastering Snowflake: Advanced Pricing, Policies, Editions, and Data Loading Techniques
Snowflake is a cloud-based data warehousing platform which is designed to store and analyze large volumes of data efficiently and cost-effectively. It addresses the complexities and limitations of traditional data warehouses, by operating entirely on the cloud. Where traditional Data Warehouses rely on on-premises infrastructure, this operational model (Cloud) of Snowflake gives it the competitive edge over them. This means that you can store, manage, and analyze your data without the need for costly hardware or complicated setups.
Virtual warehouse sizes
In Snowflake, virtual warehouses are a fundamental component of the system, and understanding the concept of virtual warehouse sizes is crucial for optimizing query performance and resource allocation. Virtual warehouses determine the computing power available for executing queries and tasks within Snowflake’s cloud-based data warehousing environment.
- Choosing the right size: Snowflake offers a range of virtual warehouse sizes, from X-Small to 4X-Large, each with varying levels of CPU and memory resources. Selecting the appropriate virtual warehouse size is essential for ensuring that your queries run efficiently. Smaller virtual warehouses may be suitable for ad-hoc or less resource-intensive tasks, while larger ones are designed for handling complex analytical workloads with high concurrency.
- Scaling on Demand: One of the key benefits of Snowflake is its ability to dynamically scale virtual warehouses up or down based on workload requirements. You can easily resize a virtual warehouse to allocate more computing resources when there’s a spike in query load and scale it back down during periods of lower activity. This scalability ensures that you’re not overpaying for resources you don’t need.
- Concurrency Control: The size of your virtual warehouse impacts the number of concurrent queries it can handle effectively. Larger virtual warehouses can support more concurrent queries without performance degradation. Understanding the concurrency limitations of your chosen virtual warehouse size is important for managing user workloads effectively.
What is Multi-Clustering?
Multi-clustering is a powerful feature in Snowflake designed to optimize the organization and storage of data within the data warehousing platform. It enables users to efficiently manage and query large datasets by physically organizing data into multiple clusters based on different columns, which in turn enhances query performance and reduces costs.
- Column-Based Clustering: Multi-clustering involves organizing data based on the values in specific columns. Instead of a single monolithic organization, data is physically stored in multiple clusters, each corresponding to a different set of columns. This approach is particularly beneficial when dealing with large and complex datasets.
- Improved Query Performance: By physically grouping data based on columns commonly used in queries, multi-clustering helps minimize the amount of data scanned when executing queries. When a query involves filtering or sorting by the columns associated with a particular cluster, Snowflake can leverage the cluster to efficiently access only the relevant data, resulting in faster query performance.
- Dynamic Clustering: Snowflake’s multi-clustering feature is dynamic, which means you can add or remove clustering keys as needed without copying or moving data. This flexibility allows you to adapt your data organization to changing query patterns and optimize performance over time.
Scaling Policies in Snowflake
- Standard: It prevents/minimizes the queuing by favouring starting additional clusters over conserving credits. The cluster starts immediately when either a query is queued, or the system detects that there are more queries than can be executed by the currently available clusters. Cluster shutdown happens after 2 to 3 consecutive successful checks.
- Economy: It conserves credits by favoring keeping running clusters fully loaded, rather than starting additional clusters. This may result in queries being queued and taking longer to complete. In this case, the cluster starts only if the system estimates there’s enough query load to keep the cluster busy for at least 6 minutes. The cluster shuts down after 5 to 6 consecutive successful checks.
How do we create data warehouses?
In Snowflake, warehouses can be created in two ways, which are through the GUI and through SQL queries.
Through the GUI:  Using the Graphical User-Interface, after Logging-in to your Snowflake account, go to the ’Warehouses’ section under the ’Admin’ tab. Click on the blue ’+ Warehouse’ button on the top right corner of the screen. Then, you will be prompted to enter a warehouse name and choose the type and size of your warehouse. It really makes the creation of a Data Warehouse easy since the layout is quite intuitive.
Through SQL Queries: To create a warehouse using a SQL query, we first create a new SQL worksheet. This gives us the space to write and execute our SQL code. The following query in the figure creates a warehouse called ’Example Warehouse’. We also set the parameters such as the warehouse size, maximum cluster count inside our query. To run a query, we could click the play button on the top right of the screen or use keyboard shortcuts such as Cmd+Return (Mac).
Snowflake Roles
- AccountAdmin Role: The highest level of role in Snowflake. Reserved for superusers who manage the entire Snowflake account. AccountAdmins can create and manage users, roles, warehouses, and databases.
- SecurityAdmin Role: Responsible for managing security aspects of Snowflake. SecurityAdmins can grant and revoke privileges, manage access controls, and create and manage roles. This role ensures the enforcement of data security policies.
- SystemAdmin Role: Focuses on system-wide administrative tasks. SystemAdmins can manage and monitor all aspects of Snowflake, including warehouses, databases, and performance. They can also create and manage roles.
- Custom Roles: Snowflake allows you to create custom roles tailored to your organization’s specific needs. Custom roles can have a combination of privileges and can be granted to users or other roles. You can define these roles to align with your organization’s security and data governance policies.
Loading Data in Snowflake
Snowflake supports two kinds of loading, which are Bulk Loading and Continuous Loading. Bulk loading is the most commonly used type and it utilizes warehouses. It consists of loading data from external ’stages’. We use this case when we want to create a data warehouse and refine our data by applying various transformations on it. In contrast, Continuous Loading is designed to only load small volumes of data. It happens automatically once the data is added to the stages.Â
What are Stages? Stages in Snowflake are defined as the location of data files where data can be loaded from. There are two kinds of stages, which are External Stage and Internal Stage. —
- External Stage: These stages include the external cloud providers, such as AWS (S3), GCP and Azure. The database objects are created in schema.
- Internal Stage: This is the local storage maintained by Snowflake, inside your Snowflake account.
Now, let us take a brief look at how we can load data into Snowflake from an S3 bucket.
Step 1. We need to create a Database. We can do this by clicking the blue ’+’ icon under Databases tab, as illustrated by the screenshot from our Snowflake dashboard.
Step 2. We create a Table called Loan Payment with the SQL query shown below. Remember that to run SQL queries, we need to be in a SQL worksheet. We also select the Database that we are working with as shown in the screenshot.
It is always a good practice to check if our newly created table is empty. We do this by executing the following SQL query.
Step 3. Now, we need to execute the query to pull the data from S3 into Snowflake. For this example, we are using a publicly accessible CSV file (stored in S3). As evident from the screenshot 10, we were able to populate our data with all 500 rows from the CSV file .
We can always look at what data we have populated in the table by using a simple SELECT statement.
That’s it! We have successfully populated data into Snowflake from an external S3 bucket.
Snowflake Editions
Snowflake is available in the following editions, at the time of writing this blog post –
- Standard Edition
- Enterprise Edition
- Business Critical Edition
- Virtual Private Snowflake (VPS)
Standard Edition: This is Snowflake’s introductory level offering, providing full, unlimited access to all of Snowflake’s standard features. Users get a balance of features, level of support and cost.
Enterprise Edition: This edition provides all the features and services of the Standard Edition, but has additional features designed specifically for the needs of large-scale enterprises.
Business Critical Edition: This edition was previously known as the Enterprise for Sensitive Data (ESD), and it offers even higher levels of data protection to support the needs of organizations with highly sensitive data. For example, PHI data that must comply with HIPAA and HITRUST CSF regulations.
It includes all the features and services of the Enterprise Edition, but with the addition of enhanced security and data protection. It also has database failover/failback support to maintain business continuity.
Virtual Private Snowflake (VPS): This edition offers the highest level of security for organizations which have the most stringent requirements, for instance financial institutions and other large enterprises which collect, analyse, and share sensitive data.
It provides users with a separate Snowflake environment, isolated from all other Snowflake accounts.
Snowflake Pricing
Pricing in Snowflake depends on two major factors, which are, Compute and Storage. For Compute, the users are charged for active warehouses per hour. They are billed by the second, with a minimum billing period of 60 seconds. It uses Snowflake credits to charge the users.
From the Storage aspect, there is a monthly storage fees charged to the users. This fee is calculated on the basis of average storage used per month by the account. Further, users can choose between ‘On-Demand’ and ‘Capacity’ storage options. In the former, the user must pay for usage month to month, whereas for the latter, the user pays an upfront fee.