Snowflake PERCENTILE: How to Calculate Percentiles
Percentile functions in Snowflake, specifically PERCENTILE_CONT
and PERCENTILE_DISC
, are used to find the value at a specified percentile in your data. They help in understanding the distribution of data points by providing a value below which a given percentage of observations fall.
What They Do
- PERCENTILE_CONT: This function calculates the percentile based on a continuous distribution of your data. If the exact percentile falls between two data points, it interpolates (or estimates) the value, providing a smooth and precise measurement.
- PERCENTILE_DISC: In contrast, this function calculates the percentile using a discrete distribution. It picks the closest actual data point without estimating between values, giving you the exact data point that corresponds to or surpasses the percentile.
Syntax
Both functions are used similarly but with a key difference in how they calculate the percentile:
_10-- For continuous interpolation_10PERCENTILE_CONT(<percentile>) WITHIN GROUP (ORDER BY <order_by_expr>) [OVER ([PARTITION BY <expr3>])]_10_10-- For discrete values_10PERCENTILE_DISC(<percentile>) WITHIN GROUP (ORDER BY <order_by_expr>) [OVER ([PARTITION BY <expr3>])]
where:
<percentile>
: The percentile to find (between 0.0 and 1.0). For example, 0.9 for the 90th percentile.<order_by_expr>
: The column by which to order the data. This column is also where the percentile value is picked from.<expr3>
: (Optional) Used to divide data into partitions, treating each as a separate group for percentile calculations.
Key concepts
- 📈 Interpolation vs. Exact Match: Choose
PERCENTILE_CONT
for a smoothed estimate orPERCENTILE_DISC
for the exact data point matching the percentile. - 🔢 Data Ordering: Both functions require an order specification because percentiles are inherently dependent on the ranking of data.
- ❌ No DISTINCT Support: These functions do not support the DISTINCT keyword.
- 🔄 Grouping Optional: Can be used with or without grouping. Without PARTITION BY, the entire data set is considered a single group.
Examples
Finding the 90th percentile with PERCENTILE_CONT and PERCENTILE_DISC
Imagine you have a table student_scores that lists students’ scores on a math test. You want to find the 90th percentile score to understand how the top 10% of students performed.
Input
Table: student_scores
student_id | name | math_score |
---|---|---|
1 | John Doe | 88 |
2 | Jane Smith | 92 |
3 | Alice Johnson | 85 |
4 | Dave Wilson | 90 |
Snowflake SQL Query
_10-- Continuous percentile calculation_10SELECT PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY math_score) _10FROM student_scores;_10_10-- Discrete percentile calculation_10SELECT PERCENTILE_DISC(0.9) WITHIN GROUP (ORDER BY math_score)_10FROM student_scores;
Output
PERCENTILE_CONT
: 91.4
PERCENTILE_DISC
: 92
Explanation
- PERCENTILE_CONT(0.9): This function calculates the 90th percentile using continuous distribution, which means it can interpolate between existing data points if the percentile doesn’t exactly match an existing value. The 90th percentile falls between the scores of 90 and 92, which represent the 75th and 100th percentiles, respectively. Since the 90th percentile is 15% of the way between these scores, the interpolation calculates: 90 + (0.9-0.75) * [(92-90)/1.0-0.75] = 91.4
- PERCENTILE_DISC(0.9): This function chooses the smallest value that is at least the 90th percentile. Since 90 (3rd rank) is the 75th percentile and 92 is the next score and also the maximum score here, 92 is chosen directly by PERCENTILE_DISC for the 90th percentile.