Computing percentiles in SQL
UNNEST
allows splatting arrays into rows, which can then be used in window functions like PERCENTILE_CONT
.SQL is a language that never stops giving, and I’ll probably never stop learning new things about it. I recently stumbled upon the following problem: how to compute the median value of a float array?
Given the following websites table:
id | name | durations |
---|---|---|
1 | example.com | {0.2, 0.4} |
2 | gautier.dev | {0.3, 0.2, 0.6} |
3 | wikipedia.org | {} |
Note: I’ll use Postgres syntax to represent arrays, with the rather unconventional curly brackets for people used to web languages. You can also use ARRAY[0.2, 0.4]
to declare arrays.
Is there a way to compute the p50 (median value) of the durations arrays in pure SQL for Postgres?
#PERCENTILE_CONT
PostgreSQL features a PERCENTILE_CONT
function that computes the percentile of a set of values. Unfortunately, it is a window function, meant to be used over numeric columns.
If duration was a numeric column, we could use the following query:
id | name | duration |
---|---|---|
1 | example.com | 0.6 |
2 | gautier.dev | 0.7 |
3 | wikipedia.org | null |
SELECT PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration) AS p50
FROM websites
This would return:
p50 |
---|
0.6499999999999999 |
The null value is properly ignored, and rounding is still hard for computers in 2023.
#UNNEST
In our original problem we have arrays of floats. To compute the median value of each array, the solution is to use the UNNEST
function, which allows splatting/spreading arrays into rows.
SELECT id, name, UNNEST(durations) AS duration FROM sites
id | name | duration |
---|---|---|
1 | example.com | 0.2 |
1 | example.com | 0.4 |
2 | gautier.dev | 0.3 |
2 | gautier.dev | 0.2 |
2 | gautier.dev | 0.6 |
It’s worth noting that id is no longer unique and that empty arrays produce no rows.
We now have a virtual table on which we can use any window function:
SELECT id, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration) AS p50
FROM (
SELECT id, UNNEST(durations) AS duration FROM sites
) t
GROUP BY id
id | p50 |
---|---|
1 | 0.30000000000000004 |
2 | 0.3 |
And then we can LEFT JOIN
the websites table to retrieve other details:
SELECT websites.id, name, p50
FROM websites
LEFT JOIN (
SELECT id, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY duration) AS p50
FROM (SELECT id, UNNEST(durations) AS duration FROM sites) t
GROUP BY id
) t ON t.id = websites.id
id | name | p50 |
---|---|---|
1 | example.com | 0.30000000000000004 |
2 | gautier.dev | 0.3 |
3 | wikipedia.org | null |
Using LEFT JOIN
rather than (INNER) JOIN
allows us to keep all the websites, even those with an empty array.