AI

Drawing From a Random Distribution in SQL | by Sami Abboud | Feb, 2024

[ad_1]

From a chance density perform to random samples

Photograph by Moritz Kindler on Unsplash

Tlisted below are totally different strategies for updating a reinforcement studying agent’s coverage at every iteration. Just a few weeks in the past we began experimenting with changing our present technique with a Bayesian inference step. A number of the knowledge workloads inside our agent are written in SQL that’s executed on GCP’s BigQuery engine. We use this stack as a result of it gives scalable computational capabilities, ML packages and an easy SQL interface.

The Bayesian inference step we regarded to implement makes use of one of many alternative parametrizations of a beta distribution. Which implies that we’ll want to have the ability to draw from a beta distribution in SQL. Whereas engaged on this, I found that drawing from the random distribution in SQL is a subject with only a few nicely documented examples. So I’m writing about it right here..

BigQuery doesn’t have a beta distribution. It doesn’t have the potential to attract from any random distribution. So my first instinct was to take the definition of the beta distribution, write it in SQL, set the parameters utilizing a CTA, draw a random quantity between 0 and 1 and compute the worth of that perform.

Nevertheless it’s 2024, so I requested ChatGPT how it might do it:

Me: “​How do you create random attracts from a beta distribution in BigQuery?

ChatGPT:

WITH beta_draws AS (
SELECT
(alpha — 1) / (alpha + beta — 2) AS a,
(beta — 1) / (alpha + beta — 2) AS b
FROM
(SELECT 2 AS alpha, 5 AS beta) — Set your required alpha and beta values
)
SELECT
(a * POW(RAND(), alpha - 1) * POW(1 - RAND(), beta - 1) + b) AS beta_draw
FROM
beta_draws,
UNNEST(GENERATE_ARRAY(1, 1000)) AS rand_num

Me pondering to myself: Proper, in order that clearly gained’t work.

Do you see the issue within the code? ChatGPT attracts two totally different x values for the presumed beta distribution PDF. I fastened this, cleaned up the question a little bit and sampled 1,000 values. And right here’s the SQL code for doing that:

WITH raw_data AS (
SELECT
‘user_a’ AS contact_id
, 0.674 AS chance
, 72 AS lifetime_messages_received
), parameters AS (
SELECT
contact_id
, chance*lifetime_messages_received AS alpha
, (1.0 — chance)*lifetime_messages_received AS beta
, RAND() AS x
FROM raw_data
CROSS JOIN UNNEST(GENERATE_ARRAY(1, 1000)) AS draw_id
)
SELECT
contact_id
, ARRAY_AGG(POW(x, alpha — 1.0) * POW(1.0 — x, beta — 1)) AS beta_x
FROM parameters
GROUP BY contact_id

Thanks all, that’s a wrap 🎁 See you within the subsequent put up!

WRONG! 🔴

Let’s take a trusted implementation of drawing from a beta distribution utilizing the identical parameters and examine the outcomes. I’ve used SciPy’s beta.rvs() in Python and listed below are two 100-bin histograms that can enable evaluating the 2 drawn distributions.

from scipy.stats import beta

alpha_param = 0.674 * 72
beta_param = (1–0.674) * 72

scipy_beta_draw = beta.rvs(alpha_param, beta_param, dimension=1000)

Left: Naive draws using BigQuery. Right: Draws using SciPy’s beta.rvs()
(Left): Naive attracts utilizing BigQuery. (Proper): Attracts utilizing SciPy’s beta.rvs()

Properly, it doesn’t take a magnifying glass to appreciate that the distributions are totally different. I went again the beta distribution definition and realized that it is likely to be as a result of the beta distribution additionally has a scaling fixed which is determined by the gamma function that I didn’t embody within the calculation 🤦.

Downside: the gamma perform doesn’t have a closed-form expression, and BigQuery doesn’t present an implementation that approximates it. So at this level I made a decision to modify to Python, a language that I’m extra accustomed to and can make my experimentation extra environment friendly. The pondering was that if I nail it down in Python, I’ll be capable of translate it to SQL. I might nonetheless want some method to approximate a gamma perform, however one step at a time.

Let’s implement a handbook draw from a beta distribution in Python, however now with the right fixed utilizing SciPy’s gamma perform:

import numpy as np
from scipy.particular import gamma
from scipy.stats import uniform

alpha_param = 0.674 * 72
beta_param = (1–0.674) * 72

fixed = gamma(alpha_param + beta_param) / (gamma(alpha_param) * gamma(beta_param))
scipy_manual_beta_draw = np.array([
constant*pow(x, alpha_param-1)*pow(1-x, beta_param-1)
for x in uniform.rvs(size=1000)
])

Let’s look at the distribution utilizing a 100-bin histogram once more:

Naive attracts utilizing Python

The very first thing we discover is that the size is now totally different, however the distribution nonetheless seems just like the one drawn in BigQuery.

… one thing is fallacious… it’s time for a brief stroll to assume 🚶

After a brief stroll:

What does drawing from a random distribution really imply? What I’ve applied to date is randomly sampling from the beta chance density perform (PDF) and it wasn’t working.

So I needed to dig up some statistics courses.

Listed below are a few good refreshers on:

Briefly, the conclusion is that drawing from a random variable really means sampling from the inverse cumulative distribution perform (CDF), not from the chance density perform (PDF) like I used to be doing to date.

In fact 🤦. My probability professor, who I simply realized had handed away from sickness in 2020, would have inspired me to “assessment the fundamentals” at this level..

Okay. Let’s revisit the Python code, now drawing samples from the inverse CDF (which can be referred to as the quantile function) of our beta distribution, and examine it to the distribution drawn utilizing SciPy’s beta.rvs():

import numpy as np
from scipy.particular import gamma
from scipy.stats import uniform, beta

alpha_param = 0.674 * 72
beta_param = (1–0.674) * 72
n_draws = 1000

# Use SciPy RVS for comparability
scipy_beta_draw = beta.rvs(alpha_param, beta_param, dimension=n_draws)

# Guide beta draw with the assistance of the SciPy Gamma perform

# We begin with a discrete analogue of the Beta PDF we want to draw from.
# That is simply sampling from the PDF at fastened intervals however do take a look at
# this assessment for a extra in-depth remedy of the topic:
# https://jsdajournal.springeropen.com/articles/10.1186/s40488-015-0028-6

# Set the decision for producing the discrete PDF
n_samples = 1000

# The beta distribution is supported on the vary [0, 1], so we set the
# pdf min and max parameters accordingly
pdf_min = 0.0
pdf_max = 1.0

x_span = np.linspace(pdf_min, pdf_max, n_samples)
fixed = gamma(alpha_param + beta_param) / (gamma(alpha_param) * gamma(beta_param))
beta_pdf = np.array([
constant * pow(x, alpha_param — 1) * pow(1 — x, beta_param — 1)
for x in x_span
])

# Utilizing the discrete Beta PDF, we now compute a discrete Beta CDF.
# To do this, we combine the PDF. For every level x, we sum the PDF till
# that time and a number of with the width of every pattern.
freq = 1.0 / n_samples
beta_cdf = beta_pdf.cumsum() * freq

def inv(cdf, q):
“””Return inverse CDF for worth q utilizing the quantile perform”””
return x_span[np.argmin(cdf < q)]

# Lastly, we will now draw n_draws from the discrete inverse of CDF, aka
# generate random samples from a beta distribution
manual_beta_draw = np.array([
inv(beta_cdf, x)
for x in uniform.rvs(size=n_draws)
])

*phew* this seems significantly better:

An overlay of two histograms evaluating a 1,000 attracts utilizing SciPy’s beta.rvs() and a handbook draw

Now that we’ve acquired drawing samples from a random variable straight, it’s time to maneuver again to SQL. For the sake of simplicity, and since BigQuery doesn’t readily include an implementation of a Gamma function¹ I’m going to attract from the logistic distribution (with parameters a=0 and b=1).

 — The next 3 parameters must be adjusted based mostly on the assist of the
— PDF of the distribution you want to draw from. This values are set for a logistic
— distribution with a=0 and b=1

DECLARE pdf_min INT64 DEFAULT -10;
DECLARE pdf_max INT64 DEFAULT 10;
DECLARE n_samples INT64 DEFAULT 5000;
DECLARE sampling_step FLOAT64 DEFAULT (pdf_max — pdf_min) / n_samples;

— The variety of random attracts you want to carry out
DECLARE n_draws INT64 DEFAULT 1000;

WITH pdf AS (

— The discrete sampling of the logistic distribution PDF

SELECT
x
, exp(-x) / pow(1 + exp(-x), 2) AS y — a=0, b=1
FROM UNNEST(GENERATE_ARRAY(pdf_min, pdf_max, sampling_step)) AS x
), cdf AS (

— The discrete CDF

SELECT
x
, SUM(y)
OVER (
ORDER BY x
) * (1.0 / n_samples) AS y
FROM pdf
), random_draws AS (

— Random attracts within the vary of [0, max(cdf)]

SELECT
RAND() * (SELECT MAX(y) FROM cdf) as q
, draw_id
FROM UNNEST(GENERATE_ARRAY(1, n_draws)) AS draw_id
)

— Calculate the inverse CDF per draw utilizing the quantile perform by producing
— and array of the discrete assist of the distribution and returning the worth
— of the index simply earlier than the randomly generated quantity is bigger than the CDF

SELECT
ARRAY_AGG(x ORDER BY x)[OFFSET(SUM(CAST(y < q AS INT64)))] AS x
FROM random_draws
JOIN cdf
ON TRUE
GROUP BY draw_id;

Let’s now examine the distributions of the three sampling strategies:

  1. SciPy’s logistic.rvs()
  2. Manually sampling the logistic distribution PDF in Python and drawing a random pattern as per Step 2 above
  3. Doing the identical in SQL
An overlay of three histograms evaluating 1,000 attracts utilizing SciPy’s beta.rvs(), a handbook attract Python and a handbook attract SQL

This seems like a hit to me! 💪

This SQL code above samples from the logistic distribution, however it ought to work on any distribution the place you’ll be able to get a discrete illustration of the PDF by sampling it at constant intervals!

[ad_2]

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button