Chapter 3

Probability & Statistics for Finance

Distributions, hypothesis testing, and regression analysis.

Probability and Statistics for Finance

Quantitative finance is built on a foundation of probability and statistics. From modeling asset returns to estimating risk measures, statistical tools are indispensable for anyone building financial systems or analyzing market data.

Probability Fundamentals

Random Variables and Distributions

A random variable maps outcomes of a random process to numerical values. In finance, returns, prices, and trading volumes are all random variables.

Expected value (mean): E[X]=μ=ixiP(xi)(discrete)E[X] = \mu = \sum_{i} x_i P(x_i) \quad \text{(discrete)} E[X]=μ=xf(x)dx(continuous)E[X] = \mu = \int_{-\infty}^{\infty} x f(x) dx \quad \text{(continuous)}

Variance measures dispersion around the mean: Var(X)=σ2=E[(Xμ)2]=E[X2](E[X])2Var(X) = \sigma^2 = E[(X - \mu)^2] = E[X^2] - (E[X])^2

Standard deviation is σ=Var(X)\sigma = \sqrt{Var(X)}.

Key Distributions in Finance

Normal (Gaussian) Distribution: f(x)=1σ2πe(xμ)22σ2f(x) = \frac{1}{\sigma\sqrt{2\pi}} e^{-\frac{(x-\mu)^2}{2\sigma^2}}

The normal distribution is central to finance due to the Central Limit Theorem. Log-returns are often assumed normally distributed.

Log-Normal Distribution: If ln(X)\ln(X) is normal, then XX is log-normal. Stock prices are often modeled as log-normal since they cannot be negative.

Student's t-Distribution: Has heavier tails than normal, better capturing the fat tails observed in financial returns: f(x)=Γ(ν+12)νπΓ(ν2)(1+x2ν)ν+12f(x) = \frac{\Gamma(\frac{\nu+1}{2})}{\sqrt{\nu\pi}\Gamma(\frac{\nu}{2})}\left(1 + \frac{x^2}{\nu}\right)^{-\frac{\nu+1}{2}}

where ν\nu is degrees of freedom.

Statistical Moments and Financial Metrics

Skewness

Measures asymmetry of the distribution: Skewness=E[(Xμσ)3]\text{Skewness} = E\left[\left(\frac{X - \mu}{\sigma}\right)^3\right]

  • Negative skew: Long left tail (crash risk)
  • Positive skew: Long right tail (upside potential)

Kurtosis

Measures tail heaviness: Kurtosis=E[(Xμσ)4]\text{Kurtosis} = E\left[\left(\frac{X - \mu}{\sigma}\right)^4\right]

Excess kurtosis = Kurtosis - 3 (since normal distribution has kurtosis of 3).

Financial returns typically exhibit leptokurtosis (excess kurtosis > 0), meaning more extreme events than normal distribution predicts.

Covariance and Correlation

Covariance measures how two variables move together: Cov(X,Y)=E[(XμX)(YμY)]Cov(X, Y) = E[(X - \mu_X)(Y - \mu_Y)]

Correlation normalizes covariance to [-1, 1]: ρX,Y=Cov(X,Y)σXσY\rho_{X,Y} = \frac{Cov(X, Y)}{\sigma_X \sigma_Y}

The correlation matrix for multiple assets is crucial for portfolio construction: Σ=(1ρ12ρ211)\Sigma = \begin{pmatrix} 1 & \rho_{12} & \cdots \\ \rho_{21} & 1 & \cdots \\ \vdots & \vdots & \ddots \end{pmatrix}

Hypothesis Testing

Framework

  1. State null hypothesis H0H_0 and alternative H1H_1
  2. Choose significance level α\alpha (typically 0.05 or 0.01)
  3. Calculate test statistic
  4. Compare to critical value or compute p-value
  5. Reject or fail to reject H0H_0

Common Tests in Finance

t-test for means: t=xˉμ0s/nt = \frac{\bar{x} - \mu_0}{s / \sqrt{n}}

F-test for comparing variances or in regression analysis.

Jarque-Bera test for normality: JB=n6(S2+(K3)24)JB = \frac{n}{6}\left(S^2 + \frac{(K-3)^2}{4}\right)

where SS is skewness and KK is kurtosis.

Regression Analysis

Ordinary Least Squares (OLS)

Linear regression models the relationship between variables: Y=β0+β1X1+β2X2++ϵY = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots + \epsilon

OLS minimizes the sum of squared residuals: minβi=1n(yiy^i)2\min_{\beta} \sum_{i=1}^{n} (y_i - \hat{y}_i)^2

In matrix form: β^=(XTX)1XTy\hat{\beta} = (X^TX)^{-1}X^Ty

Evaluation Metrics

R-squared (R2R^2): Proportion of variance explained: R2=1SSresSStotR^2 = 1 - \frac{SS_{res}}{SS_{tot}}

Adjusted R-squared: Penalizes additional predictors.

Standard errors: Measure precision of coefficient estimates.

Financial Applications

  • CAPM regression: RiRf=α+β(RmRf)+ϵR_i - R_f = \alpha + \beta(R_m - R_f) + \epsilon
  • Factor models: Multiple factors explaining returns
  • Pairs trading: Cointegration analysis

Time Series Analysis

Financial data is inherently sequential. Key concepts:

Autocorrelation: Correlation of a series with its lagged values: ρk=E[(Xtμ)(Xtkμ)]σ2\rho_k = \frac{E[(X_t - \mu)(X_{t-k} - \mu)]}{\sigma^2}

Stationarity: Statistical properties (mean, variance) don't change over time. Most financial time series are non-stationary in levels but stationary in returns.

Autoregressive (AR) Model: Xt=c+ϕ1Xt1+ϕ2Xt2++ϵtX_t = c + \phi_1 X_{t-1} + \phi_2 X_{t-2} + \cdots + \epsilon_t

GARCH Models: Capture volatility clustering: σt2=ω+αϵt12+βσt12\sigma_t^2 = \omega + \alpha \epsilon_{t-1}^2 + \beta \sigma_{t-1}^2

Programming Implementation

Key statistical computations in finance software:

  • Efficient calculation of rolling statistics
  • Matrix operations for portfolio optimization
  • Numerical methods for maximum likelihood estimation
  • Bootstrap methods for confidence intervals
  • Monte Carlo simulation for complex distributions

ELI10 Explanation

Simple analogy for better understanding

Statistics in finance is like being a weather forecaster for money. Just like meteorologists look at past weather patterns to predict tomorrow's weather, financial analysts look at past stock prices and market data to understand what might happen next. They use math to answer questions like "What's the average return?" and "How risky is this investment?" Think of it like calculating your average test score to understand how you usually perform, but also noticing that sometimes you score much higher or lower than average. Finance people do the same thing with stock prices, trying to understand the typical behavior and the surprises.

Self-Examination

Q1.

Why do financial returns typically exhibit fat tails (leptokurtosis)? What are the implications for risk management?

Q2.

Explain the difference between correlation and covariance. Why is correlation preferred when comparing relationships between different assets?

Q3.

What assumptions underlie OLS regression? How might violations of these assumptions affect financial models?

Q4.

Describe the GARCH model and explain why it's useful for modeling financial volatility. What is volatility clustering?

Q5.

How would you test whether a trading strategy generates statistically significant returns? What pitfalls should you be aware of (e.g., multiple testing, survivorship bias)?