Asset Pricing using Extreme Liquidity Risk with Python (Part-1)

Post Outline

  • Introduction
  • Get Data
  • Calculate Cross-Sectional Extreme Liquidity Risk
  • Quick and Dirty Observations
  • Next Steps
  • References


One of the primary goals of quantitative investing is effectively managing tail risk. Failure to do so can result in crushing drawdowns or a total blowup of your fund/portfolio. Commonly known tools for estimating tail risk, e.g. Value-at-Risk, often underestimate the likelihood and magnitude of risk-off events. Furthermore, tail risk events are increasingly associated with liquidity events. 

Theory links the catalyst of systemic risk events to the funding difficulties of major financial intermediaries. For example, an unexpected default by a major institution would lead to that firm's counterparties reducing risk while they assessing the fallout. Those counterparties are likely to reduce risk by selling assets and/or withdrawing funding resources from the market. This could lead to margin calls, and more selling as the default works its way across the financial network cascading into a negative feedback loop. 

A good theoretical risk model will address the relationship between liquidity and tail risk. Ying Wu of Stevens Institute of Technology - School of Business, may have discovered a framework that links these two concepts in a parsimonious and practical manner. His paper 'Asset Pricing with Extreme Liquidity Risk'[1] combines Amihud's[2] stock illiquidity metric with the Hill estimator for modeling tail distributions.  He then constructs a normalized Extreme Liquidity Risk (ELR) metric and runs a simple linear regression for each stock to assess its sensitivity to the ELR. His results find that a long-short portfolio based on buying stocks with the highest sensitivity to ELR and shorting those with the lowest, earns a empirically and economically significant return over the time period studied. 

The Amihud stock illiquidity metric is a stock's daily absolute return divided by its dollar volume, averaged over some time period. It was constructed for use as a rough measure of price impact and designed to be easily calculated for long time series. 

The Hill estimator[3] is a mathematical tool that allows us to focus on the tail of a sample distribution. This tool allows us to "skip" over trying to fit a single distribution over the entire sample and instead we can use the formal framework of Extreme Value Theory to evaluate the extreme (tail) values only. The link between Wu's choice of this estimator is based on the empirical evidence of power law behavior in the tails of the price-impact series. This further supports the use of Amihud's illiquidity metric as it was designed to be a crude yet effective measure of price impact. 

I urge readers to explore the paper further as some of the deeper mathematical underpinnings are beyond the scope of this post.

Get data

For this exploratory study I used the pandas Yahoo Finance API to download 20 years of stock data using a symbol list constructed by CRSP. 

# Import

import pandas as pd
import as web
from pandas.tseries.offsets import BDay
import numpy as np
import scipy.stats as scs
import matplotlib.pyplot as plt

# get symbols
datasets = '/YOUR/DATASETS/LOCATION/_Datasets/'
symbols = pd.read_csv(datasets+'CSRP_symbol_list.txt',sep='\t').values.flatten()

Here is the text file of symbols I used --> Symbols.

Next we construct our convenience functions to aggregate the stock data.

# Get Prices Function

def _get_px(symbol, start, end):
    return web.DataReader(symbol, 'yahoo', start, end)
# Create HDF5 data store for fast read write

def _create_symbol_datastore(symbols, start, end):
    prices_hdf = pd.HDFStore(datasets + 'CRSP_Symbol_Data_Yahoo_20y.hdf')
    symbol_count = len(symbols)
    N = copy(symbol_count)
    missing_symbols = []
    for i, sym in enumerate(symbols, start=1):
        if not pd.isnull(sym):
                prices_hdf[sym] = _get_px(sym, start, end)
            except Exception as e:
                print(e, sym)
            N -= 1
            pct_total_left = (N / symbol_count)
            print('{}..[done] | {} of {} symbols collected | {:>.2%}'.format(\
                                                            sym, i, symbol_count, pct_total_left))
    return missing_symbols

# Get past 20 years of data from today
# Evaluate missing symbols if you so choose

today =
start = today - 252 * BDay() * 20

missing = _create_symbol_datastore(symbols, start, today) 

This takes roughly 30 minutes to run, which is a good time for a coffee break.

Next we need to calculate each stock's daily illiquidity measure according to Amihud. I also save this data to its own HDF5 store. I find it good practice to save intermediate calculations where possible for reference and ease of reproducibility. 

# calculate each symbols returns and dollar volumes 
# add to dataframe with symbol_lret, symbol_dv, symbol_illiq

FILE = datasets + 'CRSP_Symbol_Data_Yahoo_20y.hdf'

start = pd.to_datetime('1999-01-01')
end = pd.to_datetime('2016-11-22')
idx = pd.bdate_range(start, end)

DF = pd.DataFrame(index=idx)
for sym in tqdm(keys):
    tmp_hdf = pd.read_hdf(FILE, 
                      mode='r', key=sym)
    tmp_hdf = tmp_hdf[['Volume', 'Adj Close']]
    # I want at least 1000 daily datapoints per stock
    if len(tmp_hdf) > 1000:
            dv = (tmp_hdf['Adj Close'] * tmp_hdf['Volume'] / 1e6)[1:]
            lret = np.log(tmp_hdf['Adj Close'] / tmp_hdf['Adj Close'].shift(1)).dropna() 
            daily_illiq = np.abs(lret) / dv
            tmp_df = pd.DataFrame({sym.lstrip('/')+'_lret':lret, 
            DF = DF.join(tmp_df, how='outer')
        except: continue


# Illiquidity HDF originally run on 2016-Nov-11
# DataFrame key is "Illiquidity_Set"

ILQ_FILE = datasets + 'Illiquidity_Set_2016-11-22.h5'
ilq_set = DF.loc[:, DF.columns.to_series().str.contains('_illiq').tolist()]
ilq_set.to_hdf(ILQ_FILE, 'Illiquidity_Set')

8487 * 4954 = 42,044,598 data points! Some of these are np.nan but still, clearly CSV storage is a non-starter.


Now we are in a position to calculate the Extreme Liquidity Risk metric (ELR) or "Tail Index" for the aggregated stocks. First we read in our 'Illiquidity_set' dataframe from the HDF5 file. Then we create a convenience function to calculate the daily ELR. First lets take a quick glance at the ELR formula:

Wu, Ying, Asset Pricing with Extreme Liquidity Risk (October 10, 2016)

My understanding is that this is a log average of the relative "distance" between the aggregated stocks' illiquidity measures and the threshold p*.  P* is the line in the sand between distribution "body" and distribution "tail". The paper uses the convention of the 95% percentile as the threshold value so I use that here as well. 

# Read hdf illiquidity

ILQ_FILE = datasets + 'Illiquidity_Set_2016-11-22.h5'
ilq = pd.read_hdf(ILQ_FILE, 'Illiquidity_Set')

# function to get daily values for gamma calc

def _ext_lq_risk(series):
    # threshold is 95th percentile
    # right tailed convention
    p_star = np.nanpercentile(series, 95)   
	illiq = series[series > p_star]
    lg_illiq = np.log(illiq / p_star)
    lg_illiq = lg_illiq[np.isfinite(lg_illiq)]
        gamma = 1./ ((1./len(lg_illiq)) * sum(lg_illiq))
    except ZeroDivisionError:
        gamma = np.nan
    return gamma

Now we can calculate the Tail Index and normalize the values to get the ELR series.

# Calculate Tail Index for all dates greater than cutoff
df = ilq.copy()
gs = {} # gammas dictionary
nan_dates = [] 

for d in df.index:
    # we want at least N nonnull values
    if df.loc[d].notnull().sum() > cutoff:
        gamma = _ext_lq_risk(df.loc[d])
        gs[d] = gamma
gdf = pd.DataFrame.from_dict(gs, orient='index').sort_index()
gdf.columns = ['Tail_Index']

# the ELR metric is a normalized version of the tail index
# normalize gamma dataframe to calc "ELR"

gdfz = (gdf - gdf.mean())/gdf.std()
gdfz.columns = ['ELR']

Let's plot it and take a look.

Blackarbs LLC

quick and dirty observations

First another plot. I skip the code here to save space, but would be happy to post it if requested. The plot below is the IWM used as a market proxy, its drawdown chart, and below that is the ELR. The shaded regions are official NBER recessions. 

Blackarbs llc

The ELR appears to rise prior to the official beginning of the Dot-Com bust. It stays relatively elevated throughout the period and begins to decline sometime during the first persistent rally off the lows. Prior to the beginning of 2008's official recession, the ELR is mixed. However, the ELR rises sharply sometime prior to the massive decline in the broad market. In fact it was rising during a period where the market bounced, providing an early warning of the cataclysmic dropoff to come. Furthermore it begins declining shortly after the official NBER recession end date, providing investors with support for getting back into the market. Interestingly the ELR is in a downtrend for most of the low-volatility period that followed the recession. Clearly the metric is not a perfect predictor, but there seems to be evidence that it could be a useful tool, and certainly warrants more rigorous investigation. 

next steps

There are several directions to pursue regarding Extreme Liquidity Risk Index. We can explore the time series itself using Time Series Analysis (TSA), we can use frequentist or bayesian inference to this end. Or we can get straight to the good stuff, and simulate the long-short portfolio based on each stock's return sensitivity to the ELR as reported in the paper that inspired this post. Check back for part 2, as we explore this concept further.


  1. Wu, Ying, Asset Pricing with Extreme Liquidity Risk (October 10, 2016). Available at SSRN: or
  2. Amihud, Yakov. "Illiquidity and Stock Returns: Cross-section and Time-series Effects." Journal of Financial Markets 5.1 (2002): 31-56. Web.
  3. "Heavy-tailed Distribution." Wikipedia. Wikimedia Foundation, n.d. Web. 29 Nov. 2016.