**Introduction****Links + Datasets****Notebook****Next Steps**

This article series provides an opportunity to move towards more interactive analysis. My plan is to integrate more **Jupyter notebooks** and **Github repos** into my research/publishing workflow. For datasets that are too big to share through github I will provide a download link both here and in the github readme.

I will be posting the notebooks into this blog using iframes. If you experience any issues with formatting I recommend viewing the notebook at github directly. If you're using mobile, you will have to "request the desktop site" for the ipynb to render.

- Github Repo
- Notebook Link
- Raw Hourly Options Data from 2017-09-13 to 2017-10-18
- Processed Hourly Options Data from 2017-09-13 to 2017-10-18
****Note: please select download all**

The next step in this process will be analysis of the skew metric and how we might apply it to develop a trading strategy.

]]>**Purpose****Intuitive explanation****Code****Next Steps**

This is a simple reference article for readers that might wonder where I get/got my options data from. In this regard I would like to shout out the contributors to the pandas-datareader, without their efforts this process would be much more complex.

So this code consists of three components. The first is the actual script that wraps the pandas-datareader functions and downloads the options data. The second is a helper script to save the aggregated data to disk. The helper script which I call *file_handler* is designed to save the data in multiple formats in a structured file directory. Internally it checks to see if today's folder is created with a particular date and naming convention, if it isn't it will create the folder and then store all the data files there. What gives this code the ability to aggregate intraday data is the third component which simply requires making use of your system's task scheduler. For example, if you have Linux/Ubuntu you can package this script to run as a cronjob quite easily. After the code below I show an example cronjob template that works.

```
import sys
import os
import time
PROJECT_DIR = '/YOUR/CODE/DIR/option_skew_project/'
sys.path.append(PROJECT_DIR)
from pandas_datareader.data import Options
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
import numpy as np
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
from file_handler import file_handler
# -----------------------------------------------------------------------------
# import symbols
# -----------------------------------------------------------------------------
symbols = (pd.read_csv(PROJECT_DIR+'data/symbols.csv', header=None, index_col=False).rename(columns={0:'symbols'}))
# ------------------------------------------------------------------------------
# define conv. fn.
# ------------------------------------------------------------------------------
def cprint(df):
print('-'*50)
print(df.sample(5))
print()
print(df.info())
print()
def random_wait():
"""fn: randomly choose a wait time based on probability"""
wait_times = [0.2, 0.5, 1, 2]
probs = [0.3, 0.4, 0.2, 0.1 ]
choice = np.random.choice(wait_times, size=1, p=probs)
return choice
# ------------------------------------------------------------------------------
# init file handler
# ------------------------------------------------------------------------------
fh = file_handler(PROJECT_DIR)
# ------------------------------------------------------------------------------
# run aggregation func
# ------------------------------------------------------------------------------
errors = []
dfs_dict = {}
for sym in tqdm(symbols.symbols.values):
print('-'*50)
print('downloading {} ...'.format(sym))
try:
tmp_df = Options(sym, 'yahoo').get_all_data()
dfs_dict[sym] = tmp_df
except Exception as e:
errors.append(sym)
print('{} error: {}'.format(sym, e))
continue
else:
print('{} complete'.format(sym))
print()
time.sleep(random_wait())
# ------------------------------------------------------------------------------
# concat dfs drop unnecessary columns
# ------------------------------------------------------------------------------
data = (pd.concat(list(dfs_dict.values())).drop(['JSON'], axis=1))
error_series = pd.Series(errors)
cprint(data)
print(error_series)
# ------------------------------------------------------------------------------
# save data
# ------------------------------------------------------------------------------
fh.save_data(error_series, format='csv', resolution='date', errors=True)
try:
fh.save_data(data, format='parquet')
except Exception as e:
print(e)
fh.save_data(data, format='h5')
```

This is the code for the *file_handler* script. It can save in 1 of the following 4 formats: parquet, h5, feather, csv. I save the list of symbol errors as a CSV since this list is generally quite small. As seen above I save the options data in parquet format first, and a backup in the form of an h5 file. Generally I prefer to work with parquet files because the are compressed by default, contain metadata, and integrate better with the Dask. This code requires the installation of the pyarrow package.

```
import os
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
class file_handler:
'''
class for handling directory/folder creation + data saving
Attributes
project_dir : str(), main project directory
Methods
save_data : actual public save function
|__> _create_dir : internal fn to create dir if it does not exist
|_> __check_exists_or_create : private fn to check if file exists
|__> __create_date_str : private fn to create date str
|__> __create_timestamp_str : private fn to create timestamp str
'''
def __init__(self, project_dir):
self.project_dir = project_dir
def __check_exists_or_create(self, _dir):
"""fn: to check if file/path exists"""
if not os.path.exists(_dir):
try:
os.mkdir(_dir)
except Exception as e:
print(e)
return
def _create_dir(self):
"""fn: create daily directory if not already created"""
_dir = self.project_dir+'/Yahoo_Options_Data/'+str(pd.to_datetime('now').date())+'/'
self.__check_exists_or_create(_dir)
return _dir
def __create_timestamp_str(self):
"""fn: to create time stamp str"""
return str(pd.to_datetime('now').tz_localize('utc').tz_convert('US/Eastern')).replace(' ', '_').replace(':','.')
def __create_date_str(self):
"""fn: to create date str"""
return str(pd.to_datetime('now').date())
def save_data(self, data, format='parquet', resolution='time', errors=False):
"""fn: to save data to directory
Args
data : pd.DataFrame
format : str, ('parquet', 'h5', 'csv', 'feather')
resolution : str, date or time
if date uses default str format,
if time will use YYYY-MM-DD_HH.MM.SS
errors : bool,
if True change filepath name
if False use options data filepath name
"""
_dir = self._create_dir()
if resolution=='time':
_timestamp = self.__create_timestamp_str()
elif resolution=='date':
_timestamp = self.__create_date_str()
if errors:
_fp = _dir + f'yahoo_options_scraper_errors_{_timestamp}.{format}'
else:
_fp = _dir + f'yahoo_options_data_{_timestamp}.{format}'
if format=='parquet':
_table = pa.Table.from_pandas(data)
pq.write_table(_table, _fp)
elif format == 'h5': data.to_hdf(_fp, key='data')
elif format == 'csv': data.to_csv(_fp, index=False)
elif format == 'feather': data.to_feather(_fp)
return
```

Finally, below is an example of my cronjob. It is set to run Monday through Friday, hourly, from market open to close. Note the log directory and log file after the **">>"**; all the print statements contained in the script will output to that log file including any exceptions.

** 30 7-15 * * 1-5 /YOUR/CODE/DIR/option_skew_project/scripts/options_downloader.py' >> /YOUR/LOG/DIR/options_downloader_cronlog.log 2>&1**

The next article will document the code I refactored to calculate the option skew metric from the paper "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. If you have been a long time reader, you may recall I did a series where I tracked a theoretical ETF equity strategy that was based on this metric. Over time, people have asked how it is performing, and I did not have an answer because I stopped tracking it, as I have been busy with other projects. However, the strategy showed promise then and I wondered if it could be applied directly in options trading. My goal is to research the possibility of implementing this strategy live, and if the results show an edge, implementing it and tracking the results publicly.

To accomplish this task I first needed to gather data which this article shows. In the next article I make heavy use of Dask because the volume of intraday data aggregated over a month is over 14 million rows and operating on the dataframe in-memory is slow and/or unfeasible on most people's systems including mine.

Additionally the next article will be a jupyter notebook I will embed as a blog post here directly, but recommend it be viewed on the github repo I will make public.

]]>- Notes on Part-2
- The Data
- Bid-Ask Spread Analysis
- How Do Aggregate Bid-Ask Spreads Vary with Days To Expiration?
- How Do Bid-Ask Spreads Vary with Volume?
- How Do Bid-Ask Spreads Vary with Volatility?

- Summary Conclusions

Some astute readers in the comments noted that analysis based on the absolute difference in bid-ask price is not robust when considering the price of the underlying option and can lead to spurious conclusions. They recommended defining bid-ask spread as a percent of the option's spot price.

Additionally, I failed to constrain the analysis to include only options with a certain level of "moneyness". That is, options far away from the strike price behave differently than options that are closer, and the prior analysis failed to incorporate that understanding. In Part 2 of this exploration we re-examine the conclusions drawn in Part-1, after incorporating the aforementioned suggestions. With that said, this post will largely follow the format of Part-1, so if you feel you are missing context for this analysis start there.

The data is a cleaned **hdf5**/**.h5** file comprised of a collection of daily options data collected over the period of 05/17/2017 to 07/24/2017. By cleaned I mean I aggregated the daily data into one set, removed some unnecessary columns, cleaned up the data types and added the underlying ETF prices from Yahoo. I make no claims about the accuracy of the data itself, and I present it as is. It is approximately a 1 GB in size and I have made it available for download at the following link:

Options Data

To import the data into your python environment:

import pandas as pd; data = pd.read_hdf('option_data_2017-05-17_to_2017-07-24.h5', key='data')

```
%load_ext watermark
%watermark
import sys
import os
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
import numpy as np
import scipy.stats as stats
import pymc3 as pm
from mpl_toolkits import mplot3d
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-muted')
import plotnine as pn
import mizani.breaks as mzb
import mizani.formatters as mzf
import seaborn as sns
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
p=print
p()
%watermark -p pymc3,pandas,pandas_datareader,numpy,scipy,matplotlib,seaborn,plotnine
```

```
# convenience functions
# add spread as percentage of spot
def create_spread_percent(df):
return (df.assign(spread_pct = lambda df: df.spread / df.askPrice))
# add instrinsic values
def create_intrinsic(df):
# create intrinsic value column
call_intrinsic = df.query('optionType == "Call"').loc[:, 'underlyingPrice']\
- df.query('optionType == "Call"').loc[:, 'strikePrice']
put_intrinsic = df.query('optionType == "Put"').loc[:, 'strikePrice']\
- df.query('optionType == "Put"').loc[:, 'underlyingPrice']
df['intrinsic_value'] = [np.nan] * df.shape[0]
(df.loc[df['optionType'] == "Call", ['intrinsic_value']]) = call_intrinsic
(df.loc[df['optionType'] == "Put", ['intrinsic_value']]) = put_intrinsic
return df
# fn: code adapted from https://github.com/jonsedar/pymc3_vs_pystan/blob/master/convenience_functions.py
def custom_describe(df, nidx=3, nfeats=20):
''' Concat transposed topN rows, numerical desc & dtypes '''
print(df.shape)
nrows = df.shape[0]
rndidx = np.random.randint(0,len(df),nidx)
dfdesc = df.describe().T
for col in ['mean','std']:
dfdesc[col] = dfdesc[col].apply(lambda x: np.round(x,2))
dfout = pd.concat((df.iloc[rndidx].T, dfdesc, df.dtypes), axis=1, join='outer')
dfout = dfout.loc[df.columns.values]
dfout.rename(columns={0:'dtype'}, inplace=True)
# add count nonNAN, min, max for string cols
nan_sum = df.isnull().sum()
dfout['count'] = nrows - nan_sum
dfout['min'] = df.min().apply(lambda x: x[:6] if type(x) == str else x)
dfout['max'] = df.max().apply(lambda x: x[:6] if type(x) == str else x)
dfout['nunique'] = df.apply(pd.Series.nunique)
dfout['nan_count'] = nan_sum
dfout['pct_nan'] = nan_sum / nrows
return dfout.iloc[:nfeats, :]
```

```
%%time
op_data = (pd.read_hdf('option_data_2017-05-17_to_2017-07-24.h5', key='data')
.dropna(subset=['underlyingPrice', 'spread', 'askPrice'])
.pipe(create_spread_percent)
.pipe(create_intrinsic)
.reset_index(drop=True))
### filter by moneyness
# within 20% of strike in either direction
def filter_by_moneyness(df, pct_cutoff=0.2):
crit1 = (1-pct_cutoff)*df.strikePrice < df.underlyingPrice
crit2 = df.underlyingPrice < (1+pct_cutoff)*df.strikePrice
return (df.loc[crit1 & crit2].reset_index(drop=True))
data = filter_by_moneyness(op_data)
data_describe = custom_describe(data)
data_describe
```

```
sprd_by_dtm = (data.groupby(['symbol', 'daysToExpiration', 'optionType'],
as_index=False)['spread_pct'].median()
.groupby(['daysToExpiration', 'optionType'], as_index=False).median()
.assign(bins = lambda x: pd.qcut(x.daysToExpiration, 10, labels=False)))
sprd_by_dtm.sample(5)
```

```
def plot_spread_dtm(sprd_by_dtm):
"""
given df plot scatter with regression line
# Params
df: pd.DataFrame()
# Returns
g: plotnine figure
"""
g = (pn.ggplot(sprd_by_dtm, pn.aes('daysToExpiration', 'spread_pct', color='factor(bins)'))
+ pn.geom_point(pn.aes(shape='factor(bins)'))
+ pn.stat_smooth(method='glm')
+ pn.scale_y_continuous(breaks=mzb.mpl_breaks(),
labels=mzf.percent_format(),
limits=(0, sprd_by_dtm.spread_pct.max()))
+ pn.scale_x_continuous(breaks=range(0, sprd_by_dtm.daysToExpiration.max(), 50),
limits=(0, sprd_by_dtm.daysToExpiration.max()))
+ pn.theme_linedraw()
+ pn.theme(figure_size=(12,6), panel_background=pn.element_rect(fill='black'),
axis_text_x=pn.element_text(rotation=50),)
+ pn.ylab('bid-ask spread')
+ pn.ggtitle('Option Spread by DTM'))
return g
# ------------------------------
# Example use of func for both calls and puts
g = plot_spread_dtm(sprd_by_dtm)
g.save(filename='call-put option bid-ask spreads - daysToExpiration scatter plot-PERCENT.png')
g.draw();
```

What jumps out at me is how large the spread is as a percentage of the option's ask price as you move closer to expiration. From ~220 days and below (or bin 4.5+) the pattern appears to show a a significant increase in spreads. With days to expiration longer than ~220 both calls and puts show a flattening.

My first guess as to what could cause this pattern is that, as the contract expiration approaches, the probability of being ITM is low for a vast majority of contracts. As a result the demand from market participants dries up so the cost to the market maker increases and to compensate spreads widen. I welcome any insight readers may have on this.

```
median_sprd = data.groupby(['symbol', 'daysToExpiration', 'optionType'],
as_index=False)['spread_pct'].median()
test_syms = ['SPY', 'DIA', 'QQQ', 'TLT', 'GLD', 'USO', 'SLV', 'XLF']
sel_med_sprd = median_sprd.query('symbol in @test_syms').dropna(subset=['spread_pct'])
# to plot symbols have to cast to type str
sel_med_sprd.symbol = sel_med_sprd.symbol.astype(str)
p(sel_med_sprd.head())
p()
p(sel_med_sprd.info())
```

```
def plot_boxplot(df, x, y, optionType='Call'):
"""given df plot boxplot
# Params
df: pd.DataFrame()
x: str(), column
y: str(), column
optionType: str()
# Returns
g: plotnine figure
"""
df = df.query('optionType == @optionType')
g = (pn.ggplot(df, pn.aes(x, y, color=f'factor({x})'))
+ pn.geom_boxplot()
+ pn.scale_y_continuous(breaks=mzb.minor_breaks(10),
labels=mzf.percent_format(),
limits=(0., 1.))
+ pn.theme_linedraw()
+ pn.theme(figure_size=(12,6), panel_background=pn.element_rect(fill='black'))
+ pn.ylab('bid-ask spread')
+ pn.ggtitle(f'Selected Symbol {optionType} Option Spreads'))
return g
# ------------------------------
# example of box plot function
g = plot_boxplot(sel_med_sprd, 'symbol', 'spread_pct')
g.save(filename='call-option bid-ask spreads - boxplot-PERCENT.png')
g.draw();
```

From these two plots we can see that the bulk of the bid-ask spreads are below 15% for both calls and puts. I find it interesting that for calls SLV, and XLF have more extreme tails than the others. DIA and XLF calls also appear to be priced consistently higher than the other symbols.

Looking at the put options we see DIA is more expensive with more extreme values than any other symbol. The tails for SPY, TLT, QQQ, and GLD are more extreme/dispersed than their call option counterparts.

```
grp_cols = ['symbol', 'daysToExpiration', 'optionType']
agg_cols = ['spread_pct', 'openInterest', 'volume', 'volatility', 'intrinsic_value']
median_sprd = data.groupby(grp_cols, as_index=False)[agg_cols].median()
test_syms = ['SPY', 'DIA', 'QQQ', 'TLT', 'GLD', 'USO', 'SLV', 'XLF']
sel_med_sprd = (median_sprd.query('symbol in @test_syms')
.dropna(subset=['spread_pct', 'openInterest']))
# to plot symbols have to cast to type str
sel_med_sprd.symbol = sel_med_sprd.symbol.astype(str)
p(sel_med_sprd.head())
p()
p(sel_med_sprd.info())
```

```
def plot_log_points(df, x, y, color='factor(symbol)', size='openInterest'):
g = (pn.ggplot(df, pn.aes(x, y, color=color))
+ pn.geom_point(pn.aes(size=size, shape='factor(symbol)'), alpha=0.75, stroke=.75)
+ pn.geom_hline(yintercept=pm.hpd(df[y]), size=2, color='red')
+ pn.scale_x_log10(breaks=[0,0.5,1,10,100,250,500,750,1_000])
+ pn.theme_linedraw()
+ pn.theme(figure_size=(12,6), panel_background=pn.element_rect(fill='black'),
axis_text_x=pn.element_text(rotation=50))
+ pn.scale_y_continuous(breaks=mzb.minor_breaks(10),
labels=mzf.percent_format(),
limits=(0., 1.))
+ pn.ylab('bid-ask spread'))
return g
# ------------------------------
df = sel_med_sprd.copy()
# example with both call and puts
g = plot_log_points(df, x='volume', y='spread_pct')
g.save(filename='call-put option bid-ask spreads - volume scatter plot-PERCENT.png')
g.draw();
```

The red lines indicate the 95% interval for the data. We can see that the two plots are very similar except for minor cosmetic differences. Looking at the puts It still appears that, as volume increases the spreads are compressed a bit more than the calls even though the 95% intervals are nearly identical. Looking at the calls, there appears to be more extreme values at lower volumes than the puts.

Furthermore it appears that in this admittedly small sampling, spreads decline as open-interest and volume increase. This should not be surprising to readers, but it is noteworthy. The hypothesized mechanism for this is simple, as volume/open-interest increase, it becomes less risky for market-makers to provide their services, thus lowering the overall cost to trade.

The following two plots make this point a little bit clearer...

```
def facet_plot_log_points(df, x, y, color='factor(symbol)', size='openInterest'):
g = (pn.ggplot(df, pn.aes(x, y, color=color))
+ pn.geom_point(pn.aes(size=size, shape='factor(symbol)'), alpha=0.75, stroke=.75)
+ pn.stat_smooth(method='loess')
+ pn.scale_x_log10(breaks=[0,0.5,1,10,100,250,500,750,1_000])
+ pn.theme_linedraw()
+ pn.theme(figure_size=(12,6), panel_background=pn.element_rect(fill='black'),
axis_text_x=pn.element_text(rotation=50))
+ pn.scale_y_continuous(breaks=mzb.minor_breaks(5),
labels=mzf.percent_format(),
limits=(0., 1.))
+ pn.facet_wrap('~symbol', ncol=2)
+ pn.ylab('bid-ask spread'))
return g
# ------------------------------
# example use
g = facet_plot_log_points(df.query('optionType=="Call"'), x='volume', y='spread_pct')
g.save(filename='FACET-call option bid-ask spreads - volume scatter plot-PERCENT.png')
g.draw();
```

You could argue that the above plots show that market makers overall are pretty good at keeping spreads low regardless of the volume.

Also notice how much volume/open-interest there is in USO; both calls and puts are traded at a sharply higher volume than the other symbols. Next closest appears to be SLV, with XLF having some very popular contracts functioning as outliers. DIA and TLT appear to be least traded however DIA appears to be priced most inefficiently compared to the other symbols.

```
def facet_plot_points(df, x, y, color='factor(symbol)', size='openInterest'):
g = (pn.ggplot(df, pn.aes(x, y, color=color))
+ pn.geom_point(pn.aes(size=size, shape='factor(symbol)'), alpha=0.75, stroke=.75)
+ pn.stat_smooth(method='loess')
+ pn.theme_linedraw()
+ pn.theme(figure_size=(12,6), panel_background=pn.element_rect(fill='black'),
axis_text_x=pn.element_text(rotation=50))
+ pn.scale_y_continuous(breaks=mzb.minor_breaks(5),
labels=mzf.percent_format(),
limits=(0., 1.))
+ pn.facet_wrap('~symbol', ncol=2)
+ pn.ylab('bid-ask spread'))
return g
# ------------------------------
# example use
g = facet_plot_points(df, 'volatility', 'spread_pct')
g.save(filename='FACET-call-put option bid-ask spreads - volatility scatter plot.png')
g.draw();
```

*In aggregate* it appears that there is some relationship between volatility and spreads, with DIA, SPY, USO, SLV, TLT, and XLF showing increases in spreads co-occurring with increases in volatility. However, the relationship looks more tenuous when we disaggregate the options into calls and puts. For example USO calls appear to show a relationship between spreads and volatility quite clearly, but USO puts show no relationship at all. The same can be said about SLV, and XLF.

- Spreads increase dramatically as the contract nears expiration. The exact cause of this is only speculative and worthy of more investigation.
- Examining selected symbols, it appears that most of the contracts are priced competitively with each other with DIA and XLF showing the most extreme outliers.
- USO options have high interest from market participants as both calls and puts are traded at a higher volume.
- The sample size is too small to conclude anything about volatility and spreads. This relationship needs to be researched further, as common wisdom suggests spreads get wider as volatility increases. Is that true in aggregate, for calls or puts? Is that relationship stronger intraday? Would it even show up in daily or weekly samplings?

- The Objective
- The Data
- Basic Data Analysis
- Bid-Ask Spread Analysis
- How Do Aggregate Bid-Ask Spreads Vary with Days To Expiration?
- How Do Bid-Ask Spreads Vary with Volume?
- How Do Bid-Ask Spreads Vary with Volatility?

- Summary Conclusions

Compared to the equity market, the options market is a level up in complexity. For each symbol there are multiple expiration dates, strike prices for each expiration date, implied volatilities, and that's before we get to the option greeks.

The increased complexity presents us with more opportunity. More complexity means less ground truth, more errors, more gaps, and more structural asymmetries. Consider that THE dominant factor underlying options pricing - implied volatility - cannot be directly measured only estimated! To estimate it requires other observable factors and a pricing **model. **We already know *"All models are wrong. Some are Useful" *thus there are opportunities to exploit the errors of others. To do that requires a better understanding than our competitors thus beginning our study of the options market.

This is the next step in the series for developing an options trading dashboard using Python and Python based tools. Thus far I have demonstrated two methods [1] [2] of scraping the necessary data. Now that the data has been collecting for a bit we can begin some initial exploratory analysis. As this is a purpose driven process we should set an objective for our study.

In this particular article I want to focus on exploring bid-ask spreads as that data is often unavailable for free.

The data is a cleaned **hdf5**/**.h5** file comprised of a collection of daily options data collected over the period of 05/17/2017 to 07/24/2017. By cleaned I mean I aggregated the daily data into one set, removed some unnecessary columns, cleaned up the data types and added the underlying ETF prices from Yahoo. I make no claims about the accuracy of the data itself, and I present it as is. It is approximately a 1 GB in size and I have made it available for download at the following link:

Options Data

To import the data into your python environment:

import pandas as pd; data = pd.read_hdf('option_data_2017-05-17_to_2017-07-24.h5', key='data')

First the package imports.

```
%load_ext watermark
%watermark
import sys
import os
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format
import numpy as np
import scipy.stats as stats
import pymc3 as pm
from mpl_toolkits import mplot3d
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use('seaborn-muted')
import plotnine as pn
import mizani.breaks as mzb
import mizani.formatters as mzf
import seaborn as sns
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
p=print
p()
%watermark -p pymc3,pandas,pandas_datareader,numpy,scipy,matplotlib,seaborn,plotnine
```

Some convenience functions...

```
# convenience functions
# add instrinsic values
def create_intrinsic(df):
# create intrinsic value column
call_intrinsic = df.query('optionType == "Call"').loc[:, 'underlyingPrice']\
- df.query('optionType == "Call"').loc[:, 'strikePrice']
put_intrinsic = df.query('optionType == "Put"').loc[:, 'strikePrice']\
- df.query('optionType == "Put"').loc[:, 'underlyingPrice']
df['intrinsic_value'] = [np.nan] * df.shape[0]
(df.loc[df['optionType'] == "Call", ['intrinsic_value']]) = call_intrinsic
(df.loc[df['optionType'] == "Put", ['intrinsic_value']]) = put_intrinsic
return df
# fn: code adapted from https://github.com/jonsedar/pymc3_vs_pystan/blob/master/convenience_functions.py
def custom_describe(df, nidx=3, nfeats=20):
''' Concat transposed topN rows, numerical desc & dtypes '''
print(df.shape)
nrows = df.shape[0]
rndidx = np.random.randint(0,len(df),nidx)
dfdesc = df.describe().T
for col in ['mean','std']:
dfdesc[col] = dfdesc[col].apply(lambda x: np.round(x,2))
dfout = pd.concat((df.iloc[rndidx].T, dfdesc, df.dtypes), axis=1, join='outer')
dfout = dfout.loc[df.columns.values]
dfout.rename(columns={0:'dtype'}, inplace=True)
# add count nonNAN, min, max for string cols
nan_sum = df.isnull().sum()
dfout['count'] = nrows - nan_sum
dfout['min'] = df.min().apply(lambda x: x[:6] if type(x) == str else x)
dfout['max'] = df.max().apply(lambda x: x[:6] if type(x) == str else x)
dfout['nunique'] = df.apply(pd.Series.nunique)
dfout['nan_count'] = nan_sum
dfout['pct_nan'] = nan_sum / nrows
return dfout.iloc[:nfeats, :]
```

Let's import the data and view some basic info...

```
sprd_by_dtm = (data.groupby(['symbol', 'daysToExpiration', 'optionType'],
as_index=False)['spread'].median()
.groupby(['daysToExpiration', 'optionType'], as_index=False).median()
.assign(bins = lambda x: pd.qcut(x.daysToExpiration, 10, labels=False)))
sprd_by_dtm.head()
```

Let's define a convenience function to plot the data.

```
def plot_spread_dtm(sprd_by_dtm):
"""given df plot scatter with regression line
# Params
df: pd.DataFrame()
# Returns
g: plotnine figure
"""
g = (pn.ggplot(sprd_by_dtm, pn.aes('daysToExpiration', 'spread', color='factor(bins)'))
+ pn.geom_point(pn.aes(shape='factor(bins)'))
+ pn.stat_smooth(method='lm')
+ pn.scale_y_continuous(breaks=range(0, int(sprd_by_dtm.spread.max()+2)),
labels=mzf.currency_format(), limits=(0, sprd_by_dtm.spread.max()))
+ pn.scale_x_continuous(breaks=range(0, sprd_by_dtm.daysToExpiration.max(), 50),
limits=(0, sprd_by_dtm.daysToExpiration.max()))
+ pn.theme_linedraw()
+ pn.theme(figure_size=(12,6), panel_background=pn.element_rect(fill='black'),
axis_text_x=pn.element_text(rotation=50),)
+ pn.ylab('bid-ask spread')
+ pn.ggtitle('Option Spread by DTM'))
return g
```

```
# Example use of func for both calls and puts
g = plot_spread_dtm(sprd_by_dtm)
g.save(filename='call-put option bid-ask spreads - daysToExpiration scatter plot.png')
g.draw();
```

Some things are interesting. From ~250 through ~600 days in both call and put options the bid-ask spreads are compressed towards zero. There also appears to be less dispersion in put bid-ask spreads overall.

We can look at a few select ETFs.

```
median_sprd = data.groupby(['symbol', 'daysToExpiration', 'optionType'],
as_index=False)['spread'].median()
test_syms = ['SPY', 'DIA', 'QQQ', 'TLT', 'GLD', 'USO', 'SLV', 'XLF']
sel_med_sprd = median_sprd.query('symbol in @test_syms').dropna(subset=['spread'])
# to plot symbols have to cast to type str
sel_med_sprd.symbol = sel_med_sprd.symbol.astype(str)
p(sel_med_sprd.head())
p(sel_med_sprd.info())
```

A convenience plotting function for boxplots.

```
def plot_boxplot(df, x, y, optionType='Call'):
"""given df plot boxplot
# Params
df: pd.DataFrame()
x: str(), column
y: str(), column
optionType: str()
# Returns
g: plotnine figure
"""
df = df.query('optionType == @optionType')
g = (pn.ggplot(df, pn.aes(x, y, color=f'factor({x})'))
+ pn.geom_boxplot()
+ pn.theme_linedraw()
+ pn.theme(figure_size=(12,6), panel_background=pn.element_rect(fill='black'))
+ pn.ylab('bid-ask spread')
+ pn.ggtitle(f'Selected Symbol {optionType} Option Spreads'))
return g
```

```
g = plot_boxplot(sel_med_sprd, 'symbol', 'spread')
g.save(filename='call-option bid-ask spreads - boxplot.pdf')
g.draw();
```

Looking at these plots we see further evidence of bid-ask spreads showing less dispersion across puts vs calls. Also it's surprising to see DIA options having such a wide range of values compared to SPY and QQQ; this is especially true for the call options.

```
grp_cols = ['symbol', 'daysToExpiration', 'optionType']
agg_cols = ['spread', 'openInterest', 'volume', 'volatility', 'intrinsic_value']
median_sprd = data.groupby(grp_cols, as_index=False)[agg_cols].median()
test_syms = ['SPY', 'DIA', 'QQQ', 'TLT', 'GLD', 'USO', 'SLV', 'XLF']
sel_med_sprd = (median_sprd.query('symbol in @test_syms')
.dropna(subset=['spread', 'openInterest']))
# to plot symbols have to cast to type str
sel_med_sprd.symbol = sel_med_sprd.symbol.astype(str)
p(sel_med_sprd.head())
p(sel_med_sprd.info())
```

A convenience function for plotting...

```
def plot_log_points(df, x, y, color='factor(symbol)', size='openInterest'):
g = (pn.ggplot(df, pn.aes(x, y, color=color))
+ pn.geom_point(pn.aes(size=size))
+ pn.scale_x_log10(breaks=[0,0.5,1,10,100,250,500,750,1_000])
+ pn.theme_linedraw()
+ pn.theme(figure_size=(12,6), panel_background=pn.element_rect(fill='black'),
axis_text_x=pn.element_text(rotation=50))
+ pn.scale_y_continuous(breaks=range(0, int(df.spread.max()+2)),
labels=mzf.currency_format(), limits=(0, df.spread.max()))
+ pn.ylab('bid-ask spread'))
return g
```

```
df = sel_med_sprd.copy()
# example with both call and puts
g = plot_log_points(df, x='volume', y='spread')
g.save(filename='call-put option bid-ask spreads - volume scatter plot.png')
g.draw();
```

Again we see put bid-ask spreads squeezed towards zero even as volume increases. We also see SPY and USO with small spreads as both volume and open interest increases. This suggests there are symbols/contracts with higher relative trading capacity.

```
# example with both call and puts
g = plot_log_points(df, 'volatility', 'spread')
g.save(filename='call-put option bid-ask spreads - volatility scatter plot.png')
g.draw();
```

Some notes. DIA again appears to have the highest dispersion in bid-ask spreads for both calls and puts. GLD is also notable. It is also somewhat surprising that for these selected ETFs increased volatility doesn't appear with increased bid-ask spreads.

- Put options have less overall dispersion in bid-ask spreads than calls relative to days to expiration, volume, and volatility.
- Bid-ask spreads have a major compression range between ~250 to ~600 days to maturity that appear smaller than all other buckets.
- Bid-ask spreads show greater dispersion at lower levels of implied volatility.
- DIA in particular shows the greatest variability in bid-ask spreads of the selected ETFs.
- SPY and USO show high capacity as bid-ask spreads remain near zero even at elevated volume and open interest levels.

**Recap****The Problem****The Solution****Barchart Scraper Class****Barchart Parser Class****Utility Functions****Putting it all together****The Simple Trick****Next Steps**

In the previous post I revealed a web scraping trick that allows us to defeat AJAX/JavaScript based web pages and extract the tables we need. We also covered how to use that trick to scrape a large volume of options prices quickly and asynchronously using the combination of **aiohttp** and** asyncio**.

It worked beautifully until... I told people about it. Shortly after publishing, my code stopped functioning. After investigating, it was clear no data was being returned during the aiohttp call to the Barchart server. I attempted to fix the code by adding the **semaphore** option to the asyncio call. Roughly speaking, in this context the semaphore option allows you to specify the max number of calls that can be made simultaneously. I tried, 100, 50, 10, 2 and they all failed.

I do not know what happened for sure, but if I had to guess, the increase in server loads per unit time measure, was significant enough for Barchart system/network staff to update their server settings and squash the multiple simultaneous calls.

We simply build a sequential scraper instead of an asynchronous one. To make it more robust we have to add a simple twist to the code that makes it more difficult to diagnose human vs automated traffic.

This class is similar to the previous version except asyncio is stripped out. It's main function is to create the POST url, call the server and return the response data. Please note, I tested this class with a dynamic referer symbol and random user agents and this simple hardcoded setup has worked most consistently for me.

```
import requests as r
class barchart_scraper:
def __init__(self, symbol):
self.__request_headers = {
"Accept":"application/json",
"Accept-Encoding":"gzip, deflate, sdch, br",
"Accept-Language":"en-US,en;q=0.8",
"Connection":"keep-alive",
"Host":"core-api.barchart.com",
"Origin":"https://www.barchart.com",
"Referer":"https://www.barchart.com/etfs-funds/quotes/SPY/options",
"User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36",
}
self.__base_url_str = 'https://core-api.barchart.com/v1/options/chain?symbol={}&fields=strikePrice%2ClastPrice%2CpercentFromLast%2CbidPrice%2Cmidpoint%2CaskPrice%2CpriceChange%2CpercentChange%2Cvolatility%2Cvolume%2CopenInterest%2CoptionType%2CdaysToExpiration%2CexpirationDate%2CsymbolCode%2CsymbolType&groupBy=optionType&raw=1&meta=field.shortName%2Cfield.type%2Cfield.description'
self.__expiry_url_str = "https://core-api.barchart.com/v1/options/chain?symbol={}&fields=strikePrice%2ClastPrice%2CpercentFromLast%2CbidPrice%2Cmidpoint%2CaskPrice%2CpriceChange%2CpercentChange%2Cvolatility%2Cvolume%2CopenInterest%2CoptionType%2CdaysToExpiration%2CexpirationDate%2CsymbolCode%2CsymbolType&groupBy=optionType&expirationDate={}&raw=1&meta=field.shortName%2Cfield.type%2Cfield.description"
self.symbol = symbol
# ------------------------------------------------
def _construct_url(self):
return self.__base_url_str.format(self.symbol)
def _construct_expiry_url(self, expiry):
return self.__expiry_url_str.format(self.symbol, expiry)
# ------------------------------------------------
def post_url(self, expiry=None):
if not expiry:
return r.post(
url = self._construct_url(),
headers = self.__request_headers
)
else:
return r.post(
url = self._construct_expiry_url(expiry=expiry),
headers = self.__request_headers
)
# ------------------------------------------------
def get_expirys(self, response):
return response.json()['meta']['expirations']
```

This class is essentially identical to the previous parser class and simply extracts call/put data into pandas dataframes.

```
import pandas as pd
import numpy as np
class barchart_parser:
def __init__(self, symbol, response):
self.symbol = symbol
self.response = response
# ------------------------------------------------
# create call df
def create_call_df(self):
"""fn: to create call df"""
json_calls = self.response.json()['data']['Call']
list_dfs = []
for quote in json_calls:
list_dfs.append(pd.DataFrame.from_dict(quote['raw'], orient='index'))
df = (
pd.concat(list_dfs, axis=1).T.reset_index(drop=True)
.replace('NA', np.nan)
.apply(pd.to_numeric, errors='ignore')
.assign(expirationDate = lambda x: pd.to_datetime(x['expirationDate']))
)
df['symbol'] = [self.symbol] * len(df.index)
return df
# ------------------------------------------------
# create put df
def create_put_df(self):
"""fn: to create put df"""
json_puts = self.response.json()['data']['Put']
list_dfs = []
for quote in json_puts:
list_dfs.append(pd.DataFrame.from_dict(quote['raw'], orient='index'))
df = (
pd.concat(list_dfs, axis=1).T.reset_index(drop=True)
.replace('NA', np.nan)
.apply(pd.to_numeric, errors='ignore')
.assign(expirationDate = lambda x: pd.to_datetime(x['expirationDate']))
)
df['symbol'] = [self.symbol] * len(df.index)
return df
```

Next we devise 2 utility functions. The first function is simply a convenience function to run the first iteration of the scraper. We need to do that for each symbol in order to extract the expiration dates dynamically.

```
def get_first_data(symbol):
"""fn: to get first data and extract expiry dates"""
# scrape
scraper = barchart_scraper(symbol)
response = scraper.post_url()
expirys = scraper.get_expirys(response)
# parse response
parser = barchart_parser(symbol, response)
first_call_df = parser.create_call_df()
first_put_df = parser.create_put_df()
# merge calls + puts
first_concat = pd.concat([first_call_df, first_put_df], axis=0)
return first_concat, expirys
```

The second function is a little lambda function that gets the symbol's last daily price from Google Finance which we add to our dataset before saving to disk.

```
get_price = lambda symbol: web.DataReader(
symbol, 'google', today - 1*BDay(), today)['Close']
```

Next we can implement the main script body. Essentially it runs a main loop and an inner loop. For each symbol get the default first data, extract the expirys, and then for each expiration extract the data. At the end of the inner loop, all data for that symbol is concatenated and then appended to a list containing all the symbols' dataframes. Finally all the symbols dataframes are concatenated and saved to hdf.

```
import requests as r
import pandas as pd
import pandas_datareader.data as web
from pandas.tseries.offsets import BDay
import numpy as np
import time
from tqdm import tqdm
from barchart_scraper import barchart_scraper
from barchart_parser import barchart_parser
today = pd.datetime.today().date()
project_dir = '/YOUR/PROJECT/DIR'
# -----------------------------------------------------------------------------
# define utility functions
# -----------------------------------------------------------------------------
def get_first_data(symbol):
"""fn: to get first data and extract expiry dates"""
# scrape
scraper = barchart_scraper(symbol)
response = scraper.post_url()
expirys = scraper.get_expirys(response)
# parse response
parser = barchart_parser(symbol, response)
first_call_df = parser.create_call_df()
first_put_df = parser.create_put_df()
# merge calls + puts
first_concat = pd.concat([first_call_df, first_put_df], axis=0)
return first_concat, expirys
# function to get last daily close from Google Finance
get_price = lambda symbol: web.DataReader(
symbol, 'google', today - 1*BDay(), today)['Close']
# -----------------------------------------------------------------------------
# import symbols
# -----------------------------------------------------------------------------
FILE = project_dir + 'ETFList.Options.Nasdaq__M.csv'
ALL_ETFS = pd.read_csv(FILE)['Symbol']
drop_symbols = ['ADRE', 'AUNZ', 'CGW', 'DGT', 'DSI', \
'EMIF', 'EPHE', 'EPU', 'EUSA', 'FAN', \
'FDD', 'FRN', 'GAF', 'GII', 'GLDI', 'GRU', \
'GUNR', 'ICN', 'INXX', 'IYY', 'KLD', 'KWT', \
'KXI', 'MINT', 'NLR', 'PBP', 'PBS', 'PEJ', \
'PIO', 'PWB', 'PWV', 'SCHO', 'SCHR', 'SCPB', \
'SDOG', 'SHM', 'SHV', 'THRK', 'TLO', 'UHN', \
'USCI', 'USV', 'VCSH']
ETFS = [x for x in ALL_ETFS if x not in set(drop_symbols)]
# -----------------------------------------------------------------------------
# run main script body
#
# loop through all etfs
# loop through expirys for each etf
# -----------------------------------------------------------------------------
t0 = time.time()
all_etfs_data = []
error_symbols = []
for symbol in tqdm(ETFS):
print()
print('-'*79)
print('scraping: ', symbol)
try:
last_close_price = get_price(symbol).iloc[0]
first_concat, expirys = get_first_data(symbol)
list_dfs_by_expiry = []
list_dfs_by_expiry.append(first_concat)
for expiry in tqdm(expirys[1:]):
print()
print('scraping expiry: ', expiry)
scraper = barchart_scraper(symbol)
tmp_response = scraper.post_url(expiry=expiry)
print('parsing... ')
parser = barchart_parser(symbol, tmp_response)
call_df = parser.create_call_df()
put_df = parser.create_put_df()
concat = pd.concat([call_df, put_df], axis=0)
concat['underlyingPrice'] = [last_close_price] * concat.shape[0]
list_dfs_by_expiry.append(concat)
print('parsing complete')
random_wait = np.random.choice([1,1.25,2.5,3], p=[0.3,0.3,0.25,0.15])
time.sleep(random_wait)
all_etfs_data.append(pd.concat(list_dfs_by_expiry, axis=0))
except Exception as e:
error_symbols.append(symbol)
print(f'symbol: {symbol}\n error: {e}')
print()
continue
# -----------------------------------------------------------------------------
duration = time.time() - t0
print(f'script run time: ', pd.to_timedelta(duration, unit='s'))
dfx = pd.concat(all_etfs_data, axis=0)
print(dfx.head())
print(dfx.info())
print(f'error symbols:\n{error_symbols}')
# -----------------------------------------------------------------------------
# store table as hdf
# -----------------------------------------------------------------------------
today = pd.datetime.today().date()
file_ = project_dir + f'/Barchart_Options_Data/ETF_options_data_{today}.h5'
dfx.to_hdf(file_, key='data', format='table', mode='w')
# -----------------------------------------------------------------------------
# kill python process after running script to prevent leakage
# -----------------------------------------------------------------------------
time.sleep(5)
os.kill(os.getpid(), 9)
```

Did you notice the **random_wait** at the end of the inner loop? We simply pass an array of reasonable wait times *(measured in seconds)* and their probabilities to numpy's **random_choice()** and pass the result to the **time.sleep()** function before iterating to the next symbol. This isn't guaranteed to always work, but in cases where servers *may* be restricting traffic loads it makes it much harder to identify your traffic as automated.

Ultimately, it's also a respectful way to operate our scraper.

Next up in the series I plan to explore the data collected over the last 6 weeks I've been running this script. I hope to explore multiple angles and dynamics in the data.

Do you have any suggestions for exploration topics? If so, leave a comment or contact me via email or twitter.

]]>**Intro****Disclaimers****The Secret to Scraping AJAX Sites****The async_option_scraper script****first_async_scraper class****expirys class****xp_async_scraper class****last_price_scraper class**

**The option_parser Module****The Implementation Script****References**

This is Part 1 of a new series I'm doing in semi real-time to build a functional options data dashboard using Python. There are many underlying motivations to attempt this, and several challenges to implementing a tool like this from scratch.

- Where to get the data? Is it affordable? Easily accessible? API?
- How to parse the results?
- How to aggregate and organize the data for analysis?
- How to store the data? TXT, CSV, SQL database, HDF5??
- How often should it run?
- How to display the data? What dynamic graphic library to use? D3.js, MPL3d, Plotly, Bokeh, etc.?

These are some of the problems that need to be solved in order to create the tool.

In this post I show a current working solution to where to get the data, how to scrape it, how to parse it, and a storage method for fast read write access. We will scrape Barchart.com's basic option quotes using **aiohttp** and **asyncio**, both are included in Python 3.6 standard library. We will parse it using **Pandas** and **Numpy** and store the data in the **HDF5** file format.

This is primarily an academic exercise. I have no intent to harm or cause others to harm Barchart.com or its vendors. My belief is that, by facilitating knowledge sharing, we will increase the number of educated participants in the options markets; thereby increasing the total addressable market for businesses like Barchart and its vendors. By designing tools like this we improve our own understanding of the use cases and applications (option valuation and trading) and can provide better feedback to those in the product development process.

First let's create a mental model of what AJAX really is.

So looking at this, we can say AJAX is a set of web development techniques to increase the efficiency and user experience during website interaction. For example, you go to a website with cool data tables on it. You want to change one of the filters on the data so you select the option you want and click. What happens from there?

In simply designed or older websites your request would be sent to the server, then to update the data table with your selected filters would require the server response to reload the entire page. This is inefficient for many reasons but one is that, often the element in need of updating is only a fraction of the entire webpage.

AJAX allows websites to send requests to the server and update page elements on an element by element basis negating the need for reloading the entire page every time you interact with the page.

This improvement in efficiency comes at the added cost of complexity, for web designers and developers and for web scrapers. Generally speaking the url you use to go to an AJAX page is not the actual url that gets sent to the server to load the page you view.

To build this understanding, let's look at a sample option quote page using the following link <https://www.barchart.com/stocks/quotes/spy/options>.

Warning: To follow along with the rest of this example you need access to developer mode in Chrome or its equivalent in other browsers.

Let's look behind the curtain so to speak. Click anywhere in the page and click inspect. Navigate to the Network tab in Chrome developer tools.

We're going to press F5 to reload the page and look for the following: Request Headers, and the Request URL.

We will need the **Request URL **and the **Request Headers** in order to construct our calls to the server a little later. Simply put, this is the secret! We can replicate our browser's behavior when it requests data from the server if we know the *actual *request url and the request headers. This will be made clearer in the next section.

This is the key module for scraping the data. First the imports.

```
import asyncio
import aiohttp
```

If you noticed when the page loads, it loads the nearest expiration date by default.

We know there are generally multiple expiration dates per symbol. However, some ETFs have weekly contracts, monthly, and/or quarterly. Instead of guessing the expiration dates, the **first_async_scraper** class scrapes the default pages so we can later extract the expiration dates directly from the page's JSON/dict response.

This class takes no initialization parameters.

```
# ================================================
# for first run only
class first_async_scraper:
def __init__(self):
pass
async def _fetch(self, symbol, url, session, headers):
"""fn: to retrieve option quotes as JSON
Params:
symbol : str(), ETF
url : str(), request url
session : aiohttp.ClientSession() object
headers : dict() containing header info
Returns:
response : JSON/Python Dict
"""
async with session.post(url.format(symbol), headers=headers) as response:
return await response.json(content_type=None)
async def run(self, symbols, user_agent):
"""fn: to aggregate response option quotes
Params:
symbols : list of str(), ETF symbols
user_agent : str()
Returns:
responses : list of JSON
"""
url = 'https://core-api.barchart.com/v1/options/chain?symbol={}&fields=strikePrice%2ClastPrice%2CpercentFromLast%2CbidPrice%2Cmidpoint%2CaskPrice%2CpriceChange%2CpercentChange%2Cvolatility%2Cvolume%2CopenInterest%2CoptionType%2CdaysToExpiration%2CexpirationDate%2CsymbolCode%2CsymbolType&groupBy=optionType&raw=1&meta=field.shortName%2Cfield.type%2Cfield.description'
headers = {
"Accept":"application/json",
"Accept-Encoding":"gzip, deflate, sdch, br",
"Accept-Language":"en-US,en;q = 0.8",
"Connection":"keep-alive",
"Host":"core-api.barchart.com",
"Origin":"https://www.barchart.com",
"Referer":"https://www.barchart.com/etfs-funds/quotes/{}/options",
"User-Agent":user_agent,
}
tasks = []
async with aiohttp.ClientSession() as session:
for symbol in symbols:
headers['Referer'] = headers['Referer'].format(symbol)
task = asyncio.ensure_future(self._fetch(symbol, url, session, headers))
tasks.append(task)
# gather returns responses in original order not arrival order
# https://docs.python.org/3/library/asyncio-task.html#task-functions
responses = await asyncio.gather(*tasks)
return responses
```

The workhorse function is **run** which calls the internal function **_fetch**. Inside the run function I've hardcoded a request url similar to the one we found before. I've also hardcoded the headers we found earlier as well. Notice both objects are string formats which can be dynamically updated with our ETF symbol.

The **_fetch** function takes the ETF symbol, the url string, session object, and our request headers and makes the call to the server returning the response as a JSON /dict object.

The **run **function takes a list of symbols, and a user agent string - *more on this later.*

The aiohttp package has a very similar interface to the requests module. We first create a **ClientSession **object which acts like a context manager. After creating the session object, we loop through each symbol using the **asyncio.ensure_future** function to create and schedule the event task. The **gather** function executes the tasks asynchronously waiting until all tasks have completed. It returns a list of JSON responses, each representing one ETF.

Once we have the list of responses we need to extract the expiry dates from each page source, collecting them for later use. The class is initialized with two parameters - a list of ETF symbols, and the list of page responses from the first scrape job.

It uses two functions. The internal function **_get_dict_expiry **takes a single response object and returns the list of expirations for a single symbol. The exposed function **get_expirys** loops through the list of ETFs and responses aggregating them into a dictionary. The dictionary keys are the ETF symbols and the values are lists of expirations for that symbol.

```
# ================================================
class expirys:
def __init__(self, ETFS, first_future_result):
"""Class to extract expiration data from Dict
Params:
ETFS : list of ETF symbol str()
first_future_result : list of response objects (dict/JSON) from the first scraper
"""
self.ETFS = ETFS
self.first_future_result = first_future_result
def _get_dict_expiry(self, response):
"""fn: to get expirations from response dict
Params:
response : dict/JSON object
Returns:
list() of date str(), "YYYY-MM-DD"
"""
if response['count'] == 0:
return None
else:
return response['meta']['expirations']
def get_expirys(self):
"""fn: to create dict with k, v = symbol, list of expirys
we have to do this b/c JSON/dict response data doesn't
contain symbol identifier
Returns:
dict(symbol = list of expiry dates)
"""
from itertools import zip_longest
expirys = {}
for symbol, resp in zip_longest(self.ETFS, self.first_future_result):
# we can do this because results are in order of submission not arrival
# gather returns responses in original order not arrival order
# https://docs.python.org/3/library/asyncio-task.html#task-functions
expirys[symbol] = self._get_dict_expiry(resp)
return expirys
```

The final scraper class is nearly identical to the **first_async_scraper** except for some additional arguments for the functions **xp_run()**, and **_xp_fetch()** to accept the expiry dates. Also notice that the hard coded URL in the **xp_run** function is slightly different in that it is formatted to accept the ETF symbol and an expiration date.

```
# ================================================
# async by url + expirations
class xp_async_scraper:
def __init__(self):
pass
async def _xp_fetch(self, symbol, expiry, url, session, headers):
"""fn: to retrieve option quotes as JSON
Params:
symbol : str(), ETF
expiry : str(), "YYYY-MM-DD"
url : str(), request url
session : aiohttp.ClientSession() object
headers : dict() containing header info
Returns:
response : JSON/Python Dict
"""
async with session.post(url.format(symbol, expiry), headers=headers) as response:
return await response.json(content_type=None)
async def xp_run(self, symbol, expirys, user_agent):
"""fn: to aggregate response option quotes
Params:
symbol : str(), ETF
expirys : list of date str() "YYYY-MM-DD"
user_agent : str()
Returns:
responses : list of JSON
"""
url = "https://core-api.barchart.com/v1/options/chain?symbol={}&fields=strikePrice%2ClastPrice%2CpercentFromLast%2CbidPrice%2Cmidpoint%2CaskPrice%2CpriceChange%2CpercentChange%2Cvolatility%2Cvolume%2CopenInterest%2CoptionType%2CdaysToExpiration%2CexpirationDate%2CsymbolCode%2CsymbolType&groupBy=optionType&expirationDate={}&raw=1&meta=field.shortName%2Cfield.type%2Cfield.description"
headers = {
"Accept":"application/json",
"Accept-Encoding":"gzip, deflate, sdch, br",
"Accept-Language":"en-US,en;q=0.8",
"Connection":"keep-alive",
"Host":"core-api.barchart.com",
"Origin":"https://www.barchart.com",
"Referer":"https://www.barchart.com/etfs-funds/quotes/{}/options",
"User-Agent":user_agent,
}
tasks = []
async with aiohttp.ClientSession() as session:
for expiry in expirys:
headers['Referer'] = headers['Referer'].format(symbol)
task = asyncio.ensure_future(self._xp_fetch(symbol, expiry, url, session, headers))
tasks.append(task)
# gather returns responses in original order not arrival order
# https://docs.python.org/3/library/asyncio-task.html#task-functions
responses = await asyncio.gather(*tasks)
return responses
```

This class has the same structure and form as the other scraper classes except slightly simpler. The purpose of this class is to simply retrieve the basic html source for each ETF so that we can later extract the last quote price for the underlying equity.

```
# ================================================
# async get html page source
class last_price_scraper:
def __init__(self):
pass
async def _fetch(self, symbol, url, session):
"""fn: to retrieve option quotes as JSON
Params:
symbol : str(), ETF
url : str(), request url
session : aiohttp.ClientSession() object
Returns:
response : text object
"""
async with session.get(url.format(symbol)) as response:
return await response.text()
async def run(self, symbols):
"""fn: to aggregate response option quotes
Params:
symbols : list of str(), ETF symbols
Returns:
responses : list of text
"""
url = 'https://www.barchart.com/stocks/quotes/{}/options'
tasks = []
async with aiohttp.ClientSession() as session:
for symbol in symbols:
task = asyncio.ensure_future(self._fetch(symbol, url, session))
tasks.append(task)
# gather returns responses in original order not arrival order
# https://docs.python.org/3/library/asyncio-task.html#task-functions
responses = await asyncio.gather(*tasks)
return responses
```

Once we have all the data we need to be able to parse it for easy analysis and storage. Fortunately this is relatively simple to do with Pandas. The **option_parser.py **module contains one class-**option_parser**, and three functions-**extract_last_price(), ****create_call_df(), create_put_df()**.

The **option_parser** class is initialized with an ETF symbol and the appropriate response object. The create dataframe functions extract the call/put data from the JSON/dict response, then iterates through each quote combining them into dataframes taking care to clean the data set and change the datatypes from objects to numeric/datetime where appropriate. The **extract_last_price **function is used to get the underlying quote price from the basic html source.

```
import pandas as pd
import numpy as np
# ================================================
class option_parser:
def __init__(self, symbol, response):
self.symbol = symbol
self.response = response
# ------------------------------------------------
# extract last price from html
def extract_last_price(self, html_text):
"""fn: extract price from html"""
reg_exp = r'(?<="lastPrice":)(\d{1,3}.{1}\d{2})'
prices = re.findall(reg_exp, html_text)
if len(prices) < 1:
return np.nan
else:
return float(prices[0])
# ------------------------------------------------
# create call df
def create_call_df(self):
"""fn: to create call df"""
json_calls = self.response['data']['Call']
list_dfs = []
for quote in json_calls:
list_dfs.append(pd.DataFrame.from_dict(quote['raw'], orient='index'))
df = (
pd.concat(list_dfs, axis=1).T.reset_index(drop=True)
.replace('NA', np.nan)
.apply(pd.to_numeric, errors='ignore')
.assign(expirationDate = lambda x: pd.to_datetime(x['expirationDate']))
)
df['symbol'] = [self.symbol] * len(df.index)
return df
# ------------------------------------------------
# create put df
def create_put_df(self):
"""fn: to create put df"""
json_puts = self.response['data']['Put']
list_dfs = []
for quote in json_puts:
list_dfs.append(pd.DataFrame.from_dict(quote['raw'], orient='index'))
df = (
pd.concat(list_dfs, axis=1).T.reset_index(drop=True)
.replace('NA', np.nan)
.apply(pd.to_numeric, errors='ignore')
.assign(expirationDate = lambda x: pd.to_datetime(x['expirationDate']))
)
df['symbol'] = [self.symbol] * len(df.index)
return df
```

Finally we can combine the modules into a script and run it. Note that this script requires the **fake-useragent **package. This package has a nice feature where it generates a random user agent string on every call. We need to do this so our requests are not blocked by the server.

The script imports a list of ETF symbols originally sourced from Nasdaq. Some of these symbols don't have options data, so they are filtered out. The script runs in the following order: basic html scraper -> first async scraper -> extracts the expiry dates -> xp async scraper which aggregates all the option data -> parses the collected data into a dataframe format -> downloads and inserts any missing underlying prices -> then saves it to disk as an HDF5 file.

```
import os
import sys
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import time
import asyncio
from fake_useragent import UserAgent
'''set path variables'''
project_dir = "YOUR/PROJECT/DIR"
sys.path.append(project_dir)
import async_option_scraper
import option_parser
# ================================================
today = pd.datetime.today().date()
# ================================================
file_start = time.time()
print('\nAsync Barchart Scraper starting...')
# --------------- \\\
# import symbols
FILE = project_dir + 'ETFList.Options.Nasdaq__M.csv'
ALL_ETFS = pd.read_csv(FILE)['Symbol']
drop_symbols = ['ADRE', 'AUNZ', 'CGW', 'DGT', 'DSI', 'EMIF', 'EPHE', 'EPU', 'EUSA', 'FAN', 'FDD', 'FRN', 'GAF', 'GII', 'GLDI', 'GRU', 'GUNR', 'ICN', 'INXX', 'IYY', 'KLD', 'KWT', 'KXI', 'MINT', 'NLR', 'PBP', 'PBS', 'PEJ', 'PIO', 'PWB', 'PWV', 'SCHO', 'SCHR', 'SCPB', 'SDOG', 'SHM', 'SHV', 'THRK', 'TLO', 'UHN', 'USCI', 'USV', 'VCSH']
ETFS = [x for x in ALL_ETFS if x not in set(drop_symbols)]
# ================================================
# GET HTML SOURCE FOR LAST SYMBOL EQUITY PRICE
# ================================================
t0_price = time.time()
# --------------- \\\
loop = asyncio.get_event_loop()
px_scraper = async_option_scraper.last_price_scraper()
px_run_future = asyncio.ensure_future(px_scraper.run(ETFS))
loop.run_until_complete(px_run_future)
px_run = px_run_future.result()
# ------------- ///
duration_price = time.time() - t0_price
print('\nprice scraper script run time: ',
pd.to_timedelta(duration_price, unit='s'))
# ------------- ///
# create price dictionary
px_dict = {}
for k, v in zip(ETFS, px_run):
px_dict[k] = v
# ================================================
# RUN FIRST ASYNC SCRAPER
# ================================================
t0_first = time.time()
# --------------- \\\
ua = UserAgent()
loop = asyncio.get_event_loop()
first_scraper = async_option_scraper.first_async_scraper()
first_run_future = asyncio.ensure_future(
first_scraper.run(ETFS, ua.random)
)
loop.run_until_complete(first_run_future)
first_run = first_run_future.result()
# ------------- ///
first_duration = time.time() - t0_first
print('\nfirst async scraper script run time: ',
pd.to_timedelta(first_duration, unit='s'))
# ================================================
# EXTRACT EXPIRYS FROM FIRST RUN SCRAPER
# ================================================
xp = async_option_scraper.expirys(ETFS, first_run)
expirys = xp.get_expirys()
# ================================================
# SCRAPE AND AGGREGATE ALL SYMBOLS BY EXPIRY
# ================================================
t0_xp = time.time()
# -------------- \\\
# dict key=sym, values=list of json data by expiry
# create helper logic to test if expirys is None before passing
sym_xp_dict = {}
ua = UserAgent()
xp_scraper = async_option_scraper.xp_async_scraper()
for symbol in ETFS:
print()
print('-'*50)
print('scraping: ', symbol)
if not expirys[symbol]:
print('symbol ' + symbol + ' missing expirys')
continue
try:
xp_loop = asyncio.get_event_loop()
xp_future = asyncio.ensure_future(
xp_scraper.xp_run(symbol, expirys[symbol], ua.random)
)
xp_loop.run_until_complete(xp_future)
sym_xp_dict[symbol] = xp_future.result()
except Exception as e:
print(symbol + ' error: ' + e)
# ------------- ///
duration_xp = time.time() - t0_xp
print('\nall async scraper script run time: ',
pd.to_timedelta(duration_xp, unit='s'))
# ================================================
# PARSE ALL COLLECTED DATA
# ================================================
t0_agg = time.time()
# -------------- \\\
all_etfs_data = []
for symbol, xp_list in sym_xp_dict.items():
print()
print('-'*50)
print('parsing: ', symbol)
list_dfs_by_expiry = []
try:
for i in range(len(xp_list)):
try:
parser = option_parser.option_parser(
symbol, xp_list[i])
call_df = parser.create_call_df()
put_df = parser.create_put_df()
concat = pd.concat([call_df, put_df], axis=0)
concat['underlyingPrice'] = np.repeat(
parser.extract_last_price(px_dict[symbol]),
len(concat.index))
list_dfs_by_expiry.append(concat)
except: continue
except Exception as e:
print(f'symbol: {symbol}\n error: {e}')
print()
continue
all_etfs_data.append(pd.concat(list_dfs_by_expiry, axis=0))
# ------------- ///
duration_agg = time.time() - t0_agg
print('\nagg parse data script run time: ',
pd.to_timedelta(duration_agg, unit='s'))
# -------------- \\\
dfx = pd.concat(all_etfs_data, axis=0).reset_index(drop=True)
print(dfx.info())
# ------------- ///
# ================================================
# GET ANY MISSING UNDERLYING PRICE
# ================================================
print('\nCollecting missing prices...')
grp = dfx.groupby(['symbol'])['underlyingPrice'].count()
missing_symbol_prices = grp[grp == 0].index
get_price = lambda symbol: web.DataReader(
symbol, 'google', today)['Close']
prices = []
for symbol in missing_symbol_prices:
px = get_price(symbol).iloc[0]
prices.append((symbol, px))
df_prices = pd.DataFrame(prices).set_index(0)
for symbol in df_prices.index:
(dfx.loc[dfx['symbol'] == symbol,
['underlyingPrice']]) = df_prices.loc[symbol].iloc[0]
dfx['underlyingPrice'] = dfx.underlyingPrice.astype(float)
print('\nmissing prices added')
# ================================================
# store dataframe as hdf
# ================================================
print(dfx.head(20))
print(dfx.info())
file_duration = time.time() - file_start
print('\nfile script run time: ', pd.to_timedelta(file_duration, unit='s'))
file_ = project_dir + f'/ETF_options_data_{today}.h5'
dfx.to_hdf(file_, key='data', mode='w')
# ================================================
# kill python process after running script
# ================================================
time.sleep(2)
os.kill(os.getpid(), 9)
```

Here's some sample output:

UPDATE: Here is the list of Nasdaq ETF symbols for download <ETF Symbol List CSV>

- Wikipedia.org - AJAX definition
- W3Schools.com - AJAX introduction
- Making 1 Million Requests with Python-aiohttp via https://pawelmhm.github.io - Great article on implementing asyncio with aiohttp
- Barchart.com - "Barchart, the leading provider of market data solutions for individuals and businesses."

**Recap****Webinar Hypothesis****Anaylsis/Conclusions****Jupyter (IPython) Notebook****Github Links and Resources**

Thus far in the series we've explored the idea of using Gaussian mixture models (GMM) to predict outlier returns. Specifically, we were measuring two things:

- The accuracy of the strategy implementation in predicting return distributions.
- The return pattern after an outlier event.

During the exploratory phase of this project there were some interesting results worthy of more investigation. The initial results implied that the strategy implementation was adaptable to changes in the means and volatilities of a small number of ETF's returns.

Recently I had the opportunity to present my first webinar with QuantInsti.com. I definitely have some areas for improvement, but the experience was great overall, and I learned a lot.

I chose this topic to present, and through the process I was able to refine the hypothesis, the code, and my thinking on the subject. The hypothesis is simple:

Can a GMM based strategy predict asset return distributions such that a strategy which "buys" the asset post an outlier event can "earn" a positive return?

There were a couple of takeaways from the project. Overall the strategy showed promise. What really impressed me was the difference in the sampled confidence intervals when using the Normal distribution vs. the JohnsonSU distribution. See the following example:

On the left, we have the same strategy except the sampled confidence intervals are drawn from a normal distribution. On the right we use the JohnsonSU distribution. In terms of predicted return distribution accuracy it's not even close-JohnsonSU is the clear winner, even showing an ability to adjust to periods of clustered volatility.

However note the equity curves in the example. The normal distribution wins handily but that is because the strategy is so inaccurate that it predicts outlier returns occurred ~97% of the time, so technically that would be a buy and hold strategy which benefits from the strong uptrend in SPY post 2009.

Another takeaway is that the model shows a bias towards US based ETFs. You can see that by examining the Seaborn facetgrid plots in the notebook I will share at the end. First, by aggregating the results in to a **tidy-data **format the analysis was rendered so simple, I kicked myself for not adhering to these principles sooner. In the examples I examine the strategy results according to median returns and the sum_ratio.

Median returns are simply the median returns of the strategy for that set of parameters. The sum_ratio is the sum of all strategy returns that ended positively divided by the sum of all returns that ended negatively for a set of parameters. A "successful" strategy should have a sum_ratio > 1 across multiple dimensions as well as consistent positive median returns.

In the analysis I look at the two metrics across different lookback periods (1 year, 3 year, and expanding), different numbers of mixture model components (k=2, 3, 5, 7, 9, 13, 17, 21) and across a number of holding periods in days (steps = 1, 2, 3, 5, 7, 10, 21).

When applied to SPY, QQQ, and TLT the strategy showed consistent positive results across a wide spectrum of parameter combinations whereas the application to GLD, EFA, and EEM were a little more mixed and definitely not as encouraging.

One theory I have for this result is that the factors I used as input to the GMM are US based interest rate spreads. These are likely to have a much stronger relationship to the behavior of SPY, QQQ, TLT vs the other ETFs. To improve performance I believe one would have to locate indicators based on the asset/ETF one wants to trade.

To sum up, I'm encouraged by the strategy framework, but would like to see a wider array of stocks, asset classes, and ETFs tested with various combinations of factors.

Here is a sample exploratory notebook I put together for the webinar that demonstrates the conclusions drawn above.

- Recap
- Model Update
- Model Testing
- Model Results
- Conclusions
- Code

In the previous post I gave a basic "proof" of concept, where we designed a trading strategy using Sklearn's implementation of Gaussian mixture models. The strategy attempts to predict an asset's return distribution such that returns that fall outside the predicted distribution are considered *outliers* and likely to mean revert. It showed some promise but had many areas in need of improvement.

In this version I've refactored a lot of the code into a more object oriented structure. Now the code uses three classes.

- ModelRunner() class - This is the class for executing the model and returning our prediction dataframe and some key parameters.
- ResultEval() class - This takes the data from the prediction dataframe and key parameters and outputs our strategy returns and summary information.
- ModelPlots() class - This takes our data and outputs key plots to help visualize the strategy performance.

I did this for several reasons.

- Reduce the likelihood of input errors by creating objects that share parameters.
- Increase the ease of model testing.
- Increase interpretability.

In this version, we are going to expand the analysis to include other, actively traded ETFs, and test the reproducibility of the results, and generalization ability of the model.

Here are the ETFs we will examine:

symbols = ['SPY', 'DIA', 'QQQ', 'GLD', 'TLT', 'EEM', 'ACWI']

Assuming the correct imports, with the refactored code we can run the model in the following fashion. We'll focus on the **TOO_LOW** events although I encourage readers to experiment with both.

```
# Project Directory
DIR = 'YOUR/PROJECT/DIRECTORY/'
# get fed data
f1 = 'TEDRATE' # ted spread
f2 = 'T10Y2Y' # constant maturity ten yer - 2 year
f3 = 'T10Y3M' # constant maturity 10yr - 3m
factors = [f1, f2, f3]
ft_cols = factors + ['lret']
start = pd.to_datetime('2002-01-01')
end = pd.to_datetime('2017-01-01')
symbols = ['SPY', 'DIA', 'QQQ', 'GLD', 'TLT', 'EEM', 'ACWI']
for mkt in symbols:
data = get_mkt_data(mkt, start, end, factors)
# Model Params
# ------------
a, b = (.2, .7) # found via coarse parameter search
alpha = 0.99
max_iter = 100
k = 2 # n_components
init = 'random' # or 'kmeans'
nSamples = 2_000
year = 2009 # cutoff
lookback = 1 # years
step_fwd = 5 # days
MR = ModelRunner(data, ft_cols, k, init, max_iter)
dct = MR.prediction_cycle(year, alpha, a, b, nSamples)
res = ResultEval(dct, step_fwd=step_fwd)
event_dict = res._get_event_states()
event = list(event_dict.keys())[1] # TOO_LOW
post_events = res.get_post_events(event_dict[event])
end_vals = res.get_end_vals(post_events)
smry = res.create_summary(end_vals)
p()
p('*'*25)
p(mkt, event.upper())
p(smry.T)
mp = ModelPlots(mkt, post_events, event, DIR, year)
mp.plot_pred_results(dct['pred'], dct['year'], dct['a'], dct['b'])
mp.plot_equity_timeline()
```

In this post I'm going to skip to the results and conclusions, and provide the refactored code at the end.

First let's look at the model results using SPY.

The first thing I noticed was that the confidence intervals were less responsive to increases in return volatility. The difference shows up in the reduction in accuracy. In Part 1, I believe the accuracy was ~71% whereas in the updated model the accuracy has dipped to ~68%! Does that hurt our strategy?

Judging by the equity curve, our strategy is not noticeably impacted by the reduced model accuracy!

The plotted equity curve is the cumulative sum of each event's returns assuming every event was a "trade". This should include overlapping events.

Let's look at the model results for the other ETFs.

The model has some interesting output. Notice that model accuracy ranges from ~57% (TLT) to ~83% (EEM). However, both of these equity curves end positively. GLD is distinctly volatile, and ends poorly, however the model was 75% accurate. DIA, QQQ, SPY, and ACWI all have stable sharply positive equity curves.

This supports my initial findings that model accuracy seems loosely, if at all, related to the strategy's equity curve. These results do indicate that the strategy is worth further evaluation but I'm hesitant to declare success.

I need to test the strategy over a longer period of time and make sure to include 2008/9. Also, I need to drill down into evaluating the strategy results vs the correlation of asset returns. For example, DIA, QQQ, and SPY are highly correlated, so we would expect the strategy to have similar results among those ETFs, but what about negatively and uncorrelated assets? TLT is generally negatively correlated with SPY while GLD is likely uncorrelated. Is the strategy performance for those two ETFs representative of other negatively/uncorrelated ETFs?

```
%load_ext watermark
%watermark
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import sklearn.mixture as mix
import scipy.stats as scs
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import missingno as msno
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
import affirm
sns.set(font_scale=1.25)
style_kwds = {'xtick.major.size': 3, 'ytick.major.size': 3,
'font.family':u'courier prime code', 'legend.frameon': True}
sns.set_style('white', style_kwds)
p=print
p()
%watermark -p pandas,pandas_datareader,numpy,scipy,sklearn,matplotlib,seaborn
# **********************************************************************
def get_mkt_data(mkt, start, end, factors):
"""Function to get benchmark data from
Yahoo and Factor data from FRED
Params:
mkt : str(), symbol
start : pd.DateTime()
end : pd.DateTime()
factors : list() of str()
Returns:
data : pd.DataFrame()
"""
MKT = (web.DataReader([mkt], 'yahoo', start, end)['Adj Close']
.rename(columns={mkt:mkt})
.assign(lret=lambda x: np.log(x[mkt]/x[mkt].shift(1)))
.dropna())
data = (web.DataReader(factors, 'fred', start, end)
.join(MKT, how='inner')
.dropna())
return data
# **********************************************************************
class ModelRunner():
def __init__(self, *args, **kwargs):
"""Class to run mixture model model
Params:
data : pd.DataFrame()
ft_cols : list() of feature columns str()
k : int(), n_components
max_iter : int(), max iterations
init : str() {random, kmeans}
"""
self.data = data
self.ft_cols = ft_cols
self.k = k
self.max_iter = max_iter
self.init = init
np.random.seed(123457) # make results reproducible
def _run_model(self, bgm=None, **kwargs):
"""Function to run mixture model
Params:
data : pd.DataFrame()
ft_cols : list of str()
k : int(), n_components
max_iter : int()
init : str() {random, kmeans}
Returns:
model : sklearn model object
hidden_states : array-like, hidden states
"""
X = self.data[self.ft_cols].values
if bgm:
model = mix.BayesianGaussianMixture(n_components=self.k,
max_iter=self.max_iter,
init_params=self.init,
**kwargs,
).fit(X)
else:
model = mix.GaussianMixture(n_components=self.k,
max_iter=self.max_iter,
init_params=self.init,
**kwargs,
).fit(X)
hidden_states = model.predict(X)
return model, hidden_states
def _get_state_est(self, model, hidden_states):
"""Function to return estimated state mean and state variance
Params:
model : sklearn model object
hidden_states : {array-like}
Returns:
mr_i : mean return of last estimated state
mvar_i : model variance of last estimated state
"""
# get last state
last_state = hidden_states[-1]
# last value is mean return for ith state
mr_i = model.means_[last_state][-1]
mvar_i = np.diag(model.covariances_[last_state])[-1]
return mr_i, mvar_i
def _get_ci(self, mr_i, mvar_i, alpha, a, b, nSamples):
"""Function to sample confidence intervals
from the JohnsonSU distribution
Params:
mr_i : float()
mvar_i : float()
alpha : float()
a : float()
b : float()
nsamples : int()
Returns:
ci : tuple(float(), float()), (low_ci, high_ci)
"""
rvs_ = scs.johnsonsu.rvs(a, b, loc=mr_i, scale=mvar_i, size=nSamples)
ci = scs.johnsonsu.interval(alpha=alpha, a=a, b=b,
loc=np.mean(rvs_), scale=np.std(rvs_))
return ci
def prediction_cycle(self, *args, **kwargs):
"""Function to make walk forward predictions from cutoff year onwards
Params:
year : int(), cutoff year
alpha : float()
a : float()
b : float()
nsamples : int()
Returns:
dict() :
pred : pd.DataFrame()
year : str()
a, b : float(), float()
"""
cutoff = year
train_df = self.data.ix[str(cutoff - lookback):str(cutoff)].dropna()
oos = self.data.ix[str(cutoff+1):].dropna()
# confirm that train_df end index is different than oos start index
assert train_df.index[-1] != oos.index[0]
# create pred list to hold tuple rows
preds = []
for t in tqdm(oos.index):
if t == oos.index[0]:
insample = train_df
# run model func to return model object and hidden states using params
model, hstates = self._run_model(**kwargs)
# get hidden state mean and variance
mr_i, mvar_i = self._get_state_est(model, hstates)
# get confidence intervals from sampled distribution
low_ci, high_ci = self._get_ci(mr_i, mvar_i, alpha, a, b, nSamples)
# append tuple row to pred list
preds.append((t, hstates[-1], mr_i, mvar_i, low_ci, high_ci))
# increment insample dataframe
insample = data.ix[:t]
cols = ['ith_state', 'ith_ret', 'ith_var', 'low_ci', 'high_ci']
pred = (pd.DataFrame(preds, columns=['Dates']+cols)
.set_index('Dates').assign(tgt = oos['lret']))
# logic to see if error exceeds neg or pos CI
pred_copy = pred.copy().reset_index()
# Identify indices where target return falls between CI
win = pred_copy.query("low_ci < tgt < high_ci").index
# create list of binary variables representing in/out CI
in_rng_list = [1 if i in win else 0 for i in pred_copy.index]
# assign binary variables sequence to new column
pred['in_rng'] = in_rng_list
return {'pred':pred, 'year':year, 'a':a, 'b':b}
# **********************************************************************
class ResultEval():
def __init__(self, data, step_fwd):
"""Class to evaluate prediction results
Params:
data : dict() containing results of ModelRunner()
step_fwd : int(), number of days to evalute post event
"""
self.df = data['pred'].copy().reset_index()
self.step_fwd=step_fwd
def _get_event_states(self):
"""Function to get event indexes
Index bjects must be called 'too_high', 'too_low'
Returns:
dict() : values are index objects
"""
too_high = self.df.query("tgt > high_ci").index
too_low = self.df.query("tgt < low_ci").index
return {'too_high':too_high, 'too_low':too_low}
def get_post_events(self, event):
"""Function to return dictionary where key, value is integer
index, and Pandas series consisting of returns post event
Params:
df : pd.DataFrame(), prediction df
event : {array-like}, index of target returns that exceed CI high or low
step_fwd : int(), how many days to include after event
Returns:
after_event : dict() w/ values = pd.Series()
"""
after_event = {}
for i in range(len(event)):
tmp_ret = self.df.ix[event[i]:event[i]+self.step_fwd, ['Dates','tgt']]
# series of returns with date index
after_event[i] = tmp_ret.set_index('Dates', drop=True).squeeze()
return after_event
def get_end_vals(self, post_events):
"""Function to sum and agg each post events' returns"""
end_vals = []
for k in post_events.keys():
tmp = post_events[k].copy()
tmp.iloc[0] = 0 # set initial return to zero
end_vals.append(tmp.sum())
return end_vals
def create_summary(self, end_vals):
"""Function to take ending values and calculate summary
Will fail if count of ending values (>0) or (<0) is less than 1
"""
gt0 = [x for x in end_vals if x>0]
lt0 = [x for x in end_vals if x<0]
assert len(gt0) > 1
assert len(lt0) > 1
summary = (pd.DataFrame(index=['value'])
.assign(mean = f'{np.mean(end_vals):.4f}')
.assign(median = f'{np.median(end_vals):.4f}')
.assign(max_ = f'{np.max(end_vals):.4f}')
.assign(min_ = f'{np.min(end_vals):.4f}')
.assign(gt0_cnt = f'{len(gt0):d}')
.assign(lt0_cnt = f'{len(lt0):d}')
.assign(sum_gt0 = f'{sum(gt0):.4f}')
.assign(sum_lt0 = f'{sum(lt0):.4f}')
.assign(sum_ratio = f'{sum(gt0) / abs(sum(lt0)):.4f}')
.assign(gt_pct = f'{len(gt0) / (len(gt0) + len(lt0)):.4f}')
.assign(lt_pct = f'{len(lt0) / (len(gt0) + len(lt0)):.4f}')
)
return summary
# **********************************************************************
class ModelPlots():
def __init__(self, mkt, post_events, event_state, project_dir, year):
"""Class to visualize prediction results and summary
Params:
mkt : str(), symbol
post_events : dict() of pd.Series()
event_state : str(), 'too_high', 'too_low'
project_dir : str()
year : int(), cutoff year
"""
self.mkt = mkt
self.post_events = post_events
self.event_state = event_state
self.DIR = project_dir
self.year = year
def plot_equity_timeline(self):
"""Function to plot event timeline with equity curve second axis"""
agg_tmp = []
fig, ax = plt.subplots(figsize=(10, 7))
ax1 = ax.twinx()
ax.axhline(y=0, color='k', lw=3)
for k in self.post_events.keys():
tmp = self.post_events[k].copy()
tmp.iloc[0] = 0 # set initial return to zero
agg_tmp.append(tmp)
if tmp.sum() > 0: color = 'dodgerblue'
else: color = 'red'
ax.plot(tmp.index, tmp.cumsum(), color=color, alpha=0.5)
ax.set_xlim(pd.to_datetime(str(self.year) + '-12-31'), tmp.index[-1])
ax.set_xlabel('Dates')
ax.set_title(f"{self.mkt} {self.event_state.upper()}", fontsize=16)
#sns.despine(offset=2)
agg_df = pd.concat(agg_tmp).cumsum()
ax1.plot(agg_df.index, agg_df.values, color='k', lw=5)
ax.set_ylabel('Event Returns')
ax1.set_ylabel('Equity Curve')
fig.savefig(self.DIR + f'{self.mkt} {self.event_state.upper()} post events timeline {pd.datetime.today()}.png', dpi=300)
return
def plot_events_timeline(self):
"""Function to plot even timeline only"""
fig, ax = plt.subplots(figsize=(10, 7))
ax.axhline(y=0, color='k', lw=3)
for k in self.post_events.keys():
tmp = self.post_events[k].copy()
tmp.iloc[0] = 0 # set initial return to zero
if tmp.sum() > 0: color = 'dodgerblue'
else: color = 'red'
ax.plot(tmp.index, tmp.cumsum(), color=color, alpha=0.5)
ax.set_xlim(pd.to_datetime('2009-12-31'), tmp.index[-1])
ax.set_xlabel('Dates')
ax.set_title(f"{self.mkt} {self.event_state.upper()}", fontsize=16, fontweight='demi')
sns.despine(offset=2)
fig.savefig(self.DIR + f'{self.mkt} {self.event_state.upper()} post events timeline.png', dpi=300)
return
def plot_events_post(self):
"""Function to plot events from zero until n days after"""
fig, ax = plt.subplots(figsize=(10, 7))
ax.axhline(y=0, color='k', lw=3)
for k in self.post_events.keys():
tmp = self.post_events[k].copy()
tmp.iloc[0] = 0 # set initial return to zero
if tmp.sum() > 0: color = 'dodgerblue'
else: color = 'red'
tmp.cumsum().reset_index(drop=True).plot(color=color, alpha=0.5, ax=ax)
ax.set_xlabel('Days')
ax.set_title(f"{self.mkt} {self.event_state.upper()}", fontsize=16, fontweight='demi')
sns.despine(offset=2)
fig.savefig(self.DIR + f'{self.mkt} {self.event_state.upper()} post events.png', dpi=300)
return
def plot_distplot(self, ending_values, summary):
"""Function to plot histogram of ending values"""
colors = sns.color_palette('RdYlBu', 4)
fig, ax = plt.subplots(figsize=(10, 7))
sns.distplot(pd.DataFrame(ending_values), bins=15, color=colors[0],
kde_kws={"color":colors[3]}, hist_kws={"color":colors[3], "alpha":0.35}, ax=ax)
ax.axvline(x=float(summary['mean'][0]), label='mean', color='dodgerblue', lw=3, ls='-.')
ax.axvline(x=float(summary['median'][0]), label='median', color='red', lw=3, ls=':')
ax.axvline(x=0, color='black', lw=1, ls='-')
ax.legend(loc='best')
sns.despine(offset=2)
ax.set_title(f"{self.mkt} {self.event_state.upper()}", fontsize=16, fontweight='demi')
fig.savefig(self.DIR + f'{self.mkt} {self.event_state.upper()} distplot.png', dpi=300)
return
def plot_pred_results(self, df, year, a, b):
"""Function to plot prediction results and confidence intervals"""
# colorblind safe palette http://colorbrewer2.org/
colors = sns.color_palette('RdYlBu', 4)
fig, ax = plt.subplots(figsize=(10, 7))
ax.scatter(df.index, df.tgt, c=[colors[1] if x==1 else colors[0] for x in df['in_rng']], alpha=0.85)
df['high_ci'].plot(ax=ax, alpha=0.65, marker='.', color=colors[2])
df['low_ci'].plot(ax=ax, alpha=0.65, marker='.', color=colors[3])
ax.set_xlim(df.index[0], df.index[-1])
nRight = df.query('in_rng==1').shape[0]
accuracy = nRight / df.shape[0]
ax.set_title('{:^10}\ncutoff year: {} | accuracy: {:2.3%} | errors: {} | a={}, b={}'
.format(self.mkt, year, accuracy, df.shape[0] - nRight, a, b))
in_ = mpl.lines.Line2D(range(1), range(1), color="white", marker='o', markersize=10, markerfacecolor=colors[1])
out_ = mpl.lines.Line2D(range(1), range(1), color="white", marker='o', markersize=10, markerfacecolor=colors[0])
hi_ci = mpl.lines.Line2D(range(1), range(1), color="white", marker='.', markersize=15, markerfacecolor=colors[2])
lo_ci = mpl.lines.Line2D(range(1), range(1), color="white", marker='.', markersize=15, markerfacecolor=colors[3])
leg = ax.legend([in_, out_, hi_ci, lo_ci],["in", "out", 'high_ci', 'low_ci'],
loc = "center left", bbox_to_anchor = (1, 0.85), numpoints = 1)
sns.despine(offset=2)
file_str = self.DIR+f'{self.mkt} prediction success {pd.datetime.today()}.png'
fig.savefig(file_str, dpi=300, bbox_inches="tight")
return
```

]]>**Recap****Hypothesis****Strategy****Conclusion****Caveats and Areas of Exploration****References**

In Part 1 we learned about Hidden Markov Models and their application using a toy example involving a lazy pet dog. In Part 2 we learned about the expectation-maximization algorithm, K-Means, and how Mixture Models improve on K-Means weaknesses. If you still have some questions or fuzzy understanding about these topics, I would recommend reviewing the prior posts. In those posts I also provide links to resources that really helped my understanding.

Given what we know about Mixture Models and their ability to characterize general distributions, can we use it to model a return series, such that we can identify **outlier** returns that are likely to mean revert?

**This strategy attempts to predict an asset's return distribution**. Actual returns that fall outside the predicted confidence intervals are considered ** outliers** and likely to revert to the mean.

We first fit a Gaussian Mixture Model to the historical daily return series. We use the model's estimate of the hidden state's mean and variance as parameters to a random sampling from the JohnsonSU distribution. We then calculate confidence intervals from the sampled distribution.

From there we evaluate model accuracy and the n days cumulative returns after each outlier event. We compute some summary statistics and try to answer the hypothesis.

Searching the net I found a useful bit of code from this site. Instead of assuming our asset return distribution is normal, we can use Python and Scipy.stats to find the brute force answer. We can cycle through each continuous distribution and run a goodness-of-fit procedure called the KS-test. The KS-test is a non-parametric method which examines the distance between a *known *cumulative distribution function and the CDF of the your sample data. The KS-test outputs the probability that your sample data comes from the benchmark distribution.

```
# code sample from:
# http://www.aizac.info/simple-check-of-a-sample-against-80-distributions/
cdfs = [
"norm", #Normal (Gaussian)
"alpha", #Alpha
"anglit", #Anglit
"arcsine", #Arcsine
"beta", #Beta
"betaprime", #Beta Prime
"bradford", #Bradford
"burr", #Burr
"cauchy", #Cauchy
"chi", #Chi
"chi2", #Chi-squared
"cosine", #Cosine
"dgamma", #Double Gamma
"dweibull", #Double Weibull
"erlang", #Erlang
"expon", #Exponential
"exponweib", #Exponentiated Weibull
"exponpow", #Exponential Power
"fatiguelife", #Fatigue Life (Birnbaum-Sanders)
"foldcauchy", #Folded Cauchy
"f", #F (Snecdor F)
"fisk", #Fisk
"foldnorm", #Folded Normal
"frechet_r", #Frechet Right Sided, Extreme Value Type II
"frechet_l", #Frechet Left Sided, Weibull_max
"gamma", #Gamma
"gausshyper", #Gauss Hypergeometric
"genexpon", #Generalized Exponential
"genextreme", #Generalized Extreme Value
"gengamma", #Generalized gamma
"genlogistic", #Generalized Logistic
"genpareto", #Generalized Pareto
"genhalflogistic", #Generalized Half Logistic
"gilbrat", #Gilbrat
"gompertz", #Gompertz (Truncated Gumbel)
"gumbel_l", #Left Sided Gumbel, etc.
"gumbel_r", #Right Sided Gumbel
"halfcauchy", #Half Cauchy
"halflogistic", #Half Logistic
"halfnorm", #Half Normal
"hypsecant", #Hyperbolic Secant
"invgamma", #Inverse Gamma
"invnorm", #Inverse Normal
"invweibull", #Inverse Weibull
"johnsonsb", #Johnson SB
"johnsonsu", #Johnson SU
"laplace", #Laplace
"logistic", #Logistic
"loggamma", #Log-Gamma
"loglaplace", #Log-Laplace (Log Double Exponential)
"lognorm", #Log-Normal
"lomax", #Lomax (Pareto of the second kind)
"maxwell", #Maxwell
"mielke", #Mielke's Beta-Kappa
"nakagami", #Nakagami
"ncx2", #Non-central chi-squared
# "ncf", #Non-central F
"nct", #Non-central Student's T
"norm" # Normal
"pareto", #Pareto
"powerlaw", #Power-function
"powerlognorm", #Power log normal
"powernorm", #Power normal
"rdist", #R distribution
"reciprocal", #Reciprocal
"rayleigh", #Rayleigh
"rice", #Rice
"recipinvgauss", #Reciprocal Inverse Gaussian
"semicircular", #Semicircular
"t", #Student's T
"triang", #Triangular
"truncexpon", #Truncated Exponential
"truncnorm", #Truncated Normal
"tukeylambda", #Tukey-Lambda
"uniform", #Uniform
"vonmises", #Von-Mises (Circular)
"wald", #Wald
"weibull_min", #Minimum Weibull (see Frechet)
"weibull_max", #Maximum Weibull (see Frechet)
"wrapcauchy", #Wrapped Cauchy
"ksone", #Kolmogorov-Smirnov one-sided (no stats)
"kstwobign"] #Kolmogorov-Smirnov two-sided test for Large N
sample = data['lret'].values
for cdf in cdfs:
try:
#fit our data set against every probability distribution
parameters = eval("scs."+cdf+".fit(sample)");
#Applying the Kolmogorov-Smirnof one sided test
D, p = scs.kstest(sample, cdf, args=parameters);
#pretty-print the results
D = round(D, 5)
p = round(p, 5)
#pretty-print the results
print (cdf.ljust(16) + ("p: "+str(p)).ljust(25)+"D: "+str(D));
except: continue
```

After running this code you should see output similar to the below code. For simplicity sake, just remember the higher the p-value, the more confident the ks-test is that our data came from the given distribution.

I had never heard of the Johnson SU distribution before this code. I had to research it, and I found that the Johnson SU was developed to in order to apply the established methods and theory of the normal distribution to non-normal data sets. What gives it this flexibility is the two shape parameters, gamma and delta, or a, b in Scipy. For more information I recommend this Wolfram reference link and this Scipy.stats link.

After selecting the distribution we can code up the experiment.

```
# import packages
# code was done in Jupyter Notebook
%load_ext watermark
%watermark
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import sklearn.mixture as mix
import scipy.stats as scs
import matplotlib as mpl
import matplotlib.pyplot as plt
from matplotlib.dates import YearLocator, MonthLocator
%matplotlib inline
import seaborn as sns
import missingno as msno
from tqdm import tqdm
import warnings
warnings.filterwarnings("ignore")
sns.set(font_scale=1.25)
style_kwds = {'xtick.major.size': 3, 'ytick.major.size': 3,
'font.family':u'courier prime code', 'legend.frameon': True}
sns.set_style('white', style_kwds)
p=print
p()
%watermark -p pandas,pandas_datareader,numpy,scipy,sklearn,matplotlib,seaborn
```

Now let's get some data.

```
# get fed data
f1 = 'TEDRATE' # ted spread
f2 = 'T10Y2Y' # constant maturity ten yer - 2 year
f3 = 'T10Y3M' # constant maturity 10yr - 3m
start = pd.to_datetime('2002-01-01')
end = pd.to_datetime('2017-01-01')
mkt = 'SPY'
MKT = (web.DataReader([mkt], 'yahoo', start, end)['Adj Close']
.rename(columns={mkt:mkt})
.assign(lret=lambda x: np.log(x[mkt]/x[mkt].shift(1)))
.dropna())
data = (web.DataReader([f1, f2, f3], 'fred', start, end)
.join(MKT, how='inner')
.dropna())
p(data.head())
# gives us a quick visual inspection of the data
msno.matrix(data)
```

Now we create our convenience functions. The first is the **run_model()** function which takes the data, feature columns, and Sklearn mixture parameters to produce a fitted model object and the predicted hidden states. Note that you can use a Bayesian Gaussian mixture if you so choose. The difference between the two models is that the Bayesian mixture model will try to derive the correct number of mixture components up to a chosen maximum. For more information on the Bayesian mixture model I recommend consulting the Sklearn docs.

```
def _run_model(df, ft_cols, k, max_iter, init, bgm=None, **kwargs):
"""Function to run mixture model
Params:
df : pd.DataFrame()
ft_cols : list of str()
k : int(), n_components
max_iter : int()
init : str() {random, kmeans}
Returns:
model : sklearn model object
hidden_states : array-like, hidden states
"""
X = df[ft_cols].values
if bgm:
model = mix.BayesianGaussianMixture(n_components=k,
max_iter=max_iter,
init_params=init,
**kwargs,
).fit(X)
else:
model = mix.GaussianMixture(n_components=k,
max_iter=max_iter,
init_params=init,
**kwargs,
).fit(X)
hidden_states = model.predict(X)
return model, hidden_states
```

The next function takes the model object and predicted hidden states and returns the estimated mean and variance of the last state.

```
def _get_state_est(model, hidden_states):
"""Function to return estimated state mean and state variance
Params:
model : sklearn model object
hidden_states : {array-like}
Returns:
mr_i : model mean return of last estimated state
mvar_i : model variance of last estimated state
"""
# get last state
last_state = hidden_states[-1]
# last value is mean return for ith state
mr_i = model.means_[last_state][-1]
mvar_i = np.diag(model.covariances_[last_state])[-1]
return mr_i, mvar_i
```

Now we take the estimated state mean and variance of the last predicted state and feed it into the **_get_ci() **function. This function takes the alpha and shape parameters, estimated mean and variance and randomly samples from the JohnsonSU distribution. From this distribution we derive confidence intervals.

```
def _get_ci(mr_i, mvar_i, alpha, a, b, nSamples):
"""Function to sample confidence intervals from the JohnsonSU distribution
Params:
mr_i : float()
mvar_i : float()
alpha : float()
a : float()
b : float()
nsamples : int()
Returns:
ci : tuple(float(), float()), (low_ci, high_ci)
"""
np.random.RandomState(0)
rvs_ = scs.johnsonsu.rvs(a, b, loc=mr_i, scale=mvar_i, size=nSamples)
ci = scs.johnsonsu.interval(alpha=alpha, a=a, b=b, loc=np.mean(rvs_), scale=np.std(rvs_))
return ci
```

The final function visualizes our model predictions, by highlighting target returns that fell inside and outside the confidence intervals, along with our predicted confidence intervals.

```
def plot_pred_success(df, year, a, b):
# colorblind safe palette http://colorbrewer2.org/
colors = sns.color_palette('RdYlBu', 4)
fig, ax = plt.subplots(figsize=(10, 7))
ax.scatter(df.index, df.tgt, c=[colors[1] if x==1 else colors[0] for x in df['in_rng']], alpha=0.85)
df['high_ci'].plot(ax=ax, alpha=0.65, marker='.', color=colors[2])
df['low_ci'].plot(ax=ax, alpha=0.65, marker='.', color=colors[3])
ax.set_xlim(df.index[0], df.index[-1])
nRight = df.query('in_rng==1').shape[0]
accuracy = nRight / df.shape[0]
ax.set_title(r'cutoff year: {} | accuracy: {:2.3%} | errors: {} | a={}, b={}'
.format(year, accuracy, df.shape[0] - nRight, a, b))
in_ = mpl.lines.Line2D(range(1), range(1), color="white", marker='o', markersize=10, markerfacecolor=colors[1])
out_ = mpl.lines.Line2D(range(1), range(1), color="white", marker='o', markersize=10, markerfacecolor=colors[0])
hi_ci = mpl.lines.Line2D(range(1), range(1), color="white", marker='.', markersize=15, markerfacecolor=colors[2])
lo_ci = mpl.lines.Line2D(range(1), range(1), color="white", marker='.', markersize=15, markerfacecolor=colors[3])
leg = ax.legend([in_, out_, hi_ci, lo_ci],["in", "out", 'high_ci', 'low_ci'],
loc = "center left", bbox_to_anchor = (1, 0.85), numpoints = 1)
sns.despine(offset=2)
plt.tight_layout()
return
```

Now we can run the model in a walk-forward fashion. The code uses a chosen lookback period up until the cutoff year to fit the model. From there, the code iterates refitting the model each day, outputting the predicted confidence intervals. The code is setup to run using successive cutoff years, however I will leave that to you readers to experiment with. In this demo we will break the loop after the first cutoff year.

```
%%time
# Model Params
# ------------
a, b = (.2, .7) # found via coarse parameter search
alpha = 0.99
max_iter = 100
k = 2
init = 'random' #'kmeans'
nSamples = 2_000
ft_cols = [f1, f2, f3, 'lret']
years = range(2009,2016)
lookback = 1 # chosen for ease of computation
# Iterate Model
# ------------
for year in years:
cutoff = year
train_df = data.ix[str(cutoff - lookback):str(cutoff)].dropna()
oos = data.ix[str(cutoff+1):].dropna()
# confirm that train_df end index is different than oos start index
assert train_df.index[-1] != oos.index[0]
# create pred list to hold tuple rows
preds = []
for t in tqdm(oos.index):
if t == oos.index[0]:
insample = train_df
# run model func to return model object and hidden states using params
model, hstates = _run_model(insample, ft_cols, k, max_iter, init, random_state=0)
# get hidden state mean and variance
mr_i, mvar_i = _get_state_est(model, hstates)
# get confidence intervals from sampled distribution
low_ci, high_ci = _get_ci(mr_i, mvar_i, alpha, a, b, nSamples)
# append tuple row to pred list
preds.append((t, hstates[-1], mr_i, mvar_i, low_ci, high_ci))
# increment insample dataframe
insample = data.ix[:t]
cols = ['ith_state', 'ith_ret', 'ith_var', 'low_ci', 'high_ci']
pred = (pd.DataFrame(preds, columns=['Dates']+cols)
.set_index('Dates').assign(tgt = oos['lret']))
# logic to see if error exceeds neg or pos CI
pred_copy = pred.copy().reset_index()
# Identify indices where target return falls between CI
win = pred_copy.query("low_ci < tgt < high_ci").index
# create list of binary variables representing in/out CI
in_rng_list = [1 if i in win else 0 for i in pred_copy.index]
# assign binary variables sequence to new column
pred['in_rng'] = in_rng_list
plot_pred_success(pred, year, a, b)
break
```

After that's complete we need to set up our analytics functions to evaluate the return patterns post each event. Recall that an **event** is an actual return that fell outside of our predicted confidence intervals.

```
def post_event(df, event, step_fwd=None):
"""Function to return dictionary where key, value is integer
index, and Pandas series consisting of returns post event
Params:
df : pd.DataFrame(), prediction df
event : {array-like}, index of target returns that exceed CI high or low
step_fwd : int(), how many days to include after event
Returns:
after_event : dict()
"""
after_event = {}
for i in range(len(event)):
tmp_ret = df.ix[event[i]:event[i]+step_fwd, ['Dates','tgt']]
# series of returns with date index
after_event[i] = tmp_ret.set_index('Dates', drop=True).squeeze()
return after_event
def plot_events_timeline(post_events, event_state):
fig, ax = plt.subplots(figsize=(10, 7))
ax.axhline(y=0, color='k', lw=3)
for k in post_events.keys():
tmp = post_events[k].copy()
tmp.iloc[0] = 0 # set initial return to zero
if tmp.sum() > 0: color = 'dodgerblue'
else: color = 'red'
ax.plot(tmp.index, tmp.cumsum(), color=color, alpha=0.5)
ax.set_xlim(pd.to_datetime('2009-12-31'), tmp.index[-1])
ax.set_xlabel('Dates')
ax.set_title(f"{mkt} {event_state.upper()}", fontsize=16, fontweight='demi')
sns.despine(offset=2)
return
def plot_events_post(post_events, event_state):
fig, ax = plt.subplots(figsize=(10, 7))
ax.axhline(y=0, color='k', lw=3)
for k in post_events.keys():
tmp = post_events[k].copy()
tmp.iloc[0] = 0 # set initial return to zero
if tmp.sum() > 0: color = 'dodgerblue'
else: color = 'red'
tmp.cumsum().reset_index(drop=True).plot(color=color, alpha=0.5, ax=ax)
ax.set_xlabel('Days')
ax.set_title(f"{mkt} {event_state.upper()}", fontsize=16, fontweight='demi')
sns.despine(offset=2)
return
def plot_distplot(ending_values, summary):
colors = sns.color_palette('RdYlBu', 4)
fig, ax = plt.subplots(figsize=(10, 7))
sns.distplot(pd.DataFrame(ending_values), bins=15, color=colors[0],
kde_kws={"color":colors[3]}, hist_kws={"color":colors[3], "alpha":0.35}, ax=ax)
ax.axvline(x=float(summary['mean'][0]), label='mean', color='dodgerblue', lw=3, ls='-.')
ax.axvline(x=float(summary['median'][0]), label='median', color='red', lw=3, ls=':')
ax.axvline(x=0, color='black', lw=1, ls='-')
ax.legend(loc='best')
sns.despine(offset=2)
ax.set_title(f"{mkt} {event_state.upper()}", fontsize=16, fontweight='demi')
return
def get_end_vals(post_events):
"""Function to sum and agg each post events' returns"""
end_vals = []
for k in post_events.keys():
tmp = post_events[k].copy()
tmp.iloc[0] = 0 # set initial return to zero
end_vals.append(tmp.sum())
return end_vals
def create_summary(end_vals):
gt0 = [x for x in end_vals if x>0]
lt0 = [x for x in end_vals if x<0]
assert len(gt0) > 1
assert len(lt0) > 1
summary = (pd.DataFrame(index=['value'])
.assign(mean = f'{np.mean(end_vals):.4f}')
.assign(median = f'{np.median(end_vals):.4f}')
.assign(max_ = f'{np.max(end_vals):.4f}')
.assign(min_ = f'{np.min(end_vals):.4f}')
.assign(gt0_cnt = f'{len(gt0):d}')
.assign(lt0_cnt = f'{len(lt0):d}')
.assign(sum_gt0 = f'{sum(gt0):.4f}')
.assign(sum_lt0 = f'{sum(lt0):.4f}')
.assign(sum_ratio = f'{sum(gt0) / abs(sum(lt0)):.4f}')
.assign(gt_pct = f'{len(gt0) / (len(gt0) + len(lt0)):.4f}')
.assign(lt_pct = f'{len(lt0) / (len(gt0) + len(lt0)):.4f}')
)
return summary
```

Now we can run the following code to extract the events, output the plots and view the summary.

```
df = pred.copy().reset_index()
too_high = df.query("tgt > high_ci").index
too_low = df.query("tgt < low_ci").index
step_fwd=5 # how many days to look forward
event_states = ['too_high', 'too_low']
for event in event_states:
after_event = post_event(df, eval(event), step_fwd=step_fwd)
ev = get_end_vals(after_event)
smry = create_summary(ev)
p()
p('*'*25)
p(mkt, event.upper())
p(smry.T)
plot_events_timeline(after_event, event)
plot_events_post(after_event, event)
plot_distplot(ev, smry, event)
```

To answer the original hypothesis about finding market bottoms, we can examine the returns after a *too low* event. Looking at the summary we can see that **the mean and median return are +62 and +82 bps respectively**. Looking at the **sum_ratio** we can see that that the sum of all positive return events is almost 2x the sum of all negative returns. We can also see that, given a *too low* event, after 5 days SPY had positive returns 65% of the time!

These are positive indicators that we may be able to predict market bottoms. However, I would emphasize more testing is needed.

- We don't consider market frictions such as commissions or slippage
- The daily prices may or may not represent actual traded values.
- I used a coarse search to find the JohnsonSU shape parameters, a and b. These may or may not be the best values. Just note that we can use these parameters to arbitrarily adjust the confidence intervals to be more or less conservative. I leave this for the reader to explore.
- In many cases both
*too high,*and*too low*events result in majority positive returns, this*could*be an indication of the overall bullishness of the sample period that may or may not affect model results in the future. - I chose k=2 components for computational simplicity, but there may be better values.
- I chose the lookback period for computational simplicity, but there may be better values.
- Varying the
**step_fwd**parameter may hurt or hinder the strategy. - What makes this approach particularly interesting, is that we don't
**want**anything close to 100% accuracy from our predicted confidence intervals, otherwise we won't have enough "trades".**This adds a level of artistry/complexity because the parameter values we choose should create predictable mean reversion opportunities, but the model accuracy is not a good indicator of this.**Testing the strategy with other assets shows "profitability" in some cases where the model accuracy is sub 60%.

**Please contact me if you find any errors. **

- Part 1 Recap
- Part 2 Goals
- Jupyter (IPython) Notebook
- References

In part 1 of this series we got a feel for Markov Models, Hidden Markov Models, and their applications. We went through the process of using a hidden Markov model to solve a toy problem involving a pet dog. We concluded the article by going through a high level quant finance application of Gaussian mixture models to detect historical regimes.

In this post, my goal is to impart a basic understanding of the expectation maximization algorithm which, not only forms the basis of several machine learning algorithms, including K-Means, and Gaussian mixture models, but also has lots of applications beyond finance. We will also cover the K-Means algorithm which is a form of EM, and its weaknesses. Finally we will discuss how Gaussian mixture models improve on several of K-Means weaknesses.

This post is structured as a Jupyter (IPython) Notebook. I used several different resources\references and tried to give proper credit. Please contact me if you find errors, have suggestions, or if any sources were not attributed correctly.

*Click here to view this notebook directly on NBviewer.jupyter.org*

- Who is Andrey Markov?
- What is the Markov Property?
- What is a Markov Model?
- What makes a Markov Model Hidden?
- A Hidden Markov Model for Regime Detection
- Conclusion
- References

Markov was a Russian mathematician best known for his work on stochastic processes. The focus of his early work was number theory but after 1900 he focused on probability theory, so much so that he taught courses after his official retirement in 1905 until his deathbed [2]. During his research Markov was able to extend the law of large numbers and the central limit theorem to apply to certain sequences of dependent random variables, now known as **Markov Chains** [1][2]. Markov chains are widely applicable to physics, economics, statistics, biology, etc. Two of the most well known applications were Brownian motion [3], and random walks.

"...a random process where the future is independent of the past given the present." [4]

Assume a simplified coin toss game with a fair coin. Suspend disbelief and assume that the Markov property is not yet known and we would like to predict the probability of flipping heads after 10 flips. Under the assumption of conditional dependence (the coin has memory of past states and the future state depends on the sequence of past states) we must record the specific sequence that lead up to the 11th flip and the joint probabilities of those flips. So imagine after 10 flips we have a random sequence of heads and tails. The joint probability of that sequence is 0.5^10 = 0.0009765625. Under conditional dependence, the probability of heads on the next flip is 0.0009765625 * 0.5 = 0.00048828125.

Is that the real probability of flipping heads on the 11th flip? Hell no!

We know that the event of flipping the coin does not depend on the result of the flip before it. The coin has no memory. The process of successive flips does not encode the prior results. Each flip is a unique event with equal probability of heads or tails, aka conditionally independent of past states. This is the Markov property.

A Markov chain (model) describes a stochastic process where the assumed probability of future state(s) depends only on the current process state and not on any the states that preceded it (*shocker*).

Let's get into a simple example. Assume you want to model the future probability that your dog is in one of three states given its current state. To do this we need to specify the state space, the initial probabilities, and the transition probabilities.

Imagine you have a very lazy fat dog, so we define the **state space **as sleeping, eating, or pooping. We will set the initial probabilities to 35%, 35%, and 30% respectively.

```
import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline
# create state space and initial state probabilities
states = ['sleeping', 'eating', 'pooping']
pi = [0.35, 0.35, 0.3]
state_space = pd.Series(pi, index=states, name='states')
print(state_space)
print(state_space.sum())
```

The next step is to define the transition probabilities. They are simply the probabilities of staying in the same state or moving to a different state given the current state.

```
# create transition matrix
# equals transition probability matrix of changing states given a state
# matrix is size (M x M) where M is number of states
q_df = pd.DataFrame(columns=states, index=states)
q_df.loc[states[0]] = [0.4, 0.2, 0.4]
q_df.loc[states[1]] = [0.45, 0.45, 0.1]
q_df.loc[states[2]] = [0.45, 0.25, .3]
print(q_df)
q = q_df.values
print('\n', q, q.shape, '\n')
print(q_df.sum(axis=1))
```

Now that we have the initial and transition probabilities setup we can create a Markov diagram using the **Networkx** package.

To do this requires a little bit of flexible thinking. Networkx creates *Graphs* that consist of *nodes *and *edges*. In our toy example the dog's possible states are the nodes and the edges are the lines that connect the nodes. The transition probabilities are the *weights. *They represent the probability of transitioning to a state given the current state.

Something to note is networkx deals primarily with dictionary objects. With that said, we need to create a dictionary object that holds our edges and their weights.

```
from pprint import pprint
# create a function that maps transition probability dataframe
# to markov edges and weights
def _get_markov_edges(Q):
edges = {}
for col in Q.columns:
for idx in Q.index:
edges[(idx,col)] = Q.loc[idx,col]
return edges
edges_wts = _get_markov_edges(q_df)
pprint(edges_wts)
```

Now we can create the graph. To visualize a Markov model we need to use *nx.MultiDiGraph().* A multidigraph is simply a directed graph which can have multiple arcs such that a single node can be both the origin and destination.

In the following code, we create the graph object, add our nodes, edges, and labels, then draw a bad networkx plot while outputting our graph to a dot file.

```
# create graph object
G = nx.MultiDiGraph()
# nodes correspond to states
G.add_nodes_from(states_)
print(f'Nodes:\n{G.nodes()}\n')
# edges represent transition probabilities
for k, v in edges_wts.items():
tmp_origin, tmp_destination = k[0], k[1]
G.add_edge(tmp_origin, tmp_destination, weight=v, label=v)
print(f'Edges:')
pprint(G.edges(data=True))
pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='dot')
nx.draw_networkx(G, pos)
# create edge labels for jupyter plot but is not necessary
edge_labels = {(n1,n2):d['label'] for n1,n2,d in G.edges(data=True)}
nx.draw_networkx_edge_labels(G , pos, edge_labels=edge_labels)
nx.drawing.nx_pydot.write_dot(G, 'pet_dog_markov.dot')
```

Now a look at the dot file.

Not bad. If you follow the edges from any node, it will tell you the probability that the dog will transition to another state. For example, if the dog is sleeping, we can see there is a 40% chance the dog will keep sleeping, a 40% chance the dog will wake up and poop, and a 20% chance the dog will wake up and eat.

Consider a situation where your dog is acting strangely and you wanted to model the probability that your dog's behavior is due to sickness or simply quirky behavior when otherwise healthy.

In this situation the **true **state of the dog is *unknown*, thus **hidden** from you. One way to model this is to *assume* that the dog has **observable** behaviors that represent the true, hidden state. Let's walk through an example.

First we create our state space - healthy or sick. We assume they are equiprobable.

```
# create state space and initial state probabilities
hidden_states = ['healthy', 'sick']
pi = [0.5, 0.5]
state_space = pd.Series(pi, index=hidden_states, name='states')
print(state_space)
print('\n', state_space.sum())
```

Next we create our transition matrix for the hidden states.

```
# create hidden transition matrix
# a or alpha
# = transition probability matrix of changing states given a state
# matrix is size (M x M) where M is number of states
a_df = pd.DataFrame(columns=hidden_states, index=hidden_states)
a_df.loc[hidden_states[0]] = [0.7, 0.3]
a_df.loc[hidden_states[1]] = [0.4, 0.6]
print(a_df)
a = a_df.values
print('\n', a, a.shape, '\n')
print(a_df.sum(axis=1))
```

This is where it gets a little more interesting. Now we create the **emission or observation** probability matrix. This matrix is size M x O where M is the number of hidden states and O is the number of possible observable states.

The emission matrix tells us the probability the dog is in one of the hidden states, given the current, observable state.

Let's keep the same observable states from the previous example. The dog can be either sleeping, eating, or pooping. For now we make our best guess to fill in the probabilities.

```
# create matrix of observation (emission) probabilities
# b or beta = observation probabilities given state
# matrix is size (M x O) where M is number of states
# and O is number of different possible observations
observable_states = states
b_df = pd.DataFrame(columns=observable_states, index=hidden_states)
b_df.loc[hidden_states[0]] = [0.2, 0.6, 0.2]
b_df.loc[hidden_states[1]] = [0.4, 0.1, 0.5]
print(b_df)
b = b_df.values
print('\n', b, b.shape, '\n')
print(b_df.sum(axis=1))
```

Now we create the graph edges and the graph object.

```
# create graph edges and weights
hide_edges_wts = _get_markov_edges(a_df)
pprint(hide_edges_wts)
emit_edges_wts = _get_markov_edges(b_df)
pprint(emit_edges_wts)
```

```
# create graph object
G = nx.MultiDiGraph()
# nodes correspond to states
G.add_nodes_from(hidden_states)
print(f'Nodes:\n{G.nodes()}\n')
# edges represent hidden probabilities
for k, v in hide_edges_wts.items():
tmp_origin, tmp_destination = k[0], k[1]
G.add_edge(tmp_origin, tmp_destination, weight=v, label=v)
# edges represent emission probabilities
for k, v in emit_edges_wts.items():
tmp_origin, tmp_destination = k[0], k[1]
G.add_edge(tmp_origin, tmp_destination, weight=v, label=v)
print(f'Edges:')
pprint(G.edges(data=True))
pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='neato')
nx.draw_networkx(G, pos)
# create edge labels for jupyter plot but is not necessary
emit_edge_labels = {(n1,n2):d['label'] for n1,n2,d in G.edges(data=True)}
nx.draw_networkx_edge_labels(G , pos, edge_labels=emit_edge_labels)
nx.drawing.nx_pydot.write_dot(G, 'pet_dog_hidden_markov.dot')
```

The hidden Markov graph is a little more complex but the principles are the same. For example, you would expect that if your dog is eating there is a high probability that it is healthy (60%) and a very low probability that the dog is sick (10%).

Now, what if you needed to discern the health of your dog over time given a sequence of observations?

```
# observation sequence of dog's behaviors
# observations are encoded numerically
obs_map = {'sleeping':0, 'eating':1, 'pooping':2}
obs = np.array([1,1,2,1,0,1,2,1,0,2,2,0,1,0,1])
inv_obs_map = dict((v,k) for k, v in obs_map.items())
obs_seq = [inv_obs_map[v] for v in list(obs)]
print( pd.DataFrame(np.column_stack([obs, obs_seq]),
columns=['Obs_code', 'Obs_seq']) )
```

Using the **Viterbi** algorithm we can identify the most likely sequence of hidden states given the sequence of observations.

High level, the Viterbi algorithm increments over each time step, finding the **maximum** probability of any path that gets to state **i**at time **t**, that ** also** has the correct observations for the sequence up to time

The algorithm also keeps track of the state with the highest probability at each stage. At the end of the sequence, the algorithm will iterate backwards selecting the state that "won" each time step, and thus creating the most likely path, or likely sequence of hidden states that led to the sequence of observations.

```
# define Viterbi algorithm for shortest path
# code adapted from Stephen Marsland's, Machine Learning An Algorthmic Perspective, Vol. 2
# https://github.com/alexsosn/MarslandMLAlgo/blob/master/Ch16/HMM.py
def viterbi(pi, a, b, obs):
nStates = np.shape(b)[0]
T = np.shape(obs)[0]
# init blank path
path = np.zeros(T)
# delta --> highest probability of any path that reaches state i
delta = np.zeros((nStates, T))
# phi --> argmax by time step for each state
phi = np.zeros((nStates, T))
# init delta and phi
delta[:, 0] = pi * b[:, obs[0]]
phi[:, 0] = 0
print('\nStart Walk Forward\n')
# the forward algorithm extension
for t in range(1, T):
for s in range(nStates):
delta[s, t] = np.max(delta[:, t-1] * a[:, s]) * b[s, obs[t]]
phi[s, t] = np.argmax(delta[:, t-1] * a[:, s])
print('s={s} and t={t}: phi[{s}, {t}] = {phi}'.format(s=s, t=t, phi=phi[s, t]))
# find optimal path
print('-'*50)
print('Start Backtrace\n')
path[T-1] = np.argmax(delta[:, T-1])
#p('init path\n t={} path[{}-1]={}\n'.format(T-1, T, path[T-1]))
for t in range(T-2, -1, -1):
path[t] = phi[path[t+1], [t+1]]
#p(' '*4 + 't={t}, path[{t}+1]={path}, [{t}+1]={i}'.format(t=t, path=path[t+1], i=[t+1]))
print('path[{}] = {}'.format(t, path[t]))
return path, delta, phi
path, delta, phi = viterbi(pi, a, b, obs)
print('\nsingle best state path: \n', path)
print('delta:\n', delta)
print('phi:\n', phi)
```

Let's take a look at the result.

```
state_map = {0:'healthy', 1:'sick'}
state_path = [state_map[v] for v in path]
(pd.DataFrame()
.assign(Observation=obs_seq)
.assign(Best_Path=state_path))
```

By now you're probably wondering how we can apply what we have learned about hidden Markov models to quantitative finance.

Consider that the largest hurdle we face when trying to apply predictive techniques to asset returns is nonstationary time series. In brief, this means that the expected mean and volatility of asset returns changes over time.

Most time series models assume that the data is stationary. This is a major weakness of these models.

Instead, let us frame the problem differently. We know that time series exhibit temporary periods where the expected means and variances are stable through time. These periods or *regimes* can be likened to *hidden states*.

If that's the case, then all we need are observable variables whose behavior allows us to infer the true hidden state(s). If we can better estimate an asset's most likely regime, including the associated means and variances, then our predictive models become more adaptable and will likely improve. We can also become better risk managers as the estimated regime parameters gives us a great framework for better scenario analysis.

In this example, the observable variables I use are: the underlying asset returns, the Ted Spread, the 10 year - 2 year constant maturity spread, and the 10 year - 3 month constant maturity spread.

```
import pandas as pd
import pandas_datareader.data as web
import sklearn.mixture as mix
import numpy as np
import scipy.stats as scs
import matplotlib as mpl
from matplotlib import cm
import matplotlib.pyplot as plt
from matplotlib.dates import YearLocator, MonthLocator
%matplotlib inline
import seaborn as sns
import missingno as msno
from tqdm import tqdm
p=print
```

Using pandas we can grab data from Yahoo Finance and FRED.

```
# get fed data
f1 = 'TEDRATE' # ted spread
f2 = 'T10Y2Y' # constant maturity ten yer - 2 year
f3 = 'T10Y3M' # constant maturity 10yr - 3m
start = pd.to_datetime('2002-01-01')
end = pd.datetime.today()
mkt = 'SPY'
MKT = (web.DataReader([mkt], 'yahoo', start, end)['Adj Close']
.rename(columns={mkt:mkt})
.assign(sret=lambda x: np.log(x[mkt]/x[mkt].shift(1)))
.dropna())
data = (web.DataReader([f1, f2, f3], 'fred', start, end)
.join(MKT, how='inner')
.dropna()
)
p(data.head())
# gives us a quick visual inspection of the data
msno.matrix(data)
```

Next we will use the **sklearn's GaussianMixture **to fit a model that estimates these regimes. We will explore *mixture models * in more depth in part 2 of this series. The important takeaway is that mixture models implement a closely related unsupervised form of density estimation. It makes use of the expectation-maximization algorithm to estimate the means and covariances of the hidden states (regimes). For now, it is ok to think of it as a magic button for guessing the transition and emission probabilities, and most likely path.

We have to specify the number of components for the mixture model to fit to the time series. In this example the components can be thought of as regimes. We will arbitrarily classify the regimes as High, Neutral and Low Volatility and set the number of components to three.

```
# code adapted from http://hmmlearn.readthedocs.io
# for sklearn 18.1
col = 'sret'
select = data.ix[:].dropna()
ft_cols = [f1, f2, f3, 'sret']
X = select[ft_cols].values
model = mix.GaussianMixture(n_components=3,
covariance_type="full",
n_init=100,
random_state=7).fit(X)
# Predict the optimal sequence of internal hidden state
hidden_states = model.predict(X)
print("Means and vars of each hidden state")
for i in range(model.n_components):
print("{0}th hidden state".format(i))
print("mean = ", model.means_[i])
print("var = ", np.diag(model.covariances_[i]))
print()
sns.set(font_scale=1.25)
style_kwds = {'xtick.major.size': 3, 'ytick.major.size': 3,
'font.family':u'courier prime code', 'legend.frameon': True}
sns.set_style('white', style_kwds)
fig, axs = plt.subplots(model.n_components, sharex=True, sharey=True, figsize=(12,9))
colors = cm.rainbow(np.linspace(0, 1, model.n_components))
for i, (ax, color) in enumerate(zip(axs, colors)):
# Use fancy indexing to plot data in each state.
mask = hidden_states == i
ax.plot_date(select.index.values[mask],
select[col].values[mask],
".-", c=color)
ax.set_title("{0}th hidden state".format(i), fontsize=16, fontweight='demi')
# Format the ticks.
ax.xaxis.set_major_locator(YearLocator())
ax.xaxis.set_minor_locator(MonthLocator())
sns.despine(offset=10)
plt.tight_layout()
fig.savefig('Hidden Markov (Mixture) Model_Regime Subplots.png')
```

In the above image, I've highlighted each regime's daily expected mean and variance of SPY returns. It appears the 1th hidden state is our low volatility regime. Note that the 1th hidden state has the largest expected return and the smallest variance.The 0th hidden state is the neutral volatility regime with the second largest return and variance. Lastly the 2th hidden state is high volatility regime. We can see the expected return is negative and the variance is the largest of the group.

```
sns.set(font_scale=1.5)
states = (pd.DataFrame(hidden_states, columns=['states'], index=select.index)
.join(select, how='inner')
.assign(mkt_cret=select.sret.cumsum())
.reset_index(drop=False)
.rename(columns={'index':'Date'}))
p(states.head())
sns.set_style('white', style_kwds)
order = [0, 1, 2]
fg = sns.FacetGrid(data=states, hue='states', hue_order=order,
palette=scolor, aspect=1.31, size=12)
fg.map(plt.scatter, 'Date', mkt, alpha=0.8).add_legend()
sns.despine(offset=10)
fg.fig.suptitle('Historical SPY Regimes', fontsize=24, fontweight='demi')
fg.savefig('Hidden Markov (Mixture) Model_SPY Regimes.png')
```

Here is the SPY price chart with the color coded regimes overlaid.

In this post we've discussed the concepts of the Markov property, Markov models and hidden Markov models. We used the networkx package to create Markov chain diagrams, and sklearn's GaussianMixture to estimate historical regimes. In part 2 we will discuss mixture models more in depth. For more detailed information I would recommend looking over the references. Setosa.io is especially helpful in covering any gaps due to the highly interactive visualizations.

- https://en.wikipedia.org/wiki/Andrey_Markov
- https://www.britannica.com/biography/Andrey-Andreyevich-Markov
- https://www.reddit.com/r/explainlikeimfive/comments/vbxfk/eli5_brownian_motion_and_what_it_has_to_do_with/
- http://www.math.uah.edu/stat/markov/Introduction.html
- http://setosa.io/ev/markov-chains/
- http://www.cs.jhu.edu/~langmea/resources/lecture_notes/hidden_markov_models.pdf
- https://github.com/alexsosn/MarslandMLAlgo/blob/master/Ch16/HMM.py
- http://hmmlearn.readthedocs.io

**Motivating the Journey****Where Do Edges Come From?****The Problem with Traditional Research****The Hidden Side**

**A Brief Description:****Part 1 - A Visual Introduction to Hidden Markov Models with Python****Part 2 - Exploring Mixture Models with Scikit-Learn and Python****Part 3 - Predicting Market Bottoms with Scikit-Learn and Python**

Edges come from superior ability to identify and execute profitable strategies.

You can see this simply by imagining the first strategy able to identify pricing errors on identical items in different markets. This knowledge is valuable in two scenarios: you can execute the transaction yourself or you know someone who can and will pay you for the "signal".

Abstractly, a signal can be thought of as a glitch in the matrix allowing us a view through a window into probabilistic future states. Signals can come from anywhere and are not always understood.

Our job is to find these signals, vet them, and implement them. This is difficult in practice. The competitive environment we seek to understand is dynamic with positive and negative feedback loops operating at various scales. The system processes are very noisy making signal extraction confusing and difficult. Competitors are always seeking strategies that "work" until they don't.

Generally profitable edges stop working when both your identification and execution strategies are well known. Thus a profit motive for secrecy and obfuscation exists among participants. If you are familiar with poker this will sound very familiar.

This also means that using well known identification techniques puts you at a strategic disadvantage because your competitors have likely incorporated knowledge of your methods into their own strategies.

Therefore we must continuously search for strategies that are not well understood, not well known, or otherwise difficult for our competitors to implement.

Too much published "research" focuses on using well known statistical tools to draw conclusions that do not improve the odds of profitable investment. Worse still, many research papers' results are not reproducible.

For periods of time, techniques involving technical analysis, regression, and simple correlations, were good enough to beat the market. This worked because the methodology was not well known or well understood. Times have changed.

These methods have been taught and promoted to generations of practitioners. These techniques form the foundation of many market participants investment strategies. Therefore the majority of well known strategies are already in use by the market.

This means sophisticated participants have had time and opportunity to develop counter strategies to take advantage of the limitations of publicly known methods.

Typical business finance teachings focus on the theory that stock values are directly tied to the expected value of net cash flows produced by the underlying operating business from now into some future period. Other research links stock prices to any number of other observable factors. My perception is that these well taught methods can bias our exploratory research when it comes to the art and science of **prediction.**

Successful prediction does not require understanding or logic. Prediction does not require expertise in the industry or business which generated the data. These things can help solidify our belief in the power of the prediction, however successful prediction methods only require a stable, positive payoff function relative to prediction accuracy over an expanding time period. Nothing more, nothing less.

Rigid knowledge structures can blind us to potential opportunities. Using statistics to explore observable factors only, ignores the entire spectrum of hidden, unobserved factors influencing asset returns.

By definition a hidden factor is not directly observable. Its presence or influence is detected by its effect on observable factor(s) or on a delayed basis.

Conceptualizing the influence of hidden factors is difficult for many decision makers to either understand or incorporate into already existing processes.

The combination of bias created by traditional finance and difficulty conceptualizing hidden factors, creates the barriers to entry we need for successful strategy development. We can reasonably assume this research pathway is still rich with profitable edges and worth pursuing.

In part 1, we will discuss Markov Models, Hidden Markov Models and a toy application for regime detection.

In part 2, we will explore the motivation behind mixture models and how they improve on the weaknesses of K-means algorithms. We will also discuss the connection between Mixture Models and Hidden Markov Models. Finally we will extend our toy regime detector to use a mixture model instead.

In part 3, we will implement a toy strategy using mixture models to predict market bottoms. The strategy assumes that we can calibrate a model to predict the market return distribution such that actual returns that fall below the confidence intervals are profitable long entries over short time periods.

Post thumbnail picture taken from Bayesian Intelligence Slideshare presentation.

]]>- Motivation
- Get Data
- Default Plot with Recession Shading
- Add Chart Titles, Axis Labels, Fancy Legend, Horizontal Line
- Format X and Y Axis Tick Labels
- Change Font and Add Data Markers
- Add Annotations
- Add Logo/Watermarks

Since I started this blog a few years ago, one of my obsessions is creating good looking, informative plots/charts. I've spent an inordinate amount of time learning how to do this and it is still a work in a progress. However all my work is not in vain as several of you readers have commented and messaged me for the code behind some of my time series plots. Beginning with basic time series data, I will show you how I produce these charts.

Import packages

```
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('white', {"xtick.major.size": 2, "ytick.major.size": 2})
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71","#f4cae4"]
sns.set_palette(sns.color_palette(flatui,7))
import missingno as msno
p=print
save_loc = '/YOUR/PROJECT/LOCATION/'
logo_loc = '/YOUR/WATERMARK/LOCATION/'
```

Get time series data from Yahoo finance and recession data from FRED.

```
# get index and fed data
f1 = 'USREC' # recession data from FRED
start = pd.to_datetime('1999-01-01')
end = pd.datetime.today()
mkt = '^GSPC'
MKT = (web.DataReader([mkt,'^VIX'], 'yahoo', start, end)['Adj Close']
.resample('MS') # month start b/c FED data is month start
.mean()
.rename(columns={mkt:'SPX','^VIX':'VIX'})
.assign(SPX_returns=lambda x: np.log(x['SPX']/x['SPX'].shift(1)))
.assign(VIX_returns=lambda x: np.log(x['VIX']/x['VIX'].shift(1)))
)
data = (web.DataReader([f1], 'fred', start, end)
.join(MKT, how='outer')
.dropna())
p(data.head())
p(data.info())
msno.matrix(data)
```

Now we have to setup our recession data so we can get the official begin and end dates for each recession over the period.

```
# recessions are marked as 1 in the data
recs = data.query('USREC==1')
# Select the two recessions over the time period
recs_2k = recs.ix['2001']
recs_2k8 = recs.ix['2008':]
# now we can grab the indices for the start
# and end of each recession
recs2k_bgn = recs_2k.index[0]
recs2k_end = recs_2k.index[-1]
recs2k8_bgn = recs_2k8.index[0]
recs2k8_end = recs_2k8.index[-1]
```

Now we can plot the default chart with recession shading. Let's take a look.

```
# Let's plot SPX and VIX cumulative returns with recession overlay
plot_cols = ['SPX_returns', 'VIX_returns']
# 2 axes for 2 subplots
fig, axes = plt.subplots(2,1, figsize=(10,7), sharex=True)
data[plot_cols].plot(subplots=True, ax=axes)
for ax in axes:
ax.axvspan(recs2k_bgn, recs2k_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
ax.axvspan(recs2k8_bgn, recs2k8_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
```

The default plot is ok but we can do better. Let's add chart titles, axis labels, spruce up the legend, and add a horizontal line for 0.

```
fig, axes = plt.subplots(2,1, figsize=(10,7), sharex=True)
data[plot_cols].plot(subplots=True, ax=axes)
# for subplots we must add features by subplot axis
for ax, col in zip(axes, plot_cols):
ax.axvspan(recs2k_bgn, recs2k_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
ax.axvspan(recs2k8_bgn, recs2k8_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
# lets add horizontal zero lines
ax.axhline(0, color='k', linestyle='-', linewidth=1)
# add titles
ax.set_title('Monthly ' + col + ' \nRecessions Shaded Gray')
# add axis labels
ax.set_ylabel('Returns')
ax.set_xlabel('Date')
# add cool legend
ax.legend(loc='upper left', fontsize=11, frameon=True).get_frame().set_edgecolor('blue')
# now to use tight layout
plt.tight_layout()
```

This is a step up but still not good enough. I prefer more informative dates on the x-axis, and percent formatting on the y-axis.

```
# better but I prefer more advanced axis tick labels
fig, axes = plt.subplots(2,1, figsize=(12,9), sharex=True)
data[plot_cols].plot(subplots=True, ax=axes)
# for subplots we must add features by subplot axis
for ax, col in zip(axes, plot_cols):
ax.axvspan(recs2k_bgn, recs2k_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
ax.axvspan(recs2k8_bgn, recs2k8_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
# lets add horizontal zero lines
ax.axhline(0, color='k', linestyle='-', linewidth=1)
# add titles
ax.set_title('Monthly ' + col + ' \nRecessions Shaded Gray')
# add axis labels
ax.set_ylabel('Returns')
ax.set_xlabel('Date')
# upgrade axis tick labels
yticks = ax.get_yticks()
ax.set_yticklabels(['{:3.1f}%'.format(x*100) for x in yticks]);
dates_rng = pd.date_range(data.index[0], data.index[-1], freq='6M')
plt.xticks(dates_rng, [dtz.strftime('%Y-%m') for dtz in dates_rng], rotation=45)
# add cool legend
ax.legend(loc='upper left', fontsize=11, frameon=True).get_frame().set_edgecolor('blue')
# now to use tight layout
plt.tight_layout()
```

It's an improvement, but I hate Arial font, and would like to add data point markers.

```
# I want markers for the data points, and change to font
mpl.rcParams['font.family'] = 'Ubuntu Mono'
fig, axes = plt.subplots(2,1, figsize=(10,7), sharex=True)
data[plot_cols].plot(subplots=True, ax=axes, marker='o', ms=3)
# for subplots we must add features by subplot axis
for ax, col in zip(axes, plot_cols):
ax.axvspan(recs2k_bgn, recs2k_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
ax.axvspan(recs2k8_bgn, recs2k8_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
# lets add horizontal zero lines
ax.axhline(0, color='k', linestyle='-', linewidth=1)
# add titles
ax.set_title('Monthly ' + col + ' \nRecessions Shaded Gray')
# add axis labels
ax.set_ylabel('Returns')
ax.set_xlabel('Date')
# upgrade axis tick labels
yticks = ax.get_yticks()
ax.set_yticklabels(['{:3.2f}%'.format(x*100) for x in yticks]);
dates_rng = pd.date_range(data.index[0], data.index[-1], freq='6M')
plt.xticks(dates_rng, [dtz.strftime('%Y-%m') for dtz in dates_rng], rotation=45)
# add cool legend
ax.legend(loc='upper left', fontsize=11, frameon=True).get_frame().set_edgecolor('blue')
# now to use tight layout
plt.tight_layout()
```

It's starting to look pretty good, but we can get even more fancy. Say we wanted to annotate the global maximum and minimum returns in each subplot along with their respective dates for SPX and VIX . That could be a challenge. To do this we first need to extract the max/mins and idxmax/idxmin for both series.

```
# I want to know show the global max and mins and their dates
# --------------------------------------------------------------- #
# MAX SPX Returns
spx_max_ = data[plot_cols[0]].max()
spx_max_idx_ = data[plot_cols[0]].idxmax(axis=0, skipna=True)
# MIN SPX Returns
spx_min_ = data[plot_cols[0]].min()
spx_min_idx_ = data[plot_cols[0]].idxmin(axis=0, skipna=True)
# MAX VIX Returns
vix_max_ = data[plot_cols[1]].max()
vix_max_idx_ = data[plot_cols[1]].idxmax(axis=0, skipna=True)
# MIN VIX Returns
vix_min_ = data[plot_cols[1]].min()
vix_min_idx_ = data[plot_cols[1]].idxmin(axis=0, skipna=True)
```

Now that we have this information we can get clever with the annotation tools Matplotlib provides. Also, I want to touch up some of the axis labels and axis tick labels as well.

```
mpl.rcParams['font.family'] = 'Ubuntu Mono'
fig, axes = plt.subplots(2,1, figsize=(12,9), sharex=True)
data[plot_cols].plot(subplots=True, ax=axes, marker='o', ms=3)
# for subplots we must add features by subplot axis
for ax, col in zip(axes, plot_cols):
ax.axvspan(recs2k_bgn, recs2k_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
ax.axvspan(recs2k8_bgn, recs2k8_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
# lets add horizontal zero lines
ax.axhline(0, color='k', linestyle='-', linewidth=1)
# add titles
ax.set_title('Monthly ' + col + ' \nRecessions Shaded Gray', fontsize=14, fontweight='demi')
# add axis labels
ax.set_ylabel('Returns', fontsize=12, fontweight='demi')
ax.set_xlabel('Date', fontsize=12, fontweight='demi')
# upgrade axis tick labels
yticks = ax.get_yticks()
ax.set_yticklabels(['{:3.1f}%'.format(x*100) for x in yticks]);
dates_rng = pd.date_range(data.index[0], data.index[-1], freq='6M')
plt.xticks(dates_rng, [dtz.strftime('%Y-%m-%d') for dtz in dates_rng], rotation=45)
# bold up tick axes
ax.tick_params(axis='both', which='major', labelsize=11)
# add cool legend
ax.legend(loc='upper left', fontsize=11, frameon=True).get_frame().set_edgecolor('blue')
# add global max/min annotations
# add cool annotation box
bbox_props = dict(boxstyle="round4, pad=0.6", fc="cyan", ec="b", lw=.5)
axes[0].annotate('Global Max = {:.2%}\nDate = {}'
.format(spx_max_, spx_max_idx_.strftime('%a, %Y-%m-%d')),
fontsize=9,
fontweight='bold',
xy=(spx_max_idx_, spx_max_),
xycoords='data',
xytext=(-150, -30),
textcoords='offset points',
arrowprops=dict(arrowstyle="->"), bbox=bbox_props)
axes[0].annotate('Global Min = {:.2%}\nDate = {}'
.format(spx_min_, spx_min_idx_.strftime('%a, %Y-%m-%d')),
fontsize=9,
fontweight='demi',
xy=(spx_min_idx_, spx_min_),
xycoords='data',
xytext=(-150, 30),
textcoords='offset points',
arrowprops=dict(arrowstyle="->"), bbox=bbox_props)
axes[1].annotate('Global Max = {:.2%}\nDate = {}'
.format(vix_max_, vix_max_idx_.strftime('%a, %Y-%m-%d')),
fontsize=9,
fontweight='bold',
xy=(vix_max_idx_, vix_max_),
xycoords='data',
xytext=(-150, -30),
textcoords='offset points',
arrowprops=dict(arrowstyle="->"), bbox=bbox_props)
axes[1].annotate('Global Min = {:.2%}\nDate = {}'
.format(vix_min_, vix_min_idx_.strftime('%a, %Y-%m-%d')),
fontsize=9,
fontweight='demi',
xy=(vix_min_idx_, vix_min_),
xycoords='data',
xytext=(-150, -20),
textcoords='offset points',
arrowprops=dict(arrowstyle="->"), bbox=bbox_props)
# now to use tight layout
plt.tight_layout()
```

Wow, now it's looking really good. But what if you wanted to insert branding via a watermark? That's simple, add the following line of code before the **plt.tight_layout() **line and voila.

```
# add logo watermark
im = mpl.image.imread(logo_loc)
axes[0].figure.figimage(im, origin='upper', alpha=0.125, zorder=10)
```

]]>**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)**

This is a stylized implementation of the strategy described in the research paper titled "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. The authors show that their SKEW factor predicts individual equity returns up to 6 months!

**ABSTRACT**

The shape of the volatility smirk has significant cross-sectional predictive power for future equity returns.Stocks exhibiting the steepest smirks in their traded options underperform stocks with the least pronounced volatility smirks in their options by around 10.9% per year on a risk-adjusted basis.This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

My strategy differs in that I arbitrarily chose 1 and 4 week holding periods to study. Additionally this strategy only analyzes a cross-section of ETFs instead of individual stocks. I chose ETFs because liquidity and data quality concerns are minimized. Here are the selected ETFs under analysis.

- Zhang, Xiaoyan and Zhao, Rui and Xing, Yuhang, What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns? (August 14, 2008). AFA 2009 San Francisco Meetings Paper. Available at SSRN:http://ssrn.com/abstract=1107464 orhttp://dx.doi.org/10.2139/ssrn.1107464

**Results simulated using the Quantopian Platform.*

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>- Part-1 Recap
- Part-1 Error Corrections
- Part-2 Implementation Details, Deviations, Goals
- Prepare Data
- Setup PYMC3 Generalized Linear Models (GLM)
- Evaluate and Interprate Models
- Conclusions
- References

In part 1 We discussed the theorized underpinnings of Ying Wu of Stevens Institute of Technology - School's asset pricing model. Theory links the catalyst of systemic risk events to the funding difficulties of major financial intermediaries. Thus crisis risk is linked to liquidity events. The model proposes a method to estimate a proxy index for the systematic liquidity risk. We use an illiquidity metric calculated across a large group of stocks, then apply a tool called the Hill estimator to measure the average distance of extreme illiquidity events from the tail cutoff. We explored the high level intuition behind the Hill estimator.

We created an implementation of the Hill estimator, aggregated the stock data, calculated the illiquidity metrics and the ELR index, and finally output the intermediate data into a hdf5 file for quick read/write access.

We did not get this far in part-1, but the paper asserts that we can use this index as an asset pricing component. This could also be thought of as the primary feature or independent variable in a simple linear regression (think CAPM). The target variable is the expected aggregate returns. From there the paper says we can create long-short portfolios by ranking the stocks according to their factor betas and sorting them into quantiles.

In part 1 the implementation of the hill estimator was incorrect. The ELR index is supposed to be comprised of the values that exceed 95th percentile. In the original implementation I calculated the average of *all *values not just those in the tail region. Therefore the quick and dirty observations made previously are for a different index. On the left is the original incorrect index. On the right is the corrected index.

After spending some time rereading the research paper there is a subtle bit of additional complexity I have not included in this implementation that may affect the results I get vs those found in the paper.

In the paper the threshold value is calculated as the 95th percentile cross-sectionally for the entire month. Then the index is constructed by calculating the average log distance from that threshold for any datapoint located in the tail. To create an index like this requires binning the data by month, getting the threshold value of that month by aggregating the *daily illiquidity *metrics of a few thousand stocks for that month, then calculating the log average distance between those tail values and the threshold.

This means we likely need a whole month of data before we can calculate the ELR value. We can potentially use a rolling 21 or 30 day window to simulate a monthly lookback but based on the paper it does not seem that the author used this method. If, instead we go by calendar months, this likely means we need *a lot *more data before we can draw any conclusions. For example the author's sample period is from 1968-2011 and only includes NYSE stocks among other stock universe selection details.

In my exploration of ELR index, I prefer to keep it simpler, and calculate the 95th percentile threshold based on the cross sectional *daily *illiquidity values instead of the whole month.

Part-2 Goals:

- Import the calculated the daily illiquidity values
- Resample the illiquidity measures by week, taking the median and max illiquidity values, then calculate the ELR Index
- Use pymc3's generalized linear models function to fit a model for predicting the cross-sectional scaled returns.
- Interpret and Evaluate the models.

First we need to import packages and get our data ready.

```
# import packages
import sys
import os
# ------------------- % import datasets % ------------------- #
datasets = '/YOUR/DATASET/LOCATION/_Datasets/'
import pandas as pd
import pandas_datareader.data as web
from pandas.tseries.offsets import *
import numpy as np
import scipy.stats as scs
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('bmh')
%matplotlib inline
import seaborn as sns
sns.set_style('white', {"xtick.major.size": 3, "ytick.major.size": 3})
import pymc3 as pm
from scipy import optimize
import time
from tqdm import tqdm
p = print
```

I created a hdf5 file for the aggregated returns because I want to use them as a proxy for our target variable of expected market returns. I import those here and create a time series consisting of the cross sectional median and average log returns.

```
## read in data
# log returns for aggregrate mkt proxy
LRET_FILE = datasets + 'LRET_Set_2016-11-22.h5'
lret_set = pd.read_hdf(LRET_FILE, 'RETURNS_DV_SET')
lret_set = lret_set.loc[:,lret_set.columns.to_series().str.contains('_lret').tolist()]
# calc median and mean cross sectional
mkt = pd.DataFrame({'cross_mdn_rets':lret_set.median(axis='columns'),
'cross_avg_rets':lret_set.mean(axis='columns')},
index=lret_set.index
)
## read in illiquidity data for ELR calculations
ILQ_FILE = datasets + 'Illiquidity_Set_2016-11-22.h5'
ilq = pd.read_hdf(ILQ_FILE, 'Illiquidity_Set')
```

After loading the data into our environment, we resample the data to a weekly frequency using both median and max values for comparison. On my outdated laptop this took approximately 7 minutes.

```
# weekly resample
freq = '1W'
df = ilq.resample(freq).median()
df_max = ilq.resample(freq).max()
```

Next we define our convenience functions for calculating our ELR index. Notice that I deviate from the traditional z-score scaling method and implement the Gelman scaler which divides the centered values by 2 times the standard deviation. You can read more details from Andrew Gelman's paper[2] about why we use this method. The high-level intuition is that this scale improves regression coefficient interpretability across binary, discrete, and continuous variables.

```
# convenience functions for gamma calculation and scaler
# gamma estimate
def _ext_lq_risk(series):
# threshold is 95th percentile
p_star = np.nanpercentile(series, 95)
illiq = series[series > p_star]
#illiq = series # looks better on chart but less explanatory power
lg_illiq = np.log(illiq / p_star)
lg_illiq = lg_illiq[np.isfinite(lg_illiq)]
try:
gamma = 1./ ((1./len(lg_illiq)) * sum(lg_illiq))
except ZeroDivisionError:
gamma = np.nan
return gamma
# scaler function
gelman_scaler = lambda ser: (ser - ser.mean()) / (2*ser.std())
# calculate elr index
def _calculate_elr(df, cutoff=100, scaler=None):
gs = {} # gammas dictionary
nan_dates = []
for d in df.index:
# we want at least N nonnull values
if df.loc[d].notnull().sum() > cutoff:
gamma = _ext_lq_risk(df.loc[d])
gs[d] = gamma
else:
nan_dates.append(d)
gdf = pd.DataFrame.from_dict(gs, orient='index').sort_index()
gdfz = scaler(gdf)
gdfz.columns = ['ELR']
return gdfz, nan_dates
```

Now we can set up our main experimental dataframe. We need to make sure the our market proxy dataframe, which consists of the aggregate sample returns, has the same index as our ELR dataframe before we merge them. Also, remember we are going to experiment with two resampled dataframes, one with the weekly median illiquidity, and one with the weekly maximum illiquidity. Our final step after creating our merged dataframes is to add a column for our Gelman scaled aggregate returns.

```
# calculate ELR index on resampled data
gdfz_mdn, _ = _calculate_elr(df, scaler=gelman_scaler)
gdfz_max, _ = _calculate_elr(df_max, scaler=gelman_scaler)
# market resample must match gdfz before merge
# merge dataframes
mkt_rs = mkt.resample(freq).mean()
mrg_mdn = pd.concat([gdfz_mdn, mkt_rs], join='inner', axis=1)
mrg_max = pd.concat([gdfz_max, mkt_rs], join='inner', axis=1)
# add cross sectional average Gelman scored returns
avg_col = 'cross_avg_rets'
mrg_mdn['cross_avg_zrets'] = gelman_scaler(mrg_mdn[avg_col])
mrg_max['cross_avg_zrets'] = gelman_scaler(mrg_max[avg_col])
mrg_mdn.head()
```

Before running our model I define some output convenience functions adapted from the excellent blog Applied AI[3].

```
# pymc3 convenience functions adapted from blog.applied.ai
def trace_median(x):
return pd.Series(np.median(x,0), name='median')
def plot_traces(trcs, retain=1000, varnames=None):
''' Convenience fn: plot traces with overlaid means and values '''
nrows = len(trcs.varnames)
if varnames is not None:
nrows = len(varnames)
ax = pm.traceplot(trcs[-retain:], varnames=varnames, figsize=(12,nrows*1.4)
,lines={k: v['mean'] for k, v in
pm.df_summary(trcs[-retain:],varnames=varnames).iterrows()})
for i, mn in enumerate(pm.df_summary(trcs[-retain:], varnames=varnames)['mean']):
ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
,xytext=(5,10), textcoords='offset points', rotation=90
,va='bottom', fontsize='large', color='#AA0022')
def plot_pm_acf(trace, varnames=None, burn=None):
pm.autocorrplot(trace, varnames=varnames, burn=burn, figsize=(7,5))
return
```

Now we can set up our model. I will gloss over some of the particulars pymc3 and the *Generalized Linear Model (glm) *functions for now. I'm also skipping over why I'm using a Bayesian methodology vs. a frequentist one. Generally speaking, Bayesian modeling is the preferred methodology due to robustness and explicit modeling of the uncertainty in our point estimates. I plan to revisit this topic in more detail in the future, but there are plenty of tutorials and explanations of why Bayesian is the way to go.

Anyone familiar with R will appreciate the following simplicity of model setup. First we need to define our model formula as a string.

```
# predicting cross sectional average returns using the ELR index
ft_endog = 'cross_avg_zrets'
ft_exog = ['ELR'] # this format allows easy addition of more variables
fml = '{} ~ '.format(ft_endog) + ' + '.join(ft_exog)
p(fml)
# 'cross_avg_zrets ~ ELR'
```

Next we follow pymc3's glm model convention and choose the number of samples we wish to draw from the predicted posterior.

```
# choose samples and run model
samples = 5000
with pm.Model() as mdl:
## Use GLM submodule for simplified model specification
## Betas are Normal (as per default settings (for Ridge)
## Likelihood is Normal (with HalfCauchy for error prior)
pm.glm.glm(fml, mrg_mdn, family=pm.glm.families.Normal())
start_MAP = pm.find_MAP(fmin=optimize.fmin_powell)
## take samples using NUTS sampler
trc_ols = pm.sample(samples, start=start_MAP, step=pm.NUTS())
rvs = [rv.name for rv in mdl.unobserved_RVs]
rvs.remove('sd_log_')
plot_traces(trc_ols, varnames=rvs)
plot_pm_acf(trc_ols, varnames=rvs, burn=1000)
p(pm.df_summary(trc_ols[-1000:], varnames=rvs))
p('\nMedian Illiquidity ELR Model\nDIC:', pm.dic(trc_ols[-1000:], model=mdl))
```

We run the model for both the median and max illiquidity estimates.

First we need to decide how we will evaluate which model is best. For this I have chosen the Deviance Information Criterion (DIC) which is implemented in pymc3 and designed specifically for Bayesian modelling using MCMC. Like similar alternative measures, the smaller the number the better our model.

First we evaluate the resampled median illiquidity model.

median model trace

On the left we can examine the distribution of our sample estimate for the intercept, ELR, and model error. On the right we can see the sample trace. This should look like white noise and it does. We can see the intercept is basically zero, the ELR beta is -0.06 and the standard deviation is 0.5

median acf

We plot the ACF of our variables to confirm that the sample traces are white noise. However we can see a strongly negative autocorrelation for each variable at its first lag.

median model summary and dic

We can see that the both the ELR and sd are significant as their highest posterior density does NOT include zero in the interval. The DIC is 1531. Now let's compare the median model to the max model.

elr max model trace

We observe that the models are similar in their output, however notice the ELR in this instance has a stronger negative correlation with our target variable than does the median model. The traces on the right side appear to resemble white noise. Let's confirm by looking at the ACF plot.

elr max model acf

This confirms our intuition that the series is close to white noise. We can also see a pretty strong negative autocorrelation at lag 1 for each of our variables. This is not ideal but ok for our exploratory purposes.

max model hpd and dic

We can see that the ELR and sd are both significant as neither interval includes zero. The magnitude of the ELR coefficient is larger in the max model which corresponds to a lower *better* DIC.

We designed an experiment to evaluate the relationship between the ELR index and the cross sectional scaled returns. We deviated from the original paper in a couple notable ways. We used the daily illiquidity measure and resampled to a weekly frequency using both the weekly median, and the weekly max. We then calculated the ELR index using the weekly cross sectional data as opposed to the highly nuanced monthly methodology used in the paper.

We then designed a basic linear model using pymc3 to explore the ELR index's impact on the scaled cross sectional returns. After examining the results I am somewhat disappointed we weren't able to show as strong a link as demonstrated in the paper. The max model is clearly the better model according to the DIC, but even then we can see the ELR index is only weakly related to the cross sectional returns.

A positive takeaway is that the *sign* of the relationship is what we would expect. The ELR we calculated is negatively correlated with cross section of returns.

As currently constructed, using this method to form the basis of an asset pricing model seems dubious at best and definitely lowers my expectations when I simulate a long-short strategy in Quantopian.

- Wu, Ying, Asset Pricing with Extreme Liquidity Risk (October 10, 2016). Available at SSRN:https://ssrn.com/abstract=2850278 orhttp://dx.doi.org/10.2139/ssrn.2850278
- Gelman, A. (2008), Scaling regression inputs by dividing by two standard deviations. Statist. Med., 27: 2865–2873. doi:10.1002/sim.3107
- Sedar, Jonathan. "Bayesian Inference with PyMC3 - Part 2."
*The Sampler*. Applied AI, 06 Sept. 2016. Web. 13 Dec. 2016.

**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)**

This is a stylized implementation of the strategy described in the research paper titled "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. The authors show that their SKEW factor predicts individual equity returns up to 6 months!

**ABSTRACT**

The shape of the volatility smirk has significant cross-sectional predictive power for future equity returns.Stocks exhibiting the steepest smirks in their traded options underperform stocks with the least pronounced volatility smirks in their options by around 10.9% per year on a risk-adjusted basis.This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

My strategy differs in that I arbitrarily chose 1 and 4 week holding periods to study. Additionally this strategy only analyzes a cross-section of ETFs instead of individual stocks. I chose ETFs because liquidity and data quality concerns are minimized. Here are the selected ETFs under analysis.

- Zhang, Xiaoyan and Zhao, Rui and Xing, Yuhang, What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns? (August 14, 2008). AFA 2009 San Francisco Meetings Paper. Available at SSRN:http://ssrn.com/abstract=1107464 orhttp://dx.doi.org/10.2139/ssrn.1107464

**Results simulated using the Quantopian Platform.*

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>- Introduction
- Get Data
- Calculate Cross-Sectional Extreme Liquidity Risk
- Quick and Dirty Observations
- Next Steps
- References

One of the primary goals of quantitative investing is effectively managing tail risk. Failure to do so can result in crushing drawdowns or a total blowup of your fund/portfolio. Commonly known tools for estimating tail risk, e.g. Value-at-Risk, often underestimate the likelihood and magnitude of risk-off events. Furthermore, tail risk events are increasingly associated with liquidity events.

Theory links the catalyst of systemic risk events to the funding difficulties of major financial intermediaries. For example, an unexpected default by a major institution would lead to that firm's counterparties reducing risk while they assessing the fallout. Those counterparties are likely to reduce risk by selling assets and/or withdrawing funding resources from the market. This could lead to margin calls, and more selling as the default works its way across the financial network cascading into a negative feedback loop.

A good theoretical risk model will address the relationship between liquidity and tail risk. Ying Wu of Stevens Institute of Technology - School of Business, may have discovered a framework that links these two concepts in a parsimonious and practical manner. His paper 'Asset Pricing with Extreme Liquidity Risk'[1] combines Amihud's[2] stock illiquidity metric with the Hill estimator for modeling tail distributions. He then constructs a normalized Extreme Liquidity Risk (ELR) metric and runs a simple linear regression for each stock to assess its sensitivity to the ELR. His results find that a long-short portfolio based on buying stocks with the highest sensitivity to ELR and shorting those with the lowest, earns a empirically and economically significant return over the time period studied.

The Amihud stock illiquidity metric is a stock's daily absolute return divided by its dollar volume, averaged over some time period. It was constructed for use as a rough measure of price impact and designed to be easily calculated for long time series.

The Hill estimator[3] is a mathematical tool that allows us to focus on the tail of a sample distribution. This tool allows us to "skip" over trying to fit a single distribution over the entire sample and instead we can use the formal framework of Extreme Value Theory to evaluate the *extreme (tail)* values only. The link between Wu's choice of this estimator is based on the empirical evidence of power law behavior in the tails of the price-impact series. This further supports the use of Amihud's illiquidity metric as it was designed to be a crude yet effective measure of price impact.

I urge readers to explore the paper further as some of the deeper mathematical underpinnings are beyond the scope of this post.

For this exploratory study I used the pandas Yahoo Finance API to download 20 years of stock data using a symbol list constructed by CRSP.

```
# Import
import pandas as pd
import pandas_datareader.data as web
from pandas.tseries.offsets import BDay
import numpy as np
import scipy.stats as scs
import matplotlib.pyplot as plt
# get symbols
datasets = '/YOUR/DATASETS/LOCATION/_Datasets/'
symbols = pd.read_csv(datasets+'CSRP_symbol_list.txt',sep='\t').values.flatten()
```

Here is the text file of symbols I used --> Symbols.

Next we construct our convenience functions to aggregate the stock data.

```
# Get Prices Function
def _get_px(symbol, start, end):
return web.DataReader(symbol, 'yahoo', start, end)
# Create HDF5 data store for fast read write
def _create_symbol_datastore(symbols, start, end):
prices_hdf = pd.HDFStore(datasets + 'CRSP_Symbol_Data_Yahoo_20y.hdf')
symbol_count = len(symbols)
N = copy(symbol_count)
missing_symbols = []
for i, sym in enumerate(symbols, start=1):
if not pd.isnull(sym):
try:
prices_hdf[sym] = _get_px(sym, start, end)
except Exception as e:
print(e, sym)
missing_symbols.append(sym)
N -= 1
pct_total_left = (N / symbol_count)
print('{}..[done] | {} of {} symbols collected | {:>.2%}'.format(\
sym, i, symbol_count, pct_total_left))
prices_hdf.close()
print(prices_hdf)
return missing_symbols
# Get past 20 years of data from today
# Evaluate missing symbols if you so choose
today = pd.datetime.today().date()
start = today - 252 * BDay() * 20
missing = _create_symbol_datastore(symbols, start, today)
```

This takes roughly 30 minutes to run, which is a good time for a coffee break.

Next we need to calculate each stock's daily illiquidity measure according to Amihud. I also save this data to its own HDF5 store. I find it good practice to save intermediate calculations where possible for reference and ease of reproducibility.

```
# calculate each symbols returns and dollar volumes
# add to dataframe with symbol_lret, symbol_dv, symbol_illiq
FILE = datasets + 'CRSP_Symbol_Data_Yahoo_20y.hdf'
start = pd.to_datetime('1999-01-01')
end = pd.to_datetime('2016-11-22')
idx = pd.bdate_range(start, end)
DF = pd.DataFrame(index=idx)
for sym in tqdm(keys):
tmp_hdf = pd.read_hdf(FILE,
mode='r', key=sym)
tmp_hdf = tmp_hdf[['Volume', 'Adj Close']]
# I want at least 1000 daily datapoints per stock
if len(tmp_hdf) > 1000:
try:
dv = (tmp_hdf['Adj Close'] * tmp_hdf['Volume'] / 1e6)[1:]
lret = np.log(tmp_hdf['Adj Close'] / tmp_hdf['Adj Close'].shift(1)).dropna()
daily_illiq = np.abs(lret) / dv
tmp_df = pd.DataFrame({sym.lstrip('/')+'_lret':lret,
sym.lstrip('/')+'_dv':dv,
sym.lstrip('/')+'_illiq':daily_illiq},
index=lret.index)
DF = DF.join(tmp_df, how='outer')
except: continue
print(DF.info())
# Illiquidity HDF originally run on 2016-Nov-11
# DataFrame key is "Illiquidity_Set"
ILQ_FILE = datasets + 'Illiquidity_Set_2016-11-22.h5'
ilq_set = DF.loc[:, DF.columns.to_series().str.contains('_illiq').tolist()]
ilq_set.to_hdf(ILQ_FILE, 'Illiquidity_Set')
```

8487 * 4954 = 42,044,598 data points! Some of these are np.nan but still, clearly CSV storage is a non-starter.

Now we are in a position to calculate the Extreme Liquidity Risk metric (ELR) or "Tail Index" for the aggregated stocks. First we read in our 'Illiquidity_set' dataframe from the HDF5 file. Then we create a convenience function to calculate the daily ELR. First lets take a quick glance at the ELR formula:

Wu, Ying, Asset Pricing with Extreme Liquidity Risk (October 10, 2016)

My understanding is that this is a log average of the relative "distance" between the aggregated stocks' illiquidity measures and the threshold *p*. P** is the line in the sand between distribution "body" and distribution "tail". The paper uses the convention of the 95% percentile as the threshold value so I use that here as well.

```
# Read hdf illiquidity
ILQ_FILE = datasets + 'Illiquidity_Set_2016-11-22.h5'
ilq = pd.read_hdf(ILQ_FILE, 'Illiquidity_Set')
# function to get daily values for gamma calc
def _ext_lq_risk(series):
# UPDATED: DEC 5TH
# threshold is 95th percentile
# right tailed convention
p_star = np.nanpercentile(series, 95)
illiq = series[series > p_star]
lg_illiq = np.log(illiq / p_star)
lg_illiq = lg_illiq[np.isfinite(lg_illiq)]
try:
gamma = 1./ ((1./len(lg_illiq)) * sum(lg_illiq))
except ZeroDivisionError:
gamma = np.nan
return gamma
```

Now we can calculate the Tail Index and normalize the values to get the ELR series.

```
# Calculate Tail Index for all dates greater than cutoff
df = ilq.copy()
gs = {} # gammas dictionary
cutoff=100
nan_dates = []
for d in df.index:
# we want at least N nonnull values
if df.loc[d].notnull().sum() > cutoff:
gamma = _ext_lq_risk(df.loc[d])
gs[d] = gamma
else:
nan_dates.append(d)
gdf = pd.DataFrame.from_dict(gs, orient='index').sort_index()
gdf.columns = ['Tail_Index']
# the ELR metric is a normalized version of the tail index
# normalize gamma dataframe to calc "ELR"
gdfz = (gdf - gdf.mean())/gdf.std()
gdfz.columns = ['ELR']
```

Let's plot it and take a look.

Blackarbs LLC

First another plot. I skip the code here to save space, but would be happy to post it if requested. The plot below is the IWM used as a market proxy, its drawdown chart, and below that is the ELR. The shaded regions are official NBER recessions.

Blackarbs llc

The ELR appears to rise prior to the official beginning of the Dot-Com bust. It stays relatively elevated throughout the period and begins to decline sometime during the first persistent rally off the lows. Prior to the beginning of 2008's official recession, the ELR is mixed. However, the ELR rises sharply sometime prior to the massive decline in the broad market. In fact it was rising during a period where the market bounced, providing an early warning of the cataclysmic dropoff to come. Furthermore it begins declining shortly after the official NBER recession end date, providing investors with support for getting back into the market. Interestingly the ELR is in a downtrend for most of the low-volatility period that followed the recession. Clearly the metric is not a perfect predictor, but there seems to be evidence that it could be a useful tool, and certainly warrants more rigorous investigation.

There are several directions to pursue regarding Extreme Liquidity Risk Index. We can explore the time series itself using Time Series Analysis (TSA), we can use frequentist or bayesian inference to this end. Or we can get straight to the good stuff, and simulate the long-short portfolio based on each stock's return sensitivity to the ELR as reported in the paper that inspired this post. Check back for part 2, as we explore this concept further.

- Wu, Ying, Asset Pricing with Extreme Liquidity Risk (October 10, 2016). Available at SSRN:https://ssrn.com/abstract=2850278 or http://dx.doi.org/10.2139/ssrn.2850278
- Amihud, Yakov. "Illiquidity and Stock Returns: Cross-section and Time-series Effects."
*Journal of Financial Markets*5.1 (2002): 31-56. Web. - "Heavy-tailed Distribution." Wikipedia. Wikimedia Foundation, n.d. Web. 29 Nov. 2016.

**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)**

This is a stylized implementation of the strategy described in the research paper titled "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. The authors show that their SKEW factor predicts individual equity returns up to 6 months!

**ABSTRACT**

The shape of the volatility smirk has significant cross-sectional predictive power for future equity returns.Stocks exhibiting the steepest smirks in their traded options underperform stocks with the least pronounced volatility smirks in their options by around 10.9% per year on a risk-adjusted basis.This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

My strategy differs in that I arbitrarily chose 1 and 4 week holding periods to study. Additionally this strategy only analyzes a cross-section of ETFs instead of individual stocks. I chose ETFs because liquidity and data quality concerns are minimized. Here are the selected ETFs under analysis.

- Zhang, Xiaoyan and Zhao, Rui and Xing, Yuhang, What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns? (August 14, 2008). AFA 2009 San Francisco Meetings Paper. Available at SSRN:http://ssrn.com/abstract=1107464 orhttp://dx.doi.org/10.2139/ssrn.1107464

**Results simulated using the Quantopian Platform.*

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>- Motivation
- The Basics
- Stationarity
- Serial Correlation (Autocorrelation)
- Why do we care about Serial Correlation?

- White Noise and Random Walks
- Linear Models
- Log-Linear Models
- Autoregressive Models - AR(p)
- Moving Average Models - MA(q)
- Autoregressive Moving Average Models - ARMA(p, q)
- Autoregressive Integrated Moving Average Models - ARIMA(p, d, q)
- Autoregressive Conditionally Heterskedastic Models - ARCH(p)
- Generalized Autoregressive Conditionally Heterskedastic Models - GARCH(p, q)
- References

Early in my quant finance journey, I learned various time series analysis techniques and how to use them but I failed to develop a deeper understanding of how the pieces fit together. I struggled to see the bigger picture of why we use certain models vs others, or how these models build on each other's weaknesses. The underlying purpose for employing these techniques eluded me for too long. That is, until I came to understand this:

By developing our time series analysis (TSA) skillset we are better able to understand what has already happened, *and *make better, more profitable, predictions of the future. Example applications include predicting future asset returns, future correlations/covariances, and future volatility.

This post is inspired by the great work Michael Halls Moore has done on his blog, Quantstart, especially his series on TSA. I thought translating some of his work to Python could help others who are less familiar with R. I have also adapted code from other bloggers as well. See References.

Before we begin let's import our Python libraries.

```
import os
import sys
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs
from arch import arch_model
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
p = print
p('Machine: {} {}\n'.format(os.uname().sysname,os.uname().machine))
p(sys.version)
# Machine: Linux x86_64
# 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
# [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
```

Let's use the **pandas_datareader** package to grab some sample data using the Yahoo Finance API.

```
end = '2015-01-01'
start = '2007-01-01'
get_px = lambda x: web.DataReader(x, 'yahoo', start=start, end=end)['Adj Close']
symbols = ['SPY','TLT','MSFT']
# raw adjusted close prices
data = pd.DataFrame({sym:get_px(sym) for sym in symbols})
# log returns
lrets = np.log(data/data.shift(1)).dropna()
```

A time series is a series of data points indexed (or listed or graphed) in time order. - Wikipedia

Here I use an infogrpahic found on SeanAbu.com. I find the pictures very intuitive.

Seanabu.com

*So what? Why do we care about stationarity?*

- A stationary time series (TS) is simple to predict as we can assume that future statistical properties are the same or proportional to current statistical properties.
- Most of the models we use in TSA assume
**covariance-stationarity (#3 above)**. This means**the descriptive statistics these models predict e.g. means, variances, and correlations, are only reliable if the TS is stationary and invalid otherwise.**

"For example, if the series is consistently increasing over time, the sample mean and variance will grow with the size of the sample, and they will always underestimate the mean and variance in future periods. And if the mean and variance of a series are not well-defined, then neither are its correlations with other variables."- http://people.duke.edu/~rnau/411diff.htm

With that said, **most TS we encounter in finance is NOT stationary.** Therefore a large part of TSA involves identifying if the series we want to predict is stationary, and if it is not we must find ways to transform it such that it is stationary. (More on that later)

Essentially when we model a time series we decompose the series into three components: trend, seasonal/cyclical, and random. The random component is called the residual or error. It is simply the difference between our predicted value(s) and the observed value(s). Serial correlation is when the residuals (errors) of our TS models are correlated with each other.

We care about serial correlation because it is critical for the validity of our model predictions, and is intrinsically related to stationarity. Recall that the residuals (errors) of a *stationary* TS are serially *uncorrelated* by definition! If we fail to account for this in our models the standard errors of our coefficients are underestimated, inflating the size of our T-statistics. The result is too many Type-1 errors, where we reject our null hypothesis even when it is True! **In layman's terms, ignoring autocorrelation means our model predictions will be bunk, and we're likely to draw incorrect conclusions about the impact of the independent variables in our model.**

White noise is the first Time Series Model (TSM) we need to understand. By definition a time series that is a white noise process has serially UNcorrelated errors and the expected mean of those errors is equal to zero. Another description for serially uncorrelated errors is, independent and identically distributed (i.i.d.). This is important because, if our TSM is appropriate and successful at capturing the underlying process, the residuals of our model will be i.i.d. and resemble a white noise process. Therefore part of TSA is literally trying to fit a model to the time series such that the residual series is indistinguishable from white noise.

Let's simulate a white noise process and view it. Below I introduce a convenience function for plotting the time series and analyzing the serial correlation visually. This code was adapted from the blog Seanabu.com

```
def tsplot(y, lags=None, figsize=(10, 8), style='bmh'):
if not isinstance(y, pd.Series):
y = pd.Series(y)
with plt.style.context(style):
fig = plt.figure(figsize=figsize)
#mpl.rcParams['font.family'] = 'Ubuntu Mono'
layout = (3, 2)
ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
acf_ax = plt.subplot2grid(layout, (1, 0))
pacf_ax = plt.subplot2grid(layout, (1, 1))
qq_ax = plt.subplot2grid(layout, (2, 0))
pp_ax = plt.subplot2grid(layout, (2, 1))
y.plot(ax=ts_ax)
ts_ax.set_title('Time Series Analysis Plots')
smt.graphics.plot_acf(y, lags=lags, ax=acf_ax, alpha=0.5)
smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax, alpha=0.5)
sm.qqplot(y, line='s', ax=qq_ax)
qq_ax.set_title('QQ Plot')
scs.probplot(y, sparams=(y.mean(), y.std()), plot=pp_ax)
plt.tight_layout()
return
```

We can model a white noise process easily and output the TS plot for visual inspection.

```
np.random.seed(1)
# plot of discrete white noise
randser = np.random.normal(size=1000)
tsplot(randser, lags=30)
```

guassian white noise

We can see that process appears to be random and centered about zero. The autocorrelation (ACF) and partial autocorrelation (PACF) plots also indicate no significant serial correlation. Keep in mind we should see approximately 5% significance in the autocorrelation plots due to pure chance as a result of sampling from the Normal distribution. Below that we can see the QQ and Probability Plots, which compares the distribution of our data with another theoretical distribution. In this case, that theoretical distribution is the standard normal distribution. Clearly our data is distributed randomly, and appears to follow Gaussian (Normal) white noise, as it should.

```
p("Random Series\n -------------\nmean: {:.3f}\nvariance: {:.3f}\nstandard deviation: {:.3f}"
.format(randser.mean(), randser.var(), randser.std()))
# Random Series
# -------------
# mean: 0.039
# variance: 0.962
# standard deviation: 0.981
```

A Random Walk is defined below:

- Michael halls moore [quantsart.com]

The significance of a random walk is that it is **non-stationary **because the covariance between observations is time-dependent. If the TS we are modeling is a random walk it is unpredictable.

Let's simulate a random walk using the "numpy.random.normal(size=our_sample_size)" function to sample from the standard normal distribution.

```
# Random Walk without a drift
np.random.seed(1)
n_samples = 1000
x = w = np.random.normal(size=n_samples)
for t in range(n_samples):
x[t] = x[t-1] + w[t]
_ = tsplot(x, lags=30)
```

Random walk without a drift

Clearly our TS is not stationary. Let's find out if the random walk model is a good fit for our simulated data. Recall that a random walk is ** xt = xt-1 + wt**. Using algebra we can say that

```
# First difference of simulated Random Walk series
_ = tsplot(np.diff(x), lags=30)
```

first difference of a random walk series

Our definition holds as this looks exactly like a white noise process. What if we fit a random walk to the first difference of SPY's prices?

```
# First difference of SPY prices
_ = tsplot(np.diff(data.SPY), lags=30)
```

fitting a random walk model to SPY ETF prices

Wow, it's quite similar to white noise. However, notice the shape of the QQ and Probability plots. This indicates that the process is close to normality but with **'heavy tails'. **There also appears to be some significant serial correlation in the ACF, and PACF plots around lags 1, 5?, 16?, 18 and 21. This means that there should be better models to describe the actual price change process.

Linear models aka trend models represent a TS that can be graphed using a straight line. The basic equation is:

In this model the value of the dependent variable is determined by the beta coefficients and a singular independent variable, *time. *An example could be a company's sales that increase by the same amount at each time step. Let's look at a contrived example below. In this simulation we assume Firm ABC sales regardless of time are -$50.00 (*beta 0 or the intercept term*) and +$25.00 (*beta 1*) at every time step.

```
# simulate linear trend
# example Firm ABC sales are -$50 by default and +$25 at every time step
w = np.random.randn(100)
y = np.empty_like(w)
b0 = -50.
b1 = 25.
for t in range(len(w)):
y[t] = b0 + b1*t + w[t]
_ = tsplot(y, lags=lags)
```

Linear trend model simulation

Here we can see that the residuals of the model are correlated and linearly decreasing as a function of the lag. The distribution is approximately normal. Before using this model to make predictions we would have to account for and remove the obvious autocorrelation present in the series. The significance of the PACF at lag 1 indicates that an *autoregressive* model may be appropriate.

These models are similar to linear models except that the data points form an exponential function that represent a constant rate of change with respect to each time step. For example, firm ABC's sales increasing X% at each time step. When plotting the simulated sales data you get a curve that looks like this:

```
# Simulate ABC exponential growth
# fake dates
idx = pd.date_range('2007-01-01', '2012-01-01', freq='M')
# fake sales increasing at exponential rate
sales = [np.exp( x/12 ) for x in range(1, len(idx)+1)]
# create dataframe and plot
df = pd.DataFrame(sales, columns=['Sales'], index=idx)
with plt.style.context('bmh'):
df.plot()
plt.title('ABC Sales')
```

simulated exponential function

We can then transform the data by taking the natural logarithm of sales. Now a linear regression is a much better fit to the data.

```
# ABC log sales
with plt.style.context('bmh'):
pd.Series(np.log(sales), index=idx).plot()
plt.title('ABC Log Sales')
```

Natural logarithm of exponential function

These models have a fatal weakness as discussed previously. They assume serially UNcorrelated errors, which as we have seen in the linear model example is not true. In real life, TS data usually violates our stationary assumptions which motivates our progression to autoregressive models.

When the dependent variable is regressed against one or more lagged values of itself the model is called autoregressive. The formula looks like this:

AR(p) model formula

When you describe the **"order"** of the model, as in, an AR model of order **"p", **the p represents the number of lagged variables used within the model. For example an AR(2) model or *second-order *autoregressive model looks like this:

AR(2) model formula

Here, alpha (a) is the coefficient, and omega (w) is a white noise term. Alpha cannot equal zero in an AR model. Note that an AR(1) model with alpha set equal to 1 is a *random walk* and therefore not stationary.

AR(1) model with alpha = 1; random walk

Let's simulate an AR(1) model with alpha set equal to 0.6

```
# Simulate an AR(1) process with alpha = 0.6
np.random.seed(1)
n_samples = int(1000)
a = 0.6
x = w = np.random.normal(size=n_samples)
for t in range(n_samples):
x[t] = a*x[t-1] + w[t]
_ = tsplot(x, lags=lags)
```

AR(1) Model with alpha = 0.6

As expected the distribution of our simulated AR(1) model is normal. There is significant serial correlation between lagged values especially at lag 1 as evidenced by the PACF plot.

Now we can fit an AR(p) model using Python's statsmodels. First we fit the AR model to our simulated data and return the estimated alpha coefficient. Then we use the statsmodels function **"select_order()" **to see if the fitted model will select the correct lag. If the AR model is correct the estimated alpha coefficient will be close to our true alpha of 0.6 and the selected order will equal 1.

```
# Fit an AR(p) model to simulated AR(1) model with alpha = 0.6
mdl = smt.AR(x).fit(maxlag=30, ic='aic', trend='nc')
%time est_order = smt.AR(x).select_order(
maxlag=30, ic='aic', trend='nc')
true_order = 1
p('\nalpha estimate: {:3.5f} | best lag order = {}'
.format(mdl.params[0], est_order))
p('\ntrue alpha = {} | true order = {}'
.format(a, true_order))
```

Looks like we were able to recover the underlying parameters of our simulated data. Let's simulate an AR(2) process with alpha_1 = 0.666 and alpha_2 = -0.333. For this we make use of statsmodel's **"arma_generate_samples()"** function. This function allows us to simulate an AR model of arbitrary orders. Note that there are some peculiarities of Python's version which requires us to take some extra steps before using the function.

```
# Simulate an AR(2) process
n = int(1000)
alphas = np.array([.666, -.333])
betas = np.array([0.])
# Python requires us to specify the zero-lag value which is 1
# Also note that the alphas for the AR model must be negated
# We also set the betas for the MA equal to 0 for an AR(p) model
# For more information see the examples at statsmodels.org
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]
ar2 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n)
_ = tsplot(ar2, lags=lags)
```

AR(2) simulation with alpha_1 = 0.666 and alpha_2 = -0.333

Let's see if we can recover the correct parameters.

```
# Fit an AR(p) model to simulated AR(2) process
max_lag = 10
mdl = smt.AR(ar2).fit(maxlag=max_lag, ic='aic', trend='nc')
est_order = smt.AR(ar2).select_order(
maxlag=max_lag, ic='aic', trend='nc')
true_order = 2
p('\ncoef estimate: {:3.4f} {:3.4f} | best lag order = {}'
.format(mdl.params[0],mdl.params[1], est_order))
p('\ntrue coefs = {} | true order = {}'
.format([.666,-.333], true_order))
# coef estimate: 0.6291 -0.3196 | best lag order = 2
# true coefs = [0.666, -0.333] | true order = 2
```

Not bad. Let's see how the AR(p) model will fit MSFT log returns. Here is the return TS.

MSFT log returns time series

```
# Select best lag order for MSFT returns
max_lag = 30
mdl = smt.AR(lrets.MSFT).fit(maxlag=max_lag, ic='aic', trend='nc')
est_order = smt.AR(lrets.MSFT).select_order(
maxlag=max_lag, ic='aic', trend='nc')
p('best estimated lag order = {}'.format(est_order))
# best estimated lag order = 23
```

The best order is 23 lags or 23 parameters! Any model with this many parameters is unlikely to be useful in practice. Clearly there is more complexity underlying the returns process than this model can explain.

MA(q) models are very similar to AR(p) models. The difference is that the MA(q) model is a linear combination of past white noise error terms as opposed to a linear combo of past observations like the AR(p) model. The motivation for the MA model is that we can observe "shocks" in the error process directly by fitting a model to the error terms. In an AR(p) model these shocks are observed indirectly by using the ACF on the series of past observations. The formula for an MA(q) model is:

Omega (w) is white noise with E(wt) = 0 and variance of sigma squared. Let's simulate this process using beta=0.6 and specifying the AR(p) alpha equal to 0.

```
# Simulate an MA(1) process
n = int(1000)
# set the AR(p) alphas equal to 0
alphas = np.array([0.])
betas = np.array([0.6])
# add zero-lag and negate alphas
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]
ma1 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n)
_ = tsplot(ma1, lags=30)
```

Simulated ma(1) process with beta=0.6

The ACF function shows that lag 1 is significant which indicates that a MA(1) model may be appropriate for our simulated series. I'm not sure how to interpret the PACF showing significance at lags 2, 3, and 4 when the ACF only shows significance at lag 1. Regardless we can now attempt to fit a MA(1) model to our simulated data. We can use the same statsmodels **"ARMA()" **function specifying our chosen orders. We call on its **"fit()"** method to return the model output.

```
# Fit the MA(1) model to our simulated time series
# Specify ARMA model with order (p, q)
max_lag = 30
mdl = smt.ARMA(ma1, order=(0, 1)).fit(
maxlag=max_lag, method='mle', trend='nc')
p(mdl.summary())
```

MA(1) model summary

The model was able to correctly estimate the lag coefficent as 0.58 is close to our true value of 0.6. Also notice that our 95% confidence interval does contain the true value. Let's try simulating an MA(3) process, then use our ARMA function to fit a third order MA model to the series and see if we can recover the correct lag coefficients (betas). Betas 1-3 are equal to 0.6, 0.4, and 0.2 respectively.

```
# Simulate MA(3) process with betas 0.6, 0.4, 0.2
n = int(1000)
alphas = np.array([0.])
betas = np.array([0.6, 0.4, 0.2])
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]
ma3 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n)
_ = tsplot(ma3, lags=30)
```

Simulated ma(3) process with betas = [0.6, 0.4, 0.2]

```
# Fit MA(3) model to simulated time series
max_lag = 30
mdl = smt.ARMA(ma3, order=(0, 3)).fit(
maxlag=max_lag, method='mle', trend='nc')
p(mdl.summary())
```

MA(3) model summary

The model was able to estimate the real coefficients effectively. Our 95% confidence intervals also contain the true parameter values of 0.6, 0.4, and 0.3. Now let's try fitting an MA(3) model to the SPY's log returns. Keep in mind we do not know the *true *parameter values.

```
# Fit MA(3) to SPY returns
max_lag = 30
Y = lrets.SPY
mdl = smt.ARMA(Y, order=(0, 3)).fit(
maxlag=max_lag, method='mle', trend='nc')
p(mdl.summary())
_ = tsplot(mdl.resid, lags=max_lag)
```

SPY MA(3) model summary

Let's look at the model residuals.

SPY MA(3) model Residuals

Not bad. Some of the ACF lags concern me especially at 5, 16, and 18. It could be sampling error but that combined with the heaviness of the tails makes me think this isn't the best model to predict future SPY returns.

As you may have guessed, the ARMA model is simply the merger between AR(p) and MA(q) models. Let's recap what these models represent to us from a quant finance perspective:

- AR(p) models try to capture
*(explain)*the momentum and mean reversion effects often observed in trading markets. - MA(q) models try to capture
*(explain)*the shock effects observed in the white noise terms. These shock effects could be thought of as unexpected events affecting the observation process e.g. Surprise earnings, A terrorist attack, etc.

"For a set of products in a grocery store, the number of active coupon campaigns introduced at different times would constitute multiple 'shocks' that affect the prices of the products in question."

- AM207: Pavlos Protopapas, Harvard University

ARMA's weakness is that it ignores the *volatility clustering *effects found in most financial time series.

The model formula is:

arma(p, q) equation

Let's simulate an ARMA(2, 2) process with given parameters, then fit an ARMA(2, 2) model and see if it can correctly estimate those parameters. Set alphas equal to [0.5,-0.25] and betas equal to [0.5,-0.3].

```
# Simulate an ARMA(2, 2) model with alphas=[0.5,-0.25] and betas=[0.5,-0.3]
max_lag = 30
n = int(5000) # lots of samples to help estimates
burn = int(n/10) # number of samples to discard before fit
alphas = np.array([0.5, -0.25])
betas = np.array([0.5, -0.3])
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]
arma22 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n, burnin=burn)
_ = tsplot(arma22, lags=max_lag)
mdl = smt.ARMA(arma22, order=(2, 2)).fit(
maxlag=max_lag, method='mle', trend='nc', burnin=burn)
p(mdl.summary())
```

simulated ARma(2, 2) process

ARMA(2, 2) Model summary

The model has correctly recovered our parameters, and our true parameters are contained within the 95% confidence interval.

Next we simulate a ARMA(3, 2) model. After, we cycle through a non trivial number of combinations of p, q to fit an ARMA model to our simulated series. We choose the best combination based on which model produces the lowest Akaike Information Criterion (AIC).

```
# Simulate an ARMA(3, 2) model with alphas=[0.5,-0.25,0.4] and betas=[0.5,-0.3]
max_lag = 30
n = int(5000)
burn = 2000
alphas = np.array([0.5, -0.25, 0.4])
betas = np.array([0.5, -0.3])
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]
arma32 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n, burnin=burn)
_ = tsplot(arma32, lags=max_lag)
# pick best order by aic
# smallest aic value wins
best_aic = np.inf
best_order = None
best_mdl = None
rng = range(5)
for i in rng:
for j in rng:
try:
tmp_mdl = smt.ARMA(arma32, order=(i, j)).fit(method='mle', trend='nc')
tmp_aic = tmp_mdl.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_order = (i, j)
best_mdl = tmp_mdl
except: continue
p('aic: {:6.5f} | order: {}'.format(best_aic, best_order))
# aic: 14108.27213 | order: (3, 2)
```

The correct order was recovered above. Below we see the output of our simulated time series before any model fitting.

Simulated arma(3, 2) series with alphas = [0.5,-0.25,0.4] and betas = [0.5,-0.3]

ARmA(3, 2) BEst model summary

We see that the correct order was selected and the model correctly estimated our parameters. However notice the MA.L1.y coefficent; the true value of 0.5 is almost outside of the 95% confidence interval!

Below we observe the model's residuals. Clearly it is a white noise process, thus the best model has been fit to *explain* the data.

ARMA(3, 2) best model residual white noise

Next we fit an ARMA model to SPY returns. The plot below is the time series before model fitting.

SPY Returns

```
# Fit ARMA model to SPY returns
best_aic = np.inf
best_order = None
best_mdl = None
rng = range(5) # [0,1,2,3,4,5]
for i in rng:
for j in rng:
try:
tmp_mdl = smt.ARMA(lrets['SPY'], order=(i, j)).fit(
method='mle', trend='nc'
)
tmp_aic = tmp_mdl.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_order = (i, j)
best_mdl = tmp_mdl
except: continue
p('aic: {:6.5f} | order: {}'.format(best_aic, best_order))
# aic: -11518.22902 | order: (4, 4)
```

We plot the model residuals.

SPY best model residuals arma(4, 4)

The ACF and PACF are showing no significant autocorrelation. The QQ and Probability Plots show the residuals are approximately normal with heavy tails. However, this model's residuals do NOT look like white noise! Look at the highlighted areas of obvious conditional heteroskedasticity (*conditional volatility*) that the model has not captured.

ARIMA is a natural extension to the class of ARMA models. As previously mentioned many of our TS are not stationary, however they can be made stationary by differencing. We saw an example of this when we took the first difference of a Guassian random walk and proved that it equals white noise. Said another way, we took the nonstationary random walk and transformed it to stationary white noise by first-differencing.

Without diving too deeply into the equation, just know the **"d"** references the number of times we are differencing the series. A side note, in Python we must use **np.diff()** function if we need to difference a series more than once. The **pandas **functions **DataFrame****.diff()/Series.diff() only takes the first difference of a dataframe/series and does not implement the recursive differencing needed in TSA. **

In the following example, we iterate through a non-trivial number of combinations of (p, d, q) orders, to find the best ARIMA model to fit SPY returns. We use the AIC to evaluate each model. The lowest AIC wins.

```
# Fit ARIMA(p, d, q) model to SPY Returns
# pick best order and final model based on aic
best_aic = np.inf
best_order = None
best_mdl = None
pq_rng = range(5) # [0,1,2,3,4]
d_rng = range(2) # [0,1]
for i in pq_rng:
for d in d_rng:
for j in pq_rng:
try:
tmp_mdl = smt.ARIMA(lrets.SPY, order=(i,d,j)).fit(method='mle', trend='nc')
tmp_aic = tmp_mdl.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_order = (i, d, j)
best_mdl = tmp_mdl
except: continue
p('aic: {:6.5f} | order: {}'.format(best_aic, best_order))
# aic: -11518.22902 | order: (4, 0, 4)
# ARIMA model resid plot
_ = tsplot(best_mdl.resid, lags=30)
```

It should be no surprise that the best model has a differencing of 0. Recall that we already took the first difference of log prices to calculate the stock returns. Below, I plot the model residuals. The result is essentially identical to the ARMA(4, 4) model we fit above. Clearly this ARIMA model has not explained the conditional volatility in the series either!

ARIMA model fit to spy returns

Now we have at least accumulated enough knowledge to make a simple forecast of future returns. Here we make use of our model's **forecast() **method. As arguments, it takes an integer for the number of time steps to predict, and a decimal for the alpha argument to specify the confidence intervals. The default setting is 95% confidence. For 99% set alpha equal to 0.01.

```
# Create a 21 day forecast of SPY returns with 95%, 99% CI
n_steps = 21
f, err95, ci95 = best_mdl.forecast(steps=n_steps) # 95% CI
_, err99, ci99 = best_mdl.forecast(steps=n_steps, alpha=0.01) # 99% CI
idx = pd.date_range(data.index[-1], periods=n_steps, freq='D')
fc_95 = pd.DataFrame(np.column_stack([f, ci95]),
index=idx, columns=['forecast', 'lower_ci_95', 'upper_ci_95'])
fc_99 = pd.DataFrame(np.column_stack([ci99]),
index=idx, columns=['lower_ci_99', 'upper_ci_99'])
fc_all = fc_95.combine_first(fc_99)
fc_all.head()
```

```
# Plot 21 day forecast for SPY returns
plt.style.use('bmh')
fig = plt.figure(figsize=(9,7))
ax = plt.gca()
ts = lrets.SPY.iloc[-500:].copy()
ts.plot(ax=ax, label='Spy Returns')
# in sample prediction
pred = best_mdl.predict(ts.index[0], ts.index[-1])
pred.plot(ax=ax, style='r-', label='In-sample prediction')
styles = ['b-', '0.2', '0.75', '0.2', '0.75']
fc_all.plot(ax=ax, style=styles)
plt.fill_between(fc_all.index, fc_all.lower_ci_95, fc_all.upper_ci_95, color='gray', alpha=0.7)
plt.fill_between(fc_all.index, fc_all.lower_ci_99, fc_all.upper_ci_99, color='gray', alpha=0.2)
plt.title('{} Day SPY Return Forecast\nARIMA{}'.format(n_steps, best_order))
plt.legend(loc='best', fontsize=10)
```

21-Day spy Return forecast - Arima(4,0,4)

ARCH(p) models can be thought of as simply an AR(p) model applied to the variance of a time series. Another way to think about it, is that the variance of our time series NOW *at time t*, is conditional on past observations of the variance in previous periods.

arch(1) model formula - penn state

Assuming the series has zero mean we can express the model as:

ARCH(1) model if zero mean

```
# Simulate ARCH(1) series
# Var(yt) = a_0 + a_1*y{t-1}**2
# if a_1 is between 0 and 1 then yt is white noise
np.random.seed(13)
a0 = 2
a1 = .5
y = w = np.random.normal(size=1000)
Y = np.empty_like(y)
for t in range(len(y)):
Y[t] = w[t] * np.sqrt((a0 + a1*y[t-1]**2))
# simulated ARCH(1) series, looks like white noise
tsplot(Y, lags=30)
```

simulated ARCH(1) Process

Simulated arch(1)**2 process

Notice the ACF, and PACF seem to show significance at lag 1 indicating an AR(1) model for the variance may be appropriate.

Simply put GARCH(p, q) is an ARMA model applied to the variance of a time series i.e., it has an autoregressive term and a moving average term. The AR(p) models the variance of the residuals (squared errors) or simply our time series squared. The MA(q) portion models the variance of the process. The basic GARCH(1, 1) formula is:

garch(1, 1) formula from quantstart.com

Omega (w) is white noise, and alpha and beta are parameters of the model. Also alpha_1 + beta_1 must be less than 1 or the model is unstable. We can simulate a GARCH(1, 1) process below.

```
# Simulating a GARCH(1, 1) process
np.random.seed(2)
a0 = 0.2
a1 = 0.5
b1 = 0.3
n = 10000
w = np.random.normal(size=n)
eps = np.zeros_like(w)
sigsq = np.zeros_like(w)
for i in range(1, n):
sigsq[i] = a0 + a1*(eps[i-1]**2) + b1*sigsq[i-1]
eps[i] = w[i] * np.sqrt(sigsq[i])
_ = tsplot(eps, lags=30)
```

Simulated Garch(1, 1) process

Again, notice that overall this process closely resembles white noise, however take a look when we view the squared **eps** series.

simulated garch(1, 1) process squared

There is clearly autocorrelation present and the significance of the lags in both the ACF and PACF indicate we need both AR and MA components for our model. Let's see if we can recover our process parameters using a GARCH(1, 1) model. Here we make use of the **arch_model **function from the **ARCH **package.

```
# Fit a GARCH(1, 1) model to our simulated EPS series
# We use the arch_model function from the ARCH package
am = arch_model(eps)
res = am.fit(update_freq=5)
p(res.summary())
```

garch model fit summary

Now let's run through an example using SPY returns. The process is as follows:

- Iterate through combinations of ARIMA(p, d, q) models to best fit our time series.
- Pick the GARCH model orders according to the ARIMA model with lowest AIC.
- Fit the GARCH(p, q) model to our time series.
- Examine the model residuals and squared residuals for autocorrelation

Also note that I've chosen a specific time period to better highlight key points. However the results will be different depending on the time period under study.

```
def _get_best_model(TS):
best_aic = np.inf
best_order = None
best_mdl = None
pq_rng = range(5) # [0,1,2,3,4]
d_rng = range(2) # [0,1]
for i in pq_rng:
for d in d_rng:
for j in pq_rng:
try:
tmp_mdl = smt.ARIMA(TS, order=(i,d,j)).fit(
method='mle', trend='nc'
)
tmp_aic = tmp_mdl.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_order = (i, d, j)
best_mdl = tmp_mdl
except: continue
p('aic: {:6.5f} | order: {}'.format(best_aic, best_order))
return best_aic, best_order, best_mdl
# Notice I've selected a specific time period to run this analysis
TS = lrets.SPY.ix['2012':'2015']
res_tup = _get_best_model(TS)
# aic: -5255.56673 | order: (3, 0, 2)
```

residuals of arima(3,0,2) model fit to SPY returns

Looks like white noise.

squared RESIDUALS OF ARIMA(3,0,2) MODEL FIT TO SPY RETURNS

Squared residuals show autocorrelation. Let's fit a GARCH model and see how it does.

```
# Now we can fit the arch model using the best fit arima model parameters
p_ = order[0]
o_ = order[1]
q_ = order[2]
# Using student T distribution usually provides better fit
am = arch_model(TS, p=p_, o=o_, q=q_, dist='StudentsT')
res = am.fit(update_freq=5, disp='off')
p(res.summary())
```

GARCH(3, 2) model fit to spy returns

Convergence warnings can occur when dealing with very small numbers. Multiplying the numbers by factors of 10 to scale the magnitude can help when necessary, however for this demonstration it isn't necessary. Below are the model residuals.

residuals of garch(3, 2) model fit to SPY returns

Looks like white noise above. Now let's view the ACF and PACF of the squared residuals.

Looks like we have achieved a good model fit as there is no obvious autocorrelation in the squared residuals.

- Quantstart.com - https://www.quantstart.com/articles#time-series-analysis
- Harvard Lectures in Python - http://iacs-courses.seas.harvard.edu/courses/am207/blog/lecture-17.html
- Penn State Stats - https://onlinecourses.science.psu.edu/stat510/node/78
- stationary pic + tsplot - http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/
- stationary quote, etc - http://people.duke.edu/~rnau/411diff.htm
- interpreting qq plots - http://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot
- Kaplan SchweserNotes (Level 2) - Quantitative Methods

**Strategy Summary****Results****Conclusions/Analysis**

First, if you're unfamiliar with the Implied Volatility Skew Strategy you can find a recent deep dive into the strategy and its performance **here.**

In this short post, I look at the effect of using only the top N ranked ETFs from each Long/Short portfolio. In this case, N is equal to 3. This is an arbitrary selection and this study could be done with the top 1, 2, 4, etc. This differs from the original strategy in that the original strategy simply bins the ETFs into quantiles (according to the factor value) and creates the portfolio from the top and bottom quantiles; **this strategy takes the top 3 ETFs from within the top and bottom quantiles.**

**I will reference the original strategy as S1 and the modified strategy as S2 for the remainder of the post.**

This strategy comparison exposes some key themes found in quant finance, but first some highlights:

- S2 outperforms S1 on a pure total return basis over the period. (~23% vs ~18%)
- S2 has a higher annualized alpha. (26% vs 20%)
- S2 has a lower annualized beta. (0% vs 10%)
- S2 has nearly double the volatility of S1. (13% vs 7%)
- S2 has nearly double the maximum drawdown of S1. (-4.4% vs -2.3%)
- S2 has max drawdown duration 3x larger than S1. (72 days vs 24 days)
- Both strategies have a positive information ratio of approximately 6%.

We can extract a few lessons here.

S2 trades a maximum of six ETFs at a time vs the ~10+ ETFs of S1, so **we would expect S2 to have a higher level of volatility than S1. **You might ask, "Why is that?". Well, this is basic Market Portfolio Theory which relates diversification and portfolio volatility to the number of assets traded and their correlations with each other. Essentially, **the theory says that the volatility of a portfolio of assets will be less than the sum of their individual volatility's as long as the correlations between the assets is not equal to 1.** This is the power and benefit of diversification in a nutshell.

Unfortunately for S2, the increase in total returns is not enough to compensate us for the increase in volatility (risk). This shows up primarily in the annualized sharpe ratio. **S1 clearly has the edge in return per unit of risk, with a sharpe ratio of ~2.8 vs ~2.** This also manifests in the Calmar ratio, which tracks the ratio of Avg. Annual Returns over Maximum Drawdown. Here **S1 is also superior sporting a Calmar ratio of ~9.8 vs ~6.7.**

S2's lower beta surprised me initially, but this is likely explained by trading less ETFs overall combined with those ETFs having low market correlation.

**Another issue we encounter with S2 vs S1 has to do with liquidity**. In the equity curve charts, I have highlighted the general region where the strategy was on hiatus. Notice in S2, that there are tiny trades still occurring throughout this period. There should be no trades happening. **What this means is that because S2 has to invest more cash into a smaller number of ETFs, if any of those ETFs are very thinly traded, it results in lots of unfilled orders that can take a significant amount of time to finish allocating or liquidating.** Again, this is another side effect of S2's reduced diversification.

Thus far all the evidence supports S1 as the superior strategy indicating that using a top N factor rank within top and bottom quantiles does not improve the strategy.

]]>