**Motivating the Journey****Where Do Edges Come From?****The Problem with Traditional Research****The Hidden Side**

**A Brief Description:****Part 1 - A Visual Introduction to Hidden Markov Models with Python****Part 2 - Exploring Mixture Models with Scikit-Learn and Python****Part 3 - Predicting Market Bottoms with Scikit-Learn and Python**

Edges come from superior ability to identify and execute profitable strategies.

You can see this simply by imagining the first strategy able to identify pricing errors on identical items in different markets. This knowledge is valuable in two scenarios: you can execute the transaction yourself or you know someone who can and will pay you for the "signal".

Abstractly, a signal can be thought of as a glitch in the matrix allowing us a view through a window into probabilistic future states. Signals can come from anywhere and are not always understood.

Our job is to find these signals, vet them, and implement them. This is difficult in practice. The competitive environment we seek to understand is dynamic with positive and negative feedback loops operating at various scales. The system processes are very noisy making signal extraction confusing and difficult. Competitors are always seeking strategies that "work" until they don't.

Generally profitable edges stop working when both your identification and execution strategies are well known. Thus a profit motive for secrecy and obfuscation exists among participants. If you are familiar with poker this will sound very familiar.

This also means that using well known identification techniques puts you at a strategic disadvantage because your competitors have likely incorporated knowledge of your methods into their own strategies.

Therefore we must continuously search for strategies that are not well understood, not well known, or otherwise difficult for our competitors to implement.

Too much published "research" focuses on using well known statistical tools to draw conclusions that do not improve the odds of profitable investment. Worse still, many research papers' results are not reproducible.

For periods of time, techniques involving technical analysis, regression, and simple correlations, were good enough to beat the market. This worked because the methodology was not well known or well understood. Times have changed.

These methods have been taught and promoted to generations of practitioners. These techniques form the foundation of many market participants investment strategies. Therefore the majority of well known strategies are already in use by the market.

This means sophisticated participants have had time and opportunity to develop counter strategies to take advantage of the limitations of publicly known methods.

Typical business finance teachings focus on the theory that stock values are directly tied to the expected value of net cash flows produced by the underlying operating business from now into some future period. Other research links stock prices to any number of other observable factors. My perception is that these well taught methods can bias our exploratory research when it comes to the art and science of **prediction.**

Successful prediction does not require understanding or logic. Prediction does not require expertise in the industry or business which generated the data. These things can help solidify our belief in the power of the prediction, however successful prediction methods only require a stable, positive payoff function relative to prediction accuracy over an expanding time period. Nothing more, nothing less.

Rigid knowledge structures can blind us to potential opportunities. Using statistics to explore observable factors only, ignores the entire spectrum of hidden, unobserved factors influencing asset returns.

By definition a hidden factor is not directly observable. Its presence or influence is detected by its effect on observable factor(s) or on a delayed basis.

Conceptualizing the influence of hidden factors is difficult for many decision makers to either understand or incorporate into already existing processes.

The combination of bias created by traditional finance and difficulty conceptualizing hidden factors, creates the barriers to entry we need for successful strategy development. We can reasonably assume this research pathway is still rich with profitable edges and worth pursuing.

In part 1, we will discuss Markov Models, Hidden Markov Models and a toy application for regime detection.

In part 2, we will explore the motivation behind mixture models and how they improve on the weaknesses of K-means algorithms. We will also discuss the connection between Mixture Models and Hidden Markov Models. Finally we will extend our toy regime detector to use a mixture model instead.

In part 3, we will implement a toy strategy using mixture models to predict market bottoms. The strategy assumes that we can calibrate a model to predict the market return distribution such that actual returns that fall below the confidence intervals are profitable long entries over short time periods.

Post thumbnail picture taken from Bayesian Intelligence Slideshare presentation.

]]>- Motivation
- Get Data
- Default Plot with Recession Shading
- Add Chart Titles, Axis Labels, Fancy Legend, Horizontal Line
- Format X and Y Axis Tick Labels
- Change Font and Add Data Markers
- Add Annotations
- Add Logo/Watermarks

Since I started this blog a few years ago, one of my obsessions is creating good looking, informative plots/charts. I've spent an inordinate amount of time learning how to do this and it is still a work in a progress. However all my work is not in vain as several of you readers have commented and messaged me for the code behind some of my time series plots. Beginning with basic time series data, I will show you how I produce these charts.

Import packages

```
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
sns.set_style('white', {"xtick.major.size": 2, "ytick.major.size": 2})
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71","#f4cae4"]
sns.set_palette(sns.color_palette(flatui,7))
import missingno as msno
p=print
save_loc = '/YOUR/PROJECT/LOCATION/'
logo_loc = '/YOUR/WATERMARK/LOCATION/'
```

Get time series data from Yahoo finance and recession data from FRED.

```
# get index and fed data
f1 = 'USREC' # recession data from FRED
start = pd.to_datetime('1999-01-01')
end = pd.datetime.today()
mkt = '^GSPC'
MKT = (web.DataReader([mkt,'^VIX'], 'yahoo', start, end)['Adj Close']
.resample('MS') # month start b/c FED data is month start
.mean()
.rename(columns={mkt:'SPX','^VIX':'VIX'})
.assign(SPX_returns=lambda x: np.log(x['SPX']/x['SPX'].shift(1)))
.assign(VIX_returns=lambda x: np.log(x['VIX']/x['VIX'].shift(1)))
)
data = (web.DataReader([f1], 'fred', start, end)
.join(MKT, how='outer')
.dropna())
p(data.head())
p(data.info())
msno.matrix(data)
```

Now we have to setup our recession data so we can get the official begin and end dates for each recession over the period.

```
# recessions are marked as 1 in the data
recs = data.query('USREC==1')
# Select the two recessions over the time period
recs_2k = recs.ix['2001']
recs_2k8 = recs.ix['2008':]
# now we can grab the indices for the start
# and end of each recession
recs2k_bgn = recs_2k.index[0]
recs2k_end = recs_2k.index[-1]
recs2k8_bgn = recs_2k8.index[0]
recs2k8_end = recs_2k8.index[-1]
```

Now we can plot the default chart with recession shading. Let's take a look.

```
# Let's plot SPX and VIX cumulative returns with recession overlay
plot_cols = ['SPX_returns', 'VIX_returns']
# 2 axes for 2 subplots
fig, axes = plt.subplots(2,1, figsize=(10,7), sharex=True)
data[plot_cols].plot(subplots=True, ax=axes)
for ax in axes:
ax.axvspan(recs2k_bgn, recs2k_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
ax.axvspan(recs2k8_bgn, recs2k8_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
```

The default plot is ok but we can do better. Let's add chart titles, axis labels, spruce up the legend, and add a horizontal line for 0.

```
fig, axes = plt.subplots(2,1, figsize=(10,7), sharex=True)
data[plot_cols].plot(subplots=True, ax=axes)
# for subplots we must add features by subplot axis
for ax, col in zip(axes, plot_cols):
ax.axvspan(recs2k_bgn, recs2k_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
ax.axvspan(recs2k8_bgn, recs2k8_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
# lets add horizontal zero lines
ax.axhline(0, color='k', linestyle='-', linewidth=1)
# add titles
ax.set_title('Monthly ' + col + ' \nRecessions Shaded Gray')
# add axis labels
ax.set_ylabel('Returns')
ax.set_xlabel('Date')
# add cool legend
ax.legend(loc='upper left', fontsize=11, frameon=True).get_frame().set_edgecolor('blue')
# now to use tight layout
plt.tight_layout()
```

This is a step up but still not good enough. I prefer more informative dates on the x-axis, and percent formatting on the y-axis.

```
# better but I prefer more advanced axis tick labels
fig, axes = plt.subplots(2,1, figsize=(12,9), sharex=True)
data[plot_cols].plot(subplots=True, ax=axes)
# for subplots we must add features by subplot axis
for ax, col in zip(axes, plot_cols):
ax.axvspan(recs2k_bgn, recs2k_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
ax.axvspan(recs2k8_bgn, recs2k8_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
# lets add horizontal zero lines
ax.axhline(0, color='k', linestyle='-', linewidth=1)
# add titles
ax.set_title('Monthly ' + col + ' \nRecessions Shaded Gray')
# add axis labels
ax.set_ylabel('Returns')
ax.set_xlabel('Date')
# upgrade axis tick labels
yticks = ax.get_yticks()
ax.set_yticklabels(['{:3.1f}%'.format(x*100) for x in yticks]);
dates_rng = pd.date_range(data.index[0], data.index[-1], freq='6M')
plt.xticks(dates_rng, [dtz.strftime('%Y-%m') for dtz in dates_rng], rotation=45)
# add cool legend
ax.legend(loc='upper left', fontsize=11, frameon=True).get_frame().set_edgecolor('blue')
# now to use tight layout
plt.tight_layout()
```

It's an improvement, but I hate Arial font, and would like to add data point markers.

```
# I want markers for the data points, and change to font
mpl.rcParams['font.family'] = 'Ubuntu Mono'
fig, axes = plt.subplots(2,1, figsize=(10,7), sharex=True)
data[plot_cols].plot(subplots=True, ax=axes, marker='o', ms=3)
# for subplots we must add features by subplot axis
for ax, col in zip(axes, plot_cols):
ax.axvspan(recs2k_bgn, recs2k_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
ax.axvspan(recs2k8_bgn, recs2k8_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
# lets add horizontal zero lines
ax.axhline(0, color='k', linestyle='-', linewidth=1)
# add titles
ax.set_title('Monthly ' + col + ' \nRecessions Shaded Gray')
# add axis labels
ax.set_ylabel('Returns')
ax.set_xlabel('Date')
# upgrade axis tick labels
yticks = ax.get_yticks()
ax.set_yticklabels(['{:3.2f}%'.format(x*100) for x in yticks]);
dates_rng = pd.date_range(data.index[0], data.index[-1], freq='6M')
plt.xticks(dates_rng, [dtz.strftime('%Y-%m') for dtz in dates_rng], rotation=45)
# add cool legend
ax.legend(loc='upper left', fontsize=11, frameon=True).get_frame().set_edgecolor('blue')
# now to use tight layout
plt.tight_layout()
```

It's starting to look pretty good, but we can get even more fancy. Say we wanted to annotate the global maximum and minimum returns in each subplot along with their respective dates for SPX and VIX . That could be a challenge. To do this we first need to extract the max/mins and idxmax/idxmin for both series.

```
# I want to know show the global max and mins and their dates
# --------------------------------------------------------------- #
# MAX SPX Returns
spx_max_ = data[plot_cols[0]].max()
spx_max_idx_ = data[plot_cols[0]].idxmax(axis=0, skipna=True)
# MIN SPX Returns
spx_min_ = data[plot_cols[0]].min()
spx_min_idx_ = data[plot_cols[0]].idxmin(axis=0, skipna=True)
# MAX VIX Returns
vix_max_ = data[plot_cols[1]].max()
vix_max_idx_ = data[plot_cols[1]].idxmax(axis=0, skipna=True)
# MIN VIX Returns
vix_min_ = data[plot_cols[1]].min()
vix_min_idx_ = data[plot_cols[1]].idxmin(axis=0, skipna=True)
```

Now that we have this information we can get clever with the annotation tools Matplotlib provides. Also, I want to touch up some of the axis labels and axis tick labels as well.

```
mpl.rcParams['font.family'] = 'Ubuntu Mono'
fig, axes = plt.subplots(2,1, figsize=(12,9), sharex=True)
data[plot_cols].plot(subplots=True, ax=axes, marker='o', ms=3)
# for subplots we must add features by subplot axis
for ax, col in zip(axes, plot_cols):
ax.axvspan(recs2k_bgn, recs2k_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
ax.axvspan(recs2k8_bgn, recs2k8_end, color=sns.xkcd_rgb['grey'], alpha=0.5)
# lets add horizontal zero lines
ax.axhline(0, color='k', linestyle='-', linewidth=1)
# add titles
ax.set_title('Monthly ' + col + ' \nRecessions Shaded Gray', fontsize=14, fontweight='demi')
# add axis labels
ax.set_ylabel('Returns', fontsize=12, fontweight='demi')
ax.set_xlabel('Date', fontsize=12, fontweight='demi')
# upgrade axis tick labels
yticks = ax.get_yticks()
ax.set_yticklabels(['{:3.1f}%'.format(x*100) for x in yticks]);
dates_rng = pd.date_range(data.index[0], data.index[-1], freq='6M')
plt.xticks(dates_rng, [dtz.strftime('%Y-%m-%d') for dtz in dates_rng], rotation=45)
# bold up tick axes
ax.tick_params(axis='both', which='major', labelsize=11)
# add cool legend
ax.legend(loc='upper left', fontsize=11, frameon=True).get_frame().set_edgecolor('blue')
# add global max/min annotations
# add cool annotation box
bbox_props = dict(boxstyle="round4, pad=0.6", fc="cyan", ec="b", lw=.5)
axes[0].annotate('Global Max = {:.2%}\nDate = {}'
.format(spx_max_, spx_max_idx_.strftime('%a, %Y-%m-%d')),
fontsize=9,
fontweight='bold',
xy=(spx_max_idx_, spx_max_),
xycoords='data',
xytext=(-150, -30),
textcoords='offset points',
arrowprops=dict(arrowstyle="->"), bbox=bbox_props)
axes[0].annotate('Global Min = {:.2%}\nDate = {}'
.format(spx_min_, spx_min_idx_.strftime('%a, %Y-%m-%d')),
fontsize=9,
fontweight='demi',
xy=(spx_min_idx_, spx_min_),
xycoords='data',
xytext=(-150, 30),
textcoords='offset points',
arrowprops=dict(arrowstyle="->"), bbox=bbox_props)
axes[1].annotate('Global Max = {:.2%}\nDate = {}'
.format(vix_max_, vix_max_idx_.strftime('%a, %Y-%m-%d')),
fontsize=9,
fontweight='bold',
xy=(vix_max_idx_, vix_max_),
xycoords='data',
xytext=(-150, -30),
textcoords='offset points',
arrowprops=dict(arrowstyle="->"), bbox=bbox_props)
axes[1].annotate('Global Min = {:.2%}\nDate = {}'
.format(vix_min_, vix_min_idx_.strftime('%a, %Y-%m-%d')),
fontsize=9,
fontweight='demi',
xy=(vix_min_idx_, vix_min_),
xycoords='data',
xytext=(-150, -20),
textcoords='offset points',
arrowprops=dict(arrowstyle="->"), bbox=bbox_props)
# now to use tight layout
plt.tight_layout()
```

Wow, now it's looking really good. But what if you wanted to insert branding via a watermark? That's simple, add the following line of code before the **plt.tight_layout() **line and voila.

```
# add logo watermark
im = mpl.image.imread(logo_loc)
axes[0].figure.figimage(im, origin='upper', alpha=0.125, zorder=10)
```

]]>**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)**

This is a stylized implementation of the strategy described in the research paper titled "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. The authors show that their SKEW factor predicts individual equity returns up to 6 months!

**ABSTRACT**

The shape of the volatility smirk has significant cross-sectional predictive power for future equity returns.Stocks exhibiting the steepest smirks in their traded options underperform stocks with the least pronounced volatility smirks in their options by around 10.9% per year on a risk-adjusted basis.This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

My strategy differs in that I arbitrarily chose 1 and 4 week holding periods to study. Additionally this strategy only analyzes a cross-section of ETFs instead of individual stocks. I chose ETFs because liquidity and data quality concerns are minimized. Here are the selected ETFs under analysis.

- Zhang, Xiaoyan and Zhao, Rui and Xing, Yuhang, What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns? (August 14, 2008). AFA 2009 San Francisco Meetings Paper. Available at SSRN:http://ssrn.com/abstract=1107464 orhttp://dx.doi.org/10.2139/ssrn.1107464

**Results simulated using the Quantopian Platform.*

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>- Part-1 Recap
- Part-1 Error Corrections
- Part-2 Implementation Details, Deviations, Goals
- Prepare Data
- Setup PYMC3 Generalized Linear Models (GLM)
- Evaluate and Interprate Models
- Conclusions
- References

In part 1 We discussed the theorized underpinnings of Ying Wu of Stevens Institute of Technology - School's asset pricing model. Theory links the catalyst of systemic risk events to the funding difficulties of major financial intermediaries. Thus crisis risk is linked to liquidity events. The model proposes a method to estimate a proxy index for the systematic liquidity risk. We use an illiquidity metric calculated across a large group of stocks, then apply a tool called the Hill estimator to measure the average distance of extreme illiquidity events from the tail cutoff. We explored the high level intuition behind the Hill estimator.

We created an implementation of the Hill estimator, aggregated the stock data, calculated the illiquidity metrics and the ELR index, and finally output the intermediate data into a hdf5 file for quick read/write access.

We did not get this far in part-1, but the paper asserts that we can use this index as an asset pricing component. This could also be thought of as the primary feature or independent variable in a simple linear regression (think CAPM). The target variable is the expected aggregate returns. From there the paper says we can create long-short portfolios by ranking the stocks according to their factor betas and sorting them into quantiles.

In part 1 the implementation of the hill estimator was incorrect. The ELR index is supposed to be comprised of the values that exceed 95th percentile. In the original implementation I calculated the average of *all *values not just those in the tail region. Therefore the quick and dirty observations made previously are for a different index. On the left is the original incorrect index. On the right is the corrected index.

After spending some time rereading the research paper there is a subtle bit of additional complexity I have not included in this implementation that may affect the results I get vs those found in the paper.

In the paper the threshold value is calculated as the 95th percentile cross-sectionally for the entire month. Then the index is constructed by calculating the average log distance from that threshold for any datapoint located in the tail. To create an index like this requires binning the data by month, getting the threshold value of that month by aggregating the *daily illiquidity *metrics of a few thousand stocks for that month, then calculating the log average distance between those tail values and the threshold.

This means we likely need a whole month of data before we can calculate the ELR value. We can potentially use a rolling 21 or 30 day window to simulate a monthly lookback but based on the paper it does not seem that the author used this method. If, instead we go by calendar months, this likely means we need *a lot *more data before we can draw any conclusions. For example the author's sample period is from 1968-2011 and only includes NYSE stocks among other stock universe selection details.

In my exploration of ELR index, I prefer to keep it simpler, and calculate the 95th percentile threshold based on the cross sectional *daily *illiquidity values instead of the whole month.

Part-2 Goals:

- Import the calculated the daily illiquidity values
- Resample the illiquidity measures by week, taking the median and max illiquidity values, then calculate the ELR Index
- Use pymc3's generalized linear models function to fit a model for predicting the cross-sectional scaled returns.
- Interpret and Evaluate the models.

First we need to import packages and get our data ready.

```
# import packages
import sys
import os
# ------------------- % import datasets % ------------------- #
datasets = '/YOUR/DATASET/LOCATION/_Datasets/'
import pandas as pd
import pandas_datareader.data as web
from pandas.tseries.offsets import *
import numpy as np
import scipy.stats as scs
import matplotlib as mpl
import matplotlib.pyplot as plt
plt.style.use('bmh')
%matplotlib inline
import seaborn as sns
sns.set_style('white', {"xtick.major.size": 3, "ytick.major.size": 3})
import pymc3 as pm
from scipy import optimize
import time
from tqdm import tqdm
p = print
```

I created a hdf5 file for the aggregated returns because I want to use them as a proxy for our target variable of expected market returns. I import those here and create a time series consisting of the cross sectional median and average log returns.

```
## read in data
# log returns for aggregrate mkt proxy
LRET_FILE = datasets + 'LRET_Set_2016-11-22.h5'
lret_set = pd.read_hdf(LRET_FILE, 'RETURNS_DV_SET')
lret_set = lret_set.loc[:,lret_set.columns.to_series().str.contains('_lret').tolist()]
# calc median and mean cross sectional
mkt = pd.DataFrame({'cross_mdn_rets':lret_set.median(axis='columns'),
'cross_avg_rets':lret_set.mean(axis='columns')},
index=lret_set.index
)
## read in illiquidity data for ELR calculations
ILQ_FILE = datasets + 'Illiquidity_Set_2016-11-22.h5'
ilq = pd.read_hdf(ILQ_FILE, 'Illiquidity_Set')
```

After loading the data into our environment, we resample the data to a weekly frequency using both median and max values for comparison. On my outdated laptop this took approximately 7 minutes.

```
# weekly resample
freq = '1W'
df = ilq.resample(freq).median()
df_max = ilq.resample(freq).max()
```

Next we define our convenience functions for calculating our ELR index. Notice that I deviate from the traditional z-score scaling method and implement the Gelman scaler which divides the centered values by 2 times the standard deviation. You can read more details from Andrew Gelman's paper[2] about why we use this method. The high-level intuition is that this scale improves regression coefficient interpretability across binary, discrete, and continuous variables.

```
# convenience functions for gamma calculation and scaler
# gamma estimate
def _ext_lq_risk(series):
# threshold is 95th percentile
p_star = np.nanpercentile(series, 95)
illiq = series[series > p_star]
#illiq = series # looks better on chart but less explanatory power
lg_illiq = np.log(illiq / p_star)
lg_illiq = lg_illiq[np.isfinite(lg_illiq)]
try:
gamma = 1./ ((1./len(lg_illiq)) * sum(lg_illiq))
except ZeroDivisionError:
gamma = np.nan
return gamma
# scaler function
gelman_scaler = lambda ser: (ser - ser.mean()) / (2*ser.std())
# calculate elr index
def _calculate_elr(df, cutoff=100, scaler=None):
gs = {} # gammas dictionary
nan_dates = []
for d in df.index:
# we want at least N nonnull values
if df.loc[d].notnull().sum() > cutoff:
gamma = _ext_lq_risk(df.loc[d])
gs[d] = gamma
else:
nan_dates.append(d)
gdf = pd.DataFrame.from_dict(gs, orient='index').sort_index()
gdfz = scaler(gdf)
gdfz.columns = ['ELR']
return gdfz, nan_dates
```

Now we can set up our main experimental dataframe. We need to make sure the our market proxy dataframe, which consists of the aggregate sample returns, has the same index as our ELR dataframe before we merge them. Also, remember we are going to experiment with two resampled dataframes, one with the weekly median illiquidity, and one with the weekly maximum illiquidity. Our final step after creating our merged dataframes is to add a column for our Gelman scaled aggregate returns.

```
# calculate ELR index on resampled data
gdfz_mdn, _ = _calculate_elr(df, scaler=gelman_scaler)
gdfz_max, _ = _calculate_elr(df_max, scaler=gelman_scaler)
# market resample must match gdfz before merge
# merge dataframes
mkt_rs = mkt.resample(freq).mean()
mrg_mdn = pd.concat([gdfz_mdn, mkt_rs], join='inner', axis=1)
mrg_max = pd.concat([gdfz_max, mkt_rs], join='inner', axis=1)
# add cross sectional average Gelman scored returns
avg_col = 'cross_avg_rets'
mrg_mdn['cross_avg_zrets'] = gelman_scaler(mrg_mdn[avg_col])
mrg_max['cross_avg_zrets'] = gelman_scaler(mrg_max[avg_col])
mrg_mdn.head()
```

Before running our model I define some output convenience functions adapted from the excellent blog Applied AI[3].

```
# pymc3 convenience functions adapted from blog.applied.ai
def trace_median(x):
return pd.Series(np.median(x,0), name='median')
def plot_traces(trcs, retain=1000, varnames=None):
''' Convenience fn: plot traces with overlaid means and values '''
nrows = len(trcs.varnames)
if varnames is not None:
nrows = len(varnames)
ax = pm.traceplot(trcs[-retain:], varnames=varnames, figsize=(12,nrows*1.4)
,lines={k: v['mean'] for k, v in
pm.df_summary(trcs[-retain:],varnames=varnames).iterrows()})
for i, mn in enumerate(pm.df_summary(trcs[-retain:], varnames=varnames)['mean']):
ax[i,0].annotate('{:.2f}'.format(mn), xy=(mn,0), xycoords='data'
,xytext=(5,10), textcoords='offset points', rotation=90
,va='bottom', fontsize='large', color='#AA0022')
def plot_pm_acf(trace, varnames=None, burn=None):
pm.autocorrplot(trace, varnames=varnames, burn=burn, figsize=(7,5))
return
```

Now we can set up our model. I will gloss over some of the particulars pymc3 and the *Generalized Linear Model (glm) *functions for now. I'm also skipping over why I'm using a Bayesian methodology vs. a frequentist one. Generally speaking, Bayesian modeling is the preferred methodology due to robustness and explicit modeling of the uncertainty in our point estimates. I plan to revisit this topic in more detail in the future, but there are plenty of tutorials and explanations of why Bayesian is the way to go.

Anyone familiar with R will appreciate the following simplicity of model setup. First we need to define our model formula as a string.

```
# predicting cross sectional average returns using the ELR index
ft_endog = 'cross_avg_zrets'
ft_exog = ['ELR'] # this format allows easy addition of more variables
fml = '{} ~ '.format(ft_endog) + ' + '.join(ft_exog)
p(fml)
# 'cross_avg_zrets ~ ELR'
```

Next we follow pymc3's glm model convention and choose the number of samples we wish to draw from the predicted posterior.

```
# choose samples and run model
samples = 5000
with pm.Model() as mdl:
## Use GLM submodule for simplified model specification
## Betas are Normal (as per default settings (for Ridge)
## Likelihood is Normal (with HalfCauchy for error prior)
pm.glm.glm(fml, mrg_mdn, family=pm.glm.families.Normal())
start_MAP = pm.find_MAP(fmin=optimize.fmin_powell)
## take samples using NUTS sampler
trc_ols = pm.sample(samples, start=start_MAP, step=pm.NUTS())
rvs = [rv.name for rv in mdl.unobserved_RVs]
rvs.remove('sd_log_')
plot_traces(trc_ols, varnames=rvs)
plot_pm_acf(trc_ols, varnames=rvs, burn=1000)
p(pm.df_summary(trc_ols[-1000:], varnames=rvs))
p('\nMedian Illiquidity ELR Model\nDIC:', pm.dic(trc_ols[-1000:], model=mdl))
```

We run the model for both the median and max illiquidity estimates.

First we need to decide how we will evaluate which model is best. For this I have chosen the Deviance Information Criterion (DIC) which is implemented in pymc3 and designed specifically for Bayesian modelling using MCMC. Like similar alternative measures, the smaller the number the better our model.

First we evaluate the resampled median illiquidity model.

median model trace

On the left we can examine the distribution of our sample estimate for the intercept, ELR, and model error. On the right we can see the sample trace. This should look like white noise and it does. We can see the intercept is basically zero, the ELR beta is -0.06 and the standard deviation is 0.5

median acf

We plot the ACF of our variables to confirm that the sample traces are white noise. However we can see a strongly negative autocorrelation for each variable at its first lag.

median model summary and dic

We can see that the both the ELR and sd are significant as their highest posterior density does NOT include zero in the interval. The DIC is 1531. Now let's compare the median model to the max model.

elr max model trace

We observe that the models are similar in their output, however notice the ELR in this instance has a stronger negative correlation with our target variable than does the median model. The traces on the right side appear to resemble white noise. Let's confirm by looking at the ACF plot.

elr max model acf

This confirms our intuition that the series is close to white noise. We can also see a pretty strong negative autocorrelation at lag 1 for each of our variables. This is not ideal but ok for our exploratory purposes.

max model hpd and dic

We can see that the ELR and sd are both significant as neither interval includes zero. The magnitude of the ELR coefficient is larger in the max model which corresponds to a lower *better* DIC.

We designed an experiment to evaluate the relationship between the ELR index and the cross sectional scaled returns. We deviated from the original paper in a couple notable ways. We used the daily illiquidity measure and resampled to a weekly frequency using both the weekly median, and the weekly max. We then calculated the ELR index using the weekly cross sectional data as opposed to the highly nuanced monthly methodology used in the paper.

We then designed a basic linear model using pymc3 to explore the ELR index's impact on the scaled cross sectional returns. After examining the results I am somewhat disappointed we weren't able to show as strong a link as demonstrated in the paper. The max model is clearly the better model according to the DIC, but even then we can see the ELR index is only weakly related to the cross sectional returns.

A positive takeaway is that the *sign* of the relationship is what we would expect. The ELR we calculated is negatively correlated with cross section of returns.

As currently constructed, using this method to form the basis of an asset pricing model seems dubious at best and definitely lowers my expectations when I simulate a long-short strategy in Quantopian.

- Wu, Ying, Asset Pricing with Extreme Liquidity Risk (October 10, 2016). Available at SSRN:https://ssrn.com/abstract=2850278 orhttp://dx.doi.org/10.2139/ssrn.2850278
- Gelman, A. (2008), Scaling regression inputs by dividing by two standard deviations. Statist. Med., 27: 2865–2873. doi:10.1002/sim.3107
- Sedar, Jonathan. "Bayesian Inference with PyMC3 - Part 2."
*The Sampler*. Applied AI, 06 Sept. 2016. Web. 13 Dec. 2016.

**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)**

This is a stylized implementation of the strategy described in the research paper titled "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. The authors show that their SKEW factor predicts individual equity returns up to 6 months!

**ABSTRACT**

The shape of the volatility smirk has significant cross-sectional predictive power for future equity returns.Stocks exhibiting the steepest smirks in their traded options underperform stocks with the least pronounced volatility smirks in their options by around 10.9% per year on a risk-adjusted basis.This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

My strategy differs in that I arbitrarily chose 1 and 4 week holding periods to study. Additionally this strategy only analyzes a cross-section of ETFs instead of individual stocks. I chose ETFs because liquidity and data quality concerns are minimized. Here are the selected ETFs under analysis.

- Zhang, Xiaoyan and Zhao, Rui and Xing, Yuhang, What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns? (August 14, 2008). AFA 2009 San Francisco Meetings Paper. Available at SSRN:http://ssrn.com/abstract=1107464 orhttp://dx.doi.org/10.2139/ssrn.1107464

**Results simulated using the Quantopian Platform.*

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>- Introduction
- Get Data
- Calculate Cross-Sectional Extreme Liquidity Risk
- Quick and Dirty Observations
- Next Steps
- References

One of the primary goals of quantitative investing is effectively managing tail risk. Failure to do so can result in crushing drawdowns or a total blowup of your fund/portfolio. Commonly known tools for estimating tail risk, e.g. Value-at-Risk, often underestimate the likelihood and magnitude of risk-off events. Furthermore, tail risk events are increasingly associated with liquidity events.

Theory links the catalyst of systemic risk events to the funding difficulties of major financial intermediaries. For example, an unexpected default by a major institution would lead to that firm's counterparties reducing risk while they assessing the fallout. Those counterparties are likely to reduce risk by selling assets and/or withdrawing funding resources from the market. This could lead to margin calls, and more selling as the default works its way across the financial network cascading into a negative feedback loop.

A good theoretical risk model will address the relationship between liquidity and tail risk. Ying Wu of Stevens Institute of Technology - School of Business, may have discovered a framework that links these two concepts in a parsimonious and practical manner. His paper 'Asset Pricing with Extreme Liquidity Risk'[1] combines Amihud's[2] stock illiquidity metric with the Hill estimator for modeling tail distributions. He then constructs a normalized Extreme Liquidity Risk (ELR) metric and runs a simple linear regression for each stock to assess its sensitivity to the ELR. His results find that a long-short portfolio based on buying stocks with the highest sensitivity to ELR and shorting those with the lowest, earns a empirically and economically significant return over the time period studied.

The Amihud stock illiquidity metric is a stock's daily absolute return divided by its dollar volume, averaged over some time period. It was constructed for use as a rough measure of price impact and designed to be easily calculated for long time series.

The Hill estimator[3] is a mathematical tool that allows us to focus on the tail of a sample distribution. This tool allows us to "skip" over trying to fit a single distribution over the entire sample and instead we can use the formal framework of Extreme Value Theory to evaluate the *extreme (tail)* values only. The link between Wu's choice of this estimator is based on the empirical evidence of power law behavior in the tails of the price-impact series. This further supports the use of Amihud's illiquidity metric as it was designed to be a crude yet effective measure of price impact.

I urge readers to explore the paper further as some of the deeper mathematical underpinnings are beyond the scope of this post.

For this exploratory study I used the pandas Yahoo Finance API to download 20 years of stock data using a symbol list constructed by CRSP.

```
# Import
import pandas as pd
import pandas_datareader.data as web
from pandas.tseries.offsets import BDay
import numpy as np
import scipy.stats as scs
import matplotlib.pyplot as plt
# get symbols
datasets = '/YOUR/DATASETS/LOCATION/_Datasets/'
symbols = pd.read_csv(datasets+'CSRP_symbol_list.txt',sep='\t').values.flatten()
```

Here is the text file of symbols I used --> Symbols.

Next we construct our convenience functions to aggregate the stock data.

```
# Get Prices Function
def _get_px(symbol, start, end):
return web.DataReader(symbol, 'yahoo', start, end)
# Create HDF5 data store for fast read write
def _create_symbol_datastore(symbols, start, end):
prices_hdf = pd.HDFStore(datasets + 'CRSP_Symbol_Data_Yahoo_20y.hdf')
symbol_count = len(symbols)
N = copy(symbol_count)
missing_symbols = []
for i, sym in enumerate(symbols, start=1):
if not pd.isnull(sym):
try:
prices_hdf[sym] = _get_px(sym, start, end)
except Exception as e:
print(e, sym)
missing_symbols.append(sym)
N -= 1
pct_total_left = (N / symbol_count)
print('{}..[done] | {} of {} symbols collected | {:>.2%}'.format(\
sym, i, symbol_count, pct_total_left))
prices_hdf.close()
print(prices_hdf)
return missing_symbols
# Get past 20 years of data from today
# Evaluate missing symbols if you so choose
today = pd.datetime.today().date()
start = today - 252 * BDay() * 20
missing = _create_symbol_datastore(symbols, start, today)
```

This takes roughly 30 minutes to run, which is a good time for a coffee break.

Next we need to calculate each stock's daily illiquidity measure according to Amihud. I also save this data to its own HDF5 store. I find it good practice to save intermediate calculations where possible for reference and ease of reproducibility.

```
# calculate each symbols returns and dollar volumes
# add to dataframe with symbol_lret, symbol_dv, symbol_illiq
FILE = datasets + 'CRSP_Symbol_Data_Yahoo_20y.hdf'
start = pd.to_datetime('1999-01-01')
end = pd.to_datetime('2016-11-22')
idx = pd.bdate_range(start, end)
DF = pd.DataFrame(index=idx)
for sym in tqdm(keys):
tmp_hdf = pd.read_hdf(FILE,
mode='r', key=sym)
tmp_hdf = tmp_hdf[['Volume', 'Adj Close']]
# I want at least 1000 daily datapoints per stock
if len(tmp_hdf) > 1000:
try:
dv = (tmp_hdf['Adj Close'] * tmp_hdf['Volume'] / 1e6)[1:]
lret = np.log(tmp_hdf['Adj Close'] / tmp_hdf['Adj Close'].shift(1)).dropna()
daily_illiq = np.abs(lret) / dv
tmp_df = pd.DataFrame({sym.lstrip('/')+'_lret':lret,
sym.lstrip('/')+'_dv':dv,
sym.lstrip('/')+'_illiq':daily_illiq},
index=lret.index)
DF = DF.join(tmp_df, how='outer')
except: continue
print(DF.info())
# Illiquidity HDF originally run on 2016-Nov-11
# DataFrame key is "Illiquidity_Set"
ILQ_FILE = datasets + 'Illiquidity_Set_2016-11-22.h5'
ilq_set = DF.loc[:, DF.columns.to_series().str.contains('_illiq').tolist()]
ilq_set.to_hdf(ILQ_FILE, 'Illiquidity_Set')
```

8487 * 4954 = 42,044,598 data points! Some of these are np.nan but still, clearly CSV storage is a non-starter.

Now we are in a position to calculate the Extreme Liquidity Risk metric (ELR) or "Tail Index" for the aggregated stocks. First we read in our 'Illiquidity_set' dataframe from the HDF5 file. Then we create a convenience function to calculate the daily ELR. First lets take a quick glance at the ELR formula:

Wu, Ying, Asset Pricing with Extreme Liquidity Risk (October 10, 2016)

My understanding is that this is a log average of the relative "distance" between the aggregated stocks' illiquidity measures and the threshold *p*. P** is the line in the sand between distribution "body" and distribution "tail". The paper uses the convention of the 95% percentile as the threshold value so I use that here as well.

```
# Read hdf illiquidity
ILQ_FILE = datasets + 'Illiquidity_Set_2016-11-22.h5'
ilq = pd.read_hdf(ILQ_FILE, 'Illiquidity_Set')
# function to get daily values for gamma calc
def _ext_lq_risk(series):
# UPDATED: DEC 5TH
# threshold is 95th percentile
# right tailed convention
p_star = np.nanpercentile(series, 95)
illiq = series[series > p_star]
lg_illiq = np.log(illiq / p_star)
lg_illiq = lg_illiq[np.isfinite(lg_illiq)]
try:
gamma = 1./ ((1./len(lg_illiq)) * sum(lg_illiq))
except ZeroDivisionError:
gamma = np.nan
return gamma
```

Now we can calculate the Tail Index and normalize the values to get the ELR series.

```
# Calculate Tail Index for all dates greater than cutoff
df = ilq.copy()
gs = {} # gammas dictionary
cutoff=100
nan_dates = []
for d in df.index:
# we want at least N nonnull values
if df.loc[d].notnull().sum() > cutoff:
gamma = _ext_lq_risk(df.loc[d])
gs[d] = gamma
else:
nan_dates.append(d)
gdf = pd.DataFrame.from_dict(gs, orient='index').sort_index()
gdf.columns = ['Tail_Index']
# the ELR metric is a normalized version of the tail index
# normalize gamma dataframe to calc "ELR"
gdfz = (gdf - gdf.mean())/gdf.std()
gdfz.columns = ['ELR']
```

Let's plot it and take a look.

Blackarbs LLC

First another plot. I skip the code here to save space, but would be happy to post it if requested. The plot below is the IWM used as a market proxy, its drawdown chart, and below that is the ELR. The shaded regions are official NBER recessions.

Blackarbs llc

The ELR appears to rise prior to the official beginning of the Dot-Com bust. It stays relatively elevated throughout the period and begins to decline sometime during the first persistent rally off the lows. Prior to the beginning of 2008's official recession, the ELR is mixed. However, the ELR rises sharply sometime prior to the massive decline in the broad market. In fact it was rising during a period where the market bounced, providing an early warning of the cataclysmic dropoff to come. Furthermore it begins declining shortly after the official NBER recession end date, providing investors with support for getting back into the market. Interestingly the ELR is in a downtrend for most of the low-volatility period that followed the recession. Clearly the metric is not a perfect predictor, but there seems to be evidence that it could be a useful tool, and certainly warrants more rigorous investigation.

There are several directions to pursue regarding Extreme Liquidity Risk Index. We can explore the time series itself using Time Series Analysis (TSA), we can use frequentist or bayesian inference to this end. Or we can get straight to the good stuff, and simulate the long-short portfolio based on each stock's return sensitivity to the ELR as reported in the paper that inspired this post. Check back for part 2, as we explore this concept further.

- Wu, Ying, Asset Pricing with Extreme Liquidity Risk (October 10, 2016). Available at SSRN:https://ssrn.com/abstract=2850278 or http://dx.doi.org/10.2139/ssrn.2850278
- Amihud, Yakov. "Illiquidity and Stock Returns: Cross-section and Time-series Effects."
*Journal of Financial Markets*5.1 (2002): 31-56. Web. - "Heavy-tailed Distribution." Wikipedia. Wikimedia Foundation, n.d. Web. 29 Nov. 2016.

**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)**

This is a stylized implementation of the strategy described in the research paper titled "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. The authors show that their SKEW factor predicts individual equity returns up to 6 months!

**ABSTRACT**

The shape of the volatility smirk has significant cross-sectional predictive power for future equity returns.Stocks exhibiting the steepest smirks in their traded options underperform stocks with the least pronounced volatility smirks in their options by around 10.9% per year on a risk-adjusted basis.This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

My strategy differs in that I arbitrarily chose 1 and 4 week holding periods to study. Additionally this strategy only analyzes a cross-section of ETFs instead of individual stocks. I chose ETFs because liquidity and data quality concerns are minimized. Here are the selected ETFs under analysis.

- Zhang, Xiaoyan and Zhao, Rui and Xing, Yuhang, What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns? (August 14, 2008). AFA 2009 San Francisco Meetings Paper. Available at SSRN:http://ssrn.com/abstract=1107464 orhttp://dx.doi.org/10.2139/ssrn.1107464

**Results simulated using the Quantopian Platform.*

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>- Motivation
- The Basics
- Stationarity
- Serial Correlation (Autocorrelation)
- Why do we care about Serial Correlation?

- White Noise and Random Walks
- Linear Models
- Log-Linear Models
- Autoregressive Models - AR(p)
- Moving Average Models - MA(q)
- Autoregressive Moving Average Models - ARMA(p, q)
- Autoregressive Integrated Moving Average Models - ARIMA(p, d, q)
- Autoregressive Conditionally Heterskedastic Models - ARCH(p)
- Generalized Autoregressive Conditionally Heterskedastic Models - GARCH(p, q)
- References

Early in my quant finance journey, I learned various time series analysis techniques and how to use them but I failed to develop a deeper understanding of how the pieces fit together. I struggled to see the bigger picture of why we use certain models vs others, or how these models build on each other's weaknesses. The underlying purpose for employing these techniques eluded me for too long. That is, until I came to understand this:

By developing our time series analysis (TSA) skillset we are better able to understand what has already happened, *and *make better, more profitable, predictions of the future. Example applications include predicting future asset returns, future correlations/covariances, and future volatility.

This post is inspired by the great work Michael Halls Moore has done on his blog, Quantstart, especially his series on TSA. I thought translating some of his work to Python could help others who are less familiar with R. I have also adapted code from other bloggers as well. See References.

Before we begin let's import our Python libraries.

```
import os
import sys
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import statsmodels.formula.api as smf
import statsmodels.tsa.api as smt
import statsmodels.api as sm
import scipy.stats as scs
from arch import arch_model
import matplotlib.pyplot as plt
import matplotlib as mpl
%matplotlib inline
p = print
p('Machine: {} {}\n'.format(os.uname().sysname,os.uname().machine))
p(sys.version)
# Machine: Linux x86_64
# 3.5.2 |Anaconda custom (64-bit)| (default, Jul 2 2016, 17:53:06)
# [GCC 4.4.7 20120313 (Red Hat 4.4.7-1)]
```

Let's use the **pandas_datareader** package to grab some sample data using the Yahoo Finance API.

```
end = '2015-01-01'
start = '2007-01-01'
get_px = lambda x: web.DataReader(x, 'yahoo', start=start, end=end)['Adj Close']
symbols = ['SPY','TLT','MSFT']
# raw adjusted close prices
data = pd.DataFrame({sym:get_px(sym) for sym in symbols})
# log returns
lrets = np.log(data/data.shift(1)).dropna()
```

A time series is a series of data points indexed (or listed or graphed) in time order. - Wikipedia

Here I use an infogrpahic found on SeanAbu.com. I find the pictures very intuitive.

Seanabu.com

*So what? Why do we care about stationarity?*

- A stationary time series (TS) is simple to predict as we can assume that future statistical properties are the same or proportional to current statistical properties.
- Most of the models we use in TSA assume
**covariance-stationarity (#3 above)**. This means**the descriptive statistics these models predict e.g. means, variances, and correlations, are only reliable if the TS is stationary and invalid otherwise.**

"For example, if the series is consistently increasing over time, the sample mean and variance will grow with the size of the sample, and they will always underestimate the mean and variance in future periods. And if the mean and variance of a series are not well-defined, then neither are its correlations with other variables."- http://people.duke.edu/~rnau/411diff.htm

With that said, **most TS we encounter in finance is NOT stationary.** Therefore a large part of TSA involves identifying if the series we want to predict is stationary, and if it is not we must find ways to transform it such that it is stationary. (More on that later)

Essentially when we model a time series we decompose the series into three components: trend, seasonal/cyclical, and random. The random component is called the residual or error. It is simply the difference between our predicted value(s) and the observed value(s). Serial correlation is when the residuals (errors) of our TS models are correlated with each other.

We care about serial correlation because it is critical for the validity of our model predictions, and is intrinsically related to stationarity. Recall that the residuals (errors) of a *stationary* TS are serially *uncorrelated* by definition! If we fail to account for this in our models the standard errors of our coefficients are underestimated, inflating the size of our T-statistics. The result is too many Type-1 errors, where we reject our null hypothesis even when it is True! **In layman's terms, ignoring autocorrelation means our model predictions will be bunk, and we're likely to draw incorrect conclusions about the impact of the independent variables in our model.**

White noise is the first Time Series Model (TSM) we need to understand. By definition a time series that is a white noise process has serially UNcorrelated errors and the expected mean of those errors is equal to zero. Another description for serially uncorrelated errors is, independent and identically distributed (i.i.d.). This is important because, if our TSM is appropriate and successful at capturing the underlying process, the residuals of our model will be i.i.d. and resemble a white noise process. Therefore part of TSA is literally trying to fit a model to the time series such that the residual series is indistinguishable from white noise.

Let's simulate a white noise process and view it. Below I introduce a convenience function for plotting the time series and analyzing the serial correlation visually. This code was adapted from the blog Seanabu.com

```
def tsplot(y, lags=None, figsize=(10, 8), style='bmh'):
if not isinstance(y, pd.Series):
y = pd.Series(y)
with plt.style.context(style):
fig = plt.figure(figsize=figsize)
#mpl.rcParams['font.family'] = 'Ubuntu Mono'
layout = (3, 2)
ts_ax = plt.subplot2grid(layout, (0, 0), colspan=2)
acf_ax = plt.subplot2grid(layout, (1, 0))
pacf_ax = plt.subplot2grid(layout, (1, 1))
qq_ax = plt.subplot2grid(layout, (2, 0))
pp_ax = plt.subplot2grid(layout, (2, 1))
y.plot(ax=ts_ax)
ts_ax.set_title('Time Series Analysis Plots')
smt.graphics.plot_acf(y, lags=lags, ax=acf_ax, alpha=0.5)
smt.graphics.plot_pacf(y, lags=lags, ax=pacf_ax, alpha=0.5)
sm.qqplot(y, line='s', ax=qq_ax)
qq_ax.set_title('QQ Plot')
scs.probplot(y, sparams=(y.mean(), y.std()), plot=pp_ax)
plt.tight_layout()
return
```

We can model a white noise process easily and output the TS plot for visual inspection.

```
np.random.seed(1)
# plot of discrete white noise
randser = np.random.normal(size=1000)
tsplot(randser, lags=30)
```

guassian white noise

We can see that process appears to be random and centered about zero. The autocorrelation (ACF) and partial autocorrelation (PACF) plots also indicate no significant serial correlation. Keep in mind we should see approximately 5% significance in the autocorrelation plots due to pure chance as a result of sampling from the Normal distribution. Below that we can see the QQ and Probability Plots, which compares the distribution of our data with another theoretical distribution. In this case, that theoretical distribution is the standard normal distribution. Clearly our data is distributed randomly, and appears to follow Gaussian (Normal) white noise, as it should.

```
p("Random Series\n -------------\nmean: {:.3f}\nvariance: {:.3f}\nstandard deviation: {:.3f}"
.format(randser.mean(), randser.var(), randser.std()))
# Random Series
# -------------
# mean: 0.039
# variance: 0.962
# standard deviation: 0.981
```

A Random Walk is defined below:

- Michael halls moore [quantsart.com]

The significance of a random walk is that it is **non-stationary **because the covariance between observations is time-dependent. If the TS we are modeling is a random walk it is unpredictable.

Let's simulate a random walk using the "numpy.random.normal(size=our_sample_size)" function to sample from the standard normal distribution.

```
# Random Walk without a drift
np.random.seed(1)
n_samples = 1000
x = w = np.random.normal(size=n_samples)
for t in range(n_samples):
x[t] = x[t-1] + w[t]
_ = tsplot(x, lags=30)
```

Random walk without a drift

Clearly our TS is not stationary. Let's find out if the random walk model is a good fit for our simulated data. Recall that a random walk is ** xt = xt-1 + wt**. Using algebra we can say that

```
# First difference of simulated Random Walk series
_ = tsplot(np.diff(x), lags=30)
```

first difference of a random walk series

Our definition holds as this looks exactly like a white noise process. What if we fit a random walk to the first difference of SPY's prices?

```
# First difference of SPY prices
_ = tsplot(np.diff(data.SPY), lags=30)
```

fitting a random walk model to SPY ETF prices

Wow, it's quite similar to white noise. However, notice the shape of the QQ and Probability plots. This indicates that the process is close to normality but with **'heavy tails'. **There also appears to be some significant serial correlation in the ACF, and PACF plots around lags 1, 5?, 16?, 18 and 21. This means that there should be better models to describe the actual price change process.

Linear models aka trend models represent a TS that can be graphed using a straight line. The basic equation is:

In this model the value of the dependent variable is determined by the beta coefficients and a singular independent variable, *time. *An example could be a company's sales that increase by the same amount at each time step. Let's look at a contrived example below. In this simulation we assume Firm ABC sales regardless of time are -$50.00 (*beta 0 or the intercept term*) and +$25.00 (*beta 1*) at every time step.

```
# simulate linear trend
# example Firm ABC sales are -$50 by default and +$25 at every time step
w = np.random.randn(100)
y = np.empty_like(w)
b0 = -50.
b1 = 25.
for t in range(len(w)):
y[t] = b0 + b1*t + w[t]
_ = tsplot(y, lags=lags)
```

Linear trend model simulation

Here we can see that the residuals of the model are correlated and linearly decreasing as a function of the lag. The distribution is approximately normal. Before using this model to make predictions we would have to account for and remove the obvious autocorrelation present in the series. The significance of the PACF at lag 1 indicates that an *autoregressive* model may be appropriate.

These models are similar to linear models except that the data points form an exponential function that represent a constant rate of change with respect to each time step. For example, firm ABC's sales increasing X% at each time step. When plotting the simulated sales data you get a curve that looks like this:

```
# Simulate ABC exponential growth
# fake dates
idx = pd.date_range('2007-01-01', '2012-01-01', freq='M')
# fake sales increasing at exponential rate
sales = [np.exp( x/12 ) for x in range(1, len(idx)+1)]
# create dataframe and plot
df = pd.DataFrame(sales, columns=['Sales'], index=idx)
with plt.style.context('bmh'):
df.plot()
plt.title('ABC Sales')
```

simulated exponential function

We can then transform the data by taking the natural logarithm of sales. Now a linear regression is a much better fit to the data.

```
# ABC log sales
with plt.style.context('bmh'):
pd.Series(np.log(sales), index=idx).plot()
plt.title('ABC Log Sales')
```

Natural logarithm of exponential function

These models have a fatal weakness as discussed previously. They assume serially UNcorrelated errors, which as we have seen in the linear model example is not true. In real life, TS data usually violates our stationary assumptions which motivates our progression to autoregressive models.

When the dependent variable is regressed against one or more lagged values of itself the model is called autoregressive. The formula looks like this:

AR(p) model formula

When you describe the **"order"** of the model, as in, an AR model of order **"p", **the p represents the number of lagged variables used within the model. For example an AR(2) model or *second-order *autoregressive model looks like this:

AR(2) model formula

Here, alpha (a) is the coefficient, and omega (w) is a white noise term. Alpha cannot equal zero in an AR model. Note that an AR(1) model with alpha set equal to 1 is a *random walk* and therefore not stationary.

AR(1) model with alpha = 1; random walk

Let's simulate an AR(1) model with alpha set equal to 0.6

```
# Simulate an AR(1) process with alpha = 0.6
np.random.seed(1)
n_samples = int(1000)
a = 0.6
x = w = np.random.normal(size=n_samples)
for t in range(n_samples):
x[t] = a*x[t-1] + w[t]
_ = tsplot(x, lags=lags)
```

AR(1) Model with alpha = 0.6

As expected the distribution of our simulated AR(1) model is normal. There is significant serial correlation between lagged values especially at lag 1 as evidenced by the PACF plot.

Now we can fit an AR(p) model using Python's statsmodels. First we fit the AR model to our simulated data and return the estimated alpha coefficient. Then we use the statsmodels function **"select_order()" **to see if the fitted model will select the correct lag. If the AR model is correct the estimated alpha coefficient will be close to our true alpha of 0.6 and the selected order will equal 1.

```
# Fit an AR(p) model to simulated AR(1) model with alpha = 0.6
mdl = smt.AR(x).fit(maxlag=30, ic='aic', trend='nc')
%time est_order = smt.AR(x).select_order(
maxlag=30, ic='aic', trend='nc')
true_order = 1
p('\nalpha estimate: {:3.5f} | best lag order = {}'
.format(mdl.params[0], est_order))
p('\ntrue alpha = {} | true order = {}'
.format(a, true_order))
```

Looks like we were able to recover the underlying parameters of our simulated data. Let's simulate an AR(2) process with alpha_1 = 0.666 and alpha_2 = -0.333. For this we make use of statsmodel's **"arma_generate_samples()"** function. This function allows us to simulate an AR model of arbitrary orders. Note that there are some peculiarities of Python's version which requires us to take some extra steps before using the function.

```
# Simulate an AR(2) process
n = int(1000)
alphas = np.array([.666, -.333])
betas = np.array([0.])
# Python requires us to specify the zero-lag value which is 1
# Also note that the alphas for the AR model must be negated
# We also set the betas for the MA equal to 0 for an AR(p) model
# For more information see the examples at statsmodels.org
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]
ar2 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n)
_ = tsplot(ar2, lags=lags)
```

AR(2) simulation with alpha_1 = 0.666 and alpha_2 = -0.333

Let's see if we can recover the correct parameters.

```
# Fit an AR(p) model to simulated AR(2) process
max_lag = 10
mdl = smt.AR(ar2).fit(maxlag=max_lag, ic='aic', trend='nc')
est_order = smt.AR(ar2).select_order(
maxlag=max_lag, ic='aic', trend='nc')
true_order = 2
p('\ncoef estimate: {:3.4f} {:3.4f} | best lag order = {}'
.format(mdl.params[0],mdl.params[1], est_order))
p('\ntrue coefs = {} | true order = {}'
.format([.666,-.333], true_order))
# coef estimate: 0.6291 -0.3196 | best lag order = 2
# true coefs = [0.666, -0.333] | true order = 2
```

Not bad. Let's see how the AR(p) model will fit MSFT log returns. Here is the return TS.

MSFT log returns time series

```
# Select best lag order for MSFT returns
max_lag = 30
mdl = smt.AR(lrets.MSFT).fit(maxlag=max_lag, ic='aic', trend='nc')
est_order = smt.AR(lrets.MSFT).select_order(
maxlag=max_lag, ic='aic', trend='nc')
p('best estimated lag order = {}'.format(est_order))
# best estimated lag order = 23
```

The best order is 23 lags or 23 parameters! Any model with this many parameters is unlikely to be useful in practice. Clearly there is more complexity underlying the returns process than this model can explain.

MA(q) models are very similar to AR(p) models. The difference is that the MA(q) model is a linear combination of past white noise error terms as opposed to a linear combo of past observations like the AR(p) model. The motivation for the MA model is that we can observe "shocks" in the error process directly by fitting a model to the error terms. In an AR(p) model these shocks are observed indirectly by using the ACF on the series of past observations. The formula for an MA(q) model is:

Omega (w) is white noise with E(wt) = 0 and variance of sigma squared. Let's simulate this process using beta=0.6 and specifying the AR(p) alpha equal to 0.

```
# Simulate an MA(1) process
n = int(1000)
# set the AR(p) alphas equal to 0
alphas = np.array([0.])
betas = np.array([0.6])
# add zero-lag and negate alphas
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]
ma1 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n)
_ = tsplot(ma1, lags=30)
```

Simulated ma(1) process with beta=0.6

The ACF function shows that lag 1 is significant which indicates that a MA(1) model may be appropriate for our simulated series. I'm not sure how to interpret the PACF showing significance at lags 2, 3, and 4 when the ACF only shows significance at lag 1. Regardless we can now attempt to fit a MA(1) model to our simulated data. We can use the same statsmodels **"ARMA()" **function specifying our chosen orders. We call on its **"fit()"** method to return the model output.

```
# Fit the MA(1) model to our simulated time series
# Specify ARMA model with order (p, q)
max_lag = 30
mdl = smt.ARMA(ma1, order=(0, 1)).fit(
maxlag=max_lag, method='mle', trend='nc')
p(mdl.summary())
```

MA(1) model summary

The model was able to correctly estimate the lag coefficent as 0.58 is close to our true value of 0.6. Also notice that our 95% confidence interval does contain the true value. Let's try simulating an MA(3) process, then use our ARMA function to fit a third order MA model to the series and see if we can recover the correct lag coefficients (betas). Betas 1-3 are equal to 0.6, 0.4, and 0.2 respectively.

```
# Simulate MA(3) process with betas 0.6, 0.4, 0.2
n = int(1000)
alphas = np.array([0.])
betas = np.array([0.6, 0.4, 0.2])
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]
ma3 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n)
_ = tsplot(ma3, lags=30)
```

Simulated ma(3) process with betas = [0.6, 0.4, 0.2]

```
# Fit MA(3) model to simulated time series
max_lag = 30
mdl = smt.ARMA(ma3, order=(0, 3)).fit(
maxlag=max_lag, method='mle', trend='nc')
p(mdl.summary())
```

MA(3) model summary

The model was able to estimate the real coefficients effectively. Our 95% confidence intervals also contain the true parameter values of 0.6, 0.4, and 0.3. Now let's try fitting an MA(3) model to the SPY's log returns. Keep in mind we do not know the *true *parameter values.

```
# Fit MA(3) to SPY returns
max_lag = 30
Y = lrets.SPY
mdl = smt.ARMA(Y, order=(0, 3)).fit(
maxlag=max_lag, method='mle', trend='nc')
p(mdl.summary())
_ = tsplot(mdl.resid, lags=max_lag)
```

SPY MA(3) model summary

Let's look at the model residuals.

SPY MA(3) model Residuals

Not bad. Some of the ACF lags concern me especially at 5, 16, and 18. It could be sampling error but that combined with the heaviness of the tails makes me think this isn't the best model to predict future SPY returns.

As you may have guessed, the ARMA model is simply the merger between AR(p) and MA(q) models. Let's recap what these models represent to us from a quant finance perspective:

- AR(p) models try to capture
*(explain)*the momentum and mean reversion effects often observed in trading markets. - MA(q) models try to capture
*(explain)*the shock effects observed in the white noise terms. These shock effects could be thought of as unexpected events affecting the observation process e.g. Surprise earnings, A terrorist attack, etc.

"For a set of products in a grocery store, the number of active coupon campaigns introduced at different times would constitute multiple 'shocks' that affect the prices of the products in question."

- AM207: Pavlos Protopapas, Harvard University

ARMA's weakness is that it ignores the *volatility clustering *effects found in most financial time series.

The model formula is:

arma(p, q) equation

Let's simulate an ARMA(2, 2) process with given parameters, then fit an ARMA(2, 2) model and see if it can correctly estimate those parameters. Set alphas equal to [0.5,-0.25] and betas equal to [0.5,-0.3].

```
# Simulate an ARMA(2, 2) model with alphas=[0.5,-0.25] and betas=[0.5,-0.3]
max_lag = 30
n = int(5000) # lots of samples to help estimates
burn = int(n/10) # number of samples to discard before fit
alphas = np.array([0.5, -0.25])
betas = np.array([0.5, -0.3])
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]
arma22 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n, burnin=burn)
_ = tsplot(arma22, lags=max_lag)
mdl = smt.ARMA(arma22, order=(2, 2)).fit(
maxlag=max_lag, method='mle', trend='nc', burnin=burn)
p(mdl.summary())
```

simulated ARma(2, 2) process

ARMA(2, 2) Model summary

The model has correctly recovered our parameters, and our true parameters are contained within the 95% confidence interval.

Next we simulate a ARMA(3, 2) model. After, we cycle through a non trivial number of combinations of p, q to fit an ARMA model to our simulated series. We choose the best combination based on which model produces the lowest Akaike Information Criterion (AIC).

```
# Simulate an ARMA(3, 2) model with alphas=[0.5,-0.25,0.4] and betas=[0.5,-0.3]
max_lag = 30
n = int(5000)
burn = 2000
alphas = np.array([0.5, -0.25, 0.4])
betas = np.array([0.5, -0.3])
ar = np.r_[1, -alphas]
ma = np.r_[1, betas]
arma32 = smt.arma_generate_sample(ar=ar, ma=ma, nsample=n, burnin=burn)
_ = tsplot(arma32, lags=max_lag)
# pick best order by aic
# smallest aic value wins
best_aic = np.inf
best_order = None
best_mdl = None
rng = range(5)
for i in rng:
for j in rng:
try:
tmp_mdl = smt.ARMA(arma32, order=(i, j)).fit(method='mle', trend='nc')
tmp_aic = tmp_mdl.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_order = (i, j)
best_mdl = tmp_mdl
except: continue
p('aic: {:6.5f} | order: {}'.format(best_aic, best_order))
# aic: 14108.27213 | order: (3, 2)
```

The correct order was recovered above. Below we see the output of our simulated time series before any model fitting.

Simulated arma(3, 2) series with alphas = [0.5,-0.25,0.4] and betas = [0.5,-0.3]

ARmA(3, 2) BEst model summary

We see that the correct order was selected and the model correctly estimated our parameters. However notice the MA.L1.y coefficent; the true value of 0.5 is almost outside of the 95% confidence interval!

Below we observe the model's residuals. Clearly it is a white noise process, thus the best model has been fit to *explain* the data.

ARMA(3, 2) best model residual white noise

Next we fit an ARMA model to SPY returns. The plot below is the time series before model fitting.

SPY Returns

```
# Fit ARMA model to SPY returns
best_aic = np.inf
best_order = None
best_mdl = None
rng = range(5) # [0,1,2,3,4,5]
for i in rng:
for j in rng:
try:
tmp_mdl = smt.ARMA(lrets['SPY'], order=(i, j)).fit(
method='mle', trend='nc'
)
tmp_aic = tmp_mdl.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_order = (i, j)
best_mdl = tmp_mdl
except: continue
p('aic: {:6.5f} | order: {}'.format(best_aic, best_order))
# aic: -11518.22902 | order: (4, 4)
```

We plot the model residuals.

SPY best model residuals arma(4, 4)

The ACF and PACF are showing no significant autocorrelation. The QQ and Probability Plots show the residuals are approximately normal with heavy tails. However, this model's residuals do NOT look like white noise! Look at the highlighted areas of obvious conditional heteroskedasticity (*conditional volatility*) that the model has not captured.

ARIMA is a natural extension to the class of ARMA models. As previously mentioned many of our TS are not stationary, however they can be made stationary by differencing. We saw an example of this when we took the first difference of a Guassian random walk and proved that it equals white noise. Said another way, we took the nonstationary random walk and transformed it to stationary white noise by first-differencing.

Without diving too deeply into the equation, just know the **"d"** references the number of times we are differencing the series. A side note, in Python we must use **np.diff()** function if we need to difference a series more than once. The **pandas **functions **DataFrame****.diff()/Series.diff() only takes the first difference of a dataframe/series and does not implement the recursive differencing needed in TSA. **

In the following example, we iterate through a non-trivial number of combinations of (p, d, q) orders, to find the best ARIMA model to fit SPY returns. We use the AIC to evaluate each model. The lowest AIC wins.

```
# Fit ARIMA(p, d, q) model to SPY Returns
# pick best order and final model based on aic
best_aic = np.inf
best_order = None
best_mdl = None
pq_rng = range(5) # [0,1,2,3,4]
d_rng = range(2) # [0,1]
for i in pq_rng:
for d in d_rng:
for j in pq_rng:
try:
tmp_mdl = smt.ARIMA(lrets.SPY, order=(i,d,j)).fit(method='mle', trend='nc')
tmp_aic = tmp_mdl.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_order = (i, d, j)
best_mdl = tmp_mdl
except: continue
p('aic: {:6.5f} | order: {}'.format(best_aic, best_order))
# aic: -11518.22902 | order: (4, 0, 4)
# ARIMA model resid plot
_ = tsplot(best_mdl.resid, lags=30)
```

It should be no surprise that the best model has a differencing of 0. Recall that we already took the first difference of log prices to calculate the stock returns. Below, I plot the model residuals. The result is essentially identical to the ARMA(4, 4) model we fit above. Clearly this ARIMA model has not explained the conditional volatility in the series either!

ARIMA model fit to spy returns

Now we have at least accumulated enough knowledge to make a simple forecast of future returns. Here we make use of our model's **forecast() **method. As arguments, it takes an integer for the number of time steps to predict, and a decimal for the alpha argument to specify the confidence intervals. The default setting is 95% confidence. For 99% set alpha equal to 0.01.

```
# Create a 21 day forecast of SPY returns with 95%, 99% CI
n_steps = 21
f, err95, ci95 = best_mdl.forecast(steps=n_steps) # 95% CI
_, err99, ci99 = best_mdl.forecast(steps=n_steps, alpha=0.01) # 99% CI
idx = pd.date_range(data.index[-1], periods=n_steps, freq='D')
fc_95 = pd.DataFrame(np.column_stack([f, ci95]),
index=idx, columns=['forecast', 'lower_ci_95', 'upper_ci_95'])
fc_99 = pd.DataFrame(np.column_stack([ci99]),
index=idx, columns=['lower_ci_99', 'upper_ci_99'])
fc_all = fc_95.combine_first(fc_99)
fc_all.head()
```

```
# Plot 21 day forecast for SPY returns
plt.style.use('bmh')
fig = plt.figure(figsize=(9,7))
ax = plt.gca()
ts = lrets.SPY.iloc[-500:].copy()
ts.plot(ax=ax, label='Spy Returns')
# in sample prediction
pred = best_mdl.predict(ts.index[0], ts.index[-1])
pred.plot(ax=ax, style='r-', label='In-sample prediction')
styles = ['b-', '0.2', '0.75', '0.2', '0.75']
fc_all.plot(ax=ax, style=styles)
plt.fill_between(fc_all.index, fc_all.lower_ci_95, fc_all.upper_ci_95, color='gray', alpha=0.7)
plt.fill_between(fc_all.index, fc_all.lower_ci_99, fc_all.upper_ci_99, color='gray', alpha=0.2)
plt.title('{} Day SPY Return Forecast\nARIMA{}'.format(n_steps, best_order))
plt.legend(loc='best', fontsize=10)
```

21-Day spy Return forecast - Arima(4,0,4)

ARCH(p) models can be thought of as simply an AR(p) model applied to the variance of a time series. Another way to think about it, is that the variance of our time series NOW *at time t*, is conditional on past observations of the variance in previous periods.

arch(1) model formula - penn state

Assuming the series has zero mean we can express the model as:

ARCH(1) model if zero mean

```
# Simulate ARCH(1) series
# Var(yt) = a_0 + a_1*y{t-1}**2
# if a_1 is between 0 and 1 then yt is white noise
np.random.seed(13)
a0 = 2
a1 = .5
y = w = np.random.normal(size=1000)
Y = np.empty_like(y)
for t in range(len(y)):
Y[t] = w[t] * np.sqrt((a0 + a1*y[t-1]**2))
# simulated ARCH(1) series, looks like white noise
tsplot(Y, lags=30)
```

simulated ARCH(1) Process

Simulated arch(1)**2 process

Notice the ACF, and PACF seem to show significance at lag 1 indicating an AR(1) model for the variance may be appropriate.

Simply put GARCH(p, q) is an ARMA model applied to the variance of a time series i.e., it has an autoregressive term and a moving average term. The AR(p) models the variance of the residuals (squared errors) or simply our time series squared. The MA(q) portion models the variance of the process. The basic GARCH(1, 1) formula is:

garch(1, 1) formula from quantstart.com

Omega (w) is white noise, and alpha and beta are parameters of the model. Also alpha_1 + beta_1 must be less than 1 or the model is unstable. We can simulate a GARCH(1, 1) process below.

```
# Simulating a GARCH(1, 1) process
np.random.seed(2)
a0 = 0.2
a1 = 0.5
b1 = 0.3
n = 10000
w = np.random.normal(size=n)
eps = np.zeros_like(w)
sigsq = np.zeros_like(w)
for i in range(1, n):
sigsq[i] = a0 + a1*(eps[i-1]**2) + b1*sigsq[i-1]
eps[i] = w[i] * np.sqrt(sigsq[i])
_ = tsplot(eps, lags=30)
```

Simulated Garch(1, 1) process

Again, notice that overall this process closely resembles white noise, however take a look when we view the squared **eps** series.

simulated garch(1, 1) process squared

There is clearly autocorrelation present and the significance of the lags in both the ACF and PACF indicate we need both AR and MA components for our model. Let's see if we can recover our process parameters using a GARCH(1, 1) model. Here we make use of the **arch_model **function from the **ARCH **package.

```
# Fit a GARCH(1, 1) model to our simulated EPS series
# We use the arch_model function from the ARCH package
am = arch_model(eps)
res = am.fit(update_freq=5)
p(res.summary())
```

garch model fit summary

Now let's run through an example using SPY returns. The process is as follows:

- Iterate through combinations of ARIMA(p, d, q) models to best fit our time series.
- Pick the GARCH model orders according to the ARIMA model with lowest AIC.
- Fit the GARCH(p, q) model to our time series.
- Examine the model residuals and squared residuals for autocorrelation

Also note that I've chosen a specific time period to better highlight key points. However the results will be different depending on the time period under study.

```
def _get_best_model(TS):
best_aic = np.inf
best_order = None
best_mdl = None
pq_rng = range(5) # [0,1,2,3,4]
d_rng = range(2) # [0,1]
for i in pq_rng:
for d in d_rng:
for j in pq_rng:
try:
tmp_mdl = smt.ARIMA(TS, order=(i,d,j)).fit(
method='mle', trend='nc'
)
tmp_aic = tmp_mdl.aic
if tmp_aic < best_aic:
best_aic = tmp_aic
best_order = (i, d, j)
best_mdl = tmp_mdl
except: continue
p('aic: {:6.5f} | order: {}'.format(best_aic, best_order))
return best_aic, best_order, best_mdl
# Notice I've selected a specific time period to run this analysis
TS = lrets.SPY.ix['2012':'2015']
res_tup = _get_best_model(TS)
# aic: -5255.56673 | order: (3, 0, 2)
```

residuals of arima(3,0,2) model fit to SPY returns

Looks like white noise.

squared RESIDUALS OF ARIMA(3,0,2) MODEL FIT TO SPY RETURNS

Squared residuals show autocorrelation. Let's fit a GARCH model and see how it does.

```
# Now we can fit the arch model using the best fit arima model parameters
p_ = order[0]
o_ = order[1]
q_ = order[2]
# Using student T distribution usually provides better fit
am = arch_model(TS, p=p_, o=o_, q=q_, dist='StudentsT')
res = am.fit(update_freq=5, disp='off')
p(res.summary())
```

GARCH(3, 2) model fit to spy returns

Convergence warnings can occur when dealing with very small numbers. Multiplying the numbers by factors of 10 to scale the magnitude can help when necessary, however for this demonstration it isn't necessary. Below are the model residuals.

residuals of garch(3, 2) model fit to SPY returns

Looks like white noise above. Now let's view the ACF and PACF of the squared residuals.

Looks like we have achieved a good model fit as there is no obvious autocorrelation in the squared residuals.

- Quantstart.com - https://www.quantstart.com/articles#time-series-analysis
- Harvard Lectures in Python - http://iacs-courses.seas.harvard.edu/courses/am207/blog/lecture-17.html
- Penn State Stats - https://onlinecourses.science.psu.edu/stat510/node/78
- stationary pic + tsplot - http://www.seanabu.com/2016/03/22/time-series-seasonal-ARIMA-model-in-python/
- stationary quote, etc - http://people.duke.edu/~rnau/411diff.htm
- interpreting qq plots - http://stats.stackexchange.com/questions/101274/how-to-interpret-a-qq-plot
- Kaplan SchweserNotes (Level 2) - Quantitative Methods

**Strategy Summary****Results****Conclusions/Analysis**

First, if you're unfamiliar with the Implied Volatility Skew Strategy you can find a recent deep dive into the strategy and its performance **here.**

In this short post, I look at the effect of using only the top N ranked ETFs from each Long/Short portfolio. In this case, N is equal to 3. This is an arbitrary selection and this study could be done with the top 1, 2, 4, etc. This differs from the original strategy in that the original strategy simply bins the ETFs into quantiles (according to the factor value) and creates the portfolio from the top and bottom quantiles; **this strategy takes the top 3 ETFs from within the top and bottom quantiles.**

**I will reference the original strategy as S1 and the modified strategy as S2 for the remainder of the post.**

This strategy comparison exposes some key themes found in quant finance, but first some highlights:

- S2 outperforms S1 on a pure total return basis over the period. (~23% vs ~18%)
- S2 has a higher annualized alpha. (26% vs 20%)
- S2 has a lower annualized beta. (0% vs 10%)
- S2 has nearly double the volatility of S1. (13% vs 7%)
- S2 has nearly double the maximum drawdown of S1. (-4.4% vs -2.3%)
- S2 has max drawdown duration 3x larger than S1. (72 days vs 24 days)
- Both strategies have a positive information ratio of approximately 6%.

We can extract a few lessons here.

S2 trades a maximum of six ETFs at a time vs the ~10+ ETFs of S1, so **we would expect S2 to have a higher level of volatility than S1. **You might ask, "Why is that?". Well, this is basic Market Portfolio Theory which relates diversification and portfolio volatility to the number of assets traded and their correlations with each other. Essentially, **the theory says that the volatility of a portfolio of assets will be less than the sum of their individual volatility's as long as the correlations between the assets is not equal to 1.** This is the power and benefit of diversification in a nutshell.

Unfortunately for S2, the increase in total returns is not enough to compensate us for the increase in volatility (risk). This shows up primarily in the annualized sharpe ratio. **S1 clearly has the edge in return per unit of risk, with a sharpe ratio of ~2.8 vs ~2.** This also manifests in the Calmar ratio, which tracks the ratio of Avg. Annual Returns over Maximum Drawdown. Here **S1 is also superior sporting a Calmar ratio of ~9.8 vs ~6.7.**

S2's lower beta surprised me initially, but this is likely explained by trading less ETFs overall combined with those ETFs having low market correlation.

**Another issue we encounter with S2 vs S1 has to do with liquidity**. In the equity curve charts, I have highlighted the general region where the strategy was on hiatus. Notice in S2, that there are tiny trades still occurring throughout this period. There should be no trades happening. **What this means is that because S2 has to invest more cash into a smaller number of ETFs, if any of those ETFs are very thinly traded, it results in lots of unfilled orders that can take a significant amount of time to finish allocating or liquidating.** Again, this is another side effect of S2's reduced diversification.

Thus far all the evidence supports S1 as the superior strategy indicating that using a top N factor rank within top and bottom quantiles does not improve the strategy.

]]>

**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)**

**ABSTRACT**

This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

The following symbols were removed from analysis: IPW, IYC, PLTM, GAF, GUNR, IYK

**Results simulated using the Quantopian Platform.*

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)****Deep Dive into the Weekly Strategy using Quantopian's Pyfolio****Strategy Concerns**

**ABSTRACT**

This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

***Results simulated using the Quantopian platform.*

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

Here I use Quantopian's portfolio analytics tool to further analyze the strategy. Note that all returns are simulated.

Annualized returns are extremely healthy at 26% with annualized volatility in the single digits. The sharpe ratio is impressive at 3.2 and the calmar ratio is double digits reflecting the low maximum drawdown this strategy has sustained. Some features we would hope for include a positive skew which indicates a higher likelihood of positive returns. Annualized alpha is substantial at 22% while maintaining very low beta exposure to the benchmark (SPY) at 10%. These numbers are fantastic but we must pump the brakes on our excitement as the strategy only has 9 months of results. Furthermore, at least 1.5 of the 9 months the strategy was not tracked due to the downfall of the Yahoo Finance options API.

A quick glance shows that the strategy has had strong cumulative return performance over the first three quarters of 2016. We can see the gap in the strategy results during the period I was searching for a new options data provider.

Monthly and weekly returns skew to the positive. The only losing month thus far appears to be March. Even without the near ~8% gain in April the strategy still skews positively.

Rolling portfolio beta relative to the SPY benchmark is near zero. Six month rolling sharpe has been consistently high around 3 for the last three months. The strategy has near zero sensitivity to the rolling Fama-French factors which is a positive sign.

This strategy turns over almost once a week, as designed, with daily trading volume averaging a little over 20,000 shares. Gross leverage very rarely exceeds two. Based on the gross leverage we can see that the strategy struggles to get filled on the more thinly traded ETFs which results in the gross leverage rarely achieving its target leverage of two. This includes my custom slippage model which increases the bar volume limit to 50%, which means up to 50% of a minutes bar's volume can be traded by the strategy.

Drawdowns have been manageable and very small, with the max at 2.26% only lasting 16 days. Max drawdown duration has been approximately 1 month or 24 days.

Here are the maximum long and short portfolio weights. ~22% isn't awful but it's not ideal. Again this is a reflection of slippage and lack of timely fills. However it is not all bad as seen below.

Clearly those overweight positions are the result of slippage during portoflio turnover as those allocations are represented by short duration spikes on both the long and short side in the **Portfolio Allocation Over Time** plot. More importantly, in the **Long/Short Max and Median Position Concentration** chart we can see that during the position weight spikes, the portfolio has had near offsetting positions such that it is not unhedged or overexposed to either the long or short side for long periods of time. We can confirm this by plotting the net leverage and average net leverage.

Let's examine some trading statistics.

First item that jumps out is the quantity of **losing round_trips** is greater than **winning round_trips** for **short trade****s** which is interesting. Only 38% of the short trades are profitable. However, look at the **Profit factor **row. For short trades it is very close to 1 indicating near breakeven. How is that possible? Look at the **Ratio Avg. Win:Avg Loss**. Short trades have the highest ratio at $1.58. This means that average winning trade is almost 1.6x the size of the average losing trade in dollar terms. The **Profit factor** for the long trades is a healthy $1.93 and the strategy overall shows an investable profit factor of $1.41.

The strategy looks great overall. However there are key concerns that must be highlighted. I will do so in bullet form:

- Small sample size incorporates only 9 months of data, ~1.5 months the strategy was not tracked.
- Liquidity and slippage can be an issue when the strategy is run live. The question is, is there more liquidity available for some of the thinly traded ETFs, during live trading that is not reflected in historical volume, due to the creation/redemption process?
- If gross and net leverage are brought into better balance the strategy returns could differ from the simulation results for better or for worse.
- Execution and commission costs could, in theory, be improved and/or reduced for the benefit of the strategy due to the algorithm's high turnover. This gives the portfolio manager/administrator leverage in negotiating commissions.
- Bad data is always a concern that could bias the simulation results up or down.
- Simulated results can be substantially different from live trading results.

**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)**

**ABSTRACT**

This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)**

**ABSTRACT**

This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>**Strategy Restart****Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Updated (Target Leverage=2)**

After a ~2 month pause, the implied volatility long/short strategy has returned! If you were previously unaware this strategy relied on aggregating free options data via the now defunct Yahoo Finance Options API. After some time I was able to track down another free, reliable, source for options data via Barchart.com. I show how to create a web scraper to aggregate the data here * <Aggregating Free Options Data with Python>. *Without further delay I present the strategy updates below.

**ABSTRACT**

This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>- Motivation
- Code Requirements
- Creating our Scraper Class
- Aggregating the Data
- Github Gist Code
- Disclaimers

This year I implemented a simulated trading strategy based on the research paper titled "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. The authors show that their SKEW factor has predictive power for equity returns for up to 6 months.

Because historical options data is difficult to find and/or prohibitively expensive I tracked the results of the simulated strategy in near real time using a combination of the Yahoo Finance Options API made available via the Pandas package and the Quantopian platform for realistic backtesting. Unfortunately, the Yahoo Finance API has changed and it appears that the options data is no longer offered. Therefore my last strategy update took place on July 12, 2016 which can be seen here.

The strategy has thus far exceeded all expectations. I tracked two versions of the strategy, one maintained a 4-week holding period, the other a weekly holding period. The 4-week holding period strategy showed a cumulative return of ~9%, with 25 of 28 weeks showing positive gains! The weekly strategy (with target leverage of 2) fared slightly better with total returns ~16%, double-digit alpha ~24%, near-zero beta ~8%, single digit volatility ~8%, with a max drawdown of ~2.2%!

The strategy showed immense promise but with only 28 weeks of results the sample size is unfortunately too small. As a Python programmer when one API goes down it's time to find another. With that said I created my own using the excellent free Barchart.com resource.

First you must sign up for a free account with Barchart.com and note your username and password. For reference, my current system is running Windows 8.1, 64-Bit with a WinPython distribution using Python 3.5. The code requires the following packages:

- pandas
- numpy
- requests
- bs4
- re
- logging
- more_itertools
- tqdm

Assuming you have the requisite packages and have signed up for a Barchart.com user account we can now code our scraper class. High-level, there are a few things to note when designing our scraper.

We can get basic options data without an account. It looks like this:

source: http://www.barchart.com/options/stocks/SPY

That's decent but we need volatility and greeks data which requires a free account.

source: http://www.barchart.com/options/stocks/&view=vol&sym=SPY&date=Sep-02-2016

This data is much more interesting. Note that there are some unique columns from each table that we would like to extract and combine later on.

Let's get into the code. First we need to initalize the`barchart_options` class.

```
class barchart_options():
# --------------------------------------------
def __init__(self, symbol):
self.symbol = symbol
self.payload = {'email':'YOUR_USER_EMAIL', 'password':'YOUR_PASSWORD'}
self.base_url = r"http://www.barchart.com/options/stocks/"
self.login_url = r'http://www.barchart.com/login.php'
self.basic_URL = self.base_url + symbol
self.base_SRC = self.__get_base_src()
self.greeks_url_suffix = '&view=vol&sym={SYMBOL}&date={DATE}'
self.reidx_cols = ['Symbol', 'Expiry', 'Type', 'Strike', 'Ask', 'Bid', 'Last', 'Change',
'Underlying_Price', 'ImpliedVolatility','TheoreticalValue', 'Delta',
'Gamma', 'Rho', 'Theta', 'Vega', 'Open Int', 'Volume']
```

Let's go through the class variables:

*self.symbol:**the equity symbol we want data for**self.payload: this is your barchart login information**self.base_url: this is the free Barchart basic options url**self.basic_URL: this is the url that combines our symbol and the base_url to get the basic options data**self.login_url: this is the login page of Barchart.com**self.base_SRC: this is a function that returns the HTML page source after using the basic_URL. We need this page to extract all the available options expiration dates**self.greeks_url_suffix: this is the url suffix that is appended to the base_url to get the volatility and greeks data.**self.reidx_cols: this is a list of the combined column headers in the order I preferred. This isn't a requirement.*

Here is the function to retrieve the basic options data source page.

```
# --------------------------------------------
# get basic options data source
# --------------------------------------------
def __get_base_src(self):
with requests.session() as S:
res = S.get(self.basic_URL)
return bs(res.text, 'lxml')
```

Now we need a helper function to search the HTML and extract the expiration dates. The expiration dates are required to create the volatility/greeks url we need to get the good data.

```
# --------------------------------------------
# extract expiry dates
# --------------------------------------------
def _extract_expiry_dates(self):
raw_date_links = []
for link in self.base_SRC.find_all('a', href=re.compile('date')):
raw_date_links.append(list(link.attrs.values())[0])
reg_exp = r'[A-z]{3}-\d{2}-\d{4}'
dates = []
for raw_date in raw_date_links:
ms = re.search(reg_exp, raw_date)
dates.append(ms.group())
return list(unique_everseen(dates))
```

Let's break down what's happening in this function. First we use a combination of `bs4` and `re` packages to search for all the 'a' tags which represent links in the page. But we don't want all of them, we only want those that correspond to expiry dates. After examining the HTML source code you will find that each link with an expiration date has the word 'date' within its 'href'. We use the regex function re.compile('date') to extract only these links. Note the `raw_date_links.append()` function. To retrieve only the text we need as a string we have to do some manipulation to the `link.attrs.values()` result. The `link.attrs.values()` returns a *view *of the dictionary value and NOT the actual value we want. To convert it to a usable data structure we first convert it to a list using the 'list()' function. The list has only one value, which is the text we need, so we use list indexing to get the first and only item in the list.

Next we use the `re` library to search each item in the list of `raw_date_links` and extract only the date from a raw and verbose string. We use a regular expression to find any sequence of text that takes the example form of 'Sep-02-2016'. Once we loop through the list extracting the dates we return a list of expiry dates. Also note we use the `unique_everseen()` function on the list of dates. This function loops though an iterable only returning the first unique item in the list while maintaining the correct order. We use this function for multiple reasons. First, there are duplicate dates in the list. Second, if we use a function like 'set()' or 'np.unique()' we return only unique items, however the order is not maintained. We avoid those issues by using the' unique_everseen()' function.

Next we need the helper functions for creating the volatility/greeks url, logging in to access the page, and extracting the correct headers and call/put tables.

```
# --------------------------------------------
# get option greeks source data
# --------------------------------------------
def __login_greeks(self, symbol, date):
with requests.session() as S:
_ = S.post(self.login_url, data=self.payload)
res = S.get(self.__create_greeks_url(symbol, date))
return bs(res.text, 'lxml')
def __create_greeks_url(self, symbol, date):
url = self.base_url + self.greeks_url_suffix.format(SYMBOL=symbol, DATE=date)
return url
def _get_greeks_src(self, symbol, date):
src = self.__login_greeks(symbol, date)
return src
def __clean_headers(self, headers):
hdr = headers.replace('\t', '').split('\n')
hdrs = [hdr for hdr in hdr if len(hdr) > 0]
return hdrs
def __get_greek_headers(self, greek_src):
hdrs = [head.get_text() for head in greek_src.find_all('tr', class_='datatable_header')][0]
return self.__clean_headers(hdrs)
def __get_greeks_tables(self, greek_src):
tables = []
for tbl in greek_src.find_all('table', class_='datatable'):
tables.append(tbl)
return tables
```

Next we are going to create our pandas dataframes from the HTML tables. To reiterate there are columns we want from both, the basic options data page and the volatility/greeks data page. To do this we will convert both pages' tables into dataframes and then combine them so we obtain one dataframe with only the unique columns we want from both. We also add the columns, 'Underlying_Price' and 'Expiry' to the combined dataframe.

Here are the functions to first create the basic options dataframes.

```
# --------------------------------------------
# create basic options dfs
# --------------------------------------------
def __get_underlying_last_price(self, greek_src):
last = [float(d.text) for d in greek_src.find_all('span', class_='last')]
return last
def __create_base_call_df(self, expiry):
tables = []
for tbl in self.base_SRC.find_all('table', class_='datatable'):
tables.append(tbl)
call_rows = [[td.text for td in tr.find_all('td')] for tr in tables[0].find_all('tr')]
cols = ['Strike', 'Symbol', 'Type', 'Bid', 'Ask', 'Last', 'Change', 'Volume', 'Open Int']
call_df = pd.DataFrame(call_rows[1:], columns=cols).apply(pd.to_numeric, errors='ignore')
return call_df
def __create_base_put_df(self, expiry):
tables = []
for tbl in self.base_SRC.find_all('table', class_='datatable'):
tables.append(tbl)
put_rows = [[td.text for td in tr.find_all('td')] for tr in tables[1].find_all('tr')]
cols = ['Strike', 'Symbol', 'Type', 'Bid', 'Ask', 'Last', 'Change', 'Volume', 'Open Int']
put_df = pd.DataFrame(put_rows[1:], columns=cols).apply(pd.to_numeric, errors='ignore')
return put_df
```

Next I show functions for creating the master call and put dataframes. The below functions do the following:

- takes the HTML source for the volatility/greeks page
- extracts the tables and creates the rows
- creates and adds the column headers
- extracts the most recent underlying stock price and creates a new column
- creates the expiry column
- uses the pandas function 'combine_first()' to merge both the basic and advanced dataframes
- uses the pandas function 'reindex()' and the self.reidx_cols list to reorder the columns.

```
# --------------------------------------------
# create, merge basic and greek options dfs
# --------------------------------------------
def _create_calls_df(self, greek_src, expiry):
tables = self.__get_greeks_tables(greek_src)
rows = [[td.text for td in tr.find_all('td')] for tr in tables[0].find_all('tr')]
calls = pd.DataFrame(rows[1:], columns=self.__get_greek_headers(greek_src)
).apply(pd.to_numeric, errors='ignore')
calls['Underlying_Price'] = self.__get_underlying_last_price(greek_src) * len(calls.index)
calls['Expiry'] = [pd.to_datetime(expiry)] * len(calls.index)
base_call_df = self.__create_base_call_df(expiry)
mrg = calls.combine_first(base_call_df)
CALLS = mrg.reindex(columns=self.reidx_cols)
return CALLS
def _create_puts_df(self, greek_src, expiry):
tables = self.__get_greeks_tables(greek_src)
rows = [[td.text for td in tr.find_all('td')] for tr in tables[1].find_all('tr')]
puts = pd.DataFrame(rows[1:], columns=self.__get_greek_headers(greek_src)
).apply(pd.to_numeric, errors='ignore')
puts['Underlying_Price'] = self.__get_underlying_last_price(greek_src) * len(puts.index)
puts['Expiry'] = [pd.to_datetime(expiry)] * len(puts.index)
base_put_df = self.__create_base_put_df(expiry)
mrg = puts.combine_first(base_put_df)
PUTS = mrg.reindex(columns=self.reidx_cols)
return PUTS
# --------------------------------------------
```

Now that we have created the scraper class we can write the script to aggregate options data for multiple symbols.

To start we need to import the necessary libraries.

```
import time
import pandas as pd
import numpy as np
from copy import copy
from tqdm import tqdm
import logging
from _barchart_options_scraper_ import barchart_options # <- this is our scraper class
p = print
```

Now we set up our logging infrastructure to track and debug our code. If you have never used Python logging and instead use lots of print() statements, it's time for a transition.

```
# ------------------------------------------ \\\\
today = pd.datetime.today().date().strftime('%m-%d-%y')
datapath = "..\YOUR_PROJECT_DIR\Barchart_Options_Data\\"
logpath = datapath + r'Barchart_OptionsData_LogFiles\\'
logging.basicConfig(filename=logpath + 'BRC_Options_Log_{}.log'.format(today),
format='%(asctime)s %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p',
level=logging.INFO)
```

Next we define a list, or in this case, a dictionary of symbols. I've used this dictionary before and it uses the ETF categories as its keys and a list of ETF symbols as its values.

```
# ------------------------------------------ \\\\
ETFS = {
'Large Cap' :['SPY','IVV','VOO','IWB'],
'Mid Cap' :['MDY','IJH','VO','IWR'],
'Small Cap' :['IWM','IJR','VB'],
'Global Equity' :['VEU','ACWI','VXUS','DGT'],
'AsiaPac Equity' :['EWT','EWY','EWA','EWS','AAXJ','FXI','EWH','EWM','EPI','INDA','RSX'],
'Europe Equity' :['FEZ','EZU','VGK','HEDJ','EWU','EWI','EWP','EWQ','EWL','EWD'],
'Emerging | Frontier' :['EWZ','EWW','ECH','GAF','FM','EEM','VWO'],
'Real Estate' :['RWO','RWX','RWR','IYR','VNQ'],
'Consumer Discretionary':['XLY','XRT','FXD','VCR','RTH','IYC'],
'Consumer Staples' :['XLP','FXG','VDC','ECON','IYK'],
'Energy' :['XLE','IPW','XOP','VDE','IYE','IXC','OIH'],
'Financials' :['XLF','KBE','KIE','IYG','KRE'],
'Healthcare' :['XLV','XBI','IBB'],
'Industrial' :['XLI','IYT','VIS','IYJ'],
'Materials' :['XLB','XHB','XME','IGE','MOO','LIT','GUNR'],
'Technology' :['XLK','SMH','HACK','FDN'],
'Telecom' :['IYZ','IXP','VOX'],
'Utilities' :['IDU','XLU','VPU'],
'Oil | Gas' :['UNG','BNO','OIL'],
'Precious Metals' :['GLD','SLV','IAU'],
'Bonds' :['BND','AGG','JNK','LQD'],
'T-Bond' :['TLT','IEF','IEI','SHY','BIL','MINT'],
'Precious Metals Miners':['SIL','GDX','GDXJ','PLTM']
}
```

Now we create our HDF5 store using pandas, initialize an empty list to hold our symbols which return an error, and count the number of symbols for tracking purposes.

```
# ------------------------------------------ \\\\
etfstore = pd.HDFStore(datapath + 'ETF_options_data_{}.h5'.format(today))
etf_missing_symbols = []
etf_val_count = 0
for i in ETFS.values():
etf_val_count += len(i)
etf_sym_count = etf_val_count
N = copy(etf_sym_count)
```

Now we create the master loop which will aggregate our data. I will post the code and then describe what is happening at each layer.

```
# ------------------------------------------ \\\\
for category, symbols in tqdm(ETFS.items()):
logging.info("---------- {} ---------".format(category))
# \\\
for i, symbol in tqdm(enumerate(symbols, start=1)):
N -= 1
# \\\
if not pd.isnull(symbol):
try:
brc = barchart_options(symbol)
expirys = brc._extract_expiry_dates()
appended_data = []
# \\\
for expiry in tqdm(expirys):
grk = brc._get_greeks_src(symbol, expiry)
calls = brc._create_calls_df(grk, expiry)
puts = brc._create_puts_df(grk, expiry)
mrg = pd.concat([calls, puts])
appended_data.append(mrg)
# \\\
data = pd.concat(appended_data)
logging.info(data.describe())
etfstore.put(symbol, data, format='table')
except Exception as e:
logging.error("ERROR: {}".format(e), exc_info=True)
etf_missing_symbols.append(symbol)
continue
pct_total_left = (N / etf_sym_count)
logging.info('{}..[done] | {} of {} ETF symbols collected | {:>.2%}'.
format(symbol, i, len(symbols), pct_total_left))
p('{}..[done] | {} of {} ETF symbols collected | {:>.2%}'.
format(symbol, i, len(symbols), pct_total_left))
time.sleep(np.random.choice(np.random.uniform(0.5,1.5, [3]), p=[.7, .2, .1]))
```

First we loop through the category level, then for each category we loop through the list of symbols. At the symbol level the code first checks to make sure the symbol is not null, then we use a try, except clause to handle any errors when our code is running so that it will continue in case we encounter any timeouts or missing symbols.

Next the code retrieves the basic options data HTML source and extracts the list of option expiration dates. The code then loops through each expiration date creating the master call and put dataframes. Then we merge the call and put dataframes together using the `pd.concat()` function. We then append the merged dataframe to the empty list called `appended_data`. Once we have finished looping through the list of expiration dates, creating and appending our merged dataframes, we merge ALL of that symbol's dataframes into ONE large dataframe containing all the data from each expiry. We then `put()` that dataframe into our HDFStore called 'etfstore'.

If there are any exceptions or errors, we log it and then append the symbol which caused the error to our `etf_missing_symbols` list for tracking, and continue the loop. The code below the loop is for tracking purposes while the code is running. Note the 'time.sleep()' function which is probably overly complicated but on some level necessary. Essentially all it does is force the code to pause for a random period of time between 0.5 and 1.5 seconds. This is done so we don't overwhelm the servers although I'm sure Barchart.com can handle the load.

Finally, after looping through and aggregating each symbol's data we close our HDFStore and log any errors, the symbol names, and the error rate.

```
# ------------------------------------------ \\\\
etfstore.close()
N_etf_errors = len(etf_missing_symbols)
etf_error_rate = N_etf_errors / etf_sym_count
logging.info('etf missing symbols:\n{}'.format(etf_missing_symbols))
logging.info('etf error rate: {} errors / {} symbols = {:3.2%}'
.format(N_etf_errors, etf_sym_count, etf_error_rate))
```

When we run the script we should see something like this:

The red output is actually created by the tqdm library. Here's the description from the Github page: "Instantly make your loops show a smart progress meter - just wrap any iterable with `tqdm(iterable)`, and you're done!"

Your log file will likely look similar to this:

As you can see it's a plain text file tracking our connection to Barchart.com, and after each symbol is aggregated it outputs the `data.describe()` function from pandas on our supersized symbol dataframe before we add it to our HDF5 store.

Let's confirm our data output is as it should be:

```
with pd.HDFStore(datapath + 'ETF_options_data_{}.h5'.format(today), mode='r') as hdf:
p(hdf.keys())
p(hdf['/SPY'].info())
p(hdf['/SPY'].iloc[10:20])
```

The HDF store keys:

A sample dataframe (SPY) from the HDF store:

The entire script takes roughly an 1 hour and 20 minutes on this collection of symbols.

Please use this code responsibly. Blackarbs has no intent to violate any terms of use or legal protections.

]]>**Background****Strategy Summary****Implementation Details****Strategy Results and Analysis****References (Notes)**

Recently I was blessed with an introduction to S&P Global-Market Intelligence's Director of Business Development, Matt Morrissy and gifted a trial membership to the S&P Capital IQ platform [1]. Specifically, I was given an opportunity to review and provide feedback for their "Alpha Factor Library." Their team, the Quantamental Group [2], has researched and developed several hundred various factors across the investment themes spectrum including but not limited to: price momentum, earnings quality, valuation, volatility, etc.

The Capital IQ offerings and Quantamental group's research is fantastic. However, due to the size, diversity and sophistication of their client base [3], most factors are not tested in a production ready fashion. This means the factors are not backtested to include any assumptions of market frictions such as commissions, and liquidity constraints.

Pouncing on the opening, I asked, and was granted the opportunity to test some of the factors in the Quantopian platform. As always, at Blackarbs, the focus is on realistic implementation and profitability. Unfortunately, too many research papers publish amazing results that cannot be replicated, or fail with even the slightest change in market friction assumptions. Let's put the first strategy to the test.

The factor we are testing is in the valuation theme [4]. We look at the ratio of operating cash flow (OCF) to the enterprise value (EV). OCF is a measure of cash generated by a firm's "normal" business operations. EV is the theoretical takeover price or buy price for a firm. OCF/EV can be thought of as a measure of how much cash the firm generates relative to its takeover price. A high ratio indicates great value as the firm generates lots of cash compared to its EV. A low ratio indicates low value as the firm is worth way more than the cash flow it generates.

This strategy first filters the Quantopian universe to approximately 2500-3000 stocks. Then it sorts the stocks according to the OCF/EV ratio and bins the stocks into quintiles. The top quintile is bought and the bottom quintile is sold.

**Market Neutral:**It attempts to maintain equal dollar value invested in both long and short legs.**Sector Neutral:**The strategy attempts to maintain sector weights equal to that of a specified index/benchmark.[5]**Monthly Holding Period:**The strategy holds positions for one month before factor rebalancing.**Weekly Target Portfolio Rebalancing:**The strategy compares the current portfolio weights to the target portfolio weights. If any position is outside the target weight +/- 5% then the algorithm sends orders to rebalance the portfolio to the target. This is to prevent the portfolio from drifting too far outside the target weights.**No other factors are considered:**No momentum overlays, no stop losses or profit targets, no crash protection, etc.**Simulated Portfolio size is $10,000,000.****Commissions and Slippage constraints are shown below:**

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

Annual return is a solid 8% over an ~12 year period. Annual volatility is in the single digits at ~7% per year! Sharpe ratio is greater than 1, and max drawdown is also single digits at 9%! There is slight negative skewness which is ok. Alpha is a healthy 8%. Beta is effectively zero! All are solid traits of an effective market neutral strategy.

Cumulative returns are great. Volatility matched returns are impressive.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

In 10 of 12 years the strategy finished in plus territory which is excellent. The strategy was able to tread water in 2009 and 2010, suffering very small losses.

Drawdowns are manageable and realistic (<10%). The drawdown duration is also reasonable in that the longest duration was approximately 8 months. It is somewhat concerning that the largest drawdowns all occurred within the last 3 years. This would definitely be something to track closely.

Rolling Sharpe is ok; the average is greater than 1 which is a positive. Rolling beta is effectively zero which is indicative of a true market neutral strategy. Overall the strategy is indifferent to the Fama-French factors except for ~2008 and ~2014-15.

Gross leverage never exceeds the Quantopian max of 3x. However, what is not shown is that net leverage was consistently near zero indicating long and short positions were balanced. The Long/Short Exposure plot shows this quite well. Turnover seems reasonable given that the strategy maintained anywhere from ~400-500 positions at a time. Trading volume seems to be increasing over time but that could be illusory due to the recent large spike in late 2015/early 2016. Position exposure is what we would expect with the top long position weighted less than 2% and the top short position less than 3%.

The strategy is mostly indifferent to the stress events identified above. During 13 of 17 events the strategy recorded non-negative average returns. Overall, the strategy appears to be unaffected by various types of market stress events.

- S&P Global Market Intelligence - Alpha Factor Library. For Inquiries contact Matt Morrissy via mmorrissy@spglobal.com
- S&P Capital IQ - Quantamental Research
- Institutional clients have unique strategy/implementation costs that are likely impossible to generalize accurately.
- My strategy is not an EXACT replication of the S&P alpha factor strategy. I did not ensure that the relevant fundamental data points encompassed the same lookback period. Instead, I simply used the default point-in-time data provided by Morningstar via the Quantopian platform.
- The sector neutral benchmark in this example is a stylized version of the Russell 3000 index. The Russell 3000 index is rule based so creating a proxy index via the Quantopian platform was simpler. However, it is not exact as I added additional screens to exclude certain stock issues, foreign ADR's, stocks with missing data, etc. Sector neutrality in this case means portfolio sector weights matched to the weights of the filtered stylized index.

**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Update (Target Leverage=1)****1-Week Holding Period Strategy Updated (Target Leverage=2)**

**ABSTRACT**

This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

Download the spreadsheet here.

Download a text file of all the portfolio stocks here.

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

RESULTS SIMULATED USING QUANTOPIAN PLATFORM

]]>**Strategy Summary****References****4-Week Holding Period Strategy Update****1-Week Holding Period Strategy Update (Target Leverage=1)****1-Week Holding Period Strategy Updated (Target Leverage=2)**

**ABSTRACT**

This predictability persists for at least six months, and firms with the steepest volatility smirks are those experiencing the worst earnings shocks in the following quarter. The results are consistent with the notion that informed traders with negative news prefer to trade out-of-the-money put options, and that the equity market is slow in incorporating the information embedded in volatility smirks. [1]

Here is the skew measure they use.

SOURCE: WHAT DOES INDIVIDUAL OPTION VOLATILITY SMIRK TELL US ABOUT FUTURE EQUITY RETURNS?

- Zhang, Xiaoyan and Zhao, Rui and Xing, Yuhang, What Does Individual Option Volatility Smirk Tell Us Ab