Exploring the Relationship Between SPY and TLT

Exploring the Relationship Between SPY and TLT

In this post I examine the relationship between the SPY and TLT ETFs. This can be considered Part 2.5 of my series exploring the 2-Asset Leveraged ETF portfolio of UPRO and TMF. Thus far I've posted results of the strategy using two implementations: "Inverse Risk-Parity" and "Risk-Parity". I've also covered some key concepts behind investing in leveraged ETFs including convexity, and beta-slippage/decay. Now we can explore the strengths and weaknesses of the strategy.

Read More

Could SPY ETF Component Participation Have Alerted Us to Sell (Hedge) Prior to the Recent Market Downturn?

This is the Python Code version of a guest post presented here on RectitudeMarket.com. If you would like to read the analysis without the Python code please click the link above.

To market pundits and casual observers the recent correction in equity markets appeared as a surprise. Overall headline economic data was positive at best and mixed at worst. Domestically, capital markets had been looking ‘ok’ while most of the major volatility was taking place abroad in emerging markets, and commodity based economies like China.

That did not stop SPY from collapsing from an August high of $210.56 to an August low of $187.23, or  a decline of ~12.5%. We’ve yet to have a daily close above $200 since the drop, even with a dovish Fed delaying the rate rising cycle once again.

As an investor I’m always searching for intuitive indicators that in essence, tip the collective Market’s hand. One indicator that always seemed interesting to me was ‘ETF Participation’.

What is ETF Participation?

There are likely several definitions on the web. For this analysis, I define ETF participation as the number of ETF component stocks’  returns that increase or decrease on a given day. I selected the SPY ETF for this study because SPY is considered by most to be the de-facto US market proxy.

To conduct this analysis I collected SPY holdings data from State Street. I only used a 252 day lookback due to the changing composition of the ETF. The first chart is a simple bar plot that shows the quantity of the current SPY component stocks that had returns above and below zero.

To begin I set up my exploratory research environment within the IPython notebook.


"""Windows 8.1 64bit, Python 3.4, Pandas 0.17"""
import pandas_datareader.data as web
import pandas as pd
pd.set_option('display.precision', 5)
pd.set_option('display.show_dimensions', True)
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 50)
import numpy as np
from pandas.tseries.offsets import *
import datetime as dt
import os
import time
from pprint import pprint as pp
import matplotlib.pyplot as plt
import matplotlib as mpl
import matplotlib.dates as dates

mpl.rcParams['font.family'] = 'Rambla'
%matplotlib inline
base_size = 11
size=(1.5 * base_size, base_size)

import seaborn as sns
sns.set_style('white', {"xtick.major.size": 3, "ytick.major.size": 3})
flatui = ["#9b59b6", "#3498db", "#95a5a6", "#e74c3c", "#34495e", "#2ecc71", "#f4cae4"]
sns.set_palette(sns.color_palette(flatui, 7))
# =========================================================== #
# filepath management
project_dir = r'C:\\myprojectdir\\'
watermark_path = project_dir + 'mywatermarkpath.png'
spdrs_data_dir = project_dir + r"_SPDR_holdings\\"
# =========================================================== #
# Datetime management  
d = dt.date.today()
# ---------- Days ---------- 
l10 = d - 10 * BDay()
l21 = d - 21 * BDay()
l63 = d - 63 * BDay()
l252 = d - 252 * BDay()
# ---------- Years ---------- 
l252_x2 = d - 252 * 2 * BDay() 
l252_x3 = d - 252 * 3 * BDay() 
l252_x5 = d - 252 * 5 * BDay()
l252_x7 = d - 252 * 7 * BDay() 
l252_x10 = d - 252 * 10 * BDay() 
l252_x20 = d - 252 * 20 * BDay() 
l252_x25 = d - 252 * 25 * BDay() 

After setting up my environment I import the SPY ETF holdings data, and perform some small edits to the ticker symbols in order to get stock price data from the Yahoo Finance API. Finally, I run a function to collect price data and return a Pandas Panel.  


etf = 'SPY'
etf_holdings = pd.read_excel(spdrs_data_dir + 'holdings-{}.xls'.format(etf.lower()), skiprows=3)
etf_components = etf_holdings[['Identifier', 'Weight']]

def _get_px(sym, start, end):
    return web.DataReader(sym, 'yahoo', start, end)

def _create_symbol_list():
    symbols = etf_components['Identifier'].dropna()
    symbols = symbols.str.replace('.', '-')
    symbols.iloc[-1] = 'SPY' # add SPY to list
    symbols = symbols.sort_values()
    return symbols  

def _get_stock_price_datapanel():
    missing_symbols = []
    panel_data = {}
    for n, sym in enumerate(symbols):
        try:
            panel_data[sym] = _get_px(sym, l252, d)
            print('[ {} ] downloaded || Remaining = {:<.2%} '.format(\
                                         sym, np.round(1 - n/len(etf_components),4) ))
        except Exception as e:
            missing_symbols.append(sym)
            #pp(e)
            pp('...{} symbol missing...'.format(sym))
        finally:
            pass

    num_errors = len(missing_symbols)        
    print('number of missing symbols: {}\nsymbol error rate: {}'.format(\
                                num_errors, np.round(num_errors/len(etf_components),4) )) 
    pp(missing_symbols)
        
    datapanel = pd.Panel.from_dict(panel_data)
    return datapanel

symbols = _create_symbol_list()  
datapanel = _get_stock_price_datapanel()

My output looks like this:

Next I calculate both log returns and cumulative log returns, then isolate and sum the quantity of stocks whose daily returns were greater than or less than 0.0.


last_n = 252
datapanel = datapanel.ix[:,-last_n:,:]

# ------ Log Returns 
cl = pd.DataFrame()
for S in datapanel:
    cl[S] = datapanel[S]['Adj Close']
    
lrets = np.log(cl / cl.shift(1)) 
crets = lrets.cumsum()
# ------ Number of Stocks whose returns gt/lt zero 
sym_gt_zero = (lrets > 0)
sym_lt_zero = (lrets < 0)

lr = lrets.copy()
gt_df = sym_gt_zero
lt_df = sym_lt_zero

gt_label = r'N Stocks Returns $(> 0)$'
lt_label = r'N Stocks Returns $(< 0)$'

lr[gt_label] = gt_df.sum(axis=1)
lr[lt_label] = lt_df.sum(axis=1) * -1

Next I construct a plot function to give an overview of the data collected so far. 


def _basic_participation_plot():
    mpl.rcParams['font.family'] = 'Rambla'
    lbl_size = 15
    f, (ax1, ax2) = plt.subplots(nrows=2, figsize=size)
    axes = (ax1, ax2)
    # ----- Date Formatting
    fmt = '%A, %m.%d.%Y'
    dt_txt_fmt = '[{} - {}]'.format((d - last_n * BDay()).strftime(fmt), (d - BDay()).strftime(fmt))    
    
    crets.SPY.plot(ax=ax1, sharex=ax2, label='SPY Returns')
    ax2.bar(lr.index, lr[gt_label].values, width=1, color=sns.xkcd_rgb['neon green'], label=gt_label)    
    ax2.bar(lr.index, lr[lt_label].values, width=1, color=sns.xkcd_rgb['bright red'], label=lt_label)
    ax1.set_title(r'SPY ETF Component Participation L/{} Trading Days'.format(\
                                                        last_n), fontsize=20, fontweight='demi', y=1.02)
    
    for ax in axes:
        ax.legend(loc='upper right', fontsize=lbl_size, frameon=True).get_frame().set_edgecolor('blue')
        dates_rng = pd.date_range(crets.index[0], crets.index[-1], freq='10D')
        plt.xticks(dates_rng, [dtz.strftime('%m-%d-%y') for dtz in dates_rng], rotation=45) 
        ax.xaxis.set_minor_formatter(dates.DateFormatter(''))
        ax.tick_params(axis='both', labelsize=lbl_size)
        
    ax1.set_ylabel('Cumulative Log Returns', fontsize=lbl_size, fontweight='demi')
    ax2.set_ylabel('Quantity of Stocks', fontsize=lbl_size, fontweight='demi')
    ax2.set_xlabel('Date', fontsize=lbl_size, fontweight='demi')        
    plt.axhline(0, color='k')    
    
    im = mpl.image.imread(watermark_path)
    f.figimage(im, xo=-50, yo=-50, origin='upper', alpha=0.125, zorder=10)
    #sns.despine()  
    
    plt.savefig(project_dir + r'_blog_RM_basicSPY_participation_plot_L{} Days_{}.png'.format(\
                                                    last_n, dt_txt_fmt))
    plt.tight_layout()
    
"""Run the function"""    
_basic_participation_plot()    

The chart may look good but it doesn’t tell us a whole lot. We need a way to see the general trends in participation. To do this I decided to use a basic 63 day exponential moving average for both time series.


"""compute exponential moving average"""
gt_ema = pd.ewma(lr[gt_label], span=63, min_periods=5)
lt_ema = pd.ewma(lr[lt_label], span=63, min_periods=5) * -1

def _ema_participation_plot():
    mpl.rcParams['font.family'] = 'Rambla'
    lbl_size = 15
    f, (ax1, ax2) = plt.subplots(nrows=2, figsize=size)
    axes = (ax1, ax2)
    # ----- Date Formatting
    fmt = '%A, %m.%d.%Y'
    dt_txt_fmt = '[{} - {}]'.format((d - last_n * BDay()).strftime(fmt), (d - BDay()).strftime(fmt))    
    
    crets.SPY.plot(ax=ax1, sharex=ax2, label='SPY Returns')   
    gt_ema.plot(ax=ax2, label='EMA of ' + gt_label)
    lt_ema.plot(ax=ax2, label='EMA of ' + lt_label)
    ax1.set_title(r'SPY ETF Component Participation L/{} Trading Days'.format(\
                                                        last_n), fontsize=20, fontweight='demi', y=1.02)
    for ax in axes:
        ax.legend(loc='upper right', fontsize=lbl_size, frameon=True).get_frame().set_edgecolor('blue')
        dates_rng = pd.date_range(crets.index[0], crets.index[-1], freq='10D')
        plt.xticks(dates_rng, [dtz.strftime('%m-%d-%y') for dtz in dates_rng], rotation=45) 
        ax.xaxis.set_minor_formatter(dates.DateFormatter(''))
        ax.tick_params(axis='both', labelsize=lbl_size)
        
    ax1.set_ylabel('Cumulative Log Returns', fontsize=lbl_size, fontweight='demi')
    ax2.set_ylabel('Quantity of Stocks', fontsize=lbl_size, fontweight='demi')
    ax2.set_xlabel('Date', fontsize=13, fontweight='demi')
    im = mpl.image.imread(watermark_path)
    f.figimage(im, xo=-50, yo=-50, origin='upper', alpha=0.125, zorder=10)
    #sns.despine()    
    plt.tight_layout()
    plt.savefig(project_dir + r'_blog_RM_SPY_Components_EMA_participation_plot_L{} Days_{}.png'.format(\
                                                    last_n, dt_txt_fmt)) 
"""Run the plot function"""                                                   
_ema_participation_plot() 

This looks a little more interesting. We can clearly see a general trend that many more stocks were participating to the upside late in 2014 during the strong rebound from the October lows of 2014. Additionally we can see the indecision of market participants throughout the entirety of 2015 as the spread compressed and oscillated. Even more interesting is that the two time series crossed over repeatedly during the period where SPY was testing 52 week highs.

To get a better understanding of the two EMA series I decided to plot the spread between them. I did not attempt to standardize the spread and made no assumptions about whether the spread is mean reverting or random. However, I do believe it crystallizes our naked eye observations from the previous chart.


"""calculate spread between ema time series"""
spread = (gt_ema - lt_ema)

def _plot_raw_spread():
    mpl.rcParams['font.family'] = 'Rambla'
    lbl_size = 14
    f = plt.figure(figsize=size)
    ax = plt.gca()
    # ----- Date Formatting
    fmt = '%A, %m.%d.%Y'
    dt_txt_fmt = '[{} - {}]'.format((d - last_n * BDay()).strftime(fmt), (d - BDay()).strftime(fmt))      
    
    spread.plot(figsize=size, label='Raw EMA Spread')
    ax.set_title(r'SPY ETF Components Raw EMA Spread L/{} Trading Days'.format(\
                                                        last_n), fontsize=20, fontweight='demi', y=1.02)
    ax.legend(loc='upper right', fontsize=lbl_size, frameon=True).get_frame().set_edgecolor('blue')
    dates_rng = pd.date_range(crets.index[0], crets.index[-1], freq='10D')
    plt.xticks(dates_rng, [dtz.strftime('%Y-%m-%d') for dtz in dates_rng], rotation=45) 
    ax.xaxis.set_minor_formatter(dates.DateFormatter(''))
    ax.tick_params(axis='both', labelsize=lbl_size)    
    plt.axhline(0, color='k')
    ax.set_ylabel('Quantity of Stocks', fontsize=lbl_size, fontweight='demi')
    ax.set_xlabel('Date', fontsize=lbl_size, fontweight='demi')
    im = mpl.image.imread(watermark_path)
    f.figimage(im, xo=-50, yo=-50, origin='upper', alpha=0.125, zorder=10)
    #sns.despine()    
    plt.tight_layout()
    plt.savefig(project_dir + r'_blog_RM_SPY_Components_EMA_Spread_plot_L{} Days_{}.png'.format(\
                                                    last_n, dt_txt_fmt))     
"""Run plot function"""    
_plot_raw_spread()

If it wasn’t clear before it definitely is now. Market participants have not been as bullish as previously advertised. The spread was declining as far back as January 2015 and mostly oscillated around zero before breaking below into negative territory around June 2015.  In mid August we see the real break which appears to coincide with the aforementioned 12%+ plunge.

Looking at the EMA spread, the zero line appears to be the important psychological support/resistance line. Theoretically the zero line would represent the maximum indecision point of all market participants, so a move above or below the line would indicate increasing conviction in either direction. Therefore it only makes sense to study those zero crossing points a little further.


"""get the datetime index of zero line spread crosses within a tolerance level""" 
crosses = spread[np.isclose(spread, 0, atol=5)].index

def _plot_zero_crosses():
    mpl.rcParams['font.family'] = 'Rambla'
    lbl_size = 14
    
    f, (ax1, ax2, ax3) = plt.subplots(nrows=3, figsize=size)
    axes = (ax1, ax2, ax3)
    # ----- Date Formatting
    fmt = '%A, %m.%d.%Y'
    dt_txt_fmt = '[{} - {}]'.format((d - last_n * BDay()).strftime(fmt), (d - BDay()).strftime(fmt))    
    
    crets.SPY.plot(ax=ax1, sharex=True, label='SPY Returns')
    gt_ema.plot(ax=ax2, label='EMA of ' + gt_label)    
    lt_ema.plot(ax=ax2, label='EMA of ' + lt_label)
    spread.plot(ax=ax3, label='Raw_EMA_Spread')
    ax1.set_title(r'SPY ETF Component Participation L/{} Trading Days'.format(\
                                                        last_n), fontsize=20, fontweight='demi', y=1.02)
    for ax in axes:
        axymin, axymax = ax.get_ylim()
        ax.set_autoscaley_on(False)
        ax.vlines(crosses, ymin=axymin, ymax=axymax, color='g', linestyle='-', alpha=0.4)        
        ax.legend(loc='upper left', fontsize=lbl_size, frameon=True).get_frame().set_edgecolor('blue')
        dates_rng = pd.date_range(crets.index[0], crets.index[-1], freq='10D')
        plt.xticks(dates_rng, [dtz.strftime('%Y-%m-%d') for dtz in dates_rng], rotation=45) 
        ax.xaxis.set_minor_formatter(dates.DateFormatter(''))
        ax.tick_params(axis='both', labelsize=lbl_size)

    ax1.axhline(0, color='k')
    ax3.axhline(0, color='k')
    
    ax1.set_ylabel('Cumulative Log Returns', fontsize=lbl_size, fontweight='demi')
    ax2.set_ylabel('Quantity of Stocks', fontsize=lbl_size, fontweight='demi')
    ax3.set_ylabel('Quantity of Stocks', fontsize=lbl_size, fontweight='demi')
    ax3.set_xlabel('Date', fontsize=13, fontweight='demi')
    im = mpl.image.imread(watermark_path)
    f.figimage(im, xo=-50, yo=-50, origin='upper', alpha=0.125, zorder=10)
    #sns.despine()    
    plt.tight_layout()
    plt.savefig(project_dir + r'_blog_RM_SPY_Components_EMA_participation_Multiplot_L{} Days_{}.png'.format(\
                                                    last_n, dt_txt_fmt))    
"""Run plot function"""    
_plot_zero_crosses()  

This chart plots vertical lines at each zero crossing for the Raw EMA Spread. Again the chart appears to reinforce my previous theory that market participants have been largely indecisive regarding overall market direction as expressed through ETF component participation.

However, that’s not enough information to make a decision about whether we should have been alert to an impending decline. We need a quantitative method to really drill down into this theory.

To quantify this analysis and add structure to the theory I constructed a simple model. Assuming the zero line is an important psychological barrier, I calculated the subsequent cumulative return over N number of days following each instance of the spread crossing zero. I then calculated the average cumulative return for each N day look ahead period.

The lookahead periods I chose were 1, 2, 3, 5, 10, and 21 days. The results are unambiguous.


def _cumlrets_given_trigger(trigger_dates, look):
    '''function to calculate cumulative returns given a date and lookahead period'''
    cr = {}
    for date in trigger_dates:
        start_int = lrets.SPY.index.get_loc(date)
        start = lrets.SPY.index[start_int]
        end = start + look * BDay()
        cumlrets = lrets.SPY.ix[start:end].cumsum().iloc[-1]
        cr[date] = cumlrets 
    conditional_rets = pd.Series(cr, index=cr.keys(), \
                                 name='_{} days lookahead_'.format(look)).sort_index()
    return conditional_rets

def _get_avgrets_bylook(trigger_dates):
    '''function to aggregate cumulative returns given a date and lookahead period'''
    lookahead = [1, 2, 3, 5, 10, 21]
    rets_dict = {}
    avg_returns_looks = {} 
    for look in lookahead:
        rets = _cumlrets_given_trigger(trigger_dates, look)
        mean_rets = rets.mean()
        avg_returns_looks['{}_Days'.format(look)] = mean_rets
        rets_dict['{}_Days'.format(look)] = rets

    avg_rets_looks = pd.Series(avg_returns_looks, index=avg_returns_looks.keys(),\
                               name='_avg_returns_given_lookahead_')
    avg_rets_looks = avg_rets_looks[['{}_Days'.format(str(i)) for i in lookahead]]
    return avg_rets_looks, rets_dict   
    
"""looking at the average"""
avg_crets, avgdict = _get_avgrets_bylook(crosses)
print(avg_crets)

Current average cumulative returns are negative across all look ahead periods! 

Notice a pattern? Looking at the table the trend appears to change for all look ahead timeframes in late May. It never got better. Obvious confirmation of the near term trend change could be seen as late as the end of July 2015.

Conclusions

Circling back to the original question, analyzing the SPY ETF components’ participation gave clear indications that broad market internals were weakening as early as late May, and continued to break down until present day.

Unfortunately, market internals still appear weak moving forward. For my personal portfolio, I would not feel confident as a long term dip buyer until ETF participation crosses over the zero line again and we see some positive cumulative returns accumulating on the 10 and 21 day periods.

This analysis provides a good foundation for further research regarding this indicator’s effectiveness. However there are some weaknesses to consider. The rolling EMA periods were chosen simply to represent a quarterly period but were largely arbitrary as were the lookahead periods. Furthermore, I did not have access to SPY ETF components ‘as of’ each date in the analysis which would have been more rigorous and largely eliminated the survivorship and look-ahead bias. Therefore, I only analyzed the last 252 trading days to minimize those effects on the results presented.

How to get Free Intraday Stock Data with Python and BarCharts OnDemand API

To this day the most popular article I have ever written on this blog was "How to get Free Intraday Stock Data with Netfonds". Unfortunately the Netfonds API has really declined in terms of usability, with too many popular stocks missing, and irregular trade and price quotes. Simply put, as the API went down, so did the code.

However, all hope is not lost. The wonderful people at BarChart.com have created a well documented, easily accessible API for intraday stock data and even near real-time quote access. The only caveat is that you must request access to get a personal API key. Again this is FREE, and the process is extremely simple and straightforward. I think I received my API key within the same day, max 24 hours. 

Step 1: Go to http://www.barchartondemand.com/api.php and request an API key. 

Step 2: Use or modify my code to get FREE intraday stock data. 

Something to note, in this example I use the SP500 components as my list of stock symbols. I covered how to get fresh SPY holdings data directly from the provider in a previous post titled "GET FREE FINANCIAL DATA W/ PYTHON (STATE STREET ETF HOLDINGS - SPY)".  Now onto the code...

First I import the necessary modules.


# -*- coding: utf-8 -*-
import time
t0 = time.clock()

import pandas as pd
from pandas.tseries.offsets import BDay
import numpy as np
import datetime as dt
from copy import copy
import warnings
warnings.filterwarnings('ignore',category=pd.io.pytables.PerformanceWarning)

Next I set up what I refer to as a 'datetime management' section of my code. I do this for ALL my time series analysis as a convenient way to standardize my code across projects. Sometimes I only use one of the variables, as I do in this case, but it's so convenient when doing any sort of exploratory analysis with time series. I also do the same for my filepaths.


# ================================================================== #
# datetime management

d = dt.date.today()
# ---------- Days ---------- 
l10 = d - 10 * BDay()
l21 = d - 21 * BDay()
l63 = d - 63 * BDay()
l252 = d - 252 * BDay()
# ---------- Years ---------- 
l252_x2 = d - 252 * 2 * BDay() 
l252_x3 = d - 252 * 3 * BDay() 
l252_x5 = d - 252 * 5 * BDay()
l252_x7 = d - 252 * 7 * BDay() 
l252_x10 = d - 252 * 10 * BDay() 
l252_x20 = d - 252 * 20 * BDay() 
l252_x25 = d - 252 * 25 * BDay()

# ================================================================== #
# filepath management

project_dir = r'D:\\' 
price_path = project_dir + r'Stock_Price_Data\\'

Next I set up a convenience function for creating the BarChart url to access the API. 


# ================================================================== #
apikey = 'insert_your_api_key'
def construct_barChart_url(sym, start_date, freq, api_key=apikey):
    '''Function to construct barchart api url'''
    
    url = 'http://marketdata.websol.barchart.com/getHistory.csv?' +\
            'key={}&symbol={}&type={}&startDate={}'.format(api_key, sym, freq, start_date)
    return url

Now for the fun part. I create a function that does the following:

  1. initializes an empty dictionary and the minimum required API variables,
  2. iterates through my list of SP500 stocks,
  3. constructs the proper API url,
  4. reads the data returned by the db as a csv file conveniently making use of Pandas read_csv function.
  5. adds the price dataframe to the dictionary dynamically
  6. converts the python dictionary into a Pandas Panel and returns the Panel

def get_minute_data():
    '''Function to Retrieve <= 3 months of minute data for SP500 components'''
    
    # This is the required format for datetimes to access the API
    # You could make a function to translate datetime to this format
    start = '20150831000000'
    #end = d
    freq = 'minutes'    
    prices = {}
    symbol_count = len(syms)
    N = copy(symbol_count)
    try:
        for i, sym in enumerate(syms, start=1):
            api_url = construct_barChart_url(sym, start, freq, api_key=apikey)
            try:
                csvfile = pd.read_csv(api_url, parse_dates=['timestamp'])
                csvfile.set_index('timestamp', inplace=True)
                prices[sym] = csvfile
            except:
                continue
            N -= 1
            pct_total_left = (N/symbol_count)
            print('{}..[done] | {} of {} symbols collected | percent remaining: {:>.2%}'.format(\
                                                                sym, i, symbol_count, pct_total_left)) 
    except Exception as e: 
        print(e)
    finally:
        pass
    px = pd.Panel.from_dict(prices)

    return px

Now I import our list of stock symbols, make some minor formatting edits and run the code.


# ================================================================== #

# header=3 to skip unnecesary file metadata included by State Street    
spy_components = pd.read_excel(project_dir +\
                             '_SPDR_holdings/holdings-spy.xls', header=3)
syms = spy_components.Identifier.dropna()
syms = syms.drop(syms.index[-1]).order()

pxx = get_minute_data()

This script takes roughly 40 minutes to run, longer if you try to get the full 3 months they provide, less if you need less data. 

Now let's test our output to make sure we got what we expected. 


print(pxx)
print(pxx['AAL'].tail())
print(pxx['ZTS'].tail())

The code ran correctly it appears, and the output is what we expected. One thing you may have noticed is that time stamps are not 'EST'. If you want to convert them use the following one liner. 


# convert timestamps to EST
pxx.major_axis = pxx.major_axis.tz_localize('utc').tz_convert('US/Eastern')

There is one last consideration that is easy to overlook if you're unfamiliar with some of the technical challenges of 'big data'. When you first run a script like this it is tempting to use the usual storage techniques that pandas provides such as 'pd.to_csv()' or 'pd.to_excel()'. However, consider the volume of data we just collected: 502 (items) x 5866 (major_axis) x 7 (minor_axis) = 20,613,124. 

Look at it again and consider this simple code collected over 20 million data points! I ran into trouble with Python/Excel I/O with only 3.5 million data points in the past. Meaning importing and exporting the data took minutes. That's a serious hangup for any type of exploratory research, especially if you plan on sharing and/or collaborating using this dataset. 

Pandas HDF5 file storage format to the rescue! Feel free to investigate the power, speed and scalability of HDF5 via the Pandas docs or any of the numerous quality blogs out there accessible by a google search. Needless to say, I/O was reduced from several minutes both ways to seconds. Here is the code I used to store the panel.  


try:
    store = pd.HDFStore(price_path + 'Minute_Symbol_Data.h5')
    store['minute_prices'] = pxx
    store.close()
except Exception as e:
    print(e)
finally:
    pass

Here's a sample plot with the intraday data. 

The entire code is posted below using Gist.