Using ETF Internal Analytics to Identify Mean Reversion Opportunities (python)

Using ETF Internal Analytics to Identify Mean Reversion Opportunities (python)

Since I started producing the following graphic for the ETF Internal Analytics product, I found the weekly return bin information compelling. I became curious about whether there was an opportunity to be exploited in the distribution patterns. I distilled all the questions I had into two: 

  1. Does the percentage of ETF component stocks at various return levels provide actionable information?
  2. Can a long-short market-neutral strategy be constructed by analyzing the relative return dispersion of each ETF's stock components?

To answer these questions I used a combination of tools/data sources including State Street's SPDR Holdings data, the Yahoo Finance API, and Python. 

Read More


Today I present a simple function to extract Earnings Estimates from Yahoo Finance. If you have any questions feel free to leave it in the comments. 

This code uses Python 3 on Windows 8.1 but could be easily adapted for Python 2 by changing the 'urllib' import. 

First we import the necessary packages into our programming environment. 

import pandas as pd
import urllib as u
from bs4 import BeautifulSoup as bs
import warnings

I also suppress warnings for the deprecation warning for the Pandas "Dataframe.convert_objects()"method within the scraper function that follows. 

This function takes the Yahoo Finance URL with our symbol of interest and uses BeautifulSoup to parse the resulting HTML. I also added some formatting code to clean up the readability of the headers. 

def _get_eps_estimates(url):
        html_source = u.request.urlopen(url).read()
        soup = bs(html_source, 'lxml')
        # table
        table = soup.find_all('table', attrs={'class': 'yfnc_tableout1'})
        header = [th.text for th in table[0].find_all(class_='yfnc_tablehead1')]
        header_title = header[0]
        header_cols = header[1:5]
        index_row_labels = header[-5:]
        body = [[td.text for td in'td')] for row in table[0].find_all('tr')]
        body = body[1:]
        df = pd.DataFrame.from_records(body)
        df = df.ix[:, 1:]
        df.index = index_row_labels
        header_cols = pd.Series(header_cols)
        header_cols = header_cols.str.replace(
            'Year', 'Year ').str.replace('Qtr.', 'Qtr. ')
        df.columns = header_cols
        eps_est = df.convert_objects(convert_numeric=True)
    except Exception as e:
    return eps_est

Now let's test the function using the proper URL. I'm using the symbol 'SWKS' in this example.

symbol = 'SWKS'

base_url = r'{}+Analyst+Estimates'.format(symbol)
eps_est = _get_eps_estimates(base_url)

Your output should appear like the following:

How to get Free Intraday Stock Data with Python and BarCharts OnDemand API

To this day the most popular article I have ever written on this blog was "How to get Free Intraday Stock Data with Netfonds". Unfortunately the Netfonds API has really declined in terms of usability, with too many popular stocks missing, and irregular trade and price quotes. Simply put, as the API went down, so did the code.

However, all hope is not lost. The wonderful people at have created a well documented, easily accessible API for intraday stock data and even near real-time quote access. The only caveat is that you must request access to get a personal API key. Again this is FREE, and the process is extremely simple and straightforward. I think I received my API key within the same day, max 24 hours. 

Step 1: Go to and request an API key. 

Step 2: Use or modify my code to get FREE intraday stock data. 

Something to note, in this example I use the SP500 components as my list of stock symbols. I covered how to get fresh SPY holdings data directly from the provider in a previous post titled "GET FREE FINANCIAL DATA W/ PYTHON (STATE STREET ETF HOLDINGS - SPY)".  Now onto the code...

First I import the necessary modules.

# -*- coding: utf-8 -*-
import time
t0 = time.clock()

import pandas as pd
from pandas.tseries.offsets import BDay
import numpy as np
import datetime as dt
from copy import copy
import warnings

Next I set up what I refer to as a 'datetime management' section of my code. I do this for ALL my time series analysis as a convenient way to standardize my code across projects. Sometimes I only use one of the variables, as I do in this case, but it's so convenient when doing any sort of exploratory analysis with time series. I also do the same for my filepaths.

# ================================================================== #
# datetime management

d =
# ---------- Days ---------- 
l10 = d - 10 * BDay()
l21 = d - 21 * BDay()
l63 = d - 63 * BDay()
l252 = d - 252 * BDay()
# ---------- Years ---------- 
l252_x2 = d - 252 * 2 * BDay() 
l252_x3 = d - 252 * 3 * BDay() 
l252_x5 = d - 252 * 5 * BDay()
l252_x7 = d - 252 * 7 * BDay() 
l252_x10 = d - 252 * 10 * BDay() 
l252_x20 = d - 252 * 20 * BDay() 
l252_x25 = d - 252 * 25 * BDay()

# ================================================================== #
# filepath management

project_dir = r'D:\\' 
price_path = project_dir + r'Stock_Price_Data\\'

Next I set up a convenience function for creating the BarChart url to access the API. 

# ================================================================== #
apikey = 'insert_your_api_key'
def construct_barChart_url(sym, start_date, freq, api_key=apikey):
    '''Function to construct barchart api url'''
    url = '' +\
            'key={}&symbol={}&type={}&startDate={}'.format(api_key, sym, freq, start_date)
    return url

Now for the fun part. I create a function that does the following:

  1. initializes an empty dictionary and the minimum required API variables,
  2. iterates through my list of SP500 stocks,
  3. constructs the proper API url,
  4. reads the data returned by the db as a csv file conveniently making use of Pandas read_csv function.
  5. adds the price dataframe to the dictionary dynamically
  6. converts the python dictionary into a Pandas Panel and returns the Panel

def get_minute_data():
    '''Function to Retrieve <= 3 months of minute data for SP500 components'''
    # This is the required format for datetimes to access the API
    # You could make a function to translate datetime to this format
    start = '20150831000000'
    #end = d
    freq = 'minutes'    
    prices = {}
    symbol_count = len(syms)
    N = copy(symbol_count)
        for i, sym in enumerate(syms, start=1):
            api_url = construct_barChart_url(sym, start, freq, api_key=apikey)
                csvfile = pd.read_csv(api_url, parse_dates=['timestamp'])
                csvfile.set_index('timestamp', inplace=True)
                prices[sym] = csvfile
            N -= 1
            pct_total_left = (N/symbol_count)
            print('{}..[done] | {} of {} symbols collected | percent remaining: {:>.2%}'.format(\
                                                                sym, i, symbol_count, pct_total_left)) 
    except Exception as e: 
    px = pd.Panel.from_dict(prices)

    return px

Now I import our list of stock symbols, make some minor formatting edits and run the code.

# ================================================================== #

# header=3 to skip unnecesary file metadata included by State Street    
spy_components = pd.read_excel(project_dir +\
                             '_SPDR_holdings/holdings-spy.xls', header=3)
syms = spy_components.Identifier.dropna()
syms = syms.drop(syms.index[-1]).order()

pxx = get_minute_data()

This script takes roughly 40 minutes to run, longer if you try to get the full 3 months they provide, less if you need less data. 

Now let's test our output to make sure we got what we expected. 


The code ran correctly it appears, and the output is what we expected. One thing you may have noticed is that time stamps are not 'EST'. If you want to convert them use the following one liner. 

# convert timestamps to EST
pxx.major_axis = pxx.major_axis.tz_localize('utc').tz_convert('US/Eastern')

There is one last consideration that is easy to overlook if you're unfamiliar with some of the technical challenges of 'big data'. When you first run a script like this it is tempting to use the usual storage techniques that pandas provides such as 'pd.to_csv()' or 'pd.to_excel()'. However, consider the volume of data we just collected: 502 (items) x 5866 (major_axis) x 7 (minor_axis) = 20,613,124. 

Look at it again and consider this simple code collected over 20 million data points! I ran into trouble with Python/Excel I/O with only 3.5 million data points in the past. Meaning importing and exporting the data took minutes. That's a serious hangup for any type of exploratory research, especially if you plan on sharing and/or collaborating using this dataset. 

Pandas HDF5 file storage format to the rescue! Feel free to investigate the power, speed and scalability of HDF5 via the Pandas docs or any of the numerous quality blogs out there accessible by a google search. Needless to say, I/O was reduced from several minutes both ways to seconds. Here is the code I used to store the panel.  

    store = pd.HDFStore(price_path + 'Minute_Symbol_Data.h5')
    store['minute_prices'] = pxx
except Exception as e:

Here's a sample plot with the intraday data. 

The entire code is posted below using Gist.

Get Free Financial Data w/ Python (State street ETF Holdings - SPY)

One issue I frequently encounter during my research is the need to compare an individual stock, or collection of stocks vs its ETF benchmark. To do this I need accurate ETF holdings data. 

Generally this information is located on the ETF provider's website. However,  this information is often inconvenient to access. Most websites including the ETF provider will do something like the following, where they only show the top 10 holdings, when what we really need is accessible only by clicking the highlighted download link.

SPY ETF Holdings Page

This isn't a major issue until you need to access multiple ETF holdings pages. State Street Global Advisors is the ETF provider and this is the structure they use most frequently, therefore I figured it would be a major time saver to write a script to automate this important yet redundant task. 

This code requires the following third-party modules to execute: 

  • Selenium
  • Google Chromedriver (allows Python to open Chrome browser)

Before we get to the code, you must have Chromedriver downloaded and unzipped. Make sure to grab the filepath as we will need it. 

# ----- import modules -----
from selenium import webdriver
from import By
from import WebDriverWait
from import expected_conditions as EC
import os
import time
from pprint import pprint as pp

Next you will need to grab the correct xpath's from the webpages of interest. I use xpath in this situation because the python script was able to find the correct clickable links every time without issue. 

When you first open the State Street Website you will need to navigate to the 'Holdings' tab and then to the .xls

# ----- webpage xpath -----
holdings_xpath = r"//*[@id='tabs']/a[3]"
xls_xpath = r"//*[@id='FUND_TOP_HOLDINGS']/a"

Next you need to construct a reusable generalized url string which can be used for any of the State Street ETF's. In this example we will be using SPY only.  Additionally I recommend creating a generalized filepath string for the actual downloaded file. This is so we can confirm that the download has completed correctly before exiting the browser in a later step. 

# ----- generalized URL string -----
symbol = 'SPY'
url = r"{}".format(symbol)
The default naming convention for the holdings file is 'holdings-spy.xls' where the ETF label is lowercase
file_string = my_etf_data_dir + 'holdings-{}.xls'.format(symbol.lower())

Now it's time to setup our chromedriver preferences via the 'ChromeOptions' method. You must define a default directory for this to work properly. During this step I also define my chromedriver filepath for convenience. 

# ----- Chromedriver options/preferences -----
chromeOptions = webdriver.ChromeOptions()
prefs = {'download.default_directory':insert_my_default_dir}
chromeOptions.add_experimental_option('prefs', prefs)
chromedriver_path = insert_my_chromedriver_filepath

Now for the 'money' code. In this step we will instantiate the webdriver (fancy word for automated browser), tell it to navigate to our previously defined URL, tell it to wait until the 'Holdings' tab is visible, click the tab link, then wait again until the 'Download All Holdings .xls' link is visible, click it, confirm the file has downloaded and finally exit the browser. 

I often use prettyprint functions to tell me what's happening with the code, feel free to delete them if you like they are not required.
pp('{} running holdings download..[start]'.format(symbol)) 
driver = webdriver.Chrome(executable_path=chromedriver_path, chrome_options=chromeOptions)
driver.set_page_load_timeout(90) # avoid hanging browser
    holdings_element = WebDriverWait(driver, 30) \
		.until(EC.presence_of_element_located((By.XPATH, holdings_xpath)))
    csv_element = WebDriverWait(driver, 30) \
		.until(EC.presence_of_element_located((By.XPATH, csv_xpath))) # start download
    # the code below checks the file exists before exiting the browser
    for i in range(1,10,2):
        if os.path.isfile(file_string)==True:
except Exception as e:
    pp('{} running holdings download..[complete]'.format(symbol))            

That's it. Now you should have the SPY holdings .xls file on your local hard drive.  If you want to get fancy you can throw this code into a function or class structure like I have. This allows you to run the code in a loop if, for example, you have 10 different State Street ETFs whose holdings data you need.

Get Free Financial Data w/ Python (Fundamental Ratios-From

A simple script to scrape fundamental ratios from This basic code can be tailored to suit your application.

"""IPython 3.1, Python 3.4, Windows 8.1"""

import pandas as pd
import urllib as u
from bs4 import BeautifulSoup as bs

First visit and get the base url for the quote page.

Then write a simple function to retrieve the desired ratio. 
In this example I'm grabbing Price-to-Book (mrq) ratio

def get_price2book( symbol ):
    	url = r'{}'\
        html = u.request.urlopen(url).read()
        soup = bs(html, 'lxml')
        # Change the text below to get a diff metric
        pb =  soup.find(text = r'P/B')
        pb_ = pb.find_next(class_='snapshot-td2').text
        print( '{} price to book = {}'.format(symbol, pb_) )
        return pb_
    except Exception as e:
Construct a pandas series whose index is the list/array
of stock symbols of interest.

Run a loop assigning the function output to the series
stock_list = ['XOM','AMZN','AAPL','SWKS']
p2b_series = pd.Series( index=stock_list )

for sym in stock_list:
	p2b_series[sym] = get_price2book(sym)

The function should produce the following:

XOM price to book = 1.89
AMZN price to book = 20.74
AAPL price to book = 5.27
SWKS price to book = 5.52

Very simple adaptable code, allowing you to spend more time analyzing the data and less time aggregating it.