Aggregating Free Options Data with Python

Post Outline

  • Motivation
  • Code Requirements
  • Creating our Scraper Class
  • Aggregating the Data
  • Github Gist Code
  • Disclaimers

Motivation

This year I implemented a simulated trading strategy based on the research paper titled "What Does Individual Option Volatility Smirk Tell Us About Future Equity Returns?" by Yuhang Xing, Xiaoyan Zhang and Rui Zhao. The authors show that their SKEW factor has predictive power for equity returns for up to 6 months. 

Because historical options data is difficult to find and/or prohibitively expensive I tracked the results of the simulated strategy in near real time using a combination of the Yahoo Finance Options API made available via the Pandas package and the Quantopian platform for realistic backtesting. Unfortunately, the Yahoo Finance API has changed and it appears that the options data is no longer offered. Therefore my last strategy update took place on July 12, 2016 which can be seen here.

The strategy has thus far exceeded all expectations. I tracked two versions of the strategy, one maintained a 4-week holding period, the other a weekly holding period. The 4-week holding period strategy showed a cumulative return of ~9%, with 25 of 28 weeks showing positive gains! The weekly strategy (with target leverage of 2) fared slightly better with total returns ~16%, double-digit alpha ~24%, near-zero beta ~8%,  single digit volatility ~8%, with a max drawdown of ~2.2%!

The strategy showed immense promise but with only 28 weeks of results the sample size is unfortunately too small. As a Python programmer when one API goes down it's time to find another. With that said I created my own using the excellent free Barchart.com resource.

Code Requirements

First you must sign up for a free account with Barchart.com and note your username and password. For reference, my current system is running Windows 8.1, 64-Bit with a WinPython distribution using Python 3.5. The code requires the following packages:

  • pandas
  • numpy
  • requests
  • bs4
  • re
  • logging
  • more_itertools
  • tqdm 

Creating our Scraper Class

Assuming you have the requisite packages and have signed up for a Barchart.com user account we can now code our scraper class. High-level, there are a few things to note when designing our scraper.

We can get basic options data without an account. It looks like this:

That's decent but we need volatility and greeks data which requires a free account. 

This data is much more interesting. Note that there are some unique columns from each table that we would like to extract and combine later on. 

Let's get into the code. First we need to initalize the`barchart_options` class.


class barchart_options():
    # --------------------------------------------
    def __init__(self, symbol):
        self.symbol = symbol

        self.payload = {'email':'YOUR_USER_EMAIL', 'password':'YOUR_PASSWORD'}
        self.base_url = r"http://www.barchart.com/options/stocks/"
        self.login_url = r'http://www.barchart.com/login.php'
        self.basic_URL = self.base_url + symbol
        self.base_SRC = self.__get_base_src()
        self.greeks_url_suffix = '&view=vol&sym={SYMBOL}&date={DATE}'
        self.reidx_cols = ['Symbol', 'Expiry', 'Type', 'Strike', 'Ask', 'Bid', 'Last', 'Change',
                           'Underlying_Price', 'ImpliedVolatility','TheoreticalValue', 'Delta',
                           'Gamma', 'Rho', 'Theta', 'Vega', 'Open Int', 'Volume']

Let's go through the class variables:

  • self.symbol: the equity symbol we want data for
  • self.payload: this is your barchart login information
  • self.base_url: this is the free Barchart basic options url
  • self.basic_URL: this is the url that combines our symbol and the base_url to get the basic options data
  • self.login_url: this is the login page of Barchart.com
  • self.base_SRC: this is a function that returns the HTML page source after using the basic_URL. We need this page to extract all the available options expiration dates
  • self.greeks_url_suffix: this is the url suffix that is appended to the base_url to get the volatility and greeks data.
  • self.reidx_cols: this is a list of the combined column headers in the order I preferred. This isn't a requirement.

Here is the function to retrieve the basic options data source page. 


    # --------------------------------------------
    # get basic options data source
    # --------------------------------------------
    def __get_base_src(self):
        with requests.session() as S:
            res = S.get(self.basic_URL)
        return bs(res.text, 'lxml')

Now we need a helper function to search the HTML and extract the expiration dates. The expiration dates are required to create the volatility/greeks url we need to get the good data.


    # --------------------------------------------
    # extract expiry dates
    # --------------------------------------------
    def _extract_expiry_dates(self):
        raw_date_links = []
        for link in self.base_SRC.find_all('a', href=re.compile('date')):
            raw_date_links.append(list(link.attrs.values())[0])

        reg_exp = r'[A-z]{3}-\d{2}-\d{4}'

        dates = []
        for raw_date in raw_date_links:
            ms = re.search(reg_exp, raw_date)
            dates.append(ms.group())
        return list(unique_everseen(dates))

Let's break down what's happening in this function. First we use a combination of `bs4` and `re` packages to search for all the 'a' tags which represent links in the page. But we don't want all of them, we only want those that correspond to expiry dates. After examining the HTML source code you will find that each link with an expiration date has the word 'date' within its 'href'. We use the regex function re.compile('date') to extract only these links. Note the `raw_date_links.append()` function. To retrieve only the text we need as a string we have to do some manipulation to the `link.attrs.values()` result. The `link.attrs.values()` returns a view of the dictionary value and NOT the actual value we want. To convert it to a usable data structure we first convert it to a list using the 'list()' function. The list has only one value, which is the text we need, so we use list indexing to get the first and only item in the list. 

Next we use the `re` library to search each item in the list of `raw_date_links` and extract only the date from a raw and verbose string. We use a regular expression to find any sequence of text that takes the example form of 'Sep-02-2016'. Once we loop through the list extracting the dates we return a list of expiry dates. Also note we use the `unique_everseen()` function on the list of dates. This function loops though an iterable only returning the first unique item in the list while maintaining the correct order. We use this function for multiple reasons. First, there are duplicate dates in the list. Second, if we use a function like 'set()' or 'np.unique()' we return only unique items, however the order is not maintained. We avoid those issues by using the' unique_everseen()' function.

Next we need the helper functions for creating the volatility/greeks url, logging in to access the page, and extracting the correct headers and call/put tables.


    # --------------------------------------------
    # get option greeks source data
    # --------------------------------------------
    def __login_greeks(self, symbol, date):
        with requests.session() as S:
            _ = S.post(self.login_url, data=self.payload)
            res = S.get(self.__create_greeks_url(symbol, date))
        return bs(res.text, 'lxml')

    def __create_greeks_url(self, symbol, date):
        url = self.base_url + self.greeks_url_suffix.format(SYMBOL=symbol, DATE=date)
        return url

    def _get_greeks_src(self, symbol, date):
        src = self.__login_greeks(symbol, date)
        return src

    def __clean_headers(self, headers):
        hdr = headers.replace('\t', '').split('\n')
        hdrs = [hdr for hdr in hdr if len(hdr) > 0]
        return hdrs

    def __get_greek_headers(self, greek_src):
        hdrs = [head.get_text() for head in greek_src.find_all('tr', class_='datatable_header')][0]
        return self.__clean_headers(hdrs)

    def __get_greeks_tables(self, greek_src):
        tables = []
        for tbl in greek_src.find_all('table', class_='datatable'):
            tables.append(tbl)
        return tables

Next we are going to create our pandas dataframes from the HTML tables. To reiterate there are columns we want from both, the basic options data page and the volatility/greeks data page. To do this we will convert both pages' tables into dataframes and then combine them so we obtain one dataframe with only the unique columns we want from both. We also add the columns, 'Underlying_Price' and 'Expiry' to the combined dataframe.

Here are the functions to first create the basic options dataframes.


    # --------------------------------------------
    # create basic options dfs
    # --------------------------------------------
    def __get_underlying_last_price(self, greek_src):
        last = [float(d.text) for d in greek_src.find_all('span', class_='last')]
        return last

    def __create_base_call_df(self, expiry):
        tables = []
        for tbl in self.base_SRC.find_all('table', class_='datatable'):
            tables.append(tbl)
        call_rows = [[td.text for td in tr.find_all('td')] for tr in tables[0].find_all('tr')]
        cols = ['Strike', 'Symbol', 'Type', 'Bid', 'Ask', 'Last', 'Change', 'Volume', 'Open Int']
        call_df = pd.DataFrame(call_rows[1:], columns=cols).apply(pd.to_numeric, errors='ignore')
        return call_df

    def __create_base_put_df(self, expiry):
        tables = []
        for tbl in self.base_SRC.find_all('table', class_='datatable'):
            tables.append(tbl)
        put_rows = [[td.text for td in tr.find_all('td')] for tr in tables[1].find_all('tr')]
        cols = ['Strike', 'Symbol', 'Type', 'Bid', 'Ask', 'Last', 'Change', 'Volume', 'Open Int']
        put_df = pd.DataFrame(put_rows[1:], columns=cols).apply(pd.to_numeric, errors='ignore')
        return put_df

Next I show functions for creating the master call and put dataframes. The below functions do the following:

  • takes the HTML source for the volatility/greeks page
  • extracts the tables and creates the rows
  • creates and adds the column headers
  • extracts the most recent underlying stock price and creates a new column
  • creates the expiry column
  • uses the pandas function 'combine_first()' to merge both the basic and advanced dataframes
  • uses the pandas function 'reindex()' and the self.reidx_cols list to reorder the columns.

    # --------------------------------------------
    # create, merge basic and greek options dfs
    # --------------------------------------------
    def _create_calls_df(self, greek_src, expiry):
        tables = self.__get_greeks_tables(greek_src)
        rows = [[td.text for td in tr.find_all('td')] for tr in tables[0].find_all('tr')]
        calls = pd.DataFrame(rows[1:], columns=self.__get_greek_headers(greek_src)
                            ).apply(pd.to_numeric, errors='ignore')
        calls['Underlying_Price'] = self.__get_underlying_last_price(greek_src) * len(calls.index)
        calls['Expiry'] = [pd.to_datetime(expiry)] * len(calls.index)

        base_call_df = self.__create_base_call_df(expiry)
        mrg = calls.combine_first(base_call_df)
        CALLS = mrg.reindex(columns=self.reidx_cols)
        return CALLS

    def _create_puts_df(self, greek_src, expiry):
        tables = self.__get_greeks_tables(greek_src)
        rows = [[td.text for td in tr.find_all('td')] for tr in tables[1].find_all('tr')]
        puts = pd.DataFrame(rows[1:], columns=self.__get_greek_headers(greek_src)
                           ).apply(pd.to_numeric, errors='ignore')
        puts['Underlying_Price'] = self.__get_underlying_last_price(greek_src) * len(puts.index)
        puts['Expiry'] = [pd.to_datetime(expiry)] * len(puts.index)

        base_put_df = self.__create_base_put_df(expiry)
        mrg = puts.combine_first(base_put_df)
        PUTS = mrg.reindex(columns=self.reidx_cols)
        return PUTS
    # --------------------------------------------

Aggregating the Data

Now that we have created the scraper class we can write the script to aggregate options data for multiple symbols. 

To start we need to import the necessary libraries. 


import time
import pandas as pd
import numpy as np
from copy import copy
from tqdm import tqdm
import logging
from _barchart_options_scraper_ import barchart_options # <- this is our scraper class

p = print

Now we set up our logging infrastructure to track and debug our code. If you have never used Python logging and instead use lots of print() statements, it's time for a transition.


# ------------------------------------------ \\\\
today = pd.datetime.today().date().strftime('%m-%d-%y')

datapath = "..\YOUR_PROJECT_DIR\Barchart_Options_Data\\"
logpath = datapath + r'Barchart_OptionsData_LogFiles\\'

logging.basicConfig(filename=logpath + 'BRC_Options_Log_{}.log'.format(today),
                    format='%(asctime)s %(message)s', datefmt='%m/%d/%Y %I:%M:%S %p',
                    level=logging.INFO)

Next we define a list, or in this case, a dictionary of symbols. I've used this dictionary before and it uses the ETF categories as its keys and a list of ETF symbols as its values. 


# ------------------------------------------ \\\\
ETFS =  {
       'Large Cap'             :['SPY','IVV','VOO','IWB'],
       'Mid Cap'               :['MDY','IJH','VO','IWR'],
       'Small Cap'             :['IWM','IJR','VB'],
       'Global Equity'         :['VEU','ACWI','VXUS','DGT'],
       'AsiaPac Equity'        :['EWT','EWY','EWA','EWS','AAXJ','FXI','EWH','EWM','EPI','INDA','RSX'],
       'Europe Equity'         :['FEZ','EZU','VGK','HEDJ','EWU','EWI','EWP','EWQ','EWL','EWD'],
       'Emerging | Frontier'   :['EWZ','EWW','ECH','GAF','FM','EEM','VWO'],
       'Real Estate'           :['RWO','RWX','RWR','IYR','VNQ'],
       'Consumer Discretionary':['XLY','XRT','FXD','VCR','RTH','IYC'],
       'Consumer Staples'      :['XLP','FXG','VDC','ECON','IYK'],
       'Energy'                :['XLE','IPW','XOP','VDE','IYE','IXC','OIH'],
       'Financials'            :['XLF','KBE','KIE','IYG','KRE'],
       'Healthcare'            :['XLV','XBI','IBB'],
       'Industrial'            :['XLI','IYT','VIS','IYJ'],
       'Materials'             :['XLB','XHB','XME','IGE','MOO','LIT','GUNR'],
       'Technology'            :['XLK','SMH','HACK','FDN'],
       'Telecom'               :['IYZ','IXP','VOX'],
       'Utilities'             :['IDU','XLU','VPU'],
       'Oil | Gas'             :['UNG','BNO','OIL'],
       'Precious Metals'       :['GLD','SLV','IAU'],
       'Bonds'                 :['BND','AGG','JNK','LQD'],
       'T-Bond'                :['TLT','IEF','IEI','SHY','BIL','MINT'],
       'Precious Metals Miners':['SIL','GDX','GDXJ','PLTM']
        }

Now we create our HDF5 store using pandas, initialize an empty list to hold our symbols which return an error, and count the number of symbols for tracking purposes.


# ------------------------------------------ \\\\
etfstore = pd.HDFStore(datapath + 'ETF_options_data_{}.h5'.format(today))
etf_missing_symbols = []

etf_val_count = 0
for i in ETFS.values():
	etf_val_count += len(i)
etf_sym_count = etf_val_count
N = copy(etf_sym_count)

Now we create the master loop which will aggregate our data. I will post the code and then describe what is happening at each layer. 


# ------------------------------------------ \\\\
for category, symbols in tqdm(ETFS.items()):
    logging.info("---------- {} ---------".format(category))
    # \\\
    for i, symbol in tqdm(enumerate(symbols, start=1)):
        N -= 1
        # \\\
        if not pd.isnull(symbol):
            try:
                brc = barchart_options(symbol)
                expirys = brc._extract_expiry_dates()
                appended_data = []
                # \\\
                for expiry in tqdm(expirys):
                    grk = brc._get_greeks_src(symbol, expiry)
                    calls = brc._create_calls_df(grk, expiry)
                    puts = brc._create_puts_df(grk, expiry)
                    mrg = pd.concat([calls, puts])
                    appended_data.append(mrg)
                # \\\
                data = pd.concat(appended_data)
                logging.info(data.describe())
                etfstore.put(symbol, data, format='table')
            except Exception as e:
                logging.error("ERROR: {}".format(e), exc_info=True)
                etf_missing_symbols.append(symbol)
                continue
            pct_total_left = (N / etf_sym_count)
            logging.info('{}..[done] | {} of {} ETF symbols collected | {:>.2%}'.
                         format(symbol, i, len(symbols), pct_total_left))
            p('{}..[done] | {} of {} ETF symbols collected | {:>.2%}'.
                         format(symbol, i, len(symbols), pct_total_left))
            time.sleep(np.random.choice(np.random.uniform(0.5,1.5, [3]), p=[.7, .2, .1]))

First we loop through the category level, then for each category we loop through the list of symbols. At the symbol level the code first checks to make sure the symbol is not null, then we use a try, except clause to handle any errors when our code is running so that it will continue in case we encounter any timeouts or missing symbols.

Next the code retrieves the basic options data HTML source and extracts the list of option expiration dates. The code then loops through each expiration date creating the master call and put dataframes. Then we merge the call and put dataframes together using the `pd.concat()` function. We then append the merged dataframe to the empty list called `appended_data`.  Once we have finished looping through the list of expiration dates, creating and appending our merged dataframes, we merge ALL of that symbol's dataframes into ONE large dataframe containing all the data from each expiry. We then `put()` that dataframe into our HDFStore called 'etfstore'. 

If there are any exceptions or errors, we log it and then append the symbol which caused the error to our `etf_missing_symbols` list for tracking, and continue the loop. The code below the loop is for tracking purposes while the code is running. Note the 'time.sleep()' function which is probably overly complicated but on some level necessary. Essentially all it does is force the code to pause for a random period of time between 0.5 and 1.5 seconds. This is done so we don't overwhelm the servers although I'm sure Barchart.com can handle the load. 

Finally, after looping through and aggregating each symbol's data we close our HDFStore and log any errors, the symbol names, and the error rate. 


# ------------------------------------------ \\\\
etfstore.close()
N_etf_errors = len(etf_missing_symbols)
etf_error_rate = N_etf_errors / etf_sym_count
logging.info('etf missing symbols:\n{}'.format(etf_missing_symbols))
logging.info('etf error rate: {} errors / {} symbols = {:3.2%}'
             .format(N_etf_errors, etf_sym_count, etf_error_rate))

When we run the script we should see something like this: 

The red output is actually created by the tqdm library. Here's the description from the Github page: "Instantly make your loops show a smart progress meter - just wrap any iterable with `tqdm(iterable)`, and you're done!" 

Your log file will likely look similar to this: 

As you can see it's a plain text file tracking our connection to Barchart.com, and after each symbol is aggregated it outputs the `data.describe()` function from pandas on our supersized symbol dataframe before we add it to our HDF5 store. 

Let's confirm our data output is as it should be:


with pd.HDFStore(datapath + 'ETF_options_data_{}.h5'.format(today), mode='r') as hdf:
    p(hdf.keys())
    p(hdf['/SPY'].info())
    p(hdf['/SPY'].iloc[10:20])

The HDF store keys:

A sample dataframe (SPY) from the HDF store:

The entire script takes roughly an 1 hour and 20 minutes on this collection of symbols. 

Github Gist Code

Barchart options scraper

aggregate options data

Disclaimer

Please use this code responsibly. Blackarbs has no intent to violate any terms of use or legal protections.