How to Scrape and Parse 600 ETF Options in 10 mins with Python and Asyncio

Post Outline

  • Intro
  • Disclaimers
  • The Secret to Scraping AJAX Sites
  • The async_option_scraper script
    • first_async_scraper class
    • expirys class
    • xp_async_scraper class
    • last_price_scraper class
  • The option_parser Module
  • The Implementation Script
  • References

Intro

This is Part 1 of a new series I'm doing in semi real-time to build a functional options data dashboard using Python. There are many underlying motivations to attempt this, and several challenges to implementing a tool like this from scratch. 

  • Where to get the data? Is it affordable? Easily accessible? API?
  • How to parse the results?
  • How to aggregate and organize the data for analysis?
  • How to store the data? TXT, CSV, SQL database, HDF5??
  • How often should it run?
  • How to display the data? What dynamic graphic library to use? D3.js, MPL3d, Plotly, Bokeh, etc.?

These are some of the problems that need to be solved in order to create the tool.

In this post I show a current working solution to where to get the data, how to scrape it, how to parse it, and a storage method for fast read write access. We will scrape Barchart.com's basic option quotes using aiohttp and asyncio, both are included in Python 3.6 standard library. We will parse it using Pandas and Numpy and store the data in the HDF5 file format. 

Disclaimers

This is primarily an academic exercise. I have no intent to harm or cause others to harm Barchart.com or its vendors. My belief is that, by facilitating knowledge sharing, we will increase the number of educated participants in the options markets; thereby increasing the total addressable market for businesses like Barchart and its vendors. By designing tools like this we improve our own understanding of the use cases and applications (option valuation and trading) and can provide better feedback to those in the product development process. 

The Secret to Scraping Ajax Sites

First let's create a mental model of what AJAX really is. 

So looking at this, we can say AJAX is a set of web development techniques to increase the efficiency and user experience during website interaction. For example, you go to a website with cool data tables on it. You want to change one of the filters on the data so you select the option you want and click. What happens from there?

In simply designed or older websites your request would be sent to the server, then to update the data table with your selected filters would require the server response to reload the entire page. This is inefficient for many reasons but one is that, often the element in need of updating is only a fraction of the entire webpage. 

AJAX allows websites to send requests to the server and update page elements on an element by element basis negating the need for reloading the entire page every time you interact with the page. 

This improvement in efficiency comes at the added cost of complexity, for web designers and developers and for web scrapers. Generally speaking the url you use to go to an AJAX page is not the actual url that gets sent to the server to load the page you view.  

To build this understanding, let's look at a sample option quote page using the following link <https://www.barchart.com/stocks/quotes/spy/options>. 

Warning: To follow along with the rest of this example you need access to developer mode in Chrome or its equivalent in other browsers.

Let's look behind the curtain so to speak. Click anywhere in the page and click inspect. Navigate to the Network tab in Chrome developer tools. 

We're going to press F5 to reload the page and look for the following: Request Headers, and the Request URL.

We will need the Request URL and the Request Headers in order to construct our calls to the server a little later. Simply put, this is the secret! We can replicate our browser's behavior when it requests data from the server if we know the actual request url and the request headers. This will be made clearer in the next section.

The async_option_scraper.py Module

This is the key module for scraping the data. First the imports.


import asyncio
import aiohttp

first_async_scraper class

If you noticed when the page loads, it loads the nearest expiration date by default.

We know there are generally multiple expiration dates per symbol. However, some ETFs have weekly contracts, monthly, and/or quarterly. Instead of guessing the expiration dates, the first_async_scraper class scrapes the default pages so we can later extract the expiration dates directly from the page's JSON/dict response. 

This class takes no initialization parameters. 


# ================================================
# for first run only
class first_async_scraper:
    def __init__(self):
        pass

    async def _fetch(self, symbol, url, session, headers):
        """fn: to retrieve option quotes as JSON
        Params:
            symbol : str(), ETF
            url : str(), request url
            session : aiohttp.ClientSession() object
            headers : dict() containing header info
        Returns:
            response : JSON/Python Dict
        """
        async with session.post(url.format(symbol), headers=headers) as response:
            return await response.json(content_type=None)

    async def run(self, symbols, user_agent):
        """fn: to aggregate response option quotes
        Params:
            symbols : list of str(), ETF symbols
            user_agent : str()
        Returns:
            responses : list of JSON
        """
        url = 'https://core-api.barchart.com/v1/options/chain?symbol={}&fields=strikePrice%2ClastPrice%2CpercentFromLast%2CbidPrice%2Cmidpoint%2CaskPrice%2CpriceChange%2CpercentChange%2Cvolatility%2Cvolume%2CopenInterest%2CoptionType%2CdaysToExpiration%2CexpirationDate%2CsymbolCode%2CsymbolType&groupBy=optionType&raw=1&meta=field.shortName%2Cfield.type%2Cfield.description'

        headers = {
                "Accept":"application/json",
                "Accept-Encoding":"gzip, deflate, sdch, br",
                "Accept-Language":"en-US,en;q = 0.8",
                "Connection":"keep-alive",
                "Host":"core-api.barchart.com",
                "Origin":"https://www.barchart.com",
                "Referer":"https://www.barchart.com/etfs-funds/quotes/{}/options",
                "User-Agent":user_agent,
                }

        tasks = []
        async with aiohttp.ClientSession() as session:
            for symbol in symbols:
                headers['Referer'] = headers['Referer'].format(symbol)
                task = asyncio.ensure_future(self._fetch(symbol, url, session, headers))
                tasks.append(task)
            # gather returns responses in original order not arrival order
            #   https://docs.python.org/3/library/asyncio-task.html#task-functions
            responses = await asyncio.gather(*tasks)
            return responses

The workhorse function is run which calls the internal function _fetch. Inside the run function I've hardcoded a request url similar to the one we found before. I've also hardcoded the headers we found earlier as well. Notice both objects are string formats which can be dynamically updated with our ETF symbol. 

The _fetch function takes the ETF symbol, the url string, session object, and our request headers and makes the call to the server returning the response as a JSON /dict object.  

The run function takes a list of symbols, and a user agent string - more on this later.

The aiohttp package has a very similar interface to the requests module. We first create a ClientSession object which acts like a context manager. After creating the session object, we loop through each symbol using the asyncio.ensure_future function to create and schedule the event task. The gather function executes the tasks asynchronously waiting until all tasks have completed. It returns a list of JSON responses, each representing one ETF.  

The Expirys Class

Once we have the list of responses we need to extract the expiry dates from each page source, collecting them for later use. The class is initialized with two parameters - a list of ETF symbols, and the list of page responses from the first scrape job. 

It uses two functions. The internal function _get_dict_expiry takes a single response object and returns the list of expirations for a single symbol. The exposed function get_expirys loops through the list of ETFs and responses aggregating them into a dictionary. The dictionary keys are the ETF symbols and the values are lists of expirations for that symbol. 


# ================================================
class expirys:
    def __init__(self, ETFS, first_future_result):
        """Class to extract expiration data from Dict
        Params:
            ETFS : list of ETF symbol str()
            first_future_result : list of response objects (dict/JSON) from the first scraper
        """
        self.ETFS = ETFS
        self.first_future_result = first_future_result

    def _get_dict_expiry(self, response):
        """fn: to get expirations from response dict
        Params:
            response : dict/JSON object
        Returns:
            list() of date str(), "YYYY-MM-DD"
        """
        if response['count'] == 0:
            return None
        else:
            return response['meta']['expirations']

    def get_expirys(self):
        """fn: to create dict with k, v = symbol, list of expirys
                we have to do this b/c JSON/dict response data doesn't
                contain symbol identifier
        Returns:
            dict(symbol = list of expiry dates)
        """
        from itertools import zip_longest
        expirys = {}
        for symbol, resp in zip_longest(self.ETFS, self.first_future_result):
            # we can do this because results are in order of submission not arrival
            #   gather returns responses in original order not arrival order
            #       https://docs.python.org/3/library/asyncio-task.html#task-functions
            expirys[symbol] = self._get_dict_expiry(resp)
        return expirys

xp_async_scraper class

The final scraper class is nearly identical to the first_async_scraper except for some additional arguments for the functions xp_run(), and _xp_fetch() to accept the expiry dates. Also notice that the hard coded URL in the xp_run function is slightly different in that it is formatted to accept the ETF symbol and an expiration date. 


# ================================================
# async by url + expirations
class xp_async_scraper:
    def __init__(self):
        pass

    async def _xp_fetch(self, symbol, expiry, url, session, headers):
        """fn: to retrieve option quotes as JSON
        Params:
            symbol : str(), ETF
            expiry : str(), "YYYY-MM-DD"
            url : str(), request url
            session : aiohttp.ClientSession() object
            headers : dict() containing header info
        Returns:
            response : JSON/Python Dict
        """
        async with session.post(url.format(symbol, expiry), headers=headers) as response:
            return await response.json(content_type=None)

    async def xp_run(self, symbol, expirys, user_agent):
        """fn: to aggregate response option quotes
        Params:
            symbol : str(), ETF
            expirys : list of date str() "YYYY-MM-DD"
            user_agent : str()
        Returns:
            responses : list of JSON
        """
        url = "https://core-api.barchart.com/v1/options/chain?symbol={}&fields=strikePrice%2ClastPrice%2CpercentFromLast%2CbidPrice%2Cmidpoint%2CaskPrice%2CpriceChange%2CpercentChange%2Cvolatility%2Cvolume%2CopenInterest%2CoptionType%2CdaysToExpiration%2CexpirationDate%2CsymbolCode%2CsymbolType&groupBy=optionType&expirationDate={}&raw=1&meta=field.shortName%2Cfield.type%2Cfield.description"

        headers = {
                "Accept":"application/json",
                "Accept-Encoding":"gzip, deflate, sdch, br",
                "Accept-Language":"en-US,en;q=0.8",
                "Connection":"keep-alive",
                "Host":"core-api.barchart.com",
                "Origin":"https://www.barchart.com",
                "Referer":"https://www.barchart.com/etfs-funds/quotes/{}/options",
                "User-Agent":user_agent,
                }

        tasks = []
        async with aiohttp.ClientSession() as session:
            for expiry in expirys:
                headers['Referer'] = headers['Referer'].format(symbol)
                task = asyncio.ensure_future(self._xp_fetch(symbol, expiry, url, session, headers))
                tasks.append(task)
            # gather returns responses in original order not arrival order
            #   https://docs.python.org/3/library/asyncio-task.html#task-functions
            responses = await asyncio.gather(*tasks)
            return responses

last_price_scraper class

This class has the same structure and form as the other scraper classes except slightly simpler. The purpose of this class is to simply retrieve the basic html source for each ETF so that we can later extract the last quote price for the underlying equity. 


# ================================================
# async get html page source
class last_price_scraper:
    def __init__(self):
        pass

    async def _fetch(self, symbol, url, session):
        """fn: to retrieve option quotes as JSON
        Params:
            symbol : str(), ETF
            url : str(), request url
            session : aiohttp.ClientSession() object
        Returns:
            response : text object
        """
        async with session.get(url.format(symbol)) as response:
            return await response.text()

    async def run(self, symbols):
        """fn: to aggregate response option quotes
        Params:
            symbols : list of str(), ETF symbols
        Returns:
            responses : list of text
        """
        url = 'https://www.barchart.com/stocks/quotes/{}/options'

        tasks = []
        async with aiohttp.ClientSession() as session:
            for symbol in symbols:
                task = asyncio.ensure_future(self._fetch(symbol, url, session))
                tasks.append(task)
            # gather returns responses in original order not arrival order
            #   https://docs.python.org/3/library/asyncio-task.html#task-functions
            responses = await asyncio.gather(*tasks)
            return responses

The option_parser.py Module

Once we have all the data we need to be able to parse it for easy analysis and storage. Fortunately this is relatively simple to do with Pandas. The option_parser.py module contains one class-option_parser, and three functions-extract_last_price(), create_call_df(), create_put_df()

The option_parser class is initialized with an ETF symbol and the appropriate response object. The create dataframe functions extract the call/put data from the JSON/dict response, then iterates through each quote combining them into dataframes taking care to clean the data set and change the datatypes from objects to numeric/datetime where appropriate. The extract_last_price function is used to get the underlying quote price from the basic html source.


import pandas as pd
import numpy as np
# ================================================
class option_parser:
    def __init__(self, symbol, response):
        self.symbol = symbol
        self.response = response
    # ------------------------------------------------
    # extract last price from html
    def extract_last_price(self, html_text):
        """fn: extract price from html"""
        reg_exp = r'(?<="lastPrice":)(\d{1,3}.{1}\d{2})'
        prices = re.findall(reg_exp, html_text)
        if len(prices) < 1:
            return np.nan
        else:
            return float(prices[0])        
    # ------------------------------------------------
    # create call df
    def create_call_df(self):
        """fn: to create call df"""
        json_calls = self.response['data']['Call']
        list_dfs = []
        for quote in json_calls:
            list_dfs.append(pd.DataFrame.from_dict(quote['raw'], orient='index'))
        df = (
            pd.concat(list_dfs, axis=1).T.reset_index(drop=True)
            .replace('NA', np.nan)
            .apply(pd.to_numeric, errors='ignore')
            .assign(expirationDate = lambda x: pd.to_datetime(x['expirationDate']))
        )
        df['symbol'] = [self.symbol] * len(df.index)
        return df
    # ------------------------------------------------
    # create put df
    def create_put_df(self):
        """fn: to create put df"""
        json_puts = self.response['data']['Put']
        list_dfs = []
        for quote in json_puts:
            list_dfs.append(pd.DataFrame.from_dict(quote['raw'], orient='index'))
        df = (
            pd.concat(list_dfs, axis=1).T.reset_index(drop=True)
            .replace('NA', np.nan)
            .apply(pd.to_numeric, errors='ignore')
            .assign(expirationDate = lambda x: pd.to_datetime(x['expirationDate']))
        )
        df['symbol'] = [self.symbol] * len(df.index)
        return df

The Implementation Script

Finally we can combine the modules into a script and run it. Note that this script requires the fake-useragent package. This package has a nice feature where it generates a random user agent string on every call. We need to do this so our requests are not blocked by the server.

The script imports a list of ETF symbols originally sourced from Nasdaq. Some of these symbols don't have options data, so they are filtered out. The script runs in the following order: basic html scraper -> first async scraper -> extracts the expiry dates -> xp async scraper which aggregates all the option data -> parses the collected data into a dataframe format -> downloads and inserts any missing underlying prices -> then saves it to disk as an HDF5 file. 


import os
import sys
import pandas as pd
import pandas_datareader.data as web
import numpy as np
import time
import asyncio
from fake_useragent import UserAgent
'''set path variables'''
project_dir = "YOUR/PROJECT/DIR"
sys.path.append(project_dir)

import async_option_scraper
import option_parser
# ================================================
today = pd.datetime.today().date()
# ================================================
file_start = time.time()
print('\nAsync Barchart Scraper starting...')
# --------------- \\\
# import symbols
FILE = project_dir + 'ETFList.Options.Nasdaq__M.csv'
ALL_ETFS =  pd.read_csv(FILE)['Symbol']
drop_symbols = ['ADRE', 'AUNZ', 'CGW', 'DGT', 'DSI', 'EMIF', 'EPHE', 'EPU', 'EUSA', 'FAN', 'FDD', 'FRN', 'GAF', 'GII', 'GLDI', 'GRU', 'GUNR', 'ICN', 'INXX', 'IYY', 'KLD', 'KWT', 'KXI', 'MINT', 'NLR', 'PBP', 'PBS', 'PEJ', 'PIO', 'PWB', 'PWV', 'SCHO', 'SCHR', 'SCPB', 'SDOG', 'SHM', 'SHV', 'THRK', 'TLO', 'UHN', 'USCI', 'USV', 'VCSH']
ETFS = [x for x in ALL_ETFS if x not in set(drop_symbols)]

# ================================================
# GET HTML SOURCE FOR LAST SYMBOL EQUITY PRICE
# ================================================
t0_price = time.time()
# --------------- \\\
loop = asyncio.get_event_loop()

px_scraper = async_option_scraper.last_price_scraper()
px_run_future = asyncio.ensure_future(px_scraper.run(ETFS))

loop.run_until_complete(px_run_future)
px_run = px_run_future.result()
# ------------- ///
duration_price =  time.time() - t0_price
print('\nprice scraper script run time: ',
    pd.to_timedelta(duration_price, unit='s'))
# ------------- ///
# create price dictionary
px_dict = {}
for k, v in zip(ETFS, px_run):
    px_dict[k] = v

# ================================================
# RUN FIRST ASYNC SCRAPER
# ================================================
t0_first = time.time()
# --------------- \\\
ua = UserAgent()
loop = asyncio.get_event_loop()

first_scraper = async_option_scraper.first_async_scraper()
first_run_future = asyncio.ensure_future(
    first_scraper.run(ETFS, ua.random)
    )

loop.run_until_complete(first_run_future)
first_run = first_run_future.result()
# ------------- ///
first_duration =  time.time() - t0_first
print('\nfirst async scraper script run time: ',
    pd.to_timedelta(first_duration, unit='s'))

# ================================================
# EXTRACT EXPIRYS FROM FIRST RUN SCRAPER
# ================================================
xp = async_option_scraper.expirys(ETFS, first_run)
expirys = xp.get_expirys()

# ================================================
# SCRAPE AND AGGREGATE ALL SYMBOLS BY EXPIRY
# ================================================
t0_xp = time.time()
# -------------- \\\
# dict key=sym, values=list of json data by expiry
# create helper logic to test if expirys is None before passing
sym_xp_dict = {}
ua = UserAgent()
xp_scraper = async_option_scraper.xp_async_scraper()
for symbol in ETFS:
    print()
    print('-'*50)
    print('scraping: ', symbol)
    if not expirys[symbol]:
        print('symbol ' + symbol + ' missing expirys')
        continue
    try:
        xp_loop = asyncio.get_event_loop()
        xp_future = asyncio.ensure_future(
            xp_scraper.xp_run(symbol, expirys[symbol], ua.random)
            )
        xp_loop.run_until_complete(xp_future)
        sym_xp_dict[symbol] = xp_future.result()
    except Exception as e:
        print(symbol + ' error: ' + e)
# ------------- ///
duration_xp =  time.time() - t0_xp
print('\nall async scraper script run time: ', 
    pd.to_timedelta(duration_xp, unit='s'))

# ================================================
# PARSE ALL COLLECTED DATA
# ================================================
t0_agg = time.time()
# -------------- \\\
all_etfs_data = []
for symbol, xp_list in sym_xp_dict.items():
    print()
    print('-'*50)
    print('parsing: ', symbol)
    list_dfs_by_expiry = []
    try:
        for i in range(len(xp_list)):
            try:
                parser = option_parser.option_parser(
                    symbol, xp_list[i])
                call_df = parser.create_call_df()
                put_df = parser.create_put_df()
                concat = pd.concat([call_df, put_df], axis=0)
                concat['underlyingPrice'] = np.repeat(
                    parser.extract_last_price(px_dict[symbol]),
                    len(concat.index))
                list_dfs_by_expiry.append(concat)
            except: continue
    except Exception as e:
        print(f'symbol: {symbol}\n error: {e}')
        print()
        continue
    all_etfs_data.append(pd.concat(list_dfs_by_expiry, axis=0))
# ------------- ///
duration_agg =  time.time() - t0_agg
print('\nagg parse data script run time: ', 
    pd.to_timedelta(duration_agg, unit='s'))
# -------------- \\\
dfx = pd.concat(all_etfs_data, axis=0).reset_index(drop=True)
print(dfx.info())
# ------------- ///

# ================================================
# GET ANY MISSING UNDERLYING PRICE
# ================================================
print('\nCollecting missing prices...')
grp = dfx.groupby(['symbol'])['underlyingPrice'].count()
missing_symbol_prices = grp[grp == 0].index

get_price = lambda symbol: web.DataReader(
    symbol, 'google', today)['Close']
prices = []
for symbol in missing_symbol_prices:
    px = get_price(symbol).iloc[0]
    prices.append((symbol, px))

df_prices = pd.DataFrame(prices).set_index(0)
for symbol in df_prices.index:
    (dfx.loc[dfx['symbol'] == symbol,
         ['underlyingPrice']]) = df_prices.loc[symbol].iloc[0]

dfx['underlyingPrice'] = dfx.underlyingPrice.astype(float)
print('\nmissing prices added')

# ================================================
# store dataframe as hdf
# ================================================
print(dfx.head(20))
print(dfx.info())

file_duration =  time.time() - file_start
print('\nfile script run time: ', pd.to_timedelta(file_duration, unit='s'))

file_ = project_dir + f'/ETF_options_data_{today}.h5'
dfx.to_hdf(file_, key='data', mode='w')

# ================================================
# kill python process after running script
# ================================================
time.sleep(2)
os.kill(os.getpid(), 9)

Here's some sample output:

Get the code at the following Github-Gist links:

UPDATE: Here is the list of Nasdaq ETF symbols for download <ETF Symbol List CSV>

References