How to Build a Sequential Option Scraper with Python and Requests

Post Outline

  • Recap
  • The Problem
  • The Solution
  • Barchart Scraper Class
  • Barchart Parser Class
  • Utility Functions
  • Putting it all together
  • The Simple Trick
  • Next Steps

Recap

In the previous post I revealed a web scraping trick that allows us to defeat AJAX/JavaScript based web pages and extract the tables we need. We also covered how to use that trick to scrape a large volume of options prices quickly and asynchronously using the combination of aiohttp and asyncio.

The Problem

It worked beautifully until... I told people about it. Shortly after publishing, my code stopped functioning. After investigating, it was clear no data was being returned during the aiohttp call to the Barchart server. I attempted to fix the code by adding the semaphore option to the asyncio call. Roughly speaking, in this context the semaphore option allows you to specify the max number of calls that can be made simultaneously. I tried, 100, 50, 10, 2 and they all failed. 

I do not know what happened for sure, but if I had to guess, the increase in server loads per unit time measure, was significant enough for Barchart system/network staff to update their server settings and squash the multiple simultaneous calls. 

The Solution

We simply build a sequential scraper instead of an asynchronous one. To make it more robust we have to add a simple twist to the code that makes it more difficult to diagnose human vs automated traffic. 

Barchart Scraper Class

This class is similar to the previous version except asyncio is stripped out. It's main function is to create the POST url, call the server and return the response data. Please note, I tested this class with a dynamic referer symbol and random user agents and this simple hardcoded setup has worked most consistently for me.


import requests as r

class barchart_scraper:
    def __init__(self, symbol):
        self.__request_headers = {
            "Accept":"application/json",
            "Accept-Encoding":"gzip, deflate, sdch, br",
            "Accept-Language":"en-US,en;q=0.8",
            "Connection":"keep-alive",
            "Host":"core-api.barchart.com",
            "Origin":"https://www.barchart.com",
            "Referer":"https://www.barchart.com/etfs-funds/quotes/SPY/options",
            "User-Agent":"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.81 Safari/537.36",
            }

        self.__base_url_str = 'https://core-api.barchart.com/v1/options/chain?symbol={}&fields=strikePrice%2ClastPrice%2CpercentFromLast%2CbidPrice%2Cmidpoint%2CaskPrice%2CpriceChange%2CpercentChange%2Cvolatility%2Cvolume%2CopenInterest%2CoptionType%2CdaysToExpiration%2CexpirationDate%2CsymbolCode%2CsymbolType&groupBy=optionType&raw=1&meta=field.shortName%2Cfield.type%2Cfield.description'

        self.__expiry_url_str = "https://core-api.barchart.com/v1/options/chain?symbol={}&fields=strikePrice%2ClastPrice%2CpercentFromLast%2CbidPrice%2Cmidpoint%2CaskPrice%2CpriceChange%2CpercentChange%2Cvolatility%2Cvolume%2CopenInterest%2CoptionType%2CdaysToExpiration%2CexpirationDate%2CsymbolCode%2CsymbolType&groupBy=optionType&expirationDate={}&raw=1&meta=field.shortName%2Cfield.type%2Cfield.description"
        self.symbol = symbol
    # ------------------------------------------------
    def _construct_url(self):
        return self.__base_url_str.format(self.symbol)

    def _construct_expiry_url(self, expiry):
        return self.__expiry_url_str.format(self.symbol, expiry)
    # ------------------------------------------------
    def post_url(self, expiry=None):
        if not expiry:
            return r.post(
                url = self._construct_url(),
                headers = self.__request_headers
                )
        else:
            return r.post(
                url = self._construct_expiry_url(expiry=expiry),
                headers = self.__request_headers
                )
    # ------------------------------------------------
    def get_expirys(self, response):
        return response.json()['meta']['expirations']

The Barchart Parser Class

This class is essentially identical to the previous parser class and simply extracts call/put data into pandas dataframes.


import pandas as pd
import numpy as np

class barchart_parser:
    def __init__(self, symbol, response):
        self.symbol = symbol
        self.response = response
    # ------------------------------------------------
    # create call df
    def create_call_df(self):
        """fn: to create call df"""
        json_calls = self.response.json()['data']['Call']
        list_dfs = []
        for quote in json_calls:
            list_dfs.append(pd.DataFrame.from_dict(quote['raw'], orient='index'))
        df = (
            pd.concat(list_dfs, axis=1).T.reset_index(drop=True)
            .replace('NA', np.nan)
            .apply(pd.to_numeric, errors='ignore')
            .assign(expirationDate = lambda x: pd.to_datetime(x['expirationDate']))
        )
        df['symbol'] = [self.symbol] * len(df.index)
        return df
    # ------------------------------------------------
    # create put df
    def create_put_df(self):
        """fn: to create put df"""
        json_puts = self.response.json()['data']['Put']
        list_dfs = []
        for quote in json_puts:
            list_dfs.append(pd.DataFrame.from_dict(quote['raw'], orient='index'))
        df = (
            pd.concat(list_dfs, axis=1).T.reset_index(drop=True)
            .replace('NA', np.nan)
            .apply(pd.to_numeric, errors='ignore')
            .assign(expirationDate = lambda x: pd.to_datetime(x['expirationDate']))
        )
        df['symbol'] = [self.symbol] * len(df.index)
        return df

Utility Functions

Next we devise 2 utility functions. The first function is simply a convenience function to run the first iteration of the scraper. We need to do that for each symbol in order to extract the expiration dates dynamically. 


def get_first_data(symbol):
    """fn: to get first data and extract expiry dates"""

    # scrape
    scraper = barchart_scraper(symbol)
    response = scraper.post_url()
    expirys = scraper.get_expirys(response)
    # parse response
    parser = barchart_parser(symbol, response)
    first_call_df = parser.create_call_df()
    first_put_df = parser.create_put_df()
    # merge calls + puts
    first_concat = pd.concat([first_call_df, first_put_df], axis=0)
    return first_concat, expirys

The second function is a little lambda function that gets the symbol's last daily price from Google Finance which we add to our dataset before saving to disk.


get_price = lambda symbol: web.DataReader(
    symbol, 'google', today - 1*BDay(), today)['Close']

Putting It All Together

Next we can implement the main script body. Essentially it runs a main loop and an inner loop. For each symbol get the default first data, extract the expirys, and then for each expiration extract the data. At the end of the inner loop, all data for that symbol is concatenated and then appended to a list containing all the symbols' dataframes. Finally all the symbols dataframes are concatenated and saved to hdf. 



import requests as r
import pandas as pd
import pandas_datareader.data as web
from pandas.tseries.offsets import BDay
import numpy as np
import time
from tqdm import tqdm

from barchart_scraper import barchart_scraper
from barchart_parser import barchart_parser

today = pd.datetime.today().date()
project_dir = '/YOUR/PROJECT/DIR'
# -----------------------------------------------------------------------------
# define utility functions
# -----------------------------------------------------------------------------

def get_first_data(symbol):
    """fn: to get first data and extract expiry dates"""

    # scrape
    scraper = barchart_scraper(symbol)
    response = scraper.post_url()
    expirys = scraper.get_expirys(response)
    # parse response
    parser = barchart_parser(symbol, response)
    first_call_df = parser.create_call_df()
    first_put_df = parser.create_put_df()
    # merge calls + puts
    first_concat = pd.concat([first_call_df, first_put_df], axis=0)
    return first_concat, expirys

# function to get last daily close from Google Finance
get_price = lambda symbol: web.DataReader(
    symbol, 'google', today - 1*BDay(), today)['Close']

# -----------------------------------------------------------------------------
# import symbols
# -----------------------------------------------------------------------------
FILE = project_dir + 'ETFList.Options.Nasdaq__M.csv'
ALL_ETFS =  pd.read_csv(FILE)['Symbol']
drop_symbols = ['ADRE', 'AUNZ', 'CGW', 'DGT', 'DSI', \
                'EMIF', 'EPHE', 'EPU', 'EUSA', 'FAN', \
                'FDD', 'FRN', 'GAF', 'GII', 'GLDI', 'GRU', \
                'GUNR', 'ICN', 'INXX', 'IYY', 'KLD', 'KWT', \
                'KXI', 'MINT', 'NLR', 'PBP', 'PBS', 'PEJ', \
                'PIO', 'PWB', 'PWV', 'SCHO', 'SCHR', 'SCPB', \
                'SDOG', 'SHM', 'SHV', 'THRK', 'TLO', 'UHN', \
                'USCI', 'USV', 'VCSH']
ETFS = [x for x in ALL_ETFS if x not in set(drop_symbols)]
# -----------------------------------------------------------------------------
# run main script body
#
# loop through all etfs
#   loop through expirys for each etf
# -----------------------------------------------------------------------------
t0 = time.time()
all_etfs_data = []
error_symbols = []
for symbol in tqdm(ETFS):
    print()
    print('-'*79)
    print('scraping: ', symbol)
    try:
        last_close_price = get_price(symbol).iloc[0]
        first_concat, expirys = get_first_data(symbol)
        list_dfs_by_expiry = []
        list_dfs_by_expiry.append(first_concat)
        for expiry in tqdm(expirys[1:]):
            print()
            print('scraping expiry: ', expiry)
            scraper = barchart_scraper(symbol)
            tmp_response = scraper.post_url(expiry=expiry)
            print('parsing... ')
            parser = barchart_parser(symbol, tmp_response)
            call_df = parser.create_call_df()
            put_df = parser.create_put_df()
            concat = pd.concat([call_df, put_df], axis=0)
            concat['underlyingPrice'] = [last_close_price] * concat.shape[0]
            list_dfs_by_expiry.append(concat)
            print('parsing complete')
            random_wait = np.random.choice([1,1.25,2.5,3], p=[0.3,0.3,0.25,0.15])
            time.sleep(random_wait)
        all_etfs_data.append(pd.concat(list_dfs_by_expiry, axis=0))
    except Exception as e:
        error_symbols.append(symbol)
        print(f'symbol: {symbol}\n error: {e}')
        print()
        continue
# -----------------------------------------------------------------------------
duration =  time.time() - t0
print(f'script run time: ', pd.to_timedelta(duration, unit='s'))

dfx = pd.concat(all_etfs_data, axis=0)
print(dfx.head())
print(dfx.info())
print(f'error symbols:\n{error_symbols}')
# -----------------------------------------------------------------------------
# store table as hdf
# -----------------------------------------------------------------------------
today = pd.datetime.today().date()
file_ = project_dir + f'/Barchart_Options_Data/ETF_options_data_{today}.h5'
dfx.to_hdf(file_, key='data', format='table', mode='w') 
# -----------------------------------------------------------------------------
# kill python process after running script to prevent leakage
# -----------------------------------------------------------------------------
time.sleep(5)
os.kill(os.getpid(), 9)

The Simple Trick

Did you notice the random_wait at the end of the inner loop? We simply pass an array of reasonable wait times (measured in seconds) and their probabilities to numpy's random_choice() and pass the result to the time.sleep() function before iterating to the next symbol. This isn't guaranteed to always work, but in cases where servers may be restricting traffic loads it makes it much harder to identify your traffic as automated. 

Ultimately, it's also a respectful way to operate our scraper.

Next Steps

Next up in the series I plan to explore the data collected over the last 6 weeks I've been running this script. I hope to explore multiple angles and dynamics in the data. 

Do you have any suggestions for exploration topics? If so, leave a comment or contact me via email or twitter.