Synthetic Data Generation (Part-1) - Block Bootstrapping

Outline

  • Introduction

  • An Alternative Solution?

  • Notebook Description and Links

  • Conclusions

  • Future Work

  • Resources and Links

Introduction

Data is at the core of quantitative research. The problem is history only has one path. Thus we are limited in our studies by the single historical path that a particular asset has taken. In order to gather more data, more asset data is collected and at higher and higher resolutions, however the main problem still exists; one historical path for each asset.

Derivatives pricing has come up with working solutions to this problem, albeit for a different purpose. Their job is to come up with a net present value for an asset based on return paths that have not happened yet. Thus they must generate many potential future paths as a means to compute the value of the asset.

At the core of these issues the same solution abounds; we must be able to generate multiple price paths that have not happened, whether this is to test the robustness of a trading algorithm, or price a derivative.

Generally speaking, most synthetic return paths are generated using a parametric model that captures the salient behavioral features of the asset in question. All of these approaches have some drawbacks but in the case of a recent project, the primary issues with these approaches were, speed, scalability, and underfitting. The dataset to fit was several million data points, and several thousand return paths needed to be generated. One could use a fast simple model but many features of the time series would be lost, specifically the volatility clustering so infamous in asset returns. A more complex model, while robust, took too long to fit, and too long to generate the required number of return paths.

An Alternative Solution?

In my research for alternative approaches, I came across the bootstrap methodology. Traditionally bootstrap methods are used to estimate a parameter of a sample distribution or model. They also generally require that the data being bootstrapped is Independent and Identically Distributed (IID). This doesn’t work well for time series, where serial correlation is present.

However, one approach that addresses this limitation is the Moving Block Bootstrap (MBB). The MBB randomly draws fixed size blocks from the data and cut and pastes them to form a new series the same size as the original data. However it has a major limitation in that beginning and ending points are systematically underrepresented.

To address this limitation an extension to this method was developed called the Circular Block Bootstrap (CBB). This approach is much the same except that it wraps around the beginning and ending points to ensure they are drawn with similar probability as the other blocks. I found an excellent implementation of this method from the ARCH package. What follows is some of my initial experimentation with this approach along with some caveats, my conclusions and ideas for future experiments.

Notebook Description and Links

The data used comes from IEX which was sampled intraday approximately every 30 seconds and covers a 3 month period ending December 2018. To see how I aggregated this data see this linked post.

In the experiments I wanted to see if the CBB did a good job of capturing the descriptive parameters of the time series including the mean, std, min, max, and autocorrelation of returns. I also wanted to see how much the synthetic datasets were able to capture the volatility structure of the original as well as observing how realistic the price paths looked in relationship to the real series.

Github Repo | Notebook Link

Conclusions

First, if one wanted to use this technique for the purposes of trading strategy development I would recommend that at least a portion of the real series be used as a hold out set, to be used as an out of sample test. This must be done to ensure that the model/strategy doesn’t suffer from ‘look-ahead bias’ during development. Ideally you would be able to split your data into 3 components: the training set for development, the test set for any hyperparameter optimization, and the validation (OOS) set as a final test.

One limitation of this approach is the fixed block size. Different block sizes emphasize different periods or lengths of autocorrelation (memory). At the extremes you can take a block size so small that no serial correlation is captured, and at the other end you could take a block size so large that you end up sampling the original series. However, in my initial experimentation there are many reasonable block sizes that generate realistic price paths, that overall capture the behavior of the original series.

Another potential limitation is the data diversity. Since the bootstrapped series only contain returns that have occurred, the generated paths are likely to be biased in the overall direction of the original series. This is evident when looking at the samples of random price paths.

Future Work

  • To address the fixed block sizes, test the output and data diversity using the Stationary Block Bootstrap (SBB) which uses an exponentially distributed block size.

  • One proposal to improve on data diversity, is to sample the blocks from different assets within the same asset class.

  • Test if models are less overfit using this method to train a prediction model.

Resources and Links

  • https://arch.readthedocs.io/en/latest/index.html

    • Kevin Sheppard. (2018, October 3). bashtage/arch: Release 4.6.0 (Version 4.6.0). Zenodo. http://doi.org/10.5281/zenodo.1443315

    • https://arch.readthedocs.io/en/latest/bootstrap/timeseries-bootstraps.html

  • http://www.blackarbs.com/blog/download-intraday-stock-data-with-iex-and-parquet