This post is a summary of a more detailed Jupyter (IPython) notebook where I demonstrate a method of using Python, Scikit-Learn and Gaussian Mixture Models to generate realistic looking return series. In this post we will compare real ETF returns versus synthetic realizations. To evaluate the similarity of the real and synthetic returns we will compare the following:

visual inspection
histogram comparisons
descriptive statistics
correlations
autocorrelations

The data set we will use contains ETF daily data covering the period 2004 - 2017. The ETFs are SPY, QQQ, TLT, GLD, EFA, EEM, and the data is sourced from the Tiingo api.

Import Data

  
infp = Path(data_dir / "tiingo_etf_returns_ending_2017-12-31.parq")
R = pd.read_parquet(infp).assign(year=lambda df: df.index.year)
cprint(R)

Scale and Transform Data

Next we scale the data before fitting. I tried a few different methods including PCA (not technically a scaler), MaxAbsScaler, PowerTransformer, RobustScaler, and no scaling at all. Surprisingly the results were pretty similar with PowerTransformer and RobustScaler doing a slightly better job on the descriptive statistics. However my findings may be the result of randomness, thus I encourage the reader to experiment themselves.

  
from sklearn.preprocessing import PowerTransformer, MaxAbsScaler, RobustScaler

R_pct = R.iloc[:, :-1].mul(100)
scaler = RobustScaler(
    quantile_range=(2.5, 97.5)
)  # PowerTransformer(method="yeo-johnson", standardize=True)
data = scaler.fit_transform(R_pct)
data.shape

# (3301, 6)

Fit Mixture Model and Generate Simulation Path

  
    
gmm = mix.GaussianMixture(aics.idxmin(), n_init=3, covariance_type="full")
gmm.fit(data)
print(gmm.converged_)

# True

def sample_path(gmm_model, scaler, n_samples, seed=0):
    np.random.seed(seed)
    data_new = gmm_model.sample(n_samples)[0]
    paths = scaler.inverse_transform(data_new)
    path_df = pd.DataFrame(paths)
    return path_df

N_SAMPLES = len(data)
path_df = sample_path(gmm, scaler, N_SAMPLES)

  

Compare Real vs Synthetic

REAL


_ = R_pct.cumsum().plot(figsize=(12, 10))

SYNTHETIC


_ = path_df.cumsum().plot(figsize=(12, 10))

COMPARE HISTOGRAMS

mean returns of single random generation vs mean returns of real returns

Not bad right? To see the detailed comparisons including the weaknesses of this approach view the notebook below or click the link to view it on nbviewer.org.

[Jupyter Notebook Link]

References

In Depth: Gaussian Mixture Models by Jake VanderPlas

Blog

Synthetic ETF Data Generation (Part-2) - Gaussian Mixture Models

Import Data

Scale and Transform Data

Fit Mixture Model and Generate Simulation Path

Compare Real vs Synthetic

REAL

SYNTHETIC

COMPARE HISTOGRAMS

References

BLACKARBS LLC

GET UPDATES!

@BLACKARBSCEO

ARCHIVE