Introduction to Hidden Markov Models with Python Networkx and Sklearn

Post Outline

  • Who is Andrey Markov?
  • What is the Markov Property?
  • What is a Markov Model?
  • What makes a Markov Model Hidden?
  • A Hidden Markov Model for Regime Detection
  • Conclusion
  • References

Who is Andrey Markov?

Markov was a Russian mathematician best known for his work on stochastic processes. The focus of his early work was number theory but after 1900 he focused on probability theory, so much so that he taught courses after his official retirement in 1905 until his deathbed [2]. During his research Markov was able to extend the law of large numbers and the central limit theorem to apply to certain sequences of dependent random variables, now known as Markov Chains [1][2]. Markov chains are widely applicable to physics, economics, statistics, biology, etc. Two of the most well known applications were Brownian motion [3], and random walks.

What is the Markov Property?

"...a random process where the future is independent of the past given the present." [4]

Assume a simplified coin toss game with a fair coin. Suspend disbelief and assume that the Markov property is not yet known and we would like to predict the probability of flipping heads after 10 flips. Under the assumption of conditional dependence (the coin has memory of past states and the future state depends on the sequence of past states) we must record the specific sequence that lead up to the 11th flip and the joint probabilities of those flips. So imagine after 10 flips we have a random sequence of heads and tails. The joint probability of that sequence is 0.5^10 = 0.0009765625. Under conditional dependence, the probability of heads on the next flip is 0.0009765625 * 0.5 = 0.00048828125. 

Is that the real probability of flipping heads on the 11th flip? Hell no!

We know that the event of flipping the coin does not depend on the result of the flip before it. The coin has no memory. The process of successive flips does not encode the prior results. Each flip is a unique event with equal probability of heads or tails, aka conditionally independent of past states. This is the Markov property.  

What is a Markov Model?

A Markov chain (model) describes a stochastic process where the assumed probability of future state(s) depends only on the current process state and not on any the states that preceded it (shocker).

Let's get into a simple example. Assume you want to model the future probability that your dog is in one of three states given its current state. To do this we need to specify the state space, the initial probabilities, and the transition probabilities.

Imagine you have a very lazy fat dog, so we define the state space as sleeping, eating, or pooping. We will set the initial probabilities to 35%, 35%, and 30% respectively.

import numpy as np
import pandas as pd
import networkx as nx
import matplotlib.pyplot as plt
%matplotlib inline

# create state space and initial state probabilities

states = ['sleeping', 'eating', 'pooping']
pi = [0.35, 0.35, 0.3]
state_space = pd.Series(pi, index=states, name='states')

The next step is to define the transition probabilities. They are simply the probabilities of staying in the same state or moving to a different state given the current state. 

# create transition matrix
# equals transition probability matrix of changing states given a state
# matrix is size (M x M) where M is number of states

q_df = pd.DataFrame(columns=states, index=states)
q_df.loc[states[0]] = [0.4, 0.2, 0.4]
q_df.loc[states[1]] = [0.45, 0.45, 0.1]
q_df.loc[states[2]] = [0.45, 0.25, .3]


q = q_df.values
print('\n', q, q.shape, '\n')

Now that we have the initial and transition probabilities setup we can create a Markov diagram using the Networkx package.

To do this requires a little bit of flexible thinking. Networkx creates Graphs that consist of nodes and edges. In our toy example the dog's possible states are the nodes and the edges are the lines that connect the nodes. The transition probabilities are the weights. They represent the probability of transitioning to a state given the current state.

Something to note is networkx deals primarily with dictionary objects. With that said, we need to create a dictionary object that holds our edges and their weights.

from pprint import pprint 

# create a function that maps transition probability dataframe 
# to markov edges and weights

def _get_markov_edges(Q):
    edges = {}
    for col in Q.columns:
        for idx in Q.index:
            edges[(idx,col)] = Q.loc[idx,col]
    return edges

edges_wts = _get_markov_edges(q_df)

Now we can create the graph. To visualize a Markov model we need to use nx.MultiDiGraph(). A multidigraph is simply a directed graph which can have multiple arcs such that a single node can be both the origin and destination. 

In the following code, we create the graph object, add our nodes, edges, and labels, then draw a bad networkx plot while outputting our graph to a dot file. 

# create graph object
G = nx.MultiDiGraph()

# nodes correspond to states

# edges represent transition probabilities
for k, v in edges_wts.items():
    tmp_origin, tmp_destination = k[0], k[1]
    G.add_edge(tmp_origin, tmp_destination, weight=v, label=v)

pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='dot')
nx.draw_networkx(G, pos)

# create edge labels for jupyter plot but is not necessary
edge_labels = {(n1,n2):d['label'] for n1,n2,d in G.edges(data=True)}
nx.draw_networkx_edge_labels(G , pos, edge_labels=edge_labels)
nx.drawing.nx_pydot.write_dot(G, '')

Now a look at the dot file. 

Not bad. If you follow the edges from any node, it will tell you the probability that the dog will transition to another state. For example, if the dog is sleeping, we can see there is a 40% chance the dog will keep sleeping, a 40% chance the dog will wake up and poop, and a 20% chance the dog will wake up and eat. 

What Makes a Markov Model Hidden?

Consider a situation where your dog is acting strangely and you wanted to model the probability that your dog's behavior is due to sickness or simply quirky behavior when otherwise healthy.

In this situation the true state of the dog is unknown, thus hidden from you. One way to model this is to assume that the dog has observable behaviors that represent the true, hidden state. Let's walk through an example.

First we create our state space - healthy or sick. We assume they are equiprobable.  

# create state space and initial state probabilities

hidden_states = ['healthy', 'sick']
pi = [0.5, 0.5]
state_space = pd.Series(pi, index=hidden_states, name='states')
print('\n', state_space.sum())

Next we create our transition matrix for the hidden states.

# create hidden transition matrix
# a or alpha 
#   = transition probability matrix of changing states given a state
# matrix is size (M x M) where M is number of states

a_df = pd.DataFrame(columns=hidden_states, index=hidden_states)
a_df.loc[hidden_states[0]] = [0.7, 0.3]
a_df.loc[hidden_states[1]] = [0.4, 0.6]


a = a_df.values
print('\n', a, a.shape, '\n')

This is where it gets a little more interesting. Now we create the emission or observation probability matrix. This matrix is size M x O where M is the number of hidden states and O is the number of possible observable states. 

The emission matrix tells us the probability the dog is in one of the hidden states, given the current, observable state. 

Let's keep the same observable states from the previous example. The dog can be either sleeping, eating, or pooping. For now we make our best guess to fill in the probabilities. 

# create matrix of observation (emission) probabilities
# b or beta = observation probabilities given state
# matrix is size (M x O) where M is number of states 
# and O is number of different possible observations

observable_states = states

b_df = pd.DataFrame(columns=observable_states, index=hidden_states)
b_df.loc[hidden_states[0]] = [0.2, 0.6, 0.2]
b_df.loc[hidden_states[1]] = [0.4, 0.1, 0.5]


b = b_df.values
print('\n', b, b.shape, '\n')

Now we create the graph edges and the graph object. 

# create graph edges and weights

hide_edges_wts = _get_markov_edges(a_df)

emit_edges_wts = _get_markov_edges(b_df)

# create graph object
G = nx.MultiDiGraph()

# nodes correspond to states

# edges represent hidden probabilities
for k, v in hide_edges_wts.items():
    tmp_origin, tmp_destination = k[0], k[1]
    G.add_edge(tmp_origin, tmp_destination, weight=v, label=v)

# edges represent emission probabilities
for k, v in emit_edges_wts.items():
    tmp_origin, tmp_destination = k[0], k[1]
    G.add_edge(tmp_origin, tmp_destination, weight=v, label=v)

pos = nx.drawing.nx_pydot.graphviz_layout(G, prog='neato')
nx.draw_networkx(G, pos)

# create edge labels for jupyter plot but is not necessary
emit_edge_labels = {(n1,n2):d['label'] for n1,n2,d in G.edges(data=True)}
nx.draw_networkx_edge_labels(G , pos, edge_labels=emit_edge_labels)
nx.drawing.nx_pydot.write_dot(G, '')

The hidden Markov graph is a little more complex but the principles are the same. For example, you would expect that if your dog is eating there is a high probability that it is healthy (60%) and a very low probability that the dog is sick (10%).

Now, what if you needed to discern the health of your dog over time given a sequence of observations?  

# observation sequence of dog's behaviors
# observations are encoded numerically

obs_map = {'sleeping':0, 'eating':1, 'pooping':2}
obs = np.array([1,1,2,1,0,1,2,1,0,2,2,0,1,0,1])

inv_obs_map = dict((v,k) for k, v in obs_map.items())
obs_seq = [inv_obs_map[v] for v in list(obs)]

print( pd.DataFrame(np.column_stack([obs, obs_seq]), 
                columns=['Obs_code', 'Obs_seq']) )

Using the Viterbi algorithm we can identify the most likely sequence of hidden states given the sequence of observations.

High level, the Viterbi algorithm increments over each time step, finding the maximum probability of any path that gets to state iat time t, that also has the correct observations for the sequence up to time t.

The algorithm also keeps track of the state with the highest probability at each stage. At the end of the sequence, the algorithm will iterate backwards selecting the state that "won" each time step, and thus creating the most likely path, or likely sequence of hidden states that led to the sequence of observations.

# define Viterbi algorithm for shortest path
# code adapted from Stephen Marsland's, Machine Learning An Algorthmic Perspective, Vol. 2

def viterbi(pi, a, b, obs):
    nStates = np.shape(b)[0]
    T = np.shape(obs)[0]
    # init blank path
    path = np.zeros(T)
    # delta --> highest probability of any path that reaches state i
    delta = np.zeros((nStates, T))
    # phi --> argmax by time step for each state
    phi = np.zeros((nStates, T))
    # init delta and phi 
    delta[:, 0] = pi * b[:, obs[0]]
    phi[:, 0] = 0

    print('\nStart Walk Forward\n')    
    # the forward algorithm extension
    for t in range(1, T):
        for s in range(nStates):
            delta[s, t] = np.max(delta[:, t-1] * a[:, s]) * b[s, obs[t]] 
            phi[s, t] = np.argmax(delta[:, t-1] * a[:, s])
            print('s={s} and t={t}: phi[{s}, {t}] = {phi}'.format(s=s, t=t, phi=phi[s, t]))
    # find optimal path
    print('Start Backtrace\n')
    path[T-1] = np.argmax(delta[:, T-1])
    #p('init path\n    t={} path[{}-1]={}\n'.format(T-1, T, path[T-1]))
    for t in range(T-2, -1, -1):
        path[t] = phi[path[t+1], [t+1]]
        #p(' '*4 + 't={t}, path[{t}+1]={path}, [{t}+1]={i}'.format(t=t, path=path[t+1], i=[t+1]))
        print('path[{}] = {}'.format(t, path[t]))
    return path, delta, phi

path, delta, phi = viterbi(pi, a, b, obs)
print('\nsingle best state path: \n', path)
print('delta:\n', delta)
print('phi:\n', phi)

Let's take a look at the result. 

state_map = {0:'healthy', 1:'sick'}
state_path = [state_map[v] for v in path]


A Hidden Markov Model for Regime Detection

By now you're probably wondering how we can apply what we have learned about hidden Markov models to quantitative finance.

Consider that the largest hurdle we face when trying to apply predictive techniques to asset returns is nonstationary time series.  In brief, this means that the expected mean and volatility of asset returns changes over time. 

Most time series models assume that the data is stationary. This is a major weakness of these models. 

Instead, let us frame the problem differently. We know that time series exhibit temporary periods where the expected means and variances are stable through time. These periods or regimes can be likened to hidden states.

If that's the case, then all we need are observable variables whose behavior allows us to infer the true hidden state(s). If we can better estimate an asset's most likely regime, including the associated means and variances, then our predictive models become more adaptable and will likely improve. We can also become better risk managers as the estimated regime parameters gives us a great framework for better scenario analysis.  

In this example, the observable variables I use are: the underlying asset returns, the Ted Spread, the 10 year - 2 year constant maturity spread, and the 10 year - 3 month constant maturity spread.

import pandas as pd
import as web
import sklearn.mixture as mix

import numpy as np
import scipy.stats as scs

import matplotlib as mpl
from matplotlib import cm
import matplotlib.pyplot as plt
from matplotlib.dates import YearLocator, MonthLocator
%matplotlib inline

import seaborn as sns
import missingno as msno
from tqdm import tqdm

Using pandas we can grab data from Yahoo Finance and FRED. 

# get fed data

f1 = 'TEDRATE' # ted spread
f2 = 'T10Y2Y' # constant maturity ten yer - 2 year
f3 = 'T10Y3M' # constant maturity 10yr - 3m

start = pd.to_datetime('2002-01-01')
end =

mkt = 'SPY'
MKT = (web.DataReader([mkt], 'yahoo', start, end)['Adj Close']
       .assign(sret=lambda x: np.log(x[mkt]/x[mkt].shift(1)))

data = (web.DataReader([f1, f2, f3], 'fred', start, end)
        .join(MKT, how='inner')


# gives us a quick visual inspection of the data

Next we will use the sklearn's GaussianMixture to fit a model that estimates these regimes. We will explore mixture models  in more depth in part 2 of this series. The important takeaway is that mixture models implement a closely related unsupervised form of density estimation. It makes use of the expectation-maximization algorithm to estimate the means and covariances of the hidden states (regimes). For now, it is ok to think of it as a magic button for guessing the transition and emission probabilities, and most likely path. 

We have to specify the number of components for the mixture model to fit to the time series. In this example the components can be thought of as regimes. We will arbitrarily classify the regimes as High, Neutral and Low Volatility and set the number of components to three.

# code adapted from
# for sklearn 18.1

col = 'sret'
select = data.ix[:].dropna()

ft_cols = [f1, f2, f3, 'sret']
X = select[ft_cols].values

model = mix.GaussianMixture(n_components=3, 

# Predict the optimal sequence of internal hidden state
hidden_states = model.predict(X)

print("Means and vars of each hidden state")
for i in range(model.n_components):
    print("{0}th hidden state".format(i))
    print("mean = ", model.means_[i])
    print("var = ", np.diag(model.covariances_[i]))

style_kwds = {'xtick.major.size': 3, 'ytick.major.size': 3,
              '':u'courier prime code', 'legend.frameon': True}
sns.set_style('white', style_kwds)

fig, axs = plt.subplots(model.n_components, sharex=True, sharey=True, figsize=(12,9))
colors = cm.rainbow(np.linspace(0, 1, model.n_components))

for i, (ax, color) in enumerate(zip(axs, colors)):
    # Use fancy indexing to plot data in each state.
    mask = hidden_states == i
                 ".-", c=color)
    ax.set_title("{0}th hidden state".format(i), fontsize=16, fontweight='demi')

    # Format the ticks.

fig.savefig('Hidden Markov (Mixture) Model_Regime Subplots.png')

In the above image, I've highlighted each regime's daily expected mean and variance of SPY returns. It appears the 1th hidden state is our low volatility regime. Note that the 1th hidden state has the largest expected return and the smallest variance.The 0th hidden state is the neutral volatility regime with the second largest return and variance. Lastly the 2th hidden state is high volatility regime. We can see the expected return is negative and the variance is the largest of the group. 

states = (pd.DataFrame(hidden_states, columns=['states'], index=select.index)
          .join(select, how='inner')

sns.set_style('white', style_kwds)
order = [0, 1, 2]
fg = sns.FacetGrid(data=states, hue='states', hue_order=order,
                   palette=scolor, aspect=1.31, size=12), 'Date', mkt, alpha=0.8).add_legend()
fg.fig.suptitle('Historical SPY Regimes', fontsize=24, fontweight='demi')
fg.savefig('Hidden Markov (Mixture) Model_SPY Regimes.png')

Here is the SPY price chart with the color coded regimes overlaid. 


In this post we've discussed the concepts of the Markov property, Markov models and hidden Markov models. We used the networkx package to create Markov chain diagrams, and sklearn's GaussianMixture to estimate historical regimes. In part 2 we will discuss mixture models more in depth. For more detailed information I would recommend looking over the references. is especially helpful in covering any gaps due to the highly interactive visualizations.