datpark

Data & Financial Analytics Examples

This IPython Notebook provides 5 advanced analytics examples:

  • derivatives analytics with DX Analytics
  • parallel computations with multiprocessing
  • regression analysis with R
  • Bayesian regression with PyMC3
  • multivariate auto regression with statsmodels

DX Analytics

DX Analytics is a Python library for advanced financial and derivatives analytics written by The Python Quants. It is particularly suited to model multi-risk derivatives and to do a consistent valuation of portfolios of complex derivatives. It mainly uses Monte Carlo simulation since it is the only numerical method capable of valuing and risk managing complex, multi-risk derivatives books.

An example with an European maximum call option on two underlyings.

In [3]:
import dx
%run dx_example.py
  # sets up market environments
  # and defines derivative instrument
In [4]:
max_call.payoff_func
  # payoff of a maximum call option
  # on two underlyings (European exercise)
Out[4]:
"np.maximum(np.maximum(maturity_value['gbm'], maturity_value['jd']) - 34., 0)"
In [5]:
max_call.vega('jd')
  # numerical Vega with respect
  # to one risk factor
Out[5]:
2.200000000000024

We are going to generate a Vega surface for one risk factor with respect to the initial values of both risk factors.

In [6]:
asset_1 = np.arange(28., 46.1, 2.)
asset_2 = asset_1
a_1, a_2 = np.meshgrid(asset_1, asset_2)
value = np.zeros_like(a_1)
In [7]:
%%time
vega_gbm = np.zeros_like(a_1)
for i in range(np.shape(vega_gbm)[0]):
    for j in range(np.shape(vega_gbm)[1]):
        max_call.update('gbm', initial_value=a_1[i, j])
        max_call.update('jd', initial_value=a_2[i, j])
        vega_gbm[i, j] = max_call.vega('gbm')
CPU times: user 4 s, sys: 1 ms, total: 4 s
Wall time: 4 s

In [9]:
%matplotlib inline
dx.plot_greeks_3d([a_1, a_2, vega_gbm], ['gbm', 'jd', 'vega gbm'])
  # Vega surface plot

Parallel Processing

Monte Carlo simulation is a computationally demanding task that nowadays in the financial industry is implemented generally on a large scale (eg for Value-at-Risk or Credit-Value-Adjustment calculations).

In [10]:
import math

This function simulates a geometric Brownian motion.

In [11]:
def simulate_geometric_brownian_motion(p):
    M, I = p
      # time steps, paths
    S0 = 100; r = 0.05; sigma = 0.2; T = 1.0
      # model parameters
    dt = T / M
    paths = np.zeros((M + 1, I))
    paths[0] = S0
    for t in range(1, M + 1):
        paths[t] = paths[t - 1] * np.exp((r - 0.5 * sigma ** 2) * dt +
                    sigma * math.sqrt(dt) * np.random.standard_normal(I))
    return paths

An example simulation with the function.

In [12]:
%time paths = simulate_geometric_brownian_motion((50, 100000))
  # example simulation
CPU times: user 292 ms, sys: 2 ms, total: 294 ms
Wall time: 293 ms

In [13]:
plt.plot(paths[:, :10]); plt.grid()

Now using the multiprocessing module of Python.

In [14]:
from time import time
import multiprocessing as mp
In [15]:
I = 7500  # number of paths
M = 50  # number of time steps
t = 100 # number of tasks/simulations
In [16]:
# running with a max of 8 cores
times = []
for w in range(1, 9):
    t0 = time()
    pool = mp.Pool(processes=w)
    result = pool.map(simulate_geometric_brownian_motion, t * [(M, I), ])
    times.append(time() - t0)

And the performance results visualized.

In [17]:
plt.plot(range(1, 9), times)
plt.plot(range(1, 9), times, 'ro')
plt.grid(True)
plt.xlabel('number of threads')
plt.ylabel('time in seconds')
plt.title('%d Monte Carlo simulations' % t)
Out[17]:
<matplotlib.text.Text at 0x7f56d954e0d0>

Statistics with R

We analyze the statistical correlation between the EURO STOXX 50 stock index and the VSTOXX volatility index.

First the EURO STOXX 50 data.

In [18]:
import pandas as pd
cols = ['Date', 'SX5P', 'SX5E', 'SXXP', 'SXXE',
        'SXXF', 'SXXA', 'DK5F', 'DKXF', 'DEL']
es_url = 'http://www.stoxx.com/download/historical_values/hbrbcpe.txt'
try:
    es = pd.read_csv(es_url,  # filename
                     header=None,  # ignore column names
                     index_col=0,  # index column (dates)
                     parse_dates=True,  # parse these dates
                     dayfirst=True,  # format of dates
                     skiprows=4,  # ignore these rows
                     sep=';',  # data separator
                     names=cols)  # use these column names

    # deleting the helper column
    del es['DEL']
except:
    # read stored data if there is no Internet connection
    es = pd.HDFStore('data/SX5E.h5', 'r')['SX5E']

Second, the VSTOXX data.

In [19]:
vs_url = 'http://www.stoxx.com/download/historical_values/h_vstoxx.txt'

try:
    vs = pd.read_csv(vs_url,  # filename
                     index_col=0,  # index column (dates)
                     parse_dates=True,  # parse date information
                     dayfirst=True, # day before month
                     header=2)  # header/column names
except:
    # read stored data if there is no Internet connection
    vs = pd.HDFStore('data/V2TX.h5', 'r')['V2TX']

Bridging to R from within IPython Notebook and pushing Python data to the R run-time.

In [20]:
%load_ext rpy2.ipython
In [21]:
import numpy as np
# log returns for the major indices' time series data
datv = pd.DataFrame({'SX5E' : es['SX5E'], 'V2TX': vs['V2TX']}).dropna()
rets = np.log(datv / datv.shift(1)).dropna()
ES = rets['SX5E'].values
VS = rets['V2TX'].values
In [22]:
%Rpush ES VS

Plotting with R in IPython Notebook.

In [23]:
%R plot(ES, VS, pch=19, col='blue'); grid(); title("Log returns ES50 & VSTOXX")

Linear regression with R.

In [24]:
%R c = coef(lm(VS~ES))
Out[24]:
<FloatVector - Python:0x7f56db08be18 / R:0x45893e8>
[0.000050, -2.789670]
In [25]:
%R plot(ES, VS, pch=19, col='blue'); grid(); abline(c, col='red', lwd=5)

Pulling data from R to Python

In [26]:
%Rpull c
In [27]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(9, 6))
plt.plot(ES, VS, 'b.')
plt.plot(ES, c[0] + c[1] * ES, 'r', lw=3)
plt.grid(); plt.xlabel('ES'); plt.ylabel('VS')
Out[27]:
<matplotlib.text.Text at 0x7f56dae39450>

Bayesian Statistics

The example we use is a "classical" pair traiding strategy, namely with gold and stocks of gold mining companies both represented by ETFs with symbols

We use zipline and PyMC3 for the analysis.

In [28]:
import numpy as np
import pymc as pm
import zipline
import pytz
import datetime as dt

First, we load the data from the Web.

In [29]:
try:
    datg = zipline.data.load_from_yahoo(stocks=['GLD', 'GDX'], 
             end=dt.datetime(2014, 3, 15, 0, 0, 0, 0, pytz.utc)).dropna()
except:
    datg = pd.HDFStore('data/gold.h5', 'r')['datg']
%matplotlib inline
datg.plot(figsize=(9, 5))
GLD
GDX

Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f56744dfed0>

A scatter plot of the value pairs over time and a simple linear regression.

In [30]:
import matplotlib as mpl; import matplotlib.pyplot as plt
mpl_dates = mpl.dates.date2num(datg.index.to_pydatetime())
plt.figure(figsize=(9, 5))
plt.scatter(datg['GDX'], datg['GLD'], c=mpl_dates, marker='o')
reg = np.polyfit(datg['GDX'], datg['GLD'], 1)
plt.plot(datg['GDX'], np.polyval(reg, datg['GDX']), 'r-', lw=3)
plt.grid(True); plt.xlabel('GDX'); plt.ylabel('GLD')
plt.colorbar(ticks=mpl.dates.DayLocator(interval=250),
             format=mpl.dates.DateFormatter('%d %b %y'))
Out[30]:
<matplotlib.colorbar.Colorbar instance at 0x7f56740e8050>

We implement a Bayesian random walk model (I).

In [31]:
model_randomwalk = pm.Model()
with model_randomwalk:
    sigma_alpha, log_sigma_alpha = \
            model_randomwalk.TransformedVar('sigma_alpha', 
                            pm.Exponential.dist(1. / .02, testval=.1), 
                            pm.logtransform)
    sigma_beta, log_sigma_beta = \
            model_randomwalk.TransformedVar('sigma_beta', 
                            pm.Exponential.dist(1. / .02, testval=.1),
                            pm.logtransform)

We implement a Bayesian random walk model (II).

In [32]:
from pymc.distributions.timeseries import GaussianRandomWalk

# take samples of 50 elements each
subsample_alpha = 50
subsample_beta = 50

with model_randomwalk:
    alpha = GaussianRandomWalk('alpha', sigma_alpha**-2, 
                               shape=len(datg) / subsample_alpha)
    beta = GaussianRandomWalk('beta', sigma_beta**-2, 
                              shape=len(datg) / subsample_beta)

    alpha_r = np.repeat(alpha, subsample_alpha)
    beta_r = np.repeat(beta, subsample_beta)

We implement a Bayesian random walk model (III).

In [33]:
with model_randomwalk:
    # define regression
    regression = alpha_r + beta_r * datg.GDX.values[:1950]
    
    sd = pm.Uniform('sd', 0, 20)
    likelihood = pm.Normal('GLD', mu=regression, 
                sd=sd, observed=datg.GLD.values[:1950])

We implement a Bayesian random walk model (IV).

In [34]:
%%time
import warnings; warnings.simplefilter('ignore')
import scipy.optimize as sco
with model_randomwalk:
    # first optimize random walk
    start = pm.find_MAP(vars=[alpha, beta], fmin=sco.fmin_l_bfgs_b)
    
    # sampling
    step = pm.NUTS(scaling=start)
    trace_rw = pm.sample(100, step, start=start, progressbar=False)

The plot of the regression coefficients over time.

In [35]:
part_dates = np.linspace(min(mpl_dates), max(mpl_dates), 39)
fig, ax1 = plt.subplots(figsize=(10, 5))
plt.plot(part_dates, np.mean(trace_rw['alpha'], axis=0),
         'b', lw=2.5, label='alpha')
for i in range(45, 55):
    plt.plot(part_dates, trace_rw['alpha'][i], 'b-.', lw=0.75)
plt.xlabel('date'); plt.ylabel('alpha'); plt.axis('tight')
plt.grid(True); plt.legend(loc=2)
ax1.xaxis.set_major_formatter(mpl.dates.DateFormatter('%d %b %y') )
ax2 = ax1.twinx()
plt.plot(part_dates, np.mean(trace_rw['beta'], axis=0),
         'r', lw=2.5, label='beta')
for i in range(45, 55):
    plt.plot(part_dates, trace_rw['beta'][i], 'r-.', lw=0.75)
plt.ylabel('beta'); plt.legend(loc=4); fig.autofmt_xdate()

The plot of the regression lines over time.

In [36]:
plt.figure(figsize=(10, 5))
plt.scatter(datg['GDX'], datg['GLD'], c=mpl_dates, marker='o')
plt.colorbar(ticks=mpl.dates.DayLocator(interval=250),
             format=mpl.dates.DateFormatter('%d %b %y'))
plt.grid(True); plt.xlabel('GDX'); plt.ylabel('GLD')
x = np.linspace(min(datg['GDX']), max(datg['GDX'])) 
for i in range(39):
    alpha_rw = np.mean(trace_rw['alpha'].T[i])
    beta_rw = np.mean(trace_rw['beta'].T[i]) 
    plt.plot(x, alpha_rw + beta_rw * x, color=plt.cm.jet(256 * i / 39))

Vector Auto Regression

Let us apply Multi-Variate Auto Regression to the financial time series data we have:

  • GLD
  • GDX
  • EURO STOXX 50
  • VSTOXX

Let us resample and join the data sets.

In [37]:
datv.index = datv.index.tz_localize(pytz.utc)
datf = datg.join(datv, how='left')
datf = datf.resample('1M', how='last')
  # resampling to monthly data
# datf = datf / datf.ix[0] * 100
  # uncomment for normalized starting values
# datf = np.log(datf / datf.shift(1)).dropna()
  # uncomment for log return based analysis

The starting values of the time series data we use.

In [38]:
datf.head()
Out[38]:
GDX GLD SX5E V2TX
Date
2006-05-31 00:00:00+00:00 36.67 64.23 3637.17 23.0529
2006-06-30 00:00:00+00:00 36.53 61.23 3648.92 18.3282
2006-07-31 00:00:00+00:00 36.58 63.16 3691.87 18.5171
2006-08-31 00:00:00+00:00 38.31 62.29 3808.70 16.1689
2006-09-30 00:00:00+00:00 33.65 59.47 3899.41 16.2455

We use the VAR class of the statsmodels library.

In [39]:
from statsmodels.tsa.api import VAR
model = VAR(datf)
In [40]:
lags = 5
  # number of lags used for fitting
results = model.fit(maxlags=lags, ic='bic')
  # model fitted to data

The summary statistics of the model fit.

In [41]:
results.summary()
Out[41]:
  Summary of Regression Results   
==================================
Model:                         VAR
Method:                        OLS
Date:           Tue, 24, Mar, 2015
Time:                     22:06:55
--------------------------------------------------------------------
No. of Equations:         4.00000    BIC:                    18.7202
Nobs:                     94.0000    HQIC:                   18.3976
Log likelihood:          -1367.94    FPE:                7.85666e+07
AIC:                      18.1791    Det(Omega_mle):     6.38570e+07
--------------------------------------------------------------------
Results for equation GDX
==========================================================================
             coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------
const           3.630380         7.689488            0.472           0.638
L1.GDX          0.951839         0.052641           18.082           0.000
L1.GLD         -0.018955         0.025582           -0.741           0.461
L1.SX5E        -0.000277         0.001433           -0.193           0.847
L1.V2TX         0.050241         0.076313            0.658           0.512
==========================================================================

Results for equation GLD
==========================================================================
             coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------
const           4.573951        13.017921            0.351           0.726
L1.GDX          0.031190         0.089119            0.350           0.727
L1.GLD          0.959550         0.043309           22.156           0.000
L1.SX5E        -0.000581         0.002425           -0.240           0.811
L1.V2TX         0.047063         0.129194            0.364           0.717
==========================================================================

Results for equation SX5E
==========================================================================
             coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------
const         673.458976       294.958237            2.283           0.025
L1.GDX          0.722644         2.019235            0.358           0.721
L1.GLD         -1.660690         0.981288           -1.692           0.094
L1.SX5E         0.872325         0.054955           15.874           0.000
L1.V2TX        -5.247642         2.927250           -1.793           0.076
==========================================================================

Results for equation V2TX
==========================================================================
             coefficient       std. error           t-stat            prob
--------------------------------------------------------------------------
const           3.537528         9.677326            0.366           0.716
L1.GDX          0.005033         0.066249            0.076           0.940
L1.GLD          0.002704         0.032195            0.084           0.933
L1.SX5E         0.000172         0.001803            0.095           0.924
L1.V2TX         0.820605         0.096041            8.544           0.000
==========================================================================

Correlation matrix of residuals
             GDX       GLD      SX5E      V2TX
GDX     1.000000  0.813648  0.061733 -0.158820
GLD     0.813648  1.000000 -0.058486 -0.081772
SX5E    0.061733 -0.058486  1.000000 -0.789067
V2TX   -0.158820 -0.081772 -0.789067  1.000000


Historical data and forecasts.

In [42]:
results.plot_forecast(50, figsize=(8, 8), offset='M')
  # historical/input data and
  # forecasts given model fit

Integrated simulation of the future development of the financial instrument prices/levels.

In [43]:
results.plotsim(paths=5, steps=50, prior=False,
                initial_values=datf.ix[-1].values,
                figsize=(8, 12), offset='M')
  # simulated paths given fitted model