datpark

TsTables – High Frequency Times Series Data with PyTables

TsTables is a Python library by Andy Fiedler built on top of the popular PyTables HDF5 database library. It is meant to handle large amounts of high frequency time series data in append once, retrieve many times scenarios (cf. Gihub page). The focus lies on retrieving chunks of data from large data sets as quickly as possible.

Sample Time Series Data

In [1]:
import numpy as np
import pandas as pd
import tables as tb
import tstables as tstb
from datetime import datetime 
%matplotlib inline

Let us generate a decent amount of sample data points.

In [2]:
no = 5000000
co = 3
dt = 1. / (12 * 30 * 24 * 60)
vol = 0.2

We generate one second data.

In [3]:
dr = pd.date_range('2014-1-1', periods=no, freq='1s')
In [4]:
dr
Out[4]:
<class 'pandas.tseries.index.DatetimeIndex'>
[2014-01-01 00:00:00, ..., 2014-02-27 20:53:19]
Length: 5000000, Freq: S, Timezone: None

In memory generation is quite quick.

In [5]:
%%time
da = 100 * np.exp(np.cumsum(-0.5 * vol ** 2 * dt +
        vol * np.sqrt(dt) * np.random.standard_normal((no, co)), axis=0))
da[0] = 100
CPU times: user 955 ms, sys: 44 ms, total: 999 ms
Wall time: 998 ms

In [6]:
df = pd.DataFrame(da, index=dr, columns=['ts1', 'ts2', 'ts3'])
In [7]:
df.count()
Out[7]:
ts1    5000000
ts2    5000000
ts3    5000000
dtype: int64

The starting values of the three time series.

In [8]:
df.head()
Out[8]:
ts1 ts2 ts3
2014-01-01 00:00:00 100.000000 100.000000 100.000000
2014-01-01 00:00:01 99.959477 99.979072 100.102487
2014-01-01 00:00:02 99.915401 100.026083 100.099791
2014-01-01 00:00:03 99.952855 100.061231 100.071574
2014-01-01 00:00:04 99.992656 100.025757 100.079362

And a plot of the time series data (every 100000th point).

In [9]:
df[::100000].plot()
Out[9]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f490327a0d0>

Storage and Retrieval with TsTables

To store the time series data in a PyTables table we first define the table structure.

In [10]:
class TS(tb.IsDescription):
    timestamp = tb.Int64Col(pos=0)
    ts1 = tb.Float64Col(pos=1)
    ts2 = tb.Float64Col(pos=1)
    ts3 = tb.Float64Col(pos=1)

Second, open a database file and create the table object.

In [11]:
h5 = tb.open_file('ts.h5','w')

TsTables adds a new function create_ts to PyTables.

In [12]:
ts = h5.create_ts('/','TS', TS)

Third, we append the time series data to the table object.

In [13]:
%time ts.append(df)
CPU times: user 401 ms, sys: 167 ms, total: 568 ms
Wall time: 596 ms

In [14]:
ls -n *.h5
-rw-r--r-- 1 1954 8 157038320 Apr  8 21:41 ts.h5

The strength of TsTables lies in retrieving chunks of time series data defined by a start date and an end date (which obviously is a typical case in finance, e.g. in backtesting strategies or risk management).

In [15]:
read_start_dt = datetime(2014, 2, 1, 0, 0)
read_end_dt = datetime(2014, 2, 5, 23, 59)

TsTables tries to make such an operation as fast as possible.

In [16]:
%time rows = ts.read_range(read_start_dt, read_end_dt)
CPU times: user 36 ms, sys: 10 ms, total: 46 ms
Wall time: 45.4 ms

In [17]:
%timeit rows = ts.read_range(read_start_dt, read_end_dt)
10 loops, best of 3: 31.9 ms per loop

Conveniently, the returned object is a pandas DataFrame.

In [18]:
rows.count()
Out[18]:
ts1    431941
ts2    431941
ts3    431941
dtype: int64
In [19]:
rows.head()
Out[19]:
ts1 ts2 ts3
2014-02-01 00:00:00 163.014116 114.570739 62.729378
2014-02-01 00:00:01 163.047768 114.609545 62.760659
2014-02-01 00:00:02 163.023880 114.627010 62.763612
2014-02-01 00:00:03 162.971144 114.649092 62.754083
2014-02-01 00:00:04 162.987766 114.623626 62.748843
In [20]:
rows[::100].plot()
Out[20]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f4902ff46d0>
In [21]:
h5.close()
In [22]:
!rm ts.h5