# 7.3. Time series¶

Note

Time series are available starting with quasardb 2.1.0.

## 7.3.1. Goal¶

Time series are meant to be used in demanding environments where in one second, millions of events have to be recorded and indexed with very high precision, and kept for as long as needed.

At quasardb, when we decided to implement time series, we did it with the following goals:

• Limitless: It must be possible to record thousands of events every microsecond, for one hundred year, without ever removing a single line.
• Reliable: Writes must be durable and reads consistent. Period.
• Flexible:Although quasardb has server-side aggregations and computations, the user may manipulate the data with her own tools such as Python or R. Extracting a subset of a time series must be simple, fast, and efficient. When inserting data, the user must not have to define a complex schema and can change her mind afterwards.
• Interactive: Transfers, computations and aggregations must be so fast that analytics can access quasardb directly, regardless of the amount of data stored, to enable the analyst to work interactively.
• Transparent: When a user wants to average the value of a column, it should not be her concern whether or not the time series resides on a single node or is distributed over a big cluster, and if 10,000 users are doing the same thing at the same time. The database must solve all the distribution, concurrence and reduction problems and present a naive interface to the user.

Time series are available since quasardb 2.1.0.

## 7.3.2. Features¶

### Reliability¶

The transactional backend of quasardb ensures that insertion within a time series is consistent and reliable (see Transactions).

### Performance¶

Thanks to its unique quasardb I/O engine a single node can typically deliver millions of insertions and lookups per second.

The I/O engine is capable of leveraging the Messaging Accelerator (VMA) of Mellanox cards.

### Columns¶

Each time series supports an arbitrary number of columns, represented by an unique name.

A database may have as many time series as the storage allows. Each time series can be looked-up by its name or by one of its tags (see queries).

For each column, the following data types are currently supported:

• Double precision (64-bit) floating point numbers
• Binary large objects (blob) of any size
• 64-bit signed integers
• Nanosecond timestamps

### Distribution¶

Time series are distributed over the cluster transparently. Each node in the cluster is responsible for a certain time interval for any given time series. This time interval is set when you create the time series, and can be different for every time series. By default, in most API, it is set to 24 hours (one day).

Time series support value by value insertion, as well as efficient bulk loading. Inserting multiple lines in one call can yield significant performance improvements as it gives the opportunity for the database to optimize network traffic, memory allocations and disk writes.

Look-up by time index is efficient with an amortized constant lookup complexity O(1). That means that fetching a sub-part of the time series is independent of the length of the time series.

### Size¶

There is no limitation on the number of entries a time series may hold, other by the storage capabilities of the cluster.

### Resolution¶

Each point in time is represented with a very-high precision 128-bit timestamp with nanosecond granularity, which makes it possible to represent any point in time for the duration of the universe.

### Real-time aggregation¶

Time series values are stored in cache-aware data structures. Aggregations are vectorized using the available enhanced instructions set of the processor.

For example, an Intel Xeon E5-2670 can sum a column in the region of 3 billions of rows per second per core thanks to the SSE 4.2 and AVX instruction sets.

When the time interval spans several nodes, the API will transparently distribute the computation over multiple nodes, and perform the adequate reductions.

### Efficient storage¶

While preserving the 128-bit resolution of each timestamp, each bucket only stores the 64-bit index relative to its time interval. Each 64-bit index is compressed using a lossless compression algorithm that stores only the second-degree variations, leveraging the column-oriented nature of quasarDB.

Integer values are compressed to disk using an algorithm similar to the one used for timestamps.

Doubles and blobs values are currently not compressed.

When a time interval does not contain data, it does not use any space. Thus, discontinuous data is natively supported and there is no need to “clean up” the data before inserting it into quasardb.

## 7.3.3. Usage¶

In the next example, we’ll assume we want to work on the following time series, named “tick42” :

Timestamp Price Size Exchange
2016-11-28 14:28:32.213 243 100 P
2016-11-28 14:28:33.124 243 200 P
2016-11-28 15:12:33.024 300 400 T

### Creation¶

A time series needs to be initially created, and column must be defined. The type of the column is fixed for the lifetime of the column. It is possible to insert, remove and rename columns after the time series has been created.

For example, to create the following time series in Python:

import quasardb

# assuming a node on the localhost
c = quasardb.Cluster('qdb://127.0.0.1:2836')
ts = c.ts("tick42")
ts.create([quasardb.TimeSeries.DoubleColumnInfo("Price"), quasardb.TimeSeries.DoubleColumnInfo("Size"), quasardb.TimeSeries.BlobColumnInfo("Exchange")])


### Data insertion¶

QuasarDB support two mode of data insertion for time series:

• Column oriented (default API)
• Row oriented (batch table API)

Data can be sparse and concurrent insertion is supported.

Which API you should use depends on the form of the data you wish to insert, its availability, and which API is the most convenient for you.

Warning

Not every API deliver nanosecond resolution for the timestamps during insertion and lookup. This can be due to the inherent limitation of the language. Internally, every value has a timestamp with nanosecond granularity regardless of the language and platform used.

#### Column-oriented data insertion¶

To insert the first line in our example:

import numpy as np

ts.int64_insert("Price", [np.datetime64('2016-11-28T14:28:32.213', 'ns')], [213000])
ts.int64_insert("Size", [np.datetime64('2016-11-28T14:28:32.213', 'ns')], [100])
ts.blob_insert("Exchange", [np.datetime64('2016-11-28T14:28:32.213', 'ns')], ['P'])


Column-oriented insertion can be extremely efficient if your data is already available in columns as the protocol will do minimal data transformation to send it to the server and the server will almost store it “as is” (after sanitation).

When using the column-oriented API, you ideally want to insert each block by shard size.

#### Row-oriented data insertion¶

The row oriented API is available through the batch table API.

You first need to tell the batch API how the data should be pushed, and what is the typical batch size. The batch size hint helps the API to pre-allocate structures for more efficient operations.

Here we will describe our three columns and tell the API to expect 100 rows every push:

columns = (quasardb.BatchColumnInfo("tick42", "Price", 100),
quasardb.BatchColumnInfo("tick42", "Size", 100),
quasardb.BatchColumnInfo("tick42", "Exchange", 100))


Then we create an inserter, and add the data to it:

b = c.ts_batch(columns)
b.start_row(np.datetime64('2016-11-28T14:28:32.213', 'ns'))
b.set_int64(0, 213000)
b.set_int64(1, 100)
b.set_blob(2, 'P')


When you add data to the batch, the data isn’t actually written to the database but buffered in the client. Only when you call the push method will the data be sent to the database.

# insert all data at once
b.push()


Push is transactional and will return when all data is written. The batch API optimizes the messages size for the best performance and is aware of the shard size of the timeseries you are pushing to.

The column-oriented and row-oriented data insertion methods described above are synchronous and transactional. When the amount of data to insert is large enough, this both yields excellent performance and very good reliability.

However, it is not always possible to buffer the updates long enough and this can create the following problems:

• Poor performance due to the overhead of the transactions
• Write amplification because of MVCC (small updates create multiple versions of the same bucket)
• Inefficient network usage because of retries caused by transaction conflicts

This is why the batch API has an additional call that enables asynchronous updates (see Asynchronous time series inserter). Taking the example above, you would do:

# push all the data to the cluster that will buffer the update
# the call returns when the cluster has successfully buffered the update
# however the data is NOT committed to the database
b.push_async()


Asynchronous updates can offer a significant performance boost for burst updates or when insertions are done row by row. The asynchronous time series inserter can be multithreaded and supports a configurable deadline as well as buffer size to fit every use case.

However, it comes with a price: updates are no longer transactional and will not be immediately visible.

### Time series lookup¶

Time series can be discovered by key, by tag or by affix, like any other entry (see queries).

It is possible to enumerate the columns of a time series at any time.

In Python, to enumerate the columns of a time series:

# columns will be returned in the order they were created
cols = ts.columns()

# it is also possible to access a specific column
col_price = ts.column(quasardb.TimeSeries.DoubleColumnInfo("Price"))
col_price.insert([(line_ts, 243)])


### Fetching the values of time series¶

Values are obtained by time interval. The complexity of the operation is independent of the size of the time series. Most APIs support querying multiple range in one call to minimize network traffic.

Warning

The number of returned values can be very large.

For example, to get all the prices of March, 25th 2016:

col_price = ts.column(quasardb.TimeSeries.DoubleColumnInfo("Price"))
all_prices = col_price.get_ranges([(datetime.datetime(2016,3,25,00,00,00), datetime.datetime(2016,3,25,23,59,59,999999))])


### Server-side aggregation¶

Aggregations are done on ranges. A single aggregation will not be multithreaded on a single server, however, a server supports multiple aggregations on the same (or different) time series in parallel and these aggregations will occur in separate threads.

Aggregations on floating-point values are done at 64-bit precision.

If we wanted to have the total volume for March, 25th 2016:

col_size = ts.column(quasardb.TimeSeries.DoubleColumnInfo("Size"))

# volumes[0] will have the total volume
volumes = col_size.aggregate(quasardb.TimeSeries.Aggregation.sum, [(datetime.datetime(2016,3,25,00,00,00), datetime.datetime(2016,3,25,23,59,59,999999))])