# 6.3. Time series¶

Note

Time series are available starting with quasardb 2.1.0.

## 6.3.1. Goal¶

Time series are meant to be used in demanding environments where in one second, millions of events have to be recorded and indexed with very high precision, and kept for as long as needed.

At quasardb, when we decided to implement time series, we did it with the following goals:

Limitless: It must be possible to record thousands of events every microsecond, for one hundred year, without ever removing a single line.Reliable: Writes must be durable and reads consistent. Period.Flexible:Although quasardb has server-side aggregations and computations, the user may manipulate the data with her own tools such as Python or R. Extracting a subset of a time series must be simple, fast, and efficient. When inserting data, the user must not have to define a complex schema and can change her mind afterwards.Interactive: Transfers, computations and aggregations must be so fast that analytics can access quasardb directly, regardless of the amount of data stored, to enable the analyst to work interactively.Transparent: When a user wants to average the value of a column, it should not be her concern whether or not the time series resides on a single node or is distributed over a big cluster, and if 10,000 users are doing the same thing at the same time. The database must solve all the distribution, concurrence and reduction problems and present a naive interface to the user.

Time series are available since quasardb 2.1.0.

## 6.3.2. Features¶

### Reliability¶

The transactional backend of quasardb ensures that insertion within a time series is consistent and reliable (see Transactions).

### Performance¶

Thanks to its unique quasardb I/O engine a single node can typically deliver millions of insertions and lookups per second.

The I/O engine is capable of leveraging the Messaging Accelerator (VMA) of Mellanox cards.

### Columns¶

Each time series supports an arbitrary number of columns, represented by an unique name.

A database may have as many time series as the storage allows. Each time series can be looked-up by its name or by one of its tags (see Queries).

For each column, the following data types are currently supported:

- Double precision (64-bit) floating point numbers
- Binary large objects (blob) of any size

Note

64-bit signed integers values will be made available in a future release.

### Distribution¶

Time series are distributed over the cluster transparently. Each node in the cluster is responsible for a certain time interval for any given time series. This time interval is currently not configurable.

For I/O efficiency, a time interval can be sub-divided in smaller buckets on a single node. This is done transparently.

### Bulk loading¶

Time series support value by value insertion, as well as efficient bulk loading. Inserting multiple lines in one call can yield significant performance improvements as it gives the opportunity for the database to optimize network traffic, memory allocations and disk writes.

Look-up by time index is efficient with an amortized constant lookup complexity O(1). That means that fetching a sub-part of the time series is independent of the length of the time series.

### Size¶

There is no limitation on the number of entries a time series may hold, other by the storage capabilities of the cluster.

### Resolution¶

Each point in time is represented with a very-high precision 128-bit timestamp with nanosecond granularity, which makes it possible to represent any point in time for the duration of the universe.

Note

100-picosecond granularity is planned for a future release.

### Real-time aggregation¶

Time series values are stored in cache-aware data structures. Aggregations are vectorized using the available enhanced instructions set of the processor.

For example, an Intel Xeon E5-2670 can sum a column in the region of 3 billions of rows per second per core thanks to the SSE 4.2 and AVX instruction sets.

When the time interval spans several nodes, the API will transparently distribute the computation over multiple nodes, and perform the adequate reductions.

### Efficient storage¶

While preserving the 128-bit resolution of each timestamp, each bucket only stores the 64-bit index relative to its time interval. Values are stored to disk using variadic encoding to minimize storage space.

When a time interval does not contain data, it does not use any space. Thus, discontinuous data is natively supported and there is no need to *“clean up”* the data before inserting it into quasardb.

Blobs are compressed using LZ4.

For more information, see Data Storage.

Note

Lossless temporal compression of values and timestamps is planned for a future release.

## 6.3.3. Supported server side functions¶

All functions are transparently distributed over the cluster.

Operation Applies to Complexity Vectorized First element Any column Constant No Last element Any column Constant No Minimum element Double columns Linear Yes Minimum element (absolute values) Double columns Linear No Maximum element Double columns Linear Yes Maximum element (absolute values) Double columns Linear No Spread Double columns Linear Yes Population variance Double columns Linear No Sample variance Double columns Linear No Population standard deviation Double columns Linear No Sample standard deviation Double columns Linear No Arithmetic mean Double columns Linear Yes Harmonic mean Double columns Linear No Geometric mean Double columns Linear No Quadratic mean (root mean square) Double columns Linear No Number of elements Any column Constant No Sum Double columns Linear Yes Sum of squares Double columns Linear No Product Double columns Linear No Skewness Double columns Linear No Kurtosis Double columns Linear No

Note

The following functions are planned in the short term: distinct values count, median, most frequent value, least frequent value, moving average and percentile.

## 6.3.4. Usage¶

In the next example, we’ll assume we want to work on the following time series, named “tick42” :

Timestamp | Price | Size | Exchange |
---|---|---|---|

2016-11-28 14:28:32.213 | 243 | 100 | P |

2016-11-28 14:28:33.124 | 243 | 200 | P |

… | … | … | … |

2016-11-28 15:12:33.024 | 300 | 400 | T |

### Creation¶

A time series needs to be initially created, and column must be defined. The type of the column is fixed for the lifetime of the column. It is possible to insert, remove and rename columns after the time series has been created.

For example, to create the following time series in Python:

```
import quasardb
# assuming a node on the localhost
c = quasardb.Cluster('qdb://127.0.0.1:2836')
ts = c.ts("tick42")
cols = ts.create([quasardb.TimeSeries.DoubleColumnInfo("Price"), quasardb.TimeSeries.DoubleColumnInfo("Size"), quasardb.TimeSeries.BlobColumnInfo("Exchange")])
```

### Insertion¶

Once the time series is created, values are inserted in each column. It is not required to have a value for each column at every timestamp. Concurrent insertion is supported.

It is not possible to insert in a non-existing time series or in a non-existing column.

Warning

Not every API deliver nanosecond resolution for the timestamps during insertion and lookup. This can be due to the inherent limitation of the language. Internally, every value has a timestamp with nanosecond granularity regardless of the language and platform used.

To insert the first line in our example:

```
line_ts = datetime.datetime(2016, 11, 28, 14, 28, 32, 213000)
cols[0].insert([(line_ts, 243)])
cols[1].insert([(line_ts, 100)])
cols[2].insert([(line_ts, "P")])
```

### Time series lookup¶

Time series can be discovered by key, by tag or by affix, like any other entry (see Queries).

It is possible to enumerate the columns of a time series at any time.

In Python, to enumerate the columns of a time series:

```
# columns will be returned in the order they were created
cols = ts.columns()
# it is also possible to access a specific column
col_price = ts.column(quasardb.TimeSeries.DoubleColumnInfo("Price"))
col_price.insert([(line_ts, 243)])
```

### Fetching the values of time series¶

Values are obtained by time interval. The complexity of the operation is independent of the size of the time series. Most APIs support querying multiple range in one call to minimize network traffic.

Warning

The number of returned values can be very large.

For example, to get all the prices of March, 25th 2016:

```
col_price = ts.column(quasardb.TimeSeries.DoubleColumnInfo("Price"))
all_prices = col_price.get_ranges([(datetime.datetime(2016,3,25,00,00,00), datetime.datetime(2016,3,25,23,59,59,999999))])
```

### Server-side aggregation¶

Aggregations are done on ranges. A single aggregation will not be multithreaded on a single server, however, a server supports multiple aggregations on the same (or different) time series in parallel and these aggregations will occur in separate threads.

Aggregations on floating-point values are done at 64-bit precision.

If we wanted to have the total volume for March, 25th 2016:

```
col_size = ts.column(quasardb.TimeSeries.DoubleColumnInfo("Size"))
# volumes[0] will have the total volume
volumes = col_size.aggregate(quasardb.TimeSeries.Aggregation.sum, [(datetime.datetime(2016,3,25,00,00,00), datetime.datetime(2016,3,25,23,59,59,999999))])
```