Print
Hits: 21636

Explore time-series database processing

Analyzing time-series data for the financial markets is one case where speed matters. A faster answer means beating your competition to the market. For those who use kdb+/q, you understand this need and how choosing the right tool can make a huge difference in the speed at which queries get answered. As you may have assumed, there has always been an active community of kdb+ users, but until now there has never been a place to call your own.

That situation has changed thanks to kxcommunity.com. The new Kx Community site has been made available by Kx Systems, Inc. to assist and support kdb+ users. If you land on the main page, you are just a click away from current blogs, kdb+ events, job listings, local MeetUps, a monthly update signup, free code, and a real time twitter feed with kdb+ conversations. There is plenty more and the site has loads to offer for both new and experienced kdb+ users.

If you are not familiar with kdb+ and want to learn more about column-oriented databases and time series analysis, then you may want to read the following two sections; otherwise you can skip down to see what Kx Community has to offer.

Using Columns Instead of Rows

A traditional database consists of tables with data ordered by row (row-oriented). As an example, consider Table One below. Each row usually has two or more columns (or fields) that hold similar data. When data are organized by rows, data look-up can be done quickly by using a key-value to identify the row and then its associated data. For many operations this scheme works well, but it does not work so well for certain types of problems where operations on entire columns of data are important. When data are organized by column (column-oriented), operations on columns are extremely fast because there is no key-value look-up.

SQL databases are usually row-oriented and have some advantages over column-oriented databases. Namely, insertion of new rows anywhere in a table is a simple operation because data is indexed using a key-value. Inserting random rows in a column-oriented database is slow due to the amount of data movement that is required. There are, however, certain types of data that do not require random row inserts and only append new data to the end of the table. Time-series data similar to that shown in Table One below fit this criteria — there is never a case where new data is created in the past! Thus, column-oriented databases are the optimal choice for analyzing time-series data.

Another way to understand the difference between column- and row-oriented databases is to consider how the user operates the data. In a row-oriented RDBMS the user queries the database using SQL, which is based on relational algebra and set theory. Column-oriented databases, on the other hand, are based on vectors of ordered lists. This abstraction allows the easy computation over an entire column.

Stock Price Table
Table One: Example of time series data for daily stock trades.

The kdb+ Time-series Database

The most popular time-series database is kdb+ and its q query language. kdb+ is used by virtually all financial institutions to analyze time series data (e.g., any type of stock or commodity exchange). Kdb/q's origin began with an obscure academic language called APL. Though powerful, APL was somewhat difficult to use and even required a special non-standard keyboard. Arthur Whitney, working at various financial firms, refined the APL approach and created a more user-friendly variant called kdb+ (k database) for time series analysis. To interact with kdb+, Whitney developed a user-friendly "q" language interface. q is an interpreted vector-based dynamically-typed language built for speed and expressiveness. As mentioned, the use of vector commands eliminated the need for virtually all looping structures (for/while) as part of the standard q program. Whitney also co-founded Kx Systems to further develop and support the kdb+/q technology that is used today.

kdb+ has been refined over the years to produce a powerful and expressive time-series database. Some of its important features are:

Joining the Growing Community

If you are a kdb+ user or are intrigued with the kdb+ approach, there are resources waiting for you. First, there is a free version of kdb+ (32 bit) available for download. The free version has all the functionality of the 64-bit version and can be used for commercial, non-commercial, or educational purposes. Second, the Kx Community site provides a 2-page Getting Started Guide, a developer’s tutorial, a kdb+ community wiki, information on using R with kdb+, and a white paper that explains the "q" query language. In addition, there are links to contributed interfaces for Java and C# as well as a Python (from DEVnet as part of Exxerleron), a kdb+ production system frameworks (TorQ by AquaQ Analytics and Enterprise Components from DEVnet), support for WebSockets, and a How-To for pivoting tables using the q language.

All In One Place

Kx Community has also collected all the important kdb+/q websites in one location. There are links to kdb+ developer blogs, background information, white papers, an FAQ site, reddit topics, and stack overflow questions and discussions. kxcommunity.com is the best place to start searching for kdb+/q information. There is a Google Group (Kdb+ Personal Developers) where the community gathers to talk and ask questions about all versions of Kdb+ -- including the free 32-bit version of kdb+.

Community Blogs

Rounding out the Kx Community page are kdb+ Blogs that cover a range of topics, including: The Nature of Ticker Plant Log Files, Open Source Building Blocks for kdb+, WebSockets, HTML5 and kdb+, and more.

Going Deeper with the Free Version of kdb+

We are using a 32-bit CentOS 6.5 machine for the free version. If you are using a 64-bit machine, make sure you install the ia32-libs package (yum install ia32-libs). Installation is very simple. First, download the package from kx.com. Then, choose an installation directory

$ mkdir kdb
$ mv ~/linux.zip kdb
$ cd kdb
$ unzip linux.zip

You will see a single directory, q, with the following

l32  q.k  q.q  q.sh  README.txt  s.k  sp.q  trade.q

The README.txt file can be consulted for other installation scenarios. Next, set the QHOME variable to the location of the q directory, move to the q directory, and start q:

$ export QHOME=/home/deadline/kdb/q
$ cd q
$ l32/q

If everything extracted correctly, you should see the following:


KDB+ 3.1 2014.08.22 Copyright (C) 1993-2014 Kx Systems
l32/ 2()core 1884MB deadline gromit 255.255.255.255 NONEXPIRE  

Welcome to kdb+ 32bit edition
For support please see http://groups.google.com/d/forum/personal-kdbplus
Tutorials can be found at http://code.kx.com/wiki/Tutorials
To exit, type \\
To remove this startup msg, edit q.q
q)

You can begin using kdb+ from the q) prompt. To exit exit kdb+ enter \\. Before we look at any examples, let's add the ability to scroll through our q commands by using the rlwrap package. If rlwrap is not installed, it can be installed using yum (as root):

# yum install rlwrap

Note: the rlwrap package is in the epel repository. Next, create a file in your q directory called q.sh with the following contents:

#!/bin/bash
cd ~/q
rlwrap l32/q "$@"

Finally, make the file executable:

chmod u+x q.sh

If we use this script to start q, then we can scroll back through previous commands we entered from the q) prompt.

We are now ready to try a few simple examples. To get a full understanding of using kdb+, please consult the links below. This tutorial is designed to provide you some basic working knowlege of kdb+. Start kdn+ and from the q) prompt type:

q)"Hello world!"
"Hello world!"

q)\ls *.q
"q.q"
"sp.q"
"trade.q"

q)5+7
12
q)5 + 5 7 12
10 12 17

In the above examples. we printed "Hello World!", listed the files that end in "q" in the current directory, summed two numbers and then incremented a list of numbers by 5.

Next, let's define the factorial function:

factorial:{prd 1+til x}

As you can see, q has a very terse expression. If we now use the factorial function we get the following:

q)factorial[7]
5040

Next, we will create a table with 1 million rows filled with random time-series data. Again, you can investigate the syntax later. For now, the examples illustrate the ease with which data can generate and be processed using kdb+. Our table is created as follows (note the ; at the end of each command):


q)n:1000000;
q)item:`apple`banana`orange`pear;
q)city:`beijing`chicago`london`paris;
q)table:([]time:asc n?0D0;n?item;amount:n?100;n?city);

Now that the data is in memory, we can ask a simple query selecting all rows from the table where the item sold is a banana:

q) select from table where item=`banana
time                 item   amount city   
------------------------------------------
0D00:00:00.048360228 banana 62     beijing
0D00:00:00.159745663 banana 27     london
0D00:00:00.480262935 banana 40     london
0D00:00:00.548035651 banana 32     chicago
0D00:00:00.705146044 banana 22     paris  
0D00:00:00.712388008 banana 98     london
0D00:00:00.958473980 banana 48     paris  
0D00:00:01.071770489 banana 40     london
0D00:00:04.457911849 banana 65     paris  
..

Only a subset of answers is printed. We can also generate an aggregate query that calculates the sum of the amounts sold of all items by each city:

q)select sum amount by city from table
city   | amount  
-------| --------
beijing| 12418161
chicago| 12342736
london | 12367712
paris  | 12383797

Even in these simple examples the real power of kdb+/q is easily seen. To learn more about using kdb+ and q, please consult the following resources:

Wrap Up

Column-oriented databases and time-series analysis are fast and powerful tools that have been serving the financial sector for many years. kxcommunity.com is a new resource for the growing kdb+ community that has a lot to offer for both new and experienced users/developers. The free availability of the fully functional 32-bit version allows anyone to try column-oriented database design and the power of vector-based queries. The resources are ready and waiting. You are invited to jump in and start looking down the column, instead of across the row, for real performance.