Welcome!

Blog Feed Post

Building a Back Testing Platform for Algorithmic Trading

In this series, I’m going to outline in general, how to build a back-testing platform for the creation, tweaking, and subsequent execution of algorithms used in electronic trading.

Part One – The Data

I recently made some comments on Vertica’s blog in regards to what I considered to be a fairly bold claim.  They said that Vertica was the only real column store.  But even if they are, so what? In my comments, I alluded to my belief that we optimize problems to solutions – we try to fix stuff using what we’ve got in our toolbox without having to run to Home Depot.

The real test is when the rubber hits the road – how do you actually solve a problem in a new way that’s motivating.  And by motivating I mean the solution addresses the issues, enables new capabilities, and is economically attractive.

So rather than tell you that DarkStar and our approach to processing both real-time and historical data (there’s a difference?) is the Real Enchilada, I thought I would illustrate a real world use case.

Let’s say you want to store a bunch of market data.  And I mean a bunch.  You want to store every piece of market data for the whole US Equities market.

And you’d like to have this data so that you can run analytics on it.  Or maybe even back-test strategies for buying and selling stocks.  So let’s assume that you’ve got some java code lying around to do that.

For our example, we are interested in seeing whether or not using volume average weighted price strategies actually work.   In our example, we will pretend that we are buying a lot of stock, and the theory we want test is whether or not buying that stock during the day when it’s lower than it’s weighted average price will give us a better average price during the day than just going with the flow (often referred to as volume participation).

We are all familiar with how relational databases work, and anyone who’s been in capital markets for a while knows how futile it would be to use something like Oracle for this due to cost, hardware and just how difficult it is to get the data into the database in the first place,

Oh that’s right, I forgot to tell you, we are going to have to load this data first.

I am not going to go into the relevant benefits of a column store here either, you can check out many other websites for that.

Instead, let’s look at some issues.  First, I would rather load the data directly into the database as it happens.  Staging the data separately is costly and error prone. In addition, what happens when you decide to load that data and encounter a problem that can’t be fixed in time for the next market day?  What if you actually run out of space or compute to get caught up?  Well then you can’t back test the next day to further refine your algorithms.  Algo’s should be tweaked every day.  New algo’s need to be developed to remain competitive.  So here, a database error costs real money.

So I need a fast data store.

As I am loading the data, what happens if one of my disks goes boom? Or one of my machines go boom? Well, now I have a problem.  If I fail over to another datacenter, how do i reconcile? What a nightmare!

So I need a data store that we can take a sledgehammer to and it will keep running.

Hey, if I have this big historical data store, I still need to query it while it is being updated.  Ideally, I would like to also be running analysis and back testing during the day.  Scheduling jobs to run at night is so very ’90′s.

So my data store has to facilitate both interactive query and batch analysis.

But wait, doing all of this means that I am going to have to figure out how to use the same code for back-testing that i use to generate orders during market hours.  It’s either that or use some visual, script based or different harnesses for my java or C++ code.  Yet another nightmare.

So, I would Iike to run the same code against my historical data store that I also use to generate orders during the day.

There’s a bunch of other stuff too, management, instrumentation, removing old data that I don’t need for back-testing, all the stuff we associate with normal day to day big data operations. We need to know what’s going on during the day so that we can be proactive.  There’s gold in that data!

And one last thing, it would be really cool if most of this technology wasn’t proprietary.  I mean let’s face it, firms that talk more about their investors on their websites than their clients can’t possibly have my best interests at heart.

This is a tall list.  Let’s knock it down, one by one.

Here is a diagram for your consideration.

The diagram isn’t very technical, and that’s on purpose – I’m outlining an algorithm, or methodology that may or may not solve our problem.

In the diagram, I’ve depicted the database as a cluster of machines.  Instead of using one big machine backed by a SAN, I’m going to use a number of machines.  Each of those machines is going to connect to the Market Data source and get data.

As we receive the data, we’re going to take a peek at it, and determine where in the cluster that data needs to live and while we’re doing that, we’re going to right it to disk.  A background process will make sure that the data ends up on the node we want it on.  More of why that’s so incredibly important in Part Two – Analyzing the Data in this series.

Also, I’m going to ask the cluster to replicate everything we’re writing to it – we’re going to end up writing the data a total of 2 times in this example.  I might usually suggest 3, but we’ve got two data centers running the same solution, so I’ll actually have 4 copies of the data.

Why write the data to three nodes in the cluster?  First, if a node goes down, I still want to be able to write data.  If the node that goes down is the primary node, I’m going to remember that and when that node comes back up, I’m going to write all the data to it as part of its “Welcome Back to the Cluster Party!”  And second, if I’m reading data from the cluster (remember, we’ve got algo’s running and users querying this data), I want my data.  If a node goes down, your users don’t really care – they just want their data.  By replicating the data across multiple nodes, I achieve high availability without having to fail-over to another instance or data center.

Ok, we’ve got the Sledge Hammer test handled, which is cool, but everything I’ve described above sounds like it’s going to take a lot of time and that the system is going to very slow.

Not true.  Each node in the cluster above is subscribing to market data.  So if one machine can ingest X messages per second, then a cluster of 10 machines should be able to ingest 10 * X messages per second, right?  Let see what that means in a real world example:

On May 20, 2010, there were about 1.1 billion BBO messages as published via SuperFeed (NYSE’s market data platform), those quotes represent the best bid, bid size, offer, and offer size for each stock at any given time.  In terms of messages per second, that’s about 50,000.  In terms of size per second, that’s about 4,500k per second.  Hmmm, chunky!

These are intimidating numbers.  But if we divide the problem up a bit, and use 10 nodes in a cluster, each node only needs to ingest about 450k per second across 5,000 messages.  All of a sudden, we’re dealing with something quite reasonable.

So, now we’ve got a cluster that can load the entire market real time and it’s redundant.  What about analyzing the data?  That’s in Part Two – Analyzing the Data which I’ll post next week.

Print

Read the original blog entry...

More Stories By Colin Clark

Colin Clark is the CTO for Cloud Event Processing, Inc. and is widely regarded as a thought leader and pioneer in both Complex Event Processing and its application within Capital Markets.

Follow Colin on Twitter at http:\\twitter.com\EventCloudPro to learn more about cloud based event processing using map/reduce, complex event processing, and event driven pattern matching agents. You can also send topic suggestions or questions to [email protected]

Latest Stories
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software. They hope to capture value from emerging technologies such as IoT, SDN, and AI. Ultimately, irrespective of the vertical, it is about deriving value from independent software applications participating in an ecosystem as one comprehensive solution. In his session at @ThingsExpo, Kausik Sridhar, founder and CTO of Pulzze Systems, discussed how given the magnitude of today's application ...
Recently, REAN Cloud built a digital concierge for a North Carolina hospital that had observed that most patient call button questions were repetitive. In addition, the paper-based process used to measure patient health metrics was laborious, not in real-time and sometimes error-prone. In their session at 21st Cloud Expo, Sean Finnerty, Executive Director, Practice Lead, Health Care & Life Science at REAN Cloud, and Dr. S.P.T. Krishnan, Principal Architect at REAN Cloud, discussed how they built...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and B...
The “Digital Era” is forcing us to engage with new methods to build, operate and maintain applications. This transformation also implies an evolution to more and more intelligent applications to better engage with the customers, while creating significant market differentiators. In both cases, the cloud has become a key enabler to embrace this digital revolution. So, moving to the cloud is no longer the question; the new questions are HOW and WHEN. To make this equation even more complex, most ...
As you move to the cloud, your network should be efficient, secure, and easy to manage. An enterprise adopting a hybrid or public cloud needs systems and tools that provide: Agility: ability to deliver applications and services faster, even in complex hybrid environments Easier manageability: enable reliable connectivity with complete oversight as the data center network evolves Greater efficiency: eliminate wasted effort while reducing errors and optimize asset utilization Security: imple...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
With tough new regulations coming to Europe on data privacy in May 2018, Calligo will explain why in reality the effect is global and transforms how you consider critical data. EU GDPR fundamentally rewrites the rules for cloud, Big Data and IoT. In his session at 21st Cloud Expo, Adam Ryan, Vice President and General Manager EMEA at Calligo, examined the regulations and provided insight on how it affects technology, challenges the established rules and will usher in new levels of diligence arou...
The past few years have brought a sea change in the way applications are architected, developed, and consumed—increasing both the complexity of testing and the business impact of software failures. How can software testing professionals keep pace with modern application delivery, given the trends that impact both architectures (cloud, microservices, and APIs) and processes (DevOps, agile, and continuous delivery)? This is where continuous testing comes in. D
Modern software design has fundamentally changed how we manage applications, causing many to turn to containers as the new virtual machine for resource management. As container adoption grows beyond stateless applications to stateful workloads, the need for persistent storage is foundational - something customers routinely cite as a top pain point. In his session at @DevOpsSummit at 21st Cloud Expo, Bill Borsari, Head of Systems Engineering at Datera, explored how organizations can reap the bene...
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve f...
The dynamic nature of the cloud means that change is a constant when it comes to modern cloud-based infrastructure. Delivering modern applications to end users, therefore, is a constantly shifting challenge. Delivery automation helps IT Ops teams ensure that apps are providing an optimal end user experience over hybrid-cloud and multi-cloud environments, no matter what the current state of the infrastructure is. To employ a delivery automation strategy that reflects your business rules, making r...
The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
In a recent survey, Sumo Logic surveyed 1,500 customers who employ cloud services such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP). According to the survey, a quarter of the respondents have already deployed Docker containers and nearly as many (23 percent) are employing the AWS Lambda serverless computing framework. It’s clear: serverless is here to stay. The adoption does come with some needed changes, within both application development and operations. Tha...