Welcome!

Blog Feed Post

Cassandra’s Data Model

As we prepare to implement our Market Data repository to facilitate algo development and back-testing, you should have downloaded Cassandra and installed it by now.  What, you haven’t?  Well, click here, get it done and then come back for some fun.  To get things up and running once you’ve downloaded Cassandra, click here for some guidance (this assumes you’re running Linux but should point you in the right direction if you’re running Windoze).

CONFUSION

Most of the explanations I’ve read about Cassandra’s data model first extol the virtues of NoSQL and the evils of Relational Databases.  And so while getting the reader caught up in this mythic struggle that summons images from Tolkien’s middle earth, the point is lost.  And that point is?

IT’S ALL ACTUALLY QUITE EASY

Cassandra thinks about data the way we think about data.  Most of us think about data in rows and columns.  So does Cassandra.  But it also alleviates some extra stuff we don’t need while adding some stuff that we do need.  And that can be a little disconcerting initially.  To make things easier, let’s first describe a goal for our exercise.  We’d like to get a day’s worth of market data, by symbol, in ascending time order.  Also, we might like to get the data for a slice of time within that day.  Like, “give me all the BBO’s for American Airlines for May 20th, 2010,” or, “I’d like to see the BBO’s for American Airlines for May 20th, 2010 between 1 and 2pm.”  Let’s jump right in.

LET’S GET OUR DATA

As we subscribe to our favorite market data feed, we receive something like:

  • Symbol,
  • Bid,
  • Offer,
  • Bid Size,
  • Offer Size.
  • Time Stamp, and
  • Seq # (most quote vendors provide a Sequence # because multiple quotes can occur for any given Time Stamp)

We’re going to call this a column family.  Cassandra’s analog for a table is a Column Family.  You can see why this fits so well above – the columns that belong to the symbol AA comprise a family of related information.  I’d like to store this data by symbol, so later, I can retrieve it.  Using the Cassandra client (cassandra-cli – it’s in the bin directory where you installed Cassandra), let’s create the BBO Column Family.  It looks like this:

create column family bbo with comparator = UTF8Type
and column_metadata = [
{column_name: symbol, validation_class:UTF8Type},
{column_name: bb, validation_class: UTF8Type},
{column_name: bo, validation_class: UTF8Type},
{column_name: bbSize, validation_class: UTF8Type},
{column_name: bSize, validation_class: UTF8Type},
{column_name: timeStamp, validation_class: LongType},
{column_name: seqNum, validation_class: LongType},
];

And now that we’ve created the schema, let’s insert some quotes.

Set bbo[‘symbol’=’AA’;
Set bbo[‘bb’]=’123.34’;
Set bbo[‘bo’]=’123.84’;
Set bbo[‘bbSize’]=’100’;
Set bbo[‘boSize’]=’200’;
Set bbo[‘timeStamp’]=1234;

What happens when you execute a list bbo command now?  So, that’s easy enough.   So what happens as we get the next quote?  Well, we go to insert our data like this:

Set bbo[‘symbol’=’AA’;
Set bbo[‘bb’]=’125.34’;
Set bbo[‘bo’]=’125.84’;
Set bbo[‘bbSize’]=’100’;
Set bbo[‘boSize’]=’200’;
Set bbo[‘timeStamp’]=1235;

And then to see our data, enter this command (again):

List bbo;

When we use the ‘list bbo’ command, we’re only go see that data last inserted for that row key.  What happened to the previous data?  It was over-written with the new data.  So if we wanted to save each quote, we could combine the timestamp with the column name and then we’d be inserting unique columns each time and we’d be fine.  But there’s a different way to do this.

BIG DEAL, I DON’T SEE ANYTHING DIFFERENT HERE

And you don’t, because we haven’t started introducing the special sauce yet.  Well, we kind of did.  In the schema definitions above, you’ll notice we didn’t say that much about what we could or couldn’t insert into a row.  We just started adding columns dynamically.  So, each row, which is identified by a key, can have different columns in it and even a different number of columns.

WELL, THAT’S NOT GOING TO WORK

So, how do we keep track of all the quotes for our symbol?  First a little clarification, the Column Family above is really BBO, and we’ve inserted a row identified by the key, ‘AA” and some associated tag/data value pairs.  Think of this as a map of maps.  So now, we need to insert the bits that change for a given symbol over time.  How could we do that?  We create a Super Column Family of course.  A Super Column Family contains Super Columns.  A Super Column is kind of like another row of data – so using our example above, the Super Column we’ll be inserting consists of the BB, BO, BB Size, etc. The data above gets inserted using [AA] as our row key, and we need to pick a key for the Super Column that contains the quote data.    Let’s pick Seq# as our Super Column key.  Our row key is still Symbol, and I’ve prepended the date to it.  This way, all the data for a day’s worth of AA will be in the same row.  This is called a compound, or aggregate, key.  It looks like this:

create column family sbbo with column_type = 'Super' and comparator = ‘BytesType’
and column_metadata = [
{column_name: bb, validation_class: UTF8Type},
{column_name: bo, validation_class: UTF8Type},
{column_name: bbSize, validation_class: UTF8Type},
{column_name: bSize, validation_class: UTF8Type},
{column_name: timeStamp, validation_class: LongType},
];

And the insert statements look like this (we’re using the Seq# as the key – that’s the Super Column key right after the row key or, ’20100124:AA’ below):

Set sbbo[‘20100124:AA’][1234][‘bb’]=’100.00’;
Set sbbo[‘20100124:AA’][1234][‘bo’]=’101.00’;
Set sbbo[‘20100124:AA’][1235][‘bb’]=’101.00’;
Set sbbo[‘20100124:AA’][1235][‘bo’]=’102.00’;
Set sbbo[‘20100125:AA’][1234][‘bb’]=’100.00’;
Set sbbo[‘20100125:AA’][1234][‘bo’]=’101.00’;
Set sbbo[‘20100125:AA’][1235][‘bb’]=’101.00’;
Set sbbo[‘20100125:AA’][1235][‘bo’]=’102.00’;

Now let’s see what’s in the column family:

List sbbo;

So now it looks like we’re able to store a set of quotes for a symbol for any given day.  Bingo.

All we’ve really done here is add another map – so we now have a map (Date, Symbol) that contains a map (Symbol, Quote) that contains another map (Quote, QuoteField).  Or, what we’ve done is figured out a way to represent the potentially sparse fact tables resulting from large data analysis (OLAP) projects in a concise and easily addressable fashion.  Told you it wasn’t that hard.

GIVE ME MY DATA

So, now that we’ve inserted a couple of rows of data, let’s see how to get our data.  From above, we want to:

  1. Get all the data for a day’s worth of a symbol, and
  2. Get all the data for a slice of time during a day for a symbol

Assuming you’ve entered the statements above to insert the data, we can retrieve an entire day’s worth of AA with this simple statement:

List sbbo[‘20100124:AA’];

Now that we’ve gone over some of Cassandra’s basics, we’ll get a little more into it in upcoming posts.  That’s where we’ll cover the goal in #2.

THANKS FOR READING

PrintFriendly

Read the original blog entry...

More Stories By Colin Clark

Colin Clark is the CTO for Cloud Event Processing, Inc. and is widely regarded as a thought leader and pioneer in both Complex Event Processing and its application within Capital Markets.

Follow Colin on Twitter at http:\\twitter.com\EventCloudPro to learn more about cloud based event processing using map/reduce, complex event processing, and event driven pattern matching agents. You can also send topic suggestions or questions to [email protected]

Latest Stories
The current age of digital transformation means that IT organizations must adapt their toolset to cover all digital experiences, beyond just the end users’. Today’s businesses can no longer focus solely on the digital interactions they manage with employees or customers; they must now contend with non-traditional factors. Whether it's the power of brand to make or break a company, the need to monitor across all locations 24/7, or the ability to proactively resolve issues, companies must adapt to...
"We focus on composable infrastructure. Composable infrastructure has been named by companies like Gartner as the evolution of the IT infrastructure where everything is now driven by software," explained Bruno Andrade, CEO and Founder of HTBase, in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
"Tintri focuses on the Ops side of the DevOps, which basically is pushing more and more of the accessibility of the infrastructure to the developers and trying to get behind the scenes," explained Dhiraj Sehgal of Tintri in this SYS-CON.tv interview at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
Hardware virtualization and cloud computing allowed us to increase resource utilization and increase our flexibility to respond to business demand. Docker Containers are the next quantum leap - Are they?! Databases always represented an additional set of challenges unique to running workloads requiring a maximum of I/O, network, CPU resources combined with data locality.
For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
Cloud promises the agility required by today’s digital businesses. As organizations adopt cloud based infrastructures and services, their IT resources become increasingly dynamic and hybrid in nature. Managing these require modern IT operations and tools. In his session at 20th Cloud Expo, Raj Sundaram, Senior Principal Product Manager at CA Technologies, will discuss how to modernize your IT operations in order to proactively manage your hybrid cloud and IT environments. He will be sharing bes...
Artificial intelligence, machine learning, neural networks. We’re in the midst of a wave of excitement around AI such as hasn’t been seen for a few decades. But those previous periods of inflated expectations led to troughs of disappointment. Will this time be different? Most likely. Applications of AI such as predictive analytics are already decreasing costs and improving reliability of industrial machinery. Furthermore, the funding and research going into AI now comes from a wide range of com...
In this presentation, Striim CTO and founder Steve Wilkes will discuss practical strategies for counteracting fraud and cyberattacks by leveraging real-time streaming analytics. In his session at @ThingsExpo, Steve Wilkes, Founder and Chief Technology Officer at Striim, will provide a detailed look into leveraging streaming data management to correlate events in real time, and identify potential breaches across IoT and non-IoT systems throughout the enterprise. Strategies for processing massive ...
SYS-CON Events announced today that GrapeUp, the leading provider of rapid product development at the speed of business, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Grape Up is a software company, specialized in cloud native application development and professional services related to Cloud Foundry PaaS. With five expert teams that operate in various sectors of the market acr...
SYS-CON Events announced today that Ayehu will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on October 31 - November 2, 2017 at the Santa Clara Convention Center in Santa Clara California. Ayehu provides IT Process Automation & Orchestration solutions for IT and Security professionals to identify and resolve critical incidents and enable rapid containment, eradication, and recovery from cyber security breaches. Ayehu provides customers greater control over IT infras...
Internet of @ThingsExpo, taking place October 31 - November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with 21st Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devic...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
SYS-CON Events announced today that MobiDev, a client-oriented software development company, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MobiDev is a software company that develops and delivers turn-key mobile apps, websites, web services, and complex software systems for startups and enterprises. Since 2009 it has grown from a small group of passionate engineers and business...
Today we can collect lots and lots of performance data. We build beautiful dashboards and even have fancy query languages to access and transform the data. Still performance data is a secret language only a couple of people understand. The more business becomes digital the more stakeholders are interested in this data including how it relates to business. Some of these people have never used a monitoring tool before. They have a question on their mind like “How is my application doing” but no id...
What's the role of an IT self-service portal when you get to continuous delivery and Infrastructure as Code? This general session showed how to create the continuous delivery culture and eight accelerators for leading the change. Don Demcsak is a DevOps and Cloud Native Modernization Principal for Dell EMC based out of New Jersey. He is a former, long time, Microsoft Most Valuable Professional, specializing in building and architecting Application Delivery Pipelines for hybrid legacy, and cloud ...