Welcome!

Blog Feed Post

Cassandra’s Data Model

As we prepare to implement our Market Data repository to facilitate algo development and back-testing, you should have downloaded Cassandra and installed it by now.  What, you haven’t?  Well, click here, get it done and then come back for some fun.  To get things up and running once you’ve downloaded Cassandra, click here for some guidance (this assumes you’re running Linux but should point you in the right direction if you’re running Windoze).

CONFUSION

Most of the explanations I’ve read about Cassandra’s data model first extol the virtues of NoSQL and the evils of Relational Databases.  And so while getting the reader caught up in this mythic struggle that summons images from Tolkien’s middle earth, the point is lost.  And that point is?

IT’S ALL ACTUALLY QUITE EASY

Cassandra thinks about data the way we think about data.  Most of us think about data in rows and columns.  So does Cassandra.  But it also alleviates some extra stuff we don’t need while adding some stuff that we do need.  And that can be a little disconcerting initially.  To make things easier, let’s first describe a goal for our exercise.  We’d like to get a day’s worth of market data, by symbol, in ascending time order.  Also, we might like to get the data for a slice of time within that day.  Like, “give me all the BBO’s for American Airlines for May 20th, 2010,” or, “I’d like to see the BBO’s for American Airlines for May 20th, 2010 between 1 and 2pm.”  Let’s jump right in.

LET’S GET OUR DATA

As we subscribe to our favorite market data feed, we receive something like:

  • Symbol,
  • Bid,
  • Offer,
  • Bid Size,
  • Offer Size.
  • Time Stamp, and
  • Seq # (most quote vendors provide a Sequence # because multiple quotes can occur for any given Time Stamp)

We’re going to call this a column family.  Cassandra’s analog for a table is a Column Family.  You can see why this fits so well above – the columns that belong to the symbol AA comprise a family of related information.  I’d like to store this data by symbol, so later, I can retrieve it.  Using the Cassandra client (cassandra-cli – it’s in the bin directory where you installed Cassandra), let’s create the BBO Column Family.  It looks like this:

create column family bbo with comparator = UTF8Type
and column_metadata = [
{column_name: symbol, validation_class:UTF8Type},
{column_name: bb, validation_class: UTF8Type},
{column_name: bo, validation_class: UTF8Type},
{column_name: bbSize, validation_class: UTF8Type},
{column_name: bSize, validation_class: UTF8Type},
{column_name: timeStamp, validation_class: LongType},
{column_name: seqNum, validation_class: LongType},
];

And now that we’ve created the schema, let’s insert some quotes.

Set bbo[‘symbol’=’AA’;
Set bbo[‘bb’]=’123.34’;
Set bbo[‘bo’]=’123.84’;
Set bbo[‘bbSize’]=’100’;
Set bbo[‘boSize’]=’200’;
Set bbo[‘timeStamp’]=1234;

What happens when you execute a list bbo command now?  So, that’s easy enough.   So what happens as we get the next quote?  Well, we go to insert our data like this:

Set bbo[‘symbol’=’AA’;
Set bbo[‘bb’]=’125.34’;
Set bbo[‘bo’]=’125.84’;
Set bbo[‘bbSize’]=’100’;
Set bbo[‘boSize’]=’200’;
Set bbo[‘timeStamp’]=1235;

And then to see our data, enter this command (again):

List bbo;

When we use the ‘list bbo’ command, we’re only go see that data last inserted for that row key.  What happened to the previous data?  It was over-written with the new data.  So if we wanted to save each quote, we could combine the timestamp with the column name and then we’d be inserting unique columns each time and we’d be fine.  But there’s a different way to do this.

BIG DEAL, I DON’T SEE ANYTHING DIFFERENT HERE

And you don’t, because we haven’t started introducing the special sauce yet.  Well, we kind of did.  In the schema definitions above, you’ll notice we didn’t say that much about what we could or couldn’t insert into a row.  We just started adding columns dynamically.  So, each row, which is identified by a key, can have different columns in it and even a different number of columns.

WELL, THAT’S NOT GOING TO WORK

So, how do we keep track of all the quotes for our symbol?  First a little clarification, the Column Family above is really BBO, and we’ve inserted a row identified by the key, ‘AA” and some associated tag/data value pairs.  Think of this as a map of maps.  So now, we need to insert the bits that change for a given symbol over time.  How could we do that?  We create a Super Column Family of course.  A Super Column Family contains Super Columns.  A Super Column is kind of like another row of data – so using our example above, the Super Column we’ll be inserting consists of the BB, BO, BB Size, etc. The data above gets inserted using [AA] as our row key, and we need to pick a key for the Super Column that contains the quote data.    Let’s pick Seq# as our Super Column key.  Our row key is still Symbol, and I’ve prepended the date to it.  This way, all the data for a day’s worth of AA will be in the same row.  This is called a compound, or aggregate, key.  It looks like this:

create column family sbbo with column_type = 'Super' and comparator = ‘BytesType’
and column_metadata = [
{column_name: bb, validation_class: UTF8Type},
{column_name: bo, validation_class: UTF8Type},
{column_name: bbSize, validation_class: UTF8Type},
{column_name: bSize, validation_class: UTF8Type},
{column_name: timeStamp, validation_class: LongType},
];

And the insert statements look like this (we’re using the Seq# as the key – that’s the Super Column key right after the row key or, ’20100124:AA’ below):

Set sbbo[‘20100124:AA’][1234][‘bb’]=’100.00’;
Set sbbo[‘20100124:AA’][1234][‘bo’]=’101.00’;
Set sbbo[‘20100124:AA’][1235][‘bb’]=’101.00’;
Set sbbo[‘20100124:AA’][1235][‘bo’]=’102.00’;
Set sbbo[‘20100125:AA’][1234][‘bb’]=’100.00’;
Set sbbo[‘20100125:AA’][1234][‘bo’]=’101.00’;
Set sbbo[‘20100125:AA’][1235][‘bb’]=’101.00’;
Set sbbo[‘20100125:AA’][1235][‘bo’]=’102.00’;

Now let’s see what’s in the column family:

List sbbo;

So now it looks like we’re able to store a set of quotes for a symbol for any given day.  Bingo.

All we’ve really done here is add another map – so we now have a map (Date, Symbol) that contains a map (Symbol, Quote) that contains another map (Quote, QuoteField).  Or, what we’ve done is figured out a way to represent the potentially sparse fact tables resulting from large data analysis (OLAP) projects in a concise and easily addressable fashion.  Told you it wasn’t that hard.

GIVE ME MY DATA

So, now that we’ve inserted a couple of rows of data, let’s see how to get our data.  From above, we want to:

  1. Get all the data for a day’s worth of a symbol, and
  2. Get all the data for a slice of time during a day for a symbol

Assuming you’ve entered the statements above to insert the data, we can retrieve an entire day’s worth of AA with this simple statement:

List sbbo[‘20100124:AA’];

Now that we’ve gone over some of Cassandra’s basics, we’ll get a little more into it in upcoming posts.  That’s where we’ll cover the goal in #2.

THANKS FOR READING

PrintFriendly

Read the original blog entry...

More Stories By Colin Clark

Colin Clark is the CTO for Cloud Event Processing, Inc. and is widely regarded as a thought leader and pioneer in both Complex Event Processing and its application within Capital Markets.

Follow Colin on Twitter at http:\\twitter.com\EventCloudPro to learn more about cloud based event processing using map/reduce, complex event processing, and event driven pattern matching agents. You can also send topic suggestions or questions to [email protected]

Latest Stories
Charles Araujo is an industry analyst, internationally recognized authority on the Digital Enterprise and author of The Quantum Age of IT: Why Everything You Know About IT is About to Change. As Principal Analyst with Intellyx, he writes, speaks and advises organizations on how to navigate through this time of disruption. He is also the founder of The Institute for Digital Transformation and a sought after keynote speaker. He has been a regular contributor to both InformationWeek and CIO Insight...
Michael Maximilien, better known as max or Dr. Max, is a computer scientist with IBM. At IBM Research Triangle Park, he was a principal engineer for the worldwide industry point-of-sale standard: JavaPOS. At IBM Research, some highlights include pioneering research on semantic Web services, mashups, and cloud computing, and platform-as-a-service. He joined the IBM Cloud Labs in 2014 and works closely with Pivotal Inc., to help make the Cloud Found the best PaaS.
It is of utmost importance for the future success of WebRTC to ensure that interoperability is operational between web browsers and any WebRTC-compliant client. To be guaranteed as operational and effective, interoperability must be tested extensively by establishing WebRTC data and media connections between different web browsers running on different devices and operating systems. In his session at WebRTC Summit at @ThingsExpo, Dr. Alex Gouaillard, CEO and Founder of CoSMo Software, presented ...
@DevOpsSummit at Cloud Expo, taking place November 12-13 in New York City, NY, is co-located with 22nd international CloudEXPO | first international DXWorldEXPO and will feature technical sessions from a rock star conference faculty and the leading industry players in the world.
One of the biggest challenges with adopting a DevOps mentality is: new applications are easily adapted to cloud-native, microservice-based, or containerized architectures - they can be built for them - but old applications need complex refactoring. On the other hand, these new technologies can require relearning or adapting new, oftentimes more complex, methodologies and tools to be ready for production. In his general session at @DevOpsSummit at 20th Cloud Expo, Chris Brown, Solutions Marketi...
In a world where the internet rules all, where 94% of business buyers conduct online research, and where e-commerce sales are poised to fall between $427 billion and $443 billion by the end of this year, we think it's safe to say that your website is a vital part of your business strategy. Whether you're a B2B company, a local business, or an e-commerce site, digital presence is key to maintain in your drive towards success. Digital Performance will take priority in 2018 for the following reason...
At the keynote this morning we spoke about the value proposition of Nutanix, of having a DevOps culture and a mindset, and the business outcomes of achieving agility and scale, which everybody here is trying to accomplish," noted Mark Lavi, DevOps Solution Architect at Nutanix, in this SYS-CON.tv interview at @DevOpsSummit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
I think DevOps is now a rambunctious teenager - it's starting to get a mind of its own, wanting to get its own things but it still needs some adult supervision," explained Thomas Hooker, VP of marketing at CollabNet, in this SYS-CON.tv interview at DevOps Summit at 20th Cloud Expo, held June 6-8, 2017, at the Javits Center in New York City, NY.
What's the role of an IT self-service portal when you get to continuous delivery and Infrastructure as Code? This general session showed how to create the continuous delivery culture and eight accelerators for leading the change. Don Demcsak is a DevOps and Cloud Native Modernization Principal for Dell EMC based out of New Jersey. He is a former, long time, Microsoft Most Valuable Professional, specializing in building and architecting Application Delivery Pipelines for hybrid legacy, and cloud ...
In this presentation, you will learn first hand what works and what doesn't while architecting and deploying OpenStack. Some of the topics will include:- best practices for creating repeatable deployments of OpenStack- multi-site considerations- how to customize OpenStack to integrate with your existing systems and security best practices.
The “Digital Era” is forcing us to engage with new methods to build, operate and maintain applications. This transformation also implies an evolution to more and more intelligent applications to better engage with the customers, while creating significant market differentiators. In both cases, the cloud has become a key enabler to embrace this digital revolution. So, moving to the cloud is no longer the question; the new questions are HOW and WHEN. To make this equation even more complex, most ...
@DevOpsSummit at Cloud Expo, taking place November 12-13 in New York City, NY, is co-located with 22nd international CloudEXPO | first international DXWorldEXPO and will feature technical sessions from a rock star conference faculty and the leading industry players in the world. The widespread success of cloud computing is driving the DevOps revolution in enterprise IT. Now as never before, development teams must communicate and collaborate in a dynamic, 24/7/365 environment. There is no time t...
22nd International Cloud Expo, taking place June 5-7, 2018, at the Javits Center in New York City, NY, and co-located with the 1st DXWorld Expo will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud ...
CloudEXPO New York 2018, colocated with DXWorldEXPO New York 2018 will be held November 11-13, 2018, in New York City and will bring together Cloud Computing, FinTech and Blockchain, Digital Transformation, Big Data, Internet of Things, DevOps, AI, Machine Learning and WebRTC to one location.
"ZeroStack is a startup in Silicon Valley. We're solving a very interesting problem around bringing public cloud convenience with private cloud control for enterprises and mid-size companies," explained Kamesh Pemmaraju, VP of Product Management at ZeroStack, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.