Welcome!

Blog Feed Post

The Structure of Big Data

First things first, all data is more or less structured. That being said, there is…

  • Structured Data
  • Semi-Structured Data
  • Unstructured Data

I tend to think of it as: data, composite or simple, with or without content. In that context, email is structured composite data (from, to, subject, date) with unstructured content (message body). The composite data is structured. The content is unstructured. Though simple data may or may not be structured. The ‘subject’ data is unstructured. The ‘to’ data is structured. It is composed of a local-part (username) and a domain.

While content is unstructured, there may be an implied structure.

So what is the difference between structured data, semi-structured data, and unstructured data?

Structured Data

The structure is externally enforced. Data

  • The data is stored in a database.
    • Relational
      • Transaction
    • XML
      • Catalog

The data itself is not structured. The structure is defined by the database. The data would be semi-structured if it was exported and transformed into JSON or XML.

Semi-Structured Data

The structure is self defined. Data and / or Content

  • The data is stored as text.
    • JSON
      • User Profile
    • XML
      • Application / Form

Unstructured Data

The structure of the data is externally defined. Metadata and Content

  • The data is stored in a binary format and / or document.
    • Media (e.g. Ogg).
    • Video (e.g. Vorbis).
    • Audio (e.g. Theora).
    • Image (e.g. PNG).
    • Microsoft Word
    • Adobe PDF

The structure is defined by the file format. The data is composed of structured metadata and unstructured content. That being said, a video is composed of frames and an image is composed of pixels.

  • The data is stored as plain text.
    • Log

The structure is defined by a pattern in the logging configuration file. The data is composed of structured metadata (e.g. severity) and unstructured content (log message).

  • The data is user generated.
    • Status (Facebook)
    • Tweet (Twitter)
    • Comment (WordPress)

The structure is defined by the application / form. The data is composed of structured metadata (e.g. user ID) and unstructured content (user message).

Update

I would say that the structure of content is user defined and thus interpreted. However, content is often a component of data (unstructured, externally defined). Though if I typed up this post in gedit and saved it as a text file, that might constitute content independent of data.


With all all that said, the structure of data is not exactly black and white.


Read the original blog entry...

More Stories By Daniel Thompson

I curate the content on this page, but the credit goes to my talented colleagues for the posts that you see here. Much of what you read on this page is the work of friends at How to JBoss, and I encourage you to drop by the site at http://www.howtojboss.com for some of the best JBoss technical and non-technical content for developers, architects and technology executives on the Web.

Latest Stories
DevOps promotes continuous improvement through a culture of collaboration. But in real terms, how do you: Integrate activities across diverse teams and services? Make objective decisions with system-wide visibility? Use feedback loops to enable learning and improvement? With technology insights and real-world examples, in his general session at @DevOpsSummit, at 21st Cloud Expo, Andi Mann, Chief Technology Advocate at Splunk, explored how leading organizations use data-driven DevOps to close th...
"Digital transformation - what we knew about it in the past has been redefined. Automation is going to play such a huge role in that because the culture, the technology, and the business operations are being shifted now," stated Brian Boeggeman, VP of Alliances & Partnerships at Ayehu, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
The past few years have brought a sea change in the way applications are architected, developed, and consumed—increasing both the complexity of testing and the business impact of software failures. How can software testing professionals keep pace with modern application delivery, given the trends that impact both architectures (cloud, microservices, and APIs) and processes (DevOps, agile, and continuous delivery)? This is where continuous testing comes in. D
"Evatronix provides design services to companies that need to integrate the IoT technology in their products but they don't necessarily have the expertise, knowledge and design team to do so," explained Adam Morawiec, VP of Business Development at Evatronix, in this SYS-CON.tv interview at @ThingsExpo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"WineSOFT is a software company making proxy server software, which is widely used in the telecommunication industry or the content delivery networks or e-commerce," explained Jonathan Ahn, COO of WineSOFT, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Smart cities have the potential to change our lives at so many levels for citizens: less pollution, reduced parking obstacles, better health, education and more energy savings. Real-time data streaming and the Internet of Things (IoT) possess the power to turn this vision into a reality. However, most organizations today are building their data infrastructure to focus solely on addressing immediate business needs vs. a platform capable of quickly adapting emerging technologies to address future ...
Mobile device usage has increased exponentially during the past several years, as consumers rely on handhelds for everything from news and weather to banking and purchases. What can we expect in the next few years? The way in which we interact with our devices will fundamentally change, as businesses leverage Artificial Intelligence. We already see this taking shape as businesses leverage AI for cost savings and customer responsiveness. This trend will continue, as AI is used for more sophistica...
There is a huge demand for responsive, real-time mobile and web experiences, but current architectural patterns do not easily accommodate applications that respond to events in real time. Common solutions using message queues or HTTP long-polling quickly lead to resiliency, scalability and development velocity challenges. In his session at 21st Cloud Expo, Ryland Degnan, a Senior Software Engineer on the Netflix Edge Platform team, will discuss how by leveraging a reactive stream-based protocol,...
In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, led attendees through the exciting evolution of the cloud. He looked at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering m...
Sanjeev Sharma Joins June 5-7, 2018 @DevOpsSummit at @Cloud Expo New York Faculty. Sanjeev Sharma is an internationally known DevOps and Cloud Transformation thought leader, technology executive, and author. Sanjeev's industry experience includes tenures as CTO, Technical Sales leader, and Cloud Architect leader. As an IBM Distinguished Engineer, Sanjeev is recognized at the highest levels of IBM's core of technical leaders.
Product connectivity goes hand and hand these days with increased use of personal data. New IoT devices are becoming more personalized than ever before. In his session at 22nd Cloud Expo | DXWorld Expo, Nicolas Fierro, CEO of MIMIR Blockchain Solutions, will discuss how in order to protect your data and privacy, IoT applications need to embrace Blockchain technology for a new level of product security never before seen - or needed.
The 22nd International Cloud Expo | 1st DXWorld Expo has announced that its Call for Papers is open. Cloud Expo | DXWorld Expo, to be held June 5-7, 2018, at the Javits Center in New York, NY, brings together Cloud Computing, Digital Transformation, Big Data, Internet of Things, DevOps, Machine Learning and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding busin...
Digital transformation is about embracing digital technologies into a company's culture to better connect with its customers, automate processes, create better tools, enter new markets, etc. Such a transformation requires continuous orchestration across teams and an environment based on open collaboration and daily experiments. In his session at 21st Cloud Expo, Alex Casalboni, Technical (Cloud) Evangelist at Cloud Academy, explored and discussed the most urgent unsolved challenges to achieve f...
SYS-CON Events announced today that Synametrics Technologies will exhibit at SYS-CON's 22nd International Cloud Expo®, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Synametrics Technologies is a privately held company based in Plainsboro, New Jersey that has been providing solutions for the developer community since 1997. Based on the success of its initial product offerings such as WinSQL, Xeams, SynaMan and Syncrify, Synametrics continues to create and hone inn...
Digital Transformation (DX) is not a "one-size-fits all" strategy. Each organization needs to develop its own unique, long-term DX plan. It must do so by realizing that we now live in a data-driven age, and that technologies such as Cloud Computing, Big Data, the IoT, Cognitive Computing, and Blockchain are only tools. In her general session at 21st Cloud Expo, Rebecca Wanta explained how the strategy must focus on DX and include a commitment from top management to create great IT jobs, monitor ...