Welcome!

Related Topics: Industrial IoT, Microservices Expo

Industrial IoT: Article

Index XML Documents with VTD-XML

How to turn the indexing capability on in your application

VTD+XML in 30 Seconds
Allowing XML parsing to be decoupled from application logic, the key in the example above is the index file "po.vxl," which conforms to the VTD+XML spec. What is VTD+XML? Since VTD-XML's internal representation of XML infoset is inherently persistent, VTD+XML, as the name suggests, is simply the binary packaging format that combines VTD records, LCs entries, and XML into a single file. The detailed technical spec can be found at http://vtd-xml.sourceforge.net/persistence.html.

A Simple Example
This section gets down to the nitty-gritty of the specification by manually composing, byte-by-byte, a VTD+XML index. For the sake of simplicity, this example chooses to index a simple XML document containing a single child-less root element whose parsed representation doesn't have location cache entries. This example also assumes a big-endian byte order (as in Java) and UTF-8 document encoding (the default character set). The name space awareness is set to false.

<root/>

The first four-byte word of the corresponding index file is 0x0102A000 containing:

  • The VTD+XML version number (0x01) in the first byte
  • The character encoding format (0x02) in the second byte (Jimmy1)
  • The name space awareness, word length of LC entries in the last level, byte endian-ness of the platform, and VTD version as encoded in various bit fields in the third byte (0xA0)(Jimmy2)
  • The document depth (0x0 as the root element has no child)(Jimmy3)

    The second four-byte word has the value of 0x00040001 containing:

  • The number of LC levels supported by the VTD-XML implementation in the upper 16 bits (0x0004 in big endian)(Jimmy4)
  • The root element index value in the lower 16 bits (0x0001 in big endian)(Jimmy5)
The next four four-byte words are reserved and set to zero.
The byte order of all the ensuing 32-bit or 64-bit words is platform-dependent and specified in the third byte of the VTD+XML spec. The next eight-byte words indicate the size (in bytes) of the XML document, which equals seven in this example. Immediately following (0x3C726F6F742F3E00) is the byte content of the XML rounded up to an integer multiple of eight bytes by padding zero to the end.

The remaining part of VTD+XML index consists of multiple adjacent segments each containing an eight-byte word (0x0000000000000002 indicating the VTD record or LC entry count) followed by the actual content of the VTD records or LC entries. The first eight-byte word (0x000000000000000002) indicates that there are two VTD records that are 0xDFF0000000000000 and 0x0000000400000001.

The remaining three eight-byte words all have the value of zero indicating that the location caches in level one, two, and three have zero entry in the VTD+XML index.

As the final output, the VTD+XML index for "<root/>" is 88-bytes long and looks like the following hex:

0x0102A00000040001 0x0000000000000000
0x0000000000000000 0x0000000000000007
0x3C726F6F742F3E00 0x0000000000000002
0xDFF0000000000000 0x0000000400000001
0x0000000000000000 0x0000000000000000
0x0000000000000000

Benefits and Limitations
Because VTD+XML straightforwardly combines VTD and XML, it inherits all the benefits of VTD-XML parsing. When compared with existing XML indices (e.g., various pure-binary XML indices modeling labeled, ordered tree etc.), VTD+XML possesses many unique technical benefits:

•  General Purpose - Before VTD+XML, most native XML indices only optimize specific types (e.g., the axis) of Xpath lookups. If an input query differs slightly from the index type, the query execution still has to resort to expensive parsing. Due to this limitation, many native XML databases today require users to create multiple indices, one for each input query type so users can benefit from those indices. The problem is that XML database applications usually serve many types of queries that are unpredictable and complex in nature, often rendering the benefits of indexing insignificant. In comparison, VTD+XML is the first index that completely eliminates the cost of XML parsing and predictably speeds up any type of XPath query. It also works with namespaces exceptionally well.
•  Human Readable - VTD+XML is also the first human-readable XML index. You can actually open it in a text editor to examine the XML text. Figure 1 is what "po.vxl" looks like in "vim." More than just a nice property, VTD+XML's human-readability offers distinct advantages over pure binary indexing schemes. Everything else being equal, keeping XML in its original format avoids the processing cost of converting to and from any binary formats. Moreover, what if your applications just wants to modify the XML payload, such as inserting into it a chunk of XML text extracted out of another SOAP message? What's the point of converting XML to binary formats? In a service-oriented heterogeneous environment, maintaining XML in its original format automatically retains the openness and interoperability. It just seems to me that the only loss-less equivalent of XML is XML itself, no less.


  • More Stories By Jimmy Zhang

    Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    Latest Stories
    Blockchain. A day doesn’t seem to go by without seeing articles and discussions about the technology. According to PwC executive Seamus Cushley, approximately $1.4B has been invested in blockchain just last year. In Gartner’s recent hype cycle for emerging technologies, blockchain is approaching the peak. It is considered by Gartner as one of the ‘Key platform-enabling technologies to track.’ While there is a lot of ‘hype vs reality’ discussions going on, there is no arguing that blockchain is b...
    Product connectivity goes hand and hand these days with increased use of personal data. New IoT devices are becoming more personalized than ever before. In his session at 22nd Cloud Expo | DXWorld Expo, Nicolas Fierro, CEO of MIMIR Blockchain Solutions, will discuss how in order to protect your data and privacy, IoT applications need to embrace Blockchain technology for a new level of product security never before seen - or needed.
    As Marc Andreessen says software is eating the world. Everything is rapidly moving toward being software-defined – from our phones and cars through our washing machines to the datacenter. However, there are larger challenges when implementing software defined on a larger scale - when building software defined infrastructure. In his session at 16th Cloud Expo, Boyan Ivanov, CEO of StorPool, provided some practical insights on what, how and why when implementing "software-defined" in the datacent...
    ChatOps is an emerging topic that has led to the wide availability of integrations between group chat and various other tools/platforms. Currently, HipChat is an extremely powerful collaboration platform due to the various ChatOps integrations that are available. However, DevOps automation can involve orchestration and complex workflows. In his session at @DevOpsSummit at 20th Cloud Expo, Himanshu Chhetri, CTO at Addteq, will cover practical examples and use cases such as self-provisioning infra...
    In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settle...
    As DevOps methodologies expand their reach across the enterprise, organizations face the daunting challenge of adapting related cloud strategies to ensure optimal alignment, from managing complexity to ensuring proper governance. How can culture, automation, legacy apps and even budget be reexamined to enable this ongoing shift within the modern software factory? In her Day 2 Keynote at @DevOpsSummit at 21st Cloud Expo, Aruna Ravichandran, VP, DevOps Solutions Marketing, CA Technologies, was jo...
    You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
    Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
    The need for greater agility and scalability necessitated the digital transformation in the form of following equation: monolithic to microservices to serverless architecture (FaaS). To keep up with the cut-throat competition, the organisations need to update their technology stack to make software development their differentiating factor. Thus microservices architecture emerged as a potential method to provide development teams with greater flexibility and other advantages, such as the abili...
    Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
    The use of containers by developers -- and now increasingly IT operators -- has grown from infatuation to deep and abiding love. But as with any long-term affair, the honeymoon soon leads to needing to live well together ... and maybe even getting some relationship help along the way. And so it goes with container orchestration and automation solutions, which are rapidly emerging as the means to maintain the bliss between rapid container adoption and broad container use among multiple cloud host...
    Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, discussed the b...
    In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
    Imagine if you will, a retail floor so densely packed with sensors that they can pick up the movements of insects scurrying across a store aisle. Or a component of a piece of factory equipment so well-instrumented that its digital twin provides resolution down to the micrometer.
    The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...