Welcome!

Related Topics: Industrial IoT, Microservices Expo

Industrial IoT: Article

Index XML Documents with VTD-XML

How to turn the indexing capability on in your application

VTD+XML in 30 Seconds
Allowing XML parsing to be decoupled from application logic, the key in the example above is the index file "po.vxl," which conforms to the VTD+XML spec. What is VTD+XML? Since VTD-XML's internal representation of XML infoset is inherently persistent, VTD+XML, as the name suggests, is simply the binary packaging format that combines VTD records, LCs entries, and XML into a single file. The detailed technical spec can be found at http://vtd-xml.sourceforge.net/persistence.html.

A Simple Example
This section gets down to the nitty-gritty of the specification by manually composing, byte-by-byte, a VTD+XML index. For the sake of simplicity, this example chooses to index a simple XML document containing a single child-less root element whose parsed representation doesn't have location cache entries. This example also assumes a big-endian byte order (as in Java) and UTF-8 document encoding (the default character set). The name space awareness is set to false.

<root/>

The first four-byte word of the corresponding index file is 0x0102A000 containing:

  • The VTD+XML version number (0x01) in the first byte
  • The character encoding format (0x02) in the second byte (Jimmy1)
  • The name space awareness, word length of LC entries in the last level, byte endian-ness of the platform, and VTD version as encoded in various bit fields in the third byte (0xA0)(Jimmy2)
  • The document depth (0x0 as the root element has no child)(Jimmy3)

    The second four-byte word has the value of 0x00040001 containing:

  • The number of LC levels supported by the VTD-XML implementation in the upper 16 bits (0x0004 in big endian)(Jimmy4)
  • The root element index value in the lower 16 bits (0x0001 in big endian)(Jimmy5)
The next four four-byte words are reserved and set to zero.
The byte order of all the ensuing 32-bit or 64-bit words is platform-dependent and specified in the third byte of the VTD+XML spec. The next eight-byte words indicate the size (in bytes) of the XML document, which equals seven in this example. Immediately following (0x3C726F6F742F3E00) is the byte content of the XML rounded up to an integer multiple of eight bytes by padding zero to the end.

The remaining part of VTD+XML index consists of multiple adjacent segments each containing an eight-byte word (0x0000000000000002 indicating the VTD record or LC entry count) followed by the actual content of the VTD records or LC entries. The first eight-byte word (0x000000000000000002) indicates that there are two VTD records that are 0xDFF0000000000000 and 0x0000000400000001.

The remaining three eight-byte words all have the value of zero indicating that the location caches in level one, two, and three have zero entry in the VTD+XML index.

As the final output, the VTD+XML index for "<root/>" is 88-bytes long and looks like the following hex:

0x0102A00000040001 0x0000000000000000
0x0000000000000000 0x0000000000000007
0x3C726F6F742F3E00 0x0000000000000002
0xDFF0000000000000 0x0000000400000001
0x0000000000000000 0x0000000000000000
0x0000000000000000

Benefits and Limitations
Because VTD+XML straightforwardly combines VTD and XML, it inherits all the benefits of VTD-XML parsing. When compared with existing XML indices (e.g., various pure-binary XML indices modeling labeled, ordered tree etc.), VTD+XML possesses many unique technical benefits:

•  General Purpose - Before VTD+XML, most native XML indices only optimize specific types (e.g., the axis) of Xpath lookups. If an input query differs slightly from the index type, the query execution still has to resort to expensive parsing. Due to this limitation, many native XML databases today require users to create multiple indices, one for each input query type so users can benefit from those indices. The problem is that XML database applications usually serve many types of queries that are unpredictable and complex in nature, often rendering the benefits of indexing insignificant. In comparison, VTD+XML is the first index that completely eliminates the cost of XML parsing and predictably speeds up any type of XPath query. It also works with namespaces exceptionally well.
•  Human Readable - VTD+XML is also the first human-readable XML index. You can actually open it in a text editor to examine the XML text. Figure 1 is what "po.vxl" looks like in "vim." More than just a nice property, VTD+XML's human-readability offers distinct advantages over pure binary indexing schemes. Everything else being equal, keeping XML in its original format avoids the processing cost of converting to and from any binary formats. Moreover, what if your applications just wants to modify the XML payload, such as inserting into it a chunk of XML text extracted out of another SOAP message? What's the point of converting XML to binary formats? In a service-oriented heterogeneous environment, maintaining XML in its original format automatically retains the openness and interoperability. It just seems to me that the only loss-less equivalent of XML is XML itself, no less.


  • More Stories By Jimmy Zhang

    Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    Latest Stories
    Enterprises have taken advantage of IoT to achieve important revenue and cost advantages. What is less apparent is how incumbent enterprises operating at scale have, following success with IoT, built analytic, operations management and software development capabilities – ranging from autonomous vehicles to manageable robotics installations. They have embraced these capabilities as if they were Silicon Valley startups. As a result, many firms employ new business models that place enormous impor...
    SYS-CON Events announced today that TidalScale will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale is the leading provider of Software-Defined Servers that bring flexibility to modern data centers by right-sizing servers on the fly to fit any data set or workload. TidalScale’s award-winning inverse hypervisor technology combines multiple commodity servers (including their ass...
    As popularity of the smart home is growing and continues to go mainstream, technological factors play a greater role. The IoT protocol houses the interoperability battery consumption, security, and configuration of a smart home device, and it can be difficult for companies to choose the right kind for their product. For both DIY and professionally installed smart homes, developers need to consider each of these elements for their product to be successful in the market and current smart homes.
    SYS-CON Events announced today that MIRAI Inc. will exhibit at the Japan External Trade Organization (JETRO) Pavilion at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. MIRAI Inc. are IT consultants from the public sector whose mission is to solve social issues by technology and innovation and to create a meaningful future for people.
    In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, will go over the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, applicatio...
    Companies are harnessing data in ways we once associated with science fiction. Analysts have access to a plethora of visualization and reporting tools, but considering the vast amount of data businesses collect and limitations of CPUs, end users are forced to design their structures and systems with limitations. Until now. As the cloud toolkit to analyze data has evolved, GPUs have stepped in to massively parallel SQL, visualization and machine learning.
    In his Opening Keynote at 21st Cloud Expo, John Considine, General Manager of IBM Cloud Infrastructure, will lead you through the exciting evolution of the cloud. He'll look at this major disruption from the perspective of technology, business models, and what this means for enterprises of all sizes. John Considine is General Manager of Cloud Infrastructure Services at IBM. In that role he is responsible for leading IBM’s public cloud infrastructure including strategy, development, and offering ...
    As hybrid cloud becomes the de-facto standard mode of operation for most enterprises, new challenges arise on how to efficiently and economically share data across environments. In his session at 21st Cloud Expo, Dr. Allon Cohen, VP of Product at Elastifile, will explore new techniques and best practices that help enterprise IT benefit from the advantages of hybrid cloud environments by enabling data availability for both legacy enterprise and cloud-native mission critical applications. By rev...
    SYS-CON Events announced today that Dasher Technologies will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Dasher Technologies, Inc. ® is a premier IT solution provider that delivers expert technical resources along with trusted account executives to architect and deliver complete IT solutions and services to help our clients execute their goals, plans and objectives. Since 1999, we'v...
    SYS-CON Events announced today that NetApp has been named “Bronze Sponsor” of SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. NetApp is the data authority for hybrid cloud. NetApp provides a full range of hybrid cloud data services that simplify management of applications and data across cloud and on-premises environments to accelerate digital transformation. Together with their partners, NetApp emp...
    SYS-CON Events announced today that TidalScale, a leading provider of systems and services, will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. TidalScale has been involved in shaping the computing landscape. They've designed, developed and deployed some of the most important and successful systems and services in the history of the computing industry - internet, Ethernet, operating s...
    Join IBM November 1 at 21st Cloud Expo at the Santa Clara Convention Center in Santa Clara, CA, and learn how IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Cognitive analysis impacts today’s systems with unparalleled ability that were previously available only to manned, back-end operations. Thanks to cloud processing, IBM Watson can bring cognitive services and AI to intelligent, unmanned systems. Imagine a robot vacuum that becomes your personal assistant tha...
    SYS-CON Events announced today that Massive Networks, that helps your business operate seamlessly with fast, reliable, and secure internet and network solutions, has been named "Exhibitor" of SYS-CON's 21st International Cloud Expo ®, which will take place on Oct 31 - Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. As a premier telecommunications provider, Massive Networks is headquartered out of Louisville, Colorado. With years of experience under their belt, their team of...
    Widespread fragmentation is stalling the growth of the IIoT and making it difficult for partners to work together. The number of software platforms, apps, hardware and connectivity standards is creating paralysis among businesses that are afraid of being locked into a solution. EdgeX Foundry is unifying the community around a common IoT edge framework and an ecosystem of interoperable components.
    Infoblox delivers Actionable Network Intelligence to enterprise, government, and service provider customers around the world. They are the industry leader in DNS, DHCP, and IP address management, the category known as DDI. We empower thousands of organizations to control and secure their networks from the core-enabling them to increase efficiency and visibility, improve customer service, and meet compliance requirements.