Welcome!

Related Topics: Industrial IoT, Microservices Expo

Industrial IoT: Article

Index XML Documents with VTD-XML

How to turn the indexing capability on in your application

  • Superior Indexing Performance - The performance of generating an VTD+XML index is identical to the parsing performance of VTD-XML, since both are essentially the same operation viewed from two different angles. On a 1.7GHz Pentium machine, it's reasonable to expect a sustained indexing performance of 50 MB/s-70 MB/s.
    •  Easy To Use - Usually adding a couple of lines (loadIndex(...) and writeIndex(...) as seen in the previous example) to your existing VTD-XML code is all that's needed to enable VTD+XML in your applications.
    •  Compact - The size of VTD+XML is usually about 30%-50% bigger than the size of the corresponding XML document. This is again consistent with the memory use of the VTD-XML processing model.
    •  Platform Neutral - Just like XML, VTD+XML is designed to be platform-neutral in that it explicitly includes information about the byte endian-ness of the platform on which the index is generated. Users of the C or C# version VTD-XML code can automatically recognize and make use of the index generated by the Java version.

    At the same time, users of VTD+XML need to be aware of the following limitations:
    •  Upper Limit on Document Size - The maximum XML document size supported by VTD+XML is 2GB without name space support. With name space, VTD+XML supports a maximum of 1GB.
    •  Lack of Support for External Entities - VTD-XML currently supports five built-in entity references (<, >, &, ', and ") as defined in XML 1.0.

    The Case Involving XML Content Update
    Some of you may wonder: What if the subsequent XML operations involve content updates that shift the offset value? In general, those use cases often require the updated XML document to be re-indexed. And for large XML documents, you may argue that the cost of re-indexing can be quite significant. However, there are actually several workarounds, all aimed at reducing, even eliminating, the cost of re-indexing.

    The first workaround: Instead of creating the VTD+XML index for a single big XML document, split the XML document into multiple smaller ones, each of which is then indexed using VTD+XML. From this point on, you only need to regenerate a VTD+XML index for those "updated" XML fragments that are usually a lot smaller and therefore cheaper to re-index.

    VTD-XML 2.0 also introduced the "overwrite" feature that lets you modify XML content without needing to regenerate the index. The code below makes use of the VTDNav class's new "overWrite(...)" to change the text node of "<root>good</root>" from "good" or "bad." If the new content is shorter or equal in length to that of the old content, the method "overWrite(...)" fills up the non-overlapping portion of the text with white spaces and returns true. Otherwise, no change to the original content and "overWrite(...)" returns false.

    import com.ximpleware.*;
    class Overwrite{
       public static void main(String s[]) throws Exception{
         VTDGen vg = new VTDGen();
         vg.setDoc("<root>good</root>".getBytes());
         vg.parse(true);
         VTDNav vn = vg.getNav();
         int i=vn.getText();
         //print "good"
         System.out.println("text ---> "+vn.toString(i));
         if (vn.overWrite(i,"bad".getBytes())){
           //overwrite, if successful, returns true
           //print "bad" here
           System.out.println("text ---> "+vn.toString(i));
         }
       }
    }

    The "overWrite" feature may look simple, but it actually has unexpected implications for the performance of XML. Consider the database table design in which you specify the column width. You can now borrow the same technique for XML composition: By pre-serializing some extra spaces into text nodes or attribute values, you can make "in situ" updates to those nodes and do so without regenerating the index. You can even pre-serialize, in an XML document, dummy elements containing text nodes or attribute values whose initial values are entirely white spaces. Those dummy elements serve as templates in anticipation of a future content update, as shown in the example below.

    The template

      <purchaseOrder orderDate="     ">
       <item partNum="     " >
         <productName>     </productName>
         <quantity>     </quantity>
         <USPrice>     </USPrice>
       </item>
     </purchaseOrder>

    After "stamping" in the data

       <purchaseOrder orderDate="1999-10-21">
         <item partNum="872-AA" >
           <productName>Lawnmower </productName>
           <quantity>1 </quantity>
           <USPrice> 100 </USPrice>
         </item>
       </purchaseOrder>

    And, by the same token, the concept of XML content deletion deserves a bit of rethinking as well. Instead of physically deleting an XML element, you can disable the XML elements by making them "invisible" to your applications to achieve the same goal. The benefit: you again avoid the need to re-index. Notice that this plays favorably to XML's strength as a loose encoding data format. Below is an example of setting the value of the attribute "enable" of an element to make it "invisible."

    Before

      <purchaseOrder orderDate="1999-10-21">
       <item partNum="872-AA" enable="1">
         <productName>Lawnmower</productName>
         <quantity>1</quantity>
         <USPrice>148.95</USPrice>
       </item>
      </purchaseOrder>

    After

      <purchaseOrder orderDate="1999-10-21">
       <item partNum="872-AA" enable='0'>
         <productName>Lawnmower</productName>
         <quantity>1</quantity>
         <USPrice>148.95</USPrice>
       </item>
      </purchaseOrder>


  • More Stories By Jimmy Zhang

    Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    Latest Stories
    The 20th International Cloud Expo has announced that its Call for Papers is open. Cloud Expo, to be held June 6-8, 2017, at the Javits Center in New York City, brings together Cloud Computing, Big Data, Internet of Things, DevOps, Containers, Microservices and WebRTC to one location. With cloud computing driving a higher percentage of enterprise IT budgets every year, it becomes increasingly important to plant your flag in this fast-expanding business opportunity. Submit your speaking proposal ...
    SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
    SYS-CON Events announced today that Hitachi, the leading provider the Internet of Things and Digital Transformation, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Hitachi Data Systems, a wholly owned subsidiary of Hitachi, Ltd., offers an integrated portfolio of services and solutions that enable digital transformation through enhanced data management, governance, mobility and analytics. We help globa...
    Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across supply chain networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost and time for product recall as well as advance trade. Are you curious about Blockchain and how it can provide you with new opportunities for innovation and growth? In her session at 20th Cloud Exp...
    Cloud promises the agility required by today’s digital businesses. As organizations adopt cloud based infrastructures and services, their IT resources become increasingly dynamic and hybrid in nature. Managing these require modern IT operations and tools. In his session at 20th Cloud Expo, Raj Sundaram, Senior Principal Product Manager at CA Technologies, will discuss how to modernize your IT operations in order to proactively manage your hybrid cloud and IT environments. He will be sharing be...
    Back in February of 2017, Andrew Clay Schafer of Pivotal tweeted the following: “seriously tho, the whole software industry is stuck on deployment when we desperately need architecture and telemetry.” Intrigue in a 140 characters. For me, I hear Andrew saying, “we’re jumping to step 5 before we’ve successfully completed steps 1-4.”
    In his keynote at 19th Cloud Expo, Sheng Liang, co-founder and CEO of Rancher Labs, discussed the technological advances and new business opportunities created by the rapid adoption of containers. With the success of Amazon Web Services (AWS) and various open source technologies used to build private clouds, cloud computing has become an essential component of IT strategy. However, users continue to face challenges in implementing clouds, as older technologies evolve and newer ones like Docker c...
    Five years ago development was seen as a dead-end career, now it’s anything but – with an explosion in mobile and IoT initiatives increasing the demand for skilled engineers. But apart from having a ready supply of great coders, what constitutes true ‘DevOps Royalty’? It’ll be the ability to craft resilient architectures, supportability, security everywhere across the software lifecycle. In his keynote at @DevOpsSummit at 20th Cloud Expo, Jeffrey Scheaffer, GM and SVP, Continuous Delivery Busine...
    NHK, Japan Broadcasting, will feature the upcoming @ThingsExpo Silicon Valley in a special 'Internet of Things' and smart technology documentary that will be filmed on the expo floor between November 3 to 5, 2015, in Santa Clara. NHK is the sole public TV network in Japan equivalent to the BBC in the UK and the largest in Asia with many award-winning science and technology programs. Japanese TV is producing a documentary about IoT and Smart technology and will be covering @ThingsExpo Silicon Val...
    DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In his Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, will explore t...
    SYS-CON Events announced today that Hitachi, the leading provider the Internet of Things and Digital Transformation, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Hitachi Data Systems, a wholly owned subsidiary of Hitachi, Ltd., offers an integrated portfolio of services and solutions that enable digital transformation through enhanced data management, governance, mobility and analytics. We help globa...
    As pervasive as cloud technology is -- and as persuasive as the arguments are for using it -- the cloud has its limits. Some companies will always have security concerns about storing data in the cloud and certain high-transaction applications will always be better suited for on-premises storage. Those statements were among the bottom-line takeaways delivered at Cloud Expo this week, a three day, bi-annual event focused on cloud technologies, adoption and associated challenges.
    The age of Digital Disruption is evolving into the next era – Digital Cohesion, an age in which applications securely self-assemble and deliver predictive services that continuously adapt to user behavior. Information from devices, sensors and applications around us will drive services seamlessly across mobile and fixed devices/infrastructure. This evolution is happening now in software defined services and secure networking. Four key drivers – Performance, Economics, Interoperability and Trust ...
    SYS-CON Events announced today that Hitachi Data Systems, a wholly owned subsidiary of Hitachi LTD., will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City. Hitachi Data Systems (HDS) will be featuring the Hitachi Content Platform (HCP) portfolio. This is the industry’s only offering that allows organizations to bring together object storage, file sync and share, cloud storage gateways, and sophisticated search an...
    DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.