Welcome!

Related Topics: Industrial IoT, Microservices Expo

Industrial IoT: Article

Index XML Documents with VTD-XML

How to turn the indexing capability on in your application

  • Superior Indexing Performance - The performance of generating an VTD+XML index is identical to the parsing performance of VTD-XML, since both are essentially the same operation viewed from two different angles. On a 1.7GHz Pentium machine, it's reasonable to expect a sustained indexing performance of 50 MB/s-70 MB/s.
    •  Easy To Use - Usually adding a couple of lines (loadIndex(...) and writeIndex(...) as seen in the previous example) to your existing VTD-XML code is all that's needed to enable VTD+XML in your applications.
    •  Compact - The size of VTD+XML is usually about 30%-50% bigger than the size of the corresponding XML document. This is again consistent with the memory use of the VTD-XML processing model.
    •  Platform Neutral - Just like XML, VTD+XML is designed to be platform-neutral in that it explicitly includes information about the byte endian-ness of the platform on which the index is generated. Users of the C or C# version VTD-XML code can automatically recognize and make use of the index generated by the Java version.

    At the same time, users of VTD+XML need to be aware of the following limitations:
    •  Upper Limit on Document Size - The maximum XML document size supported by VTD+XML is 2GB without name space support. With name space, VTD+XML supports a maximum of 1GB.
    •  Lack of Support for External Entities - VTD-XML currently supports five built-in entity references (<, >, &, ', and ") as defined in XML 1.0.

    The Case Involving XML Content Update
    Some of you may wonder: What if the subsequent XML operations involve content updates that shift the offset value? In general, those use cases often require the updated XML document to be re-indexed. And for large XML documents, you may argue that the cost of re-indexing can be quite significant. However, there are actually several workarounds, all aimed at reducing, even eliminating, the cost of re-indexing.

    The first workaround: Instead of creating the VTD+XML index for a single big XML document, split the XML document into multiple smaller ones, each of which is then indexed using VTD+XML. From this point on, you only need to regenerate a VTD+XML index for those "updated" XML fragments that are usually a lot smaller and therefore cheaper to re-index.

    VTD-XML 2.0 also introduced the "overwrite" feature that lets you modify XML content without needing to regenerate the index. The code below makes use of the VTDNav class's new "overWrite(...)" to change the text node of "<root>good</root>" from "good" or "bad." If the new content is shorter or equal in length to that of the old content, the method "overWrite(...)" fills up the non-overlapping portion of the text with white spaces and returns true. Otherwise, no change to the original content and "overWrite(...)" returns false.

    import com.ximpleware.*;
    class Overwrite{
       public static void main(String s[]) throws Exception{
         VTDGen vg = new VTDGen();
         vg.setDoc("<root>good</root>".getBytes());
         vg.parse(true);
         VTDNav vn = vg.getNav();
         int i=vn.getText();
         //print "good"
         System.out.println("text ---> "+vn.toString(i));
         if (vn.overWrite(i,"bad".getBytes())){
           //overwrite, if successful, returns true
           //print "bad" here
           System.out.println("text ---> "+vn.toString(i));
         }
       }
    }

    The "overWrite" feature may look simple, but it actually has unexpected implications for the performance of XML. Consider the database table design in which you specify the column width. You can now borrow the same technique for XML composition: By pre-serializing some extra spaces into text nodes or attribute values, you can make "in situ" updates to those nodes and do so without regenerating the index. You can even pre-serialize, in an XML document, dummy elements containing text nodes or attribute values whose initial values are entirely white spaces. Those dummy elements serve as templates in anticipation of a future content update, as shown in the example below.

    The template

      <purchaseOrder orderDate="     ">
       <item partNum="     " >
         <productName>     </productName>
         <quantity>     </quantity>
         <USPrice>     </USPrice>
       </item>
     </purchaseOrder>

    After "stamping" in the data

       <purchaseOrder orderDate="1999-10-21">
         <item partNum="872-AA" >
           <productName>Lawnmower </productName>
           <quantity>1 </quantity>
           <USPrice> 100 </USPrice>
         </item>
       </purchaseOrder>

    And, by the same token, the concept of XML content deletion deserves a bit of rethinking as well. Instead of physically deleting an XML element, you can disable the XML elements by making them "invisible" to your applications to achieve the same goal. The benefit: you again avoid the need to re-index. Notice that this plays favorably to XML's strength as a loose encoding data format. Below is an example of setting the value of the attribute "enable" of an element to make it "invisible."

    Before

      <purchaseOrder orderDate="1999-10-21">
       <item partNum="872-AA" enable="1">
         <productName>Lawnmower</productName>
         <quantity>1</quantity>
         <USPrice>148.95</USPrice>
       </item>
      </purchaseOrder>

    After

      <purchaseOrder orderDate="1999-10-21">
       <item partNum="872-AA" enable='0'>
         <productName>Lawnmower</productName>
         <quantity>1</quantity>
         <USPrice>148.95</USPrice>
       </item>
      </purchaseOrder>


  • More Stories By Jimmy Zhang

    Jimmy Zhang is a cofounder of XimpleWare, a provider of high performance XML processing solutions. He has working experience in the fields of electronic design automation and Voice over IP for a number of Silicon Valley high-tech companies. He holds both a BS and MS from the department of EECS from U.C. Berkeley.

    Comments (0)

    Share your thoughts on this story.

    Add your comment
    You must be signed in to add a comment. Sign-in | Register

    In accordance with our Comment Policy, we encourage comments that are on topic, relevant and to-the-point. We will remove comments that include profanity, personal attacks, racial slurs, threats of violence, or other inappropriate material that violates our Terms and Conditions, and will block users who make repeated violations. We ask all readers to expect diversity of opinion and to treat one another with dignity and respect.


    Latest Stories
    SYS-CON Events announced today that Hitrons Solutions will exhibit at the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA. Hitrons Solutions Inc. is distributor in the North American market for unique products and services of small and medium-size businesses, including cloud services and solutions, SEO marketing platforms, and mobile applications.
    With the introduction of IoT and Smart Living in every aspect of our lives, one question has become relevant: What are the security implications? To answer this, first we have to look and explore the security models of the technologies that IoT is founded upon. In his session at @ThingsExpo, Nevi Kaja, a Research Engineer at Ford Motor Company, will discuss some of the security challenges of the IoT infrastructure and relate how these aspects impact Smart Living. The material will be delivered i...
    For organizations that have amassed large sums of software complexity, taking a microservices approach is the first step toward DevOps and continuous improvement / development. Integrating system-level analysis with microservices makes it easier to change and add functionality to applications at any time without the increase of risk. Before you start big transformation projects or a cloud migration, make sure these changes won’t take down your entire organization.
    Historically, some banking activities such as trading have been relying heavily on analytics and cutting edge algorithmic tools. The coming of age of powerful data analytics solutions combined with the development of intelligent algorithms have created new opportunities for financial institutions. In his session at 20th Cloud Expo, Sebastien Meunier, Head of Digital for North America at Chappuis Halder & Co., will discuss how these tools can be leveraged to develop a lasting competitive advanta...
    What if you could build a web application that could support true web-scale traffic without having to ever provision or manage a single server? Sounds magical, and it is! In his session at 20th Cloud Expo, Chris Munns, Senior Developer Advocate for Serverless Applications at Amazon Web Services, will show how to build a serverless website that scales automatically using services like AWS Lambda, Amazon API Gateway, and Amazon S3. We will review several frameworks that can help you build serverle...
    Your homes and cars can be automated and self-serviced. Why can't your storage? From simply asking questions to analyze and troubleshoot your infrastructure, to provisioning storage with snapshots, recovery and replication, your wildest sci-fi dream has come true. In his session at @DevOpsSummit at 20th Cloud Expo, Dan Florea, Director of Product Management at Tintri, will provide a ChatOps demo where you can talk to your storage and manage it from anywhere, through Slack and similar services ...
    VeriStor Systems has announced that CRN has named VeriStor to its 2017 Managed Service Provider (MSP) 500 list in the Elite 150 category. This annual list recognizes North American solution providers with cutting-edge approaches to delivering managed services. Their offerings help companies navigate the complex and ever-changing landscape of IT, improve operational efficiencies, and maximize their return on IT investments. In today’s fast-paced business environments, MSPs play an important role...
    SYS-CON Events announced today that CA Technologies has been named “Platinum Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY, and the 21st International Cloud Expo®, which will take place October 31-November 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. CA Technologies helps customers succeed in a future where every business – from apparel to energy – is being rewritten by software. From ...
    SYS-CON Events announced today that Cloudistics, an on-premises cloud computing company, has been named “Bronze Sponsor” of SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Cloudistics delivers a complete public cloud experience with composable on-premises infrastructures to medium and large enterprises. Its software-defined technology natively converges network, storage, compute, virtualization, and management into a ...
    Keeping pace with advancements in software delivery processes and tooling is taxing even for the most proficient organizations. Point tools, platforms, open source and the increasing adoption of private and public cloud services requires strong engineering rigor - all in the face of developer demands to use the tools of choice. As Agile has settled in as a mainstream practice, now DevOps has emerged as the next wave to improve software delivery speed and output. To make DevOps work, organization...
    SYS-CON Events announced today that Juniper Networks (NYSE: JNPR), an industry leader in automated, scalable and secure networks, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Juniper Networks challenges the status quo with products, solutions and services that transform the economics of networking. The company co-innovates with customers and partners to deliver automated, scalable and secure network...
    My team embarked on building a data lake for our sales and marketing data to better understand customer journeys. This required building a hybrid data pipeline to connect our cloud CRM with the new Hadoop Data Lake. One challenge is that IT was not in a position to provide support until we proved value and marketing did not have the experience, so we embarked on the journey ourselves within the product marketing team for our line of business within Progress. In his session at @BigDataExpo, Sum...
    DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm. In his Day 3 Keynote at 20th Cloud Expo, Chris Brown, a Solutions Marketing Manager at Nutanix, will explore t...
    DevOps is often described as a combination of technology and culture. Without both, DevOps isn't complete. However, applying the culture to outdated technology is a recipe for disaster; as response times grow and connections between teams are delayed by technology, the culture will die. A Nutanix Enterprise Cloud has many benefits that provide the needed base for a true DevOps paradigm.
    SYS-CON Events announced today that Ocean9will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Ocean9 provides cloud services for Backup, Disaster Recovery (DRaaS) and instant Innovation, and redefines enterprise infrastructure with its cloud native subscription offerings for mission critical SAP workloads.