Open In App

How to Parse Invalid (Bad /Not Well-Formed) XML?

Parsing invalid or not well-formed XML can be a necessity when dealing with data from diverse sources. While standard XML parsers expect well-formed XML, there are strategies and techniques to handle and extract information from malformed XML documents.

In this article, we will explore how to parse such invalid XML using Java.



Parse invalid XML

Parsing invalid XML involves a combination of corrective actions and flexible parsing techniques. One common approach is to use a lenient XML parser that can tolerate errors and retrieve information despite the malformed structure.

Corrective Actions:



Before parsing, consider pre-processing the XML to correct common errors. Tools like Tidy or Jsoup can help in cleaning and repairing XML documents.

Step-by-Step Implementation

Let’s walk through a step-by-step example of parsing invalid XML.

Step 1: Define Invalid XML

<root>
<element>Value</element>
<element>UnclosedTag
</root>

Step 2: Use Lenient Parser

Java Program to Parse Invalid XML Using Lenient XML Parsing

Use a parser that can tolerate errors. The Apache Xerces2 library in Java provides a lenient parser, enabling the extraction of data even from invalid XML.




import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
  
import java.io.StringReader;
  
public class LenientXMLParser {
    public static void main(String[] args) {
        String invalidXml = "<root><element>Value</element><element>UnclosedTag</root>";
  
        try {
            // Create a DOM parser
            DOMParser parser = new DOMParser();
  
            // Set the parser to be lenient
            parser.setFeature("http://apache.org/xml/features/dom/defer-node-expansion", false);
  
            // Parse the invalid XML
            parser.parse(new InputSource(new StringReader(invalidXml)));
  
            // Retrieve the document
            Document document = parser.getDocument();
  
            // Process the document as needed
            // ...
  
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Output:

Only generated when error occurred:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 59; The element type "root" must be terminated by the matching end-tag "</root>".
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at LenientXMLParser.main(LenientXMLParser.java:21)

Explanation of the above Program:


Article Tags :