Open In App

How to Parse Invalid (Bad /Not Well-Formed) XML?

Last Updated : 26 Feb, 2024
Improve
Improve
Like Article
Like
Save
Share
Report

Parsing invalid or not well-formed XML can be a necessity when dealing with data from diverse sources. While standard XML parsers expect well-formed XML, there are strategies and techniques to handle and extract information from malformed XML documents.

In this article, we will explore how to parse such invalid XML using Java.

Parse invalid XML

Parsing invalid XML involves a combination of corrective actions and flexible parsing techniques. One common approach is to use a lenient XML parser that can tolerate errors and retrieve information despite the malformed structure.

Corrective Actions:

Before parsing, consider pre-processing the XML to correct common errors. Tools like Tidy or Jsoup can help in cleaning and repairing XML documents.

Step-by-Step Implementation

Let’s walk through a step-by-step example of parsing invalid XML.

Step 1: Define Invalid XML

<root>
<element>Value</element>
<element>UnclosedTag
</root>

Step 2: Use Lenient Parser

  • Create a DOMParser instance.
  • Set the parser to be lenient using parser.setFeature().
  • Parse the invalid XML using parser.parse().

Java Program to Parse Invalid XML Using Lenient XML Parsing

Use a parser that can tolerate errors. The Apache Xerces2 library in Java provides a lenient parser, enabling the extraction of data even from invalid XML.

Java




import org.apache.xerces.parsers.DOMParser;
import org.w3c.dom.Document;
import org.xml.sax.InputSource;
  
import java.io.StringReader;
  
public class LenientXMLParser {
    public static void main(String[] args) {
        String invalidXml = "<root><element>Value</element><element>UnclosedTag</root>";
  
        try {
            // Create a DOM parser
            DOMParser parser = new DOMParser();
  
            // Set the parser to be lenient
            parser.setFeature("http://apache.org/xml/features/dom/defer-node-expansion", false);
  
            // Parse the invalid XML
            parser.parse(new InputSource(new StringReader(invalidXml)));
  
            // Retrieve the document
            Document document = parser.getDocument();
  
            // Process the document as needed
            // ...
  
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}


Output:

Only generated when error occurred:

org.xml.sax.SAXParseException; lineNumber: 1; columnNumber: 59; The element type "root" must be terminated by the matching end-tag "</root>".
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at org.apache.xerces.parsers.DOMParser.parse(Unknown Source)
at LenientXMLParser.main(LenientXMLParser.java:21)

Explanation of the above Program:

  • We have created a DOM parser.
  • Then, we set the parser to be lenient by using parser.setFeature().
  • Now, we parse the invalid XML using parser.parse().
  • After that we have retrieved the document using parser.getDocument().
  • After parsing we can process the document as needed.


Like Article
Suggest improvement
Previous
Next
Share your thoughts in the comments

Similar Reads