Open In App

Document Object Model (DOM) in R

The Document Object Model (DOM) is a programming interface that represents structured documents. It is a platform and language-neutral interface that enables programs and scripts to dynamically access and edit document content, structure, and style. In the context of web development, the DOM is commonly associated with HTML and XML documents, but it can be applied to other types of documents as well.

What is the Document Object Model (DOM)?

The Document Object Model is a programming interface of HTML(HyperText Markup Language) and XML(Extensible markup language) documents that portrays a hierarchical structure as a tree of objects. This tree’s nodes correspond to the document’s elements, attributes, or text. The root of the tree symbolizes the complete document, with nodes branching out to contain its structural components.



Understanding the document object model (DOM) is critical for dynamically browsing, manipulating, and extracting data from HTML texts. In R, numerous packages, such as rvest and xml2, provide capabilities for working with the DOM, making it useful for tasks like as web scraping and data extraction.

Xml2 Package in R

In R Programming Language, the XML2 bundle is extensively used for running with XML and HTML files. It gives features to examine, manipulate, and save files, making it a treasured device for tasks like web scraping and information extraction.



Key Concepts Related to the DOM

1. Nodes:

Nodes are the fundamental units of the DOM.

2. XPath:

3. Selecting Nodes:

4. Modifying DOM:

5. Attributes and Text:

Working with DOM in R

R itself isn’t normally used for patron-facet internet development, as it’s far a statistical programming language. However, R can engage with the DOM while used along side internet scraping or web automation applications. Popular programs for operating with the DOM in R encompass:

rvest:

RSelenium:

Step 1: Install and Load xml2 Package




install.packages("xml2")
library(xml2)

Step 2: Read HTML/XML Document

Output:

{html_document}
<html lang="en-US" prefix="og: http://ogp.me/ns#">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta ...
[2] <body class="post-template-default single single-post postid-971700 single-format- ...

Step 3: Access Elements

Access Paragraphs




paragraphs <- xml_find_all(doc, "//p")
print(paragraphs)

Output:

{xml_nodeset (13)}
[1] <p>Learn the basics and advanced concepts of natural language processing (NLP) wi ...
[2] <p>NLP tutorial is designed for both beginners and professionals. Whether you’re ...
[3] <p>NLP stands for Natural Language Processing. It is the branch of Artificial Int ...
[4] <p>Natural Language Processing started in 1950 When <strong>Alan Mathison Turing< ...
[5] <p>There are two components of Natural Language Processing:</p>
...
...
...

Access Title




title <- xml_find_all(doc, "//title")
print(title)

Output:

{xml_nodeset (1)}
[1] <title>Natural Language Processing (NLP) Tutorial - GeeksforGeeks</title>\n

Access h3 heading




first_h3 <- xml_find_first(doc, "//h3")
print(first_h3)

Output:

{html_node}
<h3>
[1] <strong>What is the most difficult part of natural language processing?</strong>

Step 4: Extract Image




img_src <- xml_attr(xml_find_first(doc, "//img"), "src")
print(img_src)

Output:

[1] "https://media.geeksforgeeks.org/gfg-gg-logo.svg"

Step 5: Save Modified Document




new_element <- read_xml("<div>New Element</div>")
xml_add_child(xml_find_first(doc, "//body"), new_element)
write_html(doc, "modified_document.html")

Output:

Document Object Model (DOM) in R

Reading and Accessing XML Elements




# Install and load the XML package
if (!requireNamespace("XML", quietly = TRUE)) {
  install.packages("XML")
}
library(XML)
  
# Parse XML document
xml_tree <- xmlParse("<book><title>R Programming</title></book>")
  
# Access root element
root_element <- xmlRoot(xml_tree)
  
# Display element information
cat("Element Name: ", xmlName(root_element), "\n")
cat("Element Value: ", xmlValue(root_element), "\n")

Output:

Element Name:  book 
Element Value: R Programming

Extracting and Printing Paragraph Text from a Website




# Load XML package
library(XML)
  
# Create a sample XML document
xml_string <- '<bookstore>
                  <book>
                    <title>R Programming</title>
                    <author>John Doe</author>
                    <price>29.99</price>
                  </book>
                  <book>
                    <title>Data Science Essentials</title>
                    <author>Jane Smith</author>
                    <price>39.99</price>
                  </book>
              </bookstore>'
  
# Parse the XML document
xml_doc <- xmlParse(xml_string)
  
# Access the root element
root_node <- xmlRoot(xml_doc)
  
# Extract information from the XML document
titles <- xpathSApply(root_node, "//title", xmlValue)
authors <- xpathSApply(root_node, "//author", xmlValue)
prices <- xpathSApply(root_node, "//price", xmlValue)
  
# Display the extracted information
cat("Titles: ", titles, "\n")
cat("Authors: ", authors, "\n")
cat("Prices: ", prices, "\n")

Output:

Titles:  R Programming Data Science Essentials 
Authors: John Doe Jane Smith
Prices: 29.99 39.99

Conclusion

In conclusion, the Document Object Model (DOM) is a crucial idea in web development, imparting a based interface to represent the hierarchical shape of HTML or XML files as a tree of items. While R is generally a statistical programming language, it is able to be used along side net scraping and automation packages to engage with and control the DOM.


Article Tags :