Document Object Model (DOM) in R
Last Updated :
25 Jan, 2024
The Document Object Model (DOM) is a programming interface that represents structured documents. It is a platform and language-neutral interface that enables programs and scripts to dynamically access and edit document content, structure, and style. In the context of web development, the DOM is commonly associated with HTML and XML documents, but it can be applied to other types of documents as well.
What is the Document Object Model (DOM)?
The Document Object Model is a programming interface of HTML(HyperText Markup Language) and XML(Extensible markup language) documents that portrays a hierarchical structure as a tree of objects. This tree’s nodes correspond to the document’s elements, attributes, or text. The root of the tree symbolizes the complete document, with nodes branching out to contain its structural components.
Understanding the document object model (DOM) is critical for dynamically browsing, manipulating, and extracting data from HTML texts. In R, numerous packages, such as rvest and xml2, provide capabilities for working with the DOM, making it useful for tasks like as web scraping and data extraction.
Xml2 Package in R
In R Programming Language, the XML2 bundle is extensively used for running with XML and HTML files. It gives features to examine, manipulate, and save files, making it a treasured device for tasks like web scraping and information extraction.
Key Concepts Related to the DOM
1. Nodes:
Nodes are the fundamental units of the DOM.
- Elements, attributes, and text content in a report are represented as nodes in the DOM.
- Elements are the constructing blocks of HTML or XML files.
- Attributes provide extra information about factors.
- Text nodes represent the actual content within elements.
2. XPath:
- XPath is a language used to navigate XML and HTML files.
- It gives a syntax to define paths to precise elements or units of elements inside the DOM.
3. Selecting Nodes:
- Functions like xml_find_all() and xml_find_first() are used to pick out nodes primarily based on XPath expressions.
- Xml_find_all(doc, “//p”) selects all paragraphs in the document.
- Xml_find_first(doc, “//h1”) selects the first h1 detail.
4. Modifying DOM:
- Functions like xml_add_child() and xml_remove() allow customers to feature or eliminate nodes from the report.
- Xml_add_child(parent_node, new_element) adds a new detail as a baby to a particular figure node.
5. Attributes and Text:
- xml_attr(node, “attribute_name”) extracts the price of a distinct attribute from a node.
- Xml_text(node) retrieves the text content material of a node.
Working with DOM in R
R itself isn’t normally used for patron-facet internet development, as it’s far a statistical programming language. However, R can engage with the DOM while used along side internet scraping or web automation applications. Popular programs for operating with the DOM in R encompass:
rvest:
- Rvest is a web scraping bundle that lets in you to extract facts from HTML net pages.
- It affords features to navigate the DOM, pick factors, and extract information.
RSelenium:
- RSelenium is an R interface to the Selenium WebDriver, which allows you to control a web browser programmatically.
- This may be used to engage with and control the DOM of an internet web page.
Step 1: Install and Load xml2 Package
R
install.packages ( "xml2" )
library (xml2)
|
Step 2: Read HTML/XML Document
R
doc <- read_html (url)
print (doc)
|
Output:
{html_document}
<html lang="en-US" prefix="og: http://ogp.me/ns#">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">\n<meta ...
[2] <body class="post-template-default single single-post postid-971700 single-format- ...
Step 3: Access Elements
Access Paragraphs
R
paragraphs <- xml_find_all (doc, "//p" )
print (paragraphs)
|
Output:
{xml_nodeset (13)}
[1] <p>Learn the basics and advanced concepts of natural language processing (NLP) wi ...
[2] <p>NLP tutorial is designed for both beginners and professionals. Whether you’re ...
[3] <p>NLP stands for Natural Language Processing. It is the branch of Artificial Int ...
[4] <p>Natural Language Processing started in 1950 When <strong>Alan Mathison Turing< ...
[5] <p>There are two components of Natural Language Processing:</p>
...
...
...
Access Title
R
title <- xml_find_all (doc, "//title" )
print (title)
|
Output:
{xml_nodeset (1)}
[1] <title>Natural Language Processing (NLP) Tutorial - GeeksforGeeks</title>\n
Access h3 heading
R
first_h3 <- xml_find_first (doc, "//h3" )
print (first_h3)
|
Output:
{html_node}
<h3>
[1] <strong>What is the most difficult part of natural language processing?</strong>
Step 4: Extract Image
R
img_src <- xml_attr ( xml_find_first (doc, "//img" ), "src" )
print (img_src)
|
Output:
[1] "https://media.geeksforgeeks.org/gfg-gg-logo.svg"
Step 5: Save Modified Document
R
new_element <- read_xml ( "<div>New Element</div>" )
xml_add_child ( xml_find_first (doc, "//body" ), new_element)
write_html (doc, "modified_document.html" )
|
Output:
Document Object Model (DOM) in R
Reading and Accessing XML Elements
R
if (! requireNamespace ( "XML" , quietly = TRUE )) {
install.packages ( "XML" )
}
library (XML)
xml_tree <- xmlParse ( "<book><title>R Programming</title></book>" )
root_element <- xmlRoot (xml_tree)
cat ( "Element Name: " , xmlName (root_element), "\n" )
cat ( "Element Value: " , xmlValue (root_element), "\n" )
|
Output:
Element Name: book
Element Value: R Programming
Extracting and Printing Paragraph Text from a Website
R
library (XML)
xml_string <- '<bookstore>
<book>
<title>R Programming</title>
<author>John Doe</author>
<price>29.99</price>
</book>
<book>
<title>Data Science Essentials</title>
<author>Jane Smith</author>
<price>39.99</price>
</book>
</bookstore>'
xml_doc <- xmlParse (xml_string)
root_node <- xmlRoot (xml_doc)
titles <- xpathSApply (root_node, "//title" , xmlValue)
authors <- xpathSApply (root_node, "//author" , xmlValue)
prices <- xpathSApply (root_node, "//price" , xmlValue)
cat ( "Titles: " , titles, "\n" )
cat ( "Authors: " , authors, "\n" )
cat ( "Prices: " , prices, "\n" )
|
Output:
Titles: R Programming Data Science Essentials
Authors: John Doe Jane Smith
Prices: 29.99 39.99
Conclusion
In conclusion, the Document Object Model (DOM) is a crucial idea in web development, imparting a based interface to represent the hierarchical shape of HTML or XML files as a tree of items. While R is generally a statistical programming language, it is able to be used along side net scraping and automation packages to engage with and control the DOM.
Share your thoughts in the comments
Please Login to comment...