Prerequisite: Introduction to NLP
In this article, we are going to discuss how we can obtain text from online text files and extract the required data from them. For the purpose of this article, we will be using the text file available here.
The following must be installed in current working environment:
- NLTK library
- urllib library
- BeautifulSoup library
Step #1: import the required libraries
Some basic information about the above mentioned libraries:
- NLTK Library: The nltk library is a collection of libraries and programs written for processing of English language written in Python programming language.
- urllib library: This is a URL handling library for python. Know more about it here
- BeautifulSoup library: This is a library used for extracting data out of HTML and XML documents.
Step #2: To extract all the contents of the text file.
Thus, the unprocessed data is loaded into the variable raw.
Step #3: Next, we process the data to remove any html/xml tags which might be present in our ‘raw’ variable using:
Step #4: Now we obtain the text present in ‘raw’ variable.
Step #5: Next we tokenize the text into words.
This is done as preprocessing for the next step, where we will obtain final text.
Step #6: Finally, we obtain our final text.
Below is the complete code:
- NLP | Parallel list processing with execnet
- Point Processing in Image Processing using Python-OpenCV
- NLP | How tokenizing text, sentence, words works
- NLP | Categorized Text Corpus
- NLP | Chunk Tree to Text and Chaining Chunk Transformation
- Difference between Text Mining and Natural Language Processing
- NLP | Chunking using Corpus Reader
- NLP | Customization Using Tagged Corpus Reader
- NLP | Using dateutil to parse dates.
- Convert Text and Text File to PDF using Python
- NLP | Classifier-based Chunking | Set 2
- Readability Index in Python(NLP)
- Feature Extraction Techniques - NLP
- Python | NLP analysis of Restaurant reviews
- Applying Multinomial Naive Bayes to NLP Problems
- NLP | Chunking and chinking with RegEx
- NLP | Training Unigram Tagger
- NLP | Synsets for a word in WordNet
- NLP | Part of Speech - Default Tagging
- NLP | Word Collocations
If you like GeeksforGeeks and would like to contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article to email@example.com. See your article appearing on the GeeksforGeeks main page and help other Geeks.
Please Improve this article if you find anything incorrect by clicking on the "Improve Article" button below.