In this blog, we will read about web scraping and its implementation using Python programming language.
Users are required to go through our previous blogs on NLP to understand the concepts better.
Web scraping is used whenever we want to extract or scrape large amounts of information from a website as quickly as possible without manually going to each website to get the data.
Web scraping makes this task easier and faster.
Applications of Web Scraping
Web scraping can be used for a number of reasons but what is the need to collect such large data from websites, let’s check it out:
- Some companies use email addresses of users as a medium for marketing. Therefore they use web scraping to collect email ids so that they can send emails in bulk.
- Sometimes web scraping is done on social media websites such as Twitter to collect data to find out what is trending.
- It is done to gather the data from different review forum websites and implement sentiment analysis on the same.
- Web scraping is also done to gather data for testing and training our machine learning models. Etc.
However, there are some websites that prevent web scraping. To find whether a website allows web scraping or not, all we need to do is look at the website’s ‘robots.txt’ file.
We just need to append → /robots.txt to the URL that we want to scrape.
How Web Scraping works
For this, we choose a URL on which we want to perform the scraping. Then after running the code, a request is sent to the URL. The server sends the data as a request and allows us to read the HTML/XML page.
The code then parses the page, finds the data and extracts it.
In this blog, we will find the frequency of words in a webpage using urllib and BeautifulSoup to extract text from the web page. We will then remove the stopwords from it and will then plot the graph of the same.
Beautiful Soup is a Python library for pulling data out of HTML and XML files.
Let us see how to do web scraping using Python:
We’ll begin by importing all the necessary libraries
from bs4 import BeautifulSoup import urllib.request import nltk response = urllib.request.urlopen('http://php.net/') html = response.read() soup = BeautifulSoup(html,"html5lib")
The urllib.request module is used to open URLs. The Beautiful Soup package is used to extract data from HTML files.
text = soup.get_text(strip=True) tokens = [t for t in text.split()] freq = nltk.FreqDist(tokens) for key,val in freq.items(): print (str(key) + ':' + str(val))
‘soup.get_text’ is used to get the text of the webpage and ‘nltk.FreqDist’ is used to get the frequency of each vocabulary item in the text.
Plotting the frequency graph
In the above graph, we can see all the words with their frequencies have been plotted.
Now after fetching all the words from the website our objective is to remove the stopwords from them. We can do this by using the ‘stopwords’ library from nltk.
#download library if required nltk.download('stopwords') #importing stopword library from nltk.corpus import stopwords
We will remove the stopwords and print all the words with their respective frequency.
text = soup.get_text(strip=True) tokens = [t for t in text.split()] clean_tokens = tokens[:] sr = stopwords.words('english') for token in tokens: if token in stopwords.words('english'): clean_tokens.remove(token) freq = nltk.FreqDist(clean_tokens) for key,val in freq.items(): print (str(key) + ':' + str(val))
Plotting the frequency graph after removing the stopwords
This is how we do web scraping using BeautifulSoup. I hope you liked this blog. For any query or suggestion do leave us a comment.
Keep visiting our website for more blogs on Data Science and Data Analytics.