How to create a web scraping package to extract hyperlinks in 10 minutes using Python?

In the current situation of web page design, we find pages associated with various hyperlinks. In short, hyperlinks means the webpage linked to other webpages where the link to the webpages will be given as words in underlined text where the viewers of the webpage can redirect to the needed link. So, this article introduces a custom web scraper module created to extract various hyperlinks present in a web page.

Contents

  1. Introduction to Web Scraping
  2. Creating a custom python file (py)
  3. Running custom python (py) file
  4. Summary

Introduction to web scraping

Web scraping is a process of lawfully collecting data or information in the required format from the web and Python offers extensive support for collecting data from the web by offering powerful and efficient modules and libraries.

There are various web scraping packages in python. Selenium, UrlLib and BeautifulSoup (bs4) are some of the modules to name a few. Among these popular modules available, a custom python package is implemented in this article using various built-in functions of BeautifulSoup to extract the hyperlinks present in a single webpage.

Any python package implemented for web-based data collection must adhere to legal data collection by requesting data collection from particular web pages.

Creating a custom python file (py)

A custom python file can easily be created in google colab or in jupyter. Regarding colab, since it is one of the cloud-based working environments, we can start with an ipynb file first.

The first cells of the ipynb file should include the import instructions for the libraries required to perform the tasks. In this article, the custom web scraper is built using Beautiful Soup and the imported libraries for the same are shown below.

from bs4 import BeautifulSoup
import requests,re

After the required libraries are imported, a user defined function is created to send a request for the web page to collect data and it is stored in the variable. Later, from the variable, only the text of the request granted from the website will be accessible. The user defined function created for the same is shown below.

def original_htmldoc(url):
 response = requests.get(url) ## the get inbuilt function is used to send access request to the url
 return response.text ## text function is used to retrieve the text from the response

If needed, some print declarations and custom input declarations can be given as needed. The custom print statement used in the python webscrapping package is shown below.

print('Enter a url to scrape for links present in it')

A custom input has also been declared which allows the user to enter their own required web page link using the to input() operate as shown below.

url_to_scrape=input('Enter a website link to extract links')

The web page mentioned by the user is now passed to the user defined function given above to get a data collection request and the granted request is stored in a particular variable as shown below.

html_doc= original_htmldoc(url_to_scrape)

Now the html parser is used on top of the Beautiful Soup web scrapping python package to identify the hyperlinks present in the web page as shown below.

soup = BeautifulSoup(html_doc, 'html.parser')  ## html parser is used to identify the hyperlinks within the same web page

Now the parsed content of the web page is iterated through the find all() BeautifulSoup’s method to search the hyperlinks associated with the web page mentioned by the user and the hyperlinks are collected using the obtain() BeautifulSoup’s method for referral links present in the same webpage. The corresponding code is shown below.

for link in soup.find_all('a',attrs={'href': re.compile("https://")}):  ## findall is used to obtain a list of various hyperlinks in the mentioned web page in form of a list

 print(link.get('href'))

The link entered when running the python file in the custom input function is shown below.

The generated output for the above mentioned link is shown below.

The generated output basically describes the various hyperlinks present in the aforementioned link entered by the user. So this python (py) file can be used as a module or an executable statement to run on different instances. Using the python (py) file in another working instance is explained below.

Running custom python (py) file

As mentioned before, the created custom python (py) file can now be run in a different working instance. In this article, the custom python file created was downloaded as a py file and uploaded to a working directory using Google Cloud Platform. The appearance of the python file in the working directory will be as shown below.

So once the custom python file is available, an ipynb file has been picked up in the same working directory. Initially the drive was mounted in the working environment by browsing until you specified the path to the directory containing the python (py) file as shown below.

from google.colab import drive
drive.mount('/content/drive')

If the drive mount is successful, we will produce an output as shown below.

Now the command line utilities are specified as shown below to access the python file (py) directory.

!ln -s /content/gdrive/My Drive/ /mydrive
%cd /content/drive/MyDrive/Colab notebooks/Web_Scrapping

If the command line utilities are used appropriately as mentioned above, we will get output of the command line instructions for proper traversal to the python (py) file directory as shown below.

So, once the working directory is successfully traversed, we can run a python executable statement as shown below to get hyperlinks in any of the web pages required by the user.

!python link_extractor_py.py

When the above mentioned executable statement is executed in a particular cell of a python notebook, the command will ask the web page user to check the hyperlinks as shown below.

Now the user has to enter a webpage link in the blank space and the executable command will now be responsible for generating hyperlinks present in that particular webpage according to the logic present in the python (py) file. Some of the hyperlinks identified by the executable statement are shown below.

Summary

So this is how the article focuses on how to create a custom python (py) file using standard web scrapping python packages and then later run it in different instances or environments work and provide the user with the ability to view various hyperlinks present in a single web page. and access it appropriately with a single click to get the necessary information.

References

Comments are closed.