Technoarch Softwares - Web Scraping in Python

Web Scraping in Python

Just Imagine that you are in a need of a large amount of data and you want the data as quickly as possible. How would you go around gathering that data? Oh! I just forgot that you are looking from multiple websites and you are required not to manually go to each website in order to gather the data.

Well, “Web Scraping” is the answer. Web Scraping or Data Extraction is a technique used to extract useful data from the websites and further saved to a local file in the computer or in a table format (spreadsheet), which can be further used for analysis. Web scraping helps in collecting the unstructured data and storing it in a structured form. In this article, we’ll go through the implementation of web scraping with python. 
 


 

Now, what mostly bothers people about web scraping is, is scraping data from other websites even legal? Which brings me to my next question. 
 

Is Web Scraping legal?

Yes, Web Scraping is legal, unless you use it unethically. Some websites allow web scraping and some don’t and to know whether a website allows web scraping or not, you can look at the website’s “robots.txt” file.

Before moving to the implementation part, let’s understand why we will be using python for web scraping.
 

Python for Web Scraping

When it comes to choosing a programming language for web scraping, python is the easiest of all the languages to perform web scraping. Let me tell you why -

  • Large Collection Of Libraries :
    Python has a very large collection of libraries, which can provide methods and services for various purposes. Such libraries are Numpy, Matplotlib, Pandas etc.

  • Small code, large functionality :
    Web scraping is used to save time but what’s the use if you spend more time writing the complex syntax of a program? Well, you don’t have to. In Python, you can write small codes which are not at all complex and do a great job with the task. Hence, it allows you to save time even while writing the code.

  • Dynamically Typed Language :
    Python is dynamically-typed language, which means one does not have to define data type for a variable. It makes a job easier and faster.

  • Easy to code :
    Python language is easy to code and anyone can learn python basics in a few hours or days, which makes it a developer-friendly language.
     

How Web Scraping works?

Before going straight to the implementation of web scraping, let’s take a look at few of the python libraries that can be used to scrape the data from the web -

  • BeautifulSoup :
    Beautiful Soup is a Python library which is used for parsing HTML and XML documents and helps pulling out the data by getting to know the HTML structure of the website.

  • Selenium :
    Selenium is one of the most popular automation testing tools which is used to test web application.

  • Pandas :
    It is a library used for data manipulation and analysis. It allows the data to be stored in a desired format.
     

Let’s scrape some data

For a demo, we are going to scrape Flipkart website data which include mobile phones with their name, prices and ratings.
 

Step 1: Look for the URL to scrape

The URL for this demo -

 https://www.flipkart.com/search?q=phone&otracker=search&otracker1=search&marketplace=FLIPKART&as-show=on&as=off
 

Step 2: Inspect the page

Inspecting the page will be require to know the HTML tags which will needs to be extracted for the information. Below is the inspected page to know the classes of the tags to extract the Name, Price and Rating of the phones -
 

 

Step 3: Writing Code

First, import all the necessary libraries -


 

Here, importing requests allows us to request the required web page by using the URL we have mentioned above.

Now refer the below code which allows us to request the web page -
 

 

It’s time to extract the data from the requested URL, now using beautifulsoup we will be parsing the HTMl documents, and find all the required anchor tag `a` containers which belong to the specific `class`.

 


 

The data we want to extract is nested in the div tags, so by looping the `containers` variable we got from the above code, we will extract the text values by respective class names of the div -
 

 

We got the required text which are stored in the list of `names`, `prices` and  `ratings`. But this is still not structured data and not appropriate for use. So, Now we will store the data in the desired format which will form the structured way of representing the data by using the Pandas DataFrame -

 


 

Great, we have managed to extract the required products in a csv -
 


 

Conclusion

That  was fun, right? Finally we have reached the end of this article, we have used the very basic example for scraping the data from the website into a CSV, which is required in various companies or can be analysed by an individual to find the best product in the market.

One can do many possible stuff using the web scraping, like collecting data for testing and Machine learning, Gathering information for a particular Research and so on. Most importantly there is enough content on the web which can help you master the web scraping.

Hope you enjoyed the article, feel free to comment down below for any query or suggestion. 

1 Comments:

  1. Very informative article! It really helped me clear some of doubts about Webscrapping

    Leave a Reply

    Your email address will not be published. Required fields are marked *

Leave a Comments

Your email address will not be published. Required fields are marked *