Web Scrapping using Python to create data set.

In the data science life cycle, data mining or data collection is a very basic phase. Data can be gathered from SAP servers, logs, databases, APIs, online archives, or the internet, depending on business requirements. Selenium, a web scraping tool, can scrape a huge volume of data, such as text and images, in a short period of time.

Web scraping a web page involves fetching it and extracting from it. Fetching is the downloading of a page (which a browser does when a user views a page). Therefore, web crawling is a main component of web scraping, to fetch pages for later processing.

Once fetched, then extraction can take place. The content of a page may be parsed, searched, reformatted, its data copied into a spreadsheet or loaded into a database. Web scrapers typically take something out of a page, to make use of it for another purpose somewhere else. An example would be to find and copy names and telephone numbers, or companies and their URLs, or e-mail addresses to a list (contact scraping).

Web Scrapping using Python to create custom data set.

Web scraping

Web scraping, also known as "crawling" or "spidering," is a technique for collecting data from an online source, most commonly a website. Although web scraping is a fast way to get a large amount of data in a short amount of time, it puts a huge strain on the web server the source. This is also one of the main reasons why many websites do not allow full-scale scraping.

To understand consumer purchasing habits, employee turnover behavior, customer sentiments, and so on, data collected from websites such as e-commerce portals, job portals, and social media platforms can be used. BeautifulSoup, Scrappy, and Selenium are the most common libraries or frameworks for web scraping in Python.

In this post, we'll look at how to use Selenium in Python to scrape the web. Finally, we'll look at how we can gather images from the web which can be used to create train data.

What is Selenium: –

Selenium is an open-source web-based automation tool. Selenium primarily used for testing in the industry but It can also be used for web scraping. We’ll use the Chrome browser but you can try on any browser, It’s almost the same

Setup & tools:-

Installation:
- Install selenium using pip
```
pip install selenium
```

Install selenium using conda
```
conda install -c conda-forge selenium
```

Download the Chrome Driver:

Download the chrome driver according to your chrome browser extension.

https://chromedriver.chromium.org/downloads

Keep the downloaded file Python root folder in 'C' drive. Folder name is like Python36 for Python ver.3.6.

Following methods will help us to find elements in a Web-page (these methods will return a list):

find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

Now let’s write one Python code to scrape images from web.

Implementation of Image Web Scrapping using Selenium Python: –

Import libraries

import os
import selenium
from selenium import webdriver
import time
from PIL import Image
import io
import requests
from webdriver_manager.chrome import ChromeDriverManager
from selenium.common.exceptions import ElementClickInterceptedException

Specify search URL

Here we are trying to search in google for planes. To create only planes image data set.

#Specify Search URL 
search_url=“https://www.google.com/search?q={q}&tbm=isch&tbs=sur%3Afc&hl=en&ved=0CAIQpwVqFwoTCKCa1c6s4-oCFQAAAAAdAAAAABAC&biw=1251&bih=568" 
# Assign search world to variable q.
driver.get(search_url.format(q='Plane'))

Then we’re searching for Plane in our Search URL Paste the link into to driver.get(“ Your Link Here ”) function and run the cell. This will open a new browser window for that link.

Google search of Planes

Scroll to the end of the page

#Scroll to the end of the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(5)#sleep_between_interactions

This line of code would help us to reach the end of the page. And then we’re giving sleep time of 5 seconds so we don’t run in problem, where we’re trying to read elements from the page, which is not yet loaded.

We need to locate the element using inspect tool. here is the detailed tutorial.Learn how to get class name. Class name is common for all images of same type.

Locate the images to be scraped from the page

Here we are taking class name. The class name is same for all plane images.

#Locate the images to be scraped from the current page 
imgResults = driver.find_elements_by_class_name("rg_i Q4LuWd")
totalResults=len(imgResults)

As far as we can say, the photos on the website are still thumbnails and not the originals. To download each image, we should first click on each thumbnail and then extract the relevant information for that image.
So, in the below code, we’re performing the following tasks-

Iterate through each thumbnail and then click it.
Make our browser sleep for 2 seconds. So that it feels like human to webserver.
Find the unique HTML tag corresponding to that image to locate it on page
We still get more than one result for a particular image. But all we’re interested in the link for that image to download.
So, we iterate through each result for that image and extract ‘src’ attribute of it and then see whether “https” is present in the ‘src’ or not. Since typically weblink starts with ‘https’.

#Click on each Image to extract its corresponding link to download

img_urls = set()
for i in  range(0,len(imgResults)):
    img=imgResults[i]
    try:
        img.click()
        time.sleep(2)
        actual_images = driver.find_elements_by_css_selector('img.n3VNCb')
        for actual_image in actual_images:
            if actual_image.get_attribute('src') and 'https' in actual_image.get_attribute('src'):
                img_urls.add(actual_image.get_attribute('src'))
    except ElementClickInterceptedException or ElementNotInteractableException as err:
        print(err)

Download & save each image in the Destination directory

os.chdir('C:/WebScrapping/Dataset1')
baseDir=os.getcwd()

for i, url in enumerate(img_urls):
    file_name = f"{i:150}.jpg"    
    try:
        image_content = requests.get(url).content

except Exception as e:
        print(f"ERROR - COULD NOT DOWNLOAD {url} - {e}")

try:
        image_file = io.BytesIO(image_content)
        image = Image.open(image_file).convert('RGB')
        
        file_path = os.path.join(baseDir, file_name)
        
        with open(file_path, 'wb') as f:
            image.save(f, "JPEG", quality=85)
        print(f"SAVED - {url} - AT: {file_path}")
    except Exception as e:
        print(f"ERROR - COULD NOT SAVE {url} - {e}")

Now finally you have extracted the image for data set.

Note: Before collecting data from website check whether the web site allows web scrapping or not. Other wise legal action will be taken from website owner.

Automate Daily Tasks

Search This Blog