Pages

Web Scraping - Collecting Data From a Webpage

Web pages have various HTML tags. So the data we want to collect can be contained in any of these tags. Because of this you will need to carefully examine the HTML structure of the webpages before develop codes.

Web browsers have special tool which helps us to look at the structure of the web pages very easily. For an example, think you need to collect names of the people which is shown in a webpages. In this example each name is shown in separate page. But structure of each page is identical. So what you need to do first is, take your cursor near the name. Then write click. You will see this kind of options.

Then click on the Inspect.

You will able to see the structure of the web-page. Assume the code relate to name is look like this.

<span itemprop="full name" class="full name block">James Smith</span>

So I will show you how to collect name from this kind of structure. Here we can extract name using class name as follows.

objIE.document.getElementsByClassName("name block")(0).innerText

objIE should be defined before like

Set objIE = CreateObject("InternetExplorer.Application")

If you are not familiar with those basic web scraping, you can learn them from my earlier post.

Web Scraping - Basics

So let's now look at how to extract data from hyperlinks. Assume telephone numbers are stored in a hyperlink like below.

<a class="click-link" data-gaaction="ourbusiness" data-gaevent="pho_number" href="tel:9128602884" itemprop="phone">912-860-2884</a>

We can use below type of VBA code to get the telephone number.

Dim telephone_number As String

Set Alllinks = objIE.document.getElementsByTagName("A")

For Each Hyperlink In Alllinks
    If InStr(1, Hyperlink.href, "tel:", vbTextCompare) > 0 Then
        telephone_number = Hyperlink.innerText
        Exit For
    End If
Next

Also we can use following technique to get text inside div tags.

Dim AllString As String
       
Set AllDiv = objIE.document.getElementsByTagName("div")
For Each DivTag In AllDiv
    AllString = DivTag.innerText
Next

Then we can use functions like InStr, Split etc.. to extract relavent information from that "AllString" variable.

Web Scraping Techniques

In this post we are going to discuss about few more techniques used in web scraping. If you are new to web scraping, please read my earlier posts from below.

Web Scraping - Basics

Useful References for VBA Web Scraping

In this post I'm going to explain about more advance techniques.

When we do the data mining, sometimes we need to go to certain websites, put some value in text box and click on "Search" button. Then website gives list of results. These results sometimes may in several pages. If you want to automate this kind of process, you can develop a code to go to that url, then put value to text box, and then write a code to click button and so and so. You can learn how to write this kind of code from my earlier post.

But sometimes this can be easier than you think. Because it may be possible to get result of each page by changing the url. For an example if the website gives results in several pages, url's of each page sometimes have part like "?page=1", "?page=2" etc. So you can develop a program to get data from each page directly. And if you are searching for some value in textbox, then that value also can contained in the url. So it is always better to carefully look at the url and try to find some patterns.

Then sometimes we need to open the each result in separate pages to get all the information of each and every result. Think website have hyperlink like "View" for each result. So we can collect url of each of this "View" link from following code.

Set objIE = CreateObject("InternetExplorer.Application")

url="url of particular result page"

objIE.Navigate (url)

Dim WS as worksheet

set WS=activesheet

'wait to load page...
Do
DoEvents
Loop Until objIE.readystate = 4

Set Alllinks = objIE.document.getElementsByTagName("A")
    For Each Hyperlink In Alllinks
        If InStr(Hyperlink.innerText, "View") > 0 Then
            WS.Cells(row, 1).Value = Hyperlink.href
            row = row + 1
        End If
    Next

So what it does is, it lists url of each result in a worksheet. Also you should have some idea about how many result pages the website will give. Then you can set suitable upper limit to the number of pages and should use some technique to exit the loop when there are no more result pages. Here below is example of such program.

Res = InputBox("Please enter number You want to search?", "")

Set objIE = CreateObject("InternetExplorer.Application")

objIE.Top = 0
objIE.Left = 0
objIE.Width = 1100
objIE.Height = 700

objIE.Visible = True

Dim i, row As Integer
Dim url, isdata As String

Dim WS As Worksheet

Set WS = Worksheets("DataBase")

i = 1
row = 1
For i = 1 To 100
    'genarate url...
    url = "your website url+ ?page=" + CStr(i) + "There may be some additional parts of url here" + CStr(Res)
    objIE.Navigate (url)
   
    'wait to load page...
    Do
    DoEvents
    Loop Until objIE.readystate = 4
   
    'collect links...
    isdata = "n"
   
    Set Alllinks = objIE.document.getElementsByTagName("A")
    For Each Hyperlink In Alllinks
        If InStr(Hyperlink.innerText, "View") > 0 Then
            WS.Cells(row, 1).Value = Hyperlink.href
            isdata = "y"
            row = row + 1
        End If
    Next


    If (isdata = "n") Then
        i = 101
    End If
Next i

objIE.Quit

Determine whether characters inside a cell are bold or regular

Sometimes we need to determine whether characters inside a cell are bold or regular. And some cells can contain both bold and regular characters. Here is an example.


So today we are going to develop a macro to determine whether each and every character inside a cell is regular or bold. What it does is, it outputs each letter in immediate window and tells whether that each letter is bold or regular. Here below is the full code.

Dim i As Integer
Dim Sentence_Length As Integer

Sentence_Length = Len(Range("B2").Value)

For i = 1 To Sentence_Length
    Debug.Print Range("B2").Characters(i, 1).Text
   
    If Range("B2").Characters(i, 1).Font.Bold = True Then
        Debug.Print "Bold"
    Else
        Debug.Print "Regular"
    End If
Next i

Below is the explanation of above code.

First we need to define our variables. Then below line of code measure the number of characters inside B2 cell.

Sentence_Length = Len(Range("B2").Value)

And this for next loop, loop through all the letters in B2 cell.

For i = 1 To Sentence_Length
   
Next i

Below line prints the each character in immediate window. Please note that this program will consider spaces also as characters.

Debug.Print Range("B2").Characters(i, 1).Text

This part of the code checks whether the character is bold or regular and print the result in immediate window.

If Range("B2").Characters(i, 1).Font.Bold = True Then
        Debug.Print "Bold"
Else
        Debug.Print "Regular"
End If

Here is an image of immediate window after running the macro.


So this is just a very simple example. You can alter this code to suit with your requirements.