In this post we are going to discuss about few more techniques used in web scraping. If you are new to web scraping, please read my earlier posts from below.
Useful References for VBA Web Scraping
In this post I'm going to explain about more advance techniques.
When we do the data mining, sometimes we need to go to certain websites, put some value in text box and click on "Search" button. Then website gives list of results. These results sometimes may in several pages. If you want to automate this kind of process, you can develop a code to go to that url, then put value to text box, and then write a code to click button and so and so. You can learn how to write this kind of code from my earlier post.
But sometimes this can be easier than you think. Because it may be possible to get result of each page by changing the url. For an example if the website gives results in several pages, url's of each page sometimes have part like "?page=1", "?page=2" etc. So you can develop a program to get data from each page directly. And if you are searching for some value in textbox, then that value also can contained in the url. So it is always better to carefully look at the url and try to find some patterns.
Then sometimes we need to open the each result in separate pages to get all the information of each and every result. Think website have hyperlink like "View" for each result. So we can collect url of each of this "View" link from following code.
Set objIE = CreateObject("InternetExplorer.Application") url="url of particular result page" objIE.Navigate (url) Dim WS as worksheet set WS=activesheet 'wait to load page... Do DoEvents Loop Until objIE.readystate = 4 Set Alllinks = objIE.document.getElementsByTagName("A") For Each Hyperlink In Alllinks If InStr(Hyperlink.innerText, "View") > 0 Then WS.Cells(row, 1).Value = Hyperlink.href row = row + 1 End If Next |
So what it does is, it lists url of each result in a worksheet. Also you should have some idea about how many result pages the website will give. Then you can set suitable upper limit to the number of pages and should use some technique to exit the loop when there are no more result pages. Here below is example of such program.
Res = InputBox("Please enter number You want to search?", "") Set objIE = CreateObject("InternetExplorer.Application") objIE.Top = 0 objIE.Left = 0 objIE.Width = 1100 objIE.Height = 700 objIE.Visible = True Dim i, row As Integer Dim url, isdata As String Dim WS As Worksheet Set WS = Worksheets("DataBase") i = 1 row = 1 For i = 1 To 100 'genarate url... url = "your website url+ ?page=" + CStr(i) + "There may be some additional parts of url here" + CStr(Res) objIE.Navigate (url) 'wait to load page... Do DoEvents Loop Until objIE.readystate = 4 'collect links... isdata = "n" Set Alllinks = objIE.document.getElementsByTagName("A") For Each Hyperlink In Alllinks If InStr(Hyperlink.innerText, "View") > 0 Then WS.Cells(row, 1).Value = Hyperlink.href isdata = "y" row = row + 1 End If Next If (isdata = "n") Then i = 101 End If Next i objIE.Quit |