Programmers use various languages to develop programs to collect data from websites. Even the programmers who are using the same language use different techniques to extract the data. If you are new to web scraping, a typical question you will have is when to use which method. getElementsByClassName and getElementById are such two methods which often confuse the novices. Because lots of beginners have a doubt about why they should use getElementsByClassName or getElementById over the other in different situations.
So in this post I thought to teach you how to use these two methods appropriately. First thing we need to understand is that the websites are not developed aiming at helping web scraping. Web developers use various techniques and methods to have various functionalities, make the website pleasing to the eye, increase the speed and easy to make changes. So if we develop a web scraping program then we have to consider the inherent features of that particular website when doing the coding.
Now let’s consider the following HTML code. As you can see there are two input tags of type "submit" in this code. Assume we want to click the first submit button.
If you carefully examine the code, you can see that both buttons belong to the same class called "a-button-input". However both buttons have unique IDs as well. As the first button has a unique id, the easiest method to click that button is using the getElementById method. This is how you can do it.
Set objIE = CreateObject("InternetExplorer.Application") objIE.Top = 0 objIE.Left = 0 objIE.Width = 800 objIE.Height = 600 objIE.Visible = True objIE.Navigate ("Url here") Do DoEvents Loop Until objIE.readystate = 4 objIE.document.getElementById ("button-search").Click |
If you want to click the same button using the getElementsByClassName method, then you can do it as follows.
Set objIE = CreateObject("InternetExplorer.Application") objIE.Top = 0 objIE.Left = 0 objIE.Width = 800 objIE.Height = 600 objIE.Visible = True objIE.Navigate ("Url here") Do DoEvents Loop Until objIE.readystate = 4 objIE.document.getElementsByClassName("a-button-input")(0).Click |
Several elements can have the same class name. So in this method we have to use the index number of the element. Index starts at 0. You may not see a big difference in the above examples as there are only two input tags in the HTML code. But in real life, it is not simple like this. Some web pages have a large number of elements with the same class name. So then it is difficult to find the index number of the element we need. But if there is an id for that element, then we can use the getElementById method without thinking about the index number of the element.