Scraping and Cleaning Your Data with Google Sheets: A Closer Look
Posted by tags. We also happen to notice that the tags are “children” of the
tags. What these values mean is a topic for another post, but what we need to keep in mind is that for this particular URL, this is where our desired information lives that we need to extract. It will almost certainly live in a different area with different modifiers on any other given URL; this is just the information that’s unique to the site we’re on.
Let’s get to the scary stuff (maybe?): how to write the formula.
I put the formula in cell A3, where I have the red arrow. As can be seen in the highlighted rectangle, I wrote =IMPORTXML(A1, “//h3//a[@target=’_blank’]”), which yielded a wonderful, organized list of all the top Twitter handles to follow from the page. Voila. Cool, right?
Something to remember when doing this is that the values have been created via a formula, so trying to copy and paste them regularly can get messy; you’ll need to copy and paste as values.
Now, let’s break down the madness.
Like any other function in Sheets, you’ll need to begin with an equal sign, so we start with =IMPORTXML. Next, we find the cell with our targeted URL (in this case, cell A1) and then add a comma. Double quotation marks are always required to begin the query, followed by two forward slashes (“//”). Next, you select the element you want to scrape (in this case, the h3 tag). We don’t want all of the information in the h3 elements, just a particular part of the tags—specifically, the “target” part where we find the Twitter handles. To capture this part, we add //a[@target=’_blank’], which specifies only the target=’_blank” part of the tag. Putting it all together, the formula =IMPORTXML(A1, “//h3//a[@target=’_blank’]”) can be translated as “From the URL within cell A1, select the data with an
No comments