2
\$\begingroup\$

I had a desire to make a recursive web crawler in vba. As I don't have much knowledge on vba programming, so it took me a while to understand how the pattern might be. Finally, I've created one. The crawler I've created is doing just awesome. It starts from the first page of a torrent site then tracking the site's next page link it moves on while extracting names until all links are exhausted. Any input on this to make it more robust will be a great help. Thanks in advance.

Here is what I've written:

Sub yify(dynamic_link As String)

    Application.ScreenUpdating = False
    Const main_link As String = "https://yts.ag"
    Dim http As New XMLHTTP60, html As New HTMLDocument
    Dim movie As Object, link As Object

    With http
        .Open "GET", dynamic_link, False
        .send
        html.body.innerHTML = .responseText
    End With
    For Each movie In html.getElementsByClassName("browse-movie-title")
        ActiveCell.Value = movie.innerText   ''Scraping movie names
        ActiveCell.Offset(1, 0).Select
    Next movie

    For Each link In html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
        If InStr(link.innerText, "Next") > 0 Then
            yify (main_link & Split(link.href, ":")(1))  ''Feeding next page link to the crawler 
        End If
    Next link
    Application.ScreenUpdating = True

End Sub

Sub RecursiveCrawler()
    Range("A1").Select
    yify ("https://yts.ag/browse-movies/0/all/documentary/0/latest")  ''Crawling process starts here
End Sub
\$\endgroup\$

1 Answer 1

1
\$\begingroup\$

This is quite clean in general. I would probably only add blank lines to visually separate blocks of code and put comments about what is happening in the following block; and, apply the "Extract variable" refactoring method:

With http
    .Open "GET", dynamic_link, False
    .send
    html.body.innerHTML = .responseText
End With

''Extracting movie names
Set movieTitles = html.getElementsByClassName("browse-movie-title")
For Each movie In movieTitles
    ActiveCell.Value = movie.innerText   
    ActiveCell.Offset(1, 0).Select
Next movie

''Feeding next page link to the crawler 
Set paginationLinks = html.getElementsByClassName("tsc_pagination")(0).getElementsByTagName("a")
For Each link In paginationLinks
    If InStr(link.innerText, "Next") > 0 Then
        yify (main_link & Split(link.href, ":")(1)) 
    End If
Next link

Application.ScreenUpdating = True

I don't particularly like the way you are getting the next link, but, since I don't see that we can use CSS selectors or XPaths with this html document API.

\$\endgroup\$
1
  • \$\begingroup\$ It's always a pleasure to hear from you sir alecxe. Thanks for the pointer. \$\endgroup\$ Commented Sep 26, 2017 at 14:35

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.