Follow

Follow

Colly for Scraping in Go

Mads B. Cordes's photo
Mads B. Cordes
·Nov 11, 2020·

3 min read

Play this article

Price Watching - The New Bird Watching

Manual Labour

You can choose to watch the price changes manually, this is hard work and labour. This is great in some cases, but well, who wants to navigate to a page, search for our desired products, open them in new tabs, take a look at the price to only figure it's still too expensive.

So I ask you this, why would you watch price changes yourself, when you can have a service doing it for you?

Granted, a lot of SaaS products are out there, to help you through this, alas why not make something yourselves, and learn from the experiences?

SaaS

I know there's a lot of scraping services out there, a simple google search of scraping services and the whole page is filled with more than 44 million results.

Creating The Service

Manual Labour - Again

First, we have a lot of manual work ahead of us, but well, this could turn us into developers, so why not?

Like a wrote in the begining of this post, we still need to navigate to all of our desired products, notice the url and that is it, really! Simply copy the urls and save them for later.

Now we need to go to one of these urls, find the title of the product.

I've chosen these two urls for testign purposes:

These two we'll look for manual price changes of a Roku Steaming Stick and the Xiaomi Mi 10.

MVP

First, let's add the necessary code:

package main

import (
    "log"
    "regexp"

    "github.com/gocolly/colly"
)

var lines = regexp.MustCompile("(\r|\n)")

func main() {
    links := []string{
        "https://www.amazon.com/s?k=roku+streaming+stick",
        "https://www.amazon.com/s?k=xiaomi+mi+10",
    }

    c := colly.NewCollector(
        colly.CacheDir("./sites"),
        colly.Async(true),
        colly.AllowURLRevisit(),
    )

    c.OnRequest(func(r *colly.Request) {
        r.Headers.Set("User-Agent", "PriceWatch/v1.0")
    })

    c.OnError(func(r *colly.Response, err error) {
        log.Println(err)
    })

    c.OnScraped(func(r *colly.Response) {
        log.Println("Done", r.Request.URL)
    })

    c.OnHTML("#search", func(h *colly.HTMLElement) {

    })

    for _, link := range links {
        c.Visit(link)
    }

    c.Wait()
}

The imports will come by themselves, if you use any integrated go IDE.

Next few lines in the func main() { are the links we will be watching. After the links are a new Colly constructor, we add a CacheDir, to cache the sites, so we don't get IP banned by Amazon.

Note: I'm not sure if they ban us, but just to be sure.

Next-up we set the User-Agent to PriceWatch/v1.0. Which is somewhat of the standard of how to set user agents for scrapers or bots.

The Scraping

Let's add some scraping functionallity to our scraper.

c.OnHTML("#search", func(h *colly.HTMLElement) {
    h.ForEach("div.s-main-slot.s-result-list.s-search-results.sg-row > div > div", func(_ int, e *colly.HTMLElement) {
        title := e.ChildText("h2.a-size-mini.a-spacing-none.a-color-base.s-line-clamp-2")
        price := e.ChildText("span.a-price")

        title = lines.ReplaceAllString(title, "")
        price = lines.ReplaceAllString(price, "")

        title = strings.TrimSpace(title)
        price = strings.TrimSpace(price)

        priceStrs := strings.Split(price, "$")
        prices := []float64{}

        for _, priceStr := range priceStrs {
            p, err := strconv.ParseFloat(priceStr, 64)
            if err == nil {
                prices = append(prices, p)
            }
        }

        log.Println(title, prices)
    })
})

As you see, I've added four custom JS-paths to the scraper, namely:

  • #search
  • div.s-main-slot.s-result-list.s-search-results.sg-row > div > div
  • h2.a-size-mini.a-spacing-none.a-color-base.s-line-clamp-2
  • span.a-price

To find these paths, has been quite hit or miss. Trial and error. But in the end, I succeeded to find them. You can too, I believe in you!

Now, if we run our program:

1_results.png

Which is exactly what we want!

Watching

I did kinda lure you into this post by saying we'd watch the prices, but now it's simply to save our products in a datastore of sorts, be it a database or JSON/CSV file. It's up to you.

I'll leave this next part as an exercise for you to complete.

Please, by any means! Do post the results of how you do it in a gist or simply an simple explanation as a comment, if you want to, that is.


{ Best, Mads Bram Cordes }

Did you find this article valuable?

Support Mads B. Cordes by becoming a sponsor. Any amount is appreciated!

Learn more about Hashnode Sponsors
 
Share this