Colly for Scraping in Go
Price Watching - The New Bird Watching
Manual Labour
You can choose to watch the price changes manually, this is hard work and labour. This is great in some cases, but well, who wants to navigate to a page, search for our desired products, open them in new tabs, take a look at the price to only figure it's still too expensive.
So I ask you this, why would you watch price changes yourself, when you can have a service doing it for you?
Granted, a lot of SaaS products are out there, to help you through this, alas why not make something yourselves, and learn from the experiences?
SaaS
I know there's a lot of scraping services out there, a simple google search of scraping services
and the whole page is filled with more than 44 million results.
Creating The Service
Manual Labour - Again
First, we have a lot of manual work ahead of us, but well, this could turn us into developers, so why not?
Like a wrote in the begining of this post, we still need to navigate to all of our desired products, notice the url and that is it, really! Simply copy the urls and save them for later.
Now we need to go to one of these urls, find the title of the product.
I've chosen these two urls
for testign purposes:
These two we'll look for manual price changes of a Roku Steaming Stick
and the Xiaomi Mi 10
.
MVP
First, let's add the necessary code:
package main
import (
"log"
"regexp"
"github.com/gocolly/colly"
)
var lines = regexp.MustCompile("(\r|\n)")
func main() {
links := []string{
"https://www.amazon.com/s?k=roku+streaming+stick",
"https://www.amazon.com/s?k=xiaomi+mi+10",
}
c := colly.NewCollector(
colly.CacheDir("./sites"),
colly.Async(true),
colly.AllowURLRevisit(),
)
c.OnRequest(func(r *colly.Request) {
r.Headers.Set("User-Agent", "PriceWatch/v1.0")
})
c.OnError(func(r *colly.Response, err error) {
log.Println(err)
})
c.OnScraped(func(r *colly.Response) {
log.Println("Done", r.Request.URL)
})
c.OnHTML("#search", func(h *colly.HTMLElement) {
})
for _, link := range links {
c.Visit(link)
}
c.Wait()
}
The imports will come by themselves, if you use any integrated go IDE.
Next few lines in the func main() {
are the links we will be watching. After the links are a new Colly constructor, we add a CacheDir
, to cache the sites, so we don't get IP banned by Amazon.
Note: I'm not sure if they ban us, but just to be sure.
Next-up we set the User-Agent
to PriceWatch/v1.0
. Which is somewhat of the standard of how to set user agents for scrapers or bots.
The Scraping
Let's add some scraping functionallity to our scraper.
c.OnHTML("#search", func(h *colly.HTMLElement) {
h.ForEach("div.s-main-slot.s-result-list.s-search-results.sg-row > div > div", func(_ int, e *colly.HTMLElement) {
title := e.ChildText("h2.a-size-mini.a-spacing-none.a-color-base.s-line-clamp-2")
price := e.ChildText("span.a-price")
title = lines.ReplaceAllString(title, "")
price = lines.ReplaceAllString(price, "")
title = strings.TrimSpace(title)
price = strings.TrimSpace(price)
priceStrs := strings.Split(price, "$")
prices := []float64{}
for _, priceStr := range priceStrs {
p, err := strconv.ParseFloat(priceStr, 64)
if err == nil {
prices = append(prices, p)
}
}
log.Println(title, prices)
})
})
As you see, I've added four custom JS-paths
to the scraper, namely:
#search
div.s-main-slot.s-result-list.s-search-results.sg-row > div > div
h2.a-size-mini.a-spacing-none.a-color-base.s-line-clamp-2
span.a-price
To find these paths, has been quite hit or miss. Trial and error. But in the end, I succeeded to find them. You can too, I believe in you!
Now, if we run our program:
Which is exactly what we want!
Watching
I did kinda lure you into this post by saying we'd watch the prices, but now it's simply to save our products in a datastore of sorts, be it a database or JSON/CSV file. It's up to you.
I'll leave this next part as an exercise for you to complete.
Please, by any means! Do post the results of how you do it in a gist or simply an simple explanation as a comment, if you want to, that is.
{ Best, Mads Bram Cordes }