Instagram
youtube
Facebook
  • 2 years ago
  • 2332 Views

Web scrapping with golang and colly

Mradul Mishra
Table of Contents

Learn how to create a web scraper in Go using the Colly library. With Colly, you can easily extract data from websites and web pages using simple and efficient Go code. In this tutorial, we'll walk you through the steps of setting up a Colly collector, defining HTML callbacks, and visiting a website to start scraping. We'll also show you how to customize and extend your web scraper to fit your specific needs, such as handling cookies, following links, and storing data in a database. By the end of this tutorial, you'll have a fully-functional web scraper built with Golang and Colly.

What is golang colly

Colly is a popular open-source library for web scraping and crawling written in Go (also known as Golang). It provides a simple and flexible API for building web scrapers and crawlers that can extract data from websites and web pages.

Colly is designed to be easy to use and efficient, with a focus on concurrency and performance. It provides a number of features that make it well-suited for web scraping and crawling, including support for a wide range of protocols (e.g. HTTP, HTTPS, FTP), the ability to follow links, support for custom request headers and cookies, and the ability to use different storage backends (e.g. in-memory, Redis, MongoDB).

Colly is widely used by developers to build web scrapers and crawlers for a variety of purposes, such as data mining, price comparison, and search engine indexing. It is a powerful tool for extracting data from the web and can be used to build a wide range of applications and services.
 

Here is a simple example of how you can use the Colly library to scrape a website in Go:

package main

import (
	"fmt"

	"github.com/gocolly/colly"
)

func main() {
	// Create a new colly collector
	c := colly.NewCollector()

	// Set HTML callback
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		// Print link
		fmt.Println(e.Attr("href"))
	})

	// Start scraping on https://www.example.com
	c.Visit("https://www.example.com")
}

This code creates a new Colly collector and sets an HTML callback function that will be called for each a element with an href attribute found in the HTML document. The callback function prints the value of the href attribute to the console.

To start the scraping, the code calls the Visit method with the URL of the website to be scraped (https://www.example.com in this case). Colly will then send an HTTP request to this URL, download the HTML document, and extract all the links from the document.

This is a very basic example of web scraping with Colly and Go, and you can customize and extend it to fit your specific needs. For example, you can add more callbacks to extract other types of data, or use Colly's advanced features to handle cookies, follow links, or store data in a database.
 

How to store data to db using colly golang

To store data to a database using Colly in Go, you can use one of Colly's storage backends or write your own custom storage backend. Colly provides built-in support for storing data in memory, Redis, and MongoDB, and you can use these backends by specifying the appropriate options when creating a Colly collector.

Here is an example of how you can use the in-memory storage backend to store data in Colly:
 

package main

import (
	"fmt"

	"github.com/gocolly/colly"
)

func main() {
	// Create a new colly collector
	c := colly.NewCollector(
		colly.MaxDepth(1),
		colly.Async(true),
		colly.Debugger(&debug.LogDebugger{}),
	)

	// Set HTML callback
	c.OnHTML("a[href]", func(e *colly.HTMLElement) {
		// Save link to the in-memory storage backend
		c.Visit(e.Request.AbsoluteURL(e.Attr("href")))
	})

	// Set request callback
	c.OnRequest(func(r *colly.Request) {
		// Print request information
		fmt.Println("Visiting", r.URL)
	})

	// Start scraping on https://www.example.com
	c.Visit("https://www.example.com")

	// Wait until all scrapes are finished
	c.Wait()

	// Print the stored data
	fmt.Println("Stored")
}

 

Add a comment: