IoT Shaman

An Easy Method for Scraping 3rd Party Websites

The internet has become an endless resource for information on nearly every subject imaginable. As application developers, consuming this information for analysis, presentation, reporting, etc, is often as easy as calling an API; however, sometimes getting access to this information can be more difficult. When there is no API for a developer to access, or access to these API's is too expensive, another option for gathering this information is to do what is called "Web Scraping".

What is Web Scraping

Web scraping is the process of consuming a 3rd party websites information in its raw form, so that you can parse the text and gather the information you need. Please note that just because you may have the ability access a website, that doesn't mean that you have the right to pull down its data. Please make sure to only access publicly available information, or information that you have been given the right to access.

Downloading the Application

Fortunately I have developed an easy to use web API that handles all of the hard work. To get started, please click this link and clone the git repository. While you are there check out the README file, it contains all the information you need to understand the API options. Once you have cloned the repository go ahead and install the dependencies by opening a command line, navigating to the project folder, and running the following commands:

$ npm install
$ npm start

Scraping your First Website

Accessing the scraper API is extremely easy, all you need to do is make an HTTP GET request to the server (by default 'http://localhost:3000/') and provide the parameters as query string arguments. An example GET request is below, showing how to scrape our landing page here at IoT Shaman:

http://localhost:3000/scrape?to_url=www.iotshaman.com&https=true

The below list details the different query string arguments you can provide, each with a short description:

  • to_url (required): link you are wanting to scrape (please remove 'http://', 'https://' and instead use 'https' parameter)
  • https (optional): set to true if site requires https, leave blank or set to false if http is allowed
  • iso_body (optional): only return html body
  • remove_origin (optional): remove references to 3rd party links (scripts, styles, iframes, etc); use this if you are injecting the html into another page.
  • prep_html (optional): changes html container tags (, , ) into injectable tags (, etc); use this if you are injecting the html into another page.

Parsing the Result Set

Once you have determined which sites you are going to scrape, you will then need a strategy to parse the text. One of my favorite options is an npm module called Cheerio which offers a JQuery-like interface to parse scraped HTML text. Another option could be to store this information in a database with full-text index capabilities, you can then develop database queries to pull the required data from the data-set.

Conclusion

Hopefully after reading this article you have the tools you need to access information, even when its not available through traditional API's. If you enjoy this project, feel free to get involved. This scraper API is one of the open-source projects here at IoT Shaman and we are always interested in getting more people involved in our projects.