github

gaojiuli / gain

  • воскресенье, 4 июня 2017 г. в 03:12:09
https://github.com/gaojiuli/gain

Python
Web crawling framework for everyone.



Gain

Web crawling framework for everyone. Written with asyncio, uvloop and aiohttp. Every could write their own web crawler easily with gain framework. Gain framework provide a pretty simple api.

Road map

  • Basic spider
  • Custom header
  • Documentation

Requirements

  • Python3.5+

Based on

  • asyncio
  • uvloop
  • aiohttp
  • pybloomfiltermmap
  • pyquery

Installation

pip install gain

Usage

Write spider.py:

from gain import Css, Item, Parser, Spider


class Post(Item):
    title = Css('.entry-title')
    content = Css('.entry-content')

    async def save(self):
        with open('scrapinghub.txt', 'a+') as f:
            f.writelines(self.results['title'] + '\n')


class MySpider(Spider):
    frequency = 5
    start_url = 'https://blog.scrapinghub.com/'
    parsers = [Parser('https://blog.scrapinghub.com/page/\d+/'),
               Parser('https://blog.scrapinghub.com/\d{4}/\d{2}/\d{2}/[a-z0-9\-]+/', Post)]


MySpider.run()

run python spider.py

Example

the examples are in the /example/ directory.

Contribution

Just pull request or open issue.