D4Vinci / Scrapling
- четверг, 26 февраля 2026 г. в 00:00:02
🕷️ An adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl!
العربيه | Español | Deutsch | 简体中文 | 日本語 | Русский
Selection methods · Fetchers · Spiders · Proxy Rotation · CLI · MCP
Scrapling is an adaptive Web Scraping framework that handles everything from a single request to a full-scale crawl.
Its parser learns from website changes and automatically relocates your elements when pages update. Its fetchers bypass anti-bot systems like Cloudflare Turnstile out of the box. And its spider framework lets you scale up to concurrent, multi-session crawls with pause/resume and automatic proxy rotation — all in a few lines of Python. One library, zero compromises.
Blazing fast crawls with real-time stats and streaming. Built by Web Scrapers for Web Scrapers and regular users, there's something for everyone.
from scrapling.fetchers import Fetcher, AsyncFetcher, StealthyFetcher, DynamicFetcher
StealthyFetcher.adaptive = True
p = StealthyFetcher.fetch('https://example.com', headless=True, network_idle=True) # Fetch website under the radar!
products = p.css('.product', auto_save=True) # Scrape data that survives website design changes!
products = p.css('.product', adaptive=True) # Later, if the website structure changes, pass `adaptive=True` to find them!Or scale up to full crawls
from scrapling.spiders import Spider, Response
class MySpider(Spider):
name = "demo"
start_urls = ["https://example.com/"]
async def parse(self, response: Response):
for item in response.css('.product'):
yield {"title": item.css('h2::text').get()}
MySpider().start()Do you want to show your ad here? Click here and choose the tier that suites you!
start_urls, async parse callbacks, and Request/Response objects.async for item in spider.stream() with real-time stats — ideal for UI, pipelines, and long-running crawls.result.items.to_json() / result.items.to_jsonl() respectively.Fetcher class. Can impersonate browsers' TLS fingerprint, headers, and use HTTP/3.DynamicFetcher class supporting Playwright's Chromium and Google's Chrome.StealthyFetcher and fingerprint spoofing. Can easily bypass all types of Cloudflare's Turnstile/Interstitial with automation.FetcherSession, StealthySession, and DynamicSession classes for cookie and state management across requests.ProxyRotator with cyclic or custom rotation strategies across all session types, plus per-request proxy overrides.Let's give you a quick glimpse of what Scrapling can do without deep diving.
HTTP requests with session support
from scrapling.fetchers import Fetcher, FetcherSession
with FetcherSession(impersonate='chrome') as session: # Use latest version of Chrome's TLS fingerprint
page = session.get('https://quotes.toscrape.com/', stealthy_headers=True)
quotes = page.css('.quote .text::text').getall()
# Or use one-off requests
page = Fetcher.get('https://quotes.toscrape.com/')
quotes = page.css('.quote .text::text').getall()Advanced stealth mode
from scrapling.fetchers import StealthyFetcher, StealthySession
with StealthySession(headless=True, solve_cloudflare=True) as session: # Keep the browser open until you finish
page = session.fetch('https://nopecha.com/demo/cloudflare', google_search=False)
data = page.css('#padded_content a').getall()
# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = StealthyFetcher.fetch('https://nopecha.com/demo/cloudflare')
data = page.css('#padded_content a').getall()Full browser automation
from scrapling.fetchers import DynamicFetcher, DynamicSession
with DynamicSession(headless=True, disable_resources=False, network_idle=True) as session: # Keep the browser open until you finish
page = session.fetch('https://quotes.toscrape.com/', load_dom=False)
data = page.xpath('//span[@class="text"]/text()').getall() # XPath selector if you prefer it
# Or use one-off request style, it opens the browser for this request, then closes it after finishing
page = DynamicFetcher.fetch('https://quotes.toscrape.com/')
data = page.css('.quote .text::text').getall()Build full crawlers with concurrent requests, multiple session types, and pause/resume:
from scrapling.spiders import Spider, Request, Response
class QuotesSpider(Spider):
name = "quotes"
start_urls = ["https://quotes.toscrape.com/"]
concurrent_requests = 10
async def parse(self, response: Response):
for quote in response.css('.quote'):
yield {
"text": quote.css('.text::text').get(),
"author": quote.css('.author::text').get(),
}
next_page = response.css('.next a')
if next_page:
yield response.follow(next_page[0].attrib['href'])
result = QuotesSpider().start()
print(f"Scraped {len(result.items)} quotes")
result.items.to_json("quotes.json")Use multiple session types in a single spider:
from scrapling.spiders import Spider, Request, Response
from scrapling.fetchers import FetcherSession, AsyncStealthySession
class MultiSessionSpider(Spider):
name = "multi"
start_urls = ["https://example.com/"]
def configure_sessions(self, manager):
manager.add("fast", FetcherSession(impersonate="chrome"))
manager.add("stealth", AsyncStealthySession(headless=True), lazy=True)
async def parse(self, response: Response):
for link in response.css('a::attr(href)').getall():
# Route protected pages through the stealth session
if "protected" in link:
yield Request(link, sid="stealth")
else:
yield Request(link, sid="fast", callback=self.parse) # explicit callbackPause and resume long crawls with checkpoints by running the spider like this:
QuotesSpider(crawldir="./crawl_data").start()Press Ctrl+C to pause gracefully — progress is saved automatically. Later, when you start the spider again, pass the same crawldir, and it will resume from where it stopped.
from scrapling.fetchers import Fetcher
# Rich element selection and navigation
page = Fetcher.get('https://quotes.toscrape.com/')
# Get quotes with multiple selection methods
quotes = page.css('.quote') # CSS selector
quotes = page.xpath('//div[@class="quote"]') # XPath
quotes = page.find_all('div', {'class': 'quote'}) # BeautifulSoup-style
# Same as
quotes = page.find_all('div', class_='quote')
quotes = page.find_all(['div'], class_='quote')
quotes = page.find_all(class_='quote') # and so on...
# Find element by text content
quotes = page.find_by_text('quote', tag='div')
# Advanced navigation
quote_text = page.css('.quote')[0].css('.text::text').get()
quote_text = page.css('.quote').css('.text::text').getall() # Chained selectors
first_quote = page.css('.quote')[0]
author = first_quote.next_sibling.css('.author::text')
parent_container = first_quote.parent
# Element relationships and similarity
similar_elements = first_quote.find_similar()
below_elements = first_quote.below_elements()You can use the parser right away if you don't want to fetch websites like below:
from scrapling.parser import Selector
page = Selector("<html>...</html>")And it works precisely the same way!
import asyncio
from scrapling.fetchers import FetcherSession, AsyncStealthySession, AsyncDynamicSession
async with FetcherSession(http3=True) as session: # `FetcherSession` is context-aware and can work in both sync/async patterns
page1 = session.get('https://quotes.toscrape.com/')
page2 = session.get('https://quotes.toscrape.com/', impersonate='firefox135')
# Async session usage
async with AsyncStealthySession(max_pages=2) as session:
tasks = []
urls = ['https://example.com/page1', 'https://example.com/page2']
for url in urls:
task = session.fetch(url)
tasks.append(task)
print(session.get_pool_stats()) # Optional - The status of the browser tabs pool (busy/free/error)
results = await asyncio.gather(*tasks)
print(session.get_pool_stats())Scrapling includes a powerful command-line interface:
Launch the interactive Web Scraping shell
scrapling shellExtract pages to a file directly without programming (Extracts the content inside the body tag by default). If the output file ends with .txt, then the text content of the target will be extracted. If it ends in .md, it will be a Markdown representation of the HTML content; if it ends in .html, it will be the HTML content itself.
scrapling extract get 'https://example.com' content.md
scrapling extract get 'https://example.com' content.txt --css-selector '#fromSkipToProducts' --impersonate 'chrome' # All elements matching the CSS selector '#fromSkipToProducts'
scrapling extract fetch 'https://example.com' content.md --css-selector '#fromSkipToProducts' --no-headless
scrapling extract stealthy-fetch 'https://nopecha.com/demo/cloudflare' captchas.html --css-selector '#padded_content a' --solve-cloudflareNote
There are many additional features, but we want to keep this page concise, including the MCP server and the interactive Web Scraping Shell. Check out the full documentation here
Scrapling isn't just powerful—it's also blazing fast. The following benchmarks compare Scrapling's parser with the latest versions of other popular libraries.
| # | Library | Time (ms) | vs Scrapling |
|---|---|---|---|
| 1 | Scrapling | 2.02 | 1.0x |
| 2 | Parsel/Scrapy | 2.04 | 1.01 |
| 3 | Raw Lxml | 2.54 | 1.257 |
| 4 | PyQuery | 24.17 | ~12x |
| 5 | Selectolax | 82.63 | ~41x |
| 6 | MechanicalSoup | 1549.71 | ~767.1x |
| 7 | BS4 with Lxml | 1584.31 | ~784.3x |
| 8 | BS4 with html5lib | 3391.91 | ~1679.1x |
Scrapling's adaptive element finding capabilities significantly outperform alternatives:
| Library | Time (ms) | vs Scrapling |
|---|---|---|
| Scrapling | 2.39 | 1.0x |
| AutoScraper | 12.45 | 5.209x |
All benchmarks represent averages of 100+ runs. See benchmarks.py for methodology.
Scrapling requires Python 3.10 or higher:
pip install scraplingThis installation only includes the parser engine and its dependencies, without any fetchers or commandline dependencies.
If you are going to use any of the extra features below, the fetchers, or their classes, you will need to install fetchers' dependencies and their browser dependencies as follows:
pip install "scrapling[fetchers]"
scrapling installThis downloads all browsers, along with their system dependencies and fingerprint manipulation dependencies.
Extra features:
pip install "scrapling[ai]"extract command):
pip install "scrapling[shell]"pip install "scrapling[all]"Remember that you need to install the browser dependencies with scrapling install after any of these extras (if you didn't already)
You can also install a Docker image with all extras and browsers with the following command from DockerHub:
docker pull pyd4vinci/scraplingOr download it from the GitHub registry:
docker pull ghcr.io/d4vinci/scrapling:latestThis image is automatically built and pushed using GitHub Actions and the repository's main branch.
We welcome contributions! Please read our contributing guidelines before getting started.
Caution
This library is provided for educational and research purposes only. By using this library, you agree to comply with local and international data scraping and privacy laws. The authors and contributors are not responsible for any misuse of this software. Always respect the terms of service of websites and robots.txt files.
This work is licensed under the BSD-3-Clause License.
This project includes code adapted from: