CrawlSharp 1.0.22
dotnet add package CrawlSharp --version 1.0.22
NuGet\Install-Package CrawlSharp -Version 1.0.22
<PackageReference Include="CrawlSharp" Version="1.0.22" />
<PackageVersion Include="CrawlSharp" Version="1.0.22" />
<PackageReference Include="CrawlSharp" />
paket add CrawlSharp --version 1.0.22
#r "nuget: CrawlSharp, 1.0.22"
#:package CrawlSharp@1.0.22
#addin nuget:?package=CrawlSharp&version=1.0.22
#tool nuget:?package=CrawlSharp&version=1.0.22
<img src="https://raw.githubusercontent.com/jchristn/CrawlSharp/refs/heads/main/assets/icon.png" width="256" height="256">
CrawlSharp
CrawlSharp is a library and integrated webserver for crawling basic web content.
New in v1.0.22
- Added opt-in auto-expansion of common collapsible content for headless crawls
- Added tunable headless expansion delays, expansion pass count, and custom expansion selectors
- Added a top-right dashboard server endpoint selector for proxy, localhost, and custom server URLs
- Clarified rendered HTML capture behavior for headless navigable pages and direct-download handling for non-navigable assets
- Added automated coverage for rendered HTML capture, revealed-link discovery, and PDF fallback behavior
Bugs, Feedback, or Enhancement Requests
Please feel free to start an issue or a discussion!
Simple Example, Embedded
Embedding CrawlSharp into your application is simple and requires minimal configuration. Refer to the Test project for a full example.
using System.Collections.Generic;
using CrawlSharp.Web;
Settings settings = new Settings();
settings.Crawl.StartUrl = "http://www.mywebpage.com";
settings.Crawl.UseHeadlessBrowser = true; // slow but useful for sites that block bots or where content must be rendered
using (WebCrawler crawler = new WebCrawler(settings))
{
await foreach (WebResource resource in crawler.CrawlAsync())
Console.WriteLine(resource.Status + ": " + resource.Url);
}
WebCrawler.CrawlAsync can be awaited, returning an IAsyncEnumerable<WebResource> whereas WebCrawler.Crawl cannot be awaited, returning an IEnumerable<WebResource>.
Opt-in auto-expansion can be enabled for headless crawls when you need CrawlSharp to open common collapsible UI patterns before HTML capture:
using CrawlSharp.Web;
Settings settings = new Settings();
settings.Crawl.StartUrl = "https://www.mywebpage.com";
settings.Crawl.UseHeadlessBrowser = true;
settings.Crawl.AutoExpandCollapsibles = true;
settings.Crawl.PostLoadDelayMs = 500;
settings.Crawl.ExpansionSelectors = new List<string>
{
".faq-toggle"
};
using (WebCrawler crawler = new WebCrawler(settings))
{
await foreach (WebResource resource in crawler.CrawlAsync())
Console.WriteLine(resource.Status + ": " + resource.Url);
}
Crawl Settings
| Setting | Type | Default | Description |
|---|---|---|---|
UserAgent |
string |
CrawlSharp |
User agent string sent with requests |
StartUrl |
string |
null |
The URL from which to begin crawling |
UseHeadlessBrowser |
bool |
false |
Use a headless browser (Playwright) for crawling |
AutoExpandCollapsibles |
bool |
false |
Opt in to expanding common collapsible UI patterns before headless HTML capture |
PostLoadDelayMs |
int |
0 |
Delay in milliseconds after navigation and before headless auto-expansion starts |
PostInteractionDelayMs |
int |
250 |
Delay in milliseconds after each headless expansion pass |
MaxExpansionPasses |
int |
2 |
Maximum number of headless expansion passes before HTML capture |
ExpansionSelectors |
List<string> |
[] |
Additional CSS selectors to click during headless auto-expansion |
IgnoreRobotsText |
bool |
false |
Ignore the robots.txt file |
IncludeSitemap |
bool |
true |
Include URLs from sitemap.xml |
FollowLinks |
bool |
true |
Follow links found on crawled pages |
FollowRedirects |
bool |
true |
Follow HTTP redirect responses |
RestrictToChildUrls |
bool |
true |
Only follow links that are children of the start URL |
RestrictToSameSubdomain |
bool |
true |
Only follow links within the same subdomain |
RestrictToSameRootDomain |
bool |
true |
Only follow links within the same root domain |
AllowedDomains |
List<string> |
[] |
If non-empty, only these domains will be crawled |
DeniedDomains |
List<string> |
[] |
If non-empty, these domains will be excluded |
MaxCrawlDepth |
int |
5 |
Maximum depth of links to follow from the start URL |
ExcludeLinkPatterns |
List<Regex> |
[] |
Regex patterns for URLs to exclude from crawling |
FollowExternalLinks |
bool |
true |
Follow links to external domains |
MaxParallelTasks |
int |
8 |
Maximum number of concurrent crawl tasks |
PageTimeoutMs |
int |
30000 |
Timeout in milliseconds for retrieving each page (minimum 1000) |
ThrottleMs |
int |
5000 |
Delay in milliseconds when a 429 response is received and retries are exhausted |
RetryOn429 |
bool |
true |
Enable automatic retry with backoff on 429 responses |
MaxRetries |
int |
3 |
Maximum number of retry attempts on 429 (minimum 1) |
RetryMinBackoffMs |
int |
1000 |
Minimum backoff delay in milliseconds (minimum 100) |
RetryMaxBackoffMs |
int |
30000 |
Maximum backoff delay in milliseconds (minimum 1000) |
RetryBackoffJitter |
bool |
true |
Add random jitter to backoff delay to avoid thundering herd |
RequestDelayMs |
int |
2500 |
Delay in milliseconds between each HTTP request |
Rendered HTML in Headless Mode
When UseHeadlessBrowser is enabled for navigable pages, CrawlSharp captures the rendered DOM HTML from Playwright and stores it in WebResource.Data.
When headless crawling is not used, CrawlSharp returns the server response bytes directly. For non-navigable assets such as PDFs, CrawlSharp also uses direct HTTP retrieval even when headless crawling is enabled.
Headless Auto-Expand
AutoExpandCollapsibles is disabled by default. Enable it when a page only inserts usable content into the DOM after a collapsible control is opened.
When enabled in headless mode, CrawlSharp will:
- Open closed
<details>elements - Click a conservative set of common collapsible controls such as ARIA-backed toggles and Bootstrap collapse buttons
- Apply any additional selectors supplied through
ExpansionSelectors
Use PostLoadDelayMs when a page hydrates UI after the browser load event. Use PostInteractionDelayMs and MaxExpansionPasses to give nested lazy content time to appear between expansion passes.
ExpansionSelectors should stay narrow. Over-broad selectors can trigger unintended clicks and change the captured output.
Retry on 429 (Too Many Requests)
When RetryOn429 is enabled, the crawler will automatically retry individual page retrievals that receive a 429 status code. Retries use exponential backoff: the delay for each attempt is calculated as RetryMinBackoffMs * 2^attempt, capped at RetryMaxBackoffMs. When RetryBackoffJitter is enabled, the actual delay is randomized between 0 and the computed value to avoid synchronized retries across parallel tasks.
If all retry attempts are exhausted and the server still returns 429, the crawler falls back to the ThrottleMs delay and returns the 429 response as the result for that URL.
Web Resources
Objects crawled using CrawlSharp have the following properties:
Url- the URL from which the resource was retrievedParentUrl- the URL from which theUrlwas identifiedFilename- the filename component from the URL, if anyDepth- the depth level at which theUrlwas identifiedStatus- the HTTP status code returned when retrieving theUrlContentLength- the content length of the body returned when retrievingUrlContentType- the content type returned while retrievingUrlMD5Hash- the MD5 hash of theDataSHA1Hash- the SHA1 hash of theDataSHA256Hash- the SHA256 hash of theDataLastModified- theDateTimefrom when the headers indicate the object was last modifiedHeaders- aNameValueCollectionwith the headers returned while retrievingUrlData- abyte[]containing the data returned while retrievingUrl
REST API
CrawlSharp includes a project called CrawlSharp.Server which allows you to deploy a RESTful front-end for CrawlSharp. Refer to REST_API.md and also the Postman collection in the root of this repository for details.
CrawlSharp.Server will by default listen on host localhost and port 8000, meaning it will not accept requests from outside of the machine.
To change this, specify the hostname as the first argument and the port as the second, i.e. dotnet CrawlSharp.Server myhostname.com 8888.
$ dotnet CrawlSharp.Server
_ _ _
___ _ __ __ ___ _| | _| || |_
/ __| '__/ _` \ \ /\ / / | |_ .. _|
| (__| | | (_| |\ V V /| | |_ _|
\___|_| \__,_| \_/\_/ |_| |_||_|
(c)2026 Joel Christner
Usage:
crawlsharp [hostname] [port]
Where:
[hostname] is the hostname or IP address on which to listen
[port] is the port number, greater than or equal to zero, and less than 65536
NOTICE
------
Configured to listen on local address 'localhost'
Service will not receive requests from outside of localhost
Webserver started on http://localhost:8000/
2025-03-01 20:39:17 joel-laptop Info [CrawlSharpServer] server started
Refer to REST_API.md for more information about using the RESTful API.
Dashboard
CrawlSharp includes a web-based dashboard for configuring, launching, and monitoring crawls through your browser. The dashboard is a React (Vite) application located in the dashboard/ directory.
Features
Server selector - switch the dashboard between proxy, localhost, and custom server endpoints from the top-right toolbar
New Crawl — configure all crawl and authentication settings through the UI and launch a crawl against the CrawlSharp server
Active Crawl — monitor a running crawl in real time with a live feed of discovered resources, status code distribution, and content type breakdown
Crawl History — view past crawl results, including per-page status, content types, sizes, and hashes
Templates — save, duplicate, and reuse crawl configurations for repeated jobs
Running the Dashboard Locally
Prerequisites: Node.js (v18 or later).
cd dashboard
npm install
npm run dev
The dashboard will start on http://localhost:8001 and expects the CrawlSharp server to be running on http://localhost:8000. The Vite dev server proxies /crawl requests to the server automatically.
Building for Production
cd dashboard
npm run build
The compiled output is written to dashboard/dist/ and can be served by any static file server.
Configuring the Server URL
The dashboard determines the CrawlSharp server URL in the following order of precedence:
Use the top-right server endpoint icon in the dashboard toolbar to change the active endpoint without editing local storage by hand.
- localStorage — the value saved at key
crawlsharp_server_url(set through the dashboard UI) - Runtime config — the
CRAWLSHARP_SERVER_URLvalue inpublic/config.js, which is overridden at container startup when running in Docker - Default —
http://localhost:8000
Running with Docker Compose
The easiest way to run both the server and dashboard together is with Docker Compose. The Docker/compose.yaml includes both the crawlsharp-server and crawlsharp-ui services. The dashboard container uses nginx to reverse-proxy API requests to the CrawlSharp server internally, so no direct browser-to-server connectivity is needed.
The CRAWLSHARP_SERVER_URL environment variable controls the server URL used by the dashboard. When left empty (the default in Docker Compose), the dashboard routes API requests through its own nginx proxy. When running the dashboard outside of Docker, set it to the server's URL (e.g. http://localhost:8000).
To start both services:
cd Docker
docker compose up -d
The server is available at http://localhost:8000 and the dashboard at http://localhost:8001.
Use docker compose down (or the provided compose-down scripts) to stop.
Running in Docker
A Docker image is available in Docker Hub under jchristn77/crawlsharp. Use the Docker Compose start (compose-up.sh and compose-up.bat) and stop (compose-down.sh and compose-down.bat) scripts in the Docker directory if you wish to run within Docker Compose.
Using Headless Browser
CrawlSharp can use Microsoft.Playwright to crawl content to overcome challenging websites that detect and block bots or require content to be rendered from Javascript. If you run this code on an Ubuntu machine, use the following script to install dependencies that will be required. Also note that the $HOME directory must be owned by the user running the code.
#!/bin/bash
# Detect Ubuntu version
VERSION=$(lsb_release -rs)
if [[ "$VERSION" == "24.04" ]]; then
# Ubuntu 24.04 packages
PACKAGES="libasound2t64 libatk-bridge2.0-0t64 libatk1.0-0t64 libcups2t64 libgtk-3-0t64"
else
# Ubuntu 22.04 and earlier
PACKAGES="libasound2 libatk-bridge2.0-0 libatk1.0-0 libcups2 libgtk-3-0"
fi
# Install common packages plus version-specific ones
sudo apt-get update
sudo apt-get install -y \
$PACKAGES \
libnspr4 \
libnss3 \
libdrm2 \
libxkbcommon0 \
libxcomposite1 \
libxdamage1 \
libxrandr2 \
libgbm1 \
libxss1 \
fonts-liberation \
ca-certificates
Third-Party Data
CrawlSharp is licensed under MIT and uses the Nager.PublicSuffix library (MIT license) for domain matching coupled with third-party public suffix data (Mozilla Public License v2.0). Please be aware of the license for this information.
Version History
Please refer to CHANGELOG.md for version history.
| Product | Versions Compatible and additional computed target framework versions. |
|---|---|
| .NET | net8.0 is compatible. net8.0-android was computed. net8.0-browser was computed. net8.0-ios was computed. net8.0-maccatalyst was computed. net8.0-macos was computed. net8.0-tvos was computed. net8.0-windows was computed. net9.0 was computed. net9.0-android was computed. net9.0-browser was computed. net9.0-ios was computed. net9.0-maccatalyst was computed. net9.0-macos was computed. net9.0-tvos was computed. net9.0-windows was computed. net10.0 is compatible. net10.0-android was computed. net10.0-browser was computed. net10.0-ios was computed. net10.0-maccatalyst was computed. net10.0-macos was computed. net10.0-tvos was computed. net10.0-windows was computed. |
-
net10.0
- HtmlAgilityPack (>= 1.12.4)
- Microsoft.Playwright (>= 1.58.0)
- RestWrapper (>= 3.1.8)
- SerializationHelper (>= 2.0.3)
-
net8.0
- HtmlAgilityPack (>= 1.12.4)
- Microsoft.Playwright (>= 1.58.0)
- RestWrapper (>= 3.1.8)
- SerializationHelper (>= 2.0.3)
NuGet packages
This package is not used by any NuGet packages.
GitHub repositories
This package is not used by any popular GitHub repositories.
| Version | Downloads | Last Updated |
|---|---|---|
| 1.0.22 | 96 | 4/27/2026 |
| 1.0.20 | 544 | 3/6/2026 |
| 1.0.19 | 92 | 3/6/2026 |
| 1.0.18 | 260 | 2/28/2026 |
| 1.0.17 | 108 | 2/27/2026 |
| 1.0.16 | 375 | 11/23/2025 |
| 1.0.15 | 349 | 8/23/2025 |
| 1.0.14 | 151 | 8/22/2025 |
| 1.0.13 | 231 | 7/17/2025 |
| 1.0.12 | 200 | 7/17/2025 |
| 1.0.11 | 221 | 7/16/2025 |
| 1.0.10 | 217 | 6/6/2025 |
| 1.0.9 | 166 | 5/25/2025 |
| 1.0.8 | 139 | 5/24/2025 |
| 1.0.7 | 124 | 5/24/2025 |
| 1.0.6 | 237 | 3/23/2025 |
| 1.0.5 | 220 | 3/23/2025 |
| 1.0.4 | 223 | 3/23/2025 |
| 1.0.3 | 213 | 3/23/2025 |
| 1.0.2 | 286 | 3/4/2025 |
Added opt-in headless auto-expansion for collapsible content, updated headless crawl documentation, and added automated rendered HTML coverage.