feat: Zod-based Configuration class for cleaner SDK extension#3387
feat: Zod-based Configuration class for cleaner SDK extension#3387
Conversation
BREAKING CHANGE: The project is now native ESM without a CJS alternative. This is fine since all supported node versions allow `require(esm)`. Also all the dependencies are updated to the latest versions, including cheerio v1.
BREAKING CHANGE: The crawler following options are removed: - `handleRequestFunction` -> `requestHandler` - `handlePageFunction` -> `requestHandler` - `handleRequestTimeoutSecs` -> `requestHandlerTimeoutSecs` - `handleFailedRequestFunction` -> `failedRequestHandler`
BREAKING CHANGE: The crawling context no longer includes the `Error` object for failed requests. Use the second parameter of the `errorHandler` or `failedRequestHandler` callbacks to access the error. Previously, the crawling context extended a `Record` type, allowing to access any property. This was changed to a strict type, which means that you can only access properties that are defined in the context.
….retireOnBlockedStatusCodes` BREAKING CHANGE: `additionalBlockedStatusCodes` parameter of `Session.retireOnBlockedStatusCodes` method is removed. Use the `blockedStatusCodes` crawler option instead.
….retireOnBlockedStatusCodes` BREAKING CHANGE: `additionalBlockedStatusCodes` parameter of `Session.retireOnBlockedStatusCodes` method is removed. Use the `blockedStatusCodes` crawler option instead.
also tries to bump better-sqlite3 to latest version to have prebuilds for node 22
- closes #2479 - closes #3106 - closes #3107 - closes #3078 In my opinion, it makes a lot of sense to do the remaining changes in a separate PR. - [x] Introduce a `ContextPipeline` abstraction - [x] Update crawlers to use it - [x] Make sure that existing tests pass - [ ] Refine the `ContextPipeline.compose` signature and the semantics of `BasicCrawlerOptions.contextPipelineEnhancer` to maximize DX - [x] Write tests for the `contextPipelineEnhancer` - [x] Resolve added TODO comments (fix immediately or make issues) - [ ] Update documentation The `context-pipeline` branch introduces a fundamental architectural change to how Crawlee crawlers build and enhance the crawling context passed to request handlers. The core motivation is to fix the composition and extensibility nightmare in the current crawler hierarchy. 1. **Rigid inheritance hierarchy**: Crawlers were stuck in a brittle inheritance chain where each layer manipulated the context object while assuming that it already satisfied its final type. Multiple overrides of `BasicCrawler` lifecycle methods made the execution flow even harder to follow. 2. **Context enhancement via monkey-patching**: Manual property assignment (`crawlingContext.page = page`, `crawlingContext.$ = $`) scattered everywhere. It was a mess to follow and impossible to reason about. 3. **Cleanup coordination**: Resource cleanup was handled by separate `_cleanupContext` methods that were not co-located with the initialization. 4. **Extension mechanism was broken**: The `CrawlerExtension.use()` API tried to let you extend crawlers (the ones based on `HttpCrawler`) by overwriting properties - completely type-unsafe and fragile as hell. Introduces `ContextPipeline` - a **middleware-based composition pattern** where: - Each crawler layer defines how it enhances the context through explicit `action` functions - Cleanup logic is co-located with initialization via optional `cleanup` functions - Type safety is maintained through TypeScript generics that track context transformations - The pipeline executes middleware sequentially with proper error handling and guaranteed cleanup Declarative middleware composition with co-located cleanup: ```typescript contextPipeline.compose({ action: async (context) => ({ page, $ }), cleanup: async (context) => { await page.close(); } }) ``` The `ContextPipeline<TBase, TFinal>` tracks type transformations through the chain: ```typescript ContextPipeline<CrawlingContext, CrawlingContext> .compose<{ page: Page }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page }> .compose<{ $: CheerioAPI }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page, $: CheerioAPI }> ``` The `CrawlerExtension.use()` is gone. New approach via `contextPipelineEnhancer`: ```typescript new BasicCrawler({ contextPipelineEnhancer: (pipeline) => pipeline.compose({ action: async (context) => ({ myCustomProp: ... }) }) }) ``` The current way to express a context pipeline middleware has some shortcomings (`ContextPipeline.compose`, `BasicCrawlerOptions.contextPipelineEnhancer`). I suggest resolving this in another PR. For most legitimate use cases, this should be non-breaking. Those who extend the Crawler classes in non-trivial ways may need to adjust their code though - the non-public interface of `BasicCrawler` and `HttpCrawler` changed quite a bit. The pipeline uses `Object.defineProperties` for each middleware. Is this a serious performance consideration? --------- Co-authored-by: Martin Adámek <banan23@gmail.com>
Extracts `ProxyConfiguration` to `BasicCrawler` (related to discussion under #2917). Pass the `ProxyConfiguration` instance to the `SessionPool` for new `Session` object creation. Store and read the `ProxyInfo` from the `Session` instance instead of calling the `ProxyConfiguration` methods in the crawlers. closes #3198
Phasing out `got-scraping`-specific interfaces in favour of native `fetch` API. Related to #3071
Fixes build toolchain errors caused by the recent rebase onto the current `master` ([more details here](https://apify.slack.com/archives/C02JQSN79V4/p1764373034961859)). The largest thing is probably updating the dependency versions in `package.json` - if `turborepo` doesn't find the matching version in the local workspace, it will build against the package pulled from `npm` (which doesn't match the v4 API at this point).
Related to #3275 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Fixes the response header handling in `GotScrapingHttpClient` (`got-scraping` headers contain unexpected `Symbol`s and HTTP2 pseudoheaders). Fixes omission from one of the previous commits - `GotScrapingHttpClient.stream` now uses proxy correctly again. Closes #2917
Removes `HttpClient.stream()`, leaving only `HttpClient.sendRequest()` (they were both returning a streamable `Response` since #3295). Extracts cookie- and redirect-related behaviour to a separate abstract `BaseHttpClient` class. Custom HTTP clients now only have to implement one `fetch` method (matches the native `fetch` API). This makes the custom HTTP client implementation easier and will hopefully drive community contributions. Closes #3071 Closes #3314 Blocked by apify/impit#348
Removes `got-scraping` from the dependency lists of `@crawlee`-scoped packages (except for `@crawlee/got-scraping`). `HttpClient` implementations now throw Node's `TimeoutError` as a result of `AbortSignal.timeout()` firing. Closes #3275
Refactors the Configuration class to use Zod for declarative field definitions
with automatic environment variable mapping and type coercion.
Key changes:
- Field definitions are now declarative with `field()` helper
- Each field defines its Zod schema and env var mapping in one place
- Exported helpers for SDK extension: `field`, `coerceBoolean`, `logLevelSchema`
- Generic Configuration class supports inheritance via type parameters
- Priority order: constructor options > env vars > crawlee.json > defaults
- Proper TypeScript types for input (constructor) vs output (get())
This enables cleaner extension in Apify SDK without monkey patching:
```ts
const apifyConfigFields = {
...crawleeConfigFields,
token: field(z.string().optional(), { env: 'APIFY_TOKEN' }),
};
class Configuration extends CrawleeConfiguration<...> {
static override fields = apifyConfigFields;
}
```
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Refactors the Configuration class to use Zod-based field definitions,
extending Crawlee's new Configuration class cleanly without monkey patching.
Key changes:
- Uses `crawleeConfigFields` spread with Apify-specific overrides and additions
- Each field defines schema and env var aliases in one place
- Supports multiple env var aliases per field (e.g., ACTOR_ID, APIFY_ACTOR_ID)
- Removes all monkey patching of CoreConfiguration
- Adds zod as direct dependency
Example field definition:
```ts
actorId: field(z.string().optional(), {
env: ['ACTOR_ID', 'APIFY_ACTOR_ID'],
}),
```
Requires: apify/crawlee#3387
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Adds `extendField()` helper that extends an existing field with additional
env var mappings while preserving the base field's env vars. This avoids
repetition when extending fields in the SDK.
Example:
```ts
// No need to repeat CRAWLEE_DEFAULT_DATASET_ID
defaultDatasetId: extendField(crawleeConfigFields.defaultDatasetId, {
env: ['ACTOR_DEFAULT_DATASET_ID', 'APIFY_DEFAULT_DATASET_ID'],
}),
```
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Moves extendField from a standalone export to a static method on the Configuration class, providing better encapsulation while still being accessible for subclass field definitions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
| * To reset a value, we can omit the `value` argument or pass `undefined` there. | ||
| */ | ||
| set(key: keyof ConfigurationOptions, value?: any): void { | ||
| set<K extends keyof TInput>(key: K, value?: TInput[K]): void { |
There was a problem hiding this comment.
I think we could just get rid of the set method. Internally, it's not used much and changing the configuration mid-flight is a heavy-duty footgun.
There was a problem hiding this comment.
so you would you set stuff that are not crawler options? crawlee.json?
There was a problem hiding this comment.
I'm not sure I understand, the Configuration class doesn't allow "unknown" options, right?
| */ | ||
| get<K extends keyof TOutput>(key: K, defaultValue: NonNullable<TOutput[K]>): NonNullable<TOutput[K]>; | ||
| get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K]; | ||
| get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K] { |
There was a problem hiding this comment.
Is there any way we could expose the config options by a direct property access? I.e., config.maxMemoryMbytes instead of config.get("maxMemoryMbytes")?
There was a problem hiding this comment.
There are ways, but i cant say i like them, since this is rather internal API, right?
- config class returning a proxy from constructor
- adding getters dynamically
Both require some type level magic (which is imo fine on its own).
There was a problem hiding this comment.
In plain crawlee, it is internal for sure. In Apify SDK, it's accessed by users frequently as a wrapper for the plethora of environment variables that the platform provides. It makes sense to me to make it as close to the POJO experience as possible... so, can I see the type level magic? 😁
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
47a7493 to
d4139b7
Compare
| : { | ||
| request: LoadedRequest<Context['request']>; | ||
| } & Omit<Context, 'request'>; | ||
| export type LoadedContext<Context extends RestrictedCrawlingContext> = |
There was a problem hiding this comment.
This looks like a whitespace-only change, why? Same thing in the tests...
| */ | ||
| get<K extends keyof TOutput>(key: K, defaultValue: NonNullable<TOutput[K]>): NonNullable<TOutput[K]>; | ||
| get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K]; | ||
| get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K] { |
There was a problem hiding this comment.
In plain crawlee, it is internal for sure. In Apify SDK, it's accessed by users frequently as a wrapper for the plethora of environment variables that the platform provides. It makes sense to me to make it as close to the POJO experience as possible... so, can I see the type level magic? 😁
| * To reset a value, we can omit the `value` argument or pass `undefined` there. | ||
| */ | ||
| set(key: keyof ConfigurationOptions, value?: any): void { | ||
| set<K extends keyof TInput>(key: K, value?: TInput[K]): void { |
There was a problem hiding this comment.
I'm not sure I understand, the Configuration class doesn't allow "unknown" options, right?
Summary
Refactors the Configuration class to use Zod for declarative field definitions with automatic environment variable mapping and type coercion.
field()helper - single source of truthfield,coerceBoolean,logLevelSchemaconstructor options > env vars > crawlee.json > defaultsMotivation
This aligns with how the Python Crawlee/SDK handles configuration and enables cleaner extension in Apify SDK without monkey patching:
Test plan
🤖 Generated with Claude Code