feat: Zod-based Configuration class for cleaner SDK extension by B4nan · Pull Request #3387 · apify/crawlee

B4nan · 2026-02-04T16:05:35Z

Summary

Refactors the Configuration class to use Zod for declarative field definitions with automatic environment variable mapping and type coercion.

Field definitions are now declarative with field() helper - single source of truth
Each field defines its Zod schema and env var mapping in one place
Exported helpers for SDK extension: field, coerceBoolean, logLevelSchema
Generic Configuration class supports inheritance via type parameters
Priority order: constructor options > env vars > crawlee.json > defaults
Proper TypeScript types for input (constructor) vs output (get())

Motivation

This aligns with how the Python Crawlee/SDK handles configuration and enables cleaner extension in Apify SDK without monkey patching:

// Old SDK approach (required monkey patching)
CoreConfiguration.ENV_MAP = Configuration.ENV_MAP;
CoreConfiguration.BOOLEAN_VARS = Configuration.BOOLEAN_VARS;
// ...

// New approach - just extend fields and class
const apifyConfigFields = {
    ...crawleeConfigFields,
    token: field(z.string().optional(), { env: 'APIFY_TOKEN' }),
    actorId: field(z.string().optional(), { env: ['ACTOR_ID', 'APIFY_ACTOR_ID'] }),
};

class Configuration extends CrawleeConfiguration<ApifyConfigFields, ...> {
    static override fields = apifyConfigFields;
}

Test plan

Type checking passes
Full test suite (needs v4 CI)
Integration with Apify SDK

🤖 Generated with Claude Code

BREAKING CHANGE: The project is now native ESM without a CJS alternative. This is fine since all supported node versions allow `require(esm)`. Also all the dependencies are updated to the latest versions, including cheerio v1.

BREAKING CHANGE: The crawler following options are removed: - `handleRequestFunction` -> `requestHandler` - `handlePageFunction` -> `requestHandler` - `handleRequestTimeoutSecs` -> `requestHandlerTimeoutSecs` - `handleFailedRequestFunction` -> `failedRequestHandler`

BREAKING CHANGE: The crawling context no longer includes the `Error` object for failed requests. Use the second parameter of the `errorHandler` or `failedRequestHandler` callbacks to access the error. Previously, the crawling context extended a `Record` type, allowing to access any property. This was changed to a strict type, which means that you can only access properties that are defined in the context.

….retireOnBlockedStatusCodes` BREAKING CHANGE: `additionalBlockedStatusCodes` parameter of `Session.retireOnBlockedStatusCodes` method is removed. Use the `blockedStatusCodes` crawler option instead.

also tries to bump better-sqlite3 to latest version to have prebuilds for node 22

- closes #2479 - closes #3106 - closes #3107 - closes #3078 In my opinion, it makes a lot of sense to do the remaining changes in a separate PR. - [x] Introduce a `ContextPipeline` abstraction - [x] Update crawlers to use it - [x] Make sure that existing tests pass - [ ] Refine the `ContextPipeline.compose` signature and the semantics of `BasicCrawlerOptions.contextPipelineEnhancer` to maximize DX - [x] Write tests for the `contextPipelineEnhancer` - [x] Resolve added TODO comments (fix immediately or make issues) - [ ] Update documentation The `context-pipeline` branch introduces a fundamental architectural change to how Crawlee crawlers build and enhance the crawling context passed to request handlers. The core motivation is to fix the composition and extensibility nightmare in the current crawler hierarchy. 1. **Rigid inheritance hierarchy**: Crawlers were stuck in a brittle inheritance chain where each layer manipulated the context object while assuming that it already satisfied its final type. Multiple overrides of `BasicCrawler` lifecycle methods made the execution flow even harder to follow. 2. **Context enhancement via monkey-patching**: Manual property assignment (`crawlingContext.page = page`, `crawlingContext.$ = $`) scattered everywhere. It was a mess to follow and impossible to reason about. 3. **Cleanup coordination**: Resource cleanup was handled by separate `_cleanupContext` methods that were not co-located with the initialization. 4. **Extension mechanism was broken**: The `CrawlerExtension.use()` API tried to let you extend crawlers (the ones based on `HttpCrawler`) by overwriting properties - completely type-unsafe and fragile as hell. Introduces `ContextPipeline` - a **middleware-based composition pattern** where: - Each crawler layer defines how it enhances the context through explicit `action` functions - Cleanup logic is co-located with initialization via optional `cleanup` functions - Type safety is maintained through TypeScript generics that track context transformations - The pipeline executes middleware sequentially with proper error handling and guaranteed cleanup Declarative middleware composition with co-located cleanup: ```typescript contextPipeline.compose({ action: async (context) => ({ page, $ }), cleanup: async (context) => { await page.close(); } }) ``` The `ContextPipeline<TBase, TFinal>` tracks type transformations through the chain: ```typescript ContextPipeline<CrawlingContext, CrawlingContext> .compose<{ page: Page }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page }> .compose<{ $: CheerioAPI }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page, $: CheerioAPI }> ``` The `CrawlerExtension.use()` is gone. New approach via `contextPipelineEnhancer`: ```typescript new BasicCrawler({ contextPipelineEnhancer: (pipeline) => pipeline.compose({ action: async (context) => ({ myCustomProp: ... }) }) }) ``` The current way to express a context pipeline middleware has some shortcomings (`ContextPipeline.compose`, `BasicCrawlerOptions.contextPipelineEnhancer`). I suggest resolving this in another PR. For most legitimate use cases, this should be non-breaking. Those who extend the Crawler classes in non-trivial ways may need to adjust their code though - the non-public interface of `BasicCrawler` and `HttpCrawler` changed quite a bit. The pipeline uses `Object.defineProperties` for each middleware. Is this a serious performance consideration? --------- Co-authored-by: Martin Adámek <banan23@gmail.com>

Extracts `ProxyConfiguration` to `BasicCrawler` (related to discussion under #2917). Pass the `ProxyConfiguration` instance to the `SessionPool` for new `Session` object creation. Store and read the `ProxyInfo` from the `Session` instance instead of calling the `ProxyConfiguration` methods in the crawlers. closes #3198

Phasing out `got-scraping`-specific interfaces in favour of native `fetch` API. Related to #3071

Fixes build toolchain errors caused by the recent rebase onto the current `master` ([more details here](https://apify.slack.com/archives/C02JQSN79V4/p1764373034961859)). The largest thing is probably updating the dependency versions in `package.json` - if `turborepo` doesn't find the matching version in the local workspace, it will build against the package pulled from `npm` (which doesn't match the v4 API at this point).

Closes #3277

…client (#3286) Removes incorrect implementation `KVS.getPublicUrl()` implementation from `@crawlee/core` and proxies the call to the storage client. Closes #3272 Closes #3076

…rfaces (#3295) Works towards removing `got-scraping` as a direct Crawlee dependency. Related to #3275 Related to #3071

Related to #3275 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Fixes the response header handling in `GotScrapingHttpClient` (`got-scraping` headers contain unexpected `Symbol`s and HTTP2 pseudoheaders). Fixes omission from one of the previous commits - `GotScrapingHttpClient.stream` now uses proxy correctly again. Closes #2917

…quest` storages (#3306) Adds an `httpClient` option to fetching methods in `RequestQueue`, `RequestList`, and `@crawlee/utils`. Switches the default HTTP client implementation to `ImpitHttpClient`. Closes #3030 Related to #3275

…awler` (#3309) Adds unique crawler id to the `BasicCrawler` class. Prints a warning on multiple crawlers sharing the state on `useState()`. There might be more resources shared between different `BasicCrawler` (-subclass) instances - this needs further investigation. Closes #3024

…rrides (#3313) Closes #3117 and fixes omissions from previous v4 PRs.

Removes `HttpClient.stream()`, leaving only `HttpClient.sendRequest()` (they were both returning a streamable `Response` since #3295). Extracts cookie- and redirect-related behaviour to a separate abstract `BaseHttpClient` class. Custom HTTP clients now only have to implement one `fetch` method (matches the native `fetch` API). This makes the custom HTTP client implementation easier and will hopefully drive community contributions. Closes #3071 Closes #3314 Blocked by apify/impit#348

Removes `got-scraping` from the dependency lists of `@crawlee`-scoped packages (except for `@crawlee/got-scraping`). `HttpClient` implementations now throw Node's `TimeoutError` as a result of `AbortSignal.timeout()` firing. Closes #3275

Refactors the Configuration class to use Zod for declarative field definitions with automatic environment variable mapping and type coercion. Key changes: - Field definitions are now declarative with `field()` helper - Each field defines its Zod schema and env var mapping in one place - Exported helpers for SDK extension: `field`, `coerceBoolean`, `logLevelSchema` - Generic Configuration class supports inheritance via type parameters - Priority order: constructor options > env vars > crawlee.json > defaults - Proper TypeScript types for input (constructor) vs output (get()) This enables cleaner extension in Apify SDK without monkey patching: ```ts const apifyConfigFields = { ...crawleeConfigFields, token: field(z.string().optional(), { env: 'APIFY_TOKEN' }), }; class Configuration extends CrawleeConfiguration<...> { static override fields = apifyConfigFields; } ``` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Refactors the Configuration class to use Zod-based field definitions, extending Crawlee's new Configuration class cleanly without monkey patching. Key changes: - Uses `crawleeConfigFields` spread with Apify-specific overrides and additions - Each field defines schema and env var aliases in one place - Supports multiple env var aliases per field (e.g., ACTOR_ID, APIFY_ACTOR_ID) - Removes all monkey patching of CoreConfiguration - Adds zod as direct dependency Example field definition: ```ts actorId: field(z.string().optional(), { env: ['ACTOR_ID', 'APIFY_ACTOR_ID'], }), ``` Requires: apify/crawlee#3387 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Adds `extendField()` helper that extends an existing field with additional env var mappings while preserving the base field's env vars. This avoids repetition when extending fields in the SDK. Example: ```ts // No need to repeat CRAWLEE_DEFAULT_DATASET_ID defaultDatasetId: extendField(crawleeConfigFields.defaultDatasetId, { env: ['ACTOR_DEFAULT_DATASET_ID', 'APIFY_DEFAULT_DATASET_ID'], }), ``` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Moves extendField from a standalone export to a static method on the Configuration class, providing better encapsulation while still being accessible for subclass field definitions. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

packages/core/package.json

janbuchar · 2026-02-06T11:49:05Z

packages/core/src/configuration.ts

     * To reset a value, we can omit the `value` argument or pass `undefined` there.
     */
-    set(key: keyof ConfigurationOptions, value?: any): void {
+    set<K extends keyof TInput>(key: K, value?: TInput[K]): void {


I think we could just get rid of the set method. Internally, it's not used much and changing the configuration mid-flight is a heavy-duty footgun.

so you would you set stuff that are not crawler options? crawlee.json?

I'm not sure I understand, the Configuration class doesn't allow "unknown" options, right?

janbuchar · 2026-02-06T11:54:11Z

packages/core/src/configuration.ts

+     */
+    get<K extends keyof TOutput>(key: K, defaultValue: NonNullable<TOutput[K]>): NonNullable<TOutput[K]>;
+    get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K];
+    get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K] {


Is there any way we could expose the config options by a direct property access? I.e., config.maxMemoryMbytes instead of config.get("maxMemoryMbytes")?

There are ways, but i cant say i like them, since this is rather internal API, right?

config class returning a proxy from constructor

adding getters dynamically

Both require some type level magic (which is imo fine on its own).

In plain crawlee, it is internal for sure. In Apify SDK, it's accessed by users frequently as a wrapper for the plethora of environment variables that the platform provides. It makes sense to me to make it as close to the POJO experience as possible... so, can I see the type level magic? 😁

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

janbuchar · 2026-02-06T21:08:29Z

packages/core/src/crawlers/crawler_commons.ts

-    : {
-          request: LoadedRequest<Context['request']>;
-      } & Omit<Context, 'request'>;
+export type LoadedContext<Context extends RestrictedCrawlingContext> =


This looks like a whitespace-only change, why? Same thing in the tests...

janbuchar · 2026-02-06T22:18:35Z

packages/core/src/configuration.ts

+     */
+    get<K extends keyof TOutput>(key: K, defaultValue: NonNullable<TOutput[K]>): NonNullable<TOutput[K]>;
+    get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K];
+    get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K] {


In plain crawlee, it is internal for sure. In Apify SDK, it's accessed by users frequently as a wrapper for the plethora of environment variables that the platform provides. It makes sense to me to make it as close to the POJO experience as possible... so, can I see the type level magic? 😁

janbuchar · 2026-02-06T22:26:34Z

packages/core/src/configuration.ts

     * To reset a value, we can omit the `value` argument or pass `undefined` there.
     */
-    set(key: keyof ConfigurationOptions, value?: any): void {
+    set<K extends keyof TInput>(key: K, value?: TInput[K]): void {


I'm not sure I understand, the Configuration class doesn't allow "unknown" options, right?

B4nan and others added 30 commits November 28, 2025 14:55

ci: test on node 22 and 24

a700505

refactor: convert to native ESM

6043086

BREAKING CHANGE: The project is now native ESM without a CJS alternative. This is fine since all supported node versions allow `require(esm)`. Also all the dependencies are updated to the latest versions, including cheerio v1.

refactor: remove additionalBlockedStatusCodes parameter of `Session…

c5d0085

….retireOnBlockedStatusCodes` BREAKING CHANGE: `additionalBlockedStatusCodes` parameter of `Session.retireOnBlockedStatusCodes` method is removed. Use the `blockedStatusCodes` crawler option instead.

refactor: remove additionalBlockedStatusCodes parameter of `Session…

708a6c3

….retireOnBlockedStatusCodes` BREAKING CHANGE: `additionalBlockedStatusCodes` parameter of `Session.retireOnBlockedStatusCodes` method is removed. Use the `blockedStatusCodes` crawler option instead.

chore: skip docker image builds for v4

eb8e8e1

chore: use v4 dist tag

2bd42e0

chore: run tests on v4 branch

cc409d7

chore: fix build

3ed420b

chore: fix v4 publishing

6dc95b9

chore: use node 22 in e2e tests and project templates

cf674d4

chore: use node 24 in e2e tests and project templates

9ad9343

also tries to bump better-sqlite3 to latest version to have prebuilds for node 22

chore: improve types to get rid of some as any

93a1faa

chore: remove some deadcode

0948c5c

chore: bump a few more dependencies

e6d7579

fix CLI

03460f5

fix CLI 2

2618278

fix: remove old system info implementation

a80a302

chore: replace lodash.isequal with util.isDeepStrictEqual

6773146

feat!: use native Request / Response interface (#3163)

a409af2

Phasing out `got-scraping`-specific interfaces in favour of native `fetch` API. Related to #3071

chore: fix broken types in docs examples (#3287)

ab98f68

Closes #3277

chore(docs): Add (temporary) 4.0 docs snapshot (#3292)

665c690

chore: Fix HttpCrawler context types (#3291)

0b0a23e

fix: KVS.getPublicUrl() reads the public URL directly from storage …

e2c6784

…client (#3286) Removes incorrect implementation `KVS.getPublicUrl()` implementation from `@crawlee/core` and proxies the call to the storage client. Closes #3272 Closes #3076

feat: replace got-specific HttpRequest with native Request inte…

bca7d7a

…rfaces (#3295) Works towards removing `got-scraping` as a direct Crawlee dependency. Related to #3275 Related to #3071

feat: drop gotOptions params from the public interface (#3300)

9fc056d

Related to #3275 --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

barjin and others added 7 commits December 17, 2025 18:20

fix: sendRequest uses correct Session instance and allows for ove…

72e82cd

…rrides (#3313) Closes #3117 and fixes omissions from previous v4 PRs.

B4nan mentioned this pull request Feb 4, 2026

feat: Zod-based Configuration extending Crawlee's new approach apify/apify-sdk-js#551

Open

3 tasks

B4nan and others added 3 commits February 6, 2026 10:10

chore: mark extendField as @internal

8d6e3d9

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

janbuchar reviewed Feb 6, 2026

View reviewed changes

chore: bump zod to v4

d4139b7

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

B4nan force-pushed the feat/zod-configuration-v4 branch from 47a7493 to d4139b7 Compare February 6, 2026 14:03

janbuchar reviewed Feb 6, 2026

View reviewed changes

janbuchar force-pushed the v4 branch from 651f677 to 3e05afb Compare March 5, 2026 14:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Zod-based Configuration class for cleaner SDK extension#3387

feat: Zod-based Configuration class for cleaner SDK extension#3387
B4nan wants to merge 41 commits intov4from
feat/zod-configuration-v4

B4nan commented Feb 4, 2026

Uh oh!

Uh oh!

janbuchar Feb 6, 2026

Uh oh!

B4nan Feb 6, 2026

Uh oh!

janbuchar Feb 6, 2026

Uh oh!

janbuchar Feb 6, 2026

Uh oh!

B4nan Feb 6, 2026

Uh oh!

janbuchar Feb 6, 2026

Uh oh!

janbuchar Feb 6, 2026

Uh oh!

janbuchar Feb 6, 2026

Uh oh!

janbuchar Feb 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

B4nan commented Feb 4, 2026

Summary

Motivation

Test plan

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants