Skip to content

feat: Zod-based Configuration class for cleaner SDK extension#3387

Open
B4nan wants to merge 41 commits intov4from
feat/zod-configuration-v4
Open

feat: Zod-based Configuration class for cleaner SDK extension#3387
B4nan wants to merge 41 commits intov4from
feat/zod-configuration-v4

Conversation

@B4nan
Copy link
Member

@B4nan B4nan commented Feb 4, 2026

Summary

Refactors the Configuration class to use Zod for declarative field definitions with automatic environment variable mapping and type coercion.

  • Field definitions are now declarative with field() helper - single source of truth
  • Each field defines its Zod schema and env var mapping in one place
  • Exported helpers for SDK extension: field, coerceBoolean, logLevelSchema
  • Generic Configuration class supports inheritance via type parameters
  • Priority order: constructor options > env vars > crawlee.json > defaults
  • Proper TypeScript types for input (constructor) vs output (get())

Motivation

This aligns with how the Python Crawlee/SDK handles configuration and enables cleaner extension in Apify SDK without monkey patching:

// Old SDK approach (required monkey patching)
CoreConfiguration.ENV_MAP = Configuration.ENV_MAP;
CoreConfiguration.BOOLEAN_VARS = Configuration.BOOLEAN_VARS;
// ...

// New approach - just extend fields and class
const apifyConfigFields = {
    ...crawleeConfigFields,
    token: field(z.string().optional(), { env: 'APIFY_TOKEN' }),
    actorId: field(z.string().optional(), { env: ['ACTOR_ID', 'APIFY_ACTOR_ID'] }),
};

class Configuration extends CrawleeConfiguration<ApifyConfigFields, ...> {
    static override fields = apifyConfigFields;
}

Test plan

  • Type checking passes
  • Full test suite (needs v4 CI)
  • Integration with Apify SDK

🤖 Generated with Claude Code

B4nan and others added 30 commits November 28, 2025 14:55
BREAKING CHANGE:

The project is now native ESM without a CJS alternative. This is fine since all supported node versions allow `require(esm)`.

Also all the dependencies are updated to the latest versions, including cheerio v1.
BREAKING CHANGE:

The crawler following options are removed:

- `handleRequestFunction` -> `requestHandler`
- `handlePageFunction` -> `requestHandler`
- `handleRequestTimeoutSecs` -> `requestHandlerTimeoutSecs`
- `handleFailedRequestFunction` -> `failedRequestHandler`
BREAKING CHANGE:

The crawling context no longer includes the `Error` object for failed requests. Use the second parameter of the `errorHandler` or `failedRequestHandler` callbacks to access the error.

Previously, the crawling context extended a `Record` type, allowing to access any property. This was changed to a strict type, which means that you can only access properties that are defined in the context.
….retireOnBlockedStatusCodes`

BREAKING CHANGE:

`additionalBlockedStatusCodes` parameter of `Session.retireOnBlockedStatusCodes` method is removed. Use the `blockedStatusCodes` crawler option instead.
….retireOnBlockedStatusCodes`

BREAKING CHANGE:

`additionalBlockedStatusCodes` parameter of `Session.retireOnBlockedStatusCodes` method is removed. Use the `blockedStatusCodes` crawler option instead.
also tries to bump better-sqlite3 to latest version to have prebuilds for node 22
- closes #2479
- closes #3106
- closes #3107
- closes #3078

In my opinion, it makes a lot of sense to do the remaining changes in a
separate PR.

- [x] Introduce a `ContextPipeline` abstraction
- [x] Update crawlers to use it
- [x] Make sure that existing tests pass
- [ ] Refine the `ContextPipeline.compose` signature and the semantics
of `BasicCrawlerOptions.contextPipelineEnhancer` to maximize DX
- [x] Write tests for the `contextPipelineEnhancer`
- [x] Resolve added TODO comments (fix immediately or make issues)
- [ ] Update documentation

The `context-pipeline` branch introduces a fundamental architectural
change to how Crawlee crawlers build and enhance the crawling context
passed to request handlers. The core motivation is to fix the
composition and extensibility nightmare in the current crawler
hierarchy.

1. **Rigid inheritance hierarchy**: Crawlers were stuck in a brittle
inheritance chain where each layer manipulated the context object while
assuming that it already satisfied its final type. Multiple overrides of
`BasicCrawler` lifecycle methods made the execution flow even harder to
follow.

2. **Context enhancement via monkey-patching**: Manual property
assignment (`crawlingContext.page = page`, `crawlingContext.$ = $`)
scattered everywhere. It was a mess to follow and impossible to reason
about.

3. **Cleanup coordination**: Resource cleanup was handled by separate
`_cleanupContext` methods that were not co-located with the
initialization.

4. **Extension mechanism was broken**: The `CrawlerExtension.use()` API
tried to let you extend crawlers (the ones based on `HttpCrawler`) by
overwriting properties - completely type-unsafe and fragile as hell.

Introduces `ContextPipeline` - a **middleware-based composition
pattern** where:

- Each crawler layer defines how it enhances the context through
explicit `action` functions
- Cleanup logic is co-located with initialization via optional `cleanup`
functions
- Type safety is maintained through TypeScript generics that track
context transformations
- The pipeline executes middleware sequentially with proper error
handling and guaranteed cleanup

Declarative middleware composition with co-located cleanup:

```typescript
contextPipeline.compose({
  action: async (context) => ({ page, $ }),
  cleanup: async (context) => { await page.close(); }
})
```

The `ContextPipeline<TBase, TFinal>` tracks type transformations through
the chain:

```typescript
ContextPipeline<CrawlingContext, CrawlingContext>
  .compose<{ page: Page }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page }>
  .compose<{ $: CheerioAPI }>(...) // ContextPipeline<CrawlingContext, CrawlingContext & { page: Page, $: CheerioAPI }>
```

The `CrawlerExtension.use()` is gone. New approach via
`contextPipelineEnhancer`:

```typescript
new BasicCrawler({
  contextPipelineEnhancer: (pipeline) =>
    pipeline.compose({
      action: async (context) => ({ myCustomProp: ... })
    })
})
```

The current way to express a context pipeline middleware has some
shortcomings (`ContextPipeline.compose`,
`BasicCrawlerOptions.contextPipelineEnhancer`). I suggest resolving this
in another PR.

For most legitimate use cases, this should be non-breaking. Those who
extend the Crawler classes in non-trivial ways may need to adjust their
code though - the non-public interface of `BasicCrawler` and
`HttpCrawler` changed quite a bit.

The pipeline uses `Object.defineProperties` for each middleware. Is this
a serious performance consideration?

---------

Co-authored-by: Martin Adámek <banan23@gmail.com>
Extracts `ProxyConfiguration` to `BasicCrawler` (related to discussion under #2917).

Pass the `ProxyConfiguration` instance to the `SessionPool` for new `Session` object creation.

Store and read the `ProxyInfo` from the `Session` instance instead of calling the `ProxyConfiguration` methods in the crawlers.

closes #3198
Phasing out `got-scraping`-specific interfaces in favour of native
`fetch` API.

Related to #3071
Fixes build toolchain errors caused by the recent rebase onto the
current `master` ([more details
here](https://apify.slack.com/archives/C02JQSN79V4/p1764373034961859)).

The largest thing is probably updating the dependency versions in
`package.json` - if `turborepo` doesn't find the matching version in the
local workspace, it will build against the package pulled from `npm`
(which doesn't match the v4 API at this point).
…client (#3286)

Removes incorrect implementation `KVS.getPublicUrl()` implementation from `@crawlee/core` and proxies the call to the storage client.

Closes #3272 
Closes #3076
…rfaces (#3295)

Works towards removing `got-scraping` as a direct Crawlee dependency.

Related to #3275 
Related to #3071
Related to #3275

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
barjin and others added 7 commits December 17, 2025 18:20
Fixes the response header handling in `GotScrapingHttpClient`
(`got-scraping` headers contain unexpected `Symbol`s and HTTP2
pseudoheaders).

Fixes omission from one of the previous commits -
`GotScrapingHttpClient.stream` now uses proxy correctly again.

Closes #2917
…quest` storages (#3306)

Adds an `httpClient` option to fetching methods in `RequestQueue`,
`RequestList`, and `@crawlee/utils`. Switches the default HTTP client
implementation to `ImpitHttpClient`.

Closes #3030 
Related to #3275
…awler` (#3309)

Adds unique crawler id to the `BasicCrawler` class. Prints a warning on
multiple crawlers sharing the state on `useState()`.

There might be more resources shared between different `BasicCrawler`
(-subclass) instances - this needs further investigation.

Closes #3024
…rrides (#3313)

Closes #3117 and fixes omissions from previous v4 PRs.
Removes `HttpClient.stream()`, leaving only `HttpClient.sendRequest()`
(they were both returning a streamable `Response` since
#3295).

Extracts cookie- and redirect-related behaviour to a separate abstract
`BaseHttpClient` class.

Custom HTTP clients now only have to implement one `fetch` method
(matches the native `fetch` API). This makes the custom HTTP client
implementation easier and will hopefully drive community contributions.

Closes #3071 
Closes #3314

Blocked by apify/impit#348
Removes `got-scraping` from the dependency lists of `@crawlee`-scoped
packages (except for `@crawlee/got-scraping`).

`HttpClient` implementations now throw Node's `TimeoutError` as a result
of `AbortSignal.timeout()` firing.

Closes #3275
Refactors the Configuration class to use Zod for declarative field definitions
with automatic environment variable mapping and type coercion.

Key changes:
- Field definitions are now declarative with `field()` helper
- Each field defines its Zod schema and env var mapping in one place
- Exported helpers for SDK extension: `field`, `coerceBoolean`, `logLevelSchema`
- Generic Configuration class supports inheritance via type parameters
- Priority order: constructor options > env vars > crawlee.json > defaults
- Proper TypeScript types for input (constructor) vs output (get())

This enables cleaner extension in Apify SDK without monkey patching:
```ts
const apifyConfigFields = {
    ...crawleeConfigFields,
    token: field(z.string().optional(), { env: 'APIFY_TOKEN' }),
};
class Configuration extends CrawleeConfiguration<...> {
    static override fields = apifyConfigFields;
}
```

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
B4nan added a commit to apify/apify-sdk-js that referenced this pull request Feb 4, 2026
Refactors the Configuration class to use Zod-based field definitions,
extending Crawlee's new Configuration class cleanly without monkey patching.

Key changes:
- Uses `crawleeConfigFields` spread with Apify-specific overrides and additions
- Each field defines schema and env var aliases in one place
- Supports multiple env var aliases per field (e.g., ACTOR_ID, APIFY_ACTOR_ID)
- Removes all monkey patching of CoreConfiguration
- Adds zod as direct dependency

Example field definition:
```ts
actorId: field(z.string().optional(), {
    env: ['ACTOR_ID', 'APIFY_ACTOR_ID'],
}),
```

Requires: apify/crawlee#3387

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
B4nan and others added 3 commits February 6, 2026 10:10
Adds `extendField()` helper that extends an existing field with additional
env var mappings while preserving the base field's env vars. This avoids
repetition when extending fields in the SDK.

Example:
```ts
// No need to repeat CRAWLEE_DEFAULT_DATASET_ID
defaultDatasetId: extendField(crawleeConfigFields.defaultDatasetId, {
    env: ['ACTOR_DEFAULT_DATASET_ID', 'APIFY_DEFAULT_DATASET_ID'],
}),
```

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Moves extendField from a standalone export to a static method on the
Configuration class, providing better encapsulation while still being
accessible for subclass field definitions.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* To reset a value, we can omit the `value` argument or pass `undefined` there.
*/
set(key: keyof ConfigurationOptions, value?: any): void {
set<K extends keyof TInput>(key: K, value?: TInput[K]): void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could just get rid of the set method. Internally, it's not used much and changing the configuration mid-flight is a heavy-duty footgun.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you would you set stuff that are not crawler options? crawlee.json?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand, the Configuration class doesn't allow "unknown" options, right?

*/
get<K extends keyof TOutput>(key: K, defaultValue: NonNullable<TOutput[K]>): NonNullable<TOutput[K]>;
get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K];
get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any way we could expose the config options by a direct property access? I.e., config.maxMemoryMbytes instead of config.get("maxMemoryMbytes")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are ways, but i cant say i like them, since this is rather internal API, right?

  • config class returning a proxy from constructor
  • adding getters dynamically

Both require some type level magic (which is imo fine on its own).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In plain crawlee, it is internal for sure. In Apify SDK, it's accessed by users frequently as a wrapper for the plethora of environment variables that the platform provides. It makes sense to me to make it as close to the POJO experience as possible... so, can I see the type level magic? 😁

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@B4nan B4nan force-pushed the feat/zod-configuration-v4 branch from 47a7493 to d4139b7 Compare February 6, 2026 14:03
: {
request: LoadedRequest<Context['request']>;
} & Omit<Context, 'request'>;
export type LoadedContext<Context extends RestrictedCrawlingContext> =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a whitespace-only change, why? Same thing in the tests...

*/
get<K extends keyof TOutput>(key: K, defaultValue: NonNullable<TOutput[K]>): NonNullable<TOutput[K]>;
get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K];
get<K extends keyof TOutput>(key: K, defaultValue?: TOutput[K]): TOutput[K] {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In plain crawlee, it is internal for sure. In Apify SDK, it's accessed by users frequently as a wrapper for the plethora of environment variables that the platform provides. It makes sense to me to make it as close to the POJO experience as possible... so, can I see the type level magic? 😁

* To reset a value, we can omit the `value` argument or pass `undefined` there.
*/
set(key: keyof ConfigurationOptions, value?: any): void {
set<K extends keyof TInput>(key: K, value?: TInput[K]): void {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand, the Configuration class doesn't allow "unknown" options, right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants