Skip to content

Use common network runtime for telemetry messages#1796

Open
mrdimidium wants to merge 6 commits intomainfrom
wp/mrdimidium/telemetry-common-network
Open

Use common network runtime for telemetry messages#1796
mrdimidium wants to merge 6 commits intomainfrom
wp/mrdimidium/telemetry-common-network

Conversation

@mrdimidium
Copy link
Member

Adds support for libcurl requests to the network runtime. The first consumer is telemetry, which now serves as a thin event broker. Since Runtime doesn't support timeouts/intervals, events are reset either upon receiving an event or upon server termination.

@mrdimidium mrdimidium requested a review from karlseguin March 12, 2026 11:43
@mrdimidium mrdimidium marked this pull request as draft March 12, 2026 11:56
@mrdimidium mrdimidium marked this pull request as ready for review March 12, 2026 13:59
pub fn send(self: *LightPanda, iid: ?[]const u8, run_mode: Config.RunMode, raw_event: telemetry.Event) !void {
const event = try self.mem_pool.create();
event.* = .{
pub fn send(self: *LightPanda, iid: ?[]const u8, run_mode: Config.RunMode, raw_event: telemetry.Event) !void {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

send is what the rest of the code ends up calling. On main, all this does is create an event, queues it and signals the worker thread.

In this version, once the buffer is full, it serializes json, has to acquire 3 mutexes and writes to the pipe. Quite a bit more that can slow down a caller.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah. I think a better solution is setInterval, which is called from main and flush on main thread, regardless of when the events are posted. I simplify it because serializing 20 events seems cheap. But if you worries let me fix it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not so much the JSON serialization (the original version retained the allocation for the writer/buffer). It's just all those extra opaque network things that could block, either now or in the future. Very easy to change, say, getConnection and not realize "oh, this will impact all callers of telemetry).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The last commit implement interval flush on the main thread. Now the c dp thread only acquires a mutex and stores the event in the buffer.

self.mutex.lock();
defer self.mutex.unlock();

self.pending[self.pcount] = .{
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty sure there's a window where this can overflow.

return into[0..i];
}
};
const conn = self.runtime.getConnection() orelse return;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there's no connection available, we lose the data?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The event buffer is twice the batch size, so we'll keep trying to send events. But when the buffer runs out, yes, we'll start losing events.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's right? pfcount is reset to 0 a few lines above, so if there's no connection, there's no retry for those events.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it's a regression. Connection should be taken first.

if (self.pollfds[1].revents == 0) continue;
self.pollfds[1].revents = 0;
// accept new connections
if (self.pollfds[1].revents != 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor, but I think self.pollfds[0] and self.pollfds[1] can be stored in a local outside the loop.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand. They're set in init/bind, but revents are changed from poll at each iteration. Or am I misunderstanding the idea?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const poll_fd = &self.pollfds[0];
const listen_fd = &self.pollfds[1];
while (true) {
    self.drainQueue();

I guess without bound-checking in ReleaseMode, it isn't a huge win.

}

listener.onAccept(listener.ctx, socket);
if (self.shutdown.load(.acquire) and running_handles == 0)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we're shutting down but there are handles, we'll continue to process the handles? Is that right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is that we should try to send buffered telemetry events when we receive sigterm, but we shouldn't accept new connections.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm..that makes sense. It deserves a comment. But it would also "hang" the shutdown until any other network activity completes, right? Not just telemetry. Like if I have a website that downloads a script, in our donecallback, we execute it, it could download another script and repeat forever. That would block the shutdown?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I plan to stop the cdp threads when we receive a sigterm, and they'll explicitly cancel their requests.

SigHandler has logic for resuscitation, so if the user presses ctrlc again, we'll kill the process. But yes, it could potentially hang. We can add a timeout to the sighandler to kill everyone after a minute.


// If we were woken up, perhaps everything was cancelled and the iteration can be completed.
if (self.shutdown.load(.acquire)) break;
while (true)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, but doesn't checking for shutdown when self.pollfds[0].revents != 0 make the most sense? You check it on acceptConnection,once, and at the end of the loop, but if it's true, it'll be true here first. The other two checks don't seem necessary if you check here, and I think it better documents the flow, e.g. pollfds[0] is used to signal shutdown.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand the ide. We don't terminate the loop when we receive a shutdown because we're trying to send scheduled telemetry but are no longer accepting new connections.

Copy link
Collaborator

@karlseguin karlseguin Mar 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, now that I understand that the shutdown doesn't want to shutdown right away, I understand why it's how it is. Hence why I think this lines needs a comment:

if (self.shutdown.load(.acquire) and running_handles == 0)

@karlseguin
Copy link
Collaborator

Also, can we clean up the names. I know I'm guilty of this too, but as much as possible I'd like namespaces to match the file names.

const Network = @import("network/Runtime.zig");

Rename Runtime.zig to Network.zig ? When I see Network.doSomething, and I want to look at doSomething I always try to quick-nav to Network.zig

const net_http = @import("../network/http.zig");

rename the imports to http?

There's also 1 place where the http.zig is imported as Net: const Net = @import("../network/http.zig"); and 1 place where the Runtime is imported as Runtime: const Runtime = @import("../network/Runtime.zig");

@mrdimidium mrdimidium force-pushed the wp/mrdimidium/telemetry-common-network branch from 9165e9b to b4d92d2 Compare March 13, 2026 14:28
@mrdimidium mrdimidium requested a review from karlseguin March 13, 2026 14:33
@karlseguin
Copy link
Collaborator

karlseguin commented Mar 14, 2026

I haven't done a full re-review, because I think the design needs more work. It just has too much potential to lose messages. If I run go run crawler/main.go -pool 10 https://demo-browser.lightpanda.io/ I get a steady stream of "telemetry buffer exhausted". Out of curiosity, I tried to set the FLUSH_INTERVAL to 10, and that seems to freeze it (maybe it's spending all of its time trying to run the schedule?). The higher the concurrency, the more message we'll lose.

In the first iteration, a call to telemetry.record() had a chance to block. Notwithstanding the data-loss issue, it seems like the current approach is on the right track, because telemetry.record() won't block (it's essentially like it was before). But thinking about it more, the processing now happens on the main thread, so it isn't telemetry.record() that can block, it's the main thread, which doesn't sound better.

(The flush on deinit also has me a little uneasy)

@mrdimidium mrdimidium force-pushed the wp/mrdimidium/telemetry-common-network branch from b4d92d2 to c7edd99 Compare March 16, 2026 23:22
};
}
// const URL = "https://telemetry.lightpanda.io";
const URL = "http://localhost:9876";
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🙈

}
};

const conn = self.runtime.getConnection() orelse {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to make sure we can get a connection before serializing? Not sure, your call.

pub const Event = union(enum) {
run: void,
navigate: Navigate,
buffer_overflow: BufferOverflow,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krichprollsch do we need to change anything to accept a new telemetry type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I have to update the telemetry side. But we can merge,we will just loose these specific events in the meantime.


const iid: ?[]const u8 = if (self.iid) |*id| id else null;

for (h..t) |i| {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

server has a max body size of 500KB (last I heard), so we might need to limit this to ~50 ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Different events have different sizes, it's easier to explicitly write events to the writer until we fit within 500 KB.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The limit in the server side can be changed, 500KB is conservative. we can safely go higher.

self.fireTicks();

listener.onAccept(listener.ctx, socket);
if (self.shutdown.load(.acquire) and running_handles == 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When I tested it, the batch size was always 1, which is good because it's on the main thread, but a bit less efficient in other ways (including on the server, cc @krichprollsch). But, when I hit ctrl-c, the program takes a long time to exit. It just keeps processing CDP requests, and it keeps sending batches of 1. It took maybe 1 minute to properly shut down.

Is there away to improve this? killing a process but having it run for 1 minute firing off telemetry isn't great. Do we have any way to disconnect all other clients, and synchronously flush the telemetry?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants