Use common network runtime for telemetry messages by mrdimidium · Pull Request #1796 · lightpanda-io/browser

mrdimidium · 2026-03-12T11:43:44Z

Adds support for libcurl requests to the network runtime. The first consumer is telemetry, which now serves as a thin event broker. Since Runtime doesn't support timeouts/intervals, events are reset either upon receiving an event or upon server termination.

karlseguin · 2026-03-13T00:47:18Z

src/telemetry/lightpanda.zig

-    pub fn send(self: *LightPanda, iid: ?[]const u8, run_mode: Config.RunMode, raw_event: telemetry.Event) !void {
-        const event = try self.mem_pool.create();
-        event.* = .{
+pub fn send(self: *LightPanda, iid: ?[]const u8, run_mode: Config.RunMode, raw_event: telemetry.Event) !void {


send is what the rest of the code ends up calling. On main, all this does is create an event, queues it and signals the worker thread.

In this version, once the buffer is full, it serializes json, has to acquire 3 mutexes and writes to the pipe. Quite a bit more that can slow down a caller.

Yeah. I think a better solution is setInterval, which is called from main and flush on main thread, regardless of when the events are posted. I simplify it because serializing 20 events seems cheap. But if you worries let me fix it.

It's not so much the JSON serialization (the original version retained the allocation for the writer/buffer). It's just all those extra opaque network things that could block, either now or in the future. Very easy to change, say, getConnection and not realize "oh, this will impact all callers of telemetry).

The last commit implement interval flush on the main thread. Now the c dp thread only acquires a mutex and stores the event in the buffer.

karlseguin · 2026-03-13T00:47:36Z

src/telemetry/lightpanda.zig

+        self.mutex.lock();
+        defer self.mutex.unlock();
+
+        self.pending[self.pcount] = .{


Pretty sure there's a window where this can overflow.

karlseguin · 2026-03-13T00:47:56Z

src/telemetry/lightpanda.zig

-        return into[0..i];
-    }
-};
+    const conn = self.runtime.getConnection() orelse return;


If there's no connection available, we lose the data?

The event buffer is twice the batch size, so we'll keep trying to send events. But when the buffer runs out, yes, we'll start losing events.

I don't think that's right? pfcount is reset to 0 a few lines above, so if there's no connection, there's no retry for those events.

You're right, it's a regression. Connection should be taken first.

src/network/Runtime.zig

karlseguin · 2026-03-13T02:59:39Z

src/network/Runtime.zig

-        if (self.pollfds[1].revents == 0) continue;
-        self.pollfds[1].revents = 0;
+        // accept new connections
+        if (self.pollfds[1].revents != 0) {


Minor, but I think self.pollfds[0] and self.pollfds[1] can be stored in a local outside the loop.

I'm not sure I understand. They're set in init/bind, but revents are changed from poll at each iteration. Or am I misunderstanding the idea?

const poll_fd = &self.pollfds[0]; const listen_fd = &self.pollfds[1]; while (true) { self.drainQueue();

I guess without bound-checking in ReleaseMode, it isn't a huge win.

karlseguin · 2026-03-13T03:00:07Z

src/network/Runtime.zig

+        }

-        listener.onAccept(listener.ctx, socket);
+        if (self.shutdown.load(.acquire) and running_handles == 0)


So if we're shutting down but there are handles, we'll continue to process the handles? Is that right?

The idea is that we should try to send buffered telemetry events when we receive sigterm, but we shouldn't accept new connections.

Hmm..that makes sense. It deserves a comment. But it would also "hang" the shutdown until any other network activity completes, right? Not just telemetry. Like if I have a website that downloads a script, in our donecallback, we execute it, it could download another script and repeat forever. That would block the shutdown?

I plan to stop the cdp threads when we receive a sigterm, and they'll explicitly cancel their requests.

SigHandler has logic for resuscitation, so if the user presses ctrlc again, we'll kill the process. But yes, it could potentially hang. We can add a timeout to the sighandler to kill everyone after a minute.

src/main.zig

karlseguin · 2026-03-13T03:15:27Z

src/network/Runtime.zig

-
-            // If we were woken up, perhaps everything was cancelled and the iteration can be completed.
-            if (self.shutdown.load(.acquire)) break;
+            while (true)


I'm not sure, but doesn't checking for shutdown when self.pollfds[0].revents != 0 make the most sense? You check it on acceptConnection,once, and at the end of the loop, but if it's true, it'll be true here first. The other two checks don't seem necessary if you check here, and I think it better documents the flow, e.g. pollfds[0] is used to signal shutdown.

I'm not sure I understand the ide. We don't terminate the loop when we receive a shutdown because we're trying to send scheduled telemetry but are no longer accepting new connections.

Yes, now that I understand that the shutdown doesn't want to shutdown right away, I understand why it's how it is. Hence why I think this lines needs a comment:

if (self.shutdown.load(.acquire) and running_handles == 0)

karlseguin · 2026-03-13T09:16:58Z

Also, can we clean up the names. I know I'm guilty of this too, but as much as possible I'd like namespaces to match the file names.

const Network = @import("network/Runtime.zig");

Rename Runtime.zig to Network.zig ? When I see Network.doSomething, and I want to look at doSomething I always try to quick-nav to Network.zig

const net_http = @import("../network/http.zig");

rename the imports to http?

There's also 1 place where the http.zig is imported as Net: const Net = @import("../network/http.zig"); and 1 place where the Runtime is imported as Runtime: const Runtime = @import("../network/Runtime.zig");

karlseguin · 2026-03-14T01:52:56Z

I haven't done a full re-review, because I think the design needs more work. It just has too much potential to lose messages. If I run go run crawler/main.go -pool 10 https://demo-browser.lightpanda.io/ I get a steady stream of "telemetry buffer exhausted". Out of curiosity, I tried to set the FLUSH_INTERVAL to 10, and that seems to freeze it (maybe it's spending all of its time trying to run the schedule?). The higher the concurrency, the more message we'll lose.

In the first iteration, a call to telemetry.record() had a chance to block. Notwithstanding the data-loss issue, it seems like the current approach is on the right track, because telemetry.record() won't block (it's essentially like it was before). But thinking about it more, the processing now happens on the main thread, so it isn't telemetry.record() that can block, it's the main thread, which doesn't sound better.

(The flush on deinit also has me a little uneasy)

karlseguin · 2026-03-17T02:42:09Z

src/telemetry/lightpanda.zig

-        };
-    }
+// const URL = "https://telemetry.lightpanda.io";
+const URL = "http://localhost:9876";


karlseguin · 2026-03-17T02:54:36Z

src/telemetry/lightpanda.zig

    }
-};
+
+    const conn = self.runtime.getConnection() orelse {


Makes sense to make sure we can get a connection before serializing? Not sure, your call.

karlseguin · 2026-03-17T02:59:17Z

src/telemetry/telemetry.zig

 pub const Event = union(enum) {
    run: void,
    navigate: Navigate,
+    buffer_overflow: BufferOverflow,


@krichprollsch do we need to change anything to accept a new telemetry type?

Yes I have to update the telemetry side. But we can merge,we will just loose these specific events in the meantime.

karlseguin · 2026-03-17T03:20:20Z

src/telemetry/lightpanda.zig

+
+    const iid: ?[]const u8 = if (self.iid) |*id| id else null;
+
+    for (h..t) |i| {


server has a max body size of 500KB (last I heard), so we might need to limit this to ~50 ?

Indeed: https://github.com/lightpanda-io/telemetry/blob/main/api/metrics.go#L51

Different events have different sizes, it's easier to explicitly write events to the writer until we fit within 500 KB.

The limit in the server side can be changed, 500KB is conservative. we can safely go higher.

karlseguin · 2026-03-17T03:22:17Z

src/network/Runtime.zig

+        self.fireTicks();

-        listener.onAccept(listener.ctx, socket);
+        if (self.shutdown.load(.acquire) and running_handles == 0) {


When I tested it, the batch size was always 1, which is good because it's on the main thread, but a bit less efficient in other ways (including on the server, cc @krichprollsch). But, when I hit ctrl-c, the program takes a long time to exit. It just keeps processing CDP requests, and it keeps sending batches of 1. It took maybe 1 minute to properly shut down.

Is there away to improve this? killing a process but having it run for 1 minute firing off telemetry isn't great. Do we have any way to disconnect all other clients, and synchronously flush the telemetry?

Use common network runtime for telemetry messages

53112fa

mrdimidium requested a review from karlseguin March 12, 2026 11:43

mrdimidium marked this pull request as draft March 12, 2026 11:56

Create multi interface in Runtime on demand

cd07fcb

mrdimidium marked this pull request as ready for review March 12, 2026 13:59

karlseguin reviewed Mar 13, 2026

View reviewed changes

Move comments and bound checks

0f0065c

mrdimidium force-pushed the wp/mrdimidium/telemetry-common-network branch from 9165e9b to b4d92d2 Compare March 13, 2026 14:28

mrdimidium requested a review from karlseguin March 13, 2026 14:33

Used ring buffer for telemetry events buffer

c7edd99

mrdimidium force-pushed the wp/mrdimidium/telemetry-common-network branch from b4d92d2 to c7edd99 Compare March 16, 2026 23:22

karlseguin reviewed Mar 17, 2026

View reviewed changes

mrdimidium added 2 commits March 17, 2026 10:31

Limit telemetry body size

17b2c5b

Close all cdp clients on shutdown

96d3332

mrdimidium requested review from karlseguin and krichprollsch March 17, 2026 11:09


		const iid: ?[]const u8 = if (self.iid) \|*id\| id else null;

		for (h..t) \|i\| {

Conversation

mrdimidium commented Mar 12, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karlseguin Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karlseguin commented Mar 13, 2026

Uh oh!

karlseguin commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karlseguin Mar 13, 2026 •

edited

Loading

karlseguin commented Mar 14, 2026 •

edited

Loading