Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

When I've talked to people running this kind of ai scraping/agent workflow, the costs of the AI parts dwarf that of the web browser parts. This causes computational cost of the browser to become irrelevant. I'm curious what situation you got yourself in where optimizing the browser results in meaningful savings. I'd also like to be in that place!

I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website. But it should even out or get worse as the websites get richer. The nature of web scraping is that the worst sites take up the vast majority of your cpu cycles. I don't think lowering the ram usage of the browser process will have much real world impact.



The cost of the browser part is still a problem. In our previous startup, we were scraping >20 millions of webpages per day, with thousands of instances of Chrome headless in parallel.

Regarding the RAM usage, it's still ~10x better than Chrome :) It seems to be coming mostly from v8, I guess that we could do better with a lightweight JS engine alternative.


As a web developer and server manager AI trainers scraping websites with no throttle is the problem. lol


> there are hundreds of Web APIs, and for now we just support some of them (DOM, XHR, Fetch)

> it's still ~10x better than Chrome

Do you expect it to stay that way once you've reached parity?


I don't expect it to change a lot. All the main components are there, it's mainly a question of coverage now.


Playwright can run webkit very easily and it's dramatically less resource-intensive than Chrome.


Yes but WebKit is not a browser per se, it's a rendering engine.

It's less resource-intensive than Chrome, but here we are talking orders of magnitude between Lightpanda and Chrome. If you are ~10x faster while using ~10x less RAM you are using ~100x less resources.


How well does it compare to specialized headless scraper browsers, like camoufox (firefox based) or secret agent (chrome based)?

Either should reduce your ram usage compared to stock chrome by a lot.


Careful, as you implement misssing features your RAM usage might grow too. Happened to many projects, lean at the beggining, get's just as slow when dealing with real world mesiness.


Does it work nicely on Linux? I'm very curious about this


How about using QuickJS instead of full-blown V8? For example, Elinks has support for SpiderMonkey, QuickJS, MuJS: https://github.com/rkd77/elinks/blob/master/doc/ecmascript.t... and takes a few MB of RAM.


You may reduce ram, but also performance. A good JIT costs ram.


Yes, that's true. It's a balance to find between RAM and speed.

I was thinking more on use cases that require to disable JIT anyway (WASM, iOS integration, security).


Yeah, could be nice to allow the user to select the type of ECMAScript engine that fits their use-case / performance requirements (balancing the resources available).


If your target is consistent enough (perhaps even stationary), then at some point "JIT" means wasting CPU cycles.


Generally, for consumer use cases, it's best to A) do it locally, preserving some of the original web contract B) run JS to get actual content C) post-process to reduce inference cost D) get latency as low as possible

Then, as the article points out, the Big Guns making the LLMs are a big use case for this because they get a 10x speedup and can begin contemplating running JS.

It sounds like the people you've talked to are in a messy middle: no incentive to improve efficiency of loading pages, simply because there's something else in the system that has a fixed cost to it.

I'm not sure why that would rule out improving anything else, it doesn't seem they should be stuck doing nothing other than flailing around for cheaper LLM inference.

> I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website.

I'm a bit lost, the ram usage benchmark says its ~10x less, and you feel its deceptive because you'd expect ram usage to be less? Steelmanning: 10% of Chrome's usage is still too high?


The benchmark shows lower ram usage on a very simple demo website. I expect that if the benchmark ran on a random set of real websites, ram usage would not be meaningfully lower than Chrome. Happy to be impressed and wrong if it remains lower.


I believe it will be still significantly lower as we skip the graphical rendering.

But to validate that we need to increase our Web APIs coverage.


Then came deepseek




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: