When I've talked to people running this kind of ai scraping/agent workflow, the ...

fbouvier · on Jan 24, 2025

The cost of the browser part is still a problem. In our previous startup, we were scraping >20 millions of webpages per day, with thousands of instances of Chrome headless in parallel.

Regarding the RAM usage, it's still ~10x better than Chrome :) It seems to be coming mostly from v8, I guess that we could do better with a lightweight JS engine alternative.

radium3d · on Jan 25, 2025

As a web developer and server manager AI trainers scraping websites with no throttle is the problem. lol

cush · on Jan 24, 2025

> there are hundreds of Web APIs, and for now we just support some of them (DOM, XHR, Fetch)

> it's still ~10x better than Chrome

Do you expect it to stay that way once you've reached parity?

fbouvier · on Jan 25, 2025

I don't expect it to change a lot. All the main components are there, it's mainly a question of coverage now.

nwienert · on Jan 24, 2025

Playwright can run webkit very easily and it's dramatically less resource-intensive than Chrome.

fbouvier · on Jan 24, 2025

Yes but WebKit is not a browser per se, it's a rendering engine.

It's less resource-intensive than Chrome, but here we are talking orders of magnitude between Lightpanda and Chrome. If you are ~10x faster while using ~10x less RAM you are using ~100x less resources.

bdhcuidbebe · on Jan 25, 2025

How well does it compare to specialized headless scraper browsers, like camoufox (firefox based) or secret agent (chrome based)?

Either should reduce your ram usage compared to stock chrome by a lot.

whatevaa · on Jan 25, 2025

Careful, as you implement misssing features your RAM usage might grow too. Happened to many projects, lean at the beggining, get's just as slow when dealing with real world mesiness.

msoad · on Jan 25, 2025

Does it work nicely on Linux? I'm very curious about this

niutech · on Jan 29, 2025

How about using QuickJS instead of full-blown V8? For example, Elinks has support for SpiderMonkey, QuickJS, MuJS: https://github.com/rkd77/elinks/blob/master/doc/ecmascript.t... and takes a few MB of RAM.

Tostino · on Jan 24, 2025

You may reduce ram, but also performance. A good JIT costs ram.

fbouvier · on Jan 24, 2025

Yes, that's true. It's a balance to find between RAM and speed.

I was thinking more on use cases that require to disable JIT anyway (WASM, iOS integration, security).

Tostino · on Jan 24, 2025

Yeah, could be nice to allow the user to select the type of ECMAScript engine that fits their use-case / performance requirements (balancing the resources available).

cxr · on Jan 25, 2025

If your target is consistent enough (perhaps even stationary), then at some point "JIT" means wasting CPU cycles.

refulgentis · on Jan 24, 2025

Generally, for consumer use cases, it's best to A) do it locally, preserving some of the original web contract B) run JS to get actual content C) post-process to reduce inference cost D) get latency as low as possible

Then, as the article points out, the Big Guns making the LLMs are a big use case for this because they get a 10x speedup and can begin contemplating running JS.

It sounds like the people you've talked to are in a messy middle: no incentive to improve efficiency of loading pages, simply because there's something else in the system that has a fixed cost to it.

I'm not sure why that would rule out improving anything else, it doesn't seem they should be stuck doing nothing other than flailing around for cheaper LLM inference.

> I think your ram usage benchmark is deceptive. I'd expect a minimal browser to have much lower peak memory usage than chrome on a minimal website.

I'm a bit lost, the ram usage benchmark says its ~10x less, and you feel its deceptive because you'd expect ram usage to be less? Steelmanning: 10% of Chrome's usage is still too high?

JoelEinbinder · on Jan 24, 2025

The benchmark shows lower ram usage on a very simple demo website. I expect that if the benchmark ran on a random set of real websites, ram usage would not be meaningfully lower than Chrome. Happy to be impressed and wrong if it remains lower.

fbouvier · on Jan 24, 2025

I believe it will be still significantly lower as we skip the graphical rendering.

But to validate that we need to increase our Web APIs coverage.

szundi · on Jan 25, 2025

Then came deepseek