Hacker Newsnew | past | comments | ask | show | jobs | submit | brianykim's commentslogin

Good company-ready RAG benefits a lot from some basic pre-processing/labeling of the data instead of solely dumping unstrucuted data into a vector database and calling it a day. Different heuristics and different schemas of embedded data go a long way in ensuring quality and flexibility of querying.

Then you can do ReAG, which let's you reason on top of the top K intelligently.

And things like memory knowledge graph services as well, can help reduce your search space, and provide extra context over time that gets updated, beyond just treating static docs as sources of truth. You can give it more context as to how it should interpret older docs, vs. newer docs, and allowing users (based on correctness or not) to help audit the what is embedded in your RAG systems.

I appreciate the thorough write up, but doing RAG systems seriously requires much more than just embeddings and a basic chromadb set up.

Happy to share any thoughts here or on a call if anyone wants to chat.


I agree, I attempted a similar project a year ago and the retrival part is so critical. To work half decent you need some serious strategy for metadata, chunking, etc. E.g. how do you deal with tim series data? Like i am not looking for any quarterly numbers but the ones from Q2 2025, Or the research report from 4 weeks ago... And how do you deal with images. We had heaps of companiy knowledge in pptx which you can convert to text but what about pictures in the presentations. Our analyst presentations sometimes consist mostly of charts and visuals, how are they embedded? Also imo for 90% of the time companies dont need a RAG system but a good search / retrival system.


Yep. Semantically distinct and meaningful chunks wins every time over any kind of windowing or slice and dicing.

Unfortunately, many people are looking for a fire and forget solution over an existing rats nest of documentation debt..


> Happy to share any thoughts here

Please do.


Are there other ways we might be able to tune the chunkers or describe the data that we might want chunked to get the best results?

Or perhaps in the playground a way to easily given a type of input data run different chunkers side by side, or pipe them into each other to see best results?


We don't have this yet but we will soon. Finding the right setup for your data is definitely tougher than it needs to be.


Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: