Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Can you elaborate what kind of system you built? I'm curious what specific prompts are getting worse responses with the newer models.


Linguistics, specifically as it pertains to language learning

Edit: Whoops read your question wrong. I do a bunch of NLP on different languages, and use LLMs to pad out and interpret the data. Asking for things like translations, alternatives, transliterations; associating and validating data; transferring data from one language to another; segmentation and cross lingual alignment; the list goes on.

I did manage to get higher quality in the end, so it’s not entirely a regression. But older LLMs were much more capable with less prompting at interpreting disparate data and tying it together.

Most of the work I do does not really have a “right answer,” just a lot of wrong ones, which I think is what trips up LLMs. If I turn on reasoning for any step in my pipeline, the token count goes up 100 fold and the quality gets cut in half.

Edit 2: I did have to move off of GPT though to get the improvements mentioned. Go mistral!


What kind of data are you interpreting? Do you mean document extraction from different languages? I have only used GPT5.5 for agentic coding, which did get significantly better from my experience, although that does align with your conjecture of their focus being on improving this. I haven't noticed a regression when it comes to interacting with it in different languages though (specifically German and Russian). I have done data extraction from documents in different languages, but only with locally hosted LLMs (mainly Qwen3.5-397b) as I cannot legally use cloud-based solutions. My local solution was more than sufficient, so I would be surprised if a frontier model would fail at that.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: