I use intrinsics by hand all the time. It's very easy to make a problem too complicated to autovectorize. And even if you do get it to autovectorizie, it's not exactly future proof against compiler changes.
I use Agner as well. I started up my own version for Rust specifically targeting avx512[1], but I've been hitting enough snags to where I think I'll abandon it. It's super green at the moment, and I haven't pushed it to Cargo. But if I'm going to dedicate time to it, then I need it to work for my purposes, and there's a thread-parallel problem that makes this unusable for me at the moment.
Note that simply using a SIMD vector class library does not make it "go faster". In fact it can make things worse (due to latency). What you usually need is a problem and then a solution (algorithm) that parallelizes well.
Explain? My experience has been that you don't need much. I've benchmarked Agner's exp, and if you have 4 calculations to do, then calling it with avx2 will be 4x faster than calling std::exp four times.