Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Do you have (rough) numbers for inference latency on 4x 32GB v100?


(author here)

I don't have exact numbers for latency but the inference widget is currently on a TPU v3-8 (which if I am not mistaken could roughly be compared to a cluster of 8 V100). That gives you a rough idea of the latency for short inputs.

Note that a colleague just reminded me that it is possible on a single (big) GPU with enough CPU to run inference for T5-11B (which is the size we use) with offloading -> https://github.com/huggingface/transformers/issues/9996#issu...




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: