I love this: "Our singular focus means no distraction by management overhead or product cycles, and our business model means safety, security, and progress are all insulated from short-term commercial pressures."
I was a little confused about this too. The authors say in the paper:
"The outputs of the ViT image encoder before pooling form the visual tokens, which are linearly projected and prepended to the embedded input text tokens."
I took a look at the HuggingFace implementation of ViT [1]. After the ViT encoder blocks there's a layer norm and then a pooling layer (line 595), where the pooling layer involves taking the first token output from the layer norm and running it through a dense layer. So, it looks like in PaLI-3 the tokens are the hidden states output by the layer norm after the ViT encoder blocks.
Not self-hosted/local but Claude by Anthropic from what I've heard is really good but the API is not publicly available. It's apparently accessible via Poe (https://poe.com)
It's good that they chose to continue support for code-davinci-002 (https://twitter.com/sama/status/1638576434485825536?s=20) but it'd much better if they open-source it sooner or later as even OpenAI didn't expect that their model is being widely used.