Since publishing this I've found a few additional resources that are really usef...

ColinEberhardt · on Oct 25, 2023

I had exactly the same idea a while back:

https://blog.scottlogic.com/2022/02/23/word-embedding-recomm...

Using embeddings I increased engagement with related articles.

Personally I think embeddings are a powerful tool that are somewhat overlooked. They can be used to navigate between documents (and excerpts) based on similarities - or conversely find unique content.

All without worrying about hallucinations. In other words, they are quite ‘safe’

mike_hearn · on Oct 25, 2023

> All without worrying about hallucinations. In other words, they are quite ‘safe’

Within limits, yes. In some use cases a vector notion of similarity isn't always ideal.

For example, in the article "France" and "Germany" are considered similar. Yes, they are, but if you're searching for stuff about France then stuff about Germany is a false positive.

Embeddings can also struggle with logical opposites. Hot/cold are in many senses similar concepts, but they are also opposites. Finding the opposite of what you're searching for isn't always helpful.

I wouldn't say embeddings are overlooked exactly? Right now it feels like man+dog are building embedding based search engines. The next frontier is probably going to be balancing conventional word based approaches with embeddings to really maximize result quality, as sometimes you want "vibes" and sometimes you want control.

forgingahead · on Oct 25, 2023

Simon, just wanted to say thanks for all the great content and writings you've been putting out - it's been super helpful to help digest a lot of the fast developments in this space. Always looking forward to the next one!

simonw · on Oct 25, 2023

Thanks for saying that!

hhthrowaway1230 · on Oct 25, 2023

simon, the way you write makes it so accessible for people that have limited experience with ai, ml or llms. thank you!

maybe it is also interesting to tell how some embeddings are established i.e via training and cutting of the classification layer or with things things like efficientnet

3abiton · on Oct 25, 2023

Did you stumble upon any resources discussing the history of embeddings and its use in CS and LLMs? It's becoming a cornerstone of ML.

blululu · on Oct 25, 2023

Maybe someone can offer a richer history but to my knowledge the first suggestion of word vectors was lsa which originated the idea of dimensionality reduction on a term/doc matrix. They just used svd but the more modern methods are all doing essentially the same thing. To my recollection they were and HCI lab and their goal was not to make a language model as much as to make a search function for files.

simonw · on Oct 25, 2023

Not aside from Word2Vec and I'd like to learn more about that.

IKantRead · on Oct 25, 2023

Do you not consider SVD/PCA to be "embeddings"?

Latent Semantic Indexing long precedes word2vec (I believe the initial paper was 1988) but also attempted to derived semantic vectors (and was arguably successful) using SVD on the term-frequency matrix representing a corpus.

I would certainly consider this an example of using "embeddings" that was quite heavily used in practice prior to the explosion of deep learning techniques.

simonw · on Oct 25, 2023

Sounds like you know a great deal more about the history of this field than I do!

IKantRead · on Oct 25, 2023

The classic Information Retrieval, though out of date in many ways, does a great job covering the "old school" NLP approaches for IR and has an excellent section on Latent Semantic Indexing which you might find enjoyable: https://nlp.stanford.edu/IR-book/html/htmledition/latent-sem...

bravura · on Oct 25, 2023

There are a handful of historical components that come together to make word2vec such a success:

* The idea of vectorial representations for language.

* "Distributed" representations, which are dense not sparse.

* Vectorial representations for words, not documents.

* The language modeling idea of predict the next word.

* Neural approaches to inducing these representations.

* Unsupervised learning of these representations.

* The versatility of these representations for downstream tasks.

* Fast training techniques.

I'll give the history in roughly reverse chronological order.

Turian, Ratinov, Bengio (2010) "Word representations: A simple and general method for semi-supervised learning" is my work. It received the ACL ten year test of time award and 3k citations. One main contribution was showing that unsupervised neural word embeddings can just be shoved into any existing model as features and get an improvement. This turned on the NLP community to neural networks at a time when sophisticated ML was still bleeding edge for NLP and expert linguistics knowledge was the preferred MO. [edit: We also showed that two other unsupervised embeddings, like Brown clusters which are very old, and neural word embeddings from the log-bilinear model (Mnih & Hinton, 2007), a probabilistic and linear neural model, also gave improvements when plugged into existing supervised models. We also gained attention because we released our code AND all the trained embeddings for people just to try, which wasn't commonplace in NLP and ML at the time.]

We arbitraged the neural embedding model from Collobert and Weston (2008) "A unified architecture for natural language processing: Deep neural networks with multitask learning" and also "Natural Language Processing Almost From Scratch" which achieved amazing scores on many NLP tasks using semi-supervised neural networks that, for the first time, were very fast to train because of the use of contrastive learning. This work didn't get much attention at the time because it was aimed at an ML audience and also because neural networks were still gauche compared to SVMs.

Collobert and Weston had a much faster training approach to the neural language model of Bengio et al 2000 "A neural probabilistic language model" which was, in my mind, what really precipated all this: Train a neural network to predict the next word. That approach was slow because the output prediction was a multiclass prediction of output size # vocabulary words. (Collobert and Weston used Hadsell + LeCun siamese style networks to rank the next word with a higher score plus margin than a randomly selected noise word.)

With that said, vector embeddings for documents have a longer history: LSA, then LDA, and even cool semantic hashing work by Salakhuldinov + Hinton (2007) that is one of the first deep learning approaches (the first?) to NLP, which unfortunately didn't get much attention but was so cool.

Earlier work using neural networks for modeling arbitrary length context that also didn't get much attention was Pollack (1990) "Recursive distributed representations" which introduced recursive autoassociative memory (RAAM) and later Sperduti (1994) "Labelling recursive auto-associative memory". The idea was that you have a representation for the sentence and you recursively consume the next token to generate a new fixed-length representation. You then have a STOP token at the end. And then you can unroll the representation because current representation + next token => next representation is an autoassociator.

The compute power wasn't really there to make this stuff work empirically, during the 1990s. But there was other fringe work like Chrisman (1990) "Learning Recursive Distributed Representations for Holistic Computation". And this 90's work traces to cool 80s Hinton conceputal work on "associative" representations, for example Hinton (1984) "Distributed Representations" and Hinton (1986) "Learning distributed representations of concepts". A lot of this work had very interesting critical ideas, and was of the form: I thought about this for a very long time and here's how it would work and we don't have large-scale training techniques yet.

I'm pretty sure Bottou also contributed here, but I'm forgetting the exact cite.

Feel free to email me if you like. (See profile.)

simonw · on Oct 25, 2023

Thank you for this, most informative comment I've seen on Hacker News in ages!

theGnuMe · on Oct 25, 2023

Latent semantic indexing

freeman94 · on Oct 25, 2023

nice article, thanks