craffel's comments

craffel · on Oct 18, 2021

(different author, not Stella)

To your first question: Unpublished experiments done by the BigScience architecture and scaling WG suggest that training on book corpus yields a boost of 10-15% accuracy on LAMBADA.

To your second question: LAMBADA specifically is an interesting task, but it's a bit unsatisfying to work on since there are so many conflating factors in prior work on the dataset. We are planning quite a few follow-up projects along this general line of work (prompted multi-task training), though.

craffel · on Oct 18, 2021

(author here)

The paper/model/code was just made public today. This may be why no one is talking about it yet.

Regarding whether the size is a hassle: It's possible to run inference on a single Google Cloud TPU v3-8 device or on a server with 4x 32GB v100 GPUs. Hugging Face also has an inference API for any model on the Hub: https://api-inference.huggingface.co/docs/python/html/index....

NavinF · on Oct 18, 2021

Do you have (rough) numbers for inference latency on 4x 32GB v100?

VictorSh · on Oct 18, 2021

(author here)

I don't have exact numbers for latency but the inference widget is currently on a TPU v3-8 (which if I am not mistaken could roughly be compared to a cluster of 8 V100). That gives you a rough idea of the latency for short inputs.

Note that a colleague just reminded me that it is possible on a single (big) GPU with enough CPU to run inference for T5-11B (which is the size we use) with offloading -> https://github.com/huggingface/transformers/issues/9996#issu...

ourlordcaffeine · on Oct 18, 2021

On the topic of GPT-3, I asked your creation:

"Who is better, you or GPT-3?"

> GPT-3

ai_ia · on Oct 18, 2021

It somehow picked up Modesty.

echelon · on Oct 18, 2021

Can this be used to generate prose at length? Or Reddit comment replies?

srush · on Oct 18, 2021

While in theory it could, the nature of its training favors shorter more factual replies.

craffel · on Feb 25, 2020

Agreed! The interesting thing is that basic unsupervised pre-training seems to produce a model which functions not only as a knowledge base but also an NLU system which can effectively query the knowledge base using natural text questions. This is exactly what our follow-up paper is on.

craffel · on Feb 24, 2020

We include MASS in our empirical survey (see e.g. section 3.3.2 of our paper, https://arxiv.org/pdf/1910.10683.pdf). FWIW, people were pre-training Transformers before MASS, e.g. "Improving Language Understanding by Generative Pre-Training" by Radford et al. from 2018. Even further back, "Semi-Supervised Sequence Learning" by Dai et al. describe pre-training an RNN encoder-decoder model for subsequent transfer.

kitsune_ · on Feb 24, 2020

But Radford is just pretraining the decoder and qualitatively different from a seq2seq approach such as MASS. If we just look at the original paper from Vaswani, than "pretraining a transformer" imho should always only have meant pretraing the encoder and decoder. Obviously that ship has sailed.

craffel · on Feb 24, 2020

The blogpost has a summary of our paper from October (a bit late, sorry!) but also has some (fun?) new results on closed-book question answering and fill-in-the-blank text generation.

craffel · on Feb 24, 2020

Thanks, fixed!

craffel · on Feb 24, 2020

Yes, unfortunately we have to rely on the very brittle "exact match" method of evaluating whether an answer is correct. FWIW and perhaps surprisingly, this is the primary way question-answering systems are evaluated in common benchmarks. I totally agree that fine-tuning T5 for answer grading would be super interesting!

modeless · on Feb 24, 2020

I think it makes some sense to evaluate models like this, as you want to be conservative with the answers you accept (though my second example shows that it isn't always conservative), and models don't have feelings to hurt if they are docked points for not being precise enough. Humans, of course, are more sensitive.

lsb · on Feb 25, 2020

Does that mean that answer grading would become like comparing summaries of a given text?

dmit · on Feb 24, 2020

I'm sorry for being blunt, but is it possible that the `very brittle "exact match" method of evaluating whether an answer is correct` means value equality? Is `==` the secret sauce?

craffel · on Feb 24, 2020

It's slightly more than that -- it also involves lowercasing and removing articles before testing for string equality.

svnpenn · on Feb 25, 2020

Why are you replying to every single comment?

schoen · on Feb 25, 2020

I think craffel (probably "Colin Raffel, Senior Research Scientist, Google Research") was directly involved in this research!

craffel · on Feb 25, 2020

Yes, that's me! Sorry if I'm being overeager, I like talking about my research!

schoen · on Feb 25, 2020

I think it's amazing how frequently people involved in various CS and IT things are directly participating in threads about their work here on HN.

craffel · on Oct 25, 2019

Hi, one of the paper authors here. Indeed this is a good question. A couple of comments:

- Common Crawl overall is a sparse web dump, it is unlikely that the month we used includes any of the data that are in any of the test sets.

- In order for the data to be useful to our model, it would have to be in the correct preprocessed format. ("mnli: hypothesis: ... premise: ...") with the label in a format our model could extract meaning from. We introduced this preprocessing format so I don't believe this would ever happen.

- Further, most of these datasets live in .zip files. The Common Crawl dump doesn't unzip zip files.

- C4 is so large that our model sees each example (corresponding to a block of text from a website) roughly once ever over the entire course of training. Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps.

throwaway_bad · on Oct 25, 2019

> Big neural nets trained with SGD are unlikely to memorize something if they only see it once over the course of one million training steps

I am not so sure about that. Have you seen this thread: https://www.reddit.com/r/MachineLearning/comments/dfky70/dis...

Apparently lots of sentence fragments were memorized in GPT-2 (including real world URLs, entire conversations, username/emails and other PII).

craffel · on Oct 25, 2019

It actually can be more pernicious than that: https://arxiv.org/abs/1802.08232

However note that the dataset used to train GPT-2 is about 20x smaller than C4. I'm not 100% sure how many times the training set was repeated over the course of training for GPT-2, but it was likely many times. I stand by my statement (that memorization is unlikely with SGD and no repetition of training data) but I would be happy to be proven otherwise.

craffel · on Oct 25, 2019

Hi, one of the paper's authors here. We didn't submit our model's predictions for the AX-b task yet, we just copied over the predictions from the example submission. We will submit predictions for AX-b in the next few days.

mannykannot · on Oct 25, 2019

RcouF1uZ4gsC makes a compelling case for the results on this test to potentially be a significant caveat to the results, and also to the claims of achieving a near-human level of performance. If so, then why would you make such claims before you have these results? Or at least mention this caveat at the points where you are making the claim, such as in the abstract.

craffel · on Oct 25, 2019

To be clear, here is the claim we make in the paper (we did not write the title of this post to HN):

> For SuperGLUE, we improved upon the state-of-the-art by a large margin (from an average score of 84.6 [Liu et al., 2019c] to 88.9). SuperGLUE was designed to comprise of tasks that were “beyond the scope of current state-of-the-art systems, but solvable by most college-educated English speakers” [Wang et al., 2019b]. We nearly match the human performance of 89.8 [Wang et al., 2019b]. Interestingly, on the reading comprehension tasks (MultiRC and ReCoRD) we exceed human performance by a large margin, suggesting the evaluation metrics used for these tasks may be biased towards machine-made predictions. On the other hand, humans achieve 100% accuracy on both COPA and WSC, which is significantly better than our model’s performance. This suggests that there remain linguistic tasks that are hard for our model to perfect, particularly in the low-resource setting.

I'm not sure why the SuperGLUE/GLUE benchmark was designed to omit the AX-* scores from the benchmark score. It may be that they have no corresponding training set.

mannykannot · on Oct 25, 2019

My mistake - I had overlooked the AX-* scores being expressly omitted from these benchmarks. Maybe it is possible, then, that they could provide the additional headroom for further research?

Regardless of the status of the AX-* tests, I am very impressed by your results on the SuperGLUE benchmark.