zhangce's comments

zhangce · on Oct 31, 2023

Thanks for the suggestion! We will add this in the pool of features for future release. (We are currently running the current 40+ annotations on the `tail` partitions).

If you are interested in contributing the code for these features, feel free to do a PR to https://github.com/togethercomputer/RedPajama-Data! Otherwise we will try our best effort implementation :) but we hope that this can become a community effort

(feel free to created more issues on github for us to keep track. I created one for this https://github.com/togethercomputer/RedPajama-Data/issues/76)

zhangce · on Oct 31, 2023

We did an exact dedup across all 84 dumps; there are 100T tokens before this exact dedup, and 30T tokens after. If we do further fuzzy dedup (we have simhash signatures pre-computed for different similarity level), this can potentially be reduced further.

There are quite a lot redundancies across dumps; but also a lot of unique/distinct documents

zhangce · on Oct 31, 2023

It is around 100TB (84 CommonCrawl dumps, roughly 1TB per dump)

mauriceweber · on Oct 31, 2023

yes, small clarification: the 1TB per dump refers to the head+middle partition of the dataset and includes the text documents and the quality signals. There is another ~700GB for the minhash signatures and 1-1.5TB for the documents in the tail split.

zhangce · on Oct 31, 2023

What we make available is:

--

(A) the dataset after pre-processing the raw CommonCrawl data (e.g., text extraction and language identification) and some minimal filtering; and

(B) for each document in (A), we also pre-computed 40+ of "features" (we call the "quality annotations") you can use to further filter it or deduplicate it. For example, one such feature is "how similar this document is to Wikipedia".

--

(A) is around 30T tokens, but you might want to use features in (B) to further filter/dedup it down, e.g., to 5T. For example, if in your application documents similar to Wikipedia are the most helpful documents, you can take the top documents with the highest score for the feature "how similar this document is to Wikipedia". Of course, the really interesting case happens when you consider a larger subset of these features (or maybe even automatically learn what the best way of filtering it is).

Our goal is to make this as flexible as possible such that you can fit this into your own application. What we have released is both (A) and (B)

If you have any questions, please let us know! Thanks for your interests, have fun with the data!

natch · on Oct 31, 2023

Thanks.

> how similar this document is to Wikipedia

So that’s a measure of how similar it is to the background vector of all (language in focus) Wikipedia data?

zhangce · on Oct 31, 2023

There are actually a few ways to do this; and we have four:

- `rps_doc_ml_wikiref_score`: a classifier that classifiers random webpage with Wiki references (used in Llama-1)

- `ccnet_perplexity`: perplexity of an LM trained on Wikipedia (used in CCNet)

- `rps_doc_ml_wikipedia_score`: classifier prediction for the document being a Wikipedia article

- `rps_doc_wikipedia_importance`: Used in https://arxiv.org/abs/2302.03169

You can see the full table here: https://together.ai/blog/redpajama-data-v2