AIうぉ--!(ai-wo-katsuyo-shitai !)

AIを上手く使ってみせたい!!自分なりに。

BERTは。何をデータとして事前学習しているのか?

答えが0秒で沢山えられます。。。。反省。

Wikipediaによると

(両モデルとも、)Toronto BookCorpus[12](8億語)と、英語版Wikipedia(25億語)で事前訓練された。

引用元:

https://ja.wikipedia.org/wiki/BERT_(%E8%A8%80%E8%AA%9E%E3%83%A2%E3%83%87%E3%83%AB)

そもそものBertの論文によると

Pre-training data The pre-training procedure largely follows the existing literature on language model pre-training. For the pre-training corpus we use the BooksCorpus (800M words) (Zhu et al., 2015) and English Wikipedia (2,500M words).

https://arxiv.org/pdf/1810.04805.pdf

ちなみに、XLNetは

3.1 Pretraining and Implementation Following BERT [10], we use the BooksCorpus [40] and English Wikipedia as part of our pretraining data, which have 13GB plain text combined. In addition, we include Giga5 (16GB text) [26], ClueWeb 2012-B (extended from [5]), and Common Crawl [6] for pretraining. We use heuristics to aggressively filter out short or low-quality articles for ClueWeb 2012-B and Common Crawl, which results in 19GB and 110GB text respectively. After tokenization with SentencePiece [17], we obtain 2.78B, 1.09B, 4.75B, 4.30B, and 19.97B subword pieces for Wikipedia, BooksCorpus, Giga5, ClueWeb, and Common Crawl respectively, which are 32.89B in total.

https://arxiv.org/pdf/1906.08237.pdf

いろんなモデルについて知りたければ、以下のサーベイ論文が。。。お薦め

https://arxiv.org/pdf/2108.05542.pdf

Kalyan, Katikapalli Subramanyam, Ajit Rajasekharan, and Sivanesan Sangeetha. "Ammus: A survey of transformer-based pretrained models in natural language processing." arXiv preprint arXiv:2108.05542 (2021).

コメント

アドバイスやコメントなどあれば、なんなりと。。。。
この記事は、0秒情報でした!!!