Researchers from EleutherAI have open-sourced GPT-NeoX-20B, a 20-billion parameter natural language processing (NLP) AI model similar to GPT-3. The model was trained on 825GB of publicly available text data and has performance comparable to similarly-sized GPT-3 models.
The release was announced on the EleutherAI blog. GPT-NeoX-20B was trained on EleutherAI's open-source Pile dataset using NVIDIA A100-SXM4-40GB GPUs. When evaluated on several common NLP benchmark tasks, GPT-NeoX-20B achieved an accuracy approximate to a linear interpolation between OpenAI's Curie and DaVinci models, while its one-shot performance on the MATH test dataset exceeded that of GPT-3 175B. EleutherAI claims that GPT-NeoX-20B is the largest open-source pre-trained autoregressive language model available, and
We hope that the increased accessibility of models of this size will aid in research towards the safe use of AI systems, and encourage anyone interested in working in this direction to reach out to us.
OpenAI first published a paper on generative pre-trained transformers (GPT) in 2018 and released their 1.5B parameter GPT-2 model in 2019. In 2020, OpenAI announced a 175B parameter model, GPT-3, but did not release the trained model files. Instead, OpenAI provided an API that allows developers to integrate the model into their code via web service calls. Since then, several models larger than GPT-2 have been open-sourced, including Megatron-11B, Pangu-α-13B, Meta's Fairseq 13B, and EleutherAI's earlier models, GPT-Neo and GPT-J-6b, which InfoQ covered last year.
In addition to these open-source models, there are even larger models, such as GPT-3, with hundreds of billions or even trillions of parameters. However, according to EleutherAI, these are "almost universally" either gated by an API or not publicly available at all. Part of EleutherAI's motivation for releasing their models is their belief that open access to such models is necessary for advancing research in the field, since it is their large scale that makes them interesting.
The architecture of GPT-NeoX-20B is similar to GPT-3, with a few key differences. First, GPT-NeoX-20B uses rotary positional embeddings instead of learned embeddings for token position encoding. Secondly, GPT-NeoX-20B computes the attention and feed-forward layers in parallel instead of series, leading to a 15% throughput increase. Finally, where GPT-3 alternates sparse and dense layers, GPT-NeoX-20B uses only dense layers.
GPT-NeoX-20B was trained using EleutherAI's custom codebase (also called GPT-NeoX), which is based on Megatron and DeepSpeed and is implemented in PyTorch. Because the model is too large to fit into a single GPU, the team used model parallelism as well as data parallelism during training. In addition, since the team's compute budget constraints made hyperparameter search "intractable," they chose to re-use the hyperparameters published in the GPT-3 paper.
The researchers evaluated GPT-NeoX-20B on a "diverse collection" of NLP benchmarks, including LAMBADA and WinoGrande, as well as the HendrycksTest knowledge benchmark and MATH dataset. They compared its performance to their previous GPT-J-6B model as well as Meta's FairSeq 13B and several different sizes of GPT-3. According to the team, the performance of GPT-NeoX-20B on NLP tasks "could be improved," but its performance on science and math tasks "excels."
EleutherAI researcher Connor Leahy answered several questions about the model on Twitter. When asked about the impact of trying different random initialization seeds, Leahy replied:
We only had enough compute for one 20B run, so we did not compare random seeds. We haven't seen noticeable fluctuations based on seed in smaller models though. [Large language models] tend to converge to similar loss, they aren't as unstable as [reinforcement learning].
No comments:
Post a Comment