I'm following the guide here (https://github.com/huggingface/blog/blob/master/how-to-train.md, https://huggingface.co/blog/how-to-train) to train a RoBERTa-like model from scratch. (With my own tokenizer and dataset)
However, when I run run_mlm.py (https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) to train my model with masking task, the following messages appear:
All model checkpoint weights were used when initializing RobertaForMaskedLM.
All the weights of RobertaForMaskedLM were initialized from the model checkpoint at roberta-base.
If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForMaskedLM for predictions without further training.
I'm wondering does it mean that I'm training from scratch with "the pretrained weight" of RoBERTa? And if it's training from the pretrained weights, is there a way to use randomly initiated weights rather than the pretrained ones?
==== 2021/10/26 Updated ===
I am training the model with Masked Language Modeling task by following commands:
python transformer_run_mlm.py \
--model_name_or_path roberta-base \
--config_name ./my_dir/ \
--tokenizer_name ./my_dir/ \
--no_use_fast_tokenizer \
--train_file ./my_own_training_file.txt \
--validation_split_percentage 10 \
--line_by_line \
--output_dir /my_output_dir/ \
--do_train \
--do_eval \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 16 \
--learning_rate 1e-4 \
--max_seq_length 1024 \
--seed 42 \
--num_train_epochs 100
The ./my_dir/ consists of three files:
config.json produced by the following codes:
from transformers import RobertaModel
model = RobertaModel.from_pretrained('roberta-base')
model.config.save_pretrained(MODEL_CONFIG_PATH)
And here's the content:
{
"_name_or_path": "roberta-base",
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"initializer_range": 0.02,
"intermediate_size": 3072,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_hidden_layers": 12,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"transformers_version": "4.12.0.dev0",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 50265
}
vocab.json, merges.txt produced by the following codes:
from tokenizers.implementations import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
tokenizer.train(files=OUTPUT_DIR + "seed.txt", vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
# Save files to disk
tokenizer.save_model(MODEL_CONFIG_PATH)
And here's the content of vocab.json (A proportion of)
{"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":12
And here's the content of merges.txt (A proportion of)
#version: 0.2 - Trained by `huggingface/tokenizers`
e n
T o
k en
Ġ To
ĠTo ken
E R
V ER
VER B
a t
P R
PR O
P N
PRO PN
Ġ n
U N
N O
NO UN
E n
i t
t it
En tit
Entit y
b j
c o
Ġ a
I think you are mixing two distinct actions.
- The first guide you posted explains how to create a model from scratch
- The
run_mlm.py
script is for fine-tuning (see line 17 of the script) an already existing model
So, if you just want to create a model from scratch, step 1 should be
enough. If you want to fine-tune the model you just created, you have
to run step 2. Note that training a RoBERTa model from scratch already
implies a MLM phase, so this step is useful only in case that you will
have a different dataset in the future and you want to improve your
model by further fine-tuning it.
However, you are not loading the model you just created, you are loading the roberta-base model from the Huggingface repository: --model_name_or_path roberta-base \
Coming to the warning, it tells you that you loaded a model (roberta-base
, as cleared out) that was pre-trained for Masked Language Modeling (MaskedLM) task. This means you loaded a checkpoint of a model
So, quoting:
If your task is similar to the task the model of the checkpoint was
trained on, you can already use RobertaForMaskedLM for predictions
without further training.
This means that, if you going to perform a MaskedLM task, the model
is good to go. If you want to use for another task (for example,
question answering), you should probably fine-tune it because the model
as is would not provide satisfactory results.
Concluding, if you want to create a model from scratch to perform
MLM, follow step 1. This will create a model that can perform MLM.
If you want to fine-tune in MLM an already existing model (see the Huggingface repository), follow step 2.
Thanks for the answering. I'm
loading the model to perform Masked Language Model task on a dataset of
a different language. Still wondering if the messages mean that I'm
training on the base of the pretrained weights of the model, or randomly
initialized weights with the same model structure, the former is more
like fine tuning the pregiven weight, while the latter one is more like
training a model by my own.
– Chaoannricardo
Oct 26, 2021 at 12:08
I would say that the fact
that it's loading checkpoints weights means it's the first option, but
the guide is about training from scratch... can you edit the original
post and provide the lines that produce the warning?
Oct 26, 2021 at 13:04
I've edited the post with my
training commands and the configs I'm using, please feel free to let me
know if there's any more insight into the question.
Oct 26, 2021 at 14:00
1
I think that what you suggest is feasible, I suggest that you reach out to the Huggingface forum
for firther help. Regarding the language, I'm afraid that approach
probably won't work. Which language are you planning to use? Try to look
if a pre-trained model already exists in the HF repo
Oct 26, 2021