Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Sunday, 7 May 2023

Questions when training language models from scratch with Huggingface

 I'm following the guide here (https://github.com/huggingface/blog/blob/master/how-to-train.md, https://huggingface.co/blog/how-to-train) to train a RoBERTa-like model from scratch. (With my own tokenizer and dataset)

However, when I run run_mlm.py (https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) to train my model with masking task, the following messages appear:

All model checkpoint weights were used when initializing RobertaForMaskedLM.

All the weights of RobertaForMaskedLM were initialized from the model checkpoint at roberta-base.

If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForMaskedLM for predictions without further training.

I'm wondering does it mean that I'm training from scratch with "the pretrained weight" of RoBERTa? And if it's training from the pretrained weights, is there a way to use randomly initiated weights rather than the pretrained ones?

==== 2021/10/26 Updated ===

I am training the model with Masked Language Modeling task by following commands:

python transformer_run_mlm.py \
--model_name_or_path roberta-base  \
--config_name ./my_dir/ \
--tokenizer_name ./my_dir/ \
--no_use_fast_tokenizer \
--train_file ./my_own_training_file.txt \
--validation_split_percentage 10 \
--line_by_line \
--output_dir /my_output_dir/ \
--do_train \
--do_eval \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 16 \
--learning_rate 1e-4 \
--max_seq_length 1024 \
--seed 42 \
--num_train_epochs 100 

The ./my_dir/ consists of three files:

config.json produced by the following codes:

from transformers import RobertaModel

model = RobertaModel.from_pretrained('roberta-base')
model.config.save_pretrained(MODEL_CONFIG_PATH)

And here's the content:

{
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.12.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

vocab.json, merges.txt produced by the following codes:

from tokenizers.implementations import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=OUTPUT_DIR + "seed.txt", vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Save files to disk
tokenizer.save_model(MODEL_CONFIG_PATH)

And here's the content of vocab.json (A proportion of)

{"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":12

And here's the content of merges.txt (A proportion of)

#version: 0.2 - Trained by `huggingface/tokenizers`
e n
T o
k en
Ġ To
ĠTo ken
E R
V ER
VER B
a t
P R
PR O
P N
PRO PN
Ġ n
U N
N O
NO UN
E n
i t
t it
En tit
Entit y
b j
c o
Ġ a

I think you are mixing two distinct actions.

  1. The first guide you posted explains how to create a model from scratch
  2. The run_mlm.py script is for fine-tuning (see line 17 of the script) an already existing model

So, if you just want to create a model from scratch, step 1 should be enough. If you want to fine-tune the model you just created, you have to run step 2. Note that training a RoBERTa model from scratch already implies a MLM phase, so this step is useful only in case that you will have a different dataset in the future and you want to improve your model by further fine-tuning it.

However, you are not loading the model you just created, you are loading the roberta-base model from the Huggingface repository: --model_name_or_path roberta-base \


Coming to the warning, it tells you that you loaded a model (roberta-base, as cleared out) that was pre-trained for Masked Language Modeling (MaskedLM) task. This means you loaded a checkpoint of a model So, quoting:

If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForMaskedLM for predictions without further training.

This means that, if you going to perform a MaskedLM task, the model is good to go. If you want to use for another task (for example, question answering), you should probably fine-tune it because the model as is would not provide satisfactory results.


Concluding, if you want to create a model from scratch to perform MLM, follow step 1. This will create a model that can perform MLM.

If you want to fine-tune in MLM an already existing model (see the Huggingface repository), follow step 2.

Thanks for the answering. I'm loading the model to perform Masked Language Model task on a dataset of a different language. Still wondering if the messages mean that I'm training on the base of the pretrained weights of the model, or randomly initialized weights with the same model structure, the former is more like fine tuning the pregiven weight, while the latter one is more like training a model by my own. Oct 26, 2021 at 12:08
  • I would say that the fact that it's loading checkpoints weights means it's the first option, but the guide is about training from scratch... can you edit the original post and provide the lines that produce the warning? Oct 26, 2021 at 13:04
  • I've edited the post with my training commands and the configs I'm using, please feel free to let me know if there's any more insight into the question. Oct 26, 2021 at 14:00
  • I think I got it now. Updated answer Oct 26, 2021 at 14:34
  • 1
    I think that what you suggest is feasible, I suggest that you reach out to the Huggingface forum for firther help. Regarding the language, I'm afraid that approach probably won't work. Which language are you planning to use? Try to look if a pre-trained model already exists in the HF repo Oct 26, 2021
  • No comments:

    Post a Comment

    Connect broadband

    The Chain Rule of Calculus for Univariate and Multivariate Functions

    The chain rule allows us to find the derivative of composite functions. It is computed extensively by the backpropagation algorithm, in orde...