Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Sunday, 7 May 2023

Questions when training language models from scratch with Huggingface

I'm following the guide here (https://github.com/huggingface/blog/blob/master/how-to-train.md, https://huggingface.co/blog/how-to-train) to train a RoBERTa-like model from scratch. (With my own tokenizer and dataset)

However, when I run run_mlm.py (https://github.com/huggingface/transformers/blob/master/examples/pytorch/language-modeling/run_mlm.py) to train my model with masking task, the following messages appear:

All model checkpoint weights were used when initializing RobertaForMaskedLM.

All the weights of RobertaForMaskedLM were initialized from the model checkpoint at roberta-base.

If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaForMaskedLM for predictions without further training.

I'm wondering does it mean that I'm training from scratch with "the pretrained weight" of RoBERTa? And if it's training from the pretrained weights, is there a way to use randomly initiated weights rather than the pretrained ones?

==== 2021/10/26 Updated ===

I am training the model with Masked Language Modeling task by following commands:

python transformer_run_mlm.py \
--model_name_or_path roberta-base  \
--config_name ./my_dir/ \
--tokenizer_name ./my_dir/ \
--no_use_fast_tokenizer \
--train_file ./my_own_training_file.txt \
--validation_split_percentage 10 \
--line_by_line \
--output_dir /my_output_dir/ \
--do_train \
--do_eval \
--per_device_train_batch_size 64 \
--per_device_eval_batch_size 16 \
--learning_rate 1e-4 \
--max_seq_length 1024 \
--seed 42 \
--num_train_epochs 100

The ./my_dir/ consists of three files:

config.json produced by the following codes:

from transformers import RobertaModel

model = RobertaModel.from_pretrained('roberta-base')
model.config.save_pretrained(MODEL_CONFIG_PATH)

And here's the content:

{
  "_name_or_path": "roberta-base",
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": 0,
  "classifier_dropout": null,
  "eos_token_id": 2,
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "layer_norm_eps": 1e-05,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_hidden_layers": 12,
  "pad_token_id": 1,
  "position_embedding_type": "absolute",
  "transformers_version": "4.12.0.dev0",
  "type_vocab_size": 1,
  "use_cache": true,
  "vocab_size": 50265
}

vocab.json, merges.txt produced by the following codes:

from tokenizers.implementations import ByteLevelBPETokenizer

tokenizer = ByteLevelBPETokenizer()

tokenizer.train(files=OUTPUT_DIR + "seed.txt", vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Save files to disk
tokenizer.save_model(MODEL_CONFIG_PATH)

And here's the content of vocab.json (A proportion of)

{"<s>":0,"<pad>":1,"</s>":2,"<unk>":3,"<mask>":4,"!":5,"\"":6,"#":7,"$":8,"%":9,"&":10,"'":11,"(":12,")":13,"*":14,"+":15,",":16,"-":17,".":18,"/":19,"0":20,"1":21,"2":22,"3":23,"4":24,"5":25,"6":26,"7":27,"8":28,"9":29,":":30,";":31,"<":32,"=":33,">":34,"?":35,"@":36,"A":37,"B":38,"C":39,"D":40,"E":41,"F":42,"G":43,"H":44,"I":45,"J":46,"K":47,"L":48,"M":49,"N":50,"O":51,"P":52,"Q":53,"R":54,"S":55,"T":56,"U":57,"V":58,"W":59,"X":60,"Y":61,"Z":62,"[":63,"\\":64,"]":65,"^":66,"_":67,"`":68,"a":69,"b":70,"c":71,"d":72,"e":73,"f":74,"g":75,"h":76,"i":77,"j":78,"k":79,"l":80,"m":81,"n":82,"o":83,"p":84,"q":85,"r":86,"s":87,"t":88,"u":89,"v":90,"w":91,"x":92,"y":93,"z":94,"{":95,"|":96,"}":97,"~":98,"¡":99,"¢":100,"£":101,"¤":102,"¥":103,"¦":104,"§":105,"¨":106,"©":107,"ª":108,"«":109,"¬":110,"®":111,"¯":112,"°":113,"±":114,"²":115,"³":116,"´":117,"µ":118,"¶":119,"·":120,"¸":121,"¹":122,"º":123,"»":124,"¼":125,"½":126,"¾":12

And here's the content of merges.txt (A proportion of)

#version: 0.2 - Trained by `huggingface/tokenizers`
e n
T o
k en
Ġ To
ĠTo ken
E R
V ER
VER B
a t
P R
PR O
P N
PRO PN
Ġ n
U N
N O
NO UN
E n
i t
t it
En tit
Entit y
b j
c o
Ġ a

I think you are mixing two distinct actions.

The first guide you posted explains how to create a model from scratch
The run_mlm.py script is for fine-tuning (see line 17 of the script) an already existing model

So, if you just want to create a model from scratch, step 1 should be
 enough. If you want to fine-tune the model you just created, you have 
to run step 2. Note that training a RoBERTa model from scratch already 
implies a MLM phase, so this step is useful only in case that you will 
have a different dataset in the future and you want to improve your 
model by further fine-tuning it.
However, you are not loading the model you just created, you are loading the roberta-base model from the Huggingface repository: --model_name_or_path roberta-base  \

Coming to the warning, it tells you that you loaded a model (roberta-base, as cleared out) that was pre-trained for Masked Language Modeling (MaskedLM) task. This means you loaded a checkpoint of a model
So, quoting:

If your task is similar to the task the model of the checkpoint was
trained on, you can already use RobertaForMaskedLM for predictions
without further training.

This means that, if you going to perform a MaskedLM task, the model 
is good to go. If you want to use for another task (for example, 
question answering), you should probably fine-tune it because the model 
as is would not provide satisfactory results.

Concluding, if you want to create a model from scratch to perform 
MLM, follow step 1. This will create a model that can perform MLM.
If you want to fine-tune in MLM an already existing model (see the Huggingface repository), follow step 2.

            
                
                Thanks for the answering. I'm
 loading the model to perform Masked Language Model task on a dataset of
 a different language. Still wondering if the messages mean that I'm 
training on the base of the pretrained weights of the model, or randomly
 initialized weights with the same model structure, the former is more 
like fine tuning the pregiven weight, while the latter one is more like 
training a model by my own.
                
                
– Chaoannricardo
                
                Oct 26, 2021 at 12:08
                        
                            

                        
            
        
    
    
        
            
            
        
        
            
                
                I would say that the fact 
that it's loading checkpoints weights means it's the first option, but 
the guide is about training from scratch... can you edit the original 
post and provide the lines that produce the warning?
                
                
– SilentCloud
                
                Oct 26, 2021 at 13:04
            
        
    
    
        
            
            
        
        
            
                
                I've edited the post with my 
training commands and the configs I'm using, please feel free to let me 
know if there's any more insight into the question.
                
                
– Chaoannricardo
                
                Oct 26, 2021 at 14:00
            
        
    
    
        
            
            
        
        
            
                
                I think I got it now. Updated answer
                
                
– SilentCloud
                
                Oct 26, 2021 at 14:34
            
        
    
    
        
            
                    1
            
        
        
            
                
                I think that what you suggest is feasible, I suggest that you reach out to the Huggingface forum
 for firther help. Regarding the language, I'm afraid that approach 
probably won't work. Which language are you planning to use? Try to look
 if a pre-trained model already exists in the HF repo
                
                
– SilentCloud
                
                Oct 26, 2021