Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Saturday, 6 May 2023

BertModel weights are randomly initialized?

 recently, I've been trying to re-implement DiffCSE

During refactoring the codes that the authors uploaded on Github, I've run into some issues.

I have 2 questions

1. If I set seed like set_seed(30), I was under the impression that the model has the same initialized weights, thus making the same result when training. But It feels like I was wrong for example,

config = AutoConfig.from_pretrained('bert-base-uncased')
a = BertModel(config)
b = BertModel(config)
a_query =a.encoder.layer[0].attention.self.query.weight
b_query =b.encoder.layer[0].attention.self.query.weight
a_query == b_query
# tensor([[False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        ...,
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False],
        [False, False, False,  ..., False, False, False]])

print(a_query, b_query)
Parameter containing:
tensor([[ 0.0168, -0.0072,  0.0141,  ...,  0.0060, -0.0098, -0.0361],
        [ 0.0121, -0.0106,  0.0169,  ..., -0.0512,  0.0154, -0.0251],
        [ 0.0252,  0.0375,  0.0215,  ..., -0.0097, -0.0009, -0.0102],
        ...,
        [ 0.0038,  0.0120, -0.0205,  ..., -0.0082, -0.0066,  0.0125],
        [ 0.0032, -0.0330,  0.0073,  ...,  0.0072,  0.0484,  0.0143],
        [-0.0153,  0.0207, -0.0086,  ..., -0.0087, -0.0032,  0.0022]],
       requires_grad=True) Parameter containing:
tensor([[ 0.0239,  0.0236,  0.0181,  ..., -0.0331,  0.0062,  0.0142],
        [-0.0116,  0.0417, -0.0379,  ...,  0.0059,  0.0207,  0.0155],
        [ 0.0178,  0.0017,  0.0064,  ..., -0.0007,  0.0405, -0.0170],
        ...,
        [ 0.0115,  0.0039, -0.0508,  ...,  0.0187,  0.0043, -0.0048],
        [ 0.0025, -0.0079, -0.0132,  ..., -0.0003, -0.0079,  0.0320],
        [-0.0105, -0.0097, -0.0076,  ...,  0.0214, -0.0068,  0.0016]],
       requires_grad=True)

I can't understand why it happens. Also, Every time I execute this code, the weights are different from each case.

2. There are many models provided by Huggingface. When it comes to BERT, they have BertModel, BertForPretraining, BertForMaskedLM,, etc. As far as I know, the only difference between each Bert model is whether they have heads on the top layer or not.
Then, the heads are also pretrained?? or just randomly initailzed weights and provieded for users' convenience.??



====
A:

You have a small misunderstanding of how seeds work. The seed defines how the random values are sampled, it doesn't reset after each sample. This means that the sequences sampled will be the same when starting from the seed. For example, if you have a code like:

seed = 1
sample = sample_4_values()

You should always get the same four values because the seed defined this sequence. In your case you define 2 BERT models without resetting the seed to the starting point for each sample isn't the same! In order to get the same weights to reset the seed before each initialization of BERT

##Edit

To better understand what the seed does you need to think about it as a starting point. Imagine that setting the seed to 30 tells the computer to sample the following numbers: 1,2,3,5,6 Calling the sample function 1 time will return 1. Calling it again will return 2 and so on. What you are basically doing is sampling 2 times but each time your starting point is different.

===============

A:

When you use Autoconfig to load a model, only the configuration information of the model is loaded, such as the number of layers of the model, the dimension of each layer, etc. And when you use AutoModel to load the model, the real parameters of the model are loaded, so you should do as follows:

a = AutoModel.from_pretrained('bert-base-uncased')
b = AutoModel.from_pretrained('bert-base-uncased')

No comments:

Post a Comment

Connect broadband

The Chain Rule of Calculus for Univariate and Multivariate Functions

The chain rule allows us to find the derivative of composite functions. It is computed extensively by the backpropagation algorithm, in orde...