recently, I've been trying to re-implement DiffCSE
During refactoring the codes that the authors uploaded on Github, I've run into some issues.
I have 2 questions
1.
If I set seed like set_seed(30)
, I was under the impression
that the model has the same initialized weights, thus making the same
result when training. But It feels like I was wrong
for example,
config = AutoConfig.from_pretrained('bert-base-uncased')
a = BertModel(config)
b = BertModel(config)
a_query =a.encoder.layer[0].attention.self.query.weight
b_query =b.encoder.layer[0].attention.self.query.weight
a_query == b_query
# tensor([[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
...,
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False],
[False, False, False, ..., False, False, False]])
print(a_query, b_query)
Parameter containing:
tensor([[ 0.0168, -0.0072, 0.0141, ..., 0.0060, -0.0098, -0.0361],
[ 0.0121, -0.0106, 0.0169, ..., -0.0512, 0.0154, -0.0251],
[ 0.0252, 0.0375, 0.0215, ..., -0.0097, -0.0009, -0.0102],
...,
[ 0.0038, 0.0120, -0.0205, ..., -0.0082, -0.0066, 0.0125],
[ 0.0032, -0.0330, 0.0073, ..., 0.0072, 0.0484, 0.0143],
[-0.0153, 0.0207, -0.0086, ..., -0.0087, -0.0032, 0.0022]],
requires_grad=True) Parameter containing:
tensor([[ 0.0239, 0.0236, 0.0181, ..., -0.0331, 0.0062, 0.0142],
[-0.0116, 0.0417, -0.0379, ..., 0.0059, 0.0207, 0.0155],
[ 0.0178, 0.0017, 0.0064, ..., -0.0007, 0.0405, -0.0170],
...,
[ 0.0115, 0.0039, -0.0508, ..., 0.0187, 0.0043, -0.0048],
[ 0.0025, -0.0079, -0.0132, ..., -0.0003, -0.0079, 0.0320],
[-0.0105, -0.0097, -0.0076, ..., 0.0214, -0.0068, 0.0016]],
requires_grad=True)
I can't understand why it happens. Also, Every time I execute this code, the weights are different from each case.
2.
There are many models provided by Huggingface. When it comes to BERT,
they have BertModel, BertForPretraining, BertForMaskedLM,, etc. As far
as I know, the only difference between each Bert model is whether they
have heads on the top layer or not.
Then, the heads are also pretrained?? or just randomly initailzed weights and provieded for users' convenience.??
====
A:
You have a small misunderstanding of how seeds work. The seed defines how the random values are sampled, it doesn't reset after each sample. This means that the sequences sampled will be the same when starting from the seed. For example, if you have a code like:
seed = 1
sample = sample_4_values()
You should always get the same four values because the seed defined this sequence. In your case you define 2 BERT models without resetting the seed to the starting point for each sample isn't the same! In order to get the same weights to reset the seed before each initialization of BERT
##Edit
To better understand what the seed does you need to think about it as a starting point. Imagine that setting the seed to 30 tells the computer to sample the following numbers: 1,2,3,5,6 Calling the sample function 1 time will return 1. Calling it again will return 2 and so on. What you are basically doing is sampling 2 times but each time your starting point is different.
===============
A:
When you use Autoconfig to load a model, only the configuration information of the model is loaded, such as the number of layers of the model, the dimension of each layer, etc. And when you use AutoModel to load the model, the real parameters of the model are loaded, so you should do as follows:
a = AutoModel.from_pretrained('bert-base-uncased')
b = AutoModel.from_pretrained('bert-base-uncased')
No comments:
Post a Comment