Hire a web Developer and Designer to upgrade and boost your online presence with cutting edge Technologies

Saturday, 6 May 2023

Running out of Memory Training Google Big Bird with Huggingface

 I've been struggling to train Google's Big Bird model using the Huggingface transformers library due to out of memory errors. I have two Tesla V100 GPU's with 32 GB RAM each. I'm trying to train the google/bigbird-roberta-base model (https://huggingface.co/google/bigbird-roberta-base) on Spider (a natural language to SQL dataset) using the Huggingface trainer API. I'm using a batch size of 1 and the smallest version of this model, and still get OOM errors. According to the Big Bird paper (https://arxiv.org/abs/2007.14062), Big Bird can be trained on chips with 16 GB of memory, so I'm not sure why I'm running into OOM. Has anyone encountered trouble training Big Bird due to memory problems?

Here's the code that does the training:

rouge = datasets.load_metric("rouge")

training_args = Seq2SeqTrainingArguments(
    predict_with_generate = True,
    evaluation_strategy = "steps",
    per_device_train_batch_size = batch_size,
    per_device_eval_batch_size = batch_size,
    output_dir = "./",
    logging_steps = 2,
    save_steps = 400,
    eval_steps = 4
)

def compute_metrics(pred):
    labels_ids = pred.label_ids
    pred_ids = pred.predictions

    pred_str = tokenizer.batch_decode(pred_ids, skip_special_tokens=True)
    labels_ids[labels_ids == -100] = tokenizer.pad_token_id
    label_str = tokenizer.batch_decode(labels_ids, skip_special_tokens=True)

    rouge_output = rouge.compute(predictions=pred_str, references=label_str, rouge_types=["rouge2"])["rouge2"].mid

    return {
        "rouge2_precision": round(rouge_output.precision, 4),
        "rouge2_recall": round(rouge_output.recall, 4),
        "rouge2_fmeasure": round(rouge_output.fmeasure, 4),
    }

trainer = Seq2SeqTrainer(
    model = model,
    tokenizer = tokenizer,
    args = training_args,
    compute_metrics = compute_metrics,
    train_dataset = train_data,
    eval_dataset = val_data
)

trainer.train()

Here's the exact error I get:

RuntimeError: CUDA out of memory. Tried to allocate 36.00 MiB (GPU 0; 31.75 GiB total capacity; 25.14 GiB already allocated; 21.50 MiB free; 26.23 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Thanks so much for sharing any experience you have with this!


============
A:

I guess it is the problem of huggingface. There is no OOM issue with the original code in https://github.com/google-research/bigbird.

No comments:

Post a Comment

Connect broadband

The Chain Rule of Calculus for Univariate and Multivariate Functions

The chain rule allows us to find the derivative of composite functions. It is computed extensively by the backpropagation algorithm, in orde...