In this article, you will learn how to export models from PyTorch,
scikit-learn, and TensorFlow/Keras to ONNX and compare PyTorch vs. ONNX
Runtime inference on CPU for accuracy and speed.
Topics we will cover include:
- Fine-tuning a ResNet-18 on CIFAR-10 and exporting it to ONNX.
- Verifying numerical parity and benchmarking CPU latency between PyTorch and ONNX Runtime.
- Converting scikit-learn and TensorFlow/Keras models to ONNX for portable deployment.
Without further delay, let’s begin.

Export Your ML Model in ONNX Format
Image by Author
Introduction
When building machine learning models, training is only half the
journey. Deploying those models reliably across different environments
is where many projects slow down. This is where ONNX
(Open Neural Network Exchange) becomes important. ONNX provides a
common, framework-agnostic format that allows models trained in PyTorch, TensorFlow, or scikit-learn to be exported once and run anywhere.
In this tutorial, we will go step by step through the complete ONNX
workflow. We will start by fine-tuning a model and saving the fine-tuned
version in native PyTorch format as well as in ONNX format.
Once both versions are ready, we will compare their inference
performance on CPU, focusing on two key aspects: accuracy and inference
speed. This comparison will help you understand the practical tradeoffs
between framework-native models and ONNX-based deployment.
Finally, we will also cover how to convert models trained with
scikit-learn and TensorFlow into ONNX format, so you can apply the same
deployment approach across different machine learning frameworks.
Exporting Fine-Tuned PyTorch Model To ONNX
In this section, we will fine-tune a ResNet-18 model on the CIFAR-10
dataset for image classification. After training, we will save the
fine-tuned model in the normal PyTorch format and also export it into
ONNX format. Then we will run both versions on CPU and compare their
inference results using accuracy and macro F1 score, along with
inference speed.
Setting Up
First, we install the libraries we need for training, exporting, and benchmarking. We use PyTorch and TorchVision to fine-tune the model, ONNX to store the exported model, and ONNX Runtime to run ONNX inference on CPU.
We also install scikit-learn because it provides simple evaluation metrics like accuracy and F1 score.
!pip install -q torch torchvision onnx onnxruntime scikit-learn !pip install -q skl2onnx tensorflow tf2onnx protobuf |
Finally, we import all the required modules so we can train the model, export it, and measure performance.
import time import numpy as np import torch import torch.nn as nn from torch.utils.data import DataLoader from torchvision import datasets, transforms, models import onnx import onnxruntime as ort from sklearn.metrics import accuracy_score, f1_score |
Loading CIFAR-10 And Building ResNet-18
Now we prepare the dataset and model.
The get_cifar10_loaders function loads CIFAR-10 and
returns two DataLoaders: one for training and one for testing. We resize
CIFAR-10 images from 32×32 to 224×224 because ResNet-18 is designed for
ImageNet-sized inputs.
We also apply ImageNet normalization values so the pretrained ResNet
weights work correctly. The training loader includes random horizontal
flipping to add basic data augmentation.
def get_cifar10_loaders(batch_size: int = 64): """ Returns train and test DataLoaders for CIFAR-10. We resize to 224x224 and use ImageNet normalization so ResNet18 works nicely. """ imagenet_mean = [0.485, 0.456, 0.406] imagenet_std = [0.229, 0.224, 0.225] train_transform = transforms.Compose( [ transforms.Resize((224, 224)), transforms.RandomHorizontalFlip(), transforms.ToTensor(), transforms.Normalize(mean=imagenet_mean, std=imagenet_std), ] ) test_transform = transforms.Compose( [ transforms.Resize((224, 224)), transforms.ToTensor(), transforms.Normalize(mean=imagenet_mean, std=imagenet_std), ] ) train_dataset = datasets.CIFAR10( root="./data", train=True, download=True, transform=train_transform ) test_dataset = datasets.CIFAR10( root="./data", train=False, download=True, transform=test_transform ) train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True, num_workers=2) test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False, num_workers=2) return train_loader, test_loader |
The build_resnet18_cifar10 function loads a
ResNet-18 model with ImageNet pretrained weights and replaces the final
fully connected layer. ImageNet has 1000 classes, but CIFAR-10 has 10
classes, so we update the last layer to output 10 logits.
def build_resnet18_cifar10(num_classes: int = 10) -> nn.Module: """ ResNet18 backbone with ImageNet weights, but final layer adapted to CIFAR-10. """ weights = models.ResNet18_Weights.IMAGENET1K_V1 model = models.resnet18(weights=weights) in_features = model.fc.in_features model.fc = nn.Linear(in_features, num_classes) return model |
Quick Fine-Tuning
In this step, we do a small fine-tuning run to make the model adapt
to CIFAR-10. This is not meant to be a full training pipeline. It is a
fast demo training loop so we can later compare PyTorch inference vs.
ONNX inference.
The quick_finetune_cifar10 function trains the model for
a limited number of batches. It uses cross-entropy loss because
CIFAR-10 is a multi-class classification task. It uses the Adam
optimizer for quick learning. The loop runs through batches, performs a
forward pass, calculates the loss, runs backpropagation, and updates
model weights. At the end, it prints an average training loss so we can
see that training happened.
def quick_finetune_cifar10( model: nn.Module, train_loader: DataLoader, device: torch.device, max_batches: int = 200, ): """ Very light fine-tuning on CIFAR-10 to make metrics non-trivial. Trains for max_batches only (1 pass over subset of train data). """ model.to(device) model.train() criterion = nn.CrossEntropyLoss() optimizer = torch.optim.Adam(model.parameters(), lr=1e-3) running_loss = 0.0 for batch_idx, (images, labels) in enumerate(train_loader): if batch_idx >= max_batches: break images = images.to(device) labels = labels.to(device) optimizer.zero_grad() outputs = model(images) loss = criterion(outputs, labels) loss.backward() optimizer.step() running_loss += loss.item() avg_loss = running_loss / max_batches print(f"[Train] Average loss over {max_batches} batches: {avg_loss:.4f}") device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print("Using device for training:", device) train_loader, test_loader = get_cifar10_loaders(batch_size=64) model = build_resnet18_cifar10(num_classes=10) print("Starting quick fine-tuning on CIFAR-10 (demo)...") quick_finetune_cifar10(model, train_loader, device, max_batches=200) # Save weights for reuse (PyTorch + ONNX export) torch.save(model.state_dict(), "resnet18_cifar10.pth") print(" Saved fine-tuned weights to resnet18_cifar10.pth") |
After training, we save the model weights using torch.save(). This creates a .pth file, which is the standard PyTorch format for storing model parameters.
Using device for training: cuda Starting quick fine-tuning on CIFAR-10 (demo)... [Train] Average loss over 200 batches: 0.7803 Saved fine-tuned weights to resnet18_cifar10.pth |
Exporting To ONNX
Now we export the fine-tuned PyTorch model into ONNX format so it can be deployed and executed using ONNX Runtime.
The export_resnet18_cifar10_to_onnx function loads the model architecture again, loads the fine-tuned weights, and switches the model into evaluation mode using model.eval() so inference behaves consistently.
We also create a dummy input tensor with shape (1, 3, 224, 224). ONNX
export needs this dummy input to trace the model graph and understand
the input and output shapes.
def export_resnet18_cifar10_to_onnx( weights_path: str = "resnet18_cifar10.pth", onnx_path: str = "resnet18_cifar10.onnx", ): device = torch.device("cpu") # export on CPU model = build_resnet18_cifar10(num_classes=10).to(device) model.load_state_dict(torch.load(weights_path, map_location=device)) model.eval() # Dummy input (batch_size=1) dummy_input = torch.randn(1, 3, 224, 224, device=device) input_names = ["input"] output_names = ["logits"] dynamic_axes = { "input": {0: "batch_size"}, "logits": {0: "batch_size"}, } torch.onnx.export( model, dummy_input, onnx_path, export_params=True, opset_version=17, do_constant_folding=True, input_names=input_names, output_names=output_names, dynamic_axes=dynamic_axes, ) print(f" Exported ResNet18 (CIFAR-10) to ONNX: {onnx_path}") export_resnet18_cifar10_to_onnx() |
Finally, torch.onnx.export() generates the .onnx file.
Exported ResNet18 (CIFAR-10) to ONNX: resnet18_cifar10.onnx |
Benchmarking Torch CPU Vs. ONNX Runtime
In this final part, we evaluate both formats side by side. We keep everything on CPU so the comparison is fair.
The following function performs four major tasks:
- Load the PyTorch model on CPU.
- Load and validate the ONNX model.
- Check output similarity on one batch.
- Warm up and benchmark inference speed.
Then we run timed inference for a fixed number of batches:
- We measure the time taken by PyTorch inference on CPU.
- We measure the time taken by ONNX Runtime inference on CPU.
- We collect predictions from both and compute accuracy and macro F1 score.
Finally, we print average latency per batch and show an estimated speedup ratio.
def verify_and_benchmark( weights_path: str = "resnet18_cifar10.pth", onnx_path: str = "resnet18_cifar10.onnx", batch_size: int = 64, warmup_batches: int = 2, max_batches: int = 30, ): device = torch.device("cpu") # fair CPU vs CPU comparison print("Using device for evaluation:", device) # 1) Load PyTorch model torch_model = build_resnet18_cifar10(num_classes=10).to(device) torch_model.load_state_dict(torch.load(weights_path, map_location=device)) torch_model.eval() # 2) Load ONNX model and create session onnx_model = onnx.load(onnx_path) onnx.checker.check_model(onnx_model) print(" ONNX model is well-formed.") ort_session = ort.InferenceSession(onnx_path, providers=["CPUExecutionProvider"]) print("ONNXRuntime providers:", ort_session.get_providers()) # 3) Data loader (test set) _, test_loader = get_cifar10_loaders(batch_size=batch_size) # ------------------------- # A) Numeric closeness check on a single batch # ------------------------- images, labels = next(iter(test_loader)) images = images.to(device) labels = labels.to(device) with torch.no_grad(): torch_logits = torch_model(images).cpu().numpy() ort_inputs = {"input": images.cpu().numpy().astype(np.float32)} ort_logits = ort_session.run(["logits"], ort_inputs)[0] abs_diff = np.abs(torch_logits - ort_logits) max_abs = abs_diff.max() mean_abs = abs_diff.mean() print(f"Max abs diff: {max_abs:.6e}") print(f"Mean abs diff: {mean_abs:.6e}") # Relaxed tolerance to account for small numerical noise np.testing.assert_allclose(torch_logits, ort_logits, rtol=1e-02, atol=1e-04) print(" Outputs match closely between PyTorch and ONNXRuntime within relaxed tolerance.") # ------------------------- # B) Warmup runs (on a couple of batches, not recorded) # ------------------------- print(f"\nWarming up on {warmup_batches} batches (not timed)...") warmup_iter = iter(test_loader) for _ in range(warmup_batches): try: imgs_w, _ = next(warmup_iter) except StopIteration: break imgs_w = imgs_w.to(device) with torch.no_grad(): _ = torch_model(imgs_w) _ = ort_session.run(["logits"], {"input": imgs_w.cpu().numpy().astype(np.float32)}) # ------------------------- # C) Timed runs + metric collection # ------------------------- print(f"\nRunning timed evaluation on up to {max_batches} batches...") all_labels = [] torch_all_preds = [] onnx_all_preds = [] torch_times = [] onnx_times = [] n_batches = 0 for batch_idx, (images, labels) in enumerate(test_loader): if batch_idx >= max_batches: break n_batches += 1 images = images.to(device) labels = labels.to(device) # Time PyTorch start = time.perf_counter() with torch.no_grad(): torch_out = torch_model(images) end = time.perf_counter() torch_times.append(end - start) # Time ONNX ort_inp = {"input": images.cpu().numpy().astype(np.float32)} start = time.perf_counter() ort_out = ort_session.run(["logits"], ort_inp)[0] end = time.perf_counter() onnx_times.append(end - start) # Predictions torch_pred_batch = torch_out.argmax(dim=1).cpu().numpy() onnx_pred_batch = ort_out.argmax(axis=1) labels_np = labels.cpu().numpy() all_labels.append(labels_np) torch_all_preds.append(torch_pred_batch) onnx_all_preds.append(onnx_pred_batch) if n_batches == 0: print("No batches processed for evaluation. Check max_batches / dataloader.") return # Concatenate across batches all_labels = np.concatenate(all_labels, axis=0) torch_all_preds = np.concatenate(torch_all_preds, axis=0) onnx_all_preds = np.concatenate(onnx_all_preds, axis=0) # ------------------------- # D) Metrics: accuracy & F1 (macro) # ------------------------- torch_acc = accuracy_score(all_labels, torch_all_preds) * 100.0 onnx_acc = accuracy_score(all_labels, onnx_all_preds) * 100.0 torch_f1 = f1_score(all_labels, torch_all_preds, average="macro") * 100.0 onnx_f1 = f1_score(all_labels, onnx_all_preds, average="macro") * 100.0 print("\n Evaluation metrics on timed subset") print(f"PyTorch - accuracy: {torch_acc:.2f}% F1 (macro): {torch_f1:.2f}%") print(f"ONNX - accuracy: {onnx_acc:.2f}% F1 (macro): {onnx_f1:.2f}%") # ------------------------- # E) Latency summary # ------------------------- avg_torch = sum(torch_times) / len(torch_times) avg_onnx = sum(onnx_times) / len(onnx_times) print(f"\n Latency over {len(torch_times)} batches (batch size = {batch_size})") print(f"PyTorch avg: {avg_torch * 1000:.2f} ms / batch") print(f"ONNXRuntime avg: {avg_onnx * 1000:.2f} ms / batch") if avg_onnx > 0: print(f"Estimated speedup (Torch / ORT): {avg_torch / avg_onnx:.2f}x") else: print("Estimated speedup: N/A (onnx time is 0?)") verify_and_benchmark( weights_path="resnet18_cifar10.pth", onnx_path="resnet18_cifar10.onnx", batch_size=64, warmup_batches=2, max_batches=30, ) |
As a result, we get a detailed report. The accuracy remains the same, but we achieve faster inference speed with ONNX.
Using device for evaluation: cpu ONNX model is well-formed.ONNXRuntime providers: ['CPUExecutionProvider'] Max abs diff: 3.814697e-06 Mean abs diff: 4.552072e-07 Outputs match closely between PyTorch and ONNXRuntime within relaxed tolerance. Warming up on 2 batches (not timed)... Running timed evaluation on up to 30 batches... Evaluation metrics on timed subsetPyTorch - accuracy: 78.18% F1 (macro): 77.81% ONNX - accuracy: 78.18% F1 (macro): 77.81% Latency over 30 batches (batch size = 64)PyTorch avg: 2192.50 ms / batch ONNXRuntime avg: 1317.09 ms / batch Estimated speedup (Torch / ORT): 1.66x |
Exporting Scikit-Learn And Keras Models To ONNX
In this section, we show how ONNX can also be used beyond deep
learning frameworks like PyTorch. We will export a traditional
scikit-learn model and a TensorFlow/Keras neural network into ONNX
format. This demonstrates how ONNX acts as a common deployment layer
across classical machine learning and deep learning models.
Exporting A Scikit-Learn Model To ONNX
We will now train a simple Random Forest classifier on the Iris
dataset using scikit-learn and then export it to ONNX format for
deployment.
Before conversion, we explicitly define the ONNX input type,
including the input name, floating-point data type, dynamic batch size,
and the correct number of input features, which ONNX requires to build a
static computation graph.
We then convert the trained model, save the resulting .onnx file, and
finally validate it to ensure the exported model is well-formed and
ready for inference with ONNX Runtime.
from sklearn.datasets import load_iris from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from skl2onnx import convert_sklearn from skl2onnx.common.data_types import FloatTensorType import onnx # 1) Train a small sklearn model iris = load_iris() X_train, X_test, y_train, y_test = train_test_split( iris.data, iris.target, test_size=0.2, random_state=42 ) rf = RandomForestClassifier(n_estimators=50, random_state=42) rf.fit(X_train, y_train) print(" Trained RandomForestClassifier on Iris") # 2) Define input type for ONNX (batch_size x n_features) n_features = X_train.shape[1] initial_type = [("input", FloatTensorType([None, n_features]))] # 3) Convert to ONNX rf_onnx = convert_sklearn(rf, initial_types=initial_type, target_opset=17) onnx_path_sklearn = "random_forest_iris.onnx" with open(onnx_path_sklearn, "wb") as f: f.write(rf_onnx.SerializeToString()) # 4) Quick sanity check onnx.checker.check_model(onnx.load(onnx_path_sklearn)) print(f" Exported sklearn model to {onnx_path_sklearn}") |
Our model is now trained, converted, saved, and validated.
Trained RandomForestClassifier on Iris Exported sklearn model to random_forest_iris.onnx |
Exporting A TensorFlow/Keras Model To ONNX
We will now export a TensorFlow neural network to ONNX format to
demonstrate how deep learning models trained with TensorFlow can be
prepared for portable deployment.
The environment is configured to run on CPU with minimal logging to
keep the process clean and reproducible. A simple fully connected Keras
model is built using the Functional API, with a fixed input size and a
small number of layers to keep the conversion straightforward.
An input signature is then defined so ONNX knows the expected input
shape, data type, and tensor name at inference time. Using this
information, the Keras model is converted into ONNX format and saved as a
.onnx file.
Finally, the exported model is validated to ensure it is well-formed
and ready to be executed using ONNX Runtime or any other ONNX-compatible
inference engine.
import os os.environ["TF_CPP_MIN_LOG_LEVEL"] = "2" os.environ["CUDA_VISIBLE_DEVICES"] = "-1" import tensorflow as tf import tf2onnx import onnx # 3) Build a simple Keras model inputs = tf.keras.Input(shape=(32,), name="input") x = tf.keras.layers.Dense(64, activation="relu")(inputs) x = tf.keras.layers.Dense(32, activation="relu")(x) outputs = tf.keras.layers.Dense(10, activation="softmax", name="output")(x) keras_model = tf.keras.Model(inputs=inputs, outputs=outputs) keras_model.summary() # 4) Convert to ONNX spec = ( tf.TensorSpec( keras_model.inputs[0].shape, keras_model.inputs[0].dtype, name="input", ), ) onnx_model_keras, _ = tf2onnx.convert.from_keras( keras_model, input_signature=spec, opset=17, ) onnx_path_keras = "keras_mlp.onnx" with open(onnx_path_keras, "wb") as f: f.write(onnx_model_keras.SerializeToString()) onnx.checker.check_model(onnx.load(onnx_path_keras)) print(f" Exported Keras/TensorFlow model to {onnx_path_keras}") |
Our model is now trained, converted, saved, and validated.
Model: "functional_4" ┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓ ┃ Layer (type) ┃ Output Shape ┃ Param # ┃ ┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩ │ input (InputLayer) │ (None, 32) │ 0 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_8 (Dense) │ (None, 64) │ 2,112 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ dense_9 (Dense) │ (None, 32) │ 2,080 │ ├─────────────────────────────────┼────────────────────────┼───────────────┤ │ output (Dense) │ (None, 10) │ 330 │ └─────────────────────────────────┴────────────────────────┴───────────────┘ Total params: 4,522 (17.66 KB) Trainable params: 4,522 (17.66 KB) Non-trainable params: 0 (0.00 B) Exported Keras/TensorFlow model to keras_mlp.onnx |
Final Thoughts
ONNX provides a practical bridge between model training and
real-world deployment by making machine learning models portable,
framework-independent, and easier to optimize for inference.
By fine-tuning a PyTorch model, exporting it to ONNX, and comparing
accuracy and CPU inference speed, we saw that ONNX can deliver the same
predictive quality with improved performance.
It simplifies the path from experimentation to production and reduces
friction when deploying models across different environments.
With this level of portability, performance, and consistency, it is
worth asking: what more reason do you need not to use ONNX for all of
your machine learning projects?
No comments:
Post a Comment