Live
Black Hat USADark ReadingBlack Hat AsiaAI BusinessHow to secure MCP tools on AWS for AI agents with authentication, authorization, and least privilegeDev.to AIOpen Source Project of the Day (Part 30): banana-slides - Native AI PPT Generation App Based on nano banana proDev.to AIStop Writing AI Prompts From Scratch: A Developer's System for Reusable Prompt TemplatesDev.to AII Tested Every 'Memory' Solution for AI Coding Assistants - Here's What Actually WorksDev.to AIThe Flat Subscription Problem: Why Agents Break AI PricingDev.to AI10 Things I Wish I Knew Before Becoming an AI AgentDev.to AIGemma 4 Complete Guide: Architecture, Models, and Deployment in 2026Dev.to AI135,000 OpenClaw Users Just Got a 50x Price Hike. Anthropic Says It's 'Unsustainable.'Dev.to AIОдин промпт заменил мне 3 часа дебага в деньDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIciflow/trunk/177707PyTorch ReleasesShow HN: Vibooks – Local-first bookkeeping software built for AI agentsHacker News AI TopBlack Hat USADark ReadingBlack Hat AsiaAI BusinessHow to secure MCP tools on AWS for AI agents with authentication, authorization, and least privilegeDev.to AIOpen Source Project of the Day (Part 30): banana-slides - Native AI PPT Generation App Based on nano banana proDev.to AIStop Writing AI Prompts From Scratch: A Developer's System for Reusable Prompt TemplatesDev.to AII Tested Every 'Memory' Solution for AI Coding Assistants - Here's What Actually WorksDev.to AIThe Flat Subscription Problem: Why Agents Break AI PricingDev.to AI10 Things I Wish I Knew Before Becoming an AI AgentDev.to AIGemma 4 Complete Guide: Architecture, Models, and Deployment in 2026Dev.to AI135,000 OpenClaw Users Just Got a 50x Price Hike. Anthropic Says It's 'Unsustainable.'Dev.to AIОдин промпт заменил мне 3 часа дебага в деньDev.to AIBig Tech firms are accelerating AI investments and integration, while regulators and companies focus on safety and responsible adoption.Dev.to AIciflow/trunk/177707PyTorch ReleasesShow HN: Vibooks – Local-first bookkeeping software built for AI agentsHacker News AI Top
AI NEWS HUBbyEIGENVECTOREigenvector

5 Types of Loss Functions in Machine Learning

Analytics Vidhyaby Janvi KumariApril 4, 20261 min read0 views
Source Quiz

A loss function is what guides a model during training, translating predictions into a signal it can improve on. But not all losses behave the same—some amplify large errors, others stay stable in noisy settings, and each choice subtly shapes how learning unfolds. Modern libraries add another layer with reduction modes and scaling effects that [ ] The post 5 Types of Loss Functions in Machine Learning appeared first on Analytics Vidhya .

A loss function is what guides a model during training, translating predictions into a signal it can improve on. But not all losses behave the same—some amplify large errors, others stay stable in noisy settings, and each choice subtly shapes how learning unfolds.

Modern libraries add another layer with reduction modes and scaling effects that influence optimization. In this article, we break down the major loss families and how to choose the right one for your task.

Table of contents

  • Mathematical Foundations of Loss Functions
  • Regression Losses
  • Classification and Probabilistic Losses
  • Imbalance-Aware Losses
  • Segmentation and Detection Losses
  • Representation Learning Losses
  • Comparison Table and Practical Guidance
  • Conclusion
  • Frequently Asked Questions

Mathematical Foundations of Loss Functions

In supervised learning, the objective is typically to minimize the empirical risk,

(often with optional sample weights and regularization).

where ℓ is the loss function, fθ(xi) is the model prediction, and yi is the true target. In practice, this objective may also include sample weights and regularization terms. Most machine learning frameworks follow this formulation by computing per-example losses and then applying a reduction such as mean, sum, or none.

When discussing mathematical properties, it is important to state the variable with respect to which the loss is analyzed. Many loss functions are convex in the prediction or logit for a fixed label, although the overall training objective is usually non-convex in neural network parameters. Important properties include convexity, differentiability, robustness to outliers, and scale sensitivity. Common implementation of pitfalls includes confusing logits with probabilities and using a reduction that does not match the intended mathematical definition.

Regression Losses

Mean Squared Error

Mean Squared Error, or MSE, is one of the most widely used loss functions for regression. It is defined as the average of the squared differences between predicted values and true targets:

Because the error term is squared, large residuals are penalized more heavily than small ones. This makes MSE useful when large prediction errors should be strongly discouraged. It is convex in the prediction and differentiable everywhere, which makes optimization straightforward. However, it is sensitive to outliers, since a single extreme residual can strongly affect the loss.

y_true = np.array([3.0, -0.5, 2.0, 7.0]) y_pred = np.array([2.5, 0.0, 2.0, 8.0])

mse = np.mean((y_true - y_pred) ** 2) print("MSE:", mse)`**

Mean Absolute Error

Mean Absolute Error, or MAE, measures the average absolute difference between predictions and targets:

Unlike MSE, MAE penalizes errors linearly rather than quadratically. As a result, it is more robust to outliers. MAE is convex in the prediction, but it is not differentiable at zero residual, so optimization typically uses subgradients at that point.

import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0])   y_pred = np.array([2.5, 0.0, 2.0, 8.0])

mae = np.mean(np.abs(y_true - y_pred))

print("MAE:", mae)`

Huber Loss

Huber loss combines the strengths of MSE and MAE by behaving quadratically for small errors and linearly for large ones. For a threshold δ>0, it is defined as:

This makes Huber loss a good choice when the data are mostly well behaved but may contain occasional outliers.

import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0]) y_pred = np.array([2.5, 0.0, 2.0, 8.0])

error = y_pred - y_true delta = 1.0

huber = np.mean( np.where( np.abs(error) <= delta, 0.5 * error**2, delta * (np.abs(error) - 0.5 * delta) ) )*

print("Huber Loss:", huber)`

Smooth L1 Loss

Smooth L1 loss is closely related to Huber loss and is commonly used in deep learning, especially in object detection and regression heads. It transitions from a squared penalty near zero to an absolute penalty beyond a threshold. It is differentiable everywhere and less sensitive to outliers than MSE.

y_true = torch.tensor([3.0, -0.5, 2.0, 7.0]) y_pred = torch.tensor([2.5, 0.0, 2.0, 8.0])

smooth_l1 = F.smooth_l1_loss(y_pred, y_true, beta=1.0)

print("Smooth L1 Loss:", smooth_l1.item())`

Log-Cosh Loss

Log-cosh loss is a smooth alternative to MAE and is defined as

Near zero residuals, it behaves like squared loss, while for large residuals it grows almost linearly. This gives it a good balance between smooth optimization and robustness to outliers.

import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0]) y_pred = np.array([2.5, 0.0, 2.0, 8.0])

error = y_pred - y_true

logcosh = np.mean(np.log(np.cosh(error)))

print("Log-Cosh Loss:", logcosh)`

Quantile Loss

Quantile loss, also called pinball loss, is used when the goal is to estimate a conditional quantile rather than a conditional mean. For a quantile level τ∈(0,1) and residual  u=y−y^  it is defined as

It penalizes overestimation and underestimation asymmetrically, making it useful in forecasting and uncertainty estimation.

import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0]) y_pred = np.array([2.5, 0.0, 2.0, 8.0])

tau = 0.8

u = y_true - y_pred

quantile_loss = np.mean(np.where(u >= 0, tau * u, (tau - 1) * u))

print("Quantile Loss:", quantile_loss) import numpy as np

y_true = np.array([3.0, -0.5, 2.0, 7.0]) y_pred = np.array([2.5, 0.0, 2.0, 8.0])

tau = 0.8

u = y_true - y_pred

quantile_loss = np.mean(np.where(u >= 0, tau * u, (tau - 1) * u))

print("Quantile Loss:", quantile_loss)`

MAPE

Mean Absolute Percentage Error, or MAPE, measures relative error and is defined as

It is useful when relative error matters more than absolute error, but it becomes unstable when target values are zero or very close to zero.

import numpy as np

y_true = np.array([100.0, 200.0, 300.0]) y_pred = np.array([90.0, 210.0, 290.0])

mape = np.mean(np.abs((y_true - y_pred) / y_true))

print("MAPE:", mape)`

MSLE

Mean Squared Logarithmic Error, or MSLE, is defined as

It is useful when relative differences matter and the targets are nonnegative.

import numpy as np

y_true = np.array([100.0, 200.0, 300.0]) y_pred = np.array([90.0, 210.0, 290.0])

msle = np.mean((np.log1p(y_true) - np.log1p(y_pred)) ** 2)**

print("MSLE:", msle)`

Poisson Negative Log-Likelihood

Poisson negative log-likelihood is used for count data. For a rate parameter λ>0, it is typically written as

In practice, the constant term may be omitted. This loss is appropriate when targets represent counts generated from a Poisson process.

import numpy as np

y_true = np.array([2.0, 0.0, 4.0]) lam = np.array([1.5, 0.5, 3.0])

poisson_nll = np.mean(lam - y_true * np.log(lam))*

print("Poisson NLL:", poisson_nll)`

Gaussian Negative Log-Likelihood

Gaussian negative log-likelihood allows the model to predict both the mean and the variance of the target distribution. A common form is

This is useful for heteroscedastic regression, where the noise level varies across inputs.

import numpy as np

y_true = np.array([0.0, 1.0]) mu = np.array([0.0, 1.5]) var = np.array([1.0, 0.25])

gaussian_nll = np.mean(0.5 * (np.log(var) + (y_true - mu) ** 2 / var))***

print("Gaussian NLL:", gaussian_nll)`

Classification and Probabilistic Losses

Binary Cross-Entropy and Log Loss

Binary cross-entropy, or BCE, is used for binary classification. It compares a Bernoulli label y∈{0,1} with a predicted probability p∈(0,1):

In practice, many libraries prefer logits rather than probabilities and compute the loss in a numerically stable way. This avoids instability caused by applying sigmoid separately before the logarithm. BCE is convex in the logit for a fixed label and differentiable, but it is not robust to label noise because confidently wrong predictions can produce very large loss values. It is widely used for binary classification, and in multi-label classification it is applied independently to each label. A common pitfall is confusing probabilities with logits, which can silently degrade training.

import torch

logits = torch.tensor([2.0, -1.0, 0.0]) y_true = torch.tensor([1.0, 0.0, 1.0])

bce = torch.nn.BCEWithLogitsLoss() loss = bce(logits, y_true)

print("BCEWithLogitsLoss:", loss.item())`

Softmax Cross-Entropy for Multiclass Classification

Softmax cross-entropy is the standard loss for multiclass classification. For a class index y and logits vector z, it combines the softmax transformation with cross-entropy loss:

This loss is convex in the logits and differentiable. Like BCE, it can heavily penalize confident wrong predictions and is not inherently robust to label noise. It is commonly used in standard multiclass classification and also in pixelwise classification tasks such as semantic segmentation. One important implementation detail is that many libraries, including PyTorch, expect integer class indices rather than one-hot targets unless soft-label variants are explicitly used.

logits = torch.tensor([ [2.0, 0.5, -1.0], [0.0, 1.0, 0.0] ], dtype=torch.float32)

y_true = torch.tensor([0, 2], dtype=torch.long)

loss = F.cross_entropy(logits, y_true)

print("CrossEntropyLoss:", loss.item())`

Label Smoothing Variant

Label smoothing is a regularized form of cross-entropy in which a one-hot target is replaced by a softened target distribution. Instead of assigning full probability mass to the correct class, a small portion is distributed across the remaining classes. This discourages overconfident predictions and can improve calibration.

The method remains differentiable and often improves generalization, especially in large-scale classification. However, too much smoothing can make the targets overly ambiguous and lead to underfitting.

logits = torch.tensor([ [2.0, 0.5, -1.0], [0.0, 1.0, 0.0] ], dtype=torch.float32)

y_true = torch.tensor([0, 2], dtype=torch.long)

loss = F.cross_entropy(logits, y_true, label_smoothing=0.1)

print("CrossEntropyLoss with label smoothing:", loss.item())`

Margin Losses: Hinge Loss

Hinge loss is a classic margin-based loss used in support vector machines. For binary classification with label y∈{−1,+1} and score s, it is defined as

Hinge loss is convex in the score but not differentiable at the margin boundary. It produces zero loss for examples that are correctly classified with sufficient margin, which leads to sparse gradients. Unlike cross-entropy, hinge loss is not probabilistic and does not directly provide calibrated probabilities. It is useful when a max-margin property is desired.

import numpy as np

y_true = np.array([1.0, -1.0, 1.0]) scores = np.array([0.2, 0.4, 1.2])

hinge_loss = np.mean(np.maximum(0, 1 - y_true * scores))*

print("Hinge Loss:", hinge_loss)`

KL Divergence

Kullback-Leibler divergence compares two probability distributions P and Q:

It is nonnegative and becomes zero only when the two distributions are identical. KL divergence is not symmetric, so it is not a true metric. It is widely used in knowledge distillation, variational inference, and regularization of learned distributions toward a prior. In practice, PyTorch expects the input distribution in log-probability form, and using the wrong reduction can change the reported value. In particular, batchmean matches the mathematical KL definition more closely than mean.

P = torch.tensor([[0.7, 0.2, 0.1]], dtype=torch.float32) Q = torch.tensor([[0.6, 0.3, 0.1]], dtype=torch.float32)

kl_batchmean = F.kl_div(Q.log(), P, reduction="batchmean")

print("KL Divergence (batchmean):", kl_batchmean.item())`

KL Divergence Reduction Pitfall

A common implementation issue with KL divergence is the choice of reduction. In PyTorch, reduction=”mean” scales the result differently from the true KL expression, while reduction=”batchmean” better matches the standard definition.

P = torch.tensor([[0.7, 0.2, 0.1]], dtype=torch.float32) Q = torch.tensor([[0.6, 0.3, 0.1]], dtype=torch.float32)

kl_batchmean = F.kl_div(Q.log(), P, reduction="batchmean") kl_mean = F.kl_div(Q.log(), P, reduction="mean")

print("KL batchmean:", kl_batchmean.item()) print("KL mean:", kl_mean.item())`

Variational Autoencoder ELBO

The variational autoencoder, or VAE, is trained by maximizing the evidence lower bound, commonly called the ELBO:

This objective has two parts. The reconstruction term encourages the model to explain the data well, while the KL term regularizes the approximate posterior toward the prior. The ELBO is not convex in neural network parameters, but it is differentiable under the reparameterization trick. It is widely used in generative modeling and probabilistic representation learning. In practice, many variants introduce a weight on the KL term, such as in beta-VAE.

import torch

reconstruction_loss = torch.tensor(12.5) kl_term = torch.tensor(3.2)

elbo = reconstruction_loss + kl_term

print("VAE-style total loss:", elbo.item())`

Imbalance-Aware Losses

Class Weights

Class weighting is a common strategy for handling imbalanced datasets. Instead of treating all classes equally, higher loss weight is assigned to minority classes so that their errors contribute more strongly during training. In multiclass classification, weighted cross-entropy is often used:

where wy  is the weight for the true class. This approach is simple and effective when class frequencies differ substantially. However, excessively large weights can make optimization unstable.

logits = torch.tensor([ [2.0, 0.5, -1.0], [0.0, 1.0, 0.0], [0.2, -0.1, 1.5] ], dtype=torch.float32)

y_true = torch.tensor([0, 1, 2], dtype=torch.long) class_weights = torch.tensor([1.0, 2.0, 3.0], dtype=torch.float32)

loss = F.cross_entropy(logits, y_true, weight=class_weights)

print("Weighted Cross-Entropy:", loss.item())`

Positive Class Weight for Binary Loss

For binary or multi-label classification, many libraries provide a pos_weight parameter that increases the contribution of positive examples in binary cross-entropy. This is especially useful when positive labels are rare. In PyTorch, BCEWithLogitsLoss supports this directly.

This method is often preferred over naive resampling because it preserves all examples while adjusting the optimization signal. A common mistake is to confuse weight and pos_weight, since they affect the loss differently.

import torch

logits = torch.tensor([2.0, -1.0, 0.5], dtype=torch.float32) y_true = torch.tensor([1.0, 0.0, 1.0], dtype=torch.float32)

criterion = torch.nn.BCEWithLogitsLoss(pos_weight=torch.tensor([3.0])) loss = criterion(logits, y_true)

print("BCEWithLogitsLoss with pos_weight:", loss.item())`

Focal Loss

Focal loss is designed to address class imbalance by down-weighting easy examples and focusing training on harder ones. For binary classification, it is commonly written as

where pt  is the model probability assigned to the true class, α is a class-balancing factor, and γ controls how strongly easy examples are down-weighted. When γ=0, focal loss reduces to ordinary cross-entropy.

Focal loss is widely used in dense object detection and highly imbalanced classification problems. Its main hyperparameters are α and γ, both of which can significantly affect training behavior.

logits = torch.tensor([2.0, -1.0, 0.5], dtype=torch.float32) y_true = torch.tensor([1.0, 0.0, 1.0], dtype=torch.float32)

bce = F.binary_cross_entropy_with_logits(logits, y_true, reduction="none")

probs = torch.sigmoid(logits) pt = torch.where(y_true == 1, probs, 1 - probs)

alpha = 0.25 gamma = 2.0

focal_loss = (alpha * (1 - pt) ** gamma * bce).mean()

print("Focal Loss:", focal_loss.item())`

Class-Balanced Reweighting

Class-balanced reweighting improves on simple inverse-frequency weighting by using the effective number of samples rather than raw counts. A common formula for the class weight is

where nc  is the number of samples in class c and β is a parameter close to 1. This gives smoother and often more stable reweighting than direct inverse counts.

This method is useful when class imbalance is severe but naive class weights would be too extreme. The main hyperparameter is β, which determines how strongly rare classes are emphasized.

import numpy as np

class_counts = np.array([1000, 100, 10], dtype=np.float64) beta = 0.999

effective_num = 1.0 - np.power(beta, class_counts) class_weights = (1.0 - beta) / effective_num

class_weights = class_weights / class_weights.sum() * len(class_counts)*

print("Class-Balanced Weights:", class_weights)`

Segmentation and Detection Losses

Dice Loss

Dice loss is widely used in image segmentation, especially when the target region is small relative to the background. It is based on the Dice coefficient, which measures overlap between the predicted mask and the ground-truth mask:

The corresponding loss is

Dice loss directly optimizes overlap and is therefore well suited to imbalanced segmentation tasks. It is differentiable when soft predictions are used, but it can be sensitive to small denominators, so a smoothing constant ϵ is usually added.

import torch

y_true = torch.tensor([1, 1, 0, 0], dtype=torch.float32) y_pred = torch.tensor([0.9, 0.8, 0.2, 0.1], dtype=torch.float32)

eps = 1e-6

intersection = torch.sum(y_pred * y_true) dice = (2 * intersection + eps) / (torch.sum(y_pred) + torch.sum(y_true) + eps)

dice_loss = 1 - dice

print("Dice Loss:", dice_loss.item())`

IoU Loss

Intersection over Union, or IoU, also called Jaccard index, is another overlap-based measure commonly used in segmentation and detection. It is defined as

The loss form is

IoU loss is stricter than Dice loss because it penalizes disagreement more strongly. It is useful when accurate region overlap is the main objective. As with Dice loss, a small constant is added for stability.

import torch

y_true = torch.tensor([1, 1, 0, 0], dtype=torch.float32) y_pred = torch.tensor([0.9, 0.8, 0.2, 0.1], dtype=torch.float32)

eps = 1e-6

intersection = torch.sum(y_pred * y_true) union = torch.sum(y_pred) + torch.sum(y_true) - intersection*

iou = (intersection + eps) / (union + eps) iou_loss = 1 - iou

print("IoU Loss:", iou_loss.item())`

Tversky Loss

Tversky loss generalizes Dice and IoU style overlap losses by weighting false positives and false negatives differently. The Tversky index is

and the loss is

This makes it especially useful in highly imbalanced segmentation problems, such as medical imaging, where missing a positive region may be much worse than including extra background. The choice of α and β controls this tradeoff.

import torch

y_true = torch.tensor([1, 1, 0, 0], dtype=torch.float32) y_pred = torch.tensor([0.9, 0.8, 0.2, 0.1], dtype=torch.float32)

eps = 1e-6 alpha = 0.3 beta = 0.7

tp = torch.sum(y_pred * y_true) fp = torch.sum(y_pred * (1 - y_true)) fn = torch.sum((1 - y_pred) * y_true)*

tversky = (tp + eps) / (tp + alpha * fp + beta * fn + eps) tversky_loss = 1 - tversky

print("Tversky Loss:", tversky_loss.item())`

Generalized IoU Loss

Generalized IoU, or GIoU, is an extension of IoU designed for bounding-box regression in object detection. Standard IoU becomes zero when two boxes do not overlap, which gives no useful gradient. GIoU addresses this by incorporating the smallest enclosing box CCC:

The loss is

GIoU is useful because it still provides a training signal even when predicted and true boxes do not overlap.

import torch

def box_area(box): return max(0.0, box[2] - box[0]) * max(0.0, box[3] - box[1])*

def intersection_area(box1, box2): x1 = max(box1[0], box2[0]) y1 = max(box1[1], box2[1]) x2 = min(box1[2], box2[2]) y2 = min(box1[3], box2[3]) return max(0.0, x2 - x1) * max(0.0, y2 - y1)*

pred_box = [1.0, 1.0, 3.0, 3.0] true_box = [2.0, 2.0, 4.0, 4.0]

inter = intersection_area(pred_box, true_box) area_pred = box_area(pred_box) area_true = box_area(true_box)

union = area_pred + area_true - inter iou = inter / union

c_box = [ min(pred_box[0], true_box[0]), min(pred_box[1], true_box[1]), max(pred_box[2], true_box[2]), max(pred_box[3], true_box[3]), ]

area_c = box_area(c_box) giou = iou - (area_c - union) / area_c

giou_loss = 1 - giou

print("GIoU Loss:", giou_loss)`

Distance IoU Loss

Distance IoU, or DIoU, extends IoU by adding a penalty based on the distance between box centers. It is defined as

where ρ2(b,bgt) is the squared distance between the centers of the predicted and ground-truth boxes, and c2 is the squared diagonal length of the smallest enclosing box. The loss is

DIoU improves optimization by encouraging both overlap and spatial alignment. It is commonly used in bounding-box regression for object detection.

import math

def box_center(box): return ((box[0] + box[2]) / 2.0, (box[1] + box[3]) / 2.0)

def intersection_area(box1, box2): x1 = max(box1[0], box2[0]) y1 = max(box1[1], box2[1]) x2 = min(box1[2], box2[2]) y2 = min(box1[3], box2[3]) return max(0.0, x2 - x1) * max(0.0, y2 - y1)*

pred_box = [1.0, 1.0, 3.0, 3.0] true_box = [2.0, 2.0, 4.0, 4.0]

inter = intersection_area(pred_box, true_box)

area_pred = (pred_box[2] - pred_box[0]) * (pred_box[3] - pred_box[1]) area_true = (true_box[2] - true_box[0]) * (true_box[3] - true_box[1])

union = area_pred + area_true - inter iou = inter / union

cx1, cy1 = box_center(pred_box) cx2, cy2 = box_center(true_box)

center_dist_sq = (cx1 - cx2) ** 2 + (cy1 - cy2) ** 2

c_x1 = min(pred_box[0], true_box[0]) c_y1 = min(pred_box[1], true_box[1]) c_x2 = max(pred_box[2], true_box[2]) c_y2 = max(pred_box[3], true_box[3])

diag_sq = (c_x2 - c_x1) ** 2 + (c_y2 - c_y1) ** 2

diou = iou - center_dist_sq / diag_sq diou_loss = 1 - diou

print("DIoU Loss:", diou_loss)`

Representation Learning Losses

Contrastive Loss

Contrastive loss is used to learn embeddings by bringing similar samples closer together and pushing dissimilar samples farther apart. It is commonly used in Siamese networks. For a pair of embeddings with distance d and label y∈{0,1}, where y=1 indicates a similar pair, a common form is

where m is the margin. This loss encourages similar pairs to have small distance and dissimilar pairs to be separated by at least the margin. It is useful in face verification, signature matching, and metric learning.

z1 = torch.tensor([[1.0, 2.0]], dtype=torch.float32) z2 = torch.tensor([[1.5, 2.5]], dtype=torch.float32)

label = torch.tensor([1.0], dtype=torch.float32) # 1 = similar, 0 = dissimilar

distance = F.pairwise_distance(z1, z2)

margin = 1.0

contrastive_loss = ( label * distance.pow(2)*

  • (1 - label) * torch.clamp(margin - distance, min=0).pow(2) )*

print("Contrastive Loss:", contrastive_loss.mean().item())`

Triplet Loss

Triplet loss extends pairwise learning by using three examples: an anchor, a positive sample from the same class, and a negative sample from a different class. The objective is to make the anchor closer to the positive than to the negative by at least a margin:

where d(⋅, ⋅) is a distance function and m is the margin. Triplet loss is widely used in face recognition, person re-identification, and retrieval of tasks. Its success depends strongly on how informative triplets are selected during training.

anchor = torch.tensor([[1.0, 2.0]], dtype=torch.float32) positive = torch.tensor([[1.1, 2.1]], dtype=torch.float32) negative = torch.tensor([[3.0, 4.0]], dtype=torch.float32)

margin = 1.0

triplet = torch.nn.TripletMarginLoss(margin=margin, p=2) loss = triplet(anchor, positive, negative)

print("Triplet Loss:", loss.item())`

InfoNCE and NT-Xent Loss

InfoNCE is a contrastive objective widely used in self-supervised representation learning. It encourages an anchor embedding to be close to its positive pair while being far from other samples in the batch, which act as negatives. A standard form is

where sim is a similarity measure such as cosine similarity and τ is a temperature parameter. NT-Xent is a normalized temperature-scaled variant commonly used in methods such as SimCLR. These losses are powerful because they learn rich representations without manual labels, but they depend strongly on batch composition, augmentation strategy, and temperature choice.

z_anchor = torch.tensor([[1.0, 0.0]], dtype=torch.float32) z_positive = torch.tensor([[0.9, 0.1]], dtype=torch.float32) z_negative1 = torch.tensor([[0.0, 1.0]], dtype=torch.float32) z_negative2 = torch.tensor([[-1.0, 0.0]], dtype=torch.float32)

embeddings = torch.cat([z_positive, z_negative1, z_negative2], dim=0)

z_anchor = F.normalize(z_anchor, dim=1) embeddings = F.normalize(embeddings, dim=1)

similarities = torch.matmul(z_anchor, embeddings.T).squeeze(0)

temperature = 0.1 logits = similarities / temperature

labels = torch.tensor([0], dtype=torch.long) # positive is first

loss = F.cross_entropy(logits.unsqueeze(0), labels)

print("InfoNCE / NT-Xent Loss:", loss.item())`

Comparison Table and Practical Guidance

The table below summarizes key properties of commonly used loss functions. Here, convexity refers to convexity with respect to the model output, such as prediction or logit, for fixed targets, not convexity in neural network parameters. This distinction is important because most deep learning objectives are non-convex in parameters, even when the loss is convex in the output.

Loss Typical Task Convex in Output Differentiable Robust to Outliers Scale / Units

MSE Regression Yes Yes No Squared target units

MAE Regression Yes No (kink) Yes Target units

Huber Regression Yes Yes Yes (controlled by δ) Target units

Smooth L1 Regression / Detection Yes Yes Yes Target units

Log-cosh Regression Yes Yes Moderate Target units

Pinball (Quantile) Regression / Forecast Yes No (kink) Yes Target units

Poisson NLL Count Regression Yes (λ>0) Yes Not primary focus Nats

Gaussian NLL Uncertainty Regression Yes (mean) Yes Not primary focus Nats

BCE (logits) Binary / Multilabel Yes Yes Not applicable Nats

Softmax Cross-Entropy Multiclass Yes Yes Not applicable Nats

Hinge Binary / SVM Yes No (kink) Not applicable Margin units

Focal Loss Imbalanced Classification Generally No Yes Not applicable Nats

KL Divergence Distillation / Variational Context-dependent Yes Not applicable Nats

Dice Loss Segmentation No Almost (soft) Not primary focus Unitless

IoU Loss Segmentation / Detection No Almost (soft) Not primary focus Unitless

Tversky Loss Imbalanced Segmentation No Almost (soft) Not primary focus Unitless

GIoU Box Regression No Piecewise Not primary focus Unitless

DIoU Box Regression No Piecewise Not primary focus Unitless

Contrastive Loss Metric Learning No Piecewise Not primary focus Distance units

Triplet Loss Metric Learning No Piecewise Not primary focus Distance units

InfoNCE / NT-Xent Contrastive Learning No Yes Not primary focus Nats

Conclusion

Loss functions define how models measure error and learn during training. Different tasks—regression, classification, segmentation, detection, and representation learning—require different loss types. Choosing the right one depends on the problem, data distribution, and error sensitivity. Practical considerations like numerical stability, gradient scale, reduction methods, and class imbalance also matter. Understanding loss functions leads to better training and more informed model design decisions.

Frequently Asked Questions

Q1. What does a loss function do in machine learning?

A. It measures the difference between predictions and true values, guiding the model to improve during training.

Q2. How do I choose the right loss function?

A. It depends on the task, data distribution, and which errors you want to prioritize or penalize.

Q3. Why do reduction methods matter?

A. They affect gradient scale, influencing learning rate, stability, and overall training behavior.

Hi, I am Janvi, a passionate data science enthusiast currently working at Analytics Vidhya. My journey into the world of data began with a deep curiosity about how we can extract meaningful insights from complex datasets.

Was this article helpful?

Sign in to highlight and annotate this article

AI
Ask AI about this article
Powered by Eigenvector · full article context loaded
Ready

Conversation starters

Ask anything about this article…

Daily AI Digest

Get the top 5 AI stories delivered to your inbox every morning.

More about

modeltrainingprediction

Knowledge Map

Knowledge Map
TopicsEntitiesSource
5 Types of …modeltrainingpredictionAnalytics V…

Connected Articles — Knowledge Graph

This article is connected to other articles through shared AI topics and tags.

Knowledge Graph100 articles · 227 connections
Scroll to zoom · drag to pan · click to open

Discussion

Sign in to join the discussion

No comments yet — be the first to share your thoughts!

More in Models

Один промпт заменил мне 3 часа дебага в день
ModelsLive

Один промпт заменил мне 3 часа дебага в день

Вечерами, когда большинство уже отдыхает, я зависаю в своём офисе и ковыряюсь с кодом. Тот 14 августа, в 21:45, не был исключением. Я опять сидел над этой задачей, которая съедала по три часа каждый день. Почему это была боль Всё началось с простого: проект на Python, который выглядел как очередное рутинное задание. Однако вычисления упорно выдавали ошибочные результаты. Три дня подряд я безуспешно искал причину. Как обычно, приходилось проверять каждую строчку, каждую переменную. Это было настоящим адом. Для фрилансера с жесткими сроками это катастрофа - теряешь время, не зарабатываешь, а заказчик ждёт. Я собрал промпты по этой теме в PDF. Забери бесплатно: https://t.me/airozov_bot Как я нашёл решение Тогда я решил попробовать ChatGPT, хотя и не особо верил в его чудеса. Вбил проблему в п