# Unified Survey Of Anomalies, Novelty, Open Sets, And Outlier Detection

3 main points
✔️ Survey similar concepts of anomaly, novelty, open set, and outlier detection using a unified method
✔️ There are different definitions for each of these boundaries, with corresponding variations in the methods used to separate them
✔️ This survey provides a comprehensive analysis and outlines future research questions.

written by Mohammadreza SalehiHossein MirzaeiDan HendrycksYixuan LiMohammad Hossein RohbanMohammad Sabokrou
(Submitted on 26 Oct 2021)
Comments: Published on arxiv.

Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

## first of all

In machine learning models, it is common to make the "closed set" assumption that test data is drawn from the same distribution as the training data (independent identical distribution). In practice, however, it is possible to encounter all kinds of test input data, including those that have not been trained with a classifier. Unfortunately, models may misleadingly assign confidence values to test samples that they have never seen. This has led to concerns about the reliability of classifiers, especially in safety-critical applications. In the literature, several areas try to address the problem of identifying unknown/anomalous/out-of-distribution data in an open-world setting. In particular, the problems of anomaly detection (AD), novelty detection (ND), one-class classification (OCC), out-of-distribution (OOD) detection, and open set recognition (OSR) have received significant attention due to their fundamental importance and practical relevance. Although they are used for similar tasks, their differences and relevance are often overlooked.

Specifically, OSR trains the model on K classes of an N-class training dataset, and at test time, the model is faced with N - K different classes that were not seen at training time OSR assigns the correct labels to the samples seen at test time and detects samples that have not been seen before Objective. Novelty detection and 1-class classification are extreme cases of open set recognition where K is 1. In a multiclass classification setting, the problem of OOD detection is typical for OSR. That is, it accurately classifies samples in the distribution (ID) into known categories and detects OOD data that are semantically different and therefore should not be predicted by the model. However, OOD detection encompasses a wider range of learning tasks (e.g., multi-label classification, reinforcement learning) and solution spaces (e.g., density estimation), which we review comprehensively in this paper. While the aforementioned domains assume access to a completely normal training dataset, anomaly detection assumes that the training dataset is obtained completely unsupervised, without any filtering applied, and thus may contain anomalous samples. However, since anomalous events rarely occur, AD's method exploits this fact and proposes to filter them out during the training process to reach a final semantic space that fully captures normal features. Previous methods have been used mostly in the domains of object detection and image classification, but such a setup is common in industrial defect detection tasks where abnormal events rarely occur and normal samples share the notion of normality Fig. 1 shows a visual representation of the differences between these domains. Note that even though there are differences in the formulation of these regions, they are used in the same sense because they have much in common Figure 1 shows a visual representation of the differences between these regions.

Several surveys have been conducted on anomaly detection, an important research area, but they focus on each field independently or provide very general anomaly detection concepts to cover all the different types of data sets. Instead, we provide a detailed description of the methodology for each of these fields. In this way, we build bridges between the domains to easily propagate ideas and inspire future research. For example, the idea of using outlier samples from different data sets to improve task-specific features is called Outlier Exposure or Background modeling and is very similar to semi-supervised anomaly detection. Even though the ideas are shared, they are both considered novel ideas in their respective domains.

In summary, the main contributions of this paper are as follows

(1) To clarify the relationship between different research fields that have been examined separately despite their high interconnectedness.

(2) Provide a comprehensive methodological analysis of a prominent recent study and clearly explain the reviewed methods in a theoretical and visual manner.

3) Conduct comprehensive testing against existing baselines to provide a strong foundation for current and future research.

(4) Provide directions for future research and articulate the fundamentals necessary for the methods to be presented in the future, including fairness, adversarial robustness, privacy, data efficiency, and accountability.

## A General Perspective on Method Classification

Here we have, we have input space X =R d and the random variable on the label (output) space Y, the joint distribution P X,Y from the training sample (x 1 ,y 1 ),(x 2 ,y 2 ),... We consider a data set where X and Y are respectively the input spaces X =R d and a random variable on the label (output) space Y. In AD and ND, the label space Y is a binary set of normal and abnormal values. At test time, given an input sample x, the model needs to estimate P(Y = Normal/seen/in-class | X = x) in the one-class setting; in OSR for OOD detection and multiclass classification, the label space may contain multiple semantic categories, so In AD, since the input samples may contain noise (anomalies) in addition to normal samples, the problem is transformed into a one-class classification problem for noisy labels, but the overall formulation of the detection task remains unchanged. Two commonly known perspectives for modeling conditional probabilities are generative and discriminative modeling: in the OOD detection and OSR settings, discriminative modeling may be easier because we have access to the labels of the training samples, but the lack of labels makes AD, ND (OCC ) is difficult. This is because the one-class classification problem has the trivial solution of mapping each input, whether normal or abnormal, to a given label Y and thus minimizing the objective function as much as possible. This problem can be solved using In the case of DSVDD, as in, which uses an If we train with a large number of training epochs, the Regardless of normal or abnormal, the It is also seen in approaches that map all inputs to a single point.

However, there are several modified approaches to the formulation of P(Y | X) that can be used to solve this problem. They are so that the normalized distribution does not change, and We apply a series of affine transformations to the distribution of X. Then, the sum $\sum^{｜T|}_{i=1}$ P(T i |T i (X)) is estimated and the transformed input T i (X) is given, each transformation T i computes the aggregated probability that it applies to the input X, which is equal to |T | P(Y | X). This is similar to estimating P(Y | X) directly, but without collapsing, so it can be used instead of estimating a single class of conditional probabilities. This simple method avoids the problem of collapsing, but the problem depends on the transformation since the transformed inputs must intersect each other as little as possible to satisfy the constraint of consistency of the normalized distribution. Therefore, as will be shown later, the OSR method can overcome the problem by employing the AD approach in combination with the classification model. A similar situation applies to the OOD domain.

In generative modeling, AE (Autoencoder)-based, GAN (Generative Adversarial Network)-based, and explicit density estimation-based methods such as auto-regression and flow-based models are used to model data distributions. In the case of AE In AE, There are two important assumptions If the autoencoder has been trained with only normal training samples, then

- Auto-encoders are ... The auto-encoders are designed to take a normal test sample that you have never seen before. and The autoencoders will be able to reconstruct a normal test sample as accurately as a training sample.

- Samples with abnormal test times are. cannot be reconstructed exactly as normal input.

But... Although In the recently proposed method using AE, the However, recently proposed methods using AE have shown that the above assumption is not always true. For example, in the case of Even if AE can perfectly reconstruct a normal sample, a shift of only one pixel can result in The loss of reconstruction being significant.

Similarly, Another famous model family, GAN, is, the AD, and ND, and OCC AD, ND, OCC OSR and OOD, and OOD have been widely used for detection. When GAN is trained on a perfectly normal training sample, GAN operates under the assumption that

- If the input is normal, then a potential vector exists, and if it is generated, there is little disagreement with the input.

- If the input is abnormal If the Even if it is generated If There is no potential vector with a small discrepancy with the input.

Here, the discrepancy can be defined based on the pixel-level MSE loss of the generated image and the test-time input, or a more complex function such as the interlayer distance of the discriminator features given the generated image and the test-time input.GAN is, Although it has been proven to be capable of capturing semantic abstractions of a given training dataset, the Modal decay, and Unstable learning process, and plagued by the problem of non-repeatable results.

Finally, the autoregressive and flow-based models can be used to explicitly approximate the data density and detect abnormal samples based on the assigned likelihoods. Intuitively, normal samples should have a higher likelihood than abnormal samples, but as discussed below, autoregressive models assign a higher likelihood to abnormal samples even though they do not see abnormal samples during the training process, which results in AD, ND, OSR, and OOD detection performance decreases. To solve this problem, several improvements have been proposed in the OOD domain that can be used in OSR, AD, and ND, but more evaluation of their reliability is needed, considering that the general test protocol for OOD detection may be quite different from other domains such as AD and ND.

## Anomaly and novelty detection

Anomaly Detection (AD) and Novelty Detection (ND) are used interchangeably in the literature, but few works discuss the differences between them. In anomaly detection, there are certain inherent problems, contrary to the assumption that the training data consists of perfectly normal samples. For example, measurement noise is inevitable in physical experiments, so in the unsupervised learning process, the algorithm must automatically detect and focus on normal samples. However, this is not the case for the novelty detection problem. There are many applications where it is easy to provide a clean dataset with minimal supervision. These areas have been isolated over time, but their names are not yet properly used in the literature.

Interest in anomaly detection dates back to 1969 when it defined an anomaly/outlier as "a sample that appears to deviate significantly from the other members of the sample in which it occurs" and explicitly assumed the existence of a basic shared pattern followed by the majority of training samples. There are some ambiguities in this definition. For example, we need to define a criterion for the notion of deviation, and to make the term "significantly" more quantitative. For this reason, before and after the advent of deep learning methods, significant efforts have been made to make the aforementioned concepts clearer. To find samples that deviate from the trend, it is necessary to employ an appropriate distance metric. There is also the challenge of selecting a threshold to determine if the deviation from the normal sample is significant.

### Robust deep autoencoder for anomaly detection

We train the autoencoder (AE) on a dataset containing both inliers and outliers. Outliers are detected and filtered out during training, assuming that the inliers are significantly more frequent and share normal concepts. In this way, the AE is trained only on normal training samples, and consequently cannot successfully reconstruct the input during abnormal tests. Therefore, we use the Alternating Direction Method of Multipliers (ADMM) to split the objective into two (or more) parts and compute them.

where E and D are the encoder and decoder networks, respectively. It is assumed that LD is the outlier part of the training data X and S is the outlier part. However, the above optimization is not an easy solution because S and θ need to be optimized together. To address this problem, the alternating direction method of multipliers (ADMM) is used. It divides the objective lens into two (or more) parts. In the first step, by fixing S, the optimization problem for the parameter θ is solved such that LD = X - S, and the objective becomes|| LD-Dθ ( ( LD ))|| 2. The optimization problem for that norm is then solved by setting LD to a reconstruction of the trained AE, with S set to X -LD. Since the L1 norm is not differentiable, a proximal operator is used as an approximation for each optimization step as follows.

Such a function is known as a shrinkage operator and is very common in L1 optimization problems. The aforementioned objective function using || S || 1 separates only unstructured noise, e.g., Gaussian noise in the training samples, from the normal content of the training data set. To separate structured noise, such as samples that convey a completely different meaning than the majority of the training samples, the L2,1 optimization criterion can be applied as follows

We use a proximal operator called the blockwise soft threshold function [27]. During testing, reconfiguration errors are used to reject anomalous inputs.

### Inverse Learning One-Class Classification for Novelty Detection (ALOCC)

Assuming that we are given perfectly normal training samples, we aim to train a novelty detection model on them. First, we train (R) as a Denoising Auto Encoder (DAE) to (1) reduce reconstruction loss and (2) fool the discriminator in a GAN-based setting. This allows the DAE to produce high-quality images instead of blurry output. This happens because the AE model loss, on the one hand, explicitly assumes an independent Gaussian distribution for each pixel. And on the other hand, the true distribution of pixels is usually multimodal, so the average value of the Gaussian has to settle between different modes. This leads to a blurry image on complex data sets. To solve this problem, AE can be trained in a GAN-based framework to force the mean of each Gaussian to capture only one mode of the corresponding true distribution. Furthermore, by using the output (D) of the discriminator instead of the pixel-level loss, normal samples that are not properly reconstructed can be detected as normal. This loss significantly reduces the False Positive Rate (FPR) of vanilla DAE.

This allows the model to produce higher quality output as well as have the capabilities of AE for anomaly detection. Furthermore, the detection can be based on D(R(X)) as described above. Fig. 1. 1. 2 shows the overall architecture of this work .

### One-class novelty detection using GANs with constrained latent representations (OC-GAN)

AEs trained on perfectly normal training samples can reconstruct the unseen anomalous input with even lower error. To solve this problem, we attempt to make the encoder's potential distribution (EN(-)) resemble a uniform distribution in an adversarial manner. Similarly, the decoder (De(-)) is forced to replay the in-class output that samples the latent values from the uniform distribution. The learning target distributes normal features in the latent space so that the replayed output completely or at least roughly resembles the normal class for both normal and abnormal inputs. We also use another method in the latent space, called informative negative sample mining, to actively look for regions that produce low-quality images. To do so, the classifier is trained to distinguish between the reconstructed output of the decoder and false images.

### Latent Space Autoregression for Novelty Detection (LSA)

In this method, for novelty detection, we propose a concept called "surprise" which specifies the uniqueness of input samples in the latent space. This concept specifies the uniqueness of an input sample in the latent space. The more unique a sample is, the less likely it is in the latent space, and consequently the more likely it is to be an anomalous sample. This is especially beneficial when the many normal training samples are ilar to the training data set. For visually similar training samples, AEs are usually trained to reconstruct their mean as the output to minimize the MSE error. This results in a blurred output and a larger reconstruction error for such inputs. However, by using surprise loss and reconstruction error together, this problem can be mitigated. Also, anomalous samples are usually more surprising, which increases the novelty score. The surprise score is learned using an autoregressive model in the latent space, as shown in Fig. 4. The autoregressive model (h) can be instantiated from different architectures such as LSTM and RNN networks to more complex ones. Also, as with other AE-based methods, the replay error is optimized.

### Memory-Assisted Deep Autoencoder (Mem-AE) for Unsupervised Anomaly Detection

In this method, we challenged the second assumption made when using AE. We showed that an abnormal sample can be perfectly reconstructed even if the training dataset does not contain any abnormal samples. Intuitively, AE does not learn features that uniquely describe normal samples, and as a result, it may extract abnormal features from abnormal inputs and reconstruct them perfectly. Therefore, it is necessary to learn features that accurately reconstruct only normal samples. For this purpose, Mem-AE employs a memory that stores unique and sufficient features of normal training samples. During training, the encoder implicitly plays the role of an address generator for the memory. The encoder generates embeddings and the memory features similar to the generated embeddings are combined. The combined embeddings are passed to the decoder to produce the corresponding reconstructed output. Mem-AE also employs a sparse addressing technique that uses only a small number of memory items. Therefore, Mem-AE's decoders are restricted to perform reconfiguration using a small number of memory items and do not need to utilize memory items efficiently. Furthermore, reconstruction errors cause memory to record prototypical patterns that are representative of normal input.

### Redefining the learning paradigm for inverse learning single-group classifiers (the old ones are golden).

This method is an extension of the idea of ALOCC, which is trained on a GAN basis and suffers from stability and convergence problems. On the one hand, overtraining of ALOCC can confuse the discriminator D due to realistically generated false data. On the other hand, undertraining of ALOCC can confuse the discriminator D, and undertraining can make the discriminator features less usable. To address this problem, we propose a two-stage learning process. In the first stage, we perform a training process similar to ALOCC.

### Self-supervised learning can be used to improve the robustness and uncertainty of the model

In this work, we investigate the benefits of training a supervised learning task in combination with the SSL method to improve the robustness of the classifier to simple distributional misalignment and OOD detection tasks. To this end, we added auxiliary rotation prediction to simple supervised classification. We measure the robustness of our method to simple corruptions such as Gaussian noise, shot noise, blurring, zooming, and fogging. The results confirm that while the auxiliary SSL task does not improve classification accuracy, it does significantly improve the robustness and detection ability of the model. Furthermore, training the total loss function in an adversarial robust manner improves the robustness accuracy. Finally, we test the method in the ND setting using rotational prediction and the simpler horizontal and vertical movement prediction, which is similar to GT and GOAD but simpler. We also test the method in the multiclass classification setting and find that an auxiliary self-supervised learning objective improves the maximum softmax probability detector. In addition, we attempt to achieve a uniform distribution of confidence layers over a sample of backgrounds and outliers; as in Outlier Exposure, we select outliers from other accessible datasets.

### Unsupervised Out-of-Distribution Detection by Maximum Classifier Discrepancy

The method is based on the surprising fact that two classifiers trained with different random initializations behave differently in each trust layer for unseen test time samples. Based on this fact, in this study, the seen. We attempt to increase the discrepancy for samples with no seen and decrease the discrepancy for samples with seen. The loss of disagreement is the difference between the entropy of the last layer of the first classifier and the entropy of the second classifier. This allows the classifiers to have the same confidence score for inputs within a class, but a larger discrepancy for other inputs. Fig.26 shows the overall architecture. 26 shows the overall architecture.

First, we train the two classifiers on the within-class samples to produce the same confidence scores. Next, we use an unlabeled dataset containing both OOD and within-class data to maximize discrepancies for outliers while maintaining consistency for within-values.

### Why ReLU networks provide reliable predictions far from training data.

This approach proves that the ReLU network generates a piecewise affine function. Thus, it can be written in terms of the polytope Q(x) as f (x) = Vlx + al and

nl and L are the number of hidden units in the lth layer and the total number of layers, respectively.

For α → ∞, the equation becomes 1. This implies that the ReLU network has an infinite number of inputs that yield high confidence predictions. Note that it is not possible to obtain arbitrary high confidence predictions because the domain of the inputs is restricted.

### What do deep generative models know that they don't?

In this paper, we use likelihood ratios to alleviate the problem of OOD detection in generative models. The key idea is to model the background and foreground information separately. Intuitively, if semantically irrelevant information is added to the input distribution, the background information is considered less harmful than the foreground information. Thus, the two autoregressive models are trained on the noisy original input distribution, and their likelihood ratios are defined as Equation 75.

During testing, a threshold method is used for the likelihood ratio scores.

### The likelihood ratio of out-of-distribution detection

In this paper, we employ likelihood ratios to alleviate the problem of OOD detection in generative models. The key idea is to model the background and foreground information separately. Intuitively, we assume that background information is less harmful than foreground information when semantically irrelevant information is added to the input distribution.

### generalized ODIN

As an extension of ODIN, we propose a specialized network for learning temperature scaling and a strategy for selecting the size of the perturbation: the G-ODIN is an explicit binary domain variable d ∈ {din. pin} that represents whether the input x is inlier (i.e., x ∼ pin), dout} is defined. The posterior distribution can be decomposed as p(y | din, x) = p(y,din|x) p(din|x). Note that in this equation, the reason for assigning an overconfidence score to an outlier seems clearer because the values of p(y | din,x) are larger due to the smaller values of p(y, din | x) and p(din | x). Therefore, we decompose them and estimate them as hi(x) and g(x) for p(y | din,x) and p(din | x), respectively, using different heads of the shared feature extractor network. Such a structure is called dividend/split, and the logit fi(x) of class i can be written as hi(x) g(x). The desired loss function is simple cross-entropy, as in the previous approaches. Note that the loss can be minimized by either increasing hi(x) or decreasing g(x). For example, if the data are not in a dense area in the distribution, hi(x) may be small. Therefore, g(x) must be small to minimize the objective function. In other cases, it is recommended that g(x) be large. Therefore, it approximates the role of the distributions p(y | din,x) and p(din | x) described above. At test time, maxi hi(x) or g(x) is used. fig. 27 gives an overview of the method.

### Resampling of background data for outlier-aware classification.

As mentioned earlier, for AD, ND, OSR, and OOD detection, some methods use background or outlier datasets to improve performance. However, the size of the auxiliary data set is important to avoid different types of bias. In this work, we propose a resampling technique to select an optimal number of training samples from an outlier dataset so that samples on the boundary play a more influential role in the optimization task. This work first provided an interesting probabilistic interpretation of the outlier exposure method. The loss function can be written as in Equation 78 where Lcls and Luni are shown in Equations 76 and 77, respectively.

### Detecting input complexity and out-of-distribution with likelihood-based generative models.

In this paper, we further investigate the problem of generative models assigning high likelihood values to OOD samples. In particular, we find a strong link between the complexity of the OOD sample and the likelihood value. The simpler the input, the higher the likelihood value may be. This phenomenon is illustrated in Fig. 28. Yet another experiment that supports the claim is designed to start with random noise, with average mean pooling applied at each step. To preserve the dimensionality, upscaling is performed after the average pooling. Surprisingly, simpler images to which more average pooling is applied achieve a higher likelihood. Motivated by this, the work proposed to detect OOD samples by considering the complexity of the input in combination with the likelihood value. Due to the difficulty of computing the complexity of the input, in this paper, we instead use a lossless compression algorithm to compute the upper bound. Given a set x of inputs coded with the same bit depth, the normalized size L(x) (in bits per dimension) of their compressed versions is used as a measure of complexity. Finally, the OOD score is defined as

### Energy-based out-of-distribution detection

This work proposes the use of energy scores derived from logit outputs for OOD detection and shows that they are superior to softmax scores. The energy-based model maps each input x to a single deterministic point called the energy. A set of energy values E(x, y) can be transformed into a density function p(x) by a Gibbs distribution.

### Likelihood Regret: Out-of-Distribution Detection Score for Variational Autoencoders

Previous work has shown that VAE can completely reconstruct OOD samples, which makes it difficult to detect OOD samples. The average testability of VAE across different datasets is in a much narrower range than PixelCNN or Glow, indicating that it is much more difficult for VAE to distinguish OOD samples from inlier samples. The reason for this could be due to the different ways of modeling the input distribution. The autoregressive and flow-based methods model the input at the pixel level, but due to the bottleneck structure of VAE, the model ignores some information.

To address this problem, a criterion called likelihood regret has been proposed. It measures the discrepancy between a model trained to maximize the average likelihood of a training data set, for example, a simple VAE, and a model that maximizes the likelihood of a single input image. The latter is referred to as the ideal model for each sample. Intuitively, the difference in likelihood between the trained model and the ideal model may not be large. However, this is not the case for OOD inputs. To train a simple VAE, suppose the following optimization is performed

### Understanding Anomaly Detection with Deep Inversible Networks via Distribution and Function Hierarchies

In this work, we studied the problem of flow-based generative models for OOD detection. We note that local features such as smooth local patches may dominate the possibilities. As a result, smoother data sets, such as SVHN, achieve higher likelihoods than less smooth data sets, such as CIFAR-10, regardless of the training data set. Another exciting experiment shows that fully connected networks perform better than convolutional glow networks when using likelihood values to detect OOD samples. This also supports the existence of relationships between local statistics such as continuity and likelihood values; Fig. 30 shows the similarity of various dataset local statistics computed based on the difference between a pixel value and the average of its 3 × 3 neighbors.

We see a strong Spearman's correlation between the pseudolikelihood and the exact value of the likelihood. To deal with this problem, we use the following three steps

-train the generative network on common image distributions such as 80 Million Tiny Images

-train another generative network with images drawn from the distribution (e.g., CIFAR-10)

-Uses likelihood ratio for OOD detection

### Self-Supervised Learning for Generalizable Out-of-Distribution Detection

In this work, we use a self-monitoring learning method to use information from an unlabeled outlier dataset to improve the OOD detection utility of a within-distribution classifier. To do so, the classifier is first trained with intra-class training samples until the desired performance is achieved. Then, an additional output (a set of k reject classes) is added to the last layer. Each training batch consists of ID data and a few outlier samples. The following loss functions are used

### SSD: An Integrated Framework for Self-Supervised Outlier Detection

The idea of this study is very similar to GDFR: there is no need to label the samples in the class because the SSL method is built-in. This is different from some of the aforementioned methods, which need to solve the classification task. As a result, SSD can be used flexibly in a variety of settings, including ND, OSR, and OOD detection. The main idea is to employ contrast learning to learn semantically meaningful features. After representation learning, we apply k-means clustering to estimate the class centers using the mean and covariance (µm, Σm). Then, for each test time sample, we use the following Mahalanobis distance to the nearest class center of gravity as the OOD detection score.

### MOOD: Multi-level out-of-distribution detection

In this study, we first investigate the computational efficiency aspect of OOD detection. Intuitively, some OOD samples can be detected using only low-level statistics, without the need for complex modeling. For this purpose, several intermediate classifiers are trained and operate at different depths of the trained network, as shown in Fig. 31. Finding the required existing depth requires an approximation of the complexity of the input. To deal with this problem, the number of bits used to encode the compressed imageL (x) is used. Thus, the exit depth I(x) is determined based on the complexity range to which the sample belongs.

### MOS: Towards scaling of out-of-distribution detection for large semantic spaces

MOS first revealed that the performance of OOD detection can decrease significantly as the number of distribution classes increases. For example, the analysis shows that as the number of classes increases from 50 to 1,000 in ImageNet1k, the average false positive rate (95% true positive rate) for a typical baseline increases from 17.34% to 76.94%. To overcome this challenge, a key idea of MOS is to decompose the large semantic space into smaller groups with similar concepts. This allows us to simplify the decision boundaries between known and unknown data. Specifically, MOS divides the total number of C categories into K groups G1, G2, and GK. Grouping is done based on the taxonomy of the label space, if known, by applying k-means using features extracted from the last layer of the pre-trained network, or by random grouping. The standard per-group softmax for each group Gk is then defined as follows.

### Can Multi-Label Classification Networks Know What They Don't Know?

In this study, we investigate the capabilities of the OOD detector in a multi-label classification setting. In a multi-label classification setting, each input sample may contain more than one corresponding label. This makes the problem difficult, as it can make it difficult to model simultaneous distributions among labels. In this work, we propose the JointEnergy criterion as a simple and effective way to estimate OOD indicator scores by aggregating per-label energy scores from multiple labels. We also show that JointEnergy can be mathematically interpreted in terms of the joint likelihood.

### On the importance of gradients for detecting wild distribution shifts

This work proposes a simple posthocOODdetectionmethodGradNorm that utilizes a vector norm of gradients about the weights, backpropagated the KL divergence between the softmax output and a uniform probability distribution. GradNorm is generally higher for distribution (ID) data than for OOD data. Therefore, it can be used for OOD detection. Specifically, KL divergence is defined as follows.

## data set

### semantic-level data set

Below is a summary of the datasets that can be used to detect semantic anomalies. Semantic anomalies are the kinds of anomalies where a change in a pixel leads to a change in semantic content. Datasets such as MNIST, Fashion-MNIST, SVHN, and COIL-100 are considered toy datasets. CIFAR-10, CIFAR-100, LSUN, and TinyImageNet are hard datasets with many variations in color, lighting, and background. Finally, Flowers and Birds are fine-grained semantic datasets, which makes the problem even more difficult.

### pixel-level data set

In these data sets, invisible samples, outliers, or anomalies have no semantic difference from the inner values. This means that some parts of the original image are flawed. However, the original meaning is still reachable but has been flawed: MVec AD, PCB, LaceAD, Retinal-OCT, CAMELYON16, Chest X-Rays, Species, and ImageNet-O.

### composite data set

These datasets are typically created using semantic-level datasets. However, the amount of pixel variation is controlled so that invisible, novel or anomalous samples are designed to test different aspects of the trained model while preserving semantic information. For example, MNIST-c contains MNIST samples with various types of added noise, such as shot noise and impulse noise, which are random corruptions that may occur during the imaging process. These datasets can be used not only to test the robustness of the model but also to train the model in AD settings instead of novelty detection or open set recognition. Due to the lack of comprehensive research in the field of anomaly detection, these datasets can be very beneficial.

MINIST-C, ImageNet-C and ImageNet-P are available ImageNet-C and ImageNet-P.

## evaluation procedure

The AUC-ROC is often used as an evaluation metric but requires a specific threshold value. In contrast, FPR@TPR indicates the value of FPR relative to TPR; AUPR is the area under the Precision-Recall curve. This is another metric that does not require a threshold.

Accuracy is usually used in OSR; F-measure or F-score is the harmonic mean of precision and recall F-measure or F-score is the harmonic mean of precision and recall.

## A challenge to the Future

### Baseline assessment and OOD detection evaluation protocol

There is room for improvement in the evaluation protocol for OOD detection. For example, we trained a mixture of three Gaussian distributions on the CIFAR-10 dataset (as ID) and evaluated it against OOD datasets such as TinyImagenet (crop), TinyImagenet (resize), LSUN, LSUN (resize), and iSUN available. The model is trained per channel at the pixel level; TABLE 1 shows the detection results on the different datasets. Despite its simplicity, the results are comparable to SOTA. In particular, LSUN performs worse because most colors and textures are uniform, with little variation and structure. Similar to what was observed with the likelihood-based method, LSUN is "inside" CIFAR-10, with similar means but lower variance, and is more likely to be under a wider distribution. It also provides better insight into the performance of OOD detection baselines, evaluated on both datasets close to the distribution and datasets far from the distribution. For models trained with CIFAR10, we use CIFAR-100 as the dataset close to the OOD. Results are shown in TABLE 2, 3, and 5. As shown, none of the methods are suitable for detecting near and far OOD samples, except for the OE approach which uses an additional auxiliary dataset to perform the task. In addition, the use of Mahalanobis distance improves the performance of most methods in detecting distant OOD samples but degrades the performance of near OOD detection. In addition, the Mahalanobis distance is not a good choice because it may reduce the performance of detecting even some of the distant OOD samples due to inaccurate Gaussian density estimation. In addition, resizing or cropping the OOD dataset significantly changes its performance, indicating its reliance on low-level statistics. For example, note the SVHN column in TABLE 5. This is consistent with what has recently been shown for the lack of Mahalanobis distance. One solution to this problem is to apply input preprocessing techniques, such as ODIN, to reduce the impact of first- and second-order statistics in assigning OOD scores. However, the sum of the extra forward and backward passes during testing will increase the execution speed. In addition, for some OOD datasets, methods such as ensemble and MCDropout may be slightly superior to other methods. Nevertheless, multiple forward passes are still required, which significantly increases the runtime. For example, the reported MC-Dropout is 40 times slower than a simple MSP. In summary, we recommend future work to evaluate OOD detection on both near- and far-field OOD data sets.

### AD Needs to Be Explored More

As mentioned earlier, AD and ND are not historically or fundamentally the same. A category of problems that are very important and practical in real-world applications are those that cannot be easily cleaned, and consequently include various types of noise, such as label noise and data noise. This is the case for complex and dangerous systems such as modern nuclear power plants, military aircraft carriers, air traffic control, and other high-risk systems. The recently proposed methods in ND need to be evaluated in AD settings using the proposed synthetic data sets and new solutions need to be proposed. Since the openness scores of AD detectors are usually high, for practicality the repeatability must be high and the false alarm rate must be low. Additionally, almost all AD or ND methods are evaluated in a one-vs-all setting. This creates a normal class with several distributed modes, but this is not a proper approximation of the real scenario. Therefore, evaluating AD or ND methods in a multiclass setting similar to the OSR domain with no access to the labels will give a clearer perspective on the utility of SOTA methods.

### OSR Methods for Pixel Datasets

Almost all methods present in OSR are evaluated on semantic data sets. Since the class boundaries of such datasets are usually far apart from each other, discriminative or generative methods can effectively model the differences between them. However, in many applications, such as chest x-ray datasets, the variation is subtle. Existing methods may perform poorly for such tasks. For example, a model may be trained on 14 known chest diseases. A new disease, such as COVID 19, may emerge as an unknown. In this case, the model would need to detect it as a new disease, rather than classifying it into an existing disease category. Also, in many clinical applications where medical datasets are collected, images of diseases are usually more accessible than healthy images. Hence, the OSR problem needs to learn about the disease as a normal image and detect the healthy one as an abnormal input.

TABLE 4 shows the performance of a simple MSP baseline on the MVTecAD dataset when several frequent failures are considered as normal classes. The goal in such a scenario is to detect and classify well-known failures while at the same time distinguishing rare failures as outliers that need to be treated differently. While this is a common and practical industrial environment, baselines do not perform better than random, casting doubt on their generality for safety-critical applications. Recently, a paper has shown the effectiveness of using a previous Gaussian distribution in the second-to-last layer of the classifier network, similar to what was done in some of the previous work, in tasks where the class distributions are very similar to each other, for example in the Flowers or Birds datasets presented in the previous section We have shown the effectiveness of using the However, this setup is much more practical and much more difficult than the previous setup, so more research needs to be done.

### Small sample size

Learning with small sample sizes is always difficult, but desirable. One way to tackle this problem is to leverage meta-learning algorithms to learn generalizable features that can be easily adapted to AD, ND, OSR, or OOD detection using a few training samples. One challenge of meta-learning is to handle the distributional shifts between the training and adaptation phases. This may result in a single class of meta-learning algorithms. Other approaches have considered generating synthetic OOD datasets to improve the number-shot classification of in-class samples. While the combination of meta-learning with AD, ND, OOD detection, and OSR has recently received a great deal of attention, several important aspects remain unexplored, including generalization to detect UUCs using only a small number of KUCs and convergence of meta-learning algorithms in a one-class setting.

### Model fairness and bias

Research on fairness has grown substantially in recent years. Models are biased towards several sensitive variables during the training process. For example, one paper shows that for an attribute classification task on the CelebA dataset, the presence of an attribute is correlated with the gender of the person in the image, which is undesirable. Attributes such as gender in the above example are referred to as protected variables. In the OOD detection literature, recent work has systematically investigated how pseudo-correlations in training sets affect OOD detection. The results suggest that as the correlation between spurious features and labels increases in the training set, OOD detection performance deteriorates significantly. For example, a model that exploits the pseudo-correlation between the water background and the label waterbird for prediction. As a result, models that rely on spurious features can produce reliable predictions for OOD inputs with the same background (i.e., water) but different semantic labels (e.g., boat). There seems to be a fundamental contrast between fairness and AD or ND concerning each other. To be fair, there is a tendency to create unbiased models in which equality constraints between minority and majority samples hold, but the goal of AD models is to assign higher anomaly scores to rarely occurring events. To address this issue, we proposed an impartiality-aware AD while using labels for protected variables as additional oversight of the training process. From another perspective, it introduces a very important bias into semi-supervised anomaly detection methods such as DSAD. Suppose that DSAD is implemented in a law enforcement agency to find suspicious persons using surveillance cameras. Because some training samples were used as anomaly samples during the process, the trained model may have been biased towards detecting special types of anomalies more than other models. For example, if there were more males than females in the ancillary anomaly training dataset, the bounds for detecting anomalous events as males during testing may be looser than for females. This may also occur in classification settings such as OOD detection and OSR. One paper reports the presence of unfair bias for several irrelevant protective variables in detecting chest disease in a classifier trained on a chest X-ray data set. From what is said, impartiality and the detection of AD, ND, OSR, and OOD appear to be strongly correlated for several important applications where they are used.

### multimodal data set

In many cases, training datasets consist of multimodal training samples. For example, in a chest x-ray dataset, image labels are automatically detected by applying NLP methods to the radiologist's prescription. In these situations, co-training in different modes helps the model to learn better semantic features. However, as such, the model needs to be robust in different modes. For example, in a visual question answering task, we expect the model not to generate answers for input text or images that are not distributed. Here we need to be aware of the correlations between the different modes. Training the AD, ND, OOD detection, or OSR models for the various modes separately will preserve local minima. To address this issue, we investigated the performance of the VQA model by detecting a sample of test times. However, more issues need to be investigated with this approach.

### Explainability Challenge

Explainable AI (XAI) is a recently proposed deep network architecture that has been found to play a very important role, especially when used in safety-critical applications. The detection of AD, OSR, ND, and OOD should be able to explain why the model makes the decisions it does due to some of those critical applications. For example, if a person is identified as suspicious by a surveillance camera, there should be a good reason why the model made the decision. The issue of explainability can be defined in two different approaches. First, there must be an explanation for why the sample is normal, known, or not distributed. Second, you need to explain why the sample is abnormal, unknown, or not distributed. There are various methods in the literature to explain model decisions such as Multi-KD, CutPaste, Grad-cam, and Smoothfgrad. However, these are only used to explain normal, seen, or in-distribution samples, and their results are not as accurate as sufficient or unseen or abnormal inputs. There are also suggestions for VAE-based methods that can provide reasons. It detects anomalies in the input sample while also accurately describing the normal sample. However, it does not work well with complex training data sets such as CIFAR-10. This indicates that further investigation needs to be done to mitigate the problem. Another important issue of explainability is found in the one-class classification or ND approach. Only one label can be accessed during training. Therefore, Gradcam or Smoothgrad, which use the availability of fine-grained labels, can no longer be used. To address this issue, we proposed a fully convolutional architecture combined with a heatmap upsampling algorithm called receptive field upsampling. From the latent vectors of the samples, the effect of the applied convolution operator is reversed to find important regions within a given input sample. However, the explainable OCC model is still largely unexplored and further research in this direction is still needed.

### Multi-label OOD detection and large data sets

OOD detection for multiclass classification has been studied extensively, but the problem of multi-label networks is still under investigation. This means that for each input, multiple true labels must be recognized. This is more difficult because the multi-label classification task has more complex class boundaries and may result in unseen behavior in a subset of the input sample labels. The challenges of multi-label datasets can be investigated in the anomalous segmentation task. Unlike classification, where the entire image can be reported as an anomalous input, here specific anomalous parts need to be specified. Current methods have been evaluated primarily on small datasets such as CIFAR. It has been shown that approaches developed on the CIFAR benchmark may not translate effectively to the ImageNet benchmark, which has a large semantic space, highlighting the need to evaluate OOD detection in large real-world settings. Therefore, we recommend that future searches be evaluated on the ImageNet-based OOD detection benchmark to test the limitations of the developed method.

### data extension

One source of uncertainty in classifying known or normal training samples can be a lack of generalization performance. For example, rotating an image of a bird does not compromise its content, which again needs to be distinguished as a bird. Some of the works mentioned attempt to embed this functionality into the model by designing various SSL objective functions. However, there is another way to do this, using data extensions. Data expansion is a common technique for enriching training data sets. Several approaches use different data enrichment techniques to improve the performance of generalization.

Another perspective is to generate invisible anomalous samples and use them to try to transform a one-class learning problem into a simple two-class classification task; in the OSR setting, other papers follow the same idea. These can also be seen as working on training datasets to enrich the dataset for further detection tasks. From what has been said, it is clear that working on the data instead of the model can achieve very effective results and should be explored further in the sense of various future trade-offs.

### Open World Recognition

In a controlled lab environment, it is sufficient to detect new, unknown, or out-of-distribution samples, but new categories need to be continuously detected and added to the recognition capabilities of the actual operating system. This becomes even more challenging when because such a system requires minimal downtime, even when learning. Existing open-world awareness requires a few more steps. Namely, new classes need to be continuously detected and the system needs to be updated to include these new classes in the multiclass open set recognition algorithm. The aforementioned processes pose a variety of challenges, ranging from the scalability of current open set recognition algorithms to the design of new learning algorithms to avoid problems such as catastrophic forgetting of OSR classifiers. Moreover, all the aforementioned future works can be re-formulated again in the open-world recognition problem. This means that some existing work on this subject needs to be investigated further by reviewing.

### Vision Transformers in OOD Detection and OSR

Vision Transformers (ViTs) have recently been proposed as an alternative to CNNs and have shown excellent performance in a variety of applications such as object detection, medical image segmentation, and visual tracking. Similarly, several methods have recently reported the advantages of ViT in OOD detection, demonstrating its ability to detect samples close to the OOD. For example, when ViT was trained on CIFAR-10 and tested on CIFAR-100 as inlier and outlier datasets, respectively, it was reported to have a significant advantage over previous works. However, since ViT is usually pre-trained on oversized datasets such as ImageNet-22K, which have large intersections with the training and test datasets, the consistency of train-test discrepancies no longer holds, and the question translates to "how much do we remember from pre-training" from pre-training". In other words, ViT needs to be evaluated on a dataset that does not intersect with the pre-trained knowledge. To address this issue, we evaluated ViT-B16 on SVHN and MNIST when six randomly selected classes were considered normal and the remaining classes were considered outliers or invisible. We believe that MSP detects unknown samples, and as shown in TABLE 6, ViT-B16 pre-trained on ImageNet-22K is not as good as other baselines trained from scratch. All experiments are evaluated with close ODD detection settings and thus support the aforementioned deficiencies of ViT. From what has been said, the future direction of research could be to evaluate ViT in more controlled situations so that their actual gains are more accurate. Indeed, the recent Species dataset has collected examples that do not fall into any of the ImageNet-22K classes, which is a first step towards correcting this problem.

## summary

In many applications, it is not possible to model all types of classes that arise during testing, and areas, where scenarios such as OOD detection, OSR, one-class learning (ND), and AD exist, have become ubiquitous. Hence, in this paper, we have provided a comprehensive review of existing techniques, datasets, evaluation criteria, and future challenges. More importantly, we have analyzed and discussed the limitations of the approaches and pointed out promising research directions. We hope that this will help the research community to develop a broader, cross-disciplinary perspective.

If you have any suggestions for improvement of the content of the article,
please contact the AI-SCHOLAR editorial team through the contact form.