Unified Survey Of Anomalies, Novelty, Open Sets, And Outlier Detection

Survey, Review 14/01/2022

3 main points
✔️ Survey similar concepts of anomaly, novelty, open set, and outlier detection using a unified method
✔️ There are different definitions for each of these boundaries, with corresponding variations in the methods used to separate them
✔️ This survey provides a comprehensive analysis and outlines future research questions.

A Unified Survey on Anomaly, Novelty, Open-Set, and Out-of-Distribution Detection: Solutions and Future Challenges
written by Mohammadreza Salehi, Hossein Mirzaei, Dan Hendrycks, Yixuan Li, Mohammad Hossein Rohban, Mohammad Sabokrou
(Submitted on 26 Oct 2021)
Comments: Published on arxiv.
Subjects: Computer Vision and Pattern Recognition (cs.CV); Machine Learning (cs.LG)

code：

The images used in this article are from the paper, the introductory slides, or were created based on them.

first of all

In machine learning models, it is common to make the "closed set" assumption that test data is drawn from the same distribution as the training data (independent identical distribution). In practice, however, it is possible to encounter all kinds of test input data, including those that have not been trained with a classifier. Unfortunately, models may misleadingly assign confidence values to test samples that they have never seen. This has led to concerns about the reliability of classifiers, especially in safety-critical applications. In the literature, several areas try to address the problem of identifying unknown/anomalous/out-of-distribution data in an open-world setting. In particular, the problems of anomaly detection (AD), novelty detection (ND), one-class classification (OCC), out-of-distribution (OOD) detection, and open set recognition (OSR) have received significant attention due to their fundamental importance and practical relevance. Although they are used for similar tasks, their differences and relevance are often overlooked.

Specifically, OSR trains the model on K classes of an N-class training dataset, and at test time, the model is faced with N - K different classes that were not seen at training time OSR assigns the correct labels to the samples seen at test time and detects samples that have not been seen before Objective. Novelty detection and 1-class classification are extreme cases of open set recognition where K is 1. In a multiclass classification setting, the problem of OOD detection is typical for OSR. That is, it accurately classifies samples in the distribution (ID) into known categories and detects OOD data that are semantically different and therefore should not be predicted by the model. However, OOD detection encompasses a wider range of learning tasks (e.g., multi-label classification, reinforcement learning) and solution spaces (e.g., density estimation), which we review comprehensively in this paper. While the aforementioned domains assume access to a completely normal training dataset, anomaly detection assumes that the training dataset is obtained completely unsupervised, without any filtering applied, and thus may contain anomalous samples. However, since anomalous events rarely occur, AD's method exploits this fact and proposes to filter them out during the training process to reach a final semantic space that fully captures normal features. Previous methods have been used mostly in the domains of object detection and image classification, but such a setup is common in industrial defect detection tasks where abnormal events rarely occur and normal samples share the notion of normality Fig. 1 shows a visual representation of the differences between these domains. Note that even though there are differences in the formulation of these regions, they are used in the same sense because they have much in common Figure 1 shows a visual representation of the differences between these regions.

Several surveys have been conducted on anomaly detection, an important research area, but they focus on each field independently or provide very general anomaly detection concepts to cover all the different types of data sets. Instead, we provide a detailed description of the methodology for each of these fields. In this way, we build bridges between the domains to easily propagate ideas and inspire future research. For example, the idea of using outlier samples from different data sets to improve task-specific features is called Outlier Exposure or Background modeling and is very similar to semi-supervised anomaly detection. Even though the ideas are shared, they are both considered novel ideas in their respective domains.

In summary, the main contributions of this paper are as follows

(1) To clarify the relationship between different research fields that have been examined separately despite their high interconnectedness.

(2) Provide a comprehensive methodological analysis of a prominent recent study and clearly explain the reviewed methods in a theoretical and visual manner.

3) Conduct comprehensive testing against existing baselines to provide a strong foundation for current and future research.

(4) Provide directions for future research and articulate the fundamentals necessary for the methods to be presented in the future, including fairness, adversarial robustness, privacy, data efficiency, and accountability.

A General Perspective on Method Classification

Here we have, we have input space X =R^d and the random variable on the label (output) space Y, the joint distribution P_X,Y from the training sample (x₁ ,y₁ ),(x₂ ,y₂ ),... We consider a data set where X and Y are respectively the input spaces X =R^d and a random variable on the label (output) space Y. In AD and ND, the label space Y is a binary set of normal and abnormal values. At test time, given an input sample x, the model needs to estimate P(Y = Normal/seen/in-class | X = x) in the one-class setting; in OSR for OOD detection and multiclass classification, the label space may contain multiple semantic categories, so In AD, since the input samples may contain noise (anomalies) in addition to normal samples, the problem is transformed into a one-class classification problem for noisy labels, but the overall formulation of the detection task remains unchanged. Two commonly known perspectives for modeling conditional probabilities are generative and discriminative modeling: in the OOD detection and OSR settings, discriminative modeling may be easier because we have access to the labels of the training samples, but the lack of labels makes AD, ND (OCC ) is difficult. This is because the one-class classification problem has the trivial solution of mapping each input, whether normal or abnormal, to a given label Y and thus minimizing the objective function as much as possible. This problem can be solved using In the case of DSVDD, as in, which uses an If we train with a large number of training epochs, the Regardless of normal or abnormal, the It is also seen in approaches that map all inputs to a single point.

However, there are several modified approaches to the formulation of P(Y | X) that can be used to solve this problem. They are so that the normalized distribution does not change, and We apply a series of affine transformations to the distribution of X. Then, the sum $ \sum^{｜T|}_{i=1} $ P(T_i |T_i (X)) is estimated and the transformed input T_i (X) is given, each transformation T_i computes the aggregated probability that it applies to the input X, which is equal to |T | P(Y | X). This is similar to estimating P(Y | X) directly, but without collapsing, so it can be used instead of estimating a single class of conditional probabilities. This simple method avoids the problem of collapsing, but the problem depends on the transformation since the transformed inputs must intersect each other as little as possible to satisfy the constraint of consistency of the normalized distribution. Therefore, as will be shown later, the OSR method can overcome the problem by employing the AD approach in combination with the classification model. A similar situation applies to the OOD domain.

In generative modeling, AE (Autoencoder)-based, GAN (Generative Adversarial Network)-based, and explicit density estimation-based methods such as auto-regression and flow-based models are used to model data distributions. In the case of AE In AE, There are two important assumptions If the autoencoder has been trained with only normal training samples, then

- Auto-encoders are ... The auto-encoders are designed to take a normal test sample that you have never seen before. and The autoencoders will be able to reconstruct a normal test sample as accurately as a training sample.

- Samples with abnormal test times are. cannot be reconstructed exactly as normal input.

But... Although In the recently proposed method using AE, the However, recently proposed methods using AE have shown that the above assumption is not always true. For example, in the case of Even if AE can perfectly reconstruct a normal sample, a shift of only one pixel can result in The loss of reconstruction being significant.

Similarly, Another famous model family, GAN, is, the AD, and ND, and OCC AD, ND, OCC OSR and OOD, and OOD have been widely used for detection. When GAN is trained on a perfectly normal training sample, GAN operates under the assumption that

- If the input is normal, then a potential vector exists, and if it is generated, there is little disagreement with the input.

- If the input is abnormal If the Even if it is generated If There is no potential vector with a small discrepancy with the input.

Here, the discrepancy can be defined based on the pixel-level MSE loss of the generated image and the test-time input, or a more complex function such as the interlayer distance of the discriminator features given the generated image and the test-time input.GAN is, Although it has been proven to be capable of capturing semantic abstractions of a given training dataset, the Modal decay, and Unstable learning process, and plagued by the problem of non-repeatable results.

Finally, the autoregressive and flow-based models can be used to explicitly approximate the data density and detect abnormal samples based on the assigned likelihoods. Intuitively, normal samples should have a higher likelihood than abnormal samples, but as discussed below, autoregressive models assign a higher likelihood to abnormal samples even though they do not see abnormal samples during the training process, which results in AD, ND, OSR, and OOD detection performance decreases. To solve this problem, several improvements have been proposed in the OOD domain that can be used in OSR, AD, and ND, but more evaluation of their reliability is needed, considering that the general test protocol for OOD detection may be quite different from other domains such as AD and ND.

Anomaly and novelty detection

Anomaly Detection (AD) and Novelty Detection (ND) are used interchangeably in the literature, but few works discuss the differences between them. In anomaly detection, there are certain inherent problems, contrary to the assumption that the training data consists of perfectly normal samples. For example, measurement noise is inevitable in physical experiments, so in the unsupervised learning process, the algorithm must automatically detect and focus on normal samples. However, this is not the case for the novelty detection problem. There are many applications where it is easy to provide a clean dataset with minimal supervision. These areas have been isolated over time, but their names are not yet properly used in the literature.

Interest in anomaly detection dates back to 1969 when it defined an anomaly/outlier as "a sample that appears to deviate significantly from the other members of the sample in which it occurs" and explicitly assumed the existence of a basic shared pattern followed by the majority of training samples. There are some ambiguities in this definition. For example, we need to define a criterion for the notion of deviation, and to make the term "significantly" more quantitative. For this reason, before and after the advent of deep learning methods, significant efforts have been made to make the aforementioned concepts clearer. To find samples that deviate from the trend, it is necessary to employ an appropriate distance metric. There is also the challenge of selecting a threshold to determine if the deviation from the normal sample is significant.

Robust deep autoencoder for anomaly detection

We train the autoencoder (AE) on a dataset containing both inliers and outliers. Outliers are detected and filtered out during training, assuming that the inliers are significantly more frequent and share normal concepts. In this way, the AE is trained only on normal training samples, and consequently cannot successfully reconstruct the input during abnormal tests. Therefore, we use the Alternating Direction Method of Multipliers (ADMM) to split the objective into two (or more) parts and compute them.

where E and D are the encoder and decoder networks, respectively. It is assumed that _LD is the outlier part of the training data X and S is the outlier part. However, the above optimization is not an easy solution because S and θ need to be optimized together. To address this problem, the alternating direction method of multipliers (ADMM) is used. It divides the objective lens into two (or more) parts. In the first step, by fixing S, the optimization problem for the parameter θ is solved such that _LD = X - S, and the objective becomes|| _LD_-Dθ ( _Eθ ( _LD ))|| ₂. The optimization problem for that norm is then solved by setting _LD to a reconstruction of the trained AE, with S set to X _-LD. Since the L1 norm is not differentiable, a proximal operator is used as an approximation for each optimization step as follows.

Such a function is known as a shrinkage operator and is very common in L1 optimization problems. The aforementioned objective function using || S || ₁ separates only unstructured noise, e.g., Gaussian noise in the training samples, from the normal content of the training data set. To separate structured noise, such as samples that convey a completely different meaning than the majority of the training samples, the _L2,1 optimization criterion can be applied as follows

We use a proximal operator called the blockwise soft threshold function [27]. During testing, reconfiguration errors are used to reject anomalous inputs.

Inverse Learning One-Class Classification for Novelty Detection (ALOCC)

Assuming that we are given perfectly normal training samples, we aim to train a novelty detection model on them. First, we train (R) as a Denoising Auto Encoder (DAE) to (1) reduce reconstruction loss and (2) fool the discriminator in a GAN-based setting. This allows the DAE to produce high-quality images instead of blurry output. This happens because the AE model loss, on the one hand, explicitly assumes an independent Gaussian distribution for each pixel. And on the other hand, the true distribution of pixels is usually multimodal, so the average value of the Gaussian has to settle between different modes. This leads to a blurry image on complex data sets. To solve this problem, AE can be trained in a GAN-based framework to force the mean of each Gaussian to capture only one mode of the corresponding true distribution. Furthermore, by using the output (D) of the discriminator instead of the pixel-level loss, normal samples that are not properly reconstructed can be detected as normal. This loss significantly reduces the False Positive Rate (FPR) of vanilla DAE.

This allows the model to produce higher quality output as well as have the capabilities of AE for anomaly detection. Furthermore, the detection can be based on D(R(X)) as described above. Fig. 1. 1. 2 shows the overall architecture of this work .

One-class novelty detection using GANs with constrained latent representations (OC-GAN)

AEs trained on perfectly normal training samples can reconstruct the unseen anomalous input with even lower error. To solve this problem, we attempt to make the encoder's potential distribution (EN(-)) resemble a uniform distribution in an adversarial manner. Similarly, the decoder (De(-)) is forced to replay the in-class output that samples the latent values from the uniform distribution. The learning target distributes normal features in the latent space so that the replayed output completely or at least roughly resembles the normal class for both normal and abnormal inputs. We also use another method in the latent space, called informative negative sample mining, to actively look for regions that produce low-quality images. To do so, the classifier is trained to distinguish between the reconstructed output of the decoder and false images.

Latent Space Autoregression for Novelty Detection (LSA)

In this method, for novelty detection, we propose a concept called "surprise" which specifies the uniqueness of input samples in the latent space. This concept specifies the uniqueness of an input sample in the latent space. The more unique a sample is, the less likely it is in the latent space, and consequently the more likely it is to be an anomalous sample. This is especially beneficial when the many normal training samples are ilar to the training data set. For visually similar training samples, AEs are usually trained to reconstruct their mean as the output to minimize the MSE error. This results in a blurred output and a larger reconstruction error for such inputs. However, by using surprise loss and reconstruction error together, this problem can be mitigated. Also, anomalous samples are usually more surprising, which increases the novelty score. The surprise score is learned using an autoregressive model in the latent space, as shown in Fig. 4. The autoregressive model (h) can be instantiated from different architectures such as LSTM and RNN networks to more complex ones. Also, as with other AE-based methods, the replay error is optimized.

Memory-Assisted Deep Autoencoder (Mem-AE) for Unsupervised Anomaly Detection

In this method, we challenged the second assumption made when using AE. We showed that an abnormal sample can be perfectly reconstructed even if the training dataset does not contain any abnormal samples. Intuitively, AE does not learn features that uniquely describe normal samples, and as a result, it may extract abnormal features from abnormal inputs and reconstruct them perfectly. Therefore, it is necessary to learn features that accurately reconstruct only normal samples. For this purpose, Mem-AE employs a memory that stores unique and sufficient features of normal training samples. During training, the encoder implicitly plays the role of an address generator for the memory. The encoder generates embeddings and the memory features similar to the generated embeddings are combined. The combined embeddings are passed to the decoder to produce the corresponding reconstructed output. Mem-AE also employs a sparse addressing technique that uses only a small number of memory items. Therefore, Mem-AE's decoders are restricted to perform reconfiguration using a small number of memory items and do not need to utilize memory items efficiently. Furthermore, reconstruction errors cause memory to record prototypical patterns that are representative of normal input.

Redefining the learning paradigm for inverse learning single-group classifiers (the old ones are golden).

This method is an extension of the idea of ALOCC, which is trained on a GAN basis and suffers from stability and convergence problems. On the one hand, overtraining of ALOCC can confuse the discriminator D due to realistically generated false data. On the other hand, undertraining of ALOCC can confuse the discriminator D, and undertraining can make the discriminator features less usable. To address this problem, we propose a two-stage learning process. In the first stage, we perform a training process similar to ALOCC.

As the first stage proceeds, a low-epoch generative model Gold is saved for later use in the second stage of training. In the second stage, the sample $ \hat{X} $ =G is considered as high-quality constructed data. The sample $ \hat{X_{low}} = Gold(X) is considered as a low quality sample. The pseudo-anomaly sample is then created as follows.

Adversarial Mirrored Autoencoder (AMA)

The overall architecture of AMA is similar to ALOCC. However, AMA challenges the first assumption of AEs: the _lp norm is unsuitable for training AEs in anomaly detection regions because it leads to blurry reconstructions and subsequently increases the error in normal samples. To address this problem, the AMA proposes to minimize the Wasserstein distance between the distributions _PX,X and $ P_{X,\hat{X}} $.

Unsupervised Anomaly Detection with Generative Adversarial Networks Leading to Marker Discovery (AnoGAN)

In this method, the GAN is trained on a normal training sample, and at test time, an optimization problem is solved that tries to find the optimal latent space z by minimizing inconsistencies. Given a generated image and an input image, the discrepancy is found by combining the pixel-level loss of the generated image and the input image with the loss of the discriminator features in the different layers. Intuitively, any normal test time sample can find the desired potential vector, even anomalous ones. Fig.8 shows the structure of our method. Fig.9 compares the structure of AnoGAN and Efficient-GAN.

OC-SVM

Elementary AD methods use statistical approaches to detect anomalous inputs, such as comparing each sample to the mean of the training data set, which is. imposing implicit Gaussian distribution assumptions that cannot be generalized to the training data set. Reducing the number of assumptions and, and To overcome the aforementioned shortcomings of traditional statistical methods, the As the name suggests, OC-SVM is a one-class SVM that maximizes the distance of the training sample from the origin using a hyperplane containing the sample on one side and the origin on the other side. Equation 19 shows the original form of OC-SVM, which tries to find a space where there are eight training samples on exactly one side, and the more the distance of the origin is on the line, the better solution to the optimization problem is obtained.

Deep One-Classification (DeepSVDD)

This method, which is an extension of SVDD, uses a depth network to try to find a space where common features exist between training samples such that the training samples are compressed into a sphere of minimum volume surrounding them. The difference from traditional methods is that we automatically learn the kernel function φ by optimizing the parameters W

Deep Semi-Teacher Anomaly Detection

This is a semi-supervised version of DSVDD that assumes a limited number of labeled standard samples. The loss function is defined so that it minimizes the distance from the predefined center of the standard sample, but with a limited number of labeled standard samples. The loss function is defined to minimize the distance from the pre-defined aspheric center of the standard sample.

Deep Anomaly Detection Using Geometric Transformations(GT)

GT attempts to transform from a one-class problem to a multi-class classification; GT defines a set of variants that do not change the data distribution, and trains a classifier to distinguish between them; essentially, the classifier is trained in a self-monitored fashion. It also applies different transformations to the input and takes the sum of the corresponding Dirichlet probabilities as the novelty score.</span

Effective end-to-end unsupervised outlier detection through inner value prioritization in separator networks.

Similar to GT, our method employs a self-supervised learning (SSL) task to train the anomaly detector, except when a small number of outliers or anomalous samples are present in the training dataset. However, due to the presence of anomalous samples in the training dataset, the objective score of the anomalous samples may not always be high. To address this issue, it has been shown that the magnitude and direction of the gradient at each step have a strong tendency to minimize the loss function of the inlier samples. Thus, the network produces a lower score compared to the anomaly score.

Classifier-based Anomaly Detection of General Data (GOAD)

This method is very similar to GT. However, instead of using cross-entropy loss or learning a Dirichlet distribution for the final confidences, it finds the center of each transformation and minimizes the distance between each transformation data and its corresponding center.

The idea can be seen as a combination of DSVDD and GT, where GT transformations are used and different compressed hyperspheres are learned to separate them. M different transformations transform each sample at test time and the average of the correct label probabilities is assigned as the anomaly score.

CSI: Novelty detection through contrast learning of distributional shift entities

In this method, we attempt to formulate the problem of novelty detection in a contrastive framework similar to SimCLR. The idea of contrast learning is to learn an encoder _fθ that extracts the information needed to distinguish similar samples from other samples, where x is the query, _x+, and _x- are the sets of positive and negative samples respectively, z is the output feature of the encoder or the additional projection layer for each input gφ(fθ(x )), and let sim(z,z) be the cosine similarity. The contrast loss is defined as

In contrast to learning, a set of negative samples needs to be defined. To this end, a collection of transformations that shift the distribution of training samples (S) is specified, which when applied to each input creates the desired negative set. For example, rotations or patch permutations completely shift the distribution of the original input samples. Thus, they can be used as negative samples.

Uninformed Students: Detecting Student-Teacher Anomalies with Differentiated Latent Embedding

In our method, a teacher network is trained using metric learning and knowledge generation techniques to provide a semantic and discriminative feature space. The teacher T is obtained by first learning a network $ \hat{T} $ that embeds a patch-size image p into the metric space. Then, a deterministic network transformation from $ \hat{T} $ to T can be used to achieve fast and dense local feature extraction for the entire input image. To learn $ \hat{T} $, we obtain a large number of training patches p by randomly cutting the image database, e.g., ImageNet.

Self-Supervised Learning for Anomaly Detection and Localization (CutPaste)

In our method, we designed a simple SSL task to capture local pixel-level regularities rather than global semantic-level regularities: while GT and GOAD perform transformations such as rotation, translation, and jittering, CutPaste transforms a portion of the training input by cutting out and copying it to another location. The network has been trained to distinguish between flawed and intact samples. Additional auxiliary tasks, such as cutout and scar, can be used in combination with the cut-paste operation. After training, a KDE or Gaussian density estimator is trained based on the confidence scores of normal training samples and used during testing. Due to the simplicity of this method, it can easily be overfitted for the classification task.

Multi-resolution knowledge distillation for anomaly detection (Multi-KD)

Generative models are suitable for detecting pixel-level anomalies but may fail for complex semantic-level anomalies. Discriminative models, on the other hand, are better suited to capture semantics. It is not easy to design an SSL task that captures both semantics and syntax. To solve this problem, Multi-KD attempts to mimic the intermediate layers of the network (intermediate knowledge) pre-trained in VGG into a simpler network using a distillation of the knowledge. In this way, multi-resolution modeling of the normal learning distribution is obtained, which can be used to detect anomalies at both pixel and semantic levels during testing. Here, the concept of knowledge is defined as the length and direction of a pre-trained network on ImageNet. Since the Cloner network has a simple but overall similar architecture compared to the source, its knowledge will be similar to the source in a normal training sample. At test time, the Cloner can follow the source for normal test time samples but fails for abnormal samples. This results in high discrepancies that can be used at test time. fig. 14 shows the overall architecture.

open set recognition

Open-set recognition (OSR) receives more supervision than AD or ND. In this setting, K normal classes are given at training time and N classes have N - K unknown classes and K known classes at testing time. The objective is to identify the unknown classes while classifying the known classes. This technique has many uses, such as when it is possible to label a normal data set, or when it is possible to collect a clean data set that does not contain any abnormal samples. Due to the need for more monitoring, the training data is classified into four classes.

-known classes (KKC): Known training samples. It is already given and labeled.

- known unknown classes (KUC): training samples that are known to be not known. In other words, they do not belong to any known category. For example, background images and images that are known not to belong to any known class are in this group. These have already been given and labeled.

- unknown known classes (UKC): training samples that are not known to be known classes. For example, samples with known test times fall into this group. These are not given in the training phase.

- unknown classes (UUC): unknown classes. Training samples that are not known to be unknown. For example, samples with unknown test times fall into this group. They are not given in the training phase.

Towards Open Set Deep Networks (OpenMax)

This method solves the problem that classification models generate overconfident scores for samples with unseen test times. Because of the normalization in Softmax calculations, two samples with very different logit scores can have the same confidence score distribution; instead of using confidence scores, OpenMax uses logit scores represented by activation vectors (AVs). The AV for each sample represents the distribution for each class. The mean AV (MAV) is defined as the average of the AV values of all samples. For each input sample, the value of AV corresponding to the ground-truth is considered high and the distance to the corresponding value of MAV is also considered high. considering the distance between each element of AV and the corresponding element of MAV as a random variable, a correctly classified input will have the highest distance to the ground-truth element would be. This happens when several classes have a strong relationship with the correct answer but are not correct. For example, the class leopard is the correct answer, and the cheetah is the closest class to it.

Generative OpenMax (G-OpenMax) for Multiclass Open Set Classification

This method is similar to OpenMax, except that it artificially generates UUC samples in GAN and tweaks OpenMax. This eliminates the need to prepare a validation data set.

Open Set Learning with Counterfactual Images

This method follows the idea of generating UUC samples as in G-OpenMax. The generated input is similar to KKC, but it should not be assigned to the same class. Such generated inputs are called counterexamples. These samples are useful for approximating the actual UUC distribution because they are near the boundary of the UUC.

Reducing Network Agnostophobia

In applications such as object detection, there is usually a class called backgrounds. On the Internet, a large number of samples are retrieved, which may be used as a "background" for a particular task. In this work, we employ background samples as auxiliary KUC distributions to train a classifier. This training defines the marginals, where KUCs have small feature sizes and KKCs have large feature sizes. Also, the entropy of the trust layer is maximized for the background sample. This corresponds to increasing the uncertainty of the classifier for such inputs. In this training, we employ a simple entropic open set loss that maximizes the entropy of the confidence score and a target sphere loss that minimizes the L2 norm of the final feature.Fig.18 shows the effect of each loss on the geometric position of the samples of each class in the final layer.

Class-Conditional Autoencoder for Open Set Recognition (C2AE)

A second assumption for the use of AEs in this method is that samples with abnormal test times will not be reconstructed in the same way as normal samples, but in OSR, despite AD and ND, learning labels can enhance the capabilities of AEs.

However, in OSR, despite AD and ND, learning labels can enhance the capabilities of AE. We will assume that the AE is a meta-recognition function and its encoder is the classifier for the recognition task. Intuitively, we want the encoder to provide an embedding that can correctly classify the passed samples and reconstruct the original input. Furthermore, we make sure that the encoder embedding cannot be easily transformed, e.g., by a linear transformation, so that the AE cannot use the learned features to reconstruct anomalous or invisible inputs.

Deep Transition Learning (DTL) for Multi-class Novelty Detection

This method also follows the idea of using a background dataset (called the reference dataset).DTL addresses the shortcomings of using softmax loss in OSR. We propose a new loss function called membership loss. Specifically, each activation score value fi in the final layer is normalized to [0,1] using a sigmoid function. The normalized score can be interpreted as the probability that the input image belongs to an individual class. Ideally, given a label y, f(x) should be 1 when y=i and 0 otherwise.

Another approach to improve detection performance is based on "global negative filters". Filters that provide evidence for a particular class are considered positive filters, and vice versa. In the case of pre-trained neural networks, it has been shown that only a small fraction of the final feature map is positively activated. Furthermore, some filters are always activated negatively, indicating that they are not relevant for all known classes. By discarding inputs that globally activate negative filters, it is less likely that new samples will generate high activation scores. To learn such filters for domain-specific tasks, DTL trains two parallel networks with shared weights until the last layer. The first network solves the classification task for the reference dataset, while the second network solves the domain-specific classification task in combination with loss of membership. If the reference dataset and the domain-specific dataset do not share a lot of information, they provide a negative filter for each other. Also, since the reference dataset is composed of different classes, these learned filters can be considered globally negative filters. Finally, the filters of the parallel network combined with the confidence scores of the domain-specific classifiers are used for novelty detection. Fig.19 shows the overall network architecture.

Classification-Recreational Learning for Open Set Recognition (CROSR)

This method is based on the same idea as C2AE. In particular, CROSR uses an encoder network for classification and generates a latent vector for the reconstruction task. It is important to note that the latent vector z, used for the reconstruction task, and the final layer y, used for the classification task, are not shared. The reason for this is that the information loss in the final layer is too large, making it difficult to distinguish between unknown and known samples.

Generative and Discriminative Feature Representation for Open Set Recognition (GDFR)

Similar to CROSR, this study uses a combination of discriminative and generative models for training. Discriminative approaches may lose important features that are useful for distinguishing between seen and unseen samples. Generative models can provide complementary information, similar to GT, the GDFR employs SSL to improve the features of the discriminator. Shared networks are ,. By predicting the geometric transformation applied to the input. Performing both classification and SSL tasks. Furthermore, we use a generative model such as AE to generate a reconstructed output $ \hat(x) $ for a given input x. We then pass the set of input-reconstruction pairs (x, $ \hat(x) $) to the discriminator network to perform the classification and SSL tasks. Discrepancies between $ \hat(x) $ and x for unseen samples help the discriminator network to detect them. Fig.21 illustrates this technique.

Conditional Gaussian Distribution Learning for Open Set Recognition (CGDL)

The main idea of this research is very similar to CROSR. However, CGDL uses a stochastic ladder network based on variational coding and decoding. During training, samples are passed to the encoder, which estimates µ and σ for each layer. Their mean and variance values can be used as pliers for the corresponding decoding layer. The final embedding z of the encoder's top layer is used for the joint classification task and decoding process. The distribution of the final layer of the encoder is forced to resemble a different multivariate Gaussian $ p^k_θ(z) = N(z;µ_k,I) $. where k is the index of the known class and µk is obtained by a fully connected layer that maps the one-shot encoding of the labels of the input to the latent space. Each layer of the decoder is a Gaussian probability distribution whose prior distributions of mean and variance are added by the corresponding layers of statistics of the encoder.

A Hybrid Model for Open Set Recognition

In our method, the classification network is trained in combination with a flow-based generative model. Pixel-level generative models may not produce discriminative results for invisible samples or tens of samples, and they are not robust against semantically irrelevant noise. To solve this problem, we apply a flow-based model in the space of feature representations instead of in the pixel-level space (see Fig. 23). The reason for using flow-based models is their ease of use and comprehensive theoretical capabilities. The learning loss, a combination of simple cross-entropy loss and negative log-likelihood, is used to train the flow-based model. At test time, a threshold is applied to the likelihood of each input and, if it holds, the output of the classifier is assigned as an in-class label.

Learning an open set network with identifiable mutual points (RPL)

Similar to Mem-AE, this method uses the concept of prototype features. RPL, compared to softmax and OpenMax, helps the model to better reconcile the boundaries of different classes and reduces the risk factor. RPL reduces the risk factor. Initially, a random reciprocal point is selected. The positions of the reciprocal points and the weights of the classifier network are adjusted to minimize the classification loss. This allows the network to place each class feature close to a particular reciprocal point so that at least one set of points is used to obtain the desired class boundary. To reduce the risk factor, each class sample is forced to have a margin over the mutual points learned during the training process.

Loss for Distance-Based Open Set Recognition (CAC)

The idea of this method is similar to RPL and GOAD: CAC defines, for each class, an anchor vector of dimension N - the number of classes. In each vector, the element corresponding to the class label is 1 and the others are 0. In the training process, the logit score of each training sample is placed in a compact ball concerning the anchor vector of the true class, and at a large distance from the anchors of other classes, CAC can be described as a multi-class DSVDD.

The number shot open set recognition using meta-learning (PEELER)

In our method, we combine the idea of meta-learning with open set recognition. Meta-learning is the process of learning general features that can be easily adapted to unseen tasks. Meta-learning is also referred to as learning to learn. Meta-learning is useful when the amount of data is small because of its ability to work in a few few-shotting. In meta-iteration i, the metamodel h is initialized with the one generated in the previous meta-iteration. Assuming $ (S^s_i, T^s_i)^{N^s}_{i=1} $ is a meta-learning dataset with ^Ns training problems, two steps are performed. First. An estimate h of the optimal model for the training set $ S^s_i $ is generated. Then, the test set $ T^s_i $ is used to find the model with the appropriate loss function L.

Learning Placeholders for Open Set Recognition (PROSER)

In this method, we attempt to learn a classifier that can go between target and non-target classes. A dummy classifier is added to the softmax layer of the model with a shared feature extractor. It is then forced to have a second maximum value for correctly classified samples. When the classifier encounters a new input, the dummy classifier produces a high value because all known classes are non-targets. The dummy classifier can be viewed as an instance-dependent threshold that fits all known classes well.

Counterfactual zero-shot and open set visual recognition

This approach attempts to produce anomalous samples in a counterfactual manner. As mentioned in this paper, most generative approaches, such as G-OpenMax, do not produce the desired false samples, which do not resemble the actual distribution of the unseen samples. For this purpose, we use β-VAE to make the sample attribute variable Z independent of the class attribute variable Y. The loss function of β-VAE is similar to a simple VAE, but the KL term is induced by the coefficient β. This is very effective in learning separated sample attributes Z. To separate Y and Z, the proposed method creates counterfactual samples by changing the variable Y so that the distance to the given input x is large, despite the samples generated by changing the variable Z. To faithfully create the counterfactual samples, we use Wasserstein GAN loss for the discriminator D(X, Y) which verifies the correspondence between the generated counterfactual images and the assigned labels. The samples generated at the end can be used to improve the performance of any OSR problem.

Out-Of-Distribution Detection

OOD detection aims to identify samples at test time that should not be predicted to a known class because they are semantically different from the categories in the training data. For example, since CIFAR-10 and CIFAR-100 are mutually exclusive classes, we can train the model on CIFAR-10 (data within the distribution) and then evaluate CIFAR-100 as a data set outside the distribution. In a multiclass setting, the problem of OOD detection is similar to that of OSR, which is to accurately classify samples from known classes and detect unknown classes. However, OOD detection encompasses a broader range of learning tasks (e.g., multi-label classification) and solution spaces (e.g., density estimation without classification). Several approaches have relaxed the constraints imposed by OSR and have achieved strong performance. In this section, we present methods that relax the constraints of OSR and achieve high performance.

A baseline for detecting out-of-distribution examples that are misclassified by neural networks

In this study, we coined the term "out-of-distribution (OOD) detection" to show how to evaluate deep learning out-of-distribution detectors. While previous anomaly detection for deep classifiers often used low-quality or proprietary datasets, in this study we reused existing datasets to create out-of-distribution datasets for easier evaluation. In this study, we propose to use maximum softmax probability (MSP) to detect out-of-distribution samples. That is, _maxk p(y = k | x). test samples with large MSP scores are detected as in-distribution (ID) samples rather than out-of-distribution (OOD) samples. We also showed that the p(y | x) model is effective in detecting out-of-distribution samples and that the p(x) model is not always necessary. To date, this model serves as a general baseline, and it is not easy to go beyond it; the OSR study suggested further refinements to the softmax probabilities for detection.

Improving the Reliability of Out-of-Distribution Image Detection in Neural Networks (ODIN)

A technique called temperature scaling has been employed in this work. Temperature scaling has been used in other domains such as knowledge extraction, but the main novelty of this work is to show the usefulness of this technique in the OOD domain. In temperature scaling, the softmax score is computed as in Equation 59; OOD samples are detected at test time based on a threshold of maximum class probability. This simple approach, coupled with the addition of controlled small noise, showed a significant improvement over the baseline approach, MSP. oDIN further adds a one-step gradient to the input in the direction of increasing the maximum score, giving a greater effect on the within-class samples giving it a larger margin to the OD sample.

A Simple Integrated Framework for Detecting Out-of-Distribution Samples and Hostile Attacks

This method is inspired by the idea of LinearDiscriminant Analysis (LDA), which considers P(X = x | Y = y) to be a multivariate Gaussian distribution. to make P(Y = y | X = x) close to the softmax form, we assume that the feature space of the first layer follows a Gaussian distribution We assume that the feature space of the first layer follows a Gaussian distribution. Then, we simply estimate the vector of mean and variance from the features of each class and fit a multivariate Gaussian to it. To check the validity of our assumptions, we use the Mahalanobis distance of the image at test time for classification instead of the softmax function.

Estimating Predictive Uncertainty with Prior Networks (DPN)

This approach discusses three different sources of uncertainty. (1) data uncertainty, (2) distribution uncertainty, and (3) model uncertainty. The importance of breaking uncertainty down into these terms was discussed. For example, model uncertainty may arise because the model cannot approximate a given distribution well. On the other hand, uncertainty in the data may arise because similar classes are inherently intersecting. For example, classifying different types of dogs will have more data uncertainty than solving the classification problem with completely separate classes. Distributional uncertainty is associated with the detection problems of AD, ND, OSR, and OOD.

During training, the Dirichlet prior network (DPN) is expected to produce a flat distribution across the OOD sample simplex. This indicates that there is large uncertainty in the mapping from x to y. Some out-of-distribution data are used to minimize the KL distance and flat Dirichlet distribution of Dir(µ |α). In-class samples minimize the KL divergence between Dir (µ |α) and the sharp and sparse Dirichlet distribution. The objective Dirichlet distribution is obtained by pre-setting the parameters during the training process. During test time, various criteria such as maximum probability, the entropy of the final layer, and distributional uncertainty such as Equation 65 are used for OOD detection.

Confidence-Calibrated Classifiers for Detecting Out-of-Distribution Samples

This method attempts to maximize the entropy of the confidence score of the OOD sample. In addition, the OOD sample is generated by jointly training the GAN and the classifier. As shown in Equation 66, the first term solves the classification task for the intra-class samples and the second term uses KL divergence to make the confidence score distribution of the generated OOD samples uniform. The remaining terms train the GAN on the intra-class samples; note that the GAN is forced to generate high-quality OOD samples that result in high uncertainty when passed to the classifier. Therefore, the generated samples are located at the boundary between the within-class and outlier distributions. We also show in this paper that utilizing in-class samples on the boundary can significantly improve their reliability calibration.

Deep anomaly detection by outlier exposure (OE)

In this method, Outlier Exposure (OE) is introduced and its usefulness is tested in various experiments. Outlier Exposure loss, when applied to a classifier, encourages the model to output a uniform softmax distribution for outliers. In general, the Outlier Exposure objective function is as follows.

To create $ D^{OE}_{out} $, we need to scrape, curate, or download data that is different from the training data. Samples from $ D^{OE}_{out} $ are collected from existing available data sets that may not be directly related to the task-specific objective function. However, it contains a wide range of variations, which can significantly improve performance.

Self-supervised learning can be used to improve the robustness and uncertainty of the model

In this work, we investigate the benefits of training a supervised learning task in combination with the SSL method to improve the robustness of the classifier to simple distributional misalignment and OOD detection tasks. To this end, we added auxiliary rotation prediction to simple supervised classification. We measure the robustness of our method to simple corruptions such as Gaussian noise, shot noise, blurring, zooming, and fogging. The results confirm that while the auxiliary SSL task does not improve classification accuracy, it does significantly improve the robustness and detection ability of the model. Furthermore, training the total loss function in an adversarial robust manner improves the robustness accuracy. Finally, we test the method in the ND setting using rotational prediction and the simpler horizontal and vertical movement prediction, which is similar to GT and GOAD but simpler. We also test the method in the multiclass classification setting and find that an auxiliary self-supervised learning objective improves the maximum softmax probability detector. In addition, we attempt to achieve a uniform distribution of confidence layers over a sample of backgrounds and outliers; as in Outlier Exposure, we select outliers from other accessible datasets.

Unsupervised Out-of-Distribution Detection by Maximum Classifier Discrepancy

The method is based on the surprising fact that two classifiers trained with different random initializations behave differently in each trust layer for unseen test time samples. Based on this fact, in this study, the seen. We attempt to increase the discrepancy for samples with no seen and decrease the discrepancy for samples with seen. The loss of disagreement is the difference between the entropy of the last layer of the first classifier and the entropy of the second classifier. This allows the classifiers to have the same confidence score for inputs within a class, but a larger discrepancy for other inputs. Fig.26 shows the overall architecture. 26 shows the overall architecture.

First, we train the two classifiers on the within-class samples to produce the same confidence scores. Next, we use an unlabeled dataset containing both OOD and within-class data to maximize discrepancies for outliers while maintaining consistency for within-values.

Why ReLU networks provide reliable predictions far from training data.

This approach proves that the ReLU network generates a piecewise affine function. Thus, it can be written in terms of the polytope Q(x) as f (x) = ^Vlx + ^al and

^nl and L are the number of hidden units in the lth layer and the total number of layers, respectively.

For α → ∞, the equation becomes 1. This implies that the ReLU network has an infinite number of inputs that yield high confidence predictions. Note that it is not possible to obtain arbitrary high confidence predictions because the domain of the inputs is restricted.

What do deep generative models know that they don't?

In this paper, we use likelihood ratios to alleviate the problem of OOD detection in generative models. The key idea is to model the background and foreground information separately. Intuitively, if semantically irrelevant information is added to the input distribution, the background information is considered less harmful than the foreground information. Thus, the two autoregressive models are trained on the noisy original input distribution, and their likelihood ratios are defined as Equation 75.

During testing, a threshold method is used for the likelihood ratio scores.

The likelihood ratio of out-of-distribution detection

In this paper, we employ likelihood ratios to alleviate the problem of OOD detection in generative models. The key idea is to model the background and foreground information separately. Intuitively, we assume that background information is less harmful than foreground information when semantically irrelevant information is added to the input distribution.

generalized ODIN

As an extension of ODIN, we propose a specialized network for learning temperature scaling and a strategy for selecting the size of the perturbation: the G-ODIN is an explicit binary domain variable d ∈ {din. pin} that represents whether the input x is inlier (i.e., x ∼ pin), dout} is defined. The posterior distribution can be decomposed as p(y | din, x) = p(y,din|x) p(din|x). Note that in this equation, the reason for assigning an overconfidence score to an outlier seems clearer because the values of p(y | din,x) are larger due to the smaller values of p(y, din | x) and p(din | x). Therefore, we decompose them and estimate them as hi(x) and g(x) for p(y | din,x) and p(din | x), respectively, using different heads of the shared feature extractor network. Such a structure is called dividend/split, and the logit fi(x) of class i can be written as hi(x) g(x). The desired loss function is simple cross-entropy, as in the previous approaches. Note that the loss can be minimized by either increasing hi(x) or decreasing g(x). For example, if the data are not in a dense area in the distribution, hi(x) may be small. Therefore, g(x) must be small to minimize the objective function. In other cases, it is recommended that g(x) be large. Therefore, it approximates the role of the distributions p(y | din,x) and p(din | x) described above. At test time, maxi hi(x) or g(x) is used. fig. 27 gives an overview of the method.

Resampling of background data for outlier-aware classification.

As mentioned earlier, for AD, ND, OSR, and OOD detection, some methods use background or outlier datasets to improve performance. However, the size of the auxiliary data set is important to avoid different types of bias. In this work, we propose a resampling technique to select an optimal number of training samples from an outlier dataset so that samples on the boundary play a more influential role in the optimization task. This work first provided an interesting probabilistic interpretation of the outlier exposure method. The loss function can be written as in Equation 78 where _Lcls and _Luni are shown in Equations 76 and 77, respectively.

Detecting input complexity and out-of-distribution with likelihood-based generative models.

In this paper, we further investigate the problem of generative models assigning high likelihood values to OOD samples. In particular, we find a strong link between the complexity of the OOD sample and the likelihood value. The simpler the input, the higher the likelihood value may be. This phenomenon is illustrated in Fig. 28. Yet another experiment that supports the claim is designed to start with random noise, with average mean pooling applied at each step. To preserve the dimensionality, upscaling is performed after the average pooling. Surprisingly, simpler images to which more average pooling is applied achieve a higher likelihood. Motivated by this, the work proposed to detect OOD samples by considering the complexity of the input in combination with the likelihood value. Due to the difficulty of computing the complexity of the input, in this paper, we instead use a lossless compression algorithm to compute the upper bound. Given a set x of inputs coded with the same bit depth, the normalized size L(x) (in bits per dimension) of their compressed versions is used as a measure of complexity. Finally, the OOD score is defined as

Energy-based out-of-distribution detection

This work proposes the use of energy scores derived from logit outputs for OOD detection and shows that they are superior to softmax scores. The energy-based model maps each input x to a single deterministic point called the energy. A set of energy values E(x, y) can be transformed into a density function p(x) by a Gibbs distribution.

Likelihood Regret: Out-of-Distribution Detection Score for Variational Autoencoders

Previous work has shown that VAE can completely reconstruct OOD samples, which makes it difficult to detect OOD samples. The average testability of VAE across different datasets is in a much narrower range than PixelCNN or Glow, indicating that it is much more difficult for VAE to distinguish OOD samples from inlier samples. The reason for this could be due to the different ways of modeling the input distribution. The autoregressive and flow-based methods model the input at the pixel level, but due to the bottleneck structure of VAE, the model ignores some information.

To address this problem, a criterion called likelihood regret has been proposed. It measures the discrepancy between a model trained to maximize the average likelihood of a training data set, for example, a simple VAE, and a model that maximizes the likelihood of a single input image. The latter is referred to as the ideal model for each sample. Intuitively, the difference in likelihood between the trained model and the ideal model may not be large. However, this is not the case for OOD inputs. To train a simple VAE, suppose the following optimization is performed

Understanding Anomaly Detection with Deep Inversible Networks via Distribution and Function Hierarchies

In this work, we studied the problem of flow-based generative models for OOD detection. We note that local features such as smooth local patches may dominate the possibilities. As a result, smoother data sets, such as SVHN, achieve higher likelihoods than less smooth data sets, such as CIFAR-10, regardless of the training data set. Another exciting experiment shows that fully connected networks perform better than convolutional glow networks when using likelihood values to detect OOD samples. This also supports the existence of relationships between local statistics such as continuity and likelihood values; Fig. 30 shows the similarity of various dataset local statistics computed based on the difference between a pixel value and the average of its 3 × 3 neighbors.

We see a strong Spearman's correlation between the pseudolikelihood and the exact value of the likelihood. To deal with this problem, we use the following three steps

-train the generative network on common image distributions such as 80 Million Tiny Images

-train another generative network with images drawn from the distribution (e.g., CIFAR-10)

-Uses likelihood ratio for OOD detection

Self-Supervised Learning for Generalizable Out-of-Distribution Detection

In this work, we use a self-monitoring learning method to use information from an unlabeled outlier dataset to improve the OOD detection utility of a within-distribution classifier. To do so, the classifier is first trained with intra-class training samples until the desired performance is achieved. Then, an additional output (a set of k reject classes) is added to the last layer. Each training batch consists of ID data and a few outlier samples. The following loss functions are used

SSD: An Integrated Framework for Self-Supervised Outlier Detection

The idea of this study is very similar to GDFR: there is no need to label the samples in the class because the SSL method is built-in. This is different from some of the aforementioned methods, which need to solve the classification task. As a result, SSD can be used flexibly in a variety of settings, including ND, OSR, and OOD detection. The main idea is to employ contrast learning to learn semantically meaningful features. After representation learning, we apply k-means clustering to estimate the class centers using the mean and covariance (µm, Σm). Then, for each test time sample, we use the following Mahalanobis distance to the nearest class center of gravity as the OOD detection score.

MOOD: Multi-level out-of-distribution detection

In this study, we first investigate the computational efficiency aspect of OOD detection. Intuitively, some OOD samples can be detected using only low-level statistics, without the need for complex modeling. For this purpose, several intermediate classifiers are trained and operate at different depths of the trained network, as shown in Fig. 31. Finding the required existing depth requires an approximation of the complexity of the input. To deal with this problem, the number of bits used to encode the compressed imageL (x) is used. Thus, the exit depth I(x) is determined based on the complexity range to which the sample belongs.

MOS: Towards scaling of out-of-distribution detection for large semantic spaces

MOS first revealed that the performance of OOD detection can decrease significantly as the number of distribution classes increases. For example, the analysis shows that as the number of classes increases from 50 to 1,000 in ImageNet1k, the average false positive rate (95% true positive rate) for a typical baseline increases from 17.34% to 76.94%. To overcome this challenge, a key idea of MOS is to decompose the large semantic space into smaller groups with similar concepts. This allows us to simplify the decision boundaries between known and unknown data. Specifically, MOS divides the total number of C categories into K groups G1, G2, and GK. Grouping is done based on the taxonomy of the label space, if known, by applying k-means using features extracted from the last layer of the pre-trained network, or by random grouping. The standard per-group softmax for each group Gk is then defined as follows.

Can Multi-Label Classification Networks Know What They Don't Know?

In this study, we investigate the capabilities of the OOD detector in a multi-label classification setting. In a multi-label classification setting, each input sample may contain more than one corresponding label. This makes the problem difficult, as it can make it difficult to model simultaneous distributions among labels. In this work, we propose the JointEnergy criterion as a simple and effective way to estimate OOD indicator scores by aggregating per-label energy scores from multiple labels. We also show that JointEnergy can be mathematically interpreted in terms of the joint likelihood.

On the importance of gradients for detecting wild distribution shifts

This work proposes a simple posthocOODdetectionmethodGradNorm that utilizes a vector norm of gradients about the weights, backpropagated the KL divergence between the softmax output and a uniform probability distribution. GradNorm is generally higher for distribution (ID) data than for OOD data. Therefore, it can be used for OOD detection. Specifically, KL divergence is defined as follows.

data set

semantic-level data set

Below is a summary of the datasets that can be used to detect semantic anomalies. Semantic anomalies are the kinds of anomalies where a change in a pixel leads to a change in semantic content. Datasets such as MNIST, Fashion-MNIST, SVHN, and COIL-100 are considered toy datasets. CIFAR-10, CIFAR-100, LSUN, and TinyImageNet are hard datasets with many variations in color, lighting, and background. Finally, Flowers and Birds are fine-grained semantic datasets, which makes the problem even more difficult.

pixel-level data set

In these data sets, invisible samples, outliers, or anomalies have no semantic difference from the inner values. This means that some parts of the original image are flawed. However, the original meaning is still reachable but has been flawed: MVec AD, PCB, LaceAD, Retinal-OCT, CAMELYON16, Chest X-Rays, Species, and ImageNet-O.

composite data set

These datasets are typically created using semantic-level datasets. However, the amount of pixel variation is controlled so that invisible, novel or anomalous samples are designed to test different aspects of the trained model while preserving semantic information. For example, MNIST-c contains MNIST samples with various types of added noise, such as shot noise and impulse noise, which are random corruptions that may occur during the imaging process. These datasets can be used not only to test the robustness of the model but also to train the model in AD settings instead of novelty detection or open set recognition. Due to the lack of comprehensive research in the field of anomaly detection, these datasets can be very beneficial.

MINIST-C, ImageNet-C and ImageNet-P are available ImageNet-C and ImageNet-P.

evaluation procedure

The AUC-ROC is often used as an evaluation metric but requires a specific threshold value. In contrast, FPR@TPR indicates the value of FPR relative to TPR; AUPR is the area under the Precision-Recall curve. This is another metric that does not require a threshold.

Accuracy is usually used in OSR; F-measure or F-score is the harmonic mean of precision and recall F-measure or F-score is the harmonic mean of precision and recall.

A challenge to the Future

Baseline assessment and OOD detection evaluation protocol

There is room for improvement in the evaluation protocol for OOD detection. For example, we trained a mixture of three Gaussian distributions on the CIFAR-10 dataset (as ID) and evaluated it against OOD datasets such as TinyImagenet (crop), TinyImagenet (resize), LSUN, LSUN (resize), and iSUN available. The model is trained per channel at the pixel level; TABLE 1 shows the detection results on the different datasets. Despite its simplicity, the results are comparable to SOTA. In particular, LSUN performs worse because most colors and textures are uniform, with little variation and structure. Similar to what was observed with the likelihood-based method, LSUN is "inside" CIFAR-10, with similar means but lower variance, and is more likely to be under a wider distribution. It also provides better insight into the performance of OOD detection baselines, evaluated on both datasets close to the distribution and datasets far from the distribution. For models trained with CIFAR10, we use CIFAR-100 as the dataset close to the OOD. Results are shown in TABLE 2, 3, and 5. As shown, none of the methods are suitable for detecting near and far OOD samples, except for the OE approach which uses an additional auxiliary dataset to perform the task. In addition, the use of Mahalanobis distance improves the performance of most methods in detecting distant OOD samples but degrades the performance of near OOD detection. In addition, the Mahalanobis distance is not a good choice because it may reduce the performance of detecting even some of the distant OOD samples due to inaccurate Gaussian density estimation. In addition, resizing or cropping the OOD dataset significantly changes its performance, indicating its reliance on low-level statistics. For example, note the SVHN column in TABLE 5. This is consistent with what has recently been shown for the lack of Mahalanobis distance. One solution to this problem is to apply input preprocessing techniques, such as ODIN, to reduce the impact of first- and second-order statistics in assigning OOD scores. However, the sum of the extra forward and backward passes during testing will increase the execution speed. In addition, for some OOD datasets, methods such as ensemble and MCDropout may be slightly superior to other methods. Nevertheless, multiple forward passes are still required, which significantly increases the runtime. For example, the reported MC-Dropout is 40 times slower than a simple MSP. In summary, we recommend future work to evaluate OOD detection on both near- and far-field OOD data sets.

AD Needs to Be Explored More

As mentioned earlier, AD and ND are not historically or fundamentally the same. A category of problems that are very important and practical in real-world applications are those that cannot be easily cleaned, and consequently include various types of noise, such as label noise and data noise. This is the case for complex and dangerous systems such as modern nuclear power plants, military aircraft carriers, air traffic control, and other high-risk systems. The recently proposed methods in ND need to be evaluated in AD settings using the proposed synthetic data sets and new solutions need to be proposed. Since the openness scores of AD detectors are usually high, for practicality the repeatability must be high and the false alarm rate must be low. Additionally, almost all AD or ND methods are evaluated in a one-vs-all setting. This creates a normal class with several distributed modes, but this is not a proper approximation of the real scenario. Therefore, evaluating AD or ND methods in a multiclass setting similar to the OSR domain with no access to the labels will give a clearer perspective on the utility of SOTA methods.

OSR Methods for Pixel Datasets

Almost all methods present in OSR are evaluated on semantic data sets. Since the class boundaries of such datasets are usually far apart from each other, discriminative or generative methods can effectively model the differences between them. However, in many applications, such as chest x-ray datasets, the variation is subtle. Existing methods may perform poorly for such tasks. For example, a model may be trained on 14 known chest diseases. A new disease, such as COVID 19, may emerge as an unknown. In this case, the model would need to detect it as a new disease, rather than classifying it into an existing disease category. Also, in many clinical applications where medical datasets are collected, images of diseases are usually more accessible than healthy images. Hence, the OSR problem needs to learn about the disease as a normal image and detect the healthy one as an abnormal input.

TABLE 4 shows the performance of a simple MSP baseline on the MVTecAD dataset when several frequent failures are considered as normal classes. The goal in such a scenario is to detect and classify well-known failures while at the same time distinguishing rare failures as outliers that need to be treated differently. While this is a common and practical industrial environment, baselines do not perform better than random, casting doubt on their generality for safety-critical applications. Recently, a paper has shown the effectiveness of using a previous Gaussian distribution in the second-to-last layer of the classifier network, similar to what was done in some of the previous work, in tasks where the class distributions are very similar to each other, for example in the Flowers or Birds datasets presented in the previous section We have shown the effectiveness of using the However, this setup is much more practical and much more difficult than the previous setup, so more research needs to be done.

Small sample size

Learning with small sample sizes is always difficult, but desirable. One way to tackle this problem is to leverage meta-learning algorithms to learn generalizable features that can be easily adapted to AD, ND, OSR, or OOD detection using a few training samples. One challenge of meta-learning is to handle the distributional shifts between the training and adaptation phases. This may result in a single class of meta-learning algorithms. Other approaches have considered generating synthetic OOD datasets to improve the number-shot classification of in-class samples. While the combination of meta-learning with AD, ND, OOD detection, and OSR has recently received a great deal of attention, several important aspects remain unexplored, including generalization to detect UUCs using only a small number of KUCs and convergence of meta-learning algorithms in a one-class setting.

hostile stubbornness

An imperceptible perturbation that is carefully designed to trick a deep learning-based model into making false predictions is called an adversarial attack. Classifiers have previously been shown to be susceptible to adversarial attacks, resulting in significant performance degradation during testing. It is important that OOD detection, as well as OSR, AD, and ND, be robust against adversarial attacks. Recent studies of OSR, ND, and OOD detection have investigated the impact of adversarial attacks on the models. However, more is needed. For example, AD anomalies or OSR UUCs are not accessible during training, and it is not easy to achieve robust models with attacked anomalies or UUCs. The relationship between different defensive approaches against adversarial attacks and novelty detection may also reveal some important insights into the internal mechanisms of the model. For example, membership attacks attempt to infer whether an input sample has been used during the training process. This can be seen as designing a novelty detector without generalizing it to UKC samples. One paper also investigates the relationship between addiction attack detection and novelty detectors. Examples of addiction intentionally added by an attacker to achieve a backdoor attack could be treated as one type of "outlier" in the training dataset. Differential privacy is claimed to improve not only the detection of outliers and novelty but also the detection of backdoor attacks in the ND model. From a completely different perspective, adversarial robust training can be used to boost the learned feature space semantically. This path has been employed in ARAE and Puzzle-AE to improve AE's performance in detecting invisible test-time samples. The same intent is true for the one-class learning method, where robustness is shown to be beneficial in detecting novel samples. This path needs to be investigated further. For example, despite standard adversarial attacks in the classification task, perceptible attacks may further improve detection performance without the need to make the attacks imperceptible in AD or ND.

Model fairness and bias

Research on fairness has grown substantially in recent years. Models are biased towards several sensitive variables during the training process. For example, one paper shows that for an attribute classification task on the CelebA dataset, the presence of an attribute is correlated with the gender of the person in the image, which is undesirable. Attributes such as gender in the above example are referred to as protected variables. In the OOD detection literature, recent work has systematically investigated how pseudo-correlations in training sets affect OOD detection. The results suggest that as the correlation between spurious features and labels increases in the training set, OOD detection performance deteriorates significantly. For example, a model that exploits the pseudo-correlation between the water background and the label waterbird for prediction. As a result, models that rely on spurious features can produce reliable predictions for OOD inputs with the same background (i.e., water) but different semantic labels (e.g., boat). There seems to be a fundamental contrast between fairness and AD or ND concerning each other. To be fair, there is a tendency to create unbiased models in which equality constraints between minority and majority samples hold, but the goal of AD models is to assign higher anomaly scores to rarely occurring events. To address this issue, we proposed an impartiality-aware AD while using labels for protected variables as additional oversight of the training process. From another perspective, it introduces a very important bias into semi-supervised anomaly detection methods such as DSAD. Suppose that DSAD is implemented in a law enforcement agency to find suspicious persons using surveillance cameras. Because some training samples were used as anomaly samples during the process, the trained model may have been biased towards detecting special types of anomalies more than other models. For example, if there were more males than females in the ancillary anomaly training dataset, the bounds for detecting anomalous events as males during testing may be looser than for females. This may also occur in classification settings such as OOD detection and OSR. One paper reports the presence of unfair bias for several irrelevant protective variables in detecting chest disease in a classifier trained on a chest X-ray data set. From what is said, impartiality and the detection of AD, ND, OSR, and OOD appear to be strongly correlated for several important applications where they are used.

multimodal data set

In many cases, training datasets consist of multimodal training samples. For example, in a chest x-ray dataset, image labels are automatically detected by applying NLP methods to the radiologist's prescription. In these situations, co-training in different modes helps the model to learn better semantic features. However, as such, the model needs to be robust in different modes. For example, in a visual question answering task, we expect the model not to generate answers for input text or images that are not distributed. Here we need to be aware of the correlations between the different modes. Training the AD, ND, OOD detection, or OSR models for the various modes separately will preserve local minima. To address this issue, we investigated the performance of the VQA model by detecting a sample of test times. However, more issues need to be investigated with this approach.

Explainability Challenge

Explainable AI (XAI) is a recently proposed deep network architecture that has been found to play a very important role, especially when used in safety-critical applications. The detection of AD, OSR, ND, and OOD should be able to explain why the model makes the decisions it does due to some of those critical applications. For example, if a person is identified as suspicious by a surveillance camera, there should be a good reason why the model made the decision. The issue of explainability can be defined in two different approaches. First, there must be an explanation for why the sample is normal, known, or not distributed. Second, you need to explain why the sample is abnormal, unknown, or not distributed. There are various methods in the literature to explain model decisions such as Multi-KD, CutPaste, Grad-cam, and Smoothfgrad. However, these are only used to explain normal, seen, or in-distribution samples, and their results are not as accurate as sufficient or unseen or abnormal inputs. There are also suggestions for VAE-based methods that can provide reasons. It detects anomalies in the input sample while also accurately describing the normal sample. However, it does not work well with complex training data sets such as CIFAR-10. This indicates that further investigation needs to be done to mitigate the problem. Another important issue of explainability is found in the one-class classification or ND approach. Only one label can be accessed during training. Therefore, Gradcam or Smoothgrad, which use the availability of fine-grained labels, can no longer be used. To address this issue, we proposed a fully convolutional architecture combined with a heatmap upsampling algorithm called receptive field upsampling. From the latent vectors of the samples, the effect of the applied convolution operator is reversed to find important regions within a given input sample. However, the explainable OCC model is still largely unexplored and further research in this direction is still needed.

Multi-label OOD detection and large data sets

OOD detection for multiclass classification has been studied extensively, but the problem of multi-label networks is still under investigation. This means that for each input, multiple true labels must be recognized. This is more difficult because the multi-label classification task has more complex class boundaries and may result in unseen behavior in a subset of the input sample labels. The challenges of multi-label datasets can be investigated in the anomalous segmentation task. Unlike classification, where the entire image can be reported as an anomalous input, here specific anomalous parts need to be specified. Current methods have been evaluated primarily on small datasets such as CIFAR. It has been shown that approaches developed on the CIFAR benchmark may not translate effectively to the ImageNet benchmark, which has a large semantic space, highlighting the need to evaluate OOD detection in large real-world settings. Therefore, we recommend that future searches be evaluated on the ImageNet-based OOD detection benchmark to test the limitations of the developed method.

data extension

One source of uncertainty in classifying known or normal training samples can be a lack of generalization performance. For example, rotating an image of a bird does not compromise its content, which again needs to be distinguished as a bird. Some of the works mentioned attempt to embed this functionality into the model by designing various SSL objective functions. However, there is another way to do this, using data extensions. Data expansion is a common technique for enriching training data sets. Several approaches use different data enrichment techniques to improve the performance of generalization.

Another perspective is to generate invisible anomalous samples and use them to try to transform a one-class learning problem into a simple two-class classification task; in the OSR setting, other papers follow the same idea. These can also be seen as working on training datasets to enrich the dataset for further detection tasks. From what has been said, it is clear that working on the data instead of the model can achieve very effective results and should be explored further in the sense of various future trade-offs.

Open World Recognition

In a controlled lab environment, it is sufficient to detect new, unknown, or out-of-distribution samples, but new categories need to be continuously detected and added to the recognition capabilities of the actual operating system. This becomes even more challenging when because such a system requires minimal downtime, even when learning. Existing open-world awareness requires a few more steps. Namely, new classes need to be continuously detected and the system needs to be updated to include these new classes in the multiclass open set recognition algorithm. The aforementioned processes pose a variety of challenges, ranging from the scalability of current open set recognition algorithms to the design of new learning algorithms to avoid problems such as catastrophic forgetting of OSR classifiers. Moreover, all the aforementioned future works can be re-formulated again in the open-world recognition problem. This means that some existing work on this subject needs to be investigated further by reviewing.

Vision Transformers in OOD Detection and OSR

Vision Transformers (ViTs) have recently been proposed as an alternative to CNNs and have shown excellent performance in a variety of applications such as object detection, medical image segmentation, and visual tracking. Similarly, several methods have recently reported the advantages of ViT in OOD detection, demonstrating its ability to detect samples close to the OOD. For example, when ViT was trained on CIFAR-10 and tested on CIFAR-100 as inlier and outlier datasets, respectively, it was reported to have a significant advantage over previous works. However, since ViT is usually pre-trained on oversized datasets such as ImageNet-22K, which have large intersections with the training and test datasets, the consistency of train-test discrepancies no longer holds, and the question translates to "how much do we remember from pre-training" from pre-training". In other words, ViT needs to be evaluated on a dataset that does not intersect with the pre-trained knowledge. To address this issue, we evaluated ViT-B16 on SVHN and MNIST when six randomly selected classes were considered normal and the remaining classes were considered outliers or invisible. We believe that MSP detects unknown samples, and as shown in TABLE 6, ViT-B16 pre-trained on ImageNet-22K is not as good as other baselines trained from scratch. All experiments are evaluated with close ODD detection settings and thus support the aforementioned deficiencies of ViT. From what has been said, the future direction of research could be to evaluate ViT in more controlled situations so that their actual gains are more accurate. Indeed, the recent Species dataset has collected examples that do not fall into any of the ImageNet-22K classes, which is a first step towards correcting this problem.

summary

In many applications, it is not possible to model all types of classes that arise during testing, and areas, where scenarios such as OOD detection, OSR, one-class learning (ND), and AD exist, have become ubiquitous. Hence, in this paper, we have provided a comprehensive review of existing techniques, datasets, evaluation criteria, and future challenges. More importantly, we have analyzed and discussed the limitations of the approaches and pointed out promising research directions. We hope that this will help the research community to develop a broader, cross-disciplinary perspective.

Categories related to this article

友安昌幸 (Masayuki Tomoyasu): JDLA G certificate 2020#2, E certificate2021#1 Japan Society of Data Scientists, DS Certificate Japan Society for Innovation Fusion, DX Certification Expert Amiko Consulting LLC, CEO