NeurIPS2020 Highlights
Here's a summary of the conference. Here is some information about this summary.
This summary has been compiled by the Paper Digest Team into a platform called Paper Digest.
The Paper Digest Team will also be a New Yorkbased research group working on text analysis. Text analysis is to produce results.
TITLE 
HIGHLIGHT 

1 
We adopt kernel distance and propose transformsumcat as an alternative to aggregatetransform to reflect the continuous similarity between the node neighborhoods in the neighborhood aggregation. 

2 
An Unsupervised InformationTheoretic Perceptual Quality Metric 
We combine recent advances in informationtheoretic objective functions with a computational architecture informed by the physiology of the human visual system and unsupervised training on pairs of video frames, yielding our Perceptual Information Metric (PIM). 
3 
In this work, we learn representations using selfsupervision by leveraging three modalities naturally present in videos: visual, audio and language streams. 

4 
Benchmarking Deep Inverse Models over time, and the NeuralAdjoint method 
We consider the task of solving generic inverse problems, where one wishes to determine the hidden parameters of a natural system that will give rise to a particular set of measurements. 
5 
OffPolicy Evaluation and Learning for External Validity under a Covariate Shift 
In this paper, we derive the efficiency bound of OPE under a covariate shift. 
6 
In this work, instead of estimating the expected dependency, we focus on estimating pointwise dependency (PD), which quantitatively measures how likely two outcomes cooccur. 

7 
Fast and Flexible Temporal Point Processes with Triangular Maps 
By exploiting the recent developments in the field of normalizing flows, we design TriTPP – a new class of nonrecurrent TPP models, where both sampling and likelihood computation can be done in parallel. 
8 
Backpropagating Linearly Improves Transferability of Adversarial Examples 
In this paper, we study the transferability of such examples, which lays the foundation of many blackbox attacks on DNNs. 
9 
PyGlove: Symbolic Programming for Automated Machine Learning 
In this paper, we introduce a new way of programming AutoML based on symbolic programming. 
10 
Fourier Sparse Leverage Scores and Approximate Kernel Learning 
We prove new explicit upper bounds on the leverage scores of Fourier sparse functions under both the Gaussian and Laplace measures. 
11 
Improved Algorithms for Online Submodular Maximization via Firstorder Regret Bounds 
In this work, we give a general approach for improving regret bounds in online submodular maximization by exploiting “firstorder” regret bounds for online linear optimization. 
12 
Synbols: Probing Learning Algorithms with Synthetic Datasets 
In this sense, we introduce Synbols — Synthetic Symbols — a tool for rapidly generating new datasets with a rich composition of latent features rendered in low resolution images. 
13 
Adversarially Robust Streaming Algorithms via Differential Privacy 
We establish a connection between adversarial robustness of streaming algorithms and the notion of differential privacy. 
14 
Trading Personalization for Accuracy: Data Debugging in Collaborative Filtering 
In this paper, we propose a data debugging framework to identify overly personalized ratings whose existence degrades the performance of a given collaborative filtering model. 
15 
This work proposes an autoregressive model with sublinear parallel time generation. 

16 
Improving Local Identifiability in Probabilistic Box Embeddings 
In this work we model the box parameters with min and max Gumbel distributions, which were chosen such that the space is still closed under the operation of intersection. 
17 
PermuteandFlip: A new mechanism for differentially private selection 
In this work, we propose a new mechanism for this task based on a careful analysis of the privacy constraints. 
18 
Inspired by classical analysis techniques for partial observations of chaotic attractors, we introduce a general embedding technique for univariate and multivariate time series, consisting of an autoencoder trained with a novel latentspace loss function. 

19 
Reciprocal Adversarial Learning via Characteristic Functions 
We generalise this by comparing the distributions rather than their moments via a powerful tool, i.e., the characteristic function (CF), which uniquely and universally comprising all the information about a distribution. 
20 
Statistical Guarantees of Distributed Nearest Neighbor Classification 
Through majority voting, the distributed nearest neighbor classifier achieves the same rate of convergence as its oracle version in terms of the regret, up to a multiplicative constant that depends solely on the data dimension. 
21 
We propose a new Stein selfrepulsive dynamics for obtaining diversified samples from intractable unnormalized distributions. 

22 
In this paper, we study the statistical guarantees on the excess risk achieved by earlystopped unconstrained mirror descent algorithms applied to the unregularized empirical risk with the squared loss for linear models and kernel methods. 

23 
Algorithmic recourse under imperfect causal knowledge: a probabilistic approach 
To address this limitation, we propose two probabilistic approaches to select optimal actions that achieve recourse with high probability given limited causal knowledge (e.g., only the causal graph). 
24 
Quantitative Propagation of Chaos for SGD in Wide Neural Networks 
In this paper, we investigate the limiting behavior of a continuoustime counterpart of the Stochastic Gradient Descent (SGD) algorithm applied to twolayer overparameterized neural networks, as the number or neurons (i.e., the size of the hidden layer) $N \to \plusinfty$. 
25 
We present a causal view on the robustness of neural networks against input manipulations, which applies not only to traditional classification tasks but also to general measurement data. 

26 
Minimax Classification with 01 Loss and Performance Guarantees 
This paper presents minimax risk classifiers (MRCs) that do not rely on a choice of surrogate loss and family of rules. 
27 
How to Learn a Useful Critic? Modelbased ActionGradientEstimator Policy Optimization 
In this paper, we propose MAGE, a modelbased actorcritic algorithm, grounded in the theory of policy gradients, which explicitly learns the actionvalue gradient. 
28 
This paper introduces the problem of coresets for regression problems to panel data settings. 

29 
Learning Composable Energy Surrogates for PDE Order Reduction 
To address this, we leverage parametric modular structure to learn componentlevel surrogates, enabling cheaper highfidelity simulation. 
30 
We create a computationally tractable learning algorithm for contextual bandits with continuous actions having unknown structure. 

31 
We present a flexible framework for learning predictive models that approximately satisfy the equalized odds notion of fairness. 

32 
MultiRobot Collision Avoidance under Uncertainty with Probabilistic Safety Barrier Certificates 
This paper aims to propose a collision avoidance method that accounts for both measurement uncertainty and motion uncertainty. 
33 
In this paper, we prove that hard affine shape constraints on function derivatives can be encoded in kernel machines which represent one of the most flexible and powerful tools in machine learning and statistics. 

34 
A Closer Look at the Training Strategy for Modern MetaLearning 
The support/query (S/Q) episodic training strategy has been widely used in modern metalearning algorithms and is believed to improve their generalization ability to test environments. This paper conducts a theoretical investigation of this training strategy on generalization. 
35 
On the Value of OutofDistribution Testing: An Example of Goodhart's Law 
We provide short and longterm solutions to avoid these pitfalls and realize the benefits of OOD evaluation. 
36 
We introduce a framework for inference in general statespace hidden Markov models (HMMs) under likelihood misspecification. 

37 
Deterministic Approximation for Submodular Maximization over a Matroid in Nearly Linear Time 
We study the problem of maximizing a nonmonotone, nonnegative submodular function subject to a matroid constraint. 
38 
Flows for simultaneous manifold learning and density estimation 
We introduce manifoldlearning flows (?flows), a new class of generative models that simultaneously learn the data manifold as well as a tractable probability density on that manifold. 
39 
Simultaneous Preference and Metric Learning from Paired Comparisons 
In this paper, we consider the problem of learning an ideal point representation of a user’s preferences when the distance metric is an unknown Mahalanobis metric. 
40 
Efficient Variational Inference for Sparse Deep Learning with Theoretical Guarantee 
In this paper, we train sparse deep neural networks with a fully Bayesian treatment under spikeandslab priors, and develop a set of computationally efficient variational inferences via continuous relaxation of Bernoulli distribution. 
41 
Learning Manifold Implicitly via Explicit HeatKernel Learning 
In this paper, we propose the concept of implicit manifold learning, where manifold information is implicitly obtained by learning the associated heat kernel. 
42 
Deep Relational Topic Modeling via Graph Poisson Gamma Belief Network 
To better utilize the document network, we first propose graph Poisson factor analysis (GPFA) that constructs a probabilistic model for interconnected documents and also provides closedform Gibbs sampling update equations, moving beyond sophisticated approximate assumptions of existing RTMs. 
43 
This paper presents onebit supervision, a novel setting of learning from incomplete annotations, in the scenario of image classification. 

44 
In this paper, we provide new tools and analysis to address these fundamental questions. 

45 
In this paper, we introduce a novel technique for constrained submodular maximization, inspired by barrier functions in continuous optimization. 

46 
The proposed framework, termed Convolutional Neural Networks with Feedback (CNNF), introduces a generative feedback with latent variables to existing CNN architectures, where consistent predictions are made through alternating MAP inference under a Bayesian framework. 

47 
Learning to Extrapolate Knowledge: Transductive Fewshot OutofGraph Link Prediction 
Motivated by this challenge, we introduce a realistic problem of fewshot outofgraph link prediction, where we not only predict the links between the seen and unseen nodes as in a conventional outofknowledge link prediction task but also between the unseen nodes, with only few edges per node. 
48 
Exploiting weakly supervised visual patterns to learn from partial annotations 
Instead, in this paper, we exploit relationships among images and labels to derive more supervisory signal from the unannotated labels. 
49 
We consider the problem of lossy image compression with deep latent variable models. 

50 
In this work, we propose a novel concept of neuron merging applicable to both fully connected layers and convolution layers, which compensates for the information loss due to the pruned neurons/filters. 

51 
FixMatch: Simplifying SemiSupervised Learning with Consistency and Confidence 
In this paper we propose FixMatch, an algorithm that is a significant simplification of existing SSL methods. 
52 
Reinforcement Learning with Combinatorial Actions: An Application to Vehicle Routing 
We develop a framework for valuefunctionbased deep reinforcement learning with a combinatorial action space, in which the action selection problem is explicitly formulated as a mixedinteger optimization problem. 
53 
Towards Playing Full MOBA Games with Deep Reinforcement Learning 
In this paper, we propose a MOBA AI learning paradigm that methodologically enables playing full MOBA games with deep reinforcement learning. 
54 
Rankmax: An Adaptive Projection Alternative to the Softmax Function 
In this work, we propose a method that adapts this parameter to individual training examples. 
55 
In this work we provide the first agnostic online boosting algorithm; that is, given a weak learner with only marginallybetterthantrivial regret guarantees, our algorithm boosts it to a strong learner with sublinear regret. 

56 
Causal Intervention for WeaklySupervised Semantic Segmentation 
We present a causal inference framework to improve WeaklySupervised Semantic Segmentation (WSSS). 
57 
To bridge this gap, we introduce belief propagation neural networks (BPNNs), a class of parameterized operators that operate on factor graphs and generalize Belief Propagation (BP). 

58 
Overparameterized Adversarial Training: An Analysis Overcoming the Curse of Dimensionality 
Our work proves convergence to low robust training loss for \emph{polynomial} width instead of exponential, under natural assumptions and with ReLU activations. 
59 
Posttraining Iterative Hierarchical Data Augmentation for Deep Networks 
In this paper, we propose a new iterative hierarchical data augmentation (IHDA) method to finetune trained deep neural networks to improve their generalization performance. 
60 
We investigate whether posthoc model explanations are effective for diagnosing model errors–model debugging. 

61 
In this paper we propose an algorithm inspired by the MedianofMeans (MOM). 

62 
Fairness without Demographics through Adversarially Reweighted Learning 
In this work we address this problem by proposing Adversarially Reweighted Learning (ARL). 
63 
Stochastic Latent ActorCritic: Deep Reinforcement Learning with a Latent Variable Model 
In this work, we tackle these two problems separately, by explicitly learning latent representations that can accelerate reinforcement learning from images. 
64 
Ridge Rider: Finding Diverse Solutions by Following Eigenvectors of the Hessian 
In this paper, we present a different approach. Rather than following the gradient, which corresponds to a locally greedy direction, we instead follow the eigenvectors of the Hessian. 
65 
The route to chaos in routing games: When is price of anarchy too optimistic? 
We study MWU using the actual game costs without applying cost normalization to $[0,1]$. 
66 
Online Algorithm for Unsupervised Sequential Selection with Contextual Information 
In this paper, we study Contextual Unsupervised Sequential Selection (USS), a new variant of the stochastic contextual bandits problem where the loss of an arm cannot be inferred from the observed feedback. 
67 
This paper aims to improve the generalization of neural architectures via domain adaptation. 

68 
What went wrong and when?\\ Instancewise feature importance for timeseries blackbox models 
We propose FIT, a framework that evaluates the importance of observations for a multivariate timeseries blackbox model by quantifying the shift in the predictive distribution over time. 
69 
To close this gap, we propose \textit{\textbf{S}table \textbf{A}daptive \textbf{G}radient \textbf{D}escent} (\textsc{SAGD}) for nonconvex optimization which leverages differential privacy to boost the generalization performance of adaptive gradient methods. 

70 
This paper is in the same vein — starting with a surrogate RL objective that involves smoothing in the trajectoryspace, we arrive at a new algorithm for learning guidance rewards. 

71 
Variance Reduction via Accelerated Dual Averaging for FiniteSum Optimization 
In this paper, we introduce a simplified and unified method for finitesum convex optimization, named \emph{Variance Reduction via Accelerated Dual Averaging (VRADA)}. 
72 
Tree! I am no Tree! I am a low dimensional Hyperbolic Embedding 
In this paper, we explore a new method for learning hyperbolic representations by taking a metricfirst approach. 
73 
Deep Structural Causal Models for Tractable Counterfactual Inference 
We formulate a general framework for building structural causal models (SCMs) with deep learning components. 
74 
A key contribution of our work is the encoding of the mesh and texture as 2D representations, which are semantically aligned and can be easily modeled by a 2D convolutional GAN. 

75 
A Statistical Framework for Lowbitwidth Training of Deep Neural Networks 
In this paper, we address this problem by presenting a statistical framework for analyzing FQT algorithms. 
76 
To resolve this limitation, we propose a simple and general network module called Set Refiner Network (SRN). 

77 
AutoSync: Learning to Synchronize for DataParallel Distributed Deep Learning 
In this paper, we develop a model and resourcedependent representation for synchronization, which unifies multiple synchronization aspects ranging from architecture, message partitioning, placement scheme, to communication topology. 
78 
In this work we study how the learning of modular solutions can allow for effective generalization to both unseen and potentially differently distributed data. 

79 
We prove negative results in this regard, and show that for depth$2$ networks, and many “natural" weights distributions such as the normal and the uniform distribution, most networks are hard to learn. 

80 
Based on the Hermitian matrix representation of digraphs, we present a nearlylinear time algorithm for digraph clustering, and further show that our proposed algorithm can be implemented in sublinear time under reasonable assumptions. 

81 
We propose a method that combines the advantages of both types of approaches, while addressing their limitations: we extend a primaldual framework drawn from the graphneuralnetwork literature to triangle meshes, and define convolutions on two types of graphs constructed from an input mesh. 

82 
The Advantage of Conditional MetaLearning for Biased Regularization and Fine Tuning 
We address this limitation by conditional metalearning, inferring a conditioning function mapping task’s side information into a metaparameter vector that is appropriate for that task at hand. 
83 
Watch out! Motion is Blurring the Vision of Your Deep Neural Networks 
We propose a novel adversarial attack method that can generate visually natural motionblurred adversarial examples, named motionbased adversarial blur attack (ABBA). 
84 
In this paper, we consider the problem of computing the barycenter of a set of probability distributions under the Sinkhorn divergence. 

85 
We suggest a generic framework for computing sensitivities (and thus coresets) for wide family of loss functions which we call nearconvex functions. 

86 
We introduce a simple modification to standard deep ensembles training, through addition of a computationallytractable, randomised and untrainable function to each ensemble member, that enables a posterior interpretation in the infinite width limit. 

87 
Improved Schemes for Episodic Memorybased Lifelong Learning 
In this paper, we provide the first unified view of episodic memory based approaches from an optimization’s perspective. 
88 
We propose an adaptive sampling algorithm for stochastically optimizing the {\em Conditional ValueatRisk (CVaR)} of a loss distribution, which measures its performance on the $\alpha$ fraction of most difficult examples. 

89 
Deep Wiener Deconvolution: Wiener Meets Deep Learning for Image Deblurring 
We present a simple and effective approach for nonblind image deblurring, combining classical techniques and deep learning. 
90 
This paper introduces a new metalearning approach that discovers an entire update rule which includes both what to predict’ (e.g. value functions) and how to learn from it’ (e.g. bootstrapping) by interacting with a set of environments. 

91 
The key contribution of this work addresses this scalability challenge via an efficient reduction of discrete integration to model counting. 

92 
To address this issue, we present a novel and general approach for blind video temporal consistency. 

93 
Simplify and Robustify Negative Sampling for Implicit Collaborative Filtering 
In this paper, we ?rst provide a novel understanding of negative instances by empirically observing that only a few instances are potentially important for model learning, and false negatives tend to have stable predictions over many training iterations. 
94 
Model Selection for Production System via Automated Online Experiments 
We propose an automated online experimentation mechanism that can efficiently perform model selection from a large pool of models with a small number of online experiments. 
95 
On the Almost Sure Convergence of Stochastic Gradient Descent in NonConvex Problems 
In this paper, we analyze the trajectories of stochastic gradient descent (SGD) with the aim of understanding their convergence properties in nonconvex problems. 
96 
Automatic Perturbation Analysis for Scalable Certified Robustness and Beyond 
In this paper, we develop an automatic framework to enable perturbation analysis on any neural network structures, by generalizing existing LiRPA algorithms such as CROWN to operate on general computational graphs. 
97 
Adaptation Properties Allow Identification of Optimized Neural Codes 
Here we solve an inverse problem: characterizing the objective and constraint functions that efficient codes appear to be optimal for, on the basis of how they adapt to different stimulus distributions. 
98 
Global Convergence and Variance Reduction for a Class of NonconvexNonconcave Minimax Problems 
In this work, we show that for a subclass of nonconvexnonconcave objectives satisfying a socalled twosided Polyak{\L}ojasiewicz inequality, the alternating gradient descent ascent (AGDA) algorithm converges globally at a linear rate and the stochastic AGDA achieves a sublinear rate. 
99 
ModelBased MultiAgent RL in ZeroSum Markov Games with NearOptimal Sample Complexity 
In this paper, we aim to address the fundamental open question about the sample complexity of modelbased MARL. 
100 
In this paper, we propose conservative Qlearning (CQL), which aims to address these limitations by learning a conservative Qfunction such that the expected value of a policy under this Qfunction lowerbounds its true value. 

101 
In this paper, we address OIM in the linear threshold (LT) model. 

102 
We develop a novel datadriven ensembling strategy for combining geophysical models using Bayesian Neural Networks, which infers spatiotemporally varying model weights and bias while accounting for heteroscedastic uncertainties in the observations. 

103 
Delving into the Cyclic Mechanism in Semisupervised Video Object Segmentation 
In this paper, we take attempt to incorporate the cyclic mechanism with the vision task of semisupervised video object segmentation. 
104 
Asymmetric Shapley values: incorporating causal knowledge into modelagnostic explainability 
We introduce a less restrictive framework, Asymmetric Shapley values (ASVs), which are rigorously founded on a set of axioms, applicable to any AI system, and can flexibly incorporate any causal structure known to be respected by the data. 
105 
In this paper, we take an initial step toward an understanding of such hybrid deep architectures by showing that properties of the algorithm layers, such as convergence, stability and sensitivity, are intimately related to the approximation and generalization abilities of the endtoend model. 

106 
Planning in Markov Decision Processes with GapDependent Sample Complexity 
We propose MDPGapE, a new trajectorybased MonteCarlo Tree Search algorithm for planning in a Markov Decision Process in which transitions have a finite support. 
107 
Provably Good Batch OffPolicy Reinforcement Learning Without Great Exploration 
We show that using \emph{pessimistic value estimates} in the lowdata regions in Bellman optimality and evaluation backup can yield more adaptive and stronger guarantees when the concentrability assumption does not hold. 
108 
Detection as Regression: Certified Object Detection with Median Smoothing 
This work is motivated by recent progress on certified classification by randomized smoothing. We start by presenting a reduction from object detection to a regression problem. 
109 
Contextual Reserve Price Optimization in Auctions via Mixed Integer Programming 
We study the problem of learning a linear model to set the reserve price in an auction, given contextual information, in order to maximize expected revenue from the seller side. 
110 
ExpandNets: Linear Overparameterization to Train Compact Convolutional Networks 
We introduce an approach to training a given compact network. 
111 
In this paper, we propose an encryption algorithm/architecture to compress quantized weights so as to achieve fractional numbers of bits per weight. 

112 
The Implications of Local Correlation on Learning Some Deep Functions 
We introduce a property of distributions, denoted “local correlation”, which requires that small patches of the input image and of intermediate layers of the target function are correlated to the target label. 
113 
Learning to search efficiently for causally nearoptimal treatments 
We formalize this problem as learning a policy for finding a nearoptimal treatment in a minimum number of trials using a causal inference framework. 
114 
A Game Theoretic Analysis of Additive Adversarial Attacks and Defenses 
In this paper, we propose a gametheoretic framework for studying attacks and defenses which exist in equilibrium. 
115 
Posterior Network: Uncertainty Estimation without OOD Samples via DensityBased PseudoCounts 
In this work we propose the Posterior Network (PostNet), which uses Normalizing Flows to predict an individual closedform posterior distribution over predicted probabilites for any input sample. 
116 
In this work we construct the first quantum recurrent neural network (QRNN) with demonstrable performance on nontrivial tasks such as sequence learning and integer digit classification. 

117 
NoRegret Learning and Mixed Nash Equilibria: They Do Not Mix 
In this paper, we study the dynamics of follow the regularized leader (FTRL), arguably the most wellstudied class of noregret dynamics, and we establish a sweeping negative result showing that the notion of mixed Nash equilibrium is antithetical to noregret learning. 
118 
A Unifying View of Optimism in Episodic Reinforcement Learning 
In this paper we provide a general framework for designing, analyzing and implementing such algorithms in the episodic reinforcement learning problem. 
119 
In this paper, we propose the first continuous optimization algorithms that achieve a constant factor approximation guarantee for the problem of monotone continuous submodular maximization subject to a linear constraint. 

120 
An Asymptotically Optimal PrimalDual Incremental Algorithm for Contextual Linear Bandits 
In this paper, we follow recent approaches of deriving asymptotically optimal algorithms from problemdependent regret lower bounds and we introduce a novel algorithm improving over the stateoftheart along multiple dimensions. 
121 
Assessing SATNet's Ability to Solve the Symbol Grounding Problem 
In this paper, we clarify SATNet’s capabilities by showing that in the absence of intermediate labels that identify individual Sudoku digit images with their logical representations, SATNet completely fails at visual Sudoku (0% test accuracy). 
122 
We investigate neural network representations from a probabilistic perspective. 

123 
On the Similarity between the Laplace and Neural Tangent Kernels 
Here we show that NTK for fully connected networks with ReLU activation is closely related to the standard Laplace kernel. 
124 
Here we describe an approach for compositional generalization that builds on causal ideas. 

125 
We introduce a general framework (HiPPO) for the online compression of continuous signals and discrete time series by projection onto polynomial bases. 

126 
In this paper, we devise an Auto Learning Attention (AutoLA) method, which is the first attempt on automatic attention design. 

127 
We introduce Causal Structure Learning (CASTLE) regularization and propose to regularize a neural network by jointly learning the causal relationships between variables. 

128 
LongTailed Classification by Keeping the Good and Removing the Bad Momentum Causal Effect 
In this paper, we establish a causal inference framework, which not only unravels the whys of previous methods, but also derives a new principled solution. 
129 
We prove, however, that outcomes of the important Borda rule can be explained using O(m^2) steps, where m is the number of alternatives. 

130 
In this paper, we introduce ACNet, a novel differentiable neural network architecture that enforces structural properties and enables one to learn an important class of copulas–Archimedean Copulas. 

131 
ReExamining Linear Embeddings for HighDimensional Bayesian Optimization 
In this paper, we identify several crucial issues and misconceptions about the use of linear embeddings for BO. 
132 
UnModNet: Learning to Unwrap a Modulo Image for High Dynamic Range Imaging 
In this paper, we reformulate the modulo image unwrapping problem into a series of binary labeling problems and propose a modulo edgeaware model, named as UnModNet, to iteratively estimate the binary rollover masks of the modulo image for unwrapping. 
133 
Thunder: a Fast Coordinate Selection Solver for Sparse Learning 
In this paper, we propose a novel active incremental approach to further improve the efficiency of the solvers. 
134 
Neural Networks Fail to Learn Periodic Functions and How to Fix It 
As a fix of this problem, we propose a new activation, namely, $x + \sin^2(x)$, which achieves the desired periodic inductive bias to learn a periodic function while maintaining a favorable optimization property of the $\relu$based activations. 
135 
In this paper, we show that imposing Gaussians to annotations hurts generalization performance. 

136 
In this paper, we propose a fully differentiable pipeline for estimating accurate dense correspondences between 3D point clouds. 

137 
Learning to Dispatch for Job Shop Scheduling via Deep Reinforcement Learning 
In this paper, we propose to automatically learn PDRs via an endtoend deep reinforcement learning agent. 
138 
While prior evaluation papers focused mainly on the end result—showing that a defense was ineffective—this paper focuses on laying out the methodology and the approach necessary to perform an adaptive attack. 

139 
In this regard, we propose a novel Sinkhorn Natural Gradient (SiNG) algorithm which acts as a steepest descent method on the probability space endowed with the Sinkhorn divergence. 

140 
Online Sinkhorn: Optimal Transport distances from sample streams 
This paper introduces a new online estimator of entropyregularized OT distances between two such arbitrary distributions. 
141 
In this paper, we propose a representation living on a pseudoRiemannian manifold of constant nonzero curvature. 

142 
We fill this gap by introducing efficient online algorithms (based on a single versatile master algorithm) each adapting to one of the following regularities: (i) local Lipschitzness of the competitor function, (ii) local metric dimension of the instance sequence, (iii) local performance of the predictor across different regions of the instance space. 

143 
Compositional Generalization via NeuralSymbolic Stack Machines 
To tackle this issue, we propose the NeuralSymbolic Stack Machine (NeSS). 
144 
Graphon Neural Networks and the Transferability of Graph Neural Networks 
In this paper we introduce graphon NNs as limit objects of GNNs and prove a bound on the difference between the output of a GNN and its limit graphonNN. 
145 
Unreasonable Effectiveness of Greedy Algorithms in MultiArmed Bandit with Many Arms 
We study the structure of regretminimizing policies in the {\em manyarmed} Bayesian multiarmed bandit problem: in particular, with $k$ the number of arms and $T$ the time horizon, we consider the case where $k \geq \sqrt{T}$. 
146 
GammaModels: Generative Temporal Difference Learning for InfiniteHorizon Prediction 
We introduce the gammamodel, a predictive model of environment dynamics with an infinite, probabilistic horizon. 
147 
We present a probabilistic framework to automatically learn which layer(s) to use by learning the posterior distributions of layer selection. 

148 
Neural Mesh Flow: 3D Manifold Mesh Generation via Diffeomorphic Flows 
In this work, we propose NeuralMeshFlow (NMF) to generate twomanifold meshes for genus0 shapes. 
149 
Statistical control for spatiotemporal MEG/EEG source imaging with desparsified mutlitask Lasso 
To deal with this, we adapt the desparsified Lasso estimator —an estimator tailored for high dimensional linear model that asymptotically follows a Gaussian distribution under sparsity and moderate feature correlation assumptions— to temporal data corrupted with autocorrelated noise. 
150 
A Scalable MIPbased Method for Learning Optimal Multivariate Decision Trees 
In this paper, we propose a novel MIP formulation, based on 1norm support vector machine model, to train a binary oblique ODT for classification problems. 
151 
We present a new system, EEV, for efficient and exact verification of BNNs. 

152 
In this paper, we propose a number of novel techniques and numerical representation formats that enable, for the very first time, the precision of training systems to be aggressively scaled from 8bits to 4bits. 

153 
Bridging the Gap between Samplebased and Oneshot Neural Architecture Search with BONAS 
In this work, we propose BONAS (Bayesian Optimized Neural Architecture Search), a samplebased NAS framework which is accelerated using weightsharing to evaluate multiple related architectures simultaneously. 
154 
Recently, a provocative claim was published that number sense spontaneously emerges in a deep neural network trained merely for visual object recognition. This has, if true, far reaching significance to the fields of machine learning and cognitive science alike. In this paper, we prove the above claim to be unfortunately incorrect. 

155 
Outlier Robust Mean Estimation with Subgaussian Rates via Stability 
We study the problem of outlier robust highdimensional mean estimation under a bounded covariance assumption, and more broadly under bounded lowdegree moment assumptions. 
156 
In this work, we introduce a selfsupervised method that implicitly learns the visual relationships without relying on any groundtruth visual relationship annotations. 

157 
Information Theoretic Counterfactual Learning from MissingNotAtRandom Feedback 
To circumvent the use of RCTs, we build an information theoretic counterfactual variational information bottleneck (CVIB), as an alternative for debiasing learning without RCTs. 
158 
Prophet Attention: Predicting Attention with Future Attention 
In this paper, we propose the Prophet Attention, similar to the form of selfsupervision. 
159 
Specifically, we train GPT3, an autoregressive language model with 175 billion parameters, 10x more than any previous nonsparse language model, and test its performance in the fewshot setting. 

160 
In this work, we first demonstrate that the k’th margin bound is inadequate in explaining the performance of stateoftheart gradient boosters. We then explain the short comings of the k’th margin bound and prove a stronger and more refined marginbased generalization bound that indeed succeeds in explaining the performance of modern gradient boosters. 

161 
To address these shortcomings, we propose a novel attribution prior, where the Fourier transform of inputlevel attribution scores are computed at trainingtime, and highfrequency components of the Fourier spectrum are penalized. 

162 
MomentumRNN: Integrating Momentum into Recurrent Neural Networks 
We theoretically prove and numerically demonstrate that MomentumRNNs alleviate the vanishing gradient issue in training RNNs. 
163 
Marginal Utility for Planning in Continuous or Large Discrete Action Spaces 
In this paper we explore explicitly learning a candidate action generator by optimizing a novel objective, marginal utility. 
164 
In this work, we propose a {projected Stein variational gradient descent} (pSVGD) method to overcome this challenge by exploiting the fundamental property of intrinsic low dimensionality of the data informed subspace stemming from illposedness of such problems. 

165 
Minimax Lower Bounds for Transfer Learning with Linear and Onehidden Layer Neural Networks 
In this paper we develop a statistical minimax framework to characterize the fundamental limits of transfer learning in the context of regression with linear and onehidden layer neural network models. 
166 
SE(3)Transformers: 3D RotoTranslation Equivariant Attention Networks 
We introduce the SE(3)Transformer, a variant of the selfattention module for 3D pointclouds, which is equivariant under continuous 3D rototranslations. 
167 
On the equivalence of molecular graph convolution and molecular wave function with poor basis set 
In this study, we demonstrate that the linear combination of atomic orbitals (LCAO), an approximation introduced by Pauling and LennardJones in the 1920s, corresponds to graph convolutional networks (GCNs) for molecules. 
168 
We study the impact of predictions in online Linear Quadratic Regulator control with both stochastic and adversarial disturbances in the dynamics. 

169 
Learning Affordance Landscapes for Interaction Exploration in 3D Environments 
We introduce a reinforcement learning approach for exploration for interaction, whereby an embodied agent autonomously discovers the affordance landscape of a new unmapped 3D environment (such as an unfamiliar kitchen). 
170 
We design a distributed learning algorithm that overcomes the informational bias players have towards maximizing the rewards of nearby players they got more information about. 

171 
Tight First and SecondOrder Regret Bounds for Adversarial Linear Bandits 
We propose novel algorithms with first and secondorder regret bounds for adversarial linear bandits. 
172 
Just Pick a Sign: Optimizing Deep Multitask Models with Gradient Sign Dropout 
We present Gradient Sign Dropout (GradDrop), a probabilistic masking procedure which samples gradients at an activation layer based on their level of consistency. 
173 
A Loss Function for Generative Neural Networks Based on Watson?s Perceptual Model 
We propose such a loss function based on Watson’s perceptual model, which computes a weighted distance in frequency space and accounts for luminance and contrast masking. 
174 
Dynamic Fusion of Eye Movement Data and Verbal Narrations in Knowledgerich Domains 
We propose to jointly analyze experts’ eye movements and verbal narrations to discover important and interpretable knowledge patterns to better understand their decisionmaking processes. 
175 
Scalable MultiAgent Reinforcement Learning for Networked Systems with Average Reward 
In this paper, we identify a rich class of networked MARL problems where the model exhibits a local dependence structure that allows it to be solved in a scalable manner. 
176 
Koopman operator theory, a powerful framework for discovering the underlying dynamics of nonlinear dynamical systems, was recently shown to be intimately connected with neural network training. In this work, we take the first steps in making use of this connection. 

177 
SVGD as a kernelized Wasserstein gradient flow of the chisquared divergence 
We introduce a new perspective on SVGD that instead views SVGD as the kernelized gradient flow of the chisquared divergence. 
178 
In this work, we strike a better balance by considering a model that involves learning a representation while at the same time giving a precise generalization bound and a robustness certificate. 

179 
In this work, we learn such policies for an unknown distribution P using samples from P. 

180 
In this work, we investigate the role of two biologically plausible mechanisms in adversarial robustness. 

181 
For the specific problem of ReLU regression (equivalently, agnostically learning a ReLU), we show that any statisticalquery algorithm with tolerance $n^{(1/\epsilon)^b}$ must use at least $2^{n^c} \epsilon$ queries for some constants $b, c > 0$, where $n$ is the dimension and $\epsilon$ is the accuracy parameter. 

182 
This paper closes this gap for the first time: we propose an optimistic variant of the Nash Qlearning algorithm with sample complexity \tlO(SAB), and a new Nash Vlearning algorithm with sample complexity \tlO(S(A+B)). 

183 
We propose a novel learning framework based on neural meanfield dynamics for inference and estimation problems of diffusion on networks. 

184 
With this in mind, we offer a new interpretation for teacherstudent training as amortized MAP estimation, such that teacher predictions enable instancespecific regularization. 

185 
In this paper we propose a new framework based on a "uniform localized convergence" principle. 

186 
Crosslingual Retrieval for Iterative SelfSupervised Training 
In this work, we found that the crosslingual alignment can be further improved by training seq2seq models on sentence pairs mined using their own encoder outputs. 
187 
In this paper, we build upon representative GNNs and introduce variants that challenge the need for localitypreserving representations, either using randomization or clustering on the complement graph. 

188 
Here we introduce Pointer Graph Networks (PGNs) which augment sets or graphs with additional inferred edges for improved model generalisation ability. 

189 
Gradient Regularized VLearning for Dynamic Treatment Regimes 
In this paper, we introduce Gradient Regularized Vlearning (GRV), a novel method for estimating the value function of a DTR. 
190 
Faster Wasserstein Distance Estimation with the Sinkhorn Divergence 
In this work, we propose instead to estimate it with the Sinkhorn divergence, which is also built on entropic regularization but includes debiasing terms. 
191 
We address the problem of credit assignment in reinforcement learning and explore fundamental questions regarding the way in which an agent can best use additional computation to propagate new information, by planning with internal models of the world to improve its predictions. 

192 
Robust Recursive Partitioning for Heterogeneous Treatment Effects with Uncertainty Quantification 
This paper develops a new method for subgroup analysis, R2P, that addresses all these weaknesses. 
193 
To alleviate this, we propose to directly minimize the divergence between neural recorded and model generated spike trains using spike train kernels. 

194 
Lower Bounds and Optimal Algorithms for Personalized Federated Learning 
In this work, we consider the optimization formulation of personalized federated learning recently introduced by Hanzely & Richtarik (2020) which was shown to give an alternative explanation to the workings of local SGD methods. 
195 
BlackBox Certification with Randomized Smoothing: A Functional Optimization Based Framework 
We propose a general framework of adversarial certification with nonGaussian noise and for more general types of attacks, from a unified \functional optimization perspective. 
196 
We present a deep imitation learning framework for robotic bimanual manipulation in a continuous stateaction space. 

197 
Stationary Activations for Uncertainty Calibration in Deep Learning 
We introduce a new family of nonlinear neural network activation functions that mimic the properties induced by the widelyused Mat\’ern family of kernels in Gaussian process (GP) models. 
198 
Ensemble Distillation for Robust Model Fusion in Federated Learning 
In this work we investigate more powerful and more flexible aggregation schemes for FL. 
199 
In this paper, we propose a fast, frequencydomain deep neural network called Falcon, for fast inferences on encrypted data. 

200 
In this work, we focus on a classification problem and investigate the behavior of both noncalibrated and calibrated negative loglikelihood (CNLL) of a deep ensemble as a function of the ensemble size and the member network size. 

201 
Practical QuasiNewton Methods for Training Deep Neural Networks 
We consider the development of practical stochastic quasiNewton, and in particular Kroneckerfactored block diagonal BFGS and LBFGS methods, for training deep neural networks (DNNs). 
202 
Approximation Based Variance Reduction for Reparameterization Gradients 
In this work we present a control variate that is applicable for any reparameterizable distribution with known mean and covariance, e.g. Gaussians with any covariance structure. 
203 
Inference Stage Optimization for Crossscenario 3D Human Pose Estimation 
In this work, we propose a novel framework, Inference Stage Optimization (ISO), for improving the generalizability of 3D pose models when source and target data come from different pose distributions. 
204 
Consistent feature selection for analytic deep neural networks 
In this work, we investigate the problem of feature selection for analytic deep networks. 
205 
Glance and Focus: a Dynamic Approach to Reducing Spatial Redundancy in Image Classification 
Inspired by the fact that not all regions in an image are taskrelevant, we propose a novel framework that performs efficient image classification by processing a sequence of relatively small inputs, which are strategically selected from the original image with reinforcement learning. 
206 
We introduce Transductive Infomation Maximization (TIM) for fewshot learning. 

207 
Inverse Reinforcement Learning from a Gradientbased Learner 
In this paper, we propose a new algorithm for this setting, in which the goal is to recover the reward function being optimized by an agent, given a sequence of policies produced during learning. 
208 
Bayesian Multitype Mean Field Multiagent Imitation Learning 
In this paper, we proposed Bayesian multitype mean field multiagent imitation learning (BM3IL). 
209 
To provide a bridge between these two extremes, we propose Bayesian Robust Optimization for Imitation Learning (BROIL). 

210 
Multiview Neural Surface Reconstruction by Disentangling Geometry and Appearance 
In this work we address the challenging problem of multiview 3D surface reconstruction. 
211 
To overcome this problem, we introduce Riemannian continuous normalizing flows, a model which admits the parametrization of flexible probability measures on smooth manifolds by defining flows as the solution to ordinary differential equations. 

212 
AttentionGated Brain Propagation: How the brain can implement rewardbased error backpropagation 
We demonstrate a biologically plausible reinforcement learning scheme for deep networks with an arbitrary number of layers. 
213 
Asymptotic Guarantees for Generative Modeling Based on the Smooth Wasserstein Distance 
In this work, we conduct a thorough statistical study of the minimum smooth Wasserstein estimators (MSWEs), first proving the estimator’s measurability and asymptotic consistency. 
214 
In contrast, we show in this work that stochastic gradient descent on the l1 loss converges to the true parameter vector at a $\tilde{O}( 1 / (1 – \eta)^2 n )$ rate which is independent of the values of the contaminated measurements. 

215 
In this paper, we introduce the PRANK method, which satisfies these requirements. 

216 
Fighting Copycat Agents in Behavioral Cloning from Observation Histories 
To combat this "copycat problem", we propose an adversarial approach to learn a feature representation that removes excess information about the previous expert action nuisance correlate, while retaining the information necessary to predict the next action. 
217 
We analyze the convergence of singlepass, fixed stepsize stochastic gradient descent on the leastsquare risk under this model. 

218 
In this work, we propose a new perspective on conditional metalearning via structured prediction. 

219 
Optimal Lottery Tickets via Subset Sum: Logarithmic OverParameterization is Sufficient 
In this work, we close the gap and offer an exponential improvement to the overparameterization requirement for the existence of lottery tickets. 
220 
The Hateful Memes Challenge: Detecting Hate Speech in Multimodal Memes 
This work proposes a new challenge set for multimodal classification, focusing on detecting hate speech in multimodal memes. 
221 
This article suggests that deterministic Gradient Descent, which does not use any stochastic gradient approximation, can still exhibit stochastic behaviors. 

222 
It is an open question as to what specific experimental measurements would need to be made to determine whether any given learning rule is operative in a real biological system. In this work, we take a "virtual experimental" approach to this problem. 

223 
Optimal Approximation – Smoothness Tradeoffs for SoftMax Functions 
Our goal is to identify the optimal approximationsmoothness tradeoffs for different measures of approximation and smoothness. 
224 
WeaklySupervised Reinforcement Learning for Controllable Behavior 
In this work, we introduce a framework for using weak supervision to automatically disentangle this semantically meaningful subspace of tasks from the enormous space of nonsensical "chaff" tasks. 
225 
Improving PolicyConstrained Kidney Exchange via PreScreening 
We propose both a greedy heuristic and a Monte Carlo tree search, which outperforms previous approaches, using experiments on both synthetic data and real kidney exchange data from the United Network for Organ Sharing. 
226 
Learning abstract structure for drawing by efficient motor program induction 
We show that people spontaneously learn abstract drawing procedures that support generalization, and propose a model of how learners can discover these reusable drawing procedures. 
227 
This paper studies this fundamental problem in deep learning from a socalled neural tangent kernel” perspective. 

228 
We present a novel algorithm for nonlinear instrumental variable (IV) regression, DualIV, which simplifies traditional twostage methods via a dual formulation. 

229 
Stochastic Gradient Descent in Correlated Settings: A Study on Gaussian Processes 
In this paper, we focus on the Gaussian process (GP) and take a step forward towards breaking the barrier by proving minibatch SGD converges to a critical point of the full loss function, and recovers model hyperparameters with rate $O(\frac{1}{K})$ up to a statistical error term depending on the minibatch size. 
230 
Thanks to it, we propose a novel FSL paradigm: Interventional FewShot Learning (IFSL). 

231 
Minimax Value Interval for OffPolicy Evaluation and Policy Optimization 
We study minimax methods for offpolicy evaluation (OPE) using value functions and marginalized importance weights. 
232 
For this special setting, we propose an accelerated algorithm called biased SpiderBoost (BSpiderBoost) that matches the lower bound complexity. 

233 
This paper presented ShiftAddNet, whose main inspiration is drawn from a common practice in energyefficient hardware implementation, that is, multiplication can be instead performed with additions and logical bitshifts. 

234 
NetworktoNetwork Translation with Conditional Invertible Neural Networks 
Therefore, we seek a model that can relate between different existing representations and propose to solve this task with a conditionally invertible network. 
235 
In this work, we initiate the study of a new paradigm in debiasing research, intraprocessing, which sits between inprocessing and postprocessing methods. 

236 
This paper proposes two efficient algorithms for computing approximate secondorder stationary points (SOSPs) of problems with generic smooth nonconvex objective functions and generic linear constraints. 

237 
Modelbased Policy Optimization with Unsupervised Model Adaptation 
In this paper, we investigate how to bridge the gap between real and simulated data due to inaccurate model estimation for better policy optimization. 
238 
Implicit Regularization and Convergence for Weight Normalization 
Here, we study the weight normalization (WN) method \cite{salimans2016weight} and a variant called reparametrized projected gradient descent (rPGD) for overparametrized least squares regression and some more general loss functions. 
239 
In this work, we presented a computationally efficient BTD algorithm, namely Geometric Expansion for allorder Tensor Factorization (GETF), that sequentially identifies the rank1 basis components for a tensor from a geometric perspective. 

240 
Here, we propose a metalearning approach that obviates the need for this often suboptimal handselection. 

241 
A/B Testing in Dense LargeScale Networks: Design and Inference 
In this paper, we present a novel strategy for accurately estimating the causal effects of a class of treatments in a dense largescale network. 
242 
What Neural Networks Memorize and Why: Discovering the Long Tail via Influence Estimation 
In this work we design experiments to test the key ideas in this theory. 
243 
In this paper, we study one challenging issue in multiview data clustering. 

244 
Partial Optimal Tranport with applications on PositiveUnlabeled Learning 
In this paper, we address the partial Wasserstein and GromovWasserstein problems and propose exact algorithms to solve them. 
245 
In this paper, we focus on understanding the minimax statistical limits of IL in episodic Markov Decision Processes (MDPs). 

246 
In this work, we remove the most limiting assumptions of this previous work while providing significantly tighter bounds: the overparameterized network only needs a logarithmic factor (in all variables but depth) number of neurons per weight of the target subnetwork. 

247 
Hold me tight! Influence of discriminative features on deep network boundaries 
In this work, we borrow tools from the field of adversarial robustness, and propose a new perspective that relates dataset features to the distance of samples to the decision boundary. 
248 
Inspired by the above example, we consider a model in which the population $\cD$ is a mixture of two possibly distinct subpopulations: a private subpopulation $\Dprv$ of private and sensitive data, and a public subpopulation $\Dpub$ of data with no privacy concerns. 

249 
In this paper, we investigate the weight loss landscape from a new perspective, and identify a clear correlation between the flatness of weight loss landscape and robust generalization gap. 

250 
Stateful Posted Pricing with Vanishing Regret via Dynamic Deterministic Markov Decision Processes 
In this paper, a rather general online problem called \emph{dynamic resource allocation with capacity constraints (DRACC)} is introduced and studied in the realm of posted price mechanisms. 
251 
In this paper, we propose a novel adversarial attack for unlabeled data, which makes the model confuse the instancelevel identities of the perturbed data samples. 

252 
Normalizing Kalman Filters for Multivariate Time Series Analysis 
To this extent, we present a novel approach reconciling classical state space models with deep learning methods. 
253 
In this work, we show that it is possible to significantly improve summary quality by training a model to optimize for human preferences. 

254 
Fourier Spectrum Discrepancies in Deep Network Generated Images 
In this paper, we present an analysis of the highfrequency Fourier modes of real and deep network generated images and show that deep network generated images share an observable, systematic shortcoming in replicating the attributes of these highfrequency modes. 
255 
Specifically, we found that signal transformations, made by each layer of neurons on an inputdriven spike signal, demodulate signal distortions introduced by preceding layers. 

256 
Learning Dynamic Belief Graphs to Generalize on TextBased Games 
In this work, we investigate how an agent can plan and generalize in textbased games using graphstructured representations learned endtoend from raw text. 
257 
Triple descent and the two kinds of overfitting: where & why do they appear? 
In this paper, we show that despite their apparent similarity, these two scenarios are inherently different. 
258 
Multimodal Graph Networks for Compositional Generalization in Visual Question Answering 
In this paper, we propose to tackle this challenge by employing neural factor graphs to induce a tighter coupling between concepts in different modalities (e.g. images and text). 
259 
Learning Graph Structure With A FiniteState Automaton Layer 
In this work, we study the problem of learning to derive abstract relations from the intrinsic graph structure. 
260 
A Universal Approximation Theorem of Deep Neural Networks for Expressing Probability Distributions 
This paper studies the universal approximation property of deep neural networks for representing probability distributions. 
261 
Unsupervised objectcentric video generation and decomposition in 3D 
We instead propose to model a video as the view seen while moving through a scene with multiple 3D objects and a 3D background. 
262 
Domain Generalization for Medical Imaging Classification with LinearDependency Regularization 
In this paper, we introduce a simple but effective approach to improve the generalization capability of deep neural networks in the field of medical imaging classification. 
263 
Multilabel classification: do Hamming loss and subset accuracy really conflict with each other? 
This paper provides an attempt to fill up this gap by analyzing the learning guarantees of the corresponding learning algorithms on both SA and HL measures. 
264 
A Novel Automated Curriculum Strategy to Solve Hard Sokoban Planning Instances 
We present a novel {\em automated} curriculum approach that dynamically selects from a pool of unlabeled training instances of varying task complexity guided by our {\em difficulty quantum momentum} strategy. 
265 
In this work, we study the causal relations among German regions in terms of the spread of Covid19 since the beginning of the pandemic, taking into account the restriction policies that were applied by the different federal states. 

266 
We find separation rates for testing multinomial or more general discrete distributions under the constraint of alphalocal differential privacy. 

267 
We empirically observe that the statistics of gradients of deep models change during the training. Motivated by this observation, we introduce two adaptive quantization schemes, ALQ and AMQ. 

268 
Focusing on a nonparametric setting, where the mean reward is an unknown function of a onedimensional covariate, we propose an optimal strategy for this problem. 

269 
Removing Bias in Multimodal Classifiers: Regularization by Maximizing Functional Entropies 
To alleviate this shortcoming, we propose a novel regularization term based on the functional entropy. 
270 
Compact task representations as a normative model for higherorder brain activity 
More specifically, we focus on MDPs whose state is based on actionobservation histories, and we show how to compress the state space such that unnecessary redundancy is eliminated, while taskrelevant information is preserved. 
271 
RobustAdaptive Control of Linear Systems: beyond Quadratic Costs 
We consider the problem of robust and adaptive model predictive control (MPC) of a linear system, with unknown parameters that are learned along the way (adaptive), in a critical setting where failures must be prevented (robust). 
272 
In this paper, we study the problem of allocating seed users to opposing campaigns: by drawing on the equaltime rule of political campaigning on traditional media, our goal is to allocate seed users to campaigners with the aim to maximize the expected number of users who are coexposed to both campaigns. 

273 
In this paper, we show that building a geometry preserving 3dimensional latent space helps the network concurrently learn global shape regularities and local reasoning in the object coordinate space and, as a result, boosts performance. 

274 
Reinforcement Learning for Control with Multiple Frequencies 
In this paper, we formalize the problem of multiple control frequencies in RL and provide its efficient solution method. 
275 
Complex Dynamics in Simple Neural Networks: Understanding Gradient Flow in Phase Retrieval 
Here we focus on gradient flow dynamics for phase retrieval from random measurements. 
276 
Neural Message Passing for MultiRelational Ordered and Recursive Hypergraphs 
In this work, we first unify exisiting MPNNs on different structures into GMPNN (Generalised MPNN) framework. 
277 
In this paper, we present a unified view of the two methods and the first theoretical characterization of MLLS. 

278 
Optimal Private Median Estimation under Minimal Distributional Assumptions 
We study the fundamental task of estimating the median of an underlying distribution from a finite number of samples, under pure differential privacy constraints. 
279 
In this paper, we develop novel encoding and decoding mechanisms that simultaneously achieve optimal privacy and communication efficiency in various canonical settings. 

280 
Our main aim in this work is to explore the plausibility of such a transformation and to identify cues and components able to carry the association of sounds with visual events. 

281 
We present a new paradigm for Neural ODE algorithms, called ODEtoODE, where timedependent parameters of the main flow evolve according to a matrix flow on the orthogonal group O(d). 

282 
This work provides the first theoretical analysis of selfdistillation. 

283 
Couplingbased Invertible Neural Networks Are Universal Diffeomorphism Approximators 
Without a universality, there could be a wellbehaved invertible transformation that the CFINN can never approximate, hence it would render the model class unreliable. We answer this question by showing a convenient criterion: a CFINN is universal if its layers contain affine coupling and invertible linear functions as special cases. 
284 
Community detection using fast lowcardinality semidefinite programming? 
In this paper, we propose a new class of lowcardinality algorithm that generalizes the local update to maximize a semidefinite relaxation derived from maxkcut. 
285 
In this paper, we first model the annotation noise using a random variable with Gaussian distribution, and derive the pdf of the crowd density value for each spatial location in the image. We then approximate the joint distribution of the density values (i.e., the distribution of density maps) with a full covariance multivariate Gaussian density, and derive a lowrank approximate for tractable implementation. 

286 
We cast policy gradient methods as the repeated application of two operators: a policy improvement operator $\mathcal{I}$, which maps any policy $\pi$ to a better one $\mathcal{I}\pi$, and a projection operator $\mathcal{P}$, which finds the best approximation of $\mathcal{I}\pi$ in the set of realizable policies. 

287 
Demystifying Contrastive SelfSupervised Learning: Invariances, Augmentations and Dataset Biases 
Somewhat mysteriously the recent gains in performance come from training instance classification models, treating each image and it’s augmented versions as samples of a single class. In this work, we first present quantitative experiments to demystify these gains. 
288 
In this paper, we provide an efficient approximation algorithm for finding the most likelihood configuration (MAP) of size $k$ for Determinantal Point Processes (DPP) in the online setting where the data points arrive in an arbitrary order and the algorithm cannot discard the selected elements from its local memory. 

289 
Video Object Segmentation with Adaptive Feature Bank and UncertainRegion Refinement 
This paper presents a new matchingbased framework for semisupervised video object segmentation (VOS). 
290 
Whereas reinforcement learning often focuses on the design of algorithms that enable artificial agents to efficiently learn new tasks, here we develop a modeling framework to directly infer the empirical learning rules that animals use to acquire new behaviors. 

291 
In this work, we propose a novel backdoor attack technique in which the triggers vary from input to input. 

292 
How hard is to distinguish graphs with graph neural networks? 
This study derives hardness results for the classification variant of graph isomorphism in the messagepassing model (MPNN). 
293 
Minimax Regret of SwitchingConstrained Online Convex Optimization: No Phase Transition 
In this paper, we show that $ T $round switchingconstrained OCO with fewer than $ K $ switches has a minimax regret of $ \Theta(\frac{T}{\sqrt{K}}) $. 
294 
Dual Manifold Adversarial Robustness: Defense against Lp and nonLp Adversarial Attacks 
To partially answer this question, we consider the scenario when the manifold information of the underlying data is available. 
295 
CrossScale Internal Graph Neural Network for Image SuperResolution 
In this paper, we explore the crossscale patch recurrence property of a natural image, i.e., similar patches tend to recur many times across different scales. 
296 
Unsupervised Representation Learning by Invariance Propagation 
In this paper, we propose Invariance Propagation to focus on learning representations invariant to categorylevel variations, which are provided by different instances from the same category. 
297 
In this paper, we restore the negative information in fewshot object detection by introducing a new negative and positiverepresentative based metric learning framework and a new inference scheme with negative and positive representatives. 

298 
In this work, we identify another such aspect: we find that adversarially robust models, while less accurate, often perform better than their standardtrained counterparts when used for transfer learning. 

299 
Robust Correction of Sampling Bias using Cumulative Distribution Functions 
We present a new method for handling covariate shift using the empirical cumulative distribution function estimates of the target distribution by a rigorous generalization of a recent idea proposed by Vapnik and Izmailov. 
300 
Personalized Federated Learning with Theoretical Guarantees: A ModelAgnostic MetaLearning Approach 
In this paper, we study a personalized variant of the federated learning in which our goal is to find an initial shared model that current or new users can easily adapt to their local dataset by performing one or a few steps of gradient descent with respect to their own data. 
301 
PixelLevel Cycle Association: A New Perspective for Domain Adaptive Semantic Segmentation 
In this paper, we propose to build the pixellevel cycle association between source and target pixel pairs and contrastively strengthen their connections to diminish the domain gap and make the features more discriminative. 
302 
In this paper, we develop specialized versions of these techniques for categorical and unordered response labels that, in addition to providing marginal coverage, are also fully adaptive to complex data distributions, in the sense that they perform favorably in terms of approximate conditional coverage compared to alternative methods. 

303 
Learning Global Transparent Models consistent with Local Contrastive Explanations 
In this work, we explore the question: Can we produce a transparent global model that is simultaneously accurate and consistent with the local (contrastive) explanations of the blackbox model? 
304 
In this paper, we focus on the problem of approximating an arbitrary Bregman divergence from supervision, and we provide a wellprincipled approach to analyzing such approximations. 

305 
Diverse Image Captioning with ContextObject Split Latent Spaces 
To this end, we introduce a novel factorization of the latent space, termed contextobject split, to model diversity in contextual descriptions across images and texts within the dataset. 
306 
Learning Disentangled Representations of Videos with Missing Data 
We present Disentangled Imputed Video autoEncoder (DIVE), a deep generative model that imputes and predicts future video frames in the presence of missing data. 
307 
Here we show that instead of equivariance, the more general concept of naturality is sufficient for a graph network to be welldefined, opening up a larger class of graph networks. 

308 
Continual Learning with NodeImportance based Adaptive Group Sparse Regularization 
We propose a novel regularizationbased continual learning method, dubbed as Adaptive Group Sparsity based Continual Learning (AGSCL), using two group sparsitybased penalties. 
309 
Towards Crowdsourced Training of Large Neural Networks using Decentralized MixtureofExperts 
In this work, we propose Learning@home: a novel neural network training paradigm designed to handle large amounts of poorly connected participants. 
310 
Incorporating the natural documentsentenceword structure into hierarchical Bayesian modeling, we propose convolutional Poisson gamma dynamical systems (PGDS) that introduce not only wordlevel probabilistic convolutions, but also sentencelevel stochastic temporal transitions. 

311 
To test that hypothesis, we introduce an objective based on Deep InfoMax (DIM) which trains the agent to predict the future by maximizing the mutual information between its internal representation of successive timesteps. 

312 
We provide an answer to this question in the form of a structural characterization of ranking losses for which a suitable regression is consistent. 

313 
Distributionfree binary classification: prediction sets, confidence intervals and calibration 
We study three notions of uncertainty quantification—calibration, confidence intervals and prediction sets—for binary classification in the distributionfree setting, that is without making any distributional assumptions on the data. 
314 
Closing the Dequantization Gap: PixelCNN as a SingleLayer Flow 
In this paper, we introduce subset flows, a class of flows that can tractably transform finite volumes and thus allow exact computation of likelihoods for discrete data. 
315 
Sequence to MultiSequence Learning via Conditional Chain Mapping for Mixture Signals 
In this work, we focus on onetomany sequence transduction problems, such as extracting multiple sequential sources from a mixture sequence. 
316 
Variance reduction for Random Coordinate DescentLangevin Monte Carlo 
We show by a counterexamplethat blindly applying RCD does not achieve the goal in the most general setting. 
317 
Language as a Cognitive Tool to Imagine Goals in Curiosity Driven Exploration 
We introduce IMAGINE, an intrinsically motivated deep reinforcement learning architecture that models this ability. 
318 
In this study, to reduce the total number of parameters, the embeddings for all words are represented by transforming a shared embedding. 

319 
Primal Dual Interpretation of the Proximal Stochastic Gradient Langevin Algorithm 
We consider the task of sampling with respect to a log concave probability distribution. 
320 
How to Characterize The Landscape of Overparameterized Convolutional Neural Networks 
Specifically, we consider the loss landscape of an overparameterized convolutional neural network (CNN) in the continuous limit, where the numbers of channels/hidden nodes in the hidden layers go to infinity. 
321 
On the Tightness of Semidefinite Relaxations for Certifying Robustness to Adversarial Examples 
In this paper, we describe a geometric technique that determines whether this SDP certificate is exact, meaning whether it provides both a lowerbound on the size of the smallest adversarial perturbation, as well as a globally optimal perturbation that attains the lowerbound. 
322 
In this paper, we introduce a discrete variant of the Metalearning framework. 

323 
Our study reveals the generality and flexibility of selftraining with three additional insights: 1) stronger data augmentation and more labeled data further diminish the value of pretraining, 2) unlike pretraining, selftraining is always helpful when using stronger data augmentation, in both lowdata and highdata regimes, and 3) in the case that pretraining is helpful, selftraining improves upon pretraining. 

324 
Unsupervised Sound Separation Using Mixture Invariant Training 
In this paper, we propose a completely unsupervised method, mixture invariant training (MixIT), that requires only singlechannel acoustic mixtures. 
325 
Adaptive Discretization for ModelBased Reinforcement Learning 
We introduce the technique of adaptive discretization to design an efficient modelbased episodic reinforcement learning algorithm in large (potentially continuous) stateaction spaces. 
326 
CodeCMR: CrossModal Retrieval For FunctionLevel Binary Source Code Matching 
This paper proposes an endtoend crossmodal retrieval network for binary source code matching, which achieves higher accuracy and requires less expert experience. 
327 
In this work, we take a closer look at this empirical phenomenon and try to understand when and how it occurs. 

328 
DAGs with No Fears: A Closer Look at Continuous Optimization for Learning Bayesian Networks 
Informed by the KKT conditions, a local search postprocessing algorithm is proposed and shown to substantially and universally improve the structural Hamming distance of all tested algorithms, typically by a factor of 2 or more. 
329 
OODMAML: MetaLearning for FewShot OutofDistribution Detection and Classification 
We propose a fewshot learning method for detecting outofdistribution (OOD) samples from classes that are unseen during training while classifying samples from seen classes using only a few labeled examples. 
330 
An Imitation from Observation Approach to Transfer Learning with Dynamics Mismatch 
In this paper, we show that one existing solution to this transfer problem– grounded action transformation –is closely related to the problem of imitation from observation (IfO): learning behaviors that mimic the observations of behavior demonstrations. 
331 
Taking inspiration from infants learning from their environment through play and interaction, we present a computational framework to discover objects and learn their physical properties along this paradigm of Learning from Interaction. 

332 
We present a novel approach to estimating discrete distributions with (potentially) infinite support in the total variation metric. 

333 
In this work we “open the box”, further developing the continuousdepth formulation with the aim of clarifying the influence of several design choices on the underlying dynamics. 

334 
In this paper, we approach the supervised GAN problem from a different perspective, one that is motivated by the philosophy of the famous Persian poet Rumi who said, "The art of knowing is knowing what to ignore." 

335 
Counterfactual Data Augmentation using Locally Factored Dynamics 
We propose an approach to inferring these structures given an objectoriented state representation, as well as a novel algorithm for Counterfactual Data Augmentation (CoDA). 
336 
Rethinking Learnable Tree Filter for Generic Feature Transform 
To relax the geometric constraint, we give the analysis by reformulating it as a Markov Random Field and introduce a learnable unary term. 
337 
SelfSupervised Relational Reasoning for Representation Learning 
In this work, we propose a novel selfsupervised formulation of relational reasoning that allows a learner to bootstrap a signal from information implicit in unlabeled data. 
338 
Sufficient dimension reduction for classification using principal optimal transport direction 
To address this issue, we propose a novel estimation method of sufficient dimension reduction subspace (SDR subspace) using optimal transport. 
339 
In this paper, we focus on a family of Wasserstein distributionally robust support vector machine (DRSVM) problems and propose two novel epigraphical projectionbased incremental algorithms to solve them. 

340 
Differentially Private Clustering: Tight Approximation Ratios 
For several basic clustering problems, including Euclidean DensestBall, 1Cluster, kmeans, and kmedian, we give efficient differentially private algorithms that achieve essentially the same approximation ratios as those that can be obtained by any nonprivate algorithm, while incurring only small additive errors. 
341 
We provide valuable tools for the analysis of Louvain, but also for many other combinatorial algorithms. 

342 
Fairness with Overlapping Groups; a Probabilistic Perspective 
In algorithmically fair prediction problems, a standard goal is to ensure the equality of fairness metrics across multiple overlapping groups simultaneously. We reconsider this standard fair classification problem using a probabilistic population analysis, which, in turn, reveals the Bayesoptimal classifier. 
343 
AttendLight: Universal AttentionBased Reinforcement Learning Model for Traffic Signal Control 
We propose AttendLight, an endtoend Reinforcement Learning (RL) algorithm for the problem of traffic signal control. 
344 
Thus, we present to regard the discrete weights in an arbitrary quantized neural network as searchable variables, and utilize a differential method to search them accurately. 

345 
To complement the upper bound, we introduce new techniques for establishing lower bounds on the performance of any algorithm for this problem. 

346 
From Predictions to Decisions: Using Lookahead Regularization 
For this, we introduce lookahead regularization which, by anticipating user actions, encourages predictive models to also induce actions that improve outcomes. 
347 
Sequential Bayesian Experimental Design with Variable Cost Structure 
We propose and demonstrate an algorithm that accounts for these variable costs in the refinement decision. 
348 
Predictive inference is free with the jackknife+afterbootstrap 
In this paper, we propose the jackknife+afterbootstrap (J+aB), a procedure for constructing a predictive interval, which uses only the available bootstrapped samples and their corresponding fitted models, and is therefore "free" in terms of the cost of model fitting. 
349 
We propose a doublyrobust procedure for learning counterfactual prediction models in this setting. 

350 
This paper proposes a novel instancelevel test time augmentation that efficiently selects suitable transformations for a test input. 

351 
In this paper, we show that the Softmax function, though used in most classification tasks, gives a biased gradient estimation under the longtailed setup. 

352 
This paper presents an IRL framework called Bayesian optimizationIRL (BOIRL) which identifies multiple solutions that are consistent with the expert demonstrations by efficiently exploring the reward function space. 

353 
MDP Homomorphic Networks: Group Symmetries in Reinforcement Learning 
This paper introduces MDP homomorphic networks for deep reinforcement learning. 
354 
How Can I Explain This to You? An Empirical Study of Deep Neural Network Explanation Methods 
We performed a crossanalysis Amazon Mechanical Turk study comparing the popular stateoftheart explanation methods to empirically determine which are better in explaining model decisions. 
355 
In this work, we identify a set of conditions on the data under which such surrogate loss minimization algorithms provably learn the correct classifier. 

356 
Our core contribution stands in a very simple idea: adding the scaled logpolicy to the immediate reward. 

357 
Object Goal Navigation using GoalOriented Semantic Exploration 
We propose a modular system called, `GoalOriented Semantic Exploration’ which builds an episodic semantic map and uses it to explore the environment efficiently based on the goal object category. 
358 
Efficient semidefiniteprogrammingbased inference for binary and multiclass MRFs 
In this paper, we propose an efficient method for computing the partition function or MAP estimate in a pairwise MRF by instead exploiting a recently proposed coordinatedescentbased fast semidefinite solver. 
359 
FunnelTransformer: Filtering out Sequential Redundancy for Efficient Language Processing 
With this intuition, we propose FunnelTransformer which gradually compresses the sequence of hidden states to a shorter one and hence reduces the computation cost. 
360 
This paper learns and leverages such semantic cues for navigating to objects of interest in novel environments, by simply watching YouTube videos. 

361 
Heavytailed Representations, Text Polarity Classification & Data Augmentation 
In this paper, we develop a novel method to learn a heavytailed embedding with desirable regularity properties regarding the distributional tails, which allows to analyze the points far away from the distribution bulk using the framework of multivariate extreme value theory. 
362 
We propose instead a simple and generic method that can be applied to a variety of losses and tasks without any change in the learning procedure. 

363 
CogMol: TargetSpecific and Selective Drug Design for COVID19 Using Deep Generative Models 
In this study, we propose an endtoend framework, named CogMol (Controlled Generation of Molecules), for designing new druglike small molecules targeting novel viral proteins with high affinity and offtarget selectivity. 
364 
Memory Based Trajectoryconditioned Policies for Learning from Sparse Rewards 
In this work, instead of focusing on good experiences with limited diversity, we propose to learn a trajectoryconditioned policy to follow and expand diverse past trajectories from a memory buffer. 
365 
Liberty or Depth: Deep Bayesian Neural Nets Do Not Need Complex Weight Posterior Approximations 
We challenge the longstanding assumption that the meanfield approximation for variational inference in Bayesian neural networks is severely restrictive, and show this is not the case in deep networks. 
366 
Improving Sample Complexity Bounds for (Natural) ActorCritic Algorithms 
In contrast, this paper characterizes the convergence rate and sample complexity of AC and NAC under Markovian sampling, with minibatch data for each iteration, and with actor having general policy class approximation. 
367 
We propose a remedy that encourages learned dynamics to be easier to solve. 

368 
Stability of Stochastic Gradient Descent on Nonsmooth Convex Losses 
Specifically, we provide sharp upper and lower bounds for several forms of SGD and fullbatch GD on arbitrary Lipschitz nonsmooth convex losses. 
369 
InfluenceAugmented Online Planning for Complex Environments 
In this work, we propose influenceaugmented online planning, a principled method to transform a factored simulator of the entire environment into a local simulator that samples only the state variables that are most relevant to the observation and reward of the planning agent and captures the incoming influence from the rest of the environment using machine learning methods. 
370 
We present a series of new PACBayes learning guarantees for randomized algorithms with sampledependent priors. 

371 
Rewardrational (implicit) choice: A unifying formalism for reward learning 
Our key observation is that different types of behavior can be interpreted in a single unifying formalism – as a rewardrational choice that the human is making, often implicitly. 
372 
Probabilistic Time Series Forecasting with Shape and Temporal Diversity 
In this paper, we address this problem for nonstationary time series, which is very challenging yet crucially important. 
373 
Low Distortion BlockResampling with Spatially Stochastic Networks 
We formalize and attack the problem of generating new images from old ones that are as diverse as possible, only allowing them to change without restrictions in certain parts of the image while remaining globally consistent. 
374 
Continual Deep Learning by Functional Regularisation of Memorable Past 
In this paper, we fix this issue by using a new functionalregularisation approach that utilises a few memorable past examples crucial to avoid forgetting. 
375 
Distance Encoding: Design Provably More Powerful Neural Networks for Graph Representation Learning 
Here we propose and mathematically analyze a general class of structurerelated features, termed Distance Encoding (DE). 
376 
In this work, we propose a novel convolutional operator dubbed as fast Fourier convolution (FFC), which has the main hallmarks of nonlocal receptive fields and crossscale fusion within the convolutional unit. 

377 
In this paper, we propose ViewAgnostic Dense Representation (VADeR) for unsupervised learning of dense representations. 

378 
In this work, we propose a framework to improve the certified safety region for these smoothed classifiers without changing the underlying smoothing scheme. 

379 
Learning Structured Distributions From Untrusted Batches: Faster and Simpler 
In this paper, we find an appealing way to synthesize the techniques of [JO19] and [CLM19] to give the best of both worlds: an algorithm which runs in polynomial time and can exploit structure in the underlying distribution to achieve sublinear sample complexity. 
380 
This leads us to introduce a novel objective for training hierarchical VQVAEs. 

381 
Diversity can be Transferred: Output Diversification for White and Blackbox Attacks 
To improve the efficiency of these attacks, we propose Output Diversified Sampling (ODS), a novel sampling strategy that attempts to maximize diversity in the target model’s outputs among the generated samples. 
382 
POLYHOOT: MonteCarlo Planning in Continuous Space MDPs with NonAsymptotic Analysis 
In this paper, we consider MonteCarlo planning in an environment with continuous stateaction spaces, a much less understood problem with important applications in control and robotics. 
383 
We propose a new paradigm for assistance by instead increasing the human’s ability to control their environment, and formalize this approach by augmenting reinforcement learning with human empowerment. 

384 
Variational Policy Gradient Method for Reinforcement Learning with General Utilities 
In this paper, we consider policy optimization in Markov Decision Problems, where the objective is a general utility function of the stateaction occupancy measure, which subsumes several of the aforementioned examples as special cases. 
385 
Reverseengineering recurrent neural network solutions to a hierarchical inference task for mice 
We study how recurrent neural networks (RNNs) solve a hierarchical inference task involving two latent variables and disparate timescales separated by 12 orders of magnitude. 
386 
Temporal Positiveunlabeled Learning for Biomedical Hypothesis Generation via Risk Estimation 
We propose a variational inference model to estimate the positive prior, and incorporate it in the learning of node pair embeddings, which are then used for link prediction. 
387 
Efficient Low Rank Gaussian Variational Inference for Neural Networks 
By using a new form of the reparametrization trick, we derive a computationally efficient algorithm for performing VI with a Gaussian family with a lowrank plus diagonal covariance structure. 
388 
In this paper, we focus on conducting iterative methods like DPSGD in the setting of federated learning (FL) wherein the data is distributed among many devices (clients). 

389 
Probabilistic Circuits for Variational Inference in Discrete Graphical Models 
In this paper, we propose a new approach that leverages the tractability of probabilistic circuit models, such as Sum Product Networks (SPN), to compute ELBO gradients exactly (without sampling) for a certain class of densities. 
390 
Your Classifier can Secretly Suffice MultiSource Domain Adaptation 
In this work, we present a different perspective to MSDA wherein deep models are observed to implicitly align the domains under label supervision. 
391 
Labelling unlabelled videos from scratch with multimodal selfsupervision 
In this work, we a) show that unsupervised labelling of a video dataset does not come for free from strong feature encoders and b) propose a novel clustering method that allows pseudolabelling of a video dataset without any human annotations, by leveraging the natural correspondence between audio and visual modalities. 
392 
A NonAsymptotic Analysis for Stein Variational Gradient Descent 
In this paper, we provide a novel finite time analysis for the SVGD algorithm. 
393 
Robust Metalearning for Mixed Linear Regression with Small Batches 
We introduce a spectral approach that is simultaneously robust under both scenarios. 
394 
Bayesian Deep Learning and a Probabilistic Perspective of Generalization 
We show that deep ensembles provide an effective mechanism for approximate Bayesian marginalization, and propose a related approach that further improves the predictive distribution by marginalizing within basins of attraction, without significant overhead. 
395 
Unsupervised Learning of Object Landmarks via SelfTraining Correspondence 
This paper addresses the problem of unsupervised discovery of object landmarks. 
396 
Randomized tests for highdimensional regression: A more efficient and powerful solution 
In this paper, we answer this question in the affirmative by leveraging the random projection techniques, and propose a testing procedure that blends the classical $F$test with a random projection step. 
397 
Learning Representations from AudioVisual Spatial Alignment 
We introduce a novel selfsupervised pretext task for learning representations from audiovisual content. 
398 
Generative View Synthesis: From Singleview Semantics to Novelview Images 
We propose to push the envelope further, and introduce Generative View Synthesis (GVS) that can synthesize multiple photorealistic views of a scene given a single semantic map. 
399 
Towards More Practical Adversarial Attacks on Graph Neural Networks 
Therefore, we propose a greedy procedure to correct the importance score that takes into account of the diminishingreturn pattern. 
400 
Thus, instead of naively sharing parameters across tasks, we introduce an explicit modularization technique on policy representation to alleviate this optimization issue. 
