AI Machine Learning & Data Science Research Tech Blog

ICLR 2020 | A Look at Three Interesting Papers on the Robustness of Neural Networks

A look at three ICLR papers to look at that fall under the topic of robustness


Today’s state-of-the-art neural networks can achieve a high accuracy, which makes them a natural candidate for many complicated applications. This sounds attractive, and ongoing research has continued to improve the accuracy of neural networks on different tasks. However, robustness problems arise because of the missing guarantee that such performance would generalize well on permutations and unseen data.

The robustness of neural networks is crucially important in industrial product development, ranging from manufacturing automation to autonomous driving. In these fields, robustness is a must-have property in industry standards such as safe operations. This means that properties such as robustness against adversarial attacks, out of distribution detection, and uncertainty measurement should not be neglected, and must, therefore, be discussed.

My interests as an engineer in industry lie mainly on cutting-edge solutions that focus on the generalization ability and robustness of neural networks. These qualities directly affect how plausible it will be to deploy neural networks on a wide range of real-life applications, and how to design effective architectures.

The International Conference on Learning Representations (ICLR) is well-known for its contribution to learning theory. Much of the research selected for ICLR 2020 oral presentations, especially convolutional neural network (CNN) related work, focuses on providing theoretical support for neural network development. I have selected three papers to look at that fall under the topic of robustness:

  1. Your Classifier is Secretly an Energy Based Model and You Should Treat It Like One
  2. Gradient Descent Maximizes the Margin of Homogeneous Neural Networks
  3. On Robustness of Neural Ordinary Differential Equations

Paper 1: Your Classifier is Secretly an Energy Based Model and You Should Treat It Like One

Paper URL:

This is a very interesting paper, aiming at closing the gap between discriminative and generative models. Machine learning models can be classified into discriminative models and generative ones. Discriminative models learn the relationship between the unknown dependent variable(s), commonly referred to as y, and the known variable(s), commonly referred to as x. During training, the model learns the conditional probability distribution, e.g. P(y|x), based on probability theory. Discriminative models include logistic regression, support vector machine, and of course, some convolutional neural networks — the most popular models.

There are reasons for the popularity of discriminative models. Firstly, the current trend is to collect a big amount of data that is, hopefully, representative enough, then use relatively simple but generally applicable optimization algorithms such as gradient descent to solve a machine learning problem. Training discriminative models is very much aligned with this trend. The problems of interests often fall into the category of discriminative problems, like classification problems. However, discriminative models provide point prediction without any information of uncertainty, which leaves the robustness of the model difficult or even impossible to estimate.

Generative models, on the other hand, learn to model a joint distribution, P(x, y), which deals with missing/unstructured data, weaker supervision, uncertainty estimation, etc. Yet the advantages come with a cost. Generative models are much harder to compute and to train, and often require tailored optimization algorithms. More importantly, so far the performance of discriminative models is still far beyond that of generative models.

One might argue that, for example, the logit values of a neural network classifier can be seen as a mean of confidence or uncertainty score. This would be a nice property to have, but it is not obtained by neural networks. Put in a simplified and intuitive way, uncertainty score measures the prediction consistency of the model when tested repetitively. For example, when a person is asked a question that they do not know ten times, they could give a different answer each time. In this example, the uncertainty score is the consistency of the ten answers. It could seem falsely decisive if one only looks at one of the answers. Also because of this reason, most of the generative models, if not all, are highly dependent on sampling.

This paper starts from a common neural network classifier architecture, as shown in the left side of the figure below, which consists of a few convolutional blocks and uses a softmax activation in the output layer to generate the possibility of each classification. In essence, as a discriminative model, the neural network is modelling the conditional probability distribution p(y|x). To deploy the generative modelling approach, the paper seeks a way to transfer the distribution p(y|x) into joint probability distribution p(x, y), which is realized by using energy-based models (EBMs).

Figure: Illustration of the proposed JEM (Joint Energy-based Model) method.
Source: Grathwohl, W.; Wang, K.-C.; Jacobsen, J.-H.; Duvenaud, D.; Norouzi, M. and Swersky, K. (2019). Your Classifier is Secretly an Energy-based Model and You Should Treat It Like One. ICLR.

In energy-based modelling, energy refers to an unnormalized probability scalar, and an energy function E_θ(x) describes any arbitrary function that maps each input point x to a scalar.

The probability density p(x) can be modelled as:


where Z(θ) is the integral of􏰄 exp(−E_θ(x)) over x and is called the normalizing constant (with respect to x) or partition function.

Given the chain rule, the relationship between the joint distribution and the conditional distribution is known:
p(x, y) = p(x)*p(y|x).

The marginal distribution p(x) is calculated as the integral of p(x, y) over y if continuous, or expectation of p(x, y) over y if discrete.

For a generative model, the goal is to model the joint distribution p(x, y) while the conditional distribution p(y|x) is given by the categorical distribution under classification case. What is the link to put the two pieces together?

The paper first carefully analyzed the categorical distribution, which is parametrized bypassing logits through softmax transfer function:



where f_θ(x)[y] refers to the logit that corresponds to the y^th class label.

If joint distribution p(x, y) is defined as:


where Z(θ) leaves undefined and E_θ(x, y) = −fθ(x)[y].

Then the unnormalized marginal distribution p(x) can be calculated as the expectation over y:


Comparing with equation (1) it is known that exp(-E_θ(x)) = \sum_y exp(fθ(x)[y]), taking the log and reversing the sign on both sides of the equation:


which is known as the LogSumExp of the logits, independent of the classifier.

If taking the division of p(x, y) and p(x), the normalizing constant Z(θ) will cancel out and leave the original definition of a categorical distribution defined in (2).

The paper also notes that, unlike the original parametrization, the order of the logits has an impact on p_θ(x) in this case. Thus, the extra degree of freedom left out in the original parametrization is used here to define the extra distributions of interest. The authors refer to their work as a Joint Energy-based Model (JEM).

To compute the joint distribution p(x, y), one can compute the product of p(y|x) and p(x), which boils down to sum in its log form:


The authors used the standard cross-entropy to compute p(y|x) and Stochastic Gradient Langevin Dynamics (SGLD) for log p(x).

It is expected that, with the generative modelling properties, JEM has advantages of uncertainty quantification, out of distribution detection, and improved robustness. The authors trained JEM on CIFAR10, SVHN, and CIFAR100, as well as other hybrid models, standalone generative and discriminative models. The experimental results show that JEM has similar accuracy (92.9%) as Wide-ResNet (95.8%) as well as similar inception scores (JEM 8.76 vs NCSN 8.91) and Frechet Inception Distance (JEM 38.4 vs IGEBM 37.9), which are used as a proximation of normalized likelihood. It seems that the JEM indeed retains the properties of both.

The next step is to test whether JEM benefits from the properties described above or not. The first investigation focuses on uncertainty quantification, which is achieved through error calibration and sounds especially attractive. As in the example phrased at the beginning of this article, a high confidence score predicted by the classifier does not necessarily mean the model is more certain, i.e. when the neural network predicts the probability of the image belongs to class A is 0.8, it is not more confident than a prediction of probability 0.4. The authors attribute this to “the networks are less calibrated, meaning their estimate of p(y|x) does not reflect the true misclassification rate.” Yet with JEM, this calibration is achieved.

As shown below, the authors reported the confidence against the corresponding accuracy and the Expected Calibration Error (ECE).

Figure: model calibration on CIFAR 100.
Source: Grathwohl, W.; Wang, K.-C.; Jacobsen, J.-H.; Duvenaud, D.; Norouzi, M. and Swersky, K. (2019). Your Classifier is Secretly an Energy-Based Model and You Should Treat It Like One. ICLR.

It can be seen that, for a baseline network, there is no clear relationship between the confidence and the accuracy, and as we have explained many times, the ECE is also quite high. While for JEM, when the confidence score is high, it is indeed much more accurate. Real-life applications desire such a property, which provides a simple and cheap uncertainty estimation to improve the reliability of the products, let alone its intuitiveness. Though checking more results included in the appendix the calibration is not always as perfect as the figure above shows, the general trend is maintained, which is still an advantage.

The next topic falls into out-of-distribution (OOD) detection, which also has an important implication in real-life applications. For example, consider an application that makes use of a neural network trained on ten different types of fruit. Such an application cannot restrict users to employ it exclusively on these 10 types of fruit, and so could falsely classify a fruit that it was not trained on into one of the 10 classes it knows. OOD detection can be used to classify such a fruit as unseen.

OOD detection can be treated as a binary problem that scores each input. The score of in-distribution input should be higher than that of OOD input. The figure below shows the OOD detection by scoring with a maximum prediction probability of the predictive distribution. The paper compared his approach, JEM, with another generative model, Glow.

Figure: Histograms for OOD detection. Green corresponds to the score on in-distribution data (CIFAR10), while red corresponds to the score on the OOD dataset.
Source: Grathwohl, W.; Wang, K.-C.; Jacobsen, J.-H.; Duvenaud, D.; Norouzi, M. and Swersky, K. (2019). Your Classifier is Secretly an Energy-Based Model and You Should Treat It Like One. ICLR.

Overall, JEM performs better than Glow as the log p(x) under the JEM column — the green and red histogram — is much more distinguishable. Yet there is still a large overlap between the two distributions. It is still a question whether this is sufficient to achieve OOD detection. Besides, there exist many different scoring measures, of which a large part are likelihood-based. From the results of the paper, different models can perform better based on different scoring mechanisms — so it is not always necessarily JEM that leads, although it still performs well in general.

Finally, the experiment falls into the focus of this article — robustness. The authors ran adversarial attacks that deployed white-box projected gradient descent (PGD) attacks, which gave attackers access to the model weights and consequently gradients on trained CIFAR10 models. Three JEMs were tested, with 0, 1, and 10 steps of sampling seeded at the input.

Figure: Histograms for OOD detection. Green corresponds to the score on in-distribution data (CIFAR10), while red corresponds to the score on the OOD dataset.
Source: Grathwohl, W.; Wang, K.-C.; Jacobsen, J.-H.; Duvenaud, D.; Norouzi, M. and Swersky, K. (2019). Your Classifier is Secretly an Energy-Based Model and You Should Treat It Like One. ICLR.

Models trained adversarially achieved the best performance in this robustness test. The JEM does improve the model robustness to some extent. It is clear here that increasing the number of refinement steps can increase robustness, but I am curious about how much speed loss or computation cost it takes to improve robustness? Based on the above figure, the marginal improvement of the robustness on adverbial attacks seems to be limited.

This paper finds a smart way to combine the discriminative and generative models instead of separating them and, thus, is able to enjoy the benefits of both. The parametrization is neat and clean. Some applications, such as error calibration, are especially interesting.

I have a few more questions about the optimization — as the authors mentioned, the sampling is quite hard — would this narrow the application of JEM? Or in other words, is there a significant performance drop when then dimensionality increases? I am also wondering how other generative models, like Bayesian neural networks, perform compared with JEM. Of course, it is unlikely to cover a compressive range of generative models as there are simply too many.

The code of this paper has been released on GitHub (, and both the paper and code are interesting to go through.

Paper 2: Gradient Descent Maximizes the Margin of Homogeneous Neural Networks

Paper URL:

The second paper also investigates the robustness of neural networks from the perspective of the optimization algorithm, based on existing studies of logistics regression and neural networks.

Instead of proposing a new algorithm, this approach focuses on the current workflow: most of the neural networks are trained with gradient descent algorithms or variants. One question is whether these algorithms are biased towards solutions with good generalization performance on the test set?

To answer this question, the paper first studies the margin in classical machine learning algorithms. Taking a support vector machine (SVM) algorithm as an example, consider a linearly separable dataset that consists of only two classes, as blue and red points show in the figure below.

Figure: a linearly separable dataset consists of only 2 classes (blue and red points) and possible decision boundaries
Source: courses slides of C19 Machine Learning, link:

The black lines in the figure indicate decision boundaries. For the example given here there exist infinite decision boundaries that can completely separate the two classes, and the question is how to decide which one is better. For the decision boundaries shown in the first row, a large distance to the class with red points is retained, but a small permutation in the class of blue points can result in a false classification as the decision boundary is so close to this class.

The decision boundary shown in the figure in row 2 column 1 is so close to both classes that permutation on the points around the decision will cause false classification. This situation can be characterized by the L2-distance from the training sample to the decision boundary, which the paper refers to as L2-robustness. A large distance between the decision boundary and the training point can bear stronger perturbations.

Typically, the decision boundary can be calculated by seeking the largest margin, which is a hyperplane that describes the separation between the two classes. Based on SVM, the margin becomes the largest when the decision boundary is perpendicular to the hyperplane formed by the support vectors.

Figure: SVM searches for w that maximizes the margin
Source: courses slides of C19 Machine Learning, link:

It is natural to adapt the concept of margin in the context of neural networks. However, unlike linear models, where the margin is solely defined by the decision boundary for non-linear neural networks, the margin q_n(θ) for a single data point in (x_n, y_n) is defined as:

q_n(θ) := y_n Φ(θ; x_n),

where Φ(θ; x_n) stands for the output of the neural network.

In this case, the margin is dependent on the function Φ that the neural network learned from the training set. The paper argues, though, that if a neural network is locally Lipschitz then the normalized margin divided by a Lipschitz constant is the lower bound of the L2robustness. A function that satisfies Lipschitz is bounded by Lipschitz constant on how fast it can change, as the Lipschitz constant is larger than the absolute value of the slope of the line connecting any pair of points on the graph of this function. According to the paper, previous studies have shown that “the output of almost every neural network admits a chain rule (as long as the neural network is composed by definable pieces in an o-minimal structure, e.g., ReLU, sigmoid, LeakyReLU).”

For the analysis on gradient flow, only local Lipschitz is required. While for analysis on gradient descent, C^2 smoothness is required, which is a stronger assumption because it requires continuity on first and second derivatives.

There are a few other important assumptions that the authors have made — homogeneity, exponential-type loss function, and separability — to achieve their performance.

The first assumption is that the neural networks to be investigated are (positively) homogeneous neural networks, which is considered to be true if there exists an order L, which is a positive number, such that neural network output Φ(θ; x) satisfies the following condition:


for all θ and x, where Φ stands for the neural network, θ stands for the network parameters and x stands for input. When bias term is removed, fully connected neural networks and CNNs that use ReLU or LeakyReLU as activations are homogeneous, and the order L equals to the number of layers.

The assumption of Exponential-type Loss Function requires the loss function to have some certain exponential tail but not a strictly exponential loss, including popular exponential loss, logistic loss, and cross-entropy loss.

The neural network should be able to separate training data during training, as a neural network can reach 100% accuracy on training data. The authors explain that it is not required that the training loss be zero, but rather only below a threshold, after time t_0, such that the accuracy is 100%. This assumption is needed to analyze the margin given its definition. As modern neural networks are often over-parameterized, the paper argues it should not be too hard to satisfy this requirement. The analysis is focusing on the training dynamics of the network after t_0.

Without too many details on the reasoning, the main findings are as listed below:

  1. It is possible to define a smoothed version of a normalized margin, and it can be proved that this margin is non-decreasing during training for both gradient flow and gradient descent.
  2. A constrained margin maximization problem is constructed and it is shown that the optimization, both gradient flow, and gradient descent, passes through an approximate KKT point after a certain amount of time. However, the KKT conditions for highly non-linear neural networks are only first-order necessary and not sufficient for global optimality. Convergence to a KKT point does not lead to a normalized margin that is globally optimal. The paper also clarifies that it is not realistic to prove stronger convergence results without adding further assumptions since gradient descent is a first-order optimization method.
  3. Practically, the paper indicates that training with more time can enlarge the normalized margin and thus improve the (L2-) robustness of the neural network.

The paper presents an empirical test with the result plotted in the figure below. It is shown that the normalized margin increases with the training epochs. But when training with a fixed time step, the normalized margin increases very slowly(≈ 1.8 × 10^{−4} after 10000 epochs), as shown in figure (a). By instead applying the SGD algorithm with a loss-based learning rate scheduler, the normalized margin increases much faster (≈ 1.2 × 10^{−3} after 10000 epochs). The loss-based learning rate scheduler is proposed by the authors to find the maximum possible learning rate at each epoch given the current training loss.

Figure: Training loss and normalized margin during training a CNNs, with and without bias on MNIST. The optimizer is SGD. (a) with a fixed learning rate 0.01, (b) with loss-based learning rate scheduler.
source: Lyu, K. and Li, J. (2020). Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. ICLR.

Let’s further examine the training process with the loss-based learning rate scheduler. The table below lists several snapshots noted as model-1, model-2, etc., and their performance. The training accuracy of all five models is 100%. This is expected since the paper only focuses on the training stage that the neural network has perfectly fitted on the training data. The normalized margin increases when the training proceeds, but the test accuracy drops slightly.

Table: Statistics of the CNN without bias after training for a different number of epochs. Model-1 to model-4 are the snapshots taken when the loss decreases below 10^{−10}, 10^{−15}, 10^{−20}, 10^{−120}, respectively. Model-5 is the snapshot when the loss is 10^{−882}.
source: Lyu, K. and Li, J. (2020). Gradient Descent Maximizes the Margin of Homogeneous Neural Networks. ICLR.

What is a bit confusing to me is the apparent contradiction between the two observations — training longer can improve the robustness and training longer might result in overfitting. It should be noted, though, that training longer does not necessarily lead to overfitting. However, for a neural network that achieves 100% training accuracy on a complicated real-world dataset, more training will increase its risk of overfitting. Thus, this contradictory conclusion seems a bit confusing.

Also, the relationship between the margin and the L2-robustness remains unclear to me. Allow me to quote a comment from a reviewer to explain: “As the paper argues if we have an upper bound on the Lipschitz constant w.r.t. norm, then we get a lower bound on required adversarial perturbations for any training point. However, this does not mean that training longer is necessarily better because by doing so, we might end up with a larger Lipschitz constant (even after normalizing). So even if the ‘margin’ is larger, the actual adversarial perturbations (in norm) allowed might get smaller.”

Another issue in this paper is that some of its assumptions are too strong, like the separability and the C^2 smoothness. The paper, though, argues that many neural network models with commonly used losses naturally satisfy them. There are other assumed preliminaries, such as bias being not included, which are not easily met in practice.

The contribution of this paper is not negligible. It formulates an optimization problem without assumption of the convergence on the direction of parameters and losses. This paper also provides some theoretical guidelines on the choice of optimization algorithms. The code for conducting the experiments has been published on

Paper 3: On Robustness of Neural Ordinary Differential Equations

Paper URL:

The idea of this paper is similar to the first paper. It also improves the robustness with minimal adaptation on current neural network architecture. This paper focuses on Neural Ordinary Differentiable Equation (ODE) networks. The ODENets are developed according to the observation that some neural networks, such as residual networks and recurrent neural networks, approximate nonlinear functions by a sequence of transformations through a hidden state h:


where t stands for the time step and f(h_t, θ_t) are trainable layers that learn hidden unit dynamics.

Usually this model is based on a discrete fixed time step. One intuitive idea is to parameterize the continuous dynamics of hidden units by adding more layers and taking smaller steps. In the limit, this forms an ODE that describes the relationship between input and output as:


where f_θ refers to trainable layers and z refers to the d-dimensional state of the neural ODE, which is equivalent to the h_t described in the equation above. The input state/initial state of the neural ODE is specified by z_in. The output z_out is related to the state at time point R.

This makes ODENets similar to generative models, which suggests they could improve robustness. The paper illustrates a network architecture that contains 3 stages. The first stage is a feature extractor made from a series of convolutional layers. The neural ODE that performs non-linear representation mapping forms the second stage. And the last stage consists of a fully-connected classifier that outputs the final prediction. A feature extractor is necessary to perform dimensionality reduction because ODENets are dimension-preserving mappings.

Figure: the architecture the ODENet used in the paper.
Source: Yan, H.; Du, J.; Tan, Y. F. V.; Feng, J. (2020). On Robustness of Neural Ordinary Differential Equations. ICLR.

The paper evaluated the robustness of the ODENets with inputs of various types of perturbations, such as random Gaussian perturbations, fast gradient sign method (FGSM) adversarial attack, and projected gradient descent (PGD) adversarial attack. The PGD attack is also called I-FGSM, where the I stands for iterative.

A CNN model based on the ResNet architecture was trained and tested as a baseline model for comparison. Both the ODENets and the CNN model were trained with two strategies: training with only original non-perturbed images, and training with original images together with their perturbed versions. Training datasets include MNIST, SVHN, and a subset of the ImageNet dataset, which the paper refers to as ImgNet10.

The evaluation results of models trained with only original images are listed below. Each model was trained three times with different random seeds. The paper uses the mean classification accuracies (%) and standard deviations (mean ± std) to indicate robustness performance. The Gaussian noise is generated with zero mean and three different standard deviations, also listed in the table below. The adversarial attack used is FGSM, with 3 different l_∞-norm ε as shown in the table.

Table: Robustness test on ODENet and benchmark CNN, trained with only original images.
Source: Yan, H.; Du, J.; Tan, Y. F. V.; Feng, J. (2020). On Robustness of Neural Ordinary Differential Equations. ICLR.

It is obvious that ODENet outperforms CNN in every test. The difference in classification accuracy under adversarial attack is especially large. However, the performance of ODENet is not very competitive in comparison with the baseline model. The performance of both models trained on ImgNet10 dropped the most under the attack. One drawback is that the standard deviation is only calculated on three repetitions, which is not very statistically trustworthy.

The ODENet continues to outperform CNN when models are trained together with Gaussian noise, as shown in the table below. The standard deviation of the added Gaussian noise is randomly chosen from {50, 75, 100} on the MNIST dataset, {15, 25, 35} on the SVHN dataset, and {10, 15, 25} on the ImgNet10. During the robustness test, Gaussian noise is drawn from a distribution with zero mean and standard deviation 100, both FGSM and PGD adversarial examples are used, with different l_∞-norm ε as listed in the table below.

Table: Robustness test on ODENet and benchmark CNN, trained with original images together with their perturbed versions.
Source: Yan, H.; Du, J.; Tan, Y. F. V.; Feng, J. (2020). On Robustness of Neural Ordinary Differential Equations. ICLR.

Adding Gaussian noise during training improves the robustness of both models. The PGD attack is more powerful than previous attacks, and under this attack the largest performance drop is observed. This is especially true when looking at the CNN performance on MNIST under the PGD-0.3 attack. CNN completely failed to distinguish the adversarial examples and resulted in 0% accuracy on all 3 tests. ODENet, on the other hand, performs better but is not entirely satisfying either.

On the other hand, it is observed empirically that ODENet is more robust, but how to explain this observation? The paper argues that this is because, in the neural ODE, a small change on the feature map will not lead to a large deviation from the original output associated with the feature map.

To verify this idea, the paper demonstrated that ODE integral curves do not intersect. For example, consider a simple model that returns a scalar as its state. Assume there exists a few initial states z_1 (0), z_2 (0) and z_3 (0) such that the system starts from A_1 = (0, z_1 (0)), A_2 = (0,z_2(0)), and A_3 = (0,z_3(0)). Suppose A1 is bounded by A2 and A3, as shown in the figure below. Then the non-intersect theorem will ensure that the integral curve z_1(t) is always sandwiched between the integral curves z_2(t) and z_3(t).

Therefore, it is possible to find another starting point \tilde{A} that is in the neighbourhood of A_1. The distance between \tilde{A} and A_1 is smaller than both the distance between A_1 and A_2 and between A_1 and A_3, of A_1. This means that the integral curve \tilde{z}_1 (t) would also be bounded. In the figure below, it is sandwiched between two integral curves starting from A1 and A3.

Figure: No integral curves intersect.
Source: Yan, H.; Du, J.; Tan, Y. F. V.; Feng, J. (2020). On Robustness of Neural Ordinary Differential Equations. ICLR.

It is important to note here that the function f must be continuous in time t and globally Lipschitz continuous in state z. Otherwise one can easily find situations when this theorem does not hold. The paper has also implicitly assumed that feature maps will not change drastically under a small perturbation on inputs. This, in turn, requires that regularization techniques, such as weight decay, are used.

Another question relates to the bound of the distance between the integral curves. If this distance is not bounded or very large, then the robustness is not guaranteed even if the theorem holds.

In further analysis, the paper also acknowledges that the key to improving the robustness of ODENets is to control the difference between neighbouring integral curves. They propose to achieve this by (1) removing the time dependence constraint of the dynamics and (2) imposing a certain steady-state constraint. More specifically, the dependency on time step t shown in f_θ (z(t), t) should be removed, which results in a re-parametrized neural ODE:


Define a set M_1 that includes the points that satisfy {(z_1(t), t) | t ∈ [0, T], || z_1(t) − z_1(0) || ≤ ε}, where z_1(t) is a solution of the re-parametrized neural ODE and ε is a small positive value. This set M_1 describes the points that are inside the ε-neighbourhood of z_1 (0) as well as on the curve of z_1(t) during time interval [0, T]. Then for element (z_1 (T ′ ), T ′ ) that belongs to set M_1, let \tilde{z}_1 (t) be a solution of the re-parametrized neural ODE equation whose initial state is \tilde{z}_1 (0) = z_1(T′). The element (z_1 (T ′ ), T ′ ) enjoys the time-invariant property, namely:


This property enables the possibility to bound the difference between neighbouring integral curves, providing that 􏰃\tilde{z}_1 (0) is regarded as a slightly perturbed version of z1(0):


where all norms are l_2 norms and |f_θ| refers to the element-wise absolute operation of a vector-valued function f_θ. Hence, by explicitly including the final term shown in the equation above, called steady-state loss, in the training loss, the robustness becomes one of the training objectives:


where N refers to training data size and the initial state of z_i(t) equals to the feature of the ith sample. The training aims at finding a small L_sss such that the robustness against small permutations is achieved. This model is called Time-invariant steady neural ODE (TisODE).

Similar tests were done on TisODE, with results shown in the table below. At the end of the training, the steady-state loss L_ss dropped to 0.1. Models were trained with both original images and images with Gaussian noise.

Table: Robustness test on ODENet, TisODE, and benchmark CNN, trained with original images and Gaussian noise.
Source: Yan, H.; Du, J.; Tan, Y. F. V.; Feng, J. (2020). On Robustness of Neural Ordinary Differential Equations. ICLR.

Indeed, the robustness of the TisODE is improved, especially under the FGSM attack. Yet this improvement still seems to be limited under a PGD attack. For example, the accuracy only increased by 0.2% when combating PGD-0.3 on the MNIST dataset.

It is an intuitive idea to test models trained with adversarial examples. The paper tested it on the MNIST dataset. Models were trained with the FGSM attack (ε=0.3). The results are shown in the table below.

ACC (%)\sigma = 100FGSM 0.3FGSM 0.5PGD 0.3
Table: Robustness test on ODENet, TisODE, and benchmark CNN, trained with original images and adversarial examples.
Source: The authors’ comments on

Since the models are trained with adversarial examples, they are much more robust under the FGSM 0.3 attack than under Gaussian noise permutation, which was not observed in previous experiments. Performance drops significantly when applying other attacking methods, especially with PGD 0.3. TisODE though, does show a satisfying result. Using adversarial training is still the most effective way to improve robustness. However, improved robustness does not necessarily generalize well under other attacks.

Overall, the idea of ODE itself is very interesting. And it seems especially suited for generative problems, such as time-series analysis. Although the reasoning on ODENet’s robustness is not so rigorous, the creation of TisODE is innovative. A big advantage of this work is that the proposed model can be regarded as a drop-in module that can be easily combined with other approaches to improve the robustness of a trained neural network.


Given the empirical results discussed in papers 1 and 3, we see that neither proposes a more effective way to improve the robustness of neural networks — the state-of-the-art method is still adversarial training. Although some of the ‘side-products’, such as error calibration from JEMm are very interesting approaches to provide additional information regarding robustness.

Overall, there is not yet a unified or effective theoretic explanation on robustness and how to guide model training. We have experienced this ‘saturation’ in many different research fields in the past two years. For example, network architecture has not witnessed a revolutionary change since the proposal of ResNet. On the one hand, this suggests that deep learning techniques are becoming mature and provide more possibilities for industrial development. On the other hand, it seems innovative and impactful work is being seen less. Luckily, we are in a field where industrial engineers can also contribute to state-of-the-art research.

In my opinion, we, the practitioners, should also ask ourselves questions — how could these theoretical findings guide our practical work? And how could our empirical observations provide new insights to research?

Author: YuanYuan Li | Editor: H4O & Michael Sarazen

0 comments on “ICLR 2020 | A Look at Three Interesting Papers on the Robustness of Neural Networks

Leave a Reply

Your email address will not be published. Required fields are marked *

%d bloggers like this: