What frustrates me about Bayesian NNs is that talking about "priors" doesn't make nearly as much sense as it does in a regression context. A prior over parameter weights has no interpretation in the way that a prior over a regression coefficient, or even a spline smoothness, does. What you really want -- and what natural intelligence probably has -- are priors over aspects of the world.
Francois Chollet's paper on measuring intelligence was really informative for me on this front; the "priors" you should have about the world are not half-cauchys over certain hyperparameters or whatever, but priors about agent-ness, object-ness, goal-oriented-ness, and so on. How to encode that in a network...well, that's the real trick, right?
I agree that priors over aspects of the world would be more useful, but I don't think that they're important in making natural intelligence powerful. In my experience, the important thing is to make your prior really broad, but containing all kinds of different hypotheses with different kinds of rich structure.
I claim that knowing a priori about things like agents and objects just doesn't save you all that much data, as long as you have the imagination to consider all structures at least that complex.
Bayesian Neural Networks just seem like a failed approach, unfortunately.
For one, Bayesian inference and UQ fundamentally depends on the choice of the prior, but this is rarely discussed in the Bayesian NN literature and practice, and is further compounded by how fundamentally hard to interpret and choose these priors are (what is the intuition behind a NN's parameters?). Add to that the fact that the Bayesian inference is very much approximate, and you should see the trouble.
If you want UQ, 'frequentist nonparametric' approaches like Conformal Prediction and Calibration/Multi-Calibration methods seem to work quite well (especilly when combined with the standard ML machinery of taking a log-likelihood as your loss), and do not suffer from any of the issues above while also giving you formal guarantees of correctness. They are a strict improvement over Bayesian NNs, IMO.
The Conformal Prediction advocates (especially a certain prominent Twitter account) tend to rehash old frequentist-vs-bayesian arguments with more heated rhetoric than strictly necessary. That fight has been going on for almost a century now. Bayesian counterargument (in caricature form) would be that MLE frequentists just choose an arbitrary (flat) prior, and penalty hyperparameters (common in NN) are a de facto prior. The formal guarantees only have bite in the asymptotic setting or require convoluted statements about probabilities over repeated experiments; and asymptotically, the choice of prior doesn't matter anyway.
(I'm a moderate that uses both approaches, seeing them as part of a general hierarchical modeling method, which means I get mocked by either side for lack of purity).
Bayesians are losing ground at the moment because their computational methods haven't been advanced as fast by the GPU revolution for reasons having to do with difficulty in parallelization, but there's serious practical work (especially using JAX) to catch up, and the whole normalizing flow literature might just get us past the limitations of MCMC for hard problems.
But having said that, Conformal Prediction works as advertised for UQ as a wrapper on any point estimating model. If you've got the data for it - and in the ML setting you do - and you don't care about things like missing data imputation, error in inputs, non-iid spatio-temporal and hierarchical structures, mixtures of models, evidence decay, unbalanced data where small-data islands coexist big data - all the complicated situations where Bayesian methods just automatically work and other methods require elaborate workarounds, yup, use Conformal Prediction.
Calibration is also a pretty magical way to improve just about any estimator. It's cheap to do and it works (although hard to guarantee anything with that in the general case...)
And don't forget quantile regression penalties! Awkward to apply in the NN setting, but an easy and effective way to do UQ in XGBoost world.
Yeah, I know the account you are talking about, it really is a bit over the top. It's a shame, I've met a bunch of people who mentioned that they were actually turned away from Conformal Prediction due to them.
> But having said that, Conformal Prediction works as advertised for UQ as a wrapper on any point estimating model. If you've got the data for it - and in the ML setting you do - and you don't care about things like missing data imputation, error in inputs, non-iid spatio-temporal and hierarchical structures, mixtures of models, evidence decay, unbalanced data where small-data islands coexist big data - all the complicated situations where Bayesian methods just automatically work and other methods require elaborate workarounds, yup, use Conformal Prediction.
Many of these things can actually work really well with Conformal Prediction, but the algorithms require extensions (much like if you are doing Bayesian inference, you also need to update your model accordingly!). They generally end up being some form of reweighting to compensate for the distribution shifts (excluding the Online Conformal Prediction literature, which is another beast entirely). Also, worth noting that if you have iid data then Conformal Prediction is remarkably data-efficient; as little as 20 samples are enough for it to start working for 95% predictive intervals, and with 50 samples (and with almost surely unique conformity scores) it's going to match 95% coverage fairly tightly.
Priors on parameters are not an issue. On models of scale, priors are just some computationally convenient shrinkage, and what works is found empirically and canonized into the practice; projecting prior knowledge of the problem at hand by parameter priors does not really happen except in some vague sense ("I think most predictors are irrelevant, so make it sparse by Cauchy/horseshoe/whatever").
The important thing in bayesian (statistical, ML) modelling in general is the ability to gain in flexibility and do model structures that otherwise would be hard or impossible: latent states, hierarchies, etc.
In bayesian NNs the main advantages would be around uncertainty quantification (UQ) and in finding good optima and partly to avoid overfitting. These do apply in some cases of simple NNs.
Mostly however, especially with larger conventional models (not speaking of normalizing flows and such here), using explicit bayes is not feasible. Instead, people use approximate point estimates with tricks:
(1) UQ has been taken care of by post-calibration. (2) Stochastic gradient actually searches for large posterior masses like a variational approximation would do, so it is kind of bayes. (3) And those priors: using dropout is commonplace, it has a bayesian interpretation, and L2 regularization aka gaussian priors are frequent too.
So bayes is there in practice, just not in a neat, pure form but as a collection of practical hacks.
> For one, Bayesian inference and UQ fundamentally depends on the choice of the prior, but this is rarely discussed in the Bayesian NN literature and practice, and is further compounded by how fundamentally hard to interpret and choose these priors are (what is the intuition behind a NN's parameters?).
I agree that, computationally, it is hard to justify the use of Bayesian methods on large-scale neural networks when stochastic gradient descent (and friends) is so damn efficient and effective.
On the other hand, the fact that there's a dependence on (subjective) priors is hardly a fair critique: non-Bayesian training of neural networks also depends on the use of (subjective) loss functions with (subjective) regularization terms (in fact, it can be shown that, mathematically, the use of priors is precisely equivalent to adding regularization to a loss function). Non-Bayesian training of neural networks is not "a failed approach" just because someone can arbitrarily choose L1 regularization (i.e., a Laplacian prior) over L2 regularization (i.e., a Gaussian prior).
Furthermore, we do have some intuition over NN parameters (particularly when inputs and outputs are properly scaled): a value of 10^15 should be less likely than a value of 0. Note that, in Bayesian practice, people often use weakly-informative priors (see, e.g., http://www.stat.columbia.edu/~gelman/presentations/weakprior...) to encode such intuitive statements while ensuring that (for all practical purposes) the data will effectively overwhelm the prior (again, this is equivalent to adding a minimal amount of regularization to a loss function, to make a problem well-posed when e.g. you have more parameters than data points).
I agree that Bayesian neural networks haven't been worth it in practice for many applications, but I think the main problem is that it's usually better to spend your compute training a single set of weights for a larger model, rather than doing approximate inference over weights in a smaller model. The exception is probably scientific applications where you mostly know the model, but then you don't really need a neural net anymore.
Choosing a prior is hard, but I'd say it's analogously hard to choosing an architecture - if all else fails, you can do a brute force search, and you even have the marginal likelihood to guide you. I don't think it's the main reason why people don't use BNNs much.
I disagree with one conceptual point; if you are truly Bayesian you don’t “choose” a prior, by definition you “already have” a prior that you are updating with data to get to a posterior.
Conformal learning is relatively new to me. Tell me if I'm getting any of this wrong: Conformal learning is a frequentist approach that uses a calibration set to determine how unusual a prediction is.
It seems like the main time they aren't a strict improvement over bayesian methods is when it is difficult to define your calibration set? I know this scenario isn't so commonplace, but I'm working in a scenario where I quickly looked at conformal learning and wasn't sure if it is applicable.
That's a particular form of Conformal Prediction, called Split Conformal Prediction. Incidentally, it's also one of the best ones (i.e., most extensible, strongest guarantees, easiest to implement, remarkably sample-efficient).
Making a calibration set is pretty easy, it's just a data split (just like the train/test split). The hardest part (which is still fairly easy) is creating a 'conformity score', which is a function that receives the input and a candidate output and scores how well this candidate output 'conforms' to the input. This is where an underlying ML model can come in handy: it can, itself, estimate this! Split Conformal Prediction then does a fairly simple quantile calculation on these scores (or some variant thereof) to then form the set prediction.
In a sense, you could use Bayesian NNs to produce a conformity score. But that doesn't seem to be much better than just using e.g. the model's logits for your conformity score. Theory-wise, Conformal Prediction methods have a number of favorable guarantees that Bayesian models (and especially Bayesian NNs) generally don't, and in practice we've seen that conditional on the model giving calibrated outputs (which is guaranteed for Conformal Prediction, but not for Bayesian NNs), Conformal Prediction predicted sets seem to be tighter than the Bayesian NN ones.
I’m not an expert in BNNs but the prior does not need to be justified in terms of each parameter. Bayesian analysis frequently uses hyperparameters to set the overall tightness or looseness of the parameters (a la Minnesota priors in the econometric literature for example). This would be a similar regularisation intuition as, eg, L1 and L2 regularisation in traditional NN training. This is of course just one example.
Author here! What a surprise. This was an abandoned project from 2019, that we never linked or advertised anywhere as far as I know. Anyways, happy to answer questions.
why (if) was this not picked for further research? i know that oatml did quite amount of work on this front as well and it seems the direction is still being worked on. want to get ur 2 cent on this approach.
I like Bayesian inference for few-parameter models where I have solid grounds for choosing my priors. For neural networks, I like to ask people "what's your prior for ReLU versus LeakyReLU versus sigmoid?" and I've never gotten a convincing answer.
I agree choosing priors is hard, but choosing ReLU versus LeakyReLU versus sigmoid seems like a problem with using neural nets in general, not Bayesian neural nets in particular. Am I misunderstanding?
Ah, Kolmogorov Arnold Networks. Perhaps the only model I have ever tried that managed to fairly often get AUCs below 0.5 in my tabular ML benchmarks. It even managed to get a frankly disturbing 0.33, where pretty much any other method (including linear regression, IIRC) would get >=0.99!
KANs have learnable activations based on splines parameterized on few variables. You can specify a prior over those variables, effectively establishing a prior over your activation function.
mixture density networks are quite interesting if you want probabilistic estimates of neural. here, your model learns to output and array of gaussian distribution coefficient distributions, and mixture weights.
these weights are specific to individual observations, and trained to maximise likelihood.
This approach characterizes a different type of uncertainty than BNNs do, and the approaches can be combined. The BNN tracks uncertainty about parameters in the NN, and mixture density nets track the noise distribution _conditional on knowing the parameters_.
BNNs were an attractive choice in scenarios where the data is expensive to collect, like actual physical experiments. But boosting and other tree-based regression methods give you similar performance with a more straightforward framework for limited tabular data.
Good point. We wrote this pre-double descent, and a massively overparameterized model would make a nice addition to the tutorial as a baseline. However, if you want a rich predictive distribution, it might still make sense to use a Bayesian NN.
Francois Chollet's paper on measuring intelligence was really informative for me on this front; the "priors" you should have about the world are not half-cauchys over certain hyperparameters or whatever, but priors about agent-ness, object-ness, goal-oriented-ness, and so on. How to encode that in a network...well, that's the real trick, right?