Hi, this is the author of the books which are being discussed in this thread and wanted to respond to some of the comments.
it_does_follow said "there is almost no mention of understanding parameter variance". I do discuss both Bayesian and frequentist measures of uncertainty in sec 4.6 and 4.7 of my intro book (http://probml.github.io/book1). (Ironically I included Cramer-Rao in an earlier draft, but omitted it from the final version due to space).
dxbydt said "how do you compute the variance of the sample variance". I discuss ways to estimate the (posterior) variance of a variance parameter in sec 3.3.3 of my advanced book (https://probml.github.io/pml-book/book2). I also discuss hierarchical Bayes, shrinkage, etc.
However in ML (and especially DL) the models we use are usually unidentifiable, and the parameters have no meaning, so nobody cares about quantifying their uncertainty. Instead the focus is on predictive uncertainty (of observable outcomes, not latent parameters). I discuss this in more detail in the advanced book, as well as related topics like distribution shift, causality, etc.
Concerning the comments on epub, etc. My book is made with latex and compiled to pdf. This is the current standard for technical publications and is what MIT Press uses.
Concerning the comments on other books. The Elements of Statistical Learning (Hastie, Tibhsirani, Friedman) and Pattern Recognition and ML (Bishop) are both great books, but are rather dated, and quite narrow in scope compared to my 2 volume collection....
Kevin Murphy has done an incredible service to the ML (and Stats) community by producing such an encyclopedic work of contemporary views on ML. These books are really a much need update of the now outdated feeling "The Elements of Statistical Learning" and the logical continuation of Bishop's nearly perfect "Pattern Recognition and Machine Learning".
One thing I do find a bit surprising is that in the nearly 2000 pages covered between these two books there is almost no mention of understanding parameter variance. I get that in machine learning we typically don't care, but this is such an essential part of basic statistics I'm surprised it's not covered at all.
The closest we get is in the Inference section which is mostly interested in prediction variance. It's also surprising that in neither the section on Laplace Approximation or Fisher information does anyone call out the Cramér-Rao lower-bound which seems like a vital piece of information regarding uncertainty estimates.
This is of course a minor critique since virtual no ML books touch on these topics, it's just unfortunate that in a volume this massive we still see ML ignoring what is arguably the most useful part of what statistics has to offer to machine learning.
Do you really expect this situation to ever change ? The communities are vastly different in their goals despite some minor overlap in their theoretical foundations. Suppose you take rnorm(100) sample and find its variance. Then you ask the crowd the mean and variance of that sample variance. If your crowd is a 100 professional statisticians with a degree in Statistics, you should get the right answer atleast 90% of the time. If instead you have a 100 ML professionals with some sort of a degree in cs/vision/nlp, less than 10% would know how to go about computing the variance of sample variance, let alone what distribution that belongs to. The worst case is 100 self-taught Valley bros - not only will you get the wrong answer 100% of the time, they’ll pile on you for gatekeeping and computing useless statistical quantities by hand when you should be focused on the latest and greatest libraries in numpy that will magically do all these sorts of things if you invoke the right api. As a statistician, I feel quite sad. But classical stats has no place in what passes for ML these days.
Folks can’t Rao Blackwellize for shit, how can you expect a Fisher Information matrix from them ?
I think Bishop et al. WIP book Model-Based Machine Learning[0] is a nice step in the right direction. Honestly the most important thing missing from ML that stats has is the idea that your model is a model of something. That how you construct a problem mathematically says something about how you believe the world works. Then we can ask all sorts of detailed question about "how good is this model and what does it tell me?"
I'm not sure this will ever dominate. As much as I love Bayesian approaches I sort of feel there is a push to make them ever more byzantine, recreating all of the original critiques of where frequentist stats had gone wrong. So essentially we're just seeing a different orthodoxy dominant thinking with all of the same trapping of the previous orthodoxy.
Wait, what’s the problem with people not knowing things that they don’t need to know? This just comes across as being bitter that self taught people exist, or that other people are somehow encroaching on your field.
I think your comment does what the OP complains about, regarding gatekeeping etc.
I don't know about OP, whose comment I find a little harsh, but personally I'm always frustrated a bit and despairing a bit when I realise how poor the background is of the average machine learning researcher today, i.e. of my generation. Sometimes it's like nothing matters other than the chance that Google or Facebook will want to hire someone with a certain skillset and any knowledge that isn't absolutely essential to getting that skillset, is irrelevant.
Who said "Those who do not know their history are doomed to repeat it"? In research that means being oblivious of the trials and tribulations of previous generations of researchers and then falling down the same pits that they did. See for example how deep learning models today are criticised for being "brittle", a criticism that was last levelled against expert systems, and for similar, although superficially different, reasons. Why can't we ever learn?
> I think your comment does what the OP complains about, regarding gatekeeping etc.
Oh absolutely, that's how I intended it. I don't think that preemptively calling out people's reaction gives the parent comment a pass on gatekeeping.
Your concern about poor background... it's only a problem for people who are jumping into things without the prerequisite background and they aren't learning fast enough. But modern deep learning is much more empirical - there are a few building blocks and people are trying out different things to see how they perform. I don't get why we need to look down on people for not knowing things that they don't need to know. If there was some magic that comes from knowing much more statistics, then the researchers who do would be outperforming the rest of the field by a lot but I don't think that's the case.
That certainly is the case. Not for statistics specifically, but all the people at the top of the field, Bengio, LeCunn, Schmidhuber, Hinton, and so on, all have deep backgrounds in computer science, maths, psychology, statistics, physics, AI, etc. You don't get to make progress in a field as saturated as deep learning when all you know how to do is throw stuff at the wall to see what sticks.
I never said anything about needing to look down on anyone. Where did that come from?
My concern is that without a solid background in AI, no innovation can happen, because innovation means doing something entirely new and one cannot figure out what "entirely new" means, without knowing what has been done before. The people who "are trying out different things to see how they perform" as you say, are forced to do that because that's all you can do when you don't understand what you're doing.
To get the prediction variance in a Bayesian treatment, you integrate over the posterior of the parameters - surely computing or approximating the posterior counts as considering parameter variance?
Of course it does. You can put hyperpriors on the priors, and hyper hyperpriors on the hyperpriors, but the regress has to stop somewhere. What is your point?
I'm not sure I entirely follow your comment, however I was merely pointing out that reckoning with parameter uncertainty by "computing or approximating the posterior...", as you said, is not always applicable in probabilistic ML.
Yes, but that's true of all statistics. You have to make some assumptions to get off the ground. If you estimate parameter variance the frequentist way, you also make assumptions about the parameter distribution.
No, this is expressly untrue. In the frequentist paradigm parameters are fixed but unknown, they are not random variables, and have no implicit probability distribution associated with them.
An estimator (of a parameter) is a random variable, as it is a function of random variables, however this depends only on the data distribution, there is no other implicit distribution on which it depends.
For instance, the distribution for the maximum likelihood estimator of the mean of a normal distribution is normally distributed, however this does not imply that the mean parameter has a normal prior, it has no prior, as it is a fixed quantity.
Do you think this book is useful for someone just looking to get more into statistic and probability sans machine learning? How would I go about that?
Currently I have lined up - Math for programmers (No starch press), Practical Statistics for data scientists (O'Reily - the crab book), and Discovering Statistics using R.
Basically I'm trying to follow the theory from "Statistical Consequences of Fat Tails" by NNT.
Bourbaki student M. Talagrand has some work on approximate independence. If I were trying to do something along the lines of Probabilistic Machine Learning: Advanced Topics I would look
(1) carefully at the now classic
L. Breiman, et al.,Classification and Regression Trees (CART),
and
(2) at the classic Markov limiting results, e.g., as in
E. Çinlar, Introduction to Stochastic Processes,
at least to be sure are not missing something relevant and powerful,
(3) at some of the work on sufficient statistics, of course, first via the classic Halmos and Savage paper and then at the interesting more recent work in
Robert J. Serfling,
Approximation Theorems of Mathematical Statistics,
and then for the most promising
(4) very carefully at Talagrand.
(1) and (2) are old but a careful look along with more recent work may yield some directions for progress.
What Serfling develops is a bit amazing.
Then don't expect the Talagrand material to be trivial.
For clarification, Murphy's first book is just Machine Learning: A probabilistic perspective this is his newest, 2 volume book, Probabilistic Machine Learning which is broken down into two parts an Introduction (published March 1, 2022) and Advanced Topics (expected to be published in 2023, but draft preview available now).
To answer your question. This book is even more complete and a bit improved over the first book. I don't believe there's anything in Machine Learning that isn't well covered, or correctly omitted from Probabilistic Machine Learning. This also has the benefit of a few more years of rethinking these topics. So between the existing Murphy books, Probabilistic Machine Learning: an Introduction is probably the one you should have.
Why this over Bishop (which I'm not sure is the case)? While on the surface they are very similar (very mathematical overviews of ML from a very probability focused perspective) they function as very different books. Murphy is much more of a reference to contemporary ML. If you want to understand how most leading researchers think about and understand ML, and want a reference covering the mathematical underpinnings this is a book you really need for a reference.
Bishop is a much more opinionated book in that Bishop isn't just listing out all possible ways of thinking about a problem, but really building out a specific view of how probability relates to machine learning. If I'm going to sit down and read a book, it's going to be Bishop because he has a much stronger voice as an author and thinker. However Bishop's book is now more than 10 years old an misses out on nearly all of the major progress we've seen in deep learning. That's a lot to be missing and it won't be rectified in Bishop's perpetual WIP book [0.]
A better comparison is not Murphy to Murphy or Murphy to Bishop, but Murphy to Hastie et al. The Elements of Statistical Learning for many years was the standard reference for advanced ML stuff, especially during the brief time when GBDT and Random Forests where the hot thing (which they still are to an extent in some communities). I really enjoy EoSL but it does have a very "Stanford Statistics" (which I feel is even more aggressively Frequentist than your average Frequentist) feel to the intuitions. Murphy is really the contemporary computer science/Bayesian understanding of ML that has dominated the top research teams for the last few years. It feels much more modern and should be the replacement reference text for most people.
I read TESL during my Master's and I remember being very confused with the way
it described decision tree learning. I remember being pleased with myself that I
had a strong grip on decision tree learning before reading TESL and then being
thoroughly confused after reading about them on TESL.
Eyballing the relevant chapter again (9.2) I think that may have been because it
introduces decision tree learning with CART (the algorithm), whereas I was more
familiar with ID3 and C4.5. Perhaps it's simpler to describe CART as TESL does,
but decision trees are a propositional logic "model" (in truth, a theory) and
for me the natural way to describe them, is as a propositional logic "model"
(theory). I also get the feeling that Quinlan's work is sidelined a little,
perhaps because he was coming from a more classical AI background and that's
poo-poo'd in statistical learning circles. If so, that's a bit of a shame and a
bit of an omission. Machine learning is not just statistics and it's not just
AI, it's a little bit of both and one needs to have at least some background in
both subjects to understand what's really going on. But perhaps it's the data
mining/ data science angle that I find a bit one-sided.
Sorry to digress. I'm so excited when people discuss actual textbooks on HN.
I’m in agreement with much of your post.
The Elements of Statistical Learning played its role quite well years ago but a fresher take is needed.
Thanks for the response.
Echoing others, thank you for writing this (as someone doing an applied math masters and digging into ML - I have used ESL for a class but not the others you mention)
No, this is the second volume of "Probabilistic Machine Learning", the first volume of which was just published this week. The 2 volume set can be seen as a complete rewrite/replacement for "Machine Learning: A Probabilistic Perspective"
ePUB is notoriously bad at displaying mathematics. It also takes away the author's control of the page layout. To me there is nothing more satisfying than a well-crafted PDF.
One nice upside to having tex source is that you can set the page size to match e.g., a phone screen. Reading standard pdf textbooks and papers on a phone isn’t very fun.
I used to do this for reading arxiv preprints, but the script I wrote was kind of brittle and it doesn’t really work out with figures anyway
Honestly if the scientific community moved to something that could be interactive and reflowable I would be so happy
I'm fairly certain that PDF was generated using LaTeX, everyone in academia uses it. Besides, it's not fair to complain about formatting in a very early draft.
it_does_follow said "there is almost no mention of understanding parameter variance". I do discuss both Bayesian and frequentist measures of uncertainty in sec 4.6 and 4.7 of my intro book (http://probml.github.io/book1). (Ironically I included Cramer-Rao in an earlier draft, but omitted it from the final version due to space).
dxbydt said "how do you compute the variance of the sample variance". I discuss ways to estimate the (posterior) variance of a variance parameter in sec 3.3.3 of my advanced book (https://probml.github.io/pml-book/book2). I also discuss hierarchical Bayes, shrinkage, etc.
However in ML (and especially DL) the models we use are usually unidentifiable, and the parameters have no meaning, so nobody cares about quantifying their uncertainty. Instead the focus is on predictive uncertainty (of observable outcomes, not latent parameters). I discuss this in more detail in the advanced book, as well as related topics like distribution shift, causality, etc.
Concerning the comments on epub, etc. My book is made with latex and compiled to pdf. This is the current standard for technical publications and is what MIT Press uses.
Concerning the comments on other books. The Elements of Statistical Learning (Hastie, Tibhsirani, Friedman) and Pattern Recognition and ML (Bishop) are both great books, but are rather dated, and quite narrow in scope compared to my 2 volume collection....