本文共 24534 字,大约阅读时间需要 81 分钟。
之前研究过一段时间的非参贝叶斯,但是对为什么叫“非参”,以及dirichlet process不是很了解,今天看到一篇神文,深入浅出的娓娓道来
为什么叫“非参”:传统的聚类在开始的时候就要设定类别的数目,而“非参”是指随着数据的不断增加,新的类别不断加入
这篇神文更是简历了 餐厅问题和 dirichlet process 的关系
转自:
Imagine you’re a budding chef. A data-curious one, of course, so you start by taking a set of foods (pizza, salad, spaghetti, etc.) and ask 10 friends how much of each they ate in the past day.
Your goal: to find natural groups of foodies, so that you can better cater to each cluster’s tastes. For example, your fratboy friends might love , your anime friends might love soba and sushi, your hipster friends probably dig tofu, and so on.
So how can you use the data you’ve gathered to discover different kinds of groups?
One way is to use a standard clustering algorithm like k-means or Gaussian mixture modeling(see for a brief introduction). The problem is that these both assume a fixednumber of clusters, which they need to be told to find. There are a couple methods for selecting the number of clusters to learn (e.g., the ), but the problem is a more fundamental one: most real-world data simply doesn’t have a fixed number of clusters.
That is, suppose we’ve asked 10 of our friends what they ate in the past day, and we want to find groups of eating preferences. There’s really an infinite number of foodie types (carnivore, vegan, snacker, Italian, healthy, fast food, heavy eaters, light eaters, and so on), but with only 10 friends, we simply don’t have enough data to detect them all. (Indeed, we’re limited to 10 clusters!) So whereas k-means starts with the incorrect assumption that there’s a fixed, finite number of clusters that our points come from, no matter if we feed it more data, what we’d really like is a method positing an infinite number of hidden clusters that naturally arise as we ask more friends about their food habits. (For example, with only 2 data points, we might not be able to tell the difference between vegans and vegetarians, but with 200 data points, we probably could.)
Luckily for us, this is precisely the purview of nonparametric Bayes.*
*Nonparametric Bayes refers to a class of techniques that allow some parameters to change with the data. In our case, for example, instead of fixing the number of clusters to be discovered, we allow it to grow as more data comes in.
Let’s describe a generative model for finding clusters in any set of data. We assume an infinite set of latent groups, where each group is described by some set of parameters. For example, each group could be a Gaussian with a specified mean μi and standard deviation σi , and these group parameters themselves are assumed to come from some base distribution G0 . Data is then generated in the following manner:
(Note the resemblance to a .)
For example, suppose we ask 10 friends how many calories of pizza, salad, and rice they ate yesterday. Our groups could be:
When deciding what to eat when she woke up yesterday, Alice could have thought girl, I’m in the mood for pizza and her food consumption yesterday would have been a sample from the pizza Gaussian. Similarly, Bob could have spent the day in Chinatown, thereby sampling from the Asian Gaussian for his day’s meals. And so on.
The big question, then, is: how do we assign each friend to a group?
One way to assign friends to groups is to use a Chinese Restaurant Process. This works as follows: Imagine a restaurant where all your friends went to eat yesterday…
Note a couple things:
(Also notice the resemblance between table selection probabilities and a Dirichlet distribution…)
Just to summarize, given n data points, the Chinese Restaurant Process specifies a distribution over partitions (table assignments) of these points. We can also generate parameters for each partition/table from a base distribution G0 (for example, each table could represent a Gaussian whose mean and standard deviation are sampled from G0 ), though to be clear, this is not part of the CRP itself.
Since code makes everything better, here’s some Ruby to simulate a CRP:
Chinese Restaurant ProcessAnd here’s some sample output:
Chinese Restaurant ProcessNotice that as we increase α , so too does the number of distinct tables increase.
Another method for assigning friends to groups is to follow the Polya Urn Model. This is basically the same model as the Chinese Restaurant Process, just with a different metaphor.
Note the connection between this process and the CRP: balls correspond to people (i.e., data points), colors correspond to table assignments (i.e., clusters), alpha is again a dispersion parameter (put differently, a prior), colors satisfy a rich-get-richer property (since colors with many balls are more likely to get drawn), and so on. (Again, there’s also a connection between this urn model and …)
To be precise, the difference between the CRP and the Polya Urn Model is that the CRP specifies only a distribution over partitions (i.e., table assignments), but doesn’t assign parameters to each group, whereas the Polya Urn Model does both.
Again, here’s some code for simulating a Polya Urn Model:
Polya Urn ModelAnd here’s some sample output, using a uniform distribution over the unit interval as the color distribution to sample from:
Polya Urn ModelHere’s the same code for a Polya Urn Model, but in R:
Polya Urn ModelHere are some sample density plots of the colors in the urn, when using a unit normal as the base color distribution:
Notice that as alpha increases (i.e., we sample more new ball colors from our base; i.e., as we place more weight on our prior), the colors in the urn tend to a unit normal (our base color distribution).
And here are some sample plots of points generated by the urn, for varying values of alpha:
Notice that the points clump together in fewer clusters for low values of alpha, but become more dispersed as alpha increases.
Imagine running either the Chinese Restaurant Process or the Polya Urn Model without stop. For each group i , this gives a proportion wi of points that fall into group i .
So instead of running the CRP or Polya Urn model to figure out these proportions, can we simply generate them directly?
This is exactly what the Stick-Breaking Process does:
Thus, the Stick-Breaking process is simply the CRP or Polya Urn Model from a different point of view. For example, assigning customers to table 1 according to the Chinese Restaurant Process is equivalent to assigning customers to table 1 with probability w1 .
Here’s some R code for simulating a Stick-Breaking process:
Stick-Breaking ProcessAnd here’s some sample output:
Notice that for low values of alpha, the stick weights are concentrated on the first few weights (meaning our data points are concentrated on a few clusters), while the weights become more evenly dispersed as we increase alpha (meaning we posit more clusters in our data points).
Suppose we run a Polya Urn Model several times, where we sample colors from a base distribution G0 . Each run produces a distribution of colors in the urn (say, 5% blue balls, 3% red balls, 2% pink balls, etc.), and the distribution will be different each time (for example, 5% blue balls in run 1, but 1% blue balls in run 2).
For example, let’s look again at the plots from above, where I generated samples from a Polya Urn Model with the standard unit normal as the base distribution:
Each run of the Polya Urn Model produces a slighly different distribution, though each is “centered” in some fashion around the standard Gaussian I used as base. In other words, the Polya Urn Model gives us a distribution over distributions (we get a distribution of ball colors, and this distribution of colors changes each time) – and so we finally get to the Dirichlet Process.
Formally, given a base distribution G0 and a dispersion parameter α , a sample from the Dirichlet Process DP(G0,α) is a distribution G∼DP(G0,α) . This sample G can be thought of as a distribution of colors in a single simulation of the Polya Urn Model; sampling from G gives us the balls in the urn.
So here’s the connection between the Chinese Restaurant Process, the Polya Urn Model, the Stick-Breaking Process, and the Dirichlet Process:
Let’s summarize what we’ve discussed so far.
We have a bunch of data points pi that we want to cluster, and we’ve described four essentially equivalent generative models that allow us to describe how each cluster and point could have arisen.
In the Chinese Restaurant Process:
In the Polya Urn Model:
In the Stick-Breaking Process:
In the Dirichlet Process:
Also, remember that each model naturally allows the number of clusters to grow as more points come in.
So we’ve described a generative model that allows us to calculate the probability of any particular set of group assignments to data points, but we haven’t described how to actually learn a good set of group assignments.
Let’s briefly do this now. Very roughly, the Gibbs sampling approach works as follows:
For more details, provides a good description. Philip Resnick and Eric Hardisty also have a friendlier, more general description of Gibbs sampling (plus an application to naive Bayes) .
Finally, let’s show an application of the Dirichlet Process Mixture. Unfortunately, I didn’t have a data set of people’s food habits offhand, so instead I took of McDonald’s foods and nutrition facts.
After normalizing each item to have an equal number of calories, and representing each item as a vector of (total fat, cholesterol, sodium, dietary fiber, sugars, protein, vitamin A, vitamin C, calcium, iron, calories from fat, satured fat, trans fat, carbohydrates), I ran ’s to cluster McDonald’s menu based on nutritional value.
First, how does the number of clusters inferred by the Dirichlet Process mixture vary as we feed in more (randomly ordered) points?
As expected, the Dirichlet Process model discovers more and more clusters as more and more food items arrive. (And indeed, the number of clusters appears to grow logarithmically, which can in fact be proved.)
How many clusters does the mixture model infer from the entire dataset? Running the Gibbs sampler several times, we find that the number of clusters tends around 11:
Let’s dive into one of these clusterings.
Cluster 1 (Desserts)
Looking at a sample of foods from the first cluster, we find a lot of desserts and dessert-y drinks:
We can also look at the nutritional profile of some foods from this cluster (after each nutrition dimension to have mean 0 and standard deviation 1):
We see that foods in this cluster tend to be high in trans fat and low in vitamins, protein, fiber, and sodium.
Cluster 2 (Sauces)
Here’s a sample from the second cluster, which contains a lot of sauces:
And looking at the nutritional profile of points in this cluster, we see that it’s heavy in sodium and fat:
Cluster 3 (Burgers, Crispy Foods, High-Cholesterol)
The third cluster is very burgery:
It’s also high in fat and sodium, and low in carbs and sugar
Cluster 4 (Creamy Sauces)
Interestingly, even though we already found a cluster of sauces above, we discover another one as well. These sauces appear to be much more cream-based:
Nutritionally, these sauces are higher in calories from fat, and much lower in sodium:
Cluster 5 (Salads)
Here’s a salad cluster. A lot of salads also appeared in the third cluster (along with hamburgers and McMuffins), but that’s because those salads also all contained crispy chicken. The salads in this cluster are either crisp-free or have their chicken grilled instead:
This is reflected in the higher content of iron, vitamin A, and fiber:
Cluster 6 (More Sauces)
Again, we find another cluster of sauces:
These are still high in sodium, but much lower in fat compared to the other sauce clusters:
Cluster 7 (Fruit and Maple Oatmeal)
Amusingly, fruit and maple oatmeal is in a cluster by itself:
Cluster 8 (Sugary Drinks)
We also get a cluster of sugary drinks:
In addition to high sugar content, this cluster is also high in carbohydrates and calcium, and low in fat.
Cluster 9 (Breakfast Foods)
Here’s a cluster of high-cholesterol breakfast foods:
Cluster 10 (Coffee Drinks)
We find a group of coffee drinks next:
These are much higher in calcium and protein, and lower in sugar, than the other drink cluster above:
Cluster 11 (Apples)
Here’s a cluster of apples:
Vitamin C, check.
And finally, here’s an overview of all the clusters at once (using a different clustering run):
I’ll end with a couple notes:
And that’s it.
转载地址:http://kdeti.baihongyu.com/