---------------------------------------------- S13j. Priors and entropy in probability theory ---------------------------------------------- For a probability distribution on a finite set of alternatives, given by probabilities p_n summing to 1, the Shannon entropy is defined by S = - sum p_n log_2 p_n. The main use of the entropy concept is the maximum entropy principle, used to define various interesting ensembles by maximizing the entropy subject to constraints defined by known expectation values = sum P_n f(n) for certain key observables f. If the number of alternatives is infinite, this formula must be appropriately generalized. In the literature, one finds various possibilities, the most common being, for random vectors with probability density p(x), the absolute entropy S = - k_B integral dx p(x) log p(x) with the Boltzmann constant k_B and Lebesgue measure dx. The value of the Boltzmann constant k_B is conventional and has no effect on the use of entropy in applications. There is also the relative entropy S = - k_B integral dx p(x) log (p(x)/p_0(x)), which involves an arbitrary positive function p_0(x). If p_0(x) is a probability density then the relative entropy is nonnegative. For a probability distribution over an _arbitrary_ sigma algebra of events, the absolute entropy makes no sense since there is no distinguished measure and hence no meaningful absolute probability density. One needs to assume a measure to be able to define a probability density (namely as the Radon-Nikodym derivative, assuming it exists). This measure is called the prior (it is often improper = not normalizable to a probability density). Once one has specified a prior dmu, = integral dmu(x) rho(x) f(x) defines the density rho(x), and then S(rho)= <-k_B log(rho(x))> defines the entropy with respect to this prior. Note that the condition for rho to define a probability density is integral dmu(x) rho(x) = <1> = 1. In many cases, symmetry considerations suggest a unique natural prior. For random variables on a locally compact homogeneous space (such as the real line, the circle, n-dimensional space or the n-dimensional sphere), the conventional measure is the invariant Haar measure. In particular, for probability theory of finitely many alternatives, it is conventional to consider the symmetric group on the set of alternatives and take as the (proper) prior the uniform measure, giving = sum_x rho(x) f(x). The density rho(x) agrees with the probability p_x, and the corresponding entropy is the Shannon entropy is one takes k_B=1/log2. For random variables whose support is R or R^n, the conventional symmetry group is the translation group, and the corresponding (improper) prior is the Lebesgue measure. In this case one obtains the absolute entropy given above. But one could also take as prior a noninvariant measure dmu(x) = dx p_0(x); then the density becomes rho(x)=p(x)/p_0(x), and one arrives at the relative entropy. If there is no natural transitive symmetry group, there is no natural prior, and one has to make other useful choices. In particular, this is the case for random natural numbers. Choice A. Treating the natural numbers as a limiting situation of finite interval [0:n] suggests to use the measure with integral dmu(x) phi(x) = sum_n phi(n) as (improper) prior, making = sum_n rho(n) f(n) the definition of the density; in this case, p_n=rho(n) is the probability of getting n. Choice B. Statistical mechanics suggests to use as (proper) prior instead a measure with integral dmu(x) phi(x) = sum_n h^n phi(n)/n!, where h is Planck's constant, making = sum_n rho(n) h^n f(n)/n! the definition of the density; in this case, p_n=h^n rho(n)/n! is the probability of getting n. The maximum entropy ensemble defined by given expectations depends on the prior chosen. In particular, if the mean of a random natural number is given, choice A leads to a geometric distribution, while choice B leads to a Poisson distribution. The latter is the one relevant for statistical mechanics. Indeed, choice B is the prior needed in statistical mechanics of systems with an indefinite number n of particles to get the 'correct Boltzmann counting' in the grand canonical ensemble. With choice A, the maximum entropy solution is unrelated to the distributions arising in statistical mechanics. Thus while the geometric distribution has greater Shannon entropy than the Poisson distribution, this is irrelevant for classical physics. In statistical physics with an indeterminate number of particles, only the relative entropy corresponding to choice B is meaningful. (In the quantum physics of systems with discrete spectrum, however, the microcanonical ensemble is the right prior, and then Shannon's entropy is the correct one.) The identification of 'information' and 'Shannon entropy' is dubious for situations with infinitely many alternatives. Shannon assumes in his analysis that without knowledge, all alternatives are equally likely, which makes no sense in the infinite case, and may even be debated in the finite case. (One of the problems of a subjective, Bayesian approach to probability is that one always needs a prior before information theoretic arguments make sense. If there is doubt about the former the results become doubtful, too. Since information theory in statistical mechanics works out correctly _only_ if one used the right prior (choice B) and the right knowledge (expectations of the additive conserved quantities in the equilibrium case), both the prior and the knowledge are objectively determined. But this is strange for a subjective approach as the information theoretic one, and casts doubt on the relevance of information theory in the foundations.)