----------------------------------------------
S13j. Priors and entropy in probability theory
----------------------------------------------

For a probability distribution on a finite set of alternatives, 
given by probabilities p_n summing to 1, the Shannon entropy is
defined by
   S = - sum p_n log_2 p_n.
The main use of the entropy concept is the maximum entropy principle,
used to define various interesting ensembles by maximizing the entropy 
subject to constraints defined by known expectation values
  <f> = sum P_n f(n)
for certain key observables f. 

If the number of alternatives is infinite, this formula must be
appropriately generalized. In the literature, one finds various 
possibilities, the most common being, for random vectors with
probability density p(x), the absolute entropy
   S = - k_B integral dx p(x) log p(x)
with the Boltzmann constant k_B and Lebesgue measure dx.
The value of the Boltzmann constant k_B is conventional and has no
effect on the use of entropy in applications.
There is also the relative entropy
   S = - k_B integral dx p(x) log (p(x)/p_0(x)),
which involves an arbitrary positive function p_0(x). If p_0(x)
is a probability density then the relative entropy is nonnegative.

For a probability distribution over an _arbitrary_ sigma algebra
of events, the absolute entropy makes no sense since there is no 
distinguished measure and hence no meaningful absolute probability 
density. One needs to assume a measure to be able to define a 
probability density (namely as the Radon-Nikodym derivative, 
assuming it exists). This measure is called the prior (it is often 
improper = not normalizable to a probability density). 
Once one has specified a prior dmu, 
   <f(x)> = integral dmu(x) rho(x) f(x)
defines the density rho(x), and then 
   S(rho)= <-k_B log(rho(x))> 
defines the entropy with respect to this prior. Note that the 
condition for rho to define a probability density is 
   integral dmu(x) rho(x) = <1> = 1.

In many cases, symmetry considerations suggest a unique natural prior.
For random variables on a locally compact homogeneous space (such as 
the real line, the circle, n-dimensional space or the n-dimensional 
sphere), the conventional measure is the invariant Haar measure. 
In particular, for probability theory of finitely many alternatives, 
it is conventional to consider the symmetric group on the set of 
alternatives and take as the (proper) prior the uniform measure, giving
   <f(x)> = sum_x rho(x) f(x).
The density rho(x) agrees with the probability p_x, and the 
corresponding entropy is the Shannon entropy is one takes k_B=1/log2.

For random variables whose support is R or R^n, the conventional
symmetry group is the translation group, and the corresponding
(improper) prior is the Lebesgue measure. In this case one obtains 
the absolute entropy given above. But one could also take as prior
a noninvariant measure
   dmu(x) = dx p_0(x);
then the density becomes rho(x)=p(x)/p_0(x), and one arrives at the
relative entropy.

If there is no natural transitive symmetry group, there is no natural 
prior, and one has to make other useful choices. In particular, this 
is the case for random natural numbers. 

Choice A. Treating the natural numbers as a limiting situation of 
finite interval [0:n] suggests to use the measure with 
   integral dmu(x) phi(x) = sum_n phi(n) 
as (improper) prior, making
   <f(x)> = sum_n rho(n) f(n)     
the definition of the density; in this case, p_n=rho(n) is the
probability of getting n.

Choice B. Statistical mechanics suggests to use as (proper) prior 
instead a measure with 
   integral dmu(x) phi(x) = sum_n h^n phi(n)/n!,
where h is Planck's constant, making 
   <f(x)> = sum_n rho(n) h^n f(n)/n!
the definition of the density; in this case, p_n=h^n rho(n)/n! is the
probability of getting n. 

The maximum entropy ensemble defined by given expectations depends on
the prior chosen. In particular, if the mean of a random natural number 
is given, choice A leads to a geometric distribution, while
choice B leads to a Poisson distribution. The latter is the one 
relevant for statistical mechanics. Indeed, choice B is the prior 
needed in statistical mechanics of systems with an indefinite 
number n of particles to get the 'correct Boltzmann counting' in the
grand canonical ensemble. With choice A, the maximum entropy
solution is unrelated to the distributions arising in statistical 
mechanics. 

Thus while the geometric distribution has greater Shannon entropy 
than the Poisson distribution, this is irrelevant for classical physics.
In statistical physics with an indeterminate number of particles, 
only the relative entropy corresponding to choice B is meaningful.
(In the quantum physics of systems with discrete spectrum, however, 
the microcanonical ensemble is the right prior, and then Shannon's 
entropy is the correct one.)

The identification of 'information' and 'Shannon entropy' 
is dubious for situations with infinitely many alternatives. 
Shannon assumes in his analysis that without knowledge, all 
alternatives are equally likely, which makes no sense in the infinite 
case, and may even be debated in the finite case.

(One of the problems of a subjective, Bayesian approach to 
probability is that one always needs a prior before information 
theoretic arguments make sense. If there is doubt about the former 
the results become doubtful, too. Since information theory in 
statistical mechanics works out correctly _only_ if one used the 
right prior (choice B) and the right knowledge (expectations of 
the additive conserved quantities in the equilibrium case), 
both the prior and the knowledge are objectively determined. 
But this is strange for a subjective approach as the information 
theoretic one, and casts doubt on the relevance of information 
theory in the foundations.)