True or false? Log-improbability is Shannon information

True or false? If $p$ is the probability of an event, then the Shannon information of the event is $-\!\log_2 p$ bits.

I’m quite interested in knowing what you believe, and why you believe it, even if you cannot justify your belief formally.

Formal version. Let $(\Omega, 2^\Omega, P)$ be a discrete probability space with $P(\Omega) = 1,$ and let event $E$ be an arbitrary subset of $\Omega.$ Is it the case that in Shannon’s mathematical theory of communication, the self-information of the event is equal to $-\!\log_2 P(E)$ bits?

84 thoughts on “True or false? Log-improbability is Shannon information

  1. Can you define “Shannon information of the event” I mean, what are you calling the “information of the event”?

    If, say, a particular explosion on the Sun tomorrow is an event having probability p of occurring, are you asking us to opine on the claim that the Shannon information connected with that explosion is -log_2 p bits?

    Or what ARE you asking exactly?

  2. Hi walto. I think by event he means something like the toss of a coin or the toss of a die where were can assign a probability to the expected outcome and then say that we have received an “amount of information” associated with the probability of the event/outcome.

  3. walto: Or what ARE you asking exactly?

    I am asking a matter-of-fact question about the definition of Shannon self-information, which I shall not provide.

    Let’s say you identify a finite set of possible outcomes of some experiment, where the term experiment is broadly construed. Somehow a probability is associated with each of the possible outcomes. The interpretation of probability is unimportant here. All that matters is that (a) the probabilities of the possible outcomes are non-negative numbers summing to 1 and (b) for every subset $E$ of the possible outcomes, the probability of $E$ is equal to the sum of the probabilities of the possible outcomes in $E.$ That is,

    $$P(E) = P(x_1) + P(x_2) + \cdots + P(x_n) \geq 0$$

    for all subsets $E = \{x_1, x_2, \dotsc, x_n\}$ of possible outcomes, with $P(E) = 1$ in the case that $E$ contains all possible outcomes. Does it make sense, within Shannon’s mathematical theory of communication, to regard $-\!\log_2 P(E)$ as a measure of information in the event that $E$ occurs (meaning that the actual outcome of the experiment is one of the possible outcomes contained by $E$)?

    Everyone loves to talk about this, and I am asking, “What ARE you talking about, exactly?”

  4. My answer is false. But it’s probably wrong.

    My answer is based on my belief that the Shannon measure of information is calculated with respect to a probability distribution. So I’d need to know the probability distribution.

  5. Mung:
    Hi walto. I think by event he means something like the toss of a coin or the toss of a die where were can assign a probability to the expected outcome and then say that we have received an “amount of information” associated with the probability of the event/outcome.

    We usually say that the sample space is (possible outcomes are) $\Omega = \{H, T\}$ when a coin is tossed. Then there are $2^{|\Omega|} = 2^2 = 4$ elements in

    $$2^\Omega = \{\emptyset, \{0\}, \{1\}, \Omega\},$$

    the set of all subsets of the sample space. Each of the four subsets of the sample space is an event. Assuming that the two possible outcomes are equiprobable,

    [latex]\begin{align*}
    P(\emptyset) &= 0 \\
    P(\{0\}) &= .5 \\
    P(\{1\}) &= .5 \\
    P(\Omega) &= 1
    \end{align*}[/latex]

  6. I think this articles supports my view:

    https://en.wikipedia.org/wiki/Entropy_(information_theory)

    In a more technical sense, there are reasons (explained below) to define information as the negative of the logarithm of the probability distribution. The probability distribution of the events, coupled with the information amount of every event, forms a random variable whose expected value is the average amount of information, or entropy, generated by this distribution. Units of entropy are the shannon, nat, or hartley, depending on the base of the logarithm used to define it, though the shannon is commonly referred to as a bit.

  7. Tom English: I am asking a matter-of-fact question about the definition of Shannon self-information, which I shall not provide.

    Let’s say you identify a finite set of possible outcomes of some experiment, where the term experiment is broadly construed. Somehow a probability is associated with each of the possible outcomes. The interpretation of probability is unimportant here. All that matters is that (a) the probabilities of the possible outcomes are non-negative numbers summing to 1 and (b) for every subset E of the possible outcomes, the probability of E is equal to the sum of the probabilities of the possible outcomes in E. That is,

    \[P(E) = P(x_1) + P(x_2) + \cdots + P(x_n) \geq 0\]

    for all subsets E = \{x_1, x_2, \dotsc, x_n\} of possible outcomes, with P(E) = 1 in the case that E contains all possible outcomes. Does it make sense, within Shannon’s mathematical theory of communication, to regard -\!\log_2 P(E) as a measure of information in the event that E occurs (meaning that the actual outcome of the experiment is one of the possible outcomes contained by E)?

    Everyone loves to talk about this, and I am asking, “What ARE you talking about, exactly?”

    OK, then I’ll answer No. I think that the suggestion involves a misinterpretation of the term “information.”

  8. walto: I think that the suggestion involves a misinterpretation of the term “information.”

    Well, I prefer the term specific entropy to self-information. Shannon entropy is the expectation of specific entropy.

  9. False. This is the Shannon self-information (surprisal). Shannon Information is the expected value of this quantity for the population (probability distribution).

  10. Tomato Addict,

    Precisely the rare event I had hoped would occur has occurred!

    What if the experiment is a roll of an ideal, six-sided die, with $\Omega = \{1, 2, \dotsc, 6\}$, and I specify a sigma algebra of $\{\emptyset, \{1, 3, 5\}, \{2, 4, 6\}, \Omega\}$ instead of $2^\Omega$?

  11. Tom English,
    Nothing changes, except the event is now an odd/even roll of the die, instead of heads/tails. The individual outcomes {1}, {2}, {3}, etc., have greater self-information, but the aggregate event is the same under this definition.

    Here’s a leading question for you, because I think I know where you might be headed with this:: Dembski’s CSI measures Shannon Information? (true/false)

  12. Tom English: You know that the Shannon entropy is 1 bit for a fair coin toss. How do you get that result?

    C:\projects>irb
    irb(main):001:0> [0.5, 0.5].inject(0) {|sum, px| sum + px * Math.log2(1.0/px)}
    => 1.0

    I’d love to see that in plain English and in mathematical notation. I can say what I think is going on but am not absolutely certain.

  13. Tomato Addict: Here’s a leading question for you, because I think I know where you might be headed with this:: Dembski’s CSI measures Shannon Information? (true/false)

    You do know. There’s also log-improbability of the “target” event in the active information measures of Dembski and Marks, and of Montañez. I’m hoping to heap insult upon injury this evening. From my perspective, there’s nothing more IDiotic in ID than the “information” yimmer-yammer. It’s not a sort of IDiocy that many people will appreciate, however.

    Here is my favorite resource (not yet published): Gavin E. Crooks, “On Measures of Entropy and Information.” ID runs afoul of a requirement in the second paragraph.

  14. Mung,

    My very first Ruby script (does the “code” tag work in comments?):


    #!/usr/bin/ruby

    n = ARGV[0].to_i
    print "Number of equiprobable possibilities: ", n, "\n"
    p = [1.0/n] * n
    print Math.log2(n)
    print " = "
    print p.inject(0) {|sum, px| sum + px * Math.log2(1.0/px)}
    print "\n"

  15. Tom English: My very first Ruby script (does the “code” tag work in comments?)

    No. 🙁

    If I want to indent I just use html: ampersand nbsp semicolon. I do that twice if I want to indent two spaces.

    ETA: print will not add a newline. If you want a newline use puts.

    ETA ETA: not that there’s anything wrong with the way your code is written. It will work that way.

    print “n” works too

  16. Mung,
    In plain speak, your calculation (Shannon Entropy) is the average bandwidth (bits) needed to convey a signal from sender to receiver without loss. Individual messages might need more bits (or fewer) than the average; that’s the self-information.

  17. p = [(1.0 / n).round(3)] * n
    puts p.inspect

    Don’t think I’ve ever seen this done before. Multiplying an array. Cool.

    Number of equiprobable possibilities: 8
    [0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125, 0.125]

  18. Tomato Addict:
    Mung,
    In plain speak, your calculation (Shannon Entropy) is the average bandwidth (bits) needed to convey a signal from sender to receiver without loss. Individual messages might need more bits (or fewer) than the average; that’s the self-information.

    Every possible outcome must correspond to exactly one message, or Shannon’s rationale is no good. It’s OK for different outcomes to correspond to a single message, but NOT-OK for a single outcome to correspond to two or more messages. For instance, if a die comes up 2, the sender does not have a choice of whether to transmit a message indicating the occurrence of the event $\{2\}$ or to transmit a message indicating the occurrence of the event $\{2, 4, 6\}.$ If there is a message for the bigger event, then there cannot be a message for the smaller event. Put technically, the transmitted events must be blocks in a partition of the sample space. There must be a one-to-one correspondence of events and messages. The set $2^\Omega$ is NOT a partition of the sample space $\Omega.$ The quantity $$-\sum_{E \subseteq \Omega} P(E) \log_2 P(E)$$ is NOT Shannon entropy.

    The entropy of a discrete probability distribution is usually expressed in terms of a probability mass function $p(x)$ mapping possible outcomes $x$ to probabilities, not a probability distribution function $P(E)$ mapping events $E$ to probabilities. Students see something like $$H(p) = -\sum_{i=1}^{N} p(x_i) \log_2 p(x_i)$$ bits as the definition of Shannon entropy. There’s no mention at all of events, and it registers on few people that the sum is over an exhaustive collection of nonempty and mutually disjoint events.

    For a roll of a die, with equiprobable outcomes in the sample space $\{1, 2, \dotsc, 6\},$ and with a partition $$\pi = \{ \{1\}, \{2\}, \{3\}, \{4\}, \{5\}, \{6\} \}$$ of the sample space (each possible outcome belongs to exactly one of the 6 blocks in the partition), the entropy is $$-\sum_{E \in \pi} P(E) \log_2 P(E) = \log_2 6 \approx 1.585~\text{bits.}$$ If we go instead with the partition $$\pi^\prime = \{ \{1, 2, 3\}, \{4, 5, 6\} \},$$ then each of the 6 possible outcome belongs to exactly one of 2 events (only one event or the other, not both, occurs when the die is rolled), and each of the two events is of probability .5. The entropy is now $$-\sum_{E \in \pi^\prime} P(E) \log_2 P(E) = \log_2 2 = 1~\text{bit.}$$

  19. Short answer: depends.

    Shannon Information (entropy, uncertainty, whatever)

    I = -p(x0)log2(p(x0)) – p(x1)log2(p(x1)) – p(x2)log2(p(x2))…..p(xn)log2(p(xn))

    my guess is that only under the assumption of equiprobability of outcomes (aka microstates) it resolves to answering your question with “yes”, other than that it looks like a mess, so probably in general “no”.

    Computer information of RAM measures assume for the sake of argument a priori equiprobability when we say a machine has 1 Gig of RAM, even though we known in practice the outcomes (microstates) are not in actuality equiprobable on infinite time lines. So “yes” provisionally for many practical purposes, and I suppose “no” in the strict sense.

  20. Tomato Addict, I know this has been hashed over before, but I still don’t see this as a problem for Dembski’s measure. He describes M*N*phi_S(T)*P(T|H) as a number, not as a probability. What it is is an upper bound on a relevant probability, and that’s all it needs to be for his argument to proceed. You could argue that taking its logarithm and calling that “information” is only valid if it’s an actual probability, but I think that’s being unnecessarily parochial about what “information” means.

    (I’d argue that there is a problem with here, but it’s much subtler. That number is an upper bound on the total probability of an event matching a specification T_other, where phi_S(T_other) <= phi_S(T) and P(T_other|H) <= P(T|H), but what he really needs is an upper bound on the total probability of an event matching a specification with phi_S(T_other)*P(T_other|H) <= phi_S(T)*P(T|H). Under reasonable assumptions (except for unbounded description length), this total probability turns out to be 1. If he'd required descriptions be in a self-delimiting code, and based phi on the number of code sequences rather than valid descriptions, he wouldn't have this problem.)

  21. Neil Rickert: If the event is the value of a symbol in a communication stream, then it is true. Otherwise it is nonsense.

    How does this work for the binary symbols ‘0’ and ‘1’ in a communications stream?

    What are their values?

  22. Mung: How does this work for the binary symbols ‘0’ and ‘1’ in a communications stream?

    What are their values?

    ‘0’ and ‘1’ are the two possible values for the symbol at that place in the communication stream.

  23. Hi Gordon,
    Dembski refers to his 2005 paper (in a 2012 publications) saying that, “… specification has been defined in a mathematically rigorous manner …”.
    Information Theory is a mathematical science, and it is more than fair to hold him to accepted standards of rigor. That’s not being parochial, that’s the essential peer review Dembski avoided by publishing in a theological journal.

    Dembski equivocates, describing M*N*phi_S(T)*P(T|H) as a number, an upper bound, and a probability. He describes Log2 of this quantity as an Information Measure, which requires it to be a probability.
    The upper bound on any probability is 1.0, but in an example he calculates a *negative* value for CSI, and therefore M*N*phi_S(T)*P(T|H) > 1.0, so it is neither a probability or an Information Measure.

    Further, M*N*phi_S(T)*P(T|H) is the WRONG probability. Dembski approximates the probability for the event T occurring exactly once, when it should be the probability of T occurring AT LEAST once. The result is that even when the upper bound is less than 1.0, is isn’t high enough, and therefore not an upper bound.

    I grant that when CSI is applied to something that IS complex and specified, it will behave approximately as an Information Measure, but it is invalid for anything that is not Complex and Specified. This makes it useless for its intended purpose.

    The interpretation of CSI is exactly opposite of how Information (Shannon or Kolmogorov) is defined. CSI is maximized when the sequence T is perfectly orderly (highly compressible=minimal Kolmogorov Info). This is confusing at least. Very high values of CSI are incompatible with biological systems, which must contain some minimal level of Kolmogorov Information, or they would not be complex enough to be living things.

    CSI has the amusing property that for any sequence T at least 500 bits long that does not exhibit CSI, then doubling the sequence by repeating it (T’ = TT), the new sequence is guaranteed to exhibit CSI. This happens because P[T’|H] becomes much smaller, but Phi_S(T’) is only slightly larger.

    I wrote my blog post because these were mathematical/statistical flaws that no one else seemed to have pointed out, but they are not the strongest criticisms. See the papers by Elsberry and Shallit (2011), and Devine (2014) for more gory mathematical evisceration of CSI.

  24. Again Gordon, I’m still trying to suss out your more subtle criticism.

    “… this total probability turns out to be 1.”

    Which probability? I don’t follow.

  25. Tomato Addict:
    My own criticism of Dembski is that he doesn’t take Log2 of a probability, but Log2 of the expectation of a binomial random variable. Even corrected, CSI only measures the compressibility of a sequence.
    http://dreadtomatoaddiction.blogspot.com/2016/02/deconstructing-dembski-2005.html

    I’ve published on the 2005 version of specified complexity. So I’m not raining on your parade alone when I say that the Evolutionary Informatics Lab seems to have committed to algorithmic specified complexity, which does have a clear interpretation as the log-ratio of two probability measures. The EIL has rebranded specified complexity as a measure of meaningful information. I’ve long had a visual example of how meaningless their meaning is, and hope to post it soon. (But I have a very bad history of realizing such hopes.)

  26. The 2005 version of Specified Complexity also required that one have the probability of producing points in the target, $P({\bf T} | H)$ by all mechanisms of evolutionary biology. This was different from the 2002 definition of CSI which only required that one compute that probability by “chance” mechanisms, such as mutation. (Dembski has since claimed that the 2002 version was actually talking about the same thing, but back in 2002 Dembski claimed to have a Law of Conservation of Complex Specified Information which, if relevant, would prevent one from originating CSI by natural selection).

    The Law turned out to be formulated in a way that was not relevant. Little has been heard of it since.

    It is also notable that in 2002 specified complexity involved a choice of target that could be “cashed out” in terms of viability, or some other favorable properties of organisms (see page 148 of No Free Lunch). The switch to only using algorithmic specified complexity is a narrowing-down of the concept, and that has not been discussed. It is also a bit silly since it would accord a party-balloon a much higher “complexity” than a hummingbird.

  27. Joe Felsenstein: It is also a bit silly since it would accord a party-balloon a much higher “complexity” than a hummingbird.

    It depends. What is the discrete space of possible outcomes that includes party balloons and hummingbirds? What is the model assigning “natural” probabilities to all outcomes in that space? What is the descriptive system? I can set things up to support your claim or to refute your claim. That is actually a big problem for Team EIL.

  28. Neil Rickert: ‘0’ and ‘1’ are the two possible values for the symbol at that place in the communication stream.

    These seems a dubious statement to me just as a semantic point. I don’t see how these are two values rather than two alternatives. There is no value inherent in either alternative is there? It seems because the symbols being used are numbers you are saying they are values, but it could be red or blue. Would that still be a value? Or up or down. Or yes and no. Or fish and soup.

    Or you saying the one and zero actually exist? I think a switch either allows passage or it doesn’t, there is no inherent value in that.

  29. Tomato Addict:
    Hi Gordon,
    Dembski refers to his 2005 paper (in a 2012 publications) saying that, “… specification has been defined in a mathematically rigorous manner …”.
    Information Theory is a mathematical science, and it is more than fair to hold him to accepted standards of rigor. That’s not being parochial, that’s the essential peer review Dembski avoided by publishing in a theological journal.

    Dembski equivocates, describing M*N*phi_S(T)*P(T|H) as a number, an upper bound, and a probability. He describes Log2 of this quantity as an Information Measure, which requires it to be a probability.
    The upper bound on any probability is 1.0, but in an example he calculates a *negative* value for CSI, and therefore M*N*phi_S(T)*P(T|H) > 1.0, so it is neither a probability or an Information Measure.

    Further, M*N*phi_S(T)*P(T|H) is the WRONG probability. Dembski approximates the probability for the event T occurring exactly once, when it should be the probability of T occurring AT LEAST once. The result is that even when the upper bound is less than 1.0, is isn’t high enough, and therefore not an upper bound.

    I grant that when CSI is applied to something that IS complex and specified, it will behave approximately as an Information Measure, but it is invalid for anything that is not Complex and Specified. This makes it useless for its intended purpose.

    The interpretation of CSI is exactly opposite of how Information (Shannon or Kolmogorov) is defined. CSI is maximized when the sequence T is perfectly orderly (highly compressible=minimal Kolmogorov Info). This is confusing at least. Very high values of CSI are incompatible with biological systems, which must contain some minimal level of Kolmogorov Information, or they would not be complex enough to be living things.

    CSI has the amusing property that for any sequence T at least 500 bits long that does not exhibit CSI, then doubling the sequence by repeating it (T’ = TT), the new sequence is guaranteed to exhibit CSI. This happens because P[T’|H] becomes much smaller, but Phi_S(T’) is only slightly larger.

    I wrote my blog post because these were mathematical/statistical flaws that no one else seemed to have pointed out, but they are not the strongest criticisms. See the papers by Elsberry and Shallit (2011), and Devine (2014) for more gory mathematical evisceration of CSI.

    Thanks. This is very informative.

  30. phoodoo: These seems a dubious statement to me just as a semantic point. I don’t see how these are two values rather than two alternatives.

    If there is a point there, you failed to make it.

    Information, in the sense in which I was using it, is part of a communication technology. The specifications of the technology determine what counts as a symbol or symbol value. Whether you call them {0,1} or {red,blue} or {fish, soup} is pretty much beside the point. I think you are confusing the information in the communication stream, with the information in human talk about that information stream. The human talk is part of a different communication stream.

  31. phoodoo,

    Shannon’s theory was primarily a theory of communication. It is technology. The values that matter are defined in the specifications for the particular technology.

    You are confusing it with other meanings of “value”.

  32. Neil Rickert,

    You are free to offer a mathematical definition of value that does not infer a quantity. So far I have not seen you do so. Perhaps it is your own private definition.

  33. phoodoo:
    Neil Rickert,

    You are free to offer a mathematical definition of value that does not infer a quantity.So far I have not seen you do so.Perhaps it is your own private definition.

    Math is not just about numbers. We refer to $y=f(x)$ as the value of function $f$ at $x,$ no matter what kind of object $y$ is. We might model the flip of a coin with a random variable $X$ taking values in the set $\{H, T\},$ where $H$ stands for heads and $T$ stands for tails. This is standard usage.

  34. Here’s the probability distribution for a loaded die, along with a binary encoding of each outcome [in brackets].

    p(1) = 1/16 [0011]
    p(2) = 1/16 [0010]
    p(3) = 1/4 [01]
    p(4) = 1/8 [000]
    p(5) = 1/2 [1]
    p(6) = 0 [none for an impossible outcome]

    The length of the codeword for outcome $n$ in $\{1, 2, \dotsc, 6\}$ is $-\!\log_2 p(n).$ No codeword is a proper prefix of any other. The expected length of the length of the codeword for the outcome of a roll of the die is the Shannon entropy of the distribution $p,$

    [latex]\begin{align*}
    H(p) &= -\frac{1}{16} \log_2 \frac{1}{16}
    – \frac{1}{16} \log_2 \frac{1}{16}
    – \frac{1}{4} \log_2 \frac{1}{4}
    – \frac{1}{8} \log_2 \frac{1}{8}
    – \frac{1}{2} \log_2 \frac{1}{2}
    – 0 \log_2 0 \\
    &= -\frac{1}{16} (-4)
    – \frac{1}{16} (-4)
    – \frac{1}{4} (-2)
    – \frac{1}{8} ( -3)
    – \frac{1}{2} ( -1)
    – 0 \\
    &= \frac{1}{4} + \frac{1}{4} + \frac{1}{2} + \frac{3}{8} + \frac{1}{2} + 0 \\
    &= 1\frac{7}{8} \\
    &= 1.875 \text{~bits},
    \end{align*}[/latex]
    where $0 \log_2 0 = 0$ by definition. For instance, the probability is 1/16 that the outcome will be 1 and the codeword (message) length will be of length 4. The message contains 1.875 binary symbols on average. Shannon proved that the entropy is the minimum number of symbols transmitted on average.

  35. Mung:
    phoodoo, there must be some value there otherwise bits per symbol would make no sense.

    Bread is not a value. A loaf of bread is a value.

    Light is not a value. The amount of light is a value.

  36. Mung: If the question in the OP was about Hartley rather than Shannon would the answer be yes (eta: true)?

    The Hartley function takes a finite set as its argument. It’s not defined in terms of probability. So there’s really no way to turn the OP into a question about the Hartley function.

  37. Tom English,

    The tail of a coin is also not a value Tom. The head of a coin is not a value.

    The number of times it lands facing up is however. To understand math, one also needs to understand English. Or at least some language that tells you what math represents.

  38. phoodoo: The tail of a coin is also not a value Tom.The head of a coin is not a value.

    The number of times it lands facing up is however.To understand math, one also needs to understand English.Or at least some language that tells you what math represents.

    A coin does not have a tail, phoodoo. A coin does not have a head. Most coins have images of heads stamped into them. Some nickels have images of buffalo tails stamped into them. But coins don’t have heads and tails. So it’s imperative that you stop speaking of the outcomes of coin-flipping as heads and tails. It confuses everyone terribly when you call things what they’re really not, even though everyone agrees to call them that.

Leave a Reply