Paleontological systematic study as a communication
system
Shannon (1948) indicated that information is the decrease of uncertainty
and the semantic meaning of information is not related to its
communication. A typical communication system can be divided into 5
parts, the information source, encoder, channel (with noise), decoder,
and the destination (Fig. 1a). Shannon (1948, pp. 379) stated that
“The fundamental problem of communication is that of reproducing
at one point either exactly or approximately a message selected at
another point ”. Paleontological systematic studies share abundant
similarities with a communication system (Fig. 1b) and focus on
reconstructing the evolutionary history of extinct organisms either
exactly or approximately. While most communication systems such as
telephone, email, and instant messaging apps are for communication in
spatial domains, whereas paleontological systematic studies are
communication in the temporal domain. The original organisms can be
treated as the information source, fossils discovered as the received
message, and preservation environments as noisy channels. The encoder in
Fig. 1a encodes the original message into signals, for example encoding
“we found a dinosaur skull” into Morse Codes, and decoder does thevice versa . In paleontology, a widely used encoder is
morphological character matrix that encodes each OTU as a series of
character states. Most organisms cannot be preserved as fossils, namely,
transmitting information through the preservation environment channel,
and those fossils discovered are more or less incomplete and distorted.
The fundamental problem of paleontological studies is reconstructing at
present either exactly or approximately organisms living in another age.
Therefore, two questions need to asked that how much information was in
an organism or taxon and how much information can be preserved.
To efficiently transmit information, both source coding and channel
coding are essential in communication engineering and their differences
are listed in table 1. Source coding focuses on minimizing the cost at
encoding original messages. For example, Morse Code uses different
length of codes to represent different letters, while E with the
highest frequency in alphabet has the shortest code, a single dot, rarer
letters such X , Y , and Z have longer codes,
therefore maximize the information entropy, which measures how
informative a variable is, of each code. On the other hand, channel
coding is designed to resist noises in the preservation environments.
The simplest but inefficient example of channel coding is repeated
codes. If an information source is randomly sending 0 and 1 via a noisy
channel that has a 30% chance to reverse the original message, thus any
0 or 1 received has a 70% chance to be correct. To resist such noise,
the encoder repeats each message for three times, which turns “0” into
“000” and “1” into “111”, thus under maximum likelihood decoding
principle that “000”, “100”, “010”, and “001” are decoded as
“0” and others as “1”, the received message has a 78.4% chance to
be correct (0.73+3×0.72×0.3=0.784),
which is better than the original encoding method. However, repeated
code is usually seen as inefficient because in this example the encoding
has tripled the cost but only improves 7.84% accuracy.
Table 1. Comparison between source coding and channel coding