Paleontological systematic study as a communication system

Shannon (1948) indicated that information is the decrease of uncertainty and the semantic meaning of information is not related to its communication. A typical communication system can be divided into 5 parts, the information source, encoder, channel (with noise), decoder, and the destination (Fig. 1a). Shannon (1948, pp. 379) stated that “The fundamental problem of communication is that of reproducing at one point either exactly or approximately a message selected at another point ”. Paleontological systematic studies share abundant similarities with a communication system (Fig. 1b) and focus on reconstructing the evolutionary history of extinct organisms either exactly or approximately. While most communication systems such as telephone, email, and instant messaging apps are for communication in spatial domains, whereas paleontological systematic studies are communication in the temporal domain. The original organisms can be treated as the information source, fossils discovered as the received message, and preservation environments as noisy channels. The encoder in Fig. 1a encodes the original message into signals, for example encoding “we found a dinosaur skull” into Morse Codes, and decoder does thevice versa . In paleontology, a widely used encoder is morphological character matrix that encodes each OTU as a series of character states. Most organisms cannot be preserved as fossils, namely, transmitting information through the preservation environment channel, and those fossils discovered are more or less incomplete and distorted. The fundamental problem of paleontological studies is reconstructing at present either exactly or approximately organisms living in another age. Therefore, two questions need to asked that how much information was in an organism or taxon and how much information can be preserved.
To efficiently transmit information, both source coding and channel coding are essential in communication engineering and their differences are listed in table 1. Source coding focuses on minimizing the cost at encoding original messages. For example, Morse Code uses different length of codes to represent different letters, while E with the highest frequency in alphabet has the shortest code, a single dot, rarer letters such X , Y , and Z have longer codes, therefore maximize the information entropy, which measures how informative a variable is, of each code. On the other hand, channel coding is designed to resist noises in the preservation environments. The simplest but inefficient example of channel coding is repeated codes. If an information source is randomly sending 0 and 1 via a noisy channel that has a 30% chance to reverse the original message, thus any 0 or 1 received has a 70% chance to be correct. To resist such noise, the encoder repeats each message for three times, which turns “0” into “000” and “1” into “111”, thus under maximum likelihood decoding principle that “000”, “100”, “010”, and “001” are decoded as “0” and others as “1”, the received message has a 78.4% chance to be correct (0.73+3×0.72×0.3=0.784), which is better than the original encoding method. However, repeated code is usually seen as inefficient because in this example the encoding has tripled the cost but only improves 7.84% accuracy.
Table 1. Comparison between source coding and channel coding