Protein sequence networks
Sets of representative protein sequences were formed by clustering with CD-HIT to reduce the sample size and thus computational effort for pairwise sequence alignments. Values of pairwise sequence identity or similarity were calculated by the Needleman-Wunsch algorithm available in EMBOSS (version 6.6.0) with default gap opening and gap extension penalties of 10 and 0.5, respectively, and the substitution matrix BLOSUM62 24,25.
Collections of protein sequences were represented as protein sequence networks that depicted sequences as nodes connected by edges (lines). The edges in a protein sequence network were weighted by values of pairwise sequence identity or similarity. A threshold of the respective edge weights was chosen to select a subset of edges for the network. Protein sequence networks were visualized in Cytoscape (version 3.8.2) with the prefuse-force directed layout algorithm, taking the edge weights into account 26: edges of higher sequence identity or similarity were depicted preferably in closer vicinity to each other. The Python NetworkX package (version 1.11) was used to store the metadata of protein sequence networks in GraphML format, available for download at https://doi.org/10.18419/darus-205427.