Collecting and analyzing sequences
Sequences of the DNA encoding the abyssomicin (Aby, Verrucosispora maris AB-18-032, JF752342.1), ajudazol (Aju, Chondromyces crocatus , AM946600.1), akaeolide (Aka, Streptomyces sp.NBRC 109706 , BBOM01000011.1), althiomycin (Alm, Myxococcus xanthus , FR831800.1), ambruticin (Amb, Sorangium cellulosum , DQ897667.1), amphotericin (Amp, Streptomyces nodosus , AF357202), anatoxin (Ana, Oscillatoria sp. PCC 6506 , FJ477836.1), annimycin (Ann, Streptomyces calvus , KF683117.1), ansamitocin (Asm, Actinosynnema pretiosum subsp.pretiosum , KY4899977.1), apoptolidin (Apo, Nocardiopsissp. FU40 , JF819834.1), aurafuron (Auf, Stigmatella aurantiaca DW4/3-1, AM850130.1), aureothin (Aur, Streptomyces thioluteus , AJ575648.1), avermectin (Ave, Streptomyces avermitilis , AB032367.1), bafilomycin (Baf, Streptomyces lohii , GU390405.1), BE-14106 (Bec, Streptomyces sp. DSM 21069 , FJ872523.1), bengamide (Ben, Myxococcus virescens , KP143770.1), borrelidin (Bor, Streptomyces parvulus , AJ580915.1), calcimycin (Cal, Streptomyces chartreusis NRRL 3882, HM452329.1), candicidin (Fsc, Streptomyces sp. FR-008 , AY310323.2), chalcomycin (Chm, Streptomyces bikiniensis , AY509120.1), chaxamycin (Cxm, Streptomyces leeuwenhoekii , LN831790.1), chlorizidine (Clz, Streptomyces sp. CNH-287 , KF585133.1), chlorothricin (Chl, Streptomyces antibioticus , DQ116941.2), chondramide (Cmd, Chondromyces crocatus , AM179409.1), chondrochloren (Cnd, Chondromyces crocatus , AM988861.1), coelimycin (Cpk, Streptomyces coelicolor A3(2), AL645882.2), concanamycin (Con, Streptomyces neyagawaensis , DQ149987.1), conglobatin (Cng, Streptomyces conglobatus , LN849060.1), cremimycin (Cmi, Streptomyces sp. MJ635-86F5 , AB818354.1), crocacin (Cro, Chondromyces crocatus , FN547928.1), cryptophycin (Crp, Nostoc sp. ATCC 53789 , ER159954.1), curacin (Cur,Moorea producens 3L, HQ696500.1), cyclizidine (Cyc,Streptomyces sp. NCIB 11649 , KT327068.1), cylindrospermopsin (Cyr, Cylindrospermopsis raciborskii AWT205, EU140798.1), cystothiazole (Cta, Cystobacter fuscus , AY834753.1), divergolide (Div, Streptomyces sp. HKI0576 , HF563079.1), DKxanthene (Dkx, Stigmatella aurantiaca , BN001209.1), E-837 (E837, Streptomyces aculeolatus , DQ292520.1), ebelactone (Ebe,Streptomyces aburaviensis , KC894072.1), ECO-02301 (Eco,Streptomyces aizunensis , AY899214.1), elaiophylin (Ela,Streptomyces sp. ICBB 9297 , GP697151.1), epothilone (Epo,Sorangium cellulosum , GU063811.1), erythromycin (Ery,Saccharopolyspora erythraea , AM420283.1), FD-891 (Gfs,Streptomyces graminofaciens , AB469193.1), filipin (Pte,Streptomyces avermitilis MA-4680, BA000030.3), FK520 (Fkb,Streptomyces hygroscopicus subsp. ascomyceticus , AF235504.1), fostriecin (Fos, Streptomyces pulveraceus , HQ434551.1), geldanamycin (Gdm, Streptomyces hygroscopicus NRRL 3602, AY179507.1), gephyronic acid (Gph, Cystobacter violaceum Cb vi76, KF479198.1), guadinomine (Gdn, Streptomyces sp.K01-0509 , JX545234.1), gulmirecin (Gul, Pyxidicoccus fallax , KM361622.1), halstoctacosanolide (Hls, Streptomyces halstedii , AB241068.1), hectochlorin (Hct, Lyngbya majuscula , AY974560.1), herbimycin (Hbm, Streptomyces hygroscopicus , AY947889.1), herboxidiene (Her, Streptomyces chromofuscus , JN671974.1), hitachimycin (Hit, Streptomyces scabrisporus , LC008143.1), hygrocin (Hgc, Streptomyces sp. LZ35 , JX504844.1), incednine (Idn, Streptomyces sp. ML694-90F3 , AB767280.1), indanomycin (Idm, Streptomyces antibioticus , FJ545274.1), jamaicamide (Jam, Lyngbya majuscula , AY522504.1), jerangolid (Jer, Polyangium cellulosum , DQ897668.1), kendomycin (Ken, Streptomyces violaceoruber , AM992894.1), kijanimicin (Kij,Actinomadura kijaniata , EU301739.1), lankamycin (Lkm,Streptomyces rochei , AB088224.2), lasalocid (Lsd,Streptomyces lasaliensis , AB449340.1), leupyrrin (Leu,Sorangium cellulosum , HM639990.1), lipomycin (Lip,Streptomyces aureofaciens , DQ176871.1), lobophorin (Lbp,Streptomyces sp. SCSIO 01127 , KC013978.1), lobosamide (Lob, Micromonospora sp. RL09-050-HVF-A , KT209587.1), lorneic acid (Lor, Streptomyces sp. NBRC 109706 , BBOM01000004.1), macbecin (Mbc, Actinosynnema pretiosum , EU827593.1), maklamicin (Mak, Micromonospora sp. GMKU326 , LC021382.1), meilingmycin (Mei, Streptomyces nanchangensis , FJ952082.1), melithiazol (Mel, Melittangium lichenicola , AJ557546.1), meridamycin (Mer, Streptomyces sp. NRRL 30748 , DQ351275.1), microcystin (Mcy, Planktothrix agardhiNIVA-CYA 126/8, AJ441056.1), microsclerodermin (Msc, Jahnella sp.MSr9139 , KF657739.1), ML-449 (Mla, Streptomyces sp.MP39-85, FJ372525.1 ), monensin (Mon, Streptomyces cinnamonensis , AF440781.1), mycinamicin (Myc, Micromonospora griseorubida , AB089954.2), mycolactone (Mls, Mycobacterium ulcerans Agy99, BX649209.1), myxalamid (Mxa, Stigmatella aurantiaca , AF319998.1), myxothiazol (Mta, Stigmatella aurantiaca DW4/3-1, AF188287.1), nanchangmycin (Nan, Streptomyces nanchangensis , AF521085.1), nannocystin (Ncy, Nannocystis sp.MB1016 , KT067736.1), naphthomycin (Nat, Streptomyces sp.CS , GQ452266.1), neoaureothin (Nor, Streptomyces orinoci , AM778535.1), niddamycin (Nid, Streptomyces caelestis , AF016585.1), nigericin (Nig, Streptomyces violaceusniger , DQ354110.1), nocardiopsin (Nsn, Nocardiopsis sp.CMB-M0232 , KP339942.1), oligomycin (Olm, Streptomyces avermitilis , AB070940.1), nystatin (Nys, Streptomyces norseiATCC 11455, AF263912.1), pellasoren (Pel, Sorangium cellulosum , HE616533.1), phenylnannolone (Phn, Nannocystis pusilla , KF739396.1), phoslactomycin (Plm, Streptomyces sp. HK803 , AY354515.1), piericidin (Pie, Streptomyces piomogenus , HQ840721.1), pimaricin (Pim, Streptomyces natalensis , AJ278573.1), pikromycin (Pik, Streptomyces venezuelae , AF079138.1), pladienolide (Pld, Streptomyces platensis , AB435553.1), puwainaphycin (Puw, Cylindrospermum alatosporumCCALA 988, KM078884.1), pyoluteorin (Plt, Pseudomonas protegensPf-5, AF081920.3), quartromicin (Qmn, Amycolatopsis orientalis , JF970188.1), rapamycin (Rap, Streptomyces rapamycinicus NRRL 5491, X86780.1), reveromycin (Rev, Streptomyces sp.SN-593 , AB568601.1), rifamycin (Rif, Amycolatopsis mediterranei S699, AF040570.3), rubradirin (Rub, Streptomyces achromogenes subsp. rubradiris , AJ871581.1), salinilactam (Slm,Salinispora tropica CNB-440, CP000667.1), salinomycin (Sln,Streptomyces albus , JN033543.1), sanglifehrin (Sfa,Streptomyces flaveolus , FJ809786.1), soraphen (Sor,Sorangium cellulosum , U24241.2), spinosyn (Spn,Saccharopolyspora spinosa , AY007564.1), spirangien (Spi,Sorangium cellulosum , AM407731.1), stambomycin (Sta,Streptomyces ambofaciens ATCC 23877, AM238664.2), stigmatellin (Sti, Stigmatella aurantiaca Sg a15, AJ421825.1), streptazone (Stz, Streptomyces sp. MSC090213JE08 , LC051217.1), streptolydigin (Slg, Streptomyces lydicus , FN433113.1), tautomycetin (Ttn, Streptomyces sp. MSC090213JE08 , LC061217.1), tautomycin (Ttm, Streptomyces spiroverticillatus , EF990140.1), tetrocarcin (Tca, Micromonospora chalcea , EU443633.1), tetronasin (Tsn, Streptomyces longisporoflavus , FJ462704.1), tetronomycin (Tmn, Streptomyces sp. NRRL 11266 , AB193609.1), thuggacin (Tga, Sorangium cellulosum , GQ981380.1), tiacumicin (Tia, Dactylosporangium aurantiacumsubsp. hamdenensis , HQ011923.1), tirandamycin (Tam,Streptomyces sp. 307-9 , GU385216.1), tubulysin (Tub,Cystobacter sp. SBCb004 , GU002154.1), tylosin (Tyl,Streptomyces fradiae , U78289.1), versipelostatin (Vst,Streptomyces versipellis , LC006086.1), vicenistatin (Vin,Streptomyces halstedii , AB086653.1), and zwittermycin (Zma,Bacillus cereus , FJ430564.1) assembly lines were primarily obtained from MIBiG[12]. A FASTA file of the DNA encoding extension modules (the “GTNAH” motif near the end of the KS domain was used as the boundary) as well as the corresponding amino acid sequences was generated. Modules were named by the polypeptide containing its AT, its position within that polypeptide, and its type (e.g., the module EryA1_3b has its AT in the first polypeptide of the erythromycin PKS in the third position and is a β-module). Modules were divided into eight categories: α-modules without DD (n=62), α-modules with DD (n=9), β-modules without DD (n=168), β-modules with DD (n=111), γ-modules without DD (n=258), γ-modules with DD (n=183), δ-modules without DD (n=73), and δ-modules with DD (n=85). Those encoded on two polypeptides connected through a DD were treated as one sequence (5 C’s represent the C-terminus of the first polypeptide and 5 N’s represent the N-terminus of the second polypeptide). The biosynthetic model for each PKS helped determine which polypeptides contain the upstream and downstream portions of these split modules[1] (Supplementary Data Files 1-9).
The Tandem Repeats Finder server was employed to detect tandem repeats at the DNA level [advanced mode with alignment parameters (match, mismatch, indels): 2, 5, 7; minimum alignment score: 50; maximum period size: 50; maximum tandem repeat array size (bp, millions): 2][13]. These DNA repeats as well as the corresponding amino acids were made lowercase and sequentially highlighted yellow, green, and cyan (Supplementary Data File 1).
While this was sufficient for repetitive sequences located between domains, curation was necessary within domains to catalog repetitive sequences that alter the composition and/or length of known loops. Thus, repetitive sequences located in regions that correspond to structured elements were made uppercase. Likewise, when it was unclear whether a repetitive sequence was located in a (e.g., in the poorly conserved regions of KRs from γ/δ-modules), it was made uppercase. To help determine whether a repetitive sequence is in a known loop, multiple sequence alignments were generated with the program SEAVIEW (using the Clustal Omega algorithm) and ESPript with the aid of known structures [AT region (EryAT4, PDB 2QO3), KS (EryKS3, PDB 2QO3), KR region of β-modules (SpnKR4, PDB 4IMP), KR of γ/δ-modules (SpnKR3, PDB 3SLK), DH of γ/δ-modules, ER of δ-modules (SpnER3, PDB 3SLK), and ACP (MycACP8 from MlsB, PDB 6H0Q)] (Supplementary Figures 1-11) [14-16]. All lowercase repetitive sequences, along with their period size and repeat number, were tabulated (Tables I-II).
Inverted repeats were detected using the EMBOSS palindrome server (minimum length of palindrome: 10; maximum length of palindrome: 100; gap between repeated regions: 50; mismatches allowed: 0)[17]. Repeats were highlighted in magenta or red. Those separated by more than 10 bases were italicized to indicate a lower likelihood of being biologically significant (Supplementary Data File 1).