1 INTRODUCTION
Several decades of biochemical research have resulted in the elucidation of functions of a large number of proteins. Yet, we are still far from a comprehensive understanding of the role of all proteins of any single organism. This is illustrated by the artificial minimal organismMycoplasma mycoides JCVI-syn3A. With only 452 protein-coding genes, this organism currently defines the lower limit of the protein complement for an independently viable bacterium (Hutchison et al., 2016; Breuer et al., 2019). Yet, about one third of these proteins still have no known function, indicating major gaps in our knowledge even for the most simple cells (Hutchison et al., 2016; Pedreira et al., 2022 a). These gaps are caused by a focus on a limited set of intensively studied proteins for which more and more knowledge accumulates. On the other hand, the biological function of a significant number of proteins remains not at all or only poorly understood.
Recently, it has been proposed to close this gap of knowledge by launching the Understudied Proteins Initiative (Kustatscher et al., 2022 a, b). The functional analysis of unknown proteins can be highly laborious and poorly rewarding. This is exemplified by a European/ Japanese initiative to functionally identify all unknown proteins of the Gram-positive model bacterium Bacillus subtilis after the completion of the genome sequence. While tremendous human and financial resources went into this project, functions have not been identified for more than a handful of so far unknown proteins. This results from the repeated investigation of a defined set of phenotypic analyses under a defined set of conditions. Typically, these phenotypes and conditions have been studied intensively before, so that only little new knowledge can be expected. As a conclusion, at least a minimal amount of molecular/ functional annotation is required as the starting point to identify the function of so far unknown proteins. This minimal knowledge could cover expression over a wide range of conditions, the similarity of phenotypes to known phenotypes, or the association of unknown proteins with proteins of known function, with RNA or other biomolecules. Finally, the identification of gaps in our knowledge and the goal-driven investigation of these functions may help to unravel the functions of poorly studied proteins. Another prerequisite for closing the annotation gap is the integration of all available information in intuitively accessible databases.
We are interested in the model organism B. subtilis . As a model organism for differentiation and workhorse in biotechnology, B. subtilis is one of the most intensively studied organisms. However, the function of about 25% of the proteins (about 1,000 proteins) encoded by the B. subtilis genome is still unknown or only very poorly understood (Pedreira et al., 2022). In this review, we give an overview on the strategies used to get initial minimal annotation for so far unknown proteins, we present a set of highly expressed unknown proteins that should be studied with highest priority, and we define and discuss fields of research that still have many open questions, i.e ., RNA-binding proteins, amino acid transport, and the control of metabolic homeostasis.