1 INTRODUCTION
With the broad implementation of NGS technologies in the life sciences,
genomics and transcriptomics sequencing data are generated at an
unprecedented rate (Breese & Liu, 2013; Jung et al., 2019; Jung et al.,
2020). Rapid progress in NGS technologies has brought massively
high-throughput sequencing data to support research questions across
many research fields, enabling a new era of genomic research (Jung et
al., 2019; Jung et al., 2020). Simultaneously, this advancement has
brought enormous challenges in data analysis, of which efficient,
standardized and consistent analysis are fundamental steps for
maintaining reproducibility, especially for biologists (Breese & Liu,
2013; Jung et al., 2019; Jung et al., 2020). However, many of the
available tools for NGS data analysis require higher-order computational
experience (e.g. various programming/scripting languages), expensive
infrastructure (adequate HPC facilities and Cloud computing) and lack
GUIs, making them inaccessible to many researchers, and cumbersome for
even experienced biologists. Thus, the development of user-friendly
standalone software for NGS data will accelerate the pace of research
for scientists who have limited computer and bioinformatics experience.
NGS data processing often involves consecutive steps of trimming
(including quality check), assembling, mapping, manipulating, converting
and processing large files. FASTA (Pearson & Lipman, 1988) and FASTQ
(Cock et al., 2010) file formats are generated by most NGS platforms,
and further SAM/BAM (Li et al., 2009), BED (Kent et al, 2002), GFF/GTF
(Pertea & Pertea, 2020), and VCF (Danecek et al., 2011) can be derived
using FASTA and FASTQ files depending on the required analysis. The
FASTA file, based on simple text, is the most basic format for reporting
a sequence and is accepted by almost all sequence analysis programs.
Each sequence starts with a “>” followed by the sequence
name, a description of the sequence, and the sequence itself (nucleic
acids or amino acids). The FASTQ file, a text-based format for storing
both a biological sequence (usually nucleotide sequence) and its
corresponding quality scores, is the most widely used format in sequence
analysis and NGS sequencers. Each sequence requires at least 4 lines
starting with “@” followed by the sequence, a “+” sequence
identifier, and quality scores. Conveniently, FASTQ files can also be
converted to FASTA files, the most commonly used file format for NGS
data that enables direct sequencing of target genes. Many available
tools, easySEARCH (Kim et al., 2012); BlasterJS (Blanco-Míguez et al.,
2018); BlastGUI (Du et al., 2020); Sequenceserver (Priyam et al., 2019);
orfipy (Singh & Wurtele, 2021); Samtools and BCFtools (Danecek et al.,
2021) including easyfm , have not surprisingly focused on
manipulating (analyse, collect, organise, interpret, and present data in
meaningful ways) the FASTA file format to generate biologically relevant
insights.
For the last decade, many HPC and Cloud-based NGS command-line programs
or web-based platforms have wrapped popular high-level analysis and
visualisation tools in an intuitive and appealing interface (Baker et
al., 2020). Galaxy (homepage: https://galaxyproject.org, main
public server: https://usegalaxy.org, Australia:
https://usegalaxy.org.au/) in particular has been successful in
establishing itself as an analytics hub and an e-learning platform with
global scientists, intending to produce accessible, reproducible and
collaborative biological analyses (Afgan et al., 2018; Serano-Solano et
al., 2021). Even with the huge achievements made in many analytical
software packages and pipelines, further improvements in user-friendly
standalone software are still required to facilitate the rapid discovery
of meaningful sequences in very large data sets for novice users. To
help augment the functionality of existing tools and allow for
user-friendliness and convenience of NGS file manipulation,easyfm enables end-to-end file filtering, extracting and
converting (FASTQ to FASTA) with a simple mouse click on desktops.
The easyfm , implemented in Python 3.7+, was developed with four
work modules (Basic Local Alignment Search Tool [BLAST], BLAST-Like
Alignment Tool [BLAT], Open Reading Frames [ORF], and File
Manipulation) and a secondary window (Project Folder, Help and Log).
Together, these modules and secondary window cover different aspects of
NGS data analysis (mainly focusing on FASTA files), including
post-processing, filtering, format conversion, and generating results.
The functionality of each module has been described in the Results and
Discussion section to have an easy-to-follow parallel comparison.easyfm is a GUI-based, lightweight but powerful, free and
open-source desktop software for querying/manipulating NGS data sources
and generating various outcomes. Since everyone can use it from anywhere
to analyse data and find target sequences easily without any coding, HPC
and/or internet/web-server connection, we hope the usefulness ofeasyfm can extend its potential use in a wide range of
bioinformatics applications in the life sciences including
teaching/learning materials in the classroom.