Linux (and OSX) commands for working with FASTA files
When working with the genomics or molecular marker side of BIoinformatics, Bioinformaticians are faced (very) often with DNA (or RNA) sequence files in FASTA format. FASTA files can store a single sequence, or multiple sequences. To be able to access individual sequences or measure some metric of the sequence data, such as length, some form of manipulation of the files is usually required.
This post gives a set of unix commands to perform some common manipulations required of FASTA files. These commands should work using the BASH shell under most popular distributions of linux (I use Ubuntu and CentOS). They will also work in the OSX terminal and *SHOULD* work in Windows using software such as Cygwin.
The manipulations and metric measurements I cover in this article deal with:
- Splitting a FASTA file of multiple sequences into FASTA files of individual sequences
- Joining multiple FASTA files into a single, multi-sequence FASTA file
- List the sequence headers in a FASTA file
- Counting the number of sequence entities in a FASTA file
- Determining the length of the sequence in a FASTA file
BEFORE WE BEGIN
Where you see or (or a similar name in angled brackets), replace this with your input file of choice or the name of the output file you wish to create respectively.
The FASTA Format

To give a super brief description, FASTA format was the ASCII file format used for sequence information for the application of the same name. Some time in bioinformatics world passed and now FASTA formatted files are used by a variety of Bioinformatics packages and is the de facto standard for storing sequence information in text files.
The FASTA format itself is very simple: A file can consist of one or more sequence elements, each headed by a free text header starting with the chevron ‘>’ character and ending with a newline ‘\n’ character.
e.g. for DNA sequence:
>sequence 1 ACCGTACGATACGATCGCATCGCTGACTCG ACTTACGACGACGCANNNNACATCGATCGA ACACTCAGCA >sequence 2 CACGCATTATCATCGATCCTCAGCTCATCGA ATACGTACCACAACTCGCATCTCAGTCAGAC ACTCGTACGCTACGTACGCATGCATCAGATC ATCCTATGCATGCATCGTACGCTAGACTCGA ATCGATCGCATGCATACGTACGCAT
NOTE: The sequence itself may have newline characters throughout the sequence – these should be
stripped when using the sequence data.
Splitting a FASTA file of multiple sequences into FASTA files of individual sequences
This command will create as many files as there are member sequences in the same directory as the source file,
incrementally numbered with a .fasta extension. (e.g. for an input file with 5 member sequences, such as the Arabidopsis genome, it will output files 1.fasta to 5.fasta.
awk '/^>/{f=++d".fasta"} {print > f}' <inputFile>
Joining multiple FASTA files into a single, multi-sequence FASTA file
This is the reverse of the above and we will assume a few things. Firstly, you want to combine all fasta files in the current directory and, secondly, they all have the same extension (.fasta). Adapt to your needs if this is not the case!
cat *.fasta > <outputFile>
List the sequence headers in a FASTA file
grep ">" <inputFile>
Counting the number of sequence entities in a FASTA file
grep ">" <inputFile> | wc -l
Determining the length of the sequence in a FASTA file
This method will give the TOTAL sequence length of a FASTA file. This means that if your FASTA file has a number of sequence entries, it will return the sum of the length of each sequence entry. To get the length of individual entries you would first need to split the file into individual entries, or do it programatically: either using a homegrown method or a Bioinformatics library such as BioPerl.
grep -v ">" <inputFile> | tr -d [:space:] | wc -c
These are a few useful commands for performing some common and simple FASTA file manipulations without needing to resort to programatic methods. It may be worthwhile defining an alias or simple bashscript wrapper for the above commands, allowing you to type something like: fastaLength fastafile.fasta at the command line.


Chris is a shining example of how, in a world where abortion is not as easily obtained as a leg of fried chicken, an ounce of prevention can be worth more than its weight in gold. He is currently completing a PhD in Bioinformatics which he hopes may one day help him get out of a speeding ticket.