Linux (and OSX) commands for working with FASTA files
When working with the genomics or molecular marker side of BIoinformatics, Bioinformaticians are faced (very) often with DNA (or RNA) sequence files in FASTA format. FASTA files can store a single sequence, or multiple sequences. To be able to access individual sequences or measure some metric of the sequence data, such as length, some form of manipulation of the files is usually required.
This post gives a set of unix commands to perform some common manipulations required of FASTA files. These commands should work using the BASH shell under most popular distributions of linux (I use Ubuntu and CentOS). They will also work in the OSX terminal and *SHOULD* work in Windows using software such as Cygwin.
The manipulations and metric measurements I cover in this article deal with:
- Splitting a FASTA file of multiple sequences into FASTA files of individual sequences
- Joining multiple FASTA files into a single, multi-sequence FASTA file
- List the sequence headers in a FASTA file
- Counting the number of sequence entities in a FASTA file
- Determining the length of the sequence in a FASTA file
BEFORE WE BEGIN
Where you see or (or a similar name in angled brackets), replace this with your input file of choice or the name of the output file you wish to create respectively.
The FASTA Format

To give a super brief description, FASTA format was the ASCII file format used for sequence information for the application of the same name. Some time in bioinformatics world passed and now FASTA formatted files are used by a variety of Bioinformatics packages and is the de facto standard for storing sequence information in text files.
The FASTA format itself is very simple: A file can consist of one or more sequence elements, each headed by a free text header starting with the chevron ‘>’ character and ending with a newline ‘\n’ character.
e.g. for DNA sequence:
>sequence 1 ACCGTACGATACGATCGCATCGCTGACTCG ACTTACGACGACGCANNNNACATCGATCGA ACACTCAGCA >sequence 2 CACGCATTATCATCGATCCTCAGCTCATCGA ATACGTACCACAACTCGCATCTCAGTCAGAC ACTCGTACGCTACGTACGCATGCATCAGATC ATCCTATGCATGCATCGTACGCTAGACTCGA ATCGATCGCATGCATACGTACGCAT
NOTE: The sequence itself may have newline characters throughout the sequence – these should be
stripped when using the sequence data.
Splitting a FASTA file of multiple sequences into FASTA files of individual sequences
This command will create as many files as there are member sequences in the same directory as the source file,
incrementally numbered with a .fasta extension. (e.g. for an input file with 5 member sequences, such as the Arabidopsis genome, it will output files 1.fasta to 5.fasta.
awk '/^>/{f=++d".fasta"} {print > f}' <inputFile>
Joining multiple FASTA files into a single, multi-sequence FASTA file
This is the reverse of the above and we will assume a few things. Firstly, you want to combine all fasta files in the current directory and, secondly, they all have the same extension (.fasta). Adapt to your needs if this is not the case!
cat *.fasta > <outputFile>
List the sequence headers in a FASTA file
grep ">" <inputFile>
Counting the number of sequence entities in a FASTA file
grep ">" <inputFile> | wc -l
Determining the length of the sequence in a FASTA file
This method will give the TOTAL sequence length of a FASTA file. This means that if your FASTA file has a number of sequence entries, it will return the sum of the length of each sequence entry. To get the length of individual entries you would first need to split the file into individual entries, or do it programatically: either using a homegrown method or a Bioinformatics library such as BioPerl.
grep -v ">" <inputFile> | tr -d [:space:] | wc -c
These are a few useful commands for performing some common and simple FASTA file manipulations without needing to resort to programatic methods. It may be worthwhile defining an alias or simple bashscript wrapper for the above commands, allowing you to type something like: fastaLength fastafile.fasta at the command line.

Chris is a shining example of how, in a world where abortion is not as easily obtained as a leg of fried chicken, an ounce of prevention can be worth more than its weight in gold. He is currently completing a PhD in Bioinformatics which he hopes may one day help him get out of a speeding ticket.
By Lex on Feb 24, 2010 | Reply
Hey Chris! I’m just getting started with Bioinformatics. That was very helpful! Thanks!
By Chris on Feb 25, 2010 | Reply
Hi Lex, glad the post was useful, I appreciate the comment. What sort of bioinformatics work are you getting into? At any rate, enjoy the world of Bioinformatics!
By SiewFen on Mar 11, 2010 | Reply
hi
i am looking about grep sequences from multifasta format and i saw ur page,
just would like to ask if i would like to grep a sequence from a multifasta format. what should i do?
eg;
file1
001
003
file2
>001
fdjsklfslkfjd
>002
afdsskljfskl
>003
dfshafhsjk
thanks
By tux25 on Apr 14, 2010 | Reply
Great, but I am looking for a line command or script (bash, perl) for cut a sequence in the fasta sequence file (for example amino acids between 52 and 987).
Someone could help me ?
By capemaster on Aug 19, 2010 | Reply
Mhh, nice script, but I get this error.
awk: 18.fasta makes too many open files
Any Idea?
(using mac os x 10.6.4)
By capemaster on Aug 19, 2010 | Reply
fixed
awk ‘/^>/{f=++d”.fasta”} {print > f} {close f}’
Bye!
PS I was in San Diego at the PAG too!
By capemaster on Aug 19, 2010 | Reply
Ops… I cut and paste the wrong script! please delete the comment above.
By yifang on Sep 3, 2010 | Reply
How to use the sequence name of each fasta entry as the individual fiel name?
>seq1
ATCGATCGATAGCAT
>seq2
ATCGATGTTGGGGGG
seq1.fasta
>seq1
ATCGATCGATAGCAT
seq2.fasta
>seq2
ATCGATGTTGGGGGG
By Michele Foley on Dec 24, 2010 | Reply
fixed awk ‘/^>/{f=++d”.fasta”} {print > f} {close f}’ Bye! PS I was in San Diego at the PAG too!
By Stacy Gregory on Dec 24, 2010 | Reply
fixed awk ‘/^>/{f=++d”.fasta”} {print > f} {close f}’ Bye! PS I was in San Diego at the PAG too!
By Ædrián Lozano on Mar 12, 2011 | Reply
Hi Chris. Can I combine these lines with some commands like “for” or “while” and use each sequence entry properties for filterning?
Like extracting to a new fasta file exclusivly those sequences that begin with an especific 18-20 nucleotide sequence.
By xgenseq on Apr 21, 2011 | Reply
Another way to count number of sequences in a fasta file.
grep -c “>”
Cheers
xgenseq
By Matrimoniale Romania on Jun 1, 2011 | Reply
After reading this I thought it was quite informative. I appreciate you taking the time and effort to put this post. Once again I find myself spending way to much time both reading and commenting on blogs .
By Deandre Crowston on Jun 8, 2011 | Reply
Do you have any videos on youtube about this?
By dating tips on Jun 18, 2011 | Reply
Thank you, I have been hunting for details about this topic for ages and yours is the best I’ve discovered so far.
By sanibel island resorts on Sep 13, 2011 | Reply
Hey, I needed to ask you some thing. Is this a wordpress site? We are contemplating changing our website from Blogger to wordpress, ya think this can be doable? Additionally did you produce this particular theme yourself some how? Cheers for the assistance!
By Laura Spoor on Oct 31, 2011 | Reply
Hi There!
Very helpful website, I have FASTA sequence with some lowercase letters (low quality, I tried to filter genome assembly using samtools for read depth but it still called the base, just as lowercase!). Does anyone know if there is there a way I can convert the lowercase letters to ‘N’?
Many Thanks!
By Chiropractor Phoenix on Jan 22, 2012 | Reply
Terrific paintings! That is the kind of info that are supposed to be shared around the internet. Shame on the seek engines for now not positioning this put up higher! Come on over and talk over with my web site . Thanks