Linux (and OSX) commands for working with FASTA files

02 Sep 2009
19 Comments »

When working with the genomics or molecular marker side of BIoinformatics, Bioinformaticians are faced (very) often with DNA (or RNA) sequence files in FASTA format. FASTA files can store a single sequence, or multiple sequences. To be able to access individual sequences or measure some metric of the sequence data, such as length, some form of manipulation of the files is usually required.

This post gives a set of unix commands to perform some common manipulations required of FASTA files. These commands should work using the BASH shell under most popular distributions of linux (I use Ubuntu and CentOS). They will also work in the OSX terminal and *SHOULD* work in Windows using software such as Cygwin.

The manipulations and metric measurements I cover in this article deal with:

  1. Splitting a FASTA file of multiple sequences into FASTA files of individual sequences
  2. Joining multiple FASTA files into a single, multi-sequence FASTA file
  3. List the sequence headers in a FASTA file
  4. Counting the number of sequence entities in a FASTA file
  5. Determining the length of the sequence in a FASTA file

BEFORE WE BEGIN
Where you see or (or a similar name in angled brackets), replace this with your input file of choice or the name of the output file you wish to create respectively.

The FASTA Format

sourced from http://www.nmpdr.org/

To give a super brief description, FASTA format was the ASCII file format used for sequence information for the application of the same name. Some time in bioinformatics world passed and now FASTA formatted files are used by a variety of Bioinformatics packages and is the de facto standard for storing sequence information in text files.

The FASTA format itself is very simple: A file can consist of one or more sequence elements, each headed by a free text header starting with the chevron ‘>’ character and ending with a newline ‘\n’ character.

e.g. for DNA sequence:

>sequence 1
ACCGTACGATACGATCGCATCGCTGACTCG
ACTTACGACGACGCANNNNACATCGATCGA
ACACTCAGCA
>sequence 2
CACGCATTATCATCGATCCTCAGCTCATCGA
ATACGTACCACAACTCGCATCTCAGTCAGAC
ACTCGTACGCTACGTACGCATGCATCAGATC
ATCCTATGCATGCATCGTACGCTAGACTCGA
ATCGATCGCATGCATACGTACGCAT

NOTE: The sequence itself may have newline characters throughout the sequence – these should be
stripped when using the sequence data.

Splitting a FASTA file of multiple sequences into FASTA files of individual sequences

This command will create as many files as there are member sequences in the same directory as the source file,
incrementally numbered with a .fasta extension. (e.g. for an input file with 5 member sequences, such as the Arabidopsis genome, it will output files 1.fasta to 5.fasta.

awk '/^>/{f=++d".fasta"} {print > f}' <inputFile>

Joining multiple FASTA files into a single, multi-sequence FASTA file

This is the reverse of the above and we will assume a few things. Firstly, you want to combine all fasta files in the current directory and, secondly, they all have the same extension (.fasta). Adapt to your needs if this is not the case!

cat *.fasta > <outputFile>

List the sequence headers in a FASTA file

grep ">" <inputFile>

Counting the number of sequence entities in a FASTA file

grep ">" <inputFile> | wc -l

Determining the length of the sequence in a FASTA file

This method will give the TOTAL sequence length of a FASTA file. This means that if your FASTA file has a number of sequence entries, it will return the sum of the length of each sequence entry. To get the length of individual entries you would first need to split the file into individual entries, or do it programatically: either using a homegrown method or a Bioinformatics library such as BioPerl.

grep -v ">" <inputFile> | tr -d [:space:] | wc -c

These are a few useful commands for performing some common and simple FASTA file manipulations without needing to resort to programatic methods. It may be worthwhile defining an alias or simple bashscript wrapper for the above commands, allowing you to type something like: fastaLength fastafile.fasta at the command line.


19 Responses to “Linux (and OSX) commands for working with FASTA files”

  1. By Lex on Feb 24, 2010 | Reply

    Hey Chris! I’m just getting started with Bioinformatics. That was very helpful! Thanks!


  2. By Chris on Feb 25, 2010 | Reply

    Hi Lex, glad the post was useful, I appreciate the comment. What sort of bioinformatics work are you getting into? At any rate, enjoy the world of Bioinformatics! :)


  3. By SiewFen on Mar 11, 2010 | Reply

    hi
    i am looking about grep sequences from multifasta format and i saw ur page,
    just would like to ask if i would like to grep a sequence from a multifasta format. what should i do?

    eg;

    file1
    001
    003

    file2
    >001
    fdjsklfslkfjd
    >002
    afdsskljfskl
    >003
    dfshafhsjk

    thanks


  4. By tux25 on Apr 14, 2010 | Reply

    Great, but I am looking for a line command or script (bash, perl) for cut a sequence in the fasta sequence file (for example amino acids between 52 and 987).

    Someone could help me ?


  5. By capemaster on Aug 19, 2010 | Reply

    Mhh, nice script, but I get this error.

    awk: 18.fasta makes too many open files

    Any Idea?

    (using mac os x 10.6.4)


  6. By capemaster on Aug 19, 2010 | Reply

    fixed

    awk ‘/^>/{f=++d”.fasta”} {print > f} {close f}’

    Bye!

    PS I was in San Diego at the PAG too!


  7. By capemaster on Aug 19, 2010 | Reply

    Ops… I cut and paste the wrong script! please delete the comment above.


  8. By yifang on Sep 3, 2010 | Reply

    How to use the sequence name of each fasta entry as the individual fiel name?

    >seq1
    ATCGATCGATAGCAT
    >seq2
    ATCGATGTTGGGGGG

    seq1.fasta
    >seq1
    ATCGATCGATAGCAT

    seq2.fasta

    >seq2
    ATCGATGTTGGGGGG


  9. By Michele Foley on Dec 24, 2010 | Reply

    fixed awk ‘/^>/{f=++d”.fasta”} {print > f} {close f}’ Bye! PS I was in San Diego at the PAG too!


  10. By Stacy Gregory on Dec 24, 2010 | Reply

    fixed awk ‘/^>/{f=++d”.fasta”} {print > f} {close f}’ Bye! PS I was in San Diego at the PAG too!


  11. By Ædrián Lozano on Mar 12, 2011 | Reply

    Hi Chris. Can I combine these lines with some commands like “for” or “while” and use each sequence entry properties for filterning?
    Like extracting to a new fasta file exclusivly those sequences that begin with an especific 18-20 nucleotide sequence.


  12. By xgenseq on Apr 21, 2011 | Reply

    Another way to count number of sequences in a fasta file.
    grep -c “>”

    Cheers
    xgenseq


  13. By Matrimoniale Romania on Jun 1, 2011 | Reply

    After reading this I thought it was quite informative. I appreciate you taking the time and effort to put this post. Once again I find myself spending way to much time both reading and commenting on blogs .


  14. By Deandre Crowston on Jun 8, 2011 | Reply

    Do you have any videos on youtube about this?


  15. By dating tips on Jun 18, 2011 | Reply

    Thank you, I have been hunting for details about this topic for ages and yours is the best I’ve discovered so far.


  16. By sanibel island resorts on Sep 13, 2011 | Reply

    Hey, I needed to ask you some thing. Is this a wordpress site? We are contemplating changing our website from Blogger to wordpress, ya think this can be doable? Additionally did you produce this particular theme yourself some how? Cheers for the assistance!


  17. By Laura Spoor on Oct 31, 2011 | Reply

    Hi There!
    Very helpful website, I have FASTA sequence with some lowercase letters (low quality, I tried to filter genome assembly using samtools for read depth but it still called the base, just as lowercase!). Does anyone know if there is there a way I can convert the lowercase letters to ‘N’?
    Many Thanks!


  18. By Chiropractor Phoenix on Jan 22, 2012 | Reply

    Terrific paintings! That is the kind of info that are supposed to be shared around the internet. Shame on the seek engines for now not positioning this put up higher! Come on over and talk over with my web site . Thanks


  1. 1 Trackback(s)

  2. Oct 18, 2011: SEO

Post a Comment