CMap3D featured on the GMOD website, and at PAG18

22 Jan 2010
No Comments »

One of my applications from my PhD, CMap3D has featured on the GMOD project’s news page. It even has its own GMOD wiki page too.

Cmap3D screenshot

CMap3D is a 3D visualisation tool for comparative genetic/physical maps. The way that it works is by interfacing with data from an existing CMap comparative mapping installation. CMap is a GMOD application and by association CMap3D is becoming somewhat linked to GMOD, with the mid to long-term goal to have CMap3D become part of the GMOD project.

I returned last weekend from the annual Plant and Animal Genome conference, hosted in San Diego. One of the interactions from the week I enjoyed most was meeting with a number of people involved in the GMOD project. GMOD is an opensource project that strives to collect a number of interacting bioinformatics tools focused on the visualisation and management of data relating to the genomes of any and all organisms (Generic Model Organism Database).

And on a final CMap3D note, I also had a Bioinformatics paper on CMap3D accepted earlier this year.

Linux (and OSX) commands for working with FASTA files

02 Sep 2009
3 Comments »

When working with the genomics or molecular marker side of BIoinformatics, Bioinformaticians are faced (very) often with DNA (or RNA) sequence files in FASTA format. FASTA files can store a single sequence, or multiple sequences. To be able to access individual sequences or measure some metric of the sequence data, such as length, some form of manipulation of the files is usually required.

This post gives a set of unix commands to perform some common manipulations required of FASTA files. These commands should work using the BASH shell under most popular distributions of linux (I use Ubuntu and CentOS). They will also work in the OSX terminal and *SHOULD* work in Windows using software such as Cygwin.

The manipulations and metric measurements I cover in this article deal with:

  1. Splitting a FASTA file of multiple sequences into FASTA files of individual sequences
  2. Joining multiple FASTA files into a single, multi-sequence FASTA file
  3. List the sequence headers in a FASTA file
  4. Counting the number of sequence entities in a FASTA file
  5. Determining the length of the sequence in a FASTA file

BEFORE WE BEGIN
Where you see or (or a similar name in angled brackets), replace this with your input file of choice or the name of the output file you wish to create respectively.

The FASTA Format

sourced from http://www.nmpdr.org/

To give a super brief description, FASTA format was the ASCII file format used for sequence information for the application of the same name. Some time in bioinformatics world passed and now FASTA formatted files are used by a variety of Bioinformatics packages and is the de facto standard for storing sequence information in text files.

The FASTA format itself is very simple: A file can consist of one or more sequence elements, each headed by a free text header starting with the chevron ‘>’ character and ending with a newline ‘\n’ character.

e.g. for DNA sequence:

>sequence 1
ACCGTACGATACGATCGCATCGCTGACTCG
ACTTACGACGACGCANNNNACATCGATCGA
ACACTCAGCA
>sequence 2
CACGCATTATCATCGATCCTCAGCTCATCGA
ATACGTACCACAACTCGCATCTCAGTCAGAC
ACTCGTACGCTACGTACGCATGCATCAGATC
ATCCTATGCATGCATCGTACGCTAGACTCGA
ATCGATCGCATGCATACGTACGCAT

NOTE: The sequence itself may have newline characters throughout the sequence – these should be
stripped when using the sequence data.

Splitting a FASTA file of multiple sequences into FASTA files of individual sequences

This command will create as many files as there are member sequences in the same directory as the source file,
incrementally numbered with a .fasta extension. (e.g. for an input file with 5 member sequences, such as the Arabidopsis genome, it will output files 1.fasta to 5.fasta.

awk '/^>/{f=++d".fasta"} {print > f}' <inputFile>

Joining multiple FASTA files into a single, multi-sequence FASTA file

This is the reverse of the above and we will assume a few things. Firstly, you want to combine all fasta files in the current directory and, secondly, they all have the same extension (.fasta). Adapt to your needs if this is not the case!

cat *.fasta > <outputFile>

List the sequence headers in a FASTA file

grep ">" <inputFile>

Counting the number of sequence entities in a FASTA file

grep ">" <inputFile> | wc -l

Determining the length of the sequence in a FASTA file

This method will give the TOTAL sequence length of a FASTA file. This means that if your FASTA file has a number of sequence entries, it will return the sum of the length of each sequence entry. To get the length of individual entries you would first need to split the file into individual entries, or do it programatically: either using a homegrown method or a Bioinformatics library such as BioPerl.

grep -v ">" <inputFile> | tr -d [:space:] | wc -c

These are a few useful commands for performing some common and simple FASTA file manipulations without needing to resort to programatic methods. It may be worthwhile defining an alias or simple bashscript wrapper for the above commands, allowing you to type something like: fastaLength fastafile.fasta at the command line.