Data visualization is a very important part of data analysis today, and it isn’t just all bells and whistles, especially with the growing amount of data being made available to us. Data mining as the skill behind data scientist’s was voted by LinkedIn as the top skill getting individuals hired in 2014, as well as the profession itself having amongst the highest paid salaries today.
So far I’ve only mentioned data science in the context of business and marketing, but data visualization in science is just as important. Probably the best anecdote for explaining it’s importance is Anscombe’s quartet. You’ve probably seen a variation of this somewhere on the Internet or in a presentation somewhere, and so here it is again! The idea behind this set of four graphs is to illustrate the importance of visualising data before we can draw any meaningful conclusions. In each graph, there are clearly different sets of data points, but yet some of the common statistical functions appear to fail on these data sets. The mean of the values is identical (), as is for the values (). The same goes for the variance of their respective and values. For each graph, the statistical correlation between and is 0.816. Finally, the line of linear regression for each graph is identical, plotted by the line .
Now that I’ve proven the importance of data visualisation, I’d like to share with you a simple tool I wrote in Python that may come in handy to those of you working in the field of biology. The tool itself is rather simple, you provide the script with a FASTA formatted file containing a DNA sequence, and it visualises codon repeats on a graph. The script works by comparing each codon with the successive codon, and plotting a peak or bar every time the successive codon is identical to the current.
What’s especially cool about this tool, though, is that it uses a new powerful data visualisation tool called Plot.ly. Plot.ly allows you to create interactive charts and graphs, and host them online to share with other people. Where it excels better than Excel (pun not intended) is not only through the interactivity of the plots, but also through the ability to produce these graphs using coding languages such as Python, Matlab, Node.js, and R. This mean you could even plot data in real time using sensors on a Raspberry Pi, such as for temperature, and create a live feed of temperature data in your room! In reality, however, Plot.ly is a competitor for another Python graphing package, matplotlib. Matplotlib has become one of the most widely used use plotting packages for Python, and is used in fact by many scientists for producing publication quality figures.
Successive codon repeats visualization tool
This tool works by comparing each codon in your specified DNA sequence with the successive codon. If the successive codon is identical to the current codon being analysed, the program will plot a bar / peak on the graph (If , where is a codon.). This peak / bar will also be accompanied by an annotation of the exact codon which is repeating at this location. If you use the matplotlib plotting package (use of this package explained below), the annotation will also have the codon number placed after a comma. This functionality is omitted when using the Plot.ly plotting package, as the interactive nature of this package allows you to hover over the bar and see the precise number on the x-axes.
Using the tool
- Python 2 (2.7.6 or newer)
- Python 2 is installed as default on Mac OS X and Linux, but you’ll need to install it independently on Windows.
- Matplotlib (either or)
- Plot.ly (either or)
This tool can in fact use matplotlib as well as Plot.ly, so depending on which package you prefer, you’ll need to install either one of those (or both if you prefer). The simplest way to install Python packages is using pip install through the terminal. Find out how to install pip for Windows, Mac or Linux here or here. Once installed, you can use the command
pip install matplotlib or
pip install plotly in your favourite terminal / command prompt.
You can download the script for this tool from my Github page here.
Once you’ve installed all the dependencies, place the Python script in the same folder with your FASTA file of choice. To run the tool, open a terminal or command prompt in the folder and type
python codon_repeats.py f n p. The letters
p are specific arguments which you must fill in as follows:
|f||FASTA formatted file name with extension e.g. RPRD2.fasta|
|n||ID of sequence in FASTA file, by default first sequence is 0, then 1, etc.|
|p||Plotting package. ‘plotly’ for Plot.ly or ‘matplotplib’ for matplotlib.|
If you’re using the Plot.ly package to visualise your repeats, you’ll need to make one small modification to the code. Line 118 of `codon_repeats.py` contains the following code:
`py.sign_in(‘<<USERNAME>>’, ‘<<API KEY>>’)`
Replace the <<USERNAME>> and <<API KEY>> with your credentials, which you can obtain by signing up for an account at http://plot.ly/.
If you’ve done everything correctly, your web browser should open a web page with your graph displayed there, fully annotated! Here’s an example of the kind of plot you can expect:
I hope this tool can be of some use to you, and if there are any problems, drop a line in the comments section below! I’d like to leave you with a great TED talk about data visualisation, hopefully offering some you some inspiration!