Problem: Many experimental biologists must edit large files by hand, or ask
programmers for help, in order to explore or manipulate their data. Some may
even give up because they don't have the tools to perform operations
that are very simple to describe.
Vision: The Scriptome will empower experimental biologists
to manipulate and explore their data, on their own, with minimal training.
Users: The main niche of targeted users will be biologists with little or no
programming experience, who may be only occasional users. Another group that
may find The Scriptome useful is biologists currently in the process of
learning programming. Finally, experienced programmers can use it as a
"cookbook" rather than writing their own scripts.
Scope: The Scriptome will provide tools to filter or format existing
data, rather than performing complex data analysis (like Spotfire or
Rosetta Resolver) or generating new data (like BLAST).
Biologists' questions are expected to be diverse and changing over time.
(The following is from the abstract for a poster accepted for
presentation at ISMB 2005, http://iscb.org/ismb2005, which expresses
the above ideas in greater detail.)
Motivation: Computational biology tools generate large files in a wide array of formats. Although analyzing
the data in these files or reformatting them for other tools may be trivial for programmers, it often presents a
real barrier for the majority of biologists who are not programmers. Existing approaches are demanding:
building universal graphical user interfaces requires a large investment of programming time; teaching
biologists to program requires substantial training. Furthermore, given the rate at which the underlying data
and data structure change, such efforts are not likely to abolish the need for text manipulation.
We present here the Scriptome project, designed to allow non-programmers to perform simple
computational tasks such as formatting, filtering and low-level analysis of biological data. It combines the
bioinformaticists' experience and skills with the biologists' way of thinking. Bioinformaticists create tiny,
"atomic" tools, each performing simple operations on commonly used biological data structures. Biologists
assemble "protocols" that use these tools, in an approach similar to constructing experimental protocols in
Results: We have developed a prototype of the Scriptome, released as a searchable web site containing
single-line Perl scripts that perform common tasks. To use the Scriptome, a biologist cuts and pastes these
scripts onto the UNIX command line. We chose to use the command line rather than developing a dedicated
interface, because it allows new tools to be built faster, obviates complicated installation and dependencies,
and lowers the biologists' learning barrier. A significant component of the web site is documentation
(including usage notes, sample protocols, and links to related tools), which helps biologists choose and use
the tools effectively. In addition to the documentation, we offer a training session, which is organized
around solving representative biological problems by building protocols.
A typical example involves searching two BLAST output files, each with hundreds of hits, creating a non-
redundant list of genes for further study. The problem is broken down into a protocol with these steps:
- Select top N BLAST hits from a BLAST report. (Do once for each of the two BLAST output files.)
- Combine the two gene lists, keeping only one copy of duplicated genes
- Get the descriptions of the genes from the FASTA formatted sequence database.
The biologist now selects a Scriptome tool for each step,
pasting the Perl code onto a UNIX command line [and editing it].
Conclusions: The Scriptome offers an innovative approach to bridging the way bioinformaticists and
biologists manipulate data. Relying on their existing skills, and with minimal training, it allows biologists to
solve, on their own, data formatting and analysis problems that might otherwise be prohibitively difficult.
Amir Karger, Christopher Botka, and Eitan Rubin
Bauer Center for Genomics Research