Sciencescape is creating an online map of science in order to organize in real time all published research. Through the use of data visualizations and social functionality, Sciencescape’s goal is to make it quicker and easier for users to: Filter through millions of scientific papers, find the most important work done by any author, team or institution, and keep up with research.
MaRS Market Intelligence spoke with Sam Molyneux, Co-founder of Sciencescape.
What problem is Sciencescape trying to solve?
My background is in cancer genomics and I work with a lab at the Ontario Cancer Institute at Princess Margaret Hospital in Toronto. When you’re involved in research, one of the key challenges is to understand the field you are in―who the players are, what they are working on, what questions have been answered and what questions are available. Doing this seems deceptively simple, so most people turn to a search engine like PubMed or Google Scholar.
And what is wrong with the search engines?
Say I work on breast cancer and enter this term into a search engine. What I will find are thousands and thousands (sometimes millions) of results. From there, I would typically print out a hundred pages of search results and scan the titles. It sounds crazy, but this is what people do. When I started my PhD, I began asking myself, “is there not a better way?”
Search engines in the consumer space work well because if you are looking for a restaurant, for example, you need a result that best matches your query―the needle in the haystack. You’re not interested in the entire field of restaurants. Whereas in science you really need to understand the whole haystack and identify key works with very little knowledge beforehand. You need a discovery engine.
Just how much data related to these publications is there?
Right now we’re focused on the domain of biomedicine, for which there are 22 million papers in existence, with 2,000 to 4,000 more being added every day. Among these 22 million papers, there are about six million authors and 25,000 journals. Beyond this, there is the quantitative data associated with each publication, such as citation counts.
Sciencescape’s site is also collecting its own data, namely how many people look at an abstract and interact with the publication. We graph and display this data in such a way that people can find what is being read right now, or what was popular last week or even last month, and where, by spreading it over a temporal or geographical framework.
Have you encountered any challenges with getting this data?
The only challenge has been with citation counts. As it stands today, we would have to pay a large amount to access the full citation data set, as only 10% of it is made available as open-access data. This is a big area of controversy in the field of research. However, through our own analysis, we have found that the open-access citation set is nearly perfectly correlated with the full citation set. So the relative standing of our papers is highly accurate.
Sciencescape is probably not the best tool for users to find all of the journal articles that have cited their article, since one is currently limited to the open-access citation count data. The tool, however, is really useful to learn how your paper ranks in your field throughout history, or to find out the most important work done by an author or research team.
How did you decide which visualizations to use?
If you look at what other people have done with citation data, you’ll see some incredibly complex things. One example is a clickstream map of science, which visualizes clickthrough data for millions of papers. It is a vast network, which looks like a giant hairball! You may be able to pull out large macro patterns that are interesting, but as a daily-use tool, there are few simple questions that you can answer with it.
So, instead of getting more complex and fancier, we went as simple as possible. We are focused on the visualization of citations over time and over space; nothing more. But the power of those visualizations is immense.
What challenges have you encountered as a startup?
Scaling up the site has been a key challenge for us. When we first launched our beta site, we had 30,000 pages. Our beta-test users kept telling us that in order for the site to be relevant to their work, it had to hold entire data sets. So we scaled from 30,000 pages to 45 million pages. Re-indexing the database took us two weeks! We’ve since started using algorithms that can re-index the entire site in just a couple of days or less. These are the same algorithms used by Craigslist and other very large sites.
As a reference point, Google estimates the entire indexable Internet to be 45 billion pages. Sciencescape will occupy 0.1% of the entire web at launch.
What impact are you hoping Sciencescape will make globally on the science industry?
We believe Sciencescape can accelerate the work scientists are doing by enabling them to see what work is being done in their field, what work has been done and what is happening on the leading edge. If scientists can find relevant and important information more quickly, they can design more novel experiments and make the next big advance.
We also see our technology expanding into other industries. One example is the legal and technology markets where Sciencescape could help organize all the patent literature in existence today.
And finally, what are some of your favourite visualizations?
I’m fairly partial to genomics visualizations, such as those created by Circos. I find that the chromosome-based heat maps and clusters, which integrate diverse datasets, enable the discovery of patterns and connections.