White Paper

An Indexing Approach To ChemInformatics

An Indexing Approach To ChemInformatics

Indexing is a well-understood and broadly applied approach to data integration. Google search is probably the best example. Independent of any underlying data structure, indexing services like Google “ingest” text and make the contents available for incredibly fast search, with links to the underlying source systems. Scoring systems and relevancy metrics make indexing an ever more powerful tool to get most relevant search results within seconds.

The use of modern, no SQL type technologies has been largely driven by the desire for a key characteristic, namely the implementation of horizontally scalable indexing. By virtue of this capability to spread search and retrieval across essentially an infinite number of worker nodes, it has been possible to develop systems of enormous scale – scale never before even contemplated. This is an essential capability for cloud- scale informatics tools.

Licensing models for most indexing systems are also attractive and aimed at cloud-scale computing. Many indexing systems are open source, and even in cases where they are licensed, license fees for cloud deployments across hundreds of compute nodes tend to be much cheaper and more flexible than their relational database analogs. While this is a guideline more than a rule, it is certainly true for Oracle platforms.

However, indexing technologies run afoul of certain fundamental challenges. Take this search task for example: “Find all X where X is joined to Y and attributes of X are in a certain range and attributes of Y are in certain range”. While a relational database would (generally) be able to return that result quickly and precisely, indexing technologies often cannot. Even in this simple example, most indexing systems underperform or even fail, because they either can’t handle the quantitative precision of the attribute search or the ‘joining’ is cumbersome to implement and substantially mitigates the benefits of the entire approach in terms of both scalability and effort.

Another fundamental challenge has to do with extensibility of search. In mature relational database platforms, API’s exist to embed new object classes within the RDBMS engine. 3D spatial searching is the best out-of-the-box example of object extensibility in both the Oracle and Microsoft SQL*Server platforms. In our industry, the best examples are the capability to embed chemical structure and sequence searching within the Oracle RDBMS system using the Oracle cartridge API. Until now, this kind of object extensibility hasn’t been available within indexing systems to the same degree. As companies seek to transition to Cloud computing, the gap becomes more important as it is typically not cost effective or even technically possible to transition these scientific search capabilities directly into the Cloud.

Bringing Index Based Search to Scientific Computing Applications
PerkinElmer Signals™ Lead Discovery represents a new kind of solution that brings the benefits of index-based search to scientific computing applications while at the same time maintaining the benefits of hyper-scalability and ease of implementation.

There are three highly-innovative aspects to PerkinElmer Signals™ Lead Discovery that make this possible:

  1. Capture search constrains and search attributes: PerkinElmer Informatics has developed a patent-pending algorithm to enable chemical structure search within Apache Lucene-based indexing system (Lucene is the underlying technology for most indexing frameworks). This means that within a single query, it is possible to capture chemical-structure search constraints along with other search attributes. The net result is extraordinarily fast (near-real-time) structure search ideally suited for Cloud computing applications.
  2. Suitable for Chemical and Biological activity Data: PerkinElmer Signals™ Lead Discovery is designed specifically with structure-activity analysis as its main purpose. In addition to its unique structure search capabilities, PerkinElmer Signals™ Lead Discovery includes capabilities to shape and annotate biological activity data into a hyper-scalable structure that enables precise quantitative search seamlessly integrated with structure search. For example, it is possible to execute queries like the following with extreme speed: “retrieve all assay results for compounds containing a certain substructure where the activity in one of the assays is less than 15 nm.”
  3. No new programming skills needed; The architecture of the underlying REST API’s supports a SQL interface and complex joining operations. This enables rapid, agile application development such as drill-down capabilities to details of underlying data. Programming teams do not have to learn entirely new query syntax nor do they have to anticipate all the joining operations required by their new system ahead of time – which is exactly the kind of trouble “schema after” indexing systems are designed to avoid in the first place.

These factors taken together make PerkinElmer Signals™ Lead Discovery the first true SAR application to provide the full-benefits of cloud-scale index-based search to scientific computing applications.

Additional Background
PerkinElmer is committed to the notion that technical partnerships are essential in this era as it is impossible to develop all required capabilities on one’s own. As such, we are committed to both commercial strategic partnerships and the inclusion of open source technology throughout our platform. PerkinElmer Signals™ Lead Discovery is itself based on technology from two strategic partnerships: TIBCO Spotfire® and the Attivio® Active Intelligence Engine.

The primary graphical interface of PerkinElmer Signals™ Lead Discovery is TIBCO Spotfire®. PerkinElmer Signals™ Lead Discovery fills the lack of search backend in Spotfire by marrying cloud-scale search to Spotfire’s inherent strength: allowing scientists to find insights, quickly. TIBCO Spotfire customers will be able to easily integrate PerkinElmer Signals™ Lead Discovery into their IT landscape using the systems REST API’s.

PerkinElmer Signals™ Lead Discovery provides an extraordinarily rich data integration backend by virtue of the Attivio Active Intelligence Engine. Attivio’s founders were the original developers of FAST, the industry – leading search platform. Microsoft purchased FAST to make BING as performant and scalable as Google. The FAST founders moved on to solve the inherent challenge to Lucene-based search: numeric search and joining. The product of this work is the Attivio Active Intelligence Engine, which merges in a single system the best attributes of Lucene and graph-based search. From a system architecture standpoint, AIE is the only platform supporting the full capabilities of Lucene search along with detailed data modeling support and content analytics.

In many respects “standing on the shoulders of giants”, PerkinElmer Signals™ Lead Discovery is a a modern application for today’s new science.