Ī standardised V2000 molfile was chosen as the primary chemical structure representation in the database. Examples of these situations can be found in recent articles where data was extracted for the ChEMBL database. The ChEMBL compound curation procedure therefore needs to process molecules represented in all these ways (and more) to determine which compounds are the same. In some articles the compounds synthesised and reported are isotopically labelled. Compounds may also be shown with charges on acidic or basic groups, to indicate the form in which they are likely to interact with amino acid residues in a binding pocket. These structures are often represented in the publication as Markush structures with different R-groups shown in SAR (structure–activity relationship) tables. Chemical structures from the primary scientific literature are mostly manually drawn from the structural information in the papers prior to loading into ChEMBL. Even the simple process of loading molecules into and out of different cheminformatics packages can subtly change the structure, particularly if it was not well drawn in the first place. There is no universal standard for these formats and the challenges of converting between chemical structure formats is well documented. The chemical structures submitted to the ChEMBL database are generally received as molfiles but can also be in SMILES format. Hence an automated procedure is required.
For each ChEMBL release, more than 50,000 new structures are added to the database, which makes manual curation and standardisation of the chemical structures impracticable. It is worth noting that there are over 5000 unique compounds in the ChEMBL database with data from ten or more different sources, and four compounds (doxorubicin, ciprofloxacin, chloroquine and paclitaxel) each with data from over 1000 sources. In order to facilitate the use of the database, a key objective of the ChEMBL compound curation process is to standardise the chemical structures stored in the database and to assign a unique identifier to each distinct chemical structure regardless of the source. Scientists commonly wish to aggregate the data on these different forms on the basis of the common underlying parent structure, and so it is necessary to link these various forms of the “same” underlying parent molecule. Compounds may be physically tested in bioassays as the so-called parent molecule or as one of a number of different salt forms. Bioactivity data on the same compound from all ChEMBL sources (scientific articles, deposited datasets and curated drug sources) are aggregated according to chemical structure. Furthermore, ChEMBL contains a set of manually curated marketed drugs and clinical candidates as well as selected bioactivity data from other public databases such as BindingDB and PubChem. A growing number of researchers are also depositing experimental data directly in order to make these available in the public domain. The compound structures and associated bioactivity data are extracted on a regular basis primarily from the medicinal chemistry literature. The ChEMBL database is a freely available bioactivity database containing close to 2.5 million compound records on nearly 2 million unique chemical structures. It has been used successfully to standardise the nearly 2 million compounds in the ChEMBL database and the compound validity checker has been used to identify compounds with the most serious issues so that they can be prioritised for manual curation.
The code is available in a GitHub repository and it can also be accessed via the ChEMBL Beaker webservices. ConclusionĪll the components of the structure pipeline have been made freely available for other researchers to use and adapt for their own use. This pipeline has been applied to the latest version of the ChEMBL database as well as uncurated datasets from other sources to test the robustness of the process and to identify common issues in database molecular structures. It comprises three components: a Checker to test the validity of chemical structures and flag any serious errors a Standardizer which formats compounds according to defined rules and conventions and a GetParent component that removes any salts and solvents from the compound to create its parent. ResultsĪ chemical curation pipeline has been developed using the open source toolkit RDKit. In order to maintain the quality of the final database and to easily compare and integrate data on the same compound from different sources it is necessary for the chemical structures in the database to be appropriately standardised. Incoming compounds are typically not standardised according to consistent rules. The ChEMBL database is one of a number of public databases that contain bioactivity data on small molecule compounds curated from diverse sources.