Showing posts with label open access. Show all posts
Showing posts with label open access. Show all posts

Sunday, 5 June 2022

Bringing AI out of the black box

 Love it or hate it, artificial intelligence* (AI) is becoming embedded into all aspects of life. From self-driving cars to analysing legal cases, deciding which social media posts you see or making healthcare more personalised, the applications of AI are endless.

AI has been used for many years in research, but as its practical applications grow, researchers have a responsibility to ensure that their methods are transparent and can be scrutinised.

“In the fields of genomics and bioinformatics, publishing the code and the data associated with a study is commonplace, but other life science disciplines lag behind,” says Alvis Brazma, Functional Genomics Senior Team Leader at EMBL’s European Bioinformatics Institute (EMBL-EBI). “This can be problematic, especially in the case of research focusing on human health and disease. Publishing the details and code of the methods used is important not just for scientific value, but also to ensure the method can be tested and any problems identified.”

AI-powered clinical research

The past few years have seen a shift in healthcare research towards AI and deep learning methods. The applications include personalised medicine, identifying drug targets, accelerating clinical testing, and making predictions about the risk of developing a certain disease, or about its severity or outcome.

“Cancer is one of the many diseases that artificial intelligence and deep learning can shed light on,” says Moritz Gerstung, Research Group Leader at EMBL-EBI. “AI can help process enormous amounts of data faster than ever before, which can refine diagnosis, prognosis, and treatment. There’s significant potential for such applications, but the algorithms and underlying data have to be as transparent as possible. This is necessary to fully scrutinise the performance of AI algorithms and their implications, in collaboration with clinical researchers and also patient advocacy groups.” 

AI also has the potential to improve clinical trial design, reducing the time and cost involved in research and development. In some cases, AI is already being used in hospitals, to complement the work of healthcare professionals such as radiologists or histopathologists. 

Code, results, and transparency

While many AI-powered studies produce fascinating or encouraging results, a lack of detail regarding the methods and algorithms used is a common problem. Not being able to test the algorithm undermines the scientific value of the research. Scientific progress relies on the ability of independent researchers to scrutinise and reproduce the results of a study, and to build on those results. Without access to the code or the data, this becomes impossible. More worryingly, this kind of opacity could lead to unfounded and potentially harmful clinical trials.

When the method or the code underlying a study are not well documented, the study itself is difficult to validate. Textual descriptions of AI or deep learning models are not enough. By providing open access to the actual computer code used to train a model and arrive at its final set of parameters, researchers are enabling others to reuse the model. They’re allowing peers to test it – sometimes even to break it – to show its worth.

Testing plays a huge part in the development of any new technology, from mobile phones to cars, so why would research compromise on this essential step when it comes to AI?

How to open your code

There are many platforms where researchers can share their code, including GitHub, GitLab, and Bitbucket. In addition, they can use package managers, which are collections of software tools that automate the process of installing and configuring computer programs for a machine’s operating system. Package manager Conda or container and virtualisation systems such as Code Ocean or Docker enable control of the software environment, which is essential for large-scale machine learning applications.

Platforms such as TensorFlow Hub, ModelHub, or ModelDepot allow sharing of deep learning models. Using these resources improves transparency and can speed up model development, validation, and clinical implementation.

Another common challenge for researchers is that data, especially human data, can’t always be shared, due to privacy concerns. The restrictions and protocols in place ensure safe and secure sharing of sensitive data, but they can also be problematic for research reproducibility. Despite these challenges, sharing raw data has become more common in the biomedical literature, growing from 1% in the early 2000s to 20% in 2018.

When data can’t be shared, one solution is for authors to create small artificial examples or use public datasets to show how new data should be processed to train the model and to generate predictions.

“The data and the code behind a publication are almost as important as the results, so sharing them is crucial,” says Jo McEntyre, Associate Director of EMBL-EBI Services. “They put science in context, demonstrate rigour, and allow others to build on the hard work the authors have already done.”

In the majority of cases, there are ways of improving the transparency of AI models. While this does require additional effort – and often creative thinking on the part of the authors – it’s crucial if they want their method to have impact beyond the publication. AI-powered research needs to be reproducible if it’s going to be truly useful in healthcare or in other aspects of life.


*This article uses AI as an umbrella term for a suite of technologies, including algorithms, deep learning, and neural networks.

Friday, 3 June 2022

Europe PMC: Harnessing the power of text mining to accelerate life sciences research

 How text mining collaborations benefit our research, data resources, and the wider scientific community

Text mining is the process of analysing vast amounts of textual material to extract meaningful concepts, relationships, and trends using machine learning approaches. It enables researchers to rapidly find new and hidden information in text-based sources. When these techniques are applied to scientific publications, it becomes possible to uncover new meaning and hidden patterns that would otherwise take years to manually curate. 

Tackling data challenges and ensuring that we are able to exploit large datasets to their full potential for life science research is a key part of the Data Sciences Plans within EMBL’s Molecules to Ecosystems Programme. This includes developing and experimenting with new technologies and machine learning approaches. For example, these methods are used in a variety of projects to extract new information from publications. This includes mining and extraction of gene–disease associations for drug discovery, enriching our services with metagenomics data, and providing information to the wider text mining community to help others train their own machine learning algorithms. 

What is Europe PMC?

Europe PMC is EMBL-EBI’s open science platform for life science publications. It’s available to anyone, anywhere for free. With Europe PMC, scientists can search and read over 40 million publications, preprints, and other documents enriched with links to supporting data, protocols, etc.

Mining for gene–disease associations

Text mining approaches are hugely beneficial for improving the way we identify novel drug targets. A vast amount of information on gene–disease associations and associated drug targets already exists online, hidden within millions of scientific publications. Manually sorting through these texts would take decades. However, using text mining to search the literature allows data to be accessed and analysed for more rapid drug discovery. 

In collaboration with Open Targets, researchers at Europe PMC are doing just this by creating a pipeline that maximises literature information extraction using named entity recognition (NER) models. Named Entity Recognition (NER) is a widely used natural language processing approach to identify real-world objects, such as people, location, and time within text. The Europe PMC team uses this approach to identify genes, proteins, diseases, chemicals, and other biomedical concepts from life science literature. These bioNERs form the basis of gene–disease association identification from literature for Open Targets. 

What are NER models?

NER models are a form of natural language processing (NLP) – a type of machine learning method which allows computers to analyse text rather than computer code. In this case, the natural language being detected consists of disease and gene terms found within life science literature.


“For our machine learning algorithms to work effectively we needed to train them with high-quality data,” said Shyamasree Saha, Machine Learning and Text Mining Scientist at EMBL-EBI. “At Europe PMC, we developed a gold standard dataset for genes, proteins, disease, and organisms. We are using BioBERT, a domain-specific language model pre-trained on a large biomedical corpora and fine-tuning the model for the NER task using our gold standard dataset. The model replaces our old dictionary based NER approach and significantly improves entity association identification accuracy.” 

Learn more about how NER is being used to develop the Open Targets Platform.

Generating metadata descriptions

Metadata – the information that describes where, when, and how specific data are obtained – enriches the scientific value of genomic sequencing data and makes data FAIR (Findable, Accessible, Interoperable, and Reproducible). However, these metadata are frequently missing from databases or contain poor quality descriptions, meaning they cannot be used to interpret the data. For metagenomics – the direct analysis of genomes contained within an environmental sample – the use of metadata is of vital importance to increase data reuse and improve interpretation.

Researchers from Europe PMC and EMBL-EBI’s metagenomics data resource MGnify, have found a solution to this challenge by automatically extracting relevant metadata key terms straight from the literature. This is done using a machine learning framework to mine a wide range of metagenomics studies found in publications stored within the Europe PMC database. The project is called Enriching MEtagenomics Results using Artificial intelligence and Literature Data (EMERALD)

“One of the major limitations when comparing datasets is the lack of contextual metadata relating to a sample,” said Lorna Richardson, Coordinator for MGnify at EMBL-EBI. “To address this, we partnered with Europe PMC to automatically extract relevant metadata terms from publications, improving the range and depth of metadata available to our users. This metadata includes terms relating to the sequencing platform used, extraction kits, primers, the environment of the sample, and much more, which will help researchers get the most out of the data stored in MGnify.”

Find out more about how the EMERALD project is benefiting MGnify users

Annotations for the text mining community

Finally, the Europe PMC database itself is helping to advance the field of text mining by simplifying the way its users can find and access data from scientific literature. One of the tools available within Europe PMC is the annotation tool. This allows users developing their own text mining algorithms to quickly extract relevant terms and use them to develop their own text mining pipelines.

The annotations within this tool are collected by both Europe PMC and the wider text mining community and they include biological terms such as disease names, chemicals, and proteins. The annotation terms available for each article are located in the tools menu within Europe PMC and can also be accessed programmatically using the annotations API

“We have close to 1.6 billion annotations available to help our users locate entities in the full text and abstracts of articles stored in Europe PMC,” said  Aravind Venkatesan, Senior Data Scientist at EMBL-EBI. “These are available through the Europe PMC annotations tool, which supports scientists and database curators in their literature research by making it easy to find the relevant annotation terms they need to train their text mining models. This will help advance a range of research fields and also accelerate the field of text mining itself.”

Text mining is a tool which can benefit many research areas by increasing the rate at which we can unlock uncharted information already present in the millions of life science articles published online. Here we have shown how EMBL-EBI scientists have been able to harness the power of text mining to accelerate fields including drug discovery and metagenomics research. But it doesn’t stop there; this same approach can be used to leverage a vast range of fields with endless possibilities. Text mining to advance the life sciences is still a young field, but it is an exciting one to be a part of right now. 

Connect broadband