Artificial Intelligence , Machine Learning and Data Science Hubspot

Unlock the Power of Artificial Intelligence, Machine Learning, and Data Science with our Blog Discover the latest insights, trends, and innovations in Artificial Intelligence (AI), Machine Learning (ML), and Data Science through our informative and engaging Hubspot blog. Gain a deep understanding of how these transformative technologies are shaping industries and revolutionizing the way we work. Stay updated with cutting-edge advancements, practical applications, and real-world use.

Showing posts with label bioinformatics. Show all posts

Sunday, 5 June 2022

New clues to a 500-year old mystery about the human heart

Scientists show that muscular structures first described by Leonardo da Vinci are essential for heart function

Researchers have investigated the function of a complex mesh of muscle fibres that line the inner surface of the heart. The study, published in the journal Nature, sheds light on questions asked by Leonardo da Vinci 500 years ago, and shows how the shape of these muscles impacts heart performance and heart failure.

This project included collaborators at EMBL’s European Bioinformatics Institute (EMBL-EBI), Cold Spring Harbor Laboratory, the MRC London Institute of Medical Sciences, Heidelberg University, and the Politecnico di Milano.

In humans, the heart is the first functional organ to develop and starts beating spontaneously only four weeks after conception. Early in development, the heart grows an intricate network of muscle fibres – called trabeculae – that form geometric patterns on the heart’s inner surface. These are thought to help oxygenate the developing heart, but their function in adults has remained an unsolved puzzle since the 16th century.

To understand the roles and development of trabeculae, an international team of researchers used artificial intelligence to analyse 25 000 magnetic resonance imaging (MRI) scans of the heart, along with associated heart morphology and genetic data. The study reveals how trabeculae work and develop, and how their shape can influence heart disease. UK Biobank has made the study data openly available.

Solutions to da Vinci’s biological enigma

Leonardo da Vinci was the first to sketch trabeculae and their snowflake-like fractal patterns in the 16th century. He speculated that they warm the blood as it flows through the heart, but their true importance has not been recognised until now.

“Our findings answer very old questions in basic human biology. As large-scale genetic analyses and artificial intelligence progress, we’re rebooting our understanding of physiology to an unprecedented scale,” says Ewan Birney, Deputy Director General of EMBL.

The research suggests that the rough surface of the heart ventricles allows blood to flow more efficiently during each heartbeat, just like the dimples on a golf ball reduce air resistance and help the ball travel further.

The study also highlights six regions in human DNA that affect how the fractal patterns in these muscle fibres develop. Intriguingly, the researchers found that two of these regions also regulate branching of nerve cells, suggesting a similar mechanism may be at work in the developing brain.

“Our work significantly advanced our understanding of the importance of myocardial trabeculae,” explains Hannah Meyer, Principal Investigator at Cold Spring Harbor Laboratory. “Perhaps even more importantly, we also showed the value of a truly multidisciplinary team of researchers. Only the combination of genetics, clinical research, and bioengineering led us to discover the unexpected role of myocardial trabeculae in the function of the adult heart.”

Trabeculae and the risk of heart failure

The researchers discovered that the shape of trabeculae affects the performance of the heart, suggesting a potential link to heart disease. To confirm this, they analysed genetic data from 50 000 patients and found that different fractal patterns in these muscle fibres affected the risk of developing heart failure.

Further research on trabeculae may help scientists better understand how common heart diseases develop and explore new approaches to treatment.

“Leonardo da Vinci sketched these intricate muscles inside the heart 500 years ago, and it’s only now that we’re beginning to understand how important they are to human health. This work offers an exciting new direction for research into heart failure, which affects the lives of nearly 1 million people in the UK,” says Declan O’Regan, Clinical Scientist and Consultant Radiologist at the MRC London Institute of Medical Sciences.

Solving the protein structure puzzle

Proteins are beautiful molecular structures and understanding what they look like has been a goal for scientists for more than half a century. After years of arduous work and frustratingly slow progress, a game-changing artificial intelligence method is poised to disrupt the field.

We call proteins the building blocks of life because they make up all living things, from the smallest virus or bacterium to plants, animals, and humans. But, in reality, proteins don’t look anything like blocks. They are beautifully complex structures and every single one of them is unique. Their shape, also called a structure, is linked to their function, which means their shape determines what they do. For example, haemoglobin transports oxygen around the body, while insulin maintains the delicate balance of sugar within the blood.

Simple question, complex answer

Studying protein structure means you’re faced with a very simple question that requires a very complex answer. A protein is a string of small organic molecules called amino acids, connected in a chain, a bit like beads on a string. This chain of amino acids spontaneously folds up to create a unique and beautiful structure. The simple question is: what does the structure look like?

This problem has been around now for at least 50 years and, after many failed attempts, I came to believe that the only way to make progress was to gather more data to make better predictions. I was proven wrong.

Figuring out the structure of just one protein can take years of experimental work, using expensive equipment and incredibly complex methodology. One method is X-ray crystallography, which blasts crystalline molecules with an X-ray beam. This beam diffracts into many directions and, by measuring the angles and intensities, crystallographers can produce a 3D picture of the density of electrons within the crystal. This reveals the structure of complex biological molecules, including proteins. One of the difficulties of the method is obtaining the crystals, and sadly this method simply hasn’t worked for some proteins.

Experimental meets computational

Luckily, there is an incredibly active and tenacious community of scientists who have dedicated their lives to predicting protein structures or how the chain folds from their amino acid sequences. All newly determined structures are stored in the Protein Data Bank (and its European node, PDBe) and are freely available for anyone in the world to look at.

In the mid 1990s, the need to coordinate efforts and assess progress became clearer than ever, so the community embarked on a worldwide experiment, called Critical Assessment of protein Structure Prediction (CASP). Every two years the organisers launch the challenge of predicting the structure of several proteins. The objective is to test and independently assess new computational methods for structure prediction. These methods use computers, not lab experiments, to predict protein structure. The methods, now increasingly powered by artificial intelligence (AI), had been improving over the past few years, but a solution still seemed a long way off.

That is until this week, when – during the latest CASP conference – the assessors announced that one team, DeepMind’s AlphaFold, had put forward an AI system that achieved unparalleled levels of accuracy. This approach built on our extensive knowledge of protein structures obtained in the lab over the past 60 years. But this was the first time a computational model was deemed to be competitive with experimental methods. And something that would have taken years of experimental work can now be deduced within just days using a new type of neural network.

Why does it matter?

There are millions of proteins that make up the living world, but we only know the structures of a tiny number of them. In fact, we only have experimental structures (or even partial structures) for 10% of the 20 000 proteins that make up the human body. A powerful AI model could unveil the structures of the other 90%. This is important not just because it improves our understanding of human biology, health, and disease, but also because in the longer term it would offer avenues of research, for example designing new drugs.

Most existing drugs are designed using 3D structures, but they currently target only about a quarter of human proteins. AlphaFold could help unlock more proteins as potential drug targets and open up new approaches to therapies. Furthermore, easily predicting the structure of viruses can help us understand their biology and the diseases they cause. Finally, there may be significant opportunities to understand and treat neglected tropical diseases, where research is currently under-resourced.

The potential goes beyond human health. Understanding plant and animal proteins (as well as their genomes) could help us improve crop yields or breeding procedures. This would hold significant potential for feeding a growing population.

Finally, at a more scientific level, being able to predict structure from sequence is the first real step towards protein design: building proteins that fulfil a specific function. From protein therapeutics to biofuels or enzymes that eat plastic, the possibilities are endless.

A fine time for protein science

Understanding proteins is a bit like putting together a large 3D jigsaw puzzle in a dark room. You know what some of the pieces look like and you can sometimes match a few together in clusters, but it’s incredibly arduous and a complete solution would rarely be found. A fast and accessible method for determining the whole structure in the computer solves the puzzle automatically.

As a lover of everything protein, the most exciting thing for me is that this breakthrough is not an end, but a whole new beginning, bringing with it electrifying opportunities and follow-on questions. The structures allow us to understand better how the proteins function and, in turn, this could enable us to fine-tune this function for the benefit of people and the planet. Just like the Human Genome Project facilitated the birth of new scientific disciplines, such as genomics, solving the protein structure question could bring about new and exciting fields of research. One thing is for sure, it’s a fine time to be a protein scientist!

Saturday, 4 June 2022

Learning from Deep Learning: the inspirational story behind AlphaFold

AlphaFold is an Artificial Intelligence (AI) tool that predicts protein structure from sequence. In July 2021, DeepMind made the AlphaFold database public and freely accessible to users worldwide. By the beginning of 2022, the curated database had grown to contain one million protein structure predictions, and the ability to predict protein structures from amino acid sequences was coined scientific breakthrough of the year 2021.

Scientists have used structural biology approaches to reveal, probe and manipulate protein structures for many decades. With the launch of the AlphaFold Protein Structure Database, Researchers have also begun to explore how the AI tool can drive new approaches and understanding of protein structure and function.

Following the successful collaboration on the AlphaFold database, a delegation from DeepMind visited EMBL Heidelberg to learn more about EMBL research and services, and to discuss with scientists potential future directions in the application of AI in the life sciences.

During the visit, DeepMind founder and CEO Demis Hassabis, and AlphaFold team lead John Jumper, explained their work to develop AlphaFold as a new deep learning-based system. From the first steps towards developing the methodology and underlying machine learning ideas to the immense implications for molecular biological research now and in the future, Hassabis and Jumper shared their story and the inspiration behind AlphaFold.

“This is an incredibly exciting new era in digital biology, and AI is a powerful tool for accelerating scientific discovery. Our partnership with EMBL on the AlphaFold Protein Structure Database has been a wonderful and extraordinarily fruitful collaboration, and we have enjoyed this opportunity to brainstorm future ideas together,” Hassabis said.

Ewan Birney, EMBL Deputy Director General and EMBL-EBI Director, added: “The cooperation with DeepMind on AlphaFold has really been successful, and we want to extend that collaborative spirit between EMBL and DeepMind to more parts of EMBL. Artificial intelligence is a game changer and will fast-track biological discoveries in various ways in the years to come.”

“It was a great pleasure to host the DeepMind team,” said EMBL Director General Edith Heard. “EMBL is proud to have helped ensure the AlphaFold tool was made freely available, and are looking forward to the next developments in AI-assisted research.”

Friday, 3 June 2022

Deep learning models help predict protein function

Deep learning models can improve protein annotations and has helped expand the Pfam database

Our protein family database – Pfam – is used by a diverse range of researchers across the globe. Open access to the protein family data stored in Pfam has helped experimental biologists understand protein function, aided structural biologists’ insights into protein structure, given computational biologists rapid access to protein sequence information, and let evolutionary biologists trace the origins of proteins.

Pfam gives researchers access to vital protein annotations, structures, and multiple sequence alignments. It is a resource widely used to classify protein sequences into phylogenies and identify domains – functional regions – to provide insights into protein function.

With help from new deep learning models, Pfam has increased the protein sequence annotation and function data available within the database by unprecedented amounts. Research published in the journal Nature Biotechnology demonstrates how deep learning methods developed by Google Research could be trained using data from Pfam to accurately annotate many previously undescribed protein domains, shedding light on potential protein function. This new data added to Pfam has expanded the database to such an extent, it would have taken several years to achieve the same result manually.

Deep learning and protein function

“Initially I was rather sceptical about using deep learning to reproduce the protein families within Pfam. Then I started collaborating more closely with Lucy Colwell and her team at Google Research and my scepticism quickly changed to excitement for the potential of these methods to improve our ability to classify sequences into domains and families,” said Alex Bateman, Senior Team Leader of Protein Sequence Resources at EMBL-EBI. “These models exceed my expectations. They’re not just copying the data already in Pfam, they’re able to learn from the data and find new information that is yet to be discovered. What this gives us is the ability to expand the Pfam collection and potentially that of other resources using these same deep learning methods.”

By combining deep learning models with existing methods to add new data into Pfam, the researchers were able to expand the database by almost 10%. This exceeds all expansion efforts made to the database over the last decade. The deep learning methods were also able to predict the function for 360 human proteins that had no previous annotation data available in Pfam.

Expanding Pfam

Using additional protein family predictions generated from the Google Research team’s neural networks – a series of algorithms that looks for underlying structure in the sequences of protein domains and families – created a supplement to Pfam called Pfam-N, where N stands for network. Pfam-N adds a further 6.8 million protein sequences to the Pfam database.

“We’re also now building on these established deep learning methods to expand the information in the database even further,” said Bateman. “We’re changing the way the existing deep learning model works so that we can call multiple protein domains at once. This new update to the database should be ready very soon.”

“My personal view is that there’s still a lot of scope to improve the deep learning models we’re currently using,” Bateman added. “We’re in the early days of this and I’m very hopeful for what it will mean for the future classification of protein families. This may even be something that will get solved in the next five years.”

Find out more

Find out more about Pfam’s collaboration with Google Research and get a detailed introduction to Pfam-N in this Xfam blog post.

Funding

This work is funded by the Wellcome Trust as part of a Biomedical Resources grant awarded to the Pfam database.

Source article(s)

Using deep learning to annotate the protein universe

Europe PMC: Harnessing the power of text mining to accelerate life sciences research

How text mining collaborations benefit our research, data resources, and the wider scientific community

Text mining is the process of analysing vast amounts of textual material to extract meaningful concepts, relationships, and trends using machine learning approaches. It enables researchers to rapidly find new and hidden information in text-based sources. When these techniques are applied to scientific publications, it becomes possible to uncover new meaning and hidden patterns that would otherwise take years to manually curate.

Tackling data challenges and ensuring that we are able to exploit large datasets to their full potential for life science research is a key part of the Data Sciences Plans within EMBL’s Molecules to Ecosystems Programme. This includes developing and experimenting with new technologies and machine learning approaches. For example, these methods are used in a variety of projects to extract new information from publications. This includes mining and extraction of gene–disease associations for drug discovery, enriching our services with metagenomics data, and providing information to the wider text mining community to help others train their own machine learning algorithms.

What is Europe PMC?

Europe PMC is EMBL-EBI’s open science platform for life science publications. It’s available to anyone, anywhere for free. With Europe PMC, scientists can search and read over 40 million publications, preprints, and other documents enriched with links to supporting data, protocols, etc.

Mining for gene–disease associations

Text mining approaches are hugely beneficial for improving the way we identify novel drug targets. A vast amount of information on gene–disease associations and associated drug targets already exists online, hidden within millions of scientific publications. Manually sorting through these texts would take decades. However, using text mining to search the literature allows data to be accessed and analysed for more rapid drug discovery.

In collaboration with Open Targets, researchers at Europe PMC are doing just this by creating a pipeline that maximises literature information extraction using named entity recognition (NER) models. Named Entity Recognition (NER) is a widely used natural language processing approach to identify real-world objects, such as people, location, and time within text. The Europe PMC team uses this approach to identify genes, proteins, diseases, chemicals, and other biomedical concepts from life science literature. These bioNERs form the basis of gene–disease association identification from literature for Open Targets.

What are NER models?

NER models are a form of natural language processing (NLP) – a type of machine learning method which allows computers to analyse text rather than computer code. In this case, the natural language being detected consists of disease and gene terms found within life science literature.

“For our machine learning algorithms to work effectively we needed to train them with high-quality data,” said Shyamasree Saha, Machine Learning and Text Mining Scientist at EMBL-EBI. “At Europe PMC, we developed a gold standard dataset for genes, proteins, disease, and organisms. We are using BioBERT, a domain-specific language model pre-trained on a large biomedical corpora and fine-tuning the model for the NER task using our gold standard dataset. The model replaces our old dictionary based NER approach and significantly improves entity association identification accuracy.”

Learn more about how NER is being used to develop the Open Targets Platform.

Generating metadata descriptions

Metadata – the information that describes where, when, and how specific data are obtained – enriches the scientific value of genomic sequencing data and makes data FAIR (Findable, Accessible, Interoperable, and Reproducible). However, these metadata are frequently missing from databases or contain poor quality descriptions, meaning they cannot be used to interpret the data. For metagenomics – the direct analysis of genomes contained within an environmental sample – the use of metadata is of vital importance to increase data reuse and improve interpretation.

Researchers from Europe PMC and EMBL-EBI’s metagenomics data resource MGnify, have found a solution to this challenge by automatically extracting relevant metadata key terms straight from the literature. This is done using a machine learning framework to mine a wide range of metagenomics studies found in publications stored within the Europe PMC database. The project is called Enriching MEtagenomics Results using Artificial intelligence and Literature Data (EMERALD).

“One of the major limitations when comparing datasets is the lack of contextual metadata relating to a sample,” said Lorna Richardson, Coordinator for MGnify at EMBL-EBI. “To address this, we partnered with Europe PMC to automatically extract relevant metadata terms from publications, improving the range and depth of metadata available to our users. This metadata includes terms relating to the sequencing platform used, extraction kits, primers, the environment of the sample, and much more, which will help researchers get the most out of the data stored in MGnify.”

Find out more about how the EMERALD project is benefiting MGnify users.

Annotations for the text mining community

Finally, the Europe PMC database itself is helping to advance the field of text mining by simplifying the way its users can find and access data from scientific literature. One of the tools available within Europe PMC is the annotation tool. This allows users developing their own text mining algorithms to quickly extract relevant terms and use them to develop their own text mining pipelines.

The annotations within this tool are collected by both Europe PMC and the wider text mining community and they include biological terms such as disease names, chemicals, and proteins. The annotation terms available for each article are located in the tools menu within Europe PMC and can also be accessed programmatically using the annotations API.

“We have close to 1.6 billion annotations available to help our users locate entities in the full text and abstracts of articles stored in Europe PMC,” said Aravind Venkatesan, Senior Data Scientist at EMBL-EBI. “These are available through the Europe PMC annotations tool, which supports scientists and database curators in their literature research by making it easy to find the relevant annotation terms they need to train their text mining models. This will help advance a range of research fields and also accelerate the field of text mining itself.”

Text mining is a tool which can benefit many research areas by increasing the rate at which we can unlock uncharted information already present in the millions of life science articles published online. Here we have shown how EMBL-EBI scientists have been able to harness the power of text mining to accelerate fields including drug discovery and metagenomics research. But it doesn’t stop there; this same approach can be used to leverage a vast range of fields with endless possibilities. Text mining to advance the life sciences is still a young field, but it is an exciting one to be a part of right now.

A machine learning approach for allocating gene function

Researchers are making the most of machine learning methods to speed up genome annotation pipelines

EMBL’s European Bioinformatics Institute (EMBL-EBI) stores vast amounts of biological data and our researchers have expert knowledge of what these data are and how best to curate them. This makes EMBL-EBI well equipped to solve biological problems using machine learning – an artificial intelligence (AI) approach requiring extensive input of high-quality data to rapidly generate results.

One project initiated in this way came from within the Ensembl team, who are using machine learning to help allocate a function to different genes in their newly-annotated genomes at an unprecedented rate.

Adding gene function to genome annotations

Annotating a genome means identifying and mapping the locations and structures of genes and other genomic features. Having access to genome annotation gives researchers information about the location of a gene but assigning a potential function requires additional work and experimental evidence.

“We can start to extrapolate the function of a gene by looking at related genes in other species, but this can be costly, both computationally and in terms of human effort to manually curate the results,” said Fergal Martin, Eukaryotic Annotation Team Leader at EMBL-EBI. “This led us to try a machine learning approach to streamline this process of allocating potential gene function to the genes in our new species annotations deployed through Ensembl Rapid Release. Currently we are focusing on vertebrate species, but we want to extend the approach across eukaryotes.”

Machine learning to allocate gene function

The HUGO Gene Nomenclature Committee (HGNC) and the Vertebrate Gene Nomenclature Committee (VGNC) teams at EMBL-EBI work hard to manually assign gene symbols – a short-form abbreviation for a particular gene usually with an associated function – to a variety of genomes. While the manual assignment efforts continue to be streamlined and cover a growing number of vertebrate species, many species in Ensembl have the majority of their gene symbols assigned through automated methods.

Historically gene symbols have been assigned through building gene trees, which describe the evolutionary relationships between genes both across and within species. This approach is computationally costly, especially with the recent rapid growth in the number of sequenced vertebrate genomes. Fergal and his team wanted to see if they could assign gene symbols and thereby infer function through a machine learning approach.

“We trained a neural network by feeding it roughly three and a half million protein sequences from a variety of different vertebrate species from Ensembl,” said Fergal. “For these sequences, we already had existing gene symbols with associated functions. The end result is that we have built a classifier that can replicate the existing assignments with around 94-97 percent accuracy, depending on the species. Crucially, it takes less than a minute to generate assignments and confidence values for a vertebrate gene set.”

Why use a machine learning approach?

Using machine learning is saving the team a huge amount of computing time and the system is a lot less complex in terms of implementation than the existing approach that it replicates. Therefore, the team is looking at deploying it to the larger community.

“While the system is not a replacement for the very high-quality manual assignments produced by teams like HGNC and VGNC, this approach could potentially be useful to curators as an additional tool to help manually validate assignments. It’s also something that individual users could use in assessing their own annotations,” said Fergal.

The benefits of machine learning

“As technology advances scientists are increasingly using machine learning to answer biological questions. One example is the protein structure predictions from AlphaFold. We didn’t have a highly accurate algorithmic approach to figure out how to fold proteins, but deep learning is helping to solve this complex biological mystery,” said Fergal.

“That’s different from what we are trying to do,” he added. “AlphaFold is an example of solving a problem where we didn’t understand all the rules and variables of the system we were trying to model. What we’re doing here is replicating a system that we do understand, but which requires a lot of computing power to run. It’s exciting that deep learning approaches can provide such valuable solutions to challenges across the life sciences.”

Going forward, there is huge potential for using machine learning methods like this, both within Ensembl and across the organisation to benefit other data resources. Machine learning approaches can reduce both computational time and complexity. Large scale genomics projects such as the Darwin Tree of Life (DToL) and the Earth BioGenome Project (EBP) will also benefit greatly from these approaches as the new species annotations created for these can be deployed faster at a high-standard.

“If we are to annotate the genomes of all species on Earth, we need to think of where we can make computational savings,” said Fergal. “There’s an incredible wealth of both in-house knowledge and high quality training data at EMBL-EBI and it’s really exciting to think about how machine learning could not only improve the quality of our data but also drastically reduce the associated computational cost and environmental footprint.”

This machine learning approach will be rolled out as part of Ensembl Rapid Release, Ensembl’s lightweight genome browser designed to allow fast access to the latest genome annotations for a large number of species.