How Data-sharing Technologies Bolster AI Progress in Medical Research

by Andrii Buvailo

Disclaimer: All opinions expressed by Contributors are their own and do not represent those of their employers, or
Contributors are fully responsible for assuring they own any required copyright for any content they submit to This website and its owners shall not be liable for neither information and content submitted for publication by Contributors, nor its accuracy.

   273    Comments 0

While artificial intelligence (AI) already proved to be a groundbreaking thing in many industries (robotics, finance, surveillance, cyber security, self-driving cars to name just a few), the pharmaceutical industry is yet to enjoy the full scale AI-driven transformation. Some companies did manage to demonstrate the power of artificial intelligence for drug discovery and basic biology research, including those of Moderna (accelerated discovery of mRNA vaccines), Insilico Medicine (accelerated small molecule discovery, 8 drug candidates in 2 years, including novel targets), Recursion Pharmaceuticals (a diverse preclinical/clinical pipeline of drug candidates enabled by AI and robotic labs), Deep Mind (major advancement in solving protein folding and 3D structures of large protein complexes using AI) etc. Pretty much every big and small company in pharma/biotech are “experimenting” with AI technologies, but the fact of today is,  the industry on the whole is quite far away from being what we may call “AI-centric” or “AI-first”, unlike, for example, the industry of internet technologies and software. A major reason for that is the lack of quality data to train large scale deep learning models properly to achieve sufficient generalizability of AI models.

Image credit: Olemedia iStock


It might seem surprising, as pharmaceutical research generates enormous amounts of data daily. But when you consider the degree of secrecy and protectionism that competing pharmaceutical giants put on their research, and the ever growing push of governments and regulatory bodies towards personal and especially medical data protection, it becomes clear that the majority of data is actually not available for the AI practitioners to do their research. Valuable data is dispersed across thousands of organizations -- research and medical -- hidden behind their firewalls. Decades of screening, testing and validation research, decades of clinical trials, enormous amounts of patient medical data hidden in local hospitals, private EHRs, etc. -- the access to such data for AI training purposes can improve not only the ability to model biology and understand disease mechanisms better, but also create more robust biomarkers, better match patients with relevant treatment options in clinical trials, and offer more robust and better validated high-throughput diagnostics tools (e.g. analysis of radiology images using AI). Data shortage leads to the situation when medical AI models are oftentimes trained on poorly diversified data (e.g. only a specific geography of patients), leading to biases in models, and poor real-world performance.

Data generation is important, but just as important are technologies allowing access to such data in a manner that is, on the one hand, feasible for building machine learning pipelines, but on the other hand -- would meet all the regulatory and commercial secret requirements typically applied manipulating all sensitive data nowadays. Even more challenging this becomes when we talk about real-world/real-time data access requirements, when the models have to be able to respond and adjust to the real-life events “on the fly” and output relevant predictions quickly.

Indeed, according to insights from the World Economic Forum 2021, 76% of executives across industries believe new ways of collaborating with ecosystem partners, third-party organizations, and even competitors, are essential to innovation in the “era of big data”. Experts also predict that secure data-sharing capabilities could better position organizations to monetize their own data as well.

In its recent report, Deloitte predicts that by the end of 2023, a significant number of healthcare organizations will be exploring opportunities to accomplish business goals utilizing artificial intelligence-driven analysis of data, provided by other organizations via specialized secure data sharing mechanisms, preserving data privacy and commercial secrets.  


Sharing is caring

One of the game-changing technologies that can enable AI practitioners to train AI models on a much larger and more diverse datasets is Federated learning. 



Federated learning, a new collaborative form of machine learning introduced in 2017 by Google AI, is a form of model training where the training process is distributed among many users. Instead of gathering all data from all users in one centralized location to train models, federated learning trains AI models on local devices in large batches, then transfers those learnings back to a global model without the need for data to leave any particular device.

There is a great Nature article “Federated learning in medicine: facilitating multi-institutional collaborations without sharing patient data”, summarizing all key concepts about this data management paradigm. This is what they say: “Federated learning is a novel paradigm for data-private multi-institutional collaborations, where model-learning leverages all available data without sharing data between institutions, by distributing the model-training to the data-owners and aggregating their results. We show that federated learning among 10 institutions results in models reaching 99% of the model quality achieved with centralized data, and evaluate generalizability on data from institutions outside the federation”. Pretty impressive, isn’t it? 


A case of Owkin, and the largest AI-driven clinical trial to date

A French-American healthcare unicorn Owkin, which applies artificial intelligence (AI) and federated learning frameworks to drug discovery, has recently announced a strategic collaboration deal with Bristol Myers Squibb (BMS) to use AI to enhance BMS’s clinical trials for its cardiovascular drugs – in what Owkin claims to be the first-ever use of AI to enhance drug trials at this scale.

Important aspect of the deal is that BMS will be able to leverage data from a network of academic medical research centers with which Owkin has partnered.

According to Owkin, they work with a network of leading hospitals across the world, and the company has equipped 18 leading academic research centers with its federated learning technology. It is also a part of eight consortia of leading biopharma and research partners, and works closely with 160 leading opinion leaders from oncology, cardiology and other areas. The key here is that Federated learning, pioneered by Owkin, allows for data to be analyzed by machine learning models without the data being sent off-site, so while the AI is efficiently trained on all the data, the data itself is not compromised in terms of privacy or ownership issues. 


MELLODDY project

Since 2019, Owkin’s federated learning platform has been an infrastructural base for a quite unique collaborative initiative in pharma space -- MELLODDY (Machine Learning Ledger Orchestration for Drug Discovery) project.

There, 10 pharmaceutical companies came together to battle-test a new decentralized, data-private, machine learning approach to collaborative research. Another important component of MELLODDY project is blockchain technology in the heart of Owkin’s machine learning platform.

Blockchain is a growing immutable list of records of transactional data, a sort of distributed ledger for maintaining a permanent and tamper-proof information. It can serve as a decentralized database managed by computers belonging to a peer-to-peer (P2P) network. Each of the computers in the network stores a copy of the ledger to prevent a single point of failure (SPOF). All copies are updated and validated simultaneously.


“Neutral Zones”

One interesting project that is addressing a challenge of preserving sensitive information and intellectual property while sharing data for research purposes is Data Stations, from University of Chicago. Data Stations is a new architecture that offers a “neutral zone” where data is shared but sealed, so that users cannot see, access, or download the original datasets, viewing only a broad catalog of what data is available. Users query the collected data with “data-unaware task capsules,” — for example, asking about the effectiveness of treatments across patient demographics — and the Data Station automatically does the rest: finding the right data, combining it or using it to train necessary AI models, and providing the user with their answer without disclosing the underlying raw data.

Another example of using a “neutral zone” paradigm for data sharing is by St. Jude Children’s Research Hospital, Adaptive Biotechnologies, and Answer ALS, who are working to develop targeted care interventions.

Federated Learning “on the edge”

The proliferation of real time technologies such as VR, AR, and self-driving cars catalyzed the emergence of a new architecture for data processing -- edge computing

The traditional model of cloud computing is unsuitable for applications that demand low latency. In contrast, edge computing is primarily concerned with transmitting data among the devices at the “edge”, closer to where user applications are located, rather than to a centralized server. “Edge device” is usually a resource-constraint device -- e.g. a mobile phone -- which is geographically close to the nearest edge server, the latter having sufficient computing resources and high bandwidth. When the edge server requires more computing power, it would connect to the corresponding cloud server. In such an architecture, latency is dramatically reduced as data does not need to travel as far, and bandwidth is improved significantly, as the user is no longer relying on sharing a single traffic lane in order to transfer their data. Edge computing may be a great paradigm for training AI models in federated way -- using data from such devices as various digital health gadgets, mobile phones, or medical devices at hospitals or other locations. 

High-Confidence computing

Image credit: Qi Xia et al., High-Confidence Computing, Vol. 1. Issue 1, 2021 


The Future of Data Sharing in Biomedical Field

The extraordinary potential of federated learning to tackle many headaches of data acquisition for AI-driven research projects is obvious, although a lot of work is to be done. The method is also gaining awareness at the same time that the National Institutes of Health (NIH) is pushing to implement a new rule that will require its grant recipients to submit Data Management and Sharing Plans, an effort to address the scientific reproducibility crisis. The NIH is also promoting the use of Data Repositories that meet FAIR principles of scientific data: Findable, Accessible, Interoperable and Re-usable.

Such initiatives as this one by NIH are supposed to make sure that researchers from both public and private organizations will soon be in a far better position to locate and collaborate on datasets, being able to browse and search for datasets in a catalog-like format in the near future. 

The progress of AI-driven drug discovery and clinical research is fundamentally dependent on the data availability in sufficient amounts and of sufficient quality. So in the next several years, we will probably see a wave of startups in the biomedical area, building robust data management infrastructures, offering data repositories with novel capabilities, and providing tools for machine learning practitioners of any type. 

Topics: Emerging Technologies   

Subscribe to Newsletter
Share this:              


There are no comments yet. You can be the first.

Leave a Reply

Your email address will not be published. Required fields are marked *