DMER Group Blog

MR-KG: A Knowledge Graph of Mendelian Randomization Evidence Powered by Large Language Models

2026-01-10T00:00:00.000Z

📌 Background

Mendelian randomization (MR) is a powerful causal inference method that uses genetic variants as natural experiments to assess causal relationships between putative risk factors and disease outcomes. MR studies are increasingly abundant, but synthesising evidence across them remains challenging due to heterogeneity in reporting, traits examined, and the structure of the published literature.

To address this, Liu, Burton, Gatua, Hemani & Gaunt (2025) introduce MR-KG — a knowledge graph of MR evidence automatically extracted from published studies using large language models (LLMs).

Liu et al. "MR-KG: A knowledge graph of Mendelian randomization evidence powered by large language models". 2025, medRxiv DOI:10.64898/2025.12.14.25342218

🧠 What Is MR-KG?

MR-KG is a structured, machine-readable network of results from Mendelian randomization studies. Instead of manually curating every causal estimate, this project uses state-of-the-art large language models to:

Extract structured data (e.g., exposures, outcomes, effect estimates) from scientific text
Link entities such as traits, genetic instruments, and study metadata
Standardise the relationships so they can be interrogated at scale

This makes MR evidence navigable by computational systems for downstream analysis, search, and reasoning.

🛠️ How It Works

The MR-KG pipeline has three major components:

LLM-based Extraction — abstracts of published MR studies are processed through large language models to pull out structured triples (e.g., [trait → outcome → causal effect]).
Graph Construction and Storage — extracted results are normalised into a consistent schema and stored as a graph where nodes represent entities (traits, studies, variants) and edges represent relationships (e.g., causal evidence).
Interactive Access — a live web interface and API (e.g., via https://epigraphdb.org/mr-kg) allow users and programs to query and explore the graph.

The repository also integrates supplementary tools for quality control and similarity analyses between studies or traits.

🔍 Key Features

Automated MR evidence extraction reduces the need for manual curation.
Knowledge graph format represents complex relationships and enables sophisticated queries.
LLM-powered processing allows extraction from a wide range of publication styles and formats.
API & web frontend for interactive use by researchers and software clients.

💡 Why It Matters

MR-KG addresses a key bottleneck in genetic epidemiology: while MR generates important causal insights, the pace and volume of publication make it hard to synthesise evidence consistently.

A structured knowledge graph enables researchers to:

Rapidly identify all MR evidence linking specific exposures and outcomes.
Detect overlapping or conflicting causal findings.
Integrate MR evidence with other biomedical resources for multidimensional analysis.

This could accelerate causally informed hypothesis generation and help triangulate evidence across studies.

🚀 Getting Started

The MR-KG project’s web interface and API are publicly available at https://epigraphdb.org/mr-kg/, allowing other teams to:

Explore the current graph and extracted evidence
Build analytical tools on top of the graph
Contribute improvements to extraction models or schema

🧪 A Note on Preprints

This work is currently a preprint, meaning it has not yet been peer-reviewed. Preprints should be interpreted as early reports of research findings, and while valuable for rapid dissemination, they are preliminary.

Genetics as a side‑effect detective for antipsychotic medicines

2025-07-30T00:00:00.000Z

Side‑effects are one of the main reasons people stop taking antipsychotic medicines — even when the drugs are helping with symptoms. But when someone reports “I’ve gained weight” or “my blood pressure has changed”, it’s often hard to know whether the drug truly caused it, which biological target is responsible, and whether that target is the one we wanted to hit in the first place.

In work led by Andrew Elmore, published in PLOS Genetics, we combine pharmacology (what receptors a drug binds) with human genetics (natural experiments) to map side‑effects back to specific receptors.

The basic idea

Antipsychotics don’t just bind a single receptor. They bind many — and some of those bindings are “on‑target” (part of how the drug works), while others are “off‑target” (biological collateral damage).

We built a framework that brings together:

Drug–receptor binding affinities (how strongly each drug binds each receptor)
Reported side‑effects from a large reference database
Genetic instruments for receptor/gene activity (eQTLs)
GWAS traits that can stand in for side‑effects
Mendelian randomization (MR) + genetic colocalisation to strengthen causal interpretation

The output is a simple summary: a side‑effect score for each Drug × Receptor × Trait combination, which we can then aggregate to compare drugs and mechanisms.

What we analysed

We focused on six commonly prescribed antipsychotics (including clozapine, olanzapine and risperidone). Across these drugs we identified 68 receptors with evidence of binding, and started from 165 reported side‑effects — of which 80 could be genetically proxied using available GWAS in OpenGWAS.

What we found

A few results stood out:

We identified 36 side‑effects that look likely to be caused by drug action through 30 receptors.
The bulk of evidence pointed to off‑target mechanisms.
Clozapine showed the largest cumulative side‑effect profile and the largest number of scored side‑effects in this framework.

Three concrete examples

The paper walks through three side‑effects in detail (and these nicely illustrate how the approach can generate mechanistic hypotheses).

1) Neutropenia (dangerously low white blood cell counts)

Clozapine is clinically linked to rare but serious neutropenia. In our genetic scoring, the signal for neutropenia was strongest for clozapine and suggested contributions from targets including GABRA1 and HTR1B.

2) Weight gain

Weight gain is a major concern for patients and a common reason for discontinuation. Our results again highlighted clozapine (and olanzapine) and suggested that differences in binding to targets such as CHRM3 and HRH1 could help explain why these drugs tend to have larger weight‑gain profiles than others.

3) Blood pressure effects

Clinical evidence on antipsychotics and blood pressure can be mixed, and we saw that different receptors imply different directions of effect. In the scoring, blood‑pressure signals were strongly influenced by HRH1, consistent with this receptor being a plausible driver for some blood‑pressure changes.

Why this matters

What we think is important about this work is the transferability:

It provides a framework to triage side‑effects early (even before large trials, when we have target information and genetics).
It can help separate “likely causal” side‑effects from those that may be coincidental, comorbidity‑driven, or reporting artefacts.
It gives mechanistic handles: if a side‑effect seems driven by an off‑target receptor, that receptor becomes a candidate to avoid in future drug design.

This won’t replace clinical pharmacovigilance. But it could help researchers ask better questions — and focus lab, trial, and monitoring efforts where the biology is most convincing.

Caveats we should keep in mind

A few limitations are worth emphasising:

Some receptor families (notably parts of the GABA system) have incomplete binding‑affinity data, which affects how confidently we can compare magnitudes across drugs.
The genetic resources used are predominantly European‑ancestry, so we should be careful about generalising across populations.
Pharmacokinetics (dose, tissue penetration, blood–brain barrier) and real‑world reporting can complicate “genetics → drug” translation.

M-PreSS: a transparent, open-source approach to study screening in systematic reviews

2025-06-30T00:00:00.000Z

Overview

Screening thousands of titles and abstracts is often the single biggest bottleneck in a systematic review workflow. In this new medRxiv pre-print, we describe M-PreSS: a model pre-training approach that aims to make screening faster without relying on closed, black-box systems.

The key idea is to start from an open biomedical language model (BlueBERT) and fine-tune it for screening using a Siamese neural network setup, so that the resulting model can generalise across different review topics rather than needing a brand-new model each time.

Xu et al. "M-PreSS: A Model Pre-training Approach for Study Screening in Systematic Reviews". 2025, medRxiv DOI:10.1101/2025.04.08.25325463

What we did

In M-PreSS, we fine-tuned BlueBERT to produce representations of study records (titles/abstracts) that can be used to score relevance for screening decisions. We then evaluated several training strategies in seven COVID-19 systematic reviews, focusing on whether a model trained on some topics could transfer to another topic.

Two practical variations explored in the preprint are:

Enriching the “topic definition” used for training by adding explicit study selection criteria (the kind you would normally write in a protocol).
Training on more related review topics, to encourage broader generalisation.

What we found

Across the seven COVID-19 reviews, the approach showed good cross-topic performance:

Average recall/sensitivity was reported as 0.86 (range 0.67–1.00).
Average false positive rate was 6.48% (range 1.38%–11.41%).

Two additional findings are especially relevant if you are thinking about deploying screening models in real review pipelines:

Adding study selection criteria into the topic definition improved precision–recall performance (PRAUC) by 2.74%.
Adding more related topics during training increased performance by 15.82%.

We also report that, in the COVID-19 topics we compared against, this fine-tuned open model can outperform ChatGPT/GPT-4 in two out of three previously reported screening settings, while using substantially fewer computational resources.

Why this matters

From our perspective, this work lands in a useful “sweet spot”:

Transparent and reproducible: the underlying model is open, and the training approach can be documented and rerun.
Generalises across topics: rather than building a bespoke model from scratch for every review.
Practical levers to improve performance: especially the finding that writing selection criteria in a structured way can directly help the model.

That combination is important if we want screening automation to be something review teams can actually trust, maintain, and update over time.

Limitations and next steps

A couple of things we will be considering as we continue to work in this space:

Beyond COVID-19: the evaluation focuses on COVID-19 reviews, so it will be interesting to see how well the approach transfers to other domains (e.g. nutrition, cancer epidemiology, environmental exposures).
Human-in-the-loop integration: the biggest real-world gains often come from pairing models with active learning, prioritisation, and clear stopping rules—how M-PreSS plugs into those workflows will matter.

🧪 A Note on Preprints

Integrating Mendelian randomization and literature mining to map breast cancer risk factors

2025-05-31T00:00:00.000Z

Breast cancer research spans epidemiology, molecular biology, clinical trials, and a vast and rapidly growing literature. One challenge is triangulating across these evidence types: when different sources point in the same direction, we can be more confident we are seeing something causal rather than correlational.

In a paper led by Marina Vabistsevits published in the Journal of Biomedical Informatics, we show how to bring two complementary sources together:

Mendelian randomization (MR) evidence generated at scale using MR-EvE (“Everything-vs-Everything”), and
Literature-mined relationships stored in EpiGraphDB, our biomedical knowledge graph.

Why combine MR with literature mining?

MR can help prioritise likely causal risk factors, but it does not automatically tell us how an exposure influences disease. Meanwhile, the biomedical literature is full of mechanistic clues—but it is too large to read manually, and individual papers can be hard to weigh.

Our aim was to use MR for efficient hypothesis generation, and then use literature-mined links to suggest plausible intermediates/mediators, before returning to genetics again for validation.

What we did

We started with MR-EvE estimates to screen many traits against breast cancer outcomes, looking for candidate risk factors and possible mediators. We then integrated these MR results with literature-mined “triples” (subject–predicate–object statements extracted from papers) in EpiGraphDB, using an approach based on overlapping “literature spaces” between a risk factor trait and breast cancer.

Finally, for literature-based discovery (LBD) candidates, we used two-step MR to check whether a proposed intermediate sat on a plausible causal path from risk factor → intermediate → breast cancer.

What we found

Using this pipeline, we identified 129 lifestyle risk factors and molecular traits with evidence of an effect on breast cancer (including both established and potentially novel signals). We also made the MR results explorable via an R/Shiny app for interactive browsing and hypothesis generation.

To show how the integration works in practice, the paper walks through two case studies:

Childhood body size, where combining MR and literature helps explore downstream intermediates that might connect early-life adiposity to later breast cancer risk.
HDL-cholesterol, where the literature-mined links provide mechanistic hypotheses that can then be followed up using genetics-based mediation checks.

Why this matters

This is not about replacing careful study design or detailed mechanistic work. The point is to make it easier to navigate the space of plausible hypotheses, and to prioritise follow-up work with a clearer view of (a) what looks causal and (b) what the literature suggests about potential pathways.

More broadly, it’s a demonstration of what we think knowledge graphs are good at: connecting evidence across study types and helping us ask better questions, faster.

Try it yourself

Paper: Integrating Mendelian randomization and literature-mined evidence for breast cancer risk factors (Journal of Biomedical Informatics, 2025).
Interactive MR heatmaps app: https://mvab.shinyapps.io/MR_heatmaps/
EpiGraphDB platform: https://epigraphdb.org

If you use the app or EpiGraphDB in your work and have ideas for additional features, do get in touch — we’re always keen to hear how people are using these resources.

Dissecting blood pressure and BMI a pathway- and tissue-partitioned Mendelian randomization comparison

2025-05-30T00:00:00.000Z

Complex traits like blood pressure (BP) and body mass index (BMI) are highly polygenic: hundreds of associated variants can be used as instruments in Mendelian randomization (MR). But those variants don’t all “mean the same thing” biologically—some may act through kidney physiology, others through vasculature, neurobiology, metabolism, and so on. If we can separate instruments into interpretable biological subsets, we can start asking questions like:

Which component of BP is most responsible for coronary heart disease risk?
Are BMI → atrial fibrillation effects more “metabolic” or more “neuro-behavioural”?

Work led by Genevieve Leyden and Maria Sobczyk and now published in Genome Medicine sets out to do exactly this by comparing two ways of partitioning genetic instruments before running MR.

Two complementary ways to split instruments

The paper evaluates two strategies side-by-side:

1) Pathway-partitioned instruments from Mendelian disease phenotypes

We introduce a new approach that groups genome-wide significant BP/BMI variants by whether they lie near Mendelian disease genes enriched for particular symptom categories (via MendelVar).

For blood pressure, variants are partitioned into renal vs vessel (vasculature) pathways.
For BMI, variants are partitioned into metabolic vs mental health pathways.

The idea is pragmatic: Mendelian disorders provide a clinically grounded “phenotype dictionary” that can help annotate the biology sitting underneath GWAS loci.

2) Tissue-partitioned instruments from genetic colocalization

We compare the above to an existing, colocalization-derived approach previously published by Genevieve that partitions variants based on evidence that a GWAS signal overlaps an eQTL signal in specific tissues.

For blood pressure, variants are assigned to kidney vs artery tissues.
For BMI, variants are assigned to brain vs adipose tissues.

These two partitioning schemes are not expected to be identical—one is anchored in clinical phenotype enrichment, the other in molecular regulatory context—but comparing them can highlight where biology is robust versus where interpretation needs caution.

What did we find?

Our headline result is that different partitions can suggest different “drivers” of the same exposure–outcome relationship.

Blood pressure: vessel pathways stand out for heart disease, but tissue signals are more nuanced

Using the pathway-partitioned approach, the systolic BP → heart disease effect appeared more strongly driven by vessel (vasculature) instruments than renal instruments.

However, in the tissue-partitioned analyses, the corresponding comparison suggested a stronger effect attributed to kidney than artery tissue, consistent with BP acting through multiple intertwined mechanisms rather than a single clean partition.

Across outcomes, we also report consistent evidence that:

Vessel (pathway) and artery (tissue) instruments dominate the negative directional effect of diastolic BP on left ventricular stroke volume, and the positive directional effect of systolic BP on type 2 diabetes.

BMI and atrial fibrillation: metabolic vs brain components

When focusing on BMI → atrial fibrillation, we found the causal effect was predominantly driven by:

metabolic-pathway instruments (relative to mental health pathway instruments), and
brain-tissue instruments (relative to adipose tissue instruments).

That contrast is interesting in itself: it suggests that “metabolic” and “brain” are not mutually exclusive stories, and may instead reflect overlapping causal routes (e.g. appetite regulation and energy balance).

Why this matters for MR interpretation

A common workflow in MR is to build a single instrument set and estimate an “average” causal effect. This paper is a reminder that, for many complex traits, that average may be combining multiple biological components.

Partitioning instruments can help us:

identify potential pleiotropic pathways (e.g. if only one partition shows an effect),
generate mechanism-focused hypotheses (which component of BP/BMI is most relevant?),
prioritize follow-up (e.g. deeper locus-to-gene work in the partition driving an effect), and
spot interpretability pitfalls when different annotation schemes tell different stories.

However, we emphasize that partitioned effect-size differences need robust validation (and can arise by chance without strong supporting evidence), so this should be viewed as a hypothesis-generating framework rather than a final answer.

Our take

From our perspective, this is a neat example of how we can bring in external biological knowledge—here, Mendelian disease phenotype enrichment—to get more out of standard GWAS-derived instruments, and to complement more “molecular” approaches like colocalization.

If you are doing MR on a complex exposure and you suspect heterogeneous mechanisms, it’s worth thinking about instrument partitioning early: it can make your results easier to interpret, and it can help decide what’s worth chasing downstream.

Paper

Leyden GM, Sobczyk MK, Richardson TG, Gaunt TR. Distinct pathway-based effects of blood pressure and body mass index on cardiovascular traits: comparison of novel Mendelian randomization approaches. Genome Medicine (2025) 17:54. DOI: 10.1186/s13073-025-01472-2. PubMed: https://pubmed.ncbi.nlm.nih.gov/40375348/. Full text: https://pmc.ncbi.nlm.nih.gov/articles/PMC12079859/

CanDrivR-CS: cancer-specific machine learning to separate recurrent from rare missense variants

2024-09-30T00:00:00.000Z

Overview

Cancer genomes contain huge numbers of mutations, but only a subset are functionally important. One simple clue is recurrence: if the same missense variant shows up repeatedly across patients with the same cancer type, that can suggest positive selection for growth advantage. At the same time, rare variants can still matter (for example, if they emerge under treatment as resistance mechanisms).

In work led by Amy Francis, we introduce CanDrivR-CS, a framework that trains cancer-type-specific machine-learning models to distinguish recurrent from rare somatic missense variants. It’s a useful reminder that “one-size-fits-all” predictors can miss disease-context signals, and that relatively interpretable models can still surface mechanistic hypotheses.

What we did

We curated missense variant data from the International Cancer Genome Consortium (ICGC) and trained a suite of gradient boosting classifiers, one per cancer type, alongside a baseline pan-cancer model. The goal was not to label variants as “pathogenic” in the clinical sense, but to learn patterns that separate variants from two cancer-relevant frequency regimes: those that recur across samples versus those that appear rarely.

A practical detail was our evaluation setup: we report leave-one-group-out cross-validation (LOGO-CV), which is designed to test generalisation when a meaningful group (e.g. a gene or cohort) is held out at training time.

Key results

Cancer-type-specific models outperformed the pan-cancer baseline, with LOGO-CV F1 scores reaching 0.90 for skin cutaneous melanoma (CanDrivR-SKCM) and 0.89 for skin adenocarcinoma (CanDrivR-SKCA), versus 0.792 for the baseline model.
DNA-shape properties consistently ranked among the most informative features across cancer types. We report that recurrent missense variants were enriched in regions associated with DNA bends and rolls, raising the possibility that local structural context contributes to mutational hotspots (for example via replication or repair dynamics).

Why this matters

From a translational perspective, separating “common” from “rare” somatic variants is not the whole driver/passenger story — but it is a useful lens:

It can help prioritise variants for follow-up in cancer-type-specific settings (where selection pressures differ).
It provides an interpretable way to test whether adding new feature classes (like DNA-shape) improves discrimination.
It highlights the value of open, reusable pipelines for variant feature engineering and modelling.

Resources

Preprint: https://www.biorxiv.org/content/10.1101/2024.09.19.613896v1
Code and data: https://github.com/amyfrancis97/CanDrivR-CS

Paper

Francis A, Campbell C, Gaunt T. CanDrivR-CS: A Cancer-Specific Machine Learning Framework for Distinguishing Recurrent and Rare Variants. bioRxiv (posted Sep 23, 2024). https://www.biorxiv.org/content/10.1101/2024.09.19.613896v1

DrivR-Base: a feature extraction toolkit for variant effect prediction

2024-04-30T00:00:00.000Z

Understanding which genetic variants are likely to be functional (and which are probably benign) is a cornerstone of modern human genetics. Over the last decade, variant-effect predictors have become increasingly sophisticated — but behind every model sits the same practical headache: assembling a sensible set of features (annotations) for millions of variants from dozens of databases.

In a 2024 Bioinformatics paper led by Amy Francis, we introduce DrivR-Base, a reproducible, Dockerised toolkit that turns this feature-extraction step into something you can run and re-run with far less pain.

What problem is DrivR-Base trying to solve?

Most variant-effect prediction methods are “integrative”: they combine signals about a variant’s genomic context (e.g. conservation), regulatory annotations (e.g. ENCODE peaks), and protein-level consequences (e.g. amino-acid change, structure). The data exist — but pulling them together is often:

time-consuming (lots of sources, formats, and edge cases),
hard to reproduce (different software versions and dependencies), and
risky (you can spend weeks extracting features that later turn out not to help your model).

DrivR-Base’s core idea is simple: provide a single, consistent pipeline that extracts a broad set of annotations for all possible SNVs in GRCh38, so we can spend more time modelling and less time wrangling.

What is DrivR-Base?

DrivR-Base is a feature extraction toolkit for human single nucleotide variants (SNVs) in the GRCh38 genome build. It produces a table where each row is a variant and columns are feature values drawn from multiple sources (genome- and protein-level). It’s packaged for Docker, which helps make installs and runs repeatable across machines and over time.

The paper highlights a few motivating use-cases beyond “classic” pathogenicity prediction, including haploinsufficiency prediction and feature sets that could feed into drug repurposing workflows.

What features does it extract?

DrivR-Base groups its outputs into ten feature groups, spanning sequence context, regulatory genomics, and protein structure.

Conservation and mappability
PhyloP/PhastCons conservation scores across multiple alignments, plus Umap/Bismap mappability (useful for flagging regions prone to sequencing ambiguity).
Variant Effect Predictor (VEP) annotations
Transcript consequences (one-hot encoded), predicted amino acids (wild-type vs mutant), and distances to transcripts when multiple are affected.
Dinucleotide properties (DiProDB)
Thermodynamic and conformational properties for dinucleotide contexts around the variant, captured under wild-type and mutant configurations.
DNA shape (DNAShapeR)
Local structural properties like minor groove width, helix twist, propeller twist, roll, and electrostatic potential in a configurable window around the SNV.
GC content and CpG metrics
GC fraction, CpG counts, and observed/expected CpG across multiple window sizes.
Kernel-based sequence similarity (spectrum kernels)
K-mer based comparisons between wild-type and mutant sequence windows as a compact way to encode “sequence disruption”.
Amino-acid substitution matrices
Substitution rates from common matrices (e.g. BLOSUM, PAM, JTT variants) for non-synonymous variants.
Amino-acid properties
Hundreds of amino-acid descriptors (e.g. hydrophobicity, polarity, flexibility) for wild-type and mutant residues.
ENCODE-derived regulatory features
Peaks and signal summaries across multiple assay types (TF ChIP-seq, histone marks, DNase/ATAC, eCLIP, etc.). Note: the authors report this step can require substantial local storage (on the order of ~160GB) because it downloads large ENCODE datasets.
Protein structure features from AlphaFold (and PDB)
For coding variants, mapping to protein positions enables extraction of AlphaFold structural information (e.g. atom coordinates and conformation-type encodings).

Why this matters for our work

A lot of what we do in the DMER team sits at the interface of genetic evidence and downstream biology — and variant-level annotations are often the glue. Even when our end goal isn’t “variant pathogenicity prediction”, having a robust, standardised way to pull out features can help with:

building or benchmarking new predictors (and understanding why they behave as they do),
prioritising variants for experimental follow-up, and
reusing the same feature definitions across projects to avoid “feature drift”.

Just as importantly, DrivR-Base makes it easier to ask the boring-but-essential questions early, like: Which feature groups are actually informative for my prediction task? That can save a lot of iteration time.

Getting started

DrivR-Base is distributed via GitHub with Docker instructions. The paper and repository are the best places to start:

Paper (open access via PubMed Central): https://pmc.ncbi.nlm.nih.gov/articles/PMC11057939/
Code: https://github.com/amyfrancis97/DrivR-Base

Reference

Francis A, Campbell C, Gaunt TR. DrivR-Base: a feature extraction toolkit for variant effect prediction model construction. Bioinformatics (2024).

Pilot analysis on BioRxiv and MedRxiv full text data to facilitate comprehensive data mining on biomedical literature

2023-08-21T00:00:00.000Z

Overview

The BioRxiv and MedRxiv preprint facilities are vital infrastructure for the biomedical research community, which also provide a rich and comprehensive resource for data mining biomedical literature for investigations on research trends, interests, and novel findings. In our previous works we have conducted extensive literature mining efforts on BioRxiv and MedRxiv to extract structural literature knowledge into EpiGraphDB[1] and derive research claims from recent preprints to be triangulated with other evidence on ASQ[2].

Supported by the Elizabeth Blackwell Institute Rapid Research Funding Call, in this project we have acquired the full text data archives for BioRxiv and MedRxiv preprints from 2013 to May 2023, and we have also conducted some exploratory analysis on the data archives.

Full text archives of BioRxiv / MedRxiv preprints

Web scraping pre-print text is time consuming as well as error prone. However, BioRxiv and MedRxiv have provided archive data on the preprints for the purpose of text and data mining hosted on Amazon AWS S3 as Requester Pays Buckets. Full text archives from BioRxiv and MedRxiv are stored in two S3 buckets biorxiv-src-monthly and medrxiv-src-monthly respectively, where the organization structure is roughly as follows (high level description is also available on the tdm pages):

    biorxiv-src-montly
    ├── Back_Content
    └── Current_Content
        ├── April_2019
        ├── April_2020
        │   ├── 0002415e-6e79-1014-bad3-d7b11ff8718c.meca
        │   ├── 0008729e-7222-1014-9e12-b08e9cbb4568.meca
        │   ├── 00114dfb-6f21-1014-b58f-80aa2e0e89bd.meca
        │   ├── 00114e62-6cf9-1014-aedd-beed0b185e0b.meca
        │   ├── 0025be66-6dea-1014-8d12-849fadd63f55.meca
        │   ├── ....
        │   ├── ....
        ├── April_2021
        ├── April_2022
        ├── April_2023
        ├── August_2019
        ├── August_2020
        ├── August_2021
        ├── August_2022
        ├── December_2018
        ├── December_2019
        ├── December_2020
        ├── December_2021
        ├── December_2022
        ├── ...
        ├── ...
    medrxiv-src-monthly
    ├── Back_Content
    └── Current_Content

Each .meca file is an archive in .zip format which contains individual files associated with one preprint submission. For example the following structure corresponds to files associated with this preprint (Shrestha et al., 2022, Knowledge, Attitude and Practice (KAP) study on COVID-19 among the general population of Nepal). Specifically the file ./content/22279527.xml is the “full text / manuscript” file in .xml format.

    0a2ef310-6c04-1014-8ee5-ac250845df11.meca
    ├── content
    │   ├── 22279527.pdf
    │   ├── 22279527v1_tbl1.tif
    │   ├── 22279527v1_tbl2.tif
    │   ├── 22279527v1_tbl3.tif
    │   ├── 22279527v1_tbl4.tif
    │   ├── 22279527v1_tbl5a.tif
    │   ├── 22279527v1_tbl5.tif
    │   ├── 22279527v1_tbl6a.tif
    │   ├── 22279527v1_tbl6b.tif
    │   ├── 22279527v1_tbl6.tif
    │   └── 22279527.xml
    ├── directives.xml
    ├── manifest.xml
    ├── mimetype
    └── transfer.xml

How do we know which .meca archive filename corresponds to which preprint DOI? Unfortunately we haven’t found a way to know this without actually opening the meca file and parsing the full text xml. In addition we have identified several rounds of changes to the storage structure, as well as text format (e.g. how a key information such as submission date is represented in xml tags), which means that it takes effort to systematically and robustly curate various information such as metadata as well as sections (e.g. abstracts, methods, results, conclusions) in the manuscripts. We will need to curate these information as separate intermediate datasets for our follow-up research projects.

Some exploratory results

As part of our on-going efforts in parsing and curating the preprint information from the raw archives, here we discuss some of the gathered results which will share insights into future projects.

Trends and distributions

Figures 1-3 shows the trends and distributions of the preprints. MedRxiv was separated from BioRxiv in 2019 to host preprints related to clinical and medical topics, and they have now become vital infrastructure to the scientific community in communicating rapid findings. The COVID-19 pandemic also substantially contributed to this growth in researchers presenting their findings as preprints before the peer-review process is complete. Bioinformatics, Cancer Biology, Cell Biology, Evolutionary Biology, Microbiology, as well as Neuroscience are among the most populous categories in BioRxiv, whereas Epidemiology, Infectious Diseases (except HIV/AIDS), Public and Global Health are among the most populous ones in MedRxiv.

Figure 1: Total number of monthly preprint submissions We count the archives by their submission month and preprint source (biorxiv / medrxiv). The red line corresponds to the combined submissions from both sources. The black dashed line corresponds to February 2020 when COVID-19 started to a pandemic, and the red dash line corresponds to May 2020 around which the monthly submissions peaked.

Figure 2: Total number of montly BioRxiv submissions by category We count the archives by their submission month and category, where we consolidate the monthly categories below top-15 most populous ones into an “Others” category.

Figure 3: Total number of montly MedRxiv submissions by category We count the archives by their submission month and category, where we consolidate the monthly categories below top-15 most populous ones into an “Others” category.

Topic analysis

To further demonstrate the distribution of research topics, we did some simple topic analysis, where we used the BERTopic library to efficiently embed and model all (over 300,000) article titles to derive topics corresponding to research topics. Figure 4 shows the trends of the dynamic topics, where we visually identify a topic related to the transmission of SARS-CoV-2 virus (“sarscov2”, “transmission”, “seroprevalence”, etc.) dominate over others during the early phase (2020-2021) of the pandemic and its gradual decline. In addition, a topic related to vaccines (“vaccination”, “vaccine hesitancy”, etc.) can also been seen being prevalent during 2021. Figure 5 then further demonstrates how each topic relate to others in the topic clusters.

Figure 4: Trends of topics dynamic topic modelling analysis Dynamic topics are modelled by associating the submission month of the preprint with the topics represented from the article titles.

Figure 5: Topic clusters We randomly select a sample of 100_000 documents (article titles) to generate the topic model, and then used the UMAP clustering method to visualize the clustering of documents. For all the generated topics, we highlight those associated with the COVID-19 pandemic, as well as for comparison a few others related to health data science and genetic epidemiology topics.

Summary

The above preliminary results on the exploratory analysis demonstrate the richness in what can be data mined from analysing the BioRxiv / MedRxiv preprints. We are excited at the prospect of research projects on text mining biomedical studies involving the full historical archive of biomedical research (as proxied by the preprints), not just on public health and epidemiology topics, but also cross-discipline ones such as social-economic factors influencing the eventual publication of a preprint, effective modelling of research trends via various network or clustering analysis methods. For potential collaboration interests please reach out to Yi Liu (yi6240.liu[at]bristol.ac.uk) and Tom Gaunt (tom.gaunt[at]bristol.ac.uk).

Code availability

Our code in data accessing as well as research analysis have been published as a GitHub repository here https://github.com/MRCIEU/biorxiv-medrxiv-tdm, which we will continue to update in the future as our current results are just a first step towards bigger analysis projects.

References

[1] Yi Liu, Benjamin Elsworth, et al., Tom Gaunt, EpiGraphDB: a database and data mining platform for health data science, Bioinformatics, Volume 37, Issue 9, May 2021, Pages 1304–1311, https://doi.org/10.1093/bioinformatics/btaa961

[2] Yi Liu, Tom R Gaunt, Triangulating evidence in health sciences with Annotated Semantic Queries medRxiv 2022.04.12.22273803; doi: https://doi.org/10.1101/2022.04.12.22273803

Proteome-wide Mendelian randomization in global biobank to identify multi-ancestry drug targets

2022-11-01T00:00:00.000Z

Overview

Genetic studies have been very biased towards populations of European ancestry in western Europe and the United States of America, and this has led to a significant bias in the application of Mendelian randomization (MR) to identify intervention targets. In this project we worked with a leading international genetics consortium, the Global Biobank Meta-analysis Initiative (GBMI) to evaluate the differences in predicted drug target effects between African and European ancestry populations.

What we did

In this paper (published in Cell Genomics), Huiling and Chris carried out a multi-ancestry proteome-wide MR analysis using cross-population data from GBMI. We estimated the causal effects of 1545 proteins on eight diseases in African (n=32,658) and European (n=1,219,993) ancestry populations. We found 45 protein-disease pairs with MR and colocalization evidence of causality in European ancestry, and 7 protein-disease pairs with evidence of causality in African ancestry. The difference in sample size (and, consequently, statistical power) almost certainly explains the large difference in number of causal effects detected. Interestingly, only 2 protein-disease pairs showed MR evidence in both ancestries.

Paper

‘Proteome-wide Mendelian randomization in global biobank meta-analysis reveals multi-ancestry drug targets for common diseases’ by Huiling Zhao, Humaria Rasheed, Therese Haugdahl Nøst, Yoonsu Cho, Yi Liu, Laxmi Bhatta, Arjun Bhattacharya; Global Biobank Meta-analysis Initiative; Gibran Hemani, George Davey Smith, Ben Michael Brumpton, Wei Zhou, Benjamin M Neale, Tom R Gaunt and Jie Zheng in Cell Genom. 2022 Nov 9;2(11):None. doi: 10.1016/j.xgen.2022.100195..

New funding: NIHR Bristol Biomedical Research Institute.

2022-10-14T00:00:00.000Z

“The National Institute for Health and Care Research Bristol Biomedical Research Centre (NIHR Bristol BRC) has been awarded nearly £12 million of new funding for the next five years. The funding has been awarded to University Hospitals Bristol and Weston NHS Foundation Trust by the NIHR, with the University of Bristol a major partner.” - Press release

See the NIHR Bristol Biomedical Research Centre website for more details on the overall BRC research portfolio.

Links to the Data Mining Epidemiological Relationships programme

The NIHR BRC Translational Data Science theme is co-led by Profs Tom Gaunt and John Macleod. The theme aims to translate research in the MRC IEU, in particular in the following two workstreams:

Prioritizing interventions

The first workstream will use genetic evidence to prioritise interventions building on some of our work in the use of molecular QTL Mendelian randomization and genetic colocalization and our collaborations with pharmaceutical partners. This theme will be co-led by Lavinia Paternoster and Gibran Hemani, with Tom Gaunt, Kate Tilling and George Davey Smith.

Mendelian randomization (MR) is a ground-breaking gene-based approach pioneered in Bristol by our Medical Director George Davey Smith. This approach doesn’t involve giving people a particular treatment. Instead, it uses natural variation in our genes to test the effects of a modifiable factor to estimate the effect of that factor on disease outcomes. It also allows us to explore how different populations are affected using existing datasets from around the world.

MR is now routinely used to decide which targets to focus on for medical and public health intervention. However, it has mainly been used for disease prevention rather than treatment. To address this, we will apply our new MR methods to genetic datasets to identify potential treatment targets.

The use of MR has also mainly focused on white European populations. We will work with our large population-based study collaborators, including Global Biobank Meta-analysis Initiative and Born in Bradford, to address this. This will allow us to predict ancestry-specific effects for existing and new drugs, and to prioritise interventions for a range of ethnic groups.

We are working with our other themes, including mental health and diet and physical activity, to apply our MR approaches in their research.

Omics for prediction and prognosis

The second workstream will use omics for prediction and prognosis building on some of Caroline Relton’s work in the MRC IEU on molecular epidemiology, and in particular use of epigenetics for prediction. The workstream will be co-led by Paul Yousefi, Mattew Suderman and Caroline Relton.

In this workstream we use large, complex molecular (‘omics’) datasets to identify biomarkers to predict who will get a disease and how it will progress.

We use machine learning to identify, optimise and validate these molecular biomarkers. We then combine them with data from health records, cohort studies and trials to develop disease prediction tools for use in a range of settings.

Our biomarker identification work will support other NIHR Bristol BRC themes, including respiratory and mental health.

Triangulating evidence in health sciences with Annotated Semantic Queries

2022-04-16T00:00:00.000Z

Update: The ASQ work has now been published in Bioinformatics.

Yi Liu, Tom R Gaunt, Triangulating evidence in health sciences with Annotated Semantic Queries, Bioinformatics, Volume 40, Issue 9, September 2024, btae519, https://doi.org/10.1093/bioinformatics/btae519

Overview

Integrating information from data sources representing different study designs has the potential to strengthen evidence in population health research. However, this concept of evidence “triangulation” presents a number of challenges for systematically identifying and integrating relevant information.

In this medRxiv preprint we present ASQ (Annotated Semantic Queries), a natural language query interface to the integrated biomedical entities and epidemiological evidence in EpiGraphDB . ASQ enables users to extract “claims” from a piece of unstructured text, and then investigate the evidence that could either support, contradict the claims, or offer additional information to the query.

The ASQ approach has the potential to support the rapid review of pre-prints, grant applications, conference abstracts and articles submitted for peer review. ASQ implements strategies to harmonize biomedical entities in different taxonomies and evidence from different sources, to facilitate evidence triangulation and interpretation.

ASQ is openly available at https://asq.epigraphdb.org.

What we did

The ASQ platform was designed as a natural language interface to the EpiGraphDB biomedical knowledge graph. ASQ considers two primary evidence groups in EpigraphDB:

Triple and literature evidence, comprising semantic triples derived from the biomedical literature
Association evidence, comprising results from genetic correlation, polygenic risk score association and Mendelian randomization

The user interface accepts free text entry (e.g. the abstract of a pre-print or journal article, the summary of a grant application, etc). We then use SemRep [1] to derive “claim triples” (Subject-PREDICATE-Object). The user then selects a triple of interest for analysis.

ASQ maps biomedical entities in the Subject and Object parts of the claim triple to biomedical entities in EpiGraphDB. The system then retrieves evidence from the two evidence categories (above) that link the Subject and Object.

Figure 1 provides an overview of the architecture of the platform.

Figure 1: Overall architecture of the EpiGraphDB-ASQ platform Overall architecture design of the EpiGraphDB-ASQ platform and its associated components in the EpiGraphDB ecosystem. Left: EpiGraphDB’s biomedical entities (in the form of graph nodes) from different taxonomies are encoded into vector representations which allows for fast information retrieval against the query of interest. Epidemiological evidence (in the form of graph edges) are incorporated into ASQ as harmonized evidence groups. Right: Internal processing workflow of the EpiGraphDB-ASQ platform by the three stages: the claim parsing stage, the entity harmonization stage, and the evidence retrieval stage

Paper

‘Triangulating evidence in health sciences with Annotated Semantic Queries’ by Yi Liu and Tom Gaunt in medRxiv.

Code availability

Source code for the ASQ platform and relevant analysis scripts can be found via https://github.com/mrcieu/epigraphdb-asq. Tutorial on programmatically accessing the ASQ platform can be found via this Jupyter notebook https://github.com/MRCIEU/epigraphdb-asq/blob/master/analysis/notebooks/programmatic-access.ipynb.

References

[1] Kilicoglu, H., Rosemblat, G., Fiszman, M. & Shin, D. Broad-coverage biomedical relation extraction with SemRep. BMC bioinformatics 21, 1–28 (2020).

Systematic comparison of Mendelian randomization studies and randomized controlled trials using electronic databases

2022-04-16T00:00:00.000Z

Overview

Mendelian Randomization (MR) uses genetic instrumental variables to make causal inferences. Whilst sometimes referred to as “nature’s randomized trial”, it has distinct assumptions that make comparisons between the results of MR studies with those of actual randomized controlled trials (RCTs) invaluable.

In this medRxiv pre-print we mined ClinicalTrials.Gov, PubMed and EpigraphDB databases and carried out a series of 26 manual literature comparisons among 54 MR and 77 RCT publications.

What we did

We downloaded all results for all trials within ClinicalTrials.gov, filtering them as illustrated in the figure. We used EpigraphDB to collect information about drug-target associations and semantic triples associated with selected MR and RCT publications based on a comprehensive search of PubMed. We then mapped MR studies to the corresponding RCTs to evaluate consistency and disagreement between the results.

We found that only 11% of completed RCTs identified in ClinicalTrials.Gov submitted their results to the database. Similarly low coverage was revealed for Semantic Medline (SemMedDB) semantic triples derived from MR and RCT publications – 25% and 12%, respectively.

Among intervention types that can be mimicked by MR, only trials of pharmaceutical interventions could be automatically matched to MR results due to insufficient annotation with MeSH ontology. A manual survey of the literature highlighted the potential for triangulation across a number of exposure/outcome pairs if similar challenges can be addressed. We conclude that careful triangulation of MR with RCT evidence should involve consideration of similarity of phenotypes across study designs, intervention intensity and duration, study population demography and health status, comparator group, intervention goal and quality of evidence.

Paper

‘Systematic comparison of Mendelian randomization studies and randomized controlled trials using electronic databases’ by Maria K. Sobczyk, George Davey Smith and Tom R. Gaunt in medRxiv.

Code and data availability

Code used to carry out the analysis is available on GitHub: https://github.com/marynias/mr-rct. ClinicalTrials.Gov data was accessed via AACT: https://aact.ctti-clinicaltrials.org/snapshots and analysed data subset is available in Supplementary Datasets 1 & 2. pQTL and eQTL MR analysis results are available via EpigraphDB: https://epigraphdb.org/xqtl

Evaluating the potential benefits and pitfalls of combining protein and expression quantitative trait loci in evidencing drug targets

2022-03-17T00:00:00.000Z

Overview

Molecular quantitative trait loci (molQTL), which can provide functional evidence on the mechanisms underlying phenotype-genotype associations, are increasingly used in drug target validation and safety assessment. In particular, protein abundance QTLs (pQTLs) and gene expression QTLs (eQTLs) are the most commonly used for this purpose. However, questions remain on how to best consolidate results from pQTLs and eQTLs for target validation.

What we did

In this bioRxiv pre-print we combined blood cell-derived eQTLs and plasma-derived pQTLs to form QTL pairs representing each gene and its product. We performed a series of enrichment analyses to identify features of QTL pairs that provide consistent evidence for drug targets based on the concordance of the direction of effect of the pQTL and eQTL. We repeated these analyses using eQŢLs derived in 49 tissues.

We found that 25-30% of blood-cell derived QTL pairs have discordant effects. The difference in tissues of origin for molecular markers contributes to, but is not likely a major source of, this observed discordance. Finally, druggable genes were as likely to have discordant QTL pairs as concordant.

Our analyses suggest combining and consolidating evidence from pQTLs and eQTLs for drug target validation is crucial and should be done whenever possible, as many potential drug targets show discordance between the two molecular phenotypes that could be misleading if only one is considered. We also encourage investigating QTL tissue-specificity in target validation applications to help identify reasons for discordance and emphasise that concordance and discordance of QTL pairs across tissues are both informative in target validation.

Figure 1: Results for the enrichment analysis using the Drug-Gene Interaction database (DGIdb) for druggability-related terms. The bars show the percentage of concordant or discordant QTL pairs which were enriched for a given term. P values were calculated using Fisher’s exact test, and unadjusted and FDR-adjusted P values are shown for those terms which reached at least nominal significance.

Paper

‘Evaluating the potential benefits and pitfalls of combining protein and expression quantitative trait loci in evidencing drug targets’ by Jamie W Robinson, Thomas Battram, Denis A Baird, Philip C Haycock, Jie Zheng, Gibran Hemani, Chia-Yen Chen and Tom R Gaunt in bioRxiv.

Senior Research Associate / Research Fellow in Health Data Science

2022-01-12T00:00:00.000Z

The role:

We are seeking a talented postdoctoral scientist with expertise in biomedical data integration and analysis, data mining and causal inference. As the successful candidate you will join a vibrant interdisciplinary research environment in the MRC Integrative Epidemiology Unit, working within a programme that applies data mining approaches to epidemiological research questions (www.biocompute.org.uk). The post holder will be appointed at either Senior Research Associate (grade J) or Research Fellow (grade K) depending on their level of experience. If successful, you will have the opportunity to develop your own research portfolio within the programme, contribute to teaching and postgraduate training and will be supported in your career progression. Closing date: 6th Feb 2022

Online applications

What will you be doing?

The role will be focused on systematic analysis of knowledge graphs such as EpiGraphDB (epigraphdb.org) to identify risk factors, mechanistic pathways and potential intervention targets for human disease. You will be encouraged to co-develop new project ideas that align with the research objectives of the MRC IEU, in particular those focused on data mining, knowledge extraction from the literature, evidence triangulation and knowledge graph analysis.

You should apply if:

You have a strong computational and analytical background, with experience in programming, data science and the application of these skills in health, biomedical or biological research.
Ideally you will be familiar with knowledge graphs and natural language processing.
You will also have experience of applied statistical analysis and/or machine learning and will have a good understanding of causal inference using graphical causal models.

Further information and how to apply

Please read the job description for further information on the role and criteria. We ask you to provide evidence for how you meet these criteria in your application. If you would like more information please don’t hesitate to get in touch: Prof Tom Gaunt (Email)

You should apply using the online application system linked here through the green button below. (Applications that are not submitted to the online system by end of 31st October 2021 can’t be considered)

Online applications

Trans-ethnic Mendelian-randomization study reveals causal relationships between cardiometabolic factors and chronic kidney disease

2021-10-20T00:00:00.000Z

Overview

Most Mendelian randomization (MR) studies focus on European populations because of the wealth of genome-wide association study (GWAS) datasets available from European ancestry population samples, in contrast to other populations. However, new GWAS summary datasets from studies such as Biobank Japan, China Kadoorie Biobank and the Japan Kidney Biobank enable us to run ancestry-specific MR analyses to compare causal effects of risk factors across populations. This approach is important in the use of MR to inform public health priorities and interventions in other populations and sub-populations that have historically been under-represented in research.

What we did

In this paper in International Journal of Epidemiology we used MR to estimate the causal effects of 45 risk factors in Europeans and 17 risk factors in East Asians (based on available genetic data)on chronic kidney disease (CKD) (Figure 1). CKD was defined by either clinical diagnosis or estimated glomerular filtration rate (eGFR) of less than 60ml/min per 1.73m2.

In collaboration with an international team of researchers representing a number of different population studies we were able to analyse samples from: * 51,672 CKD cases and 958,102 controls of European ancestry: CKDGen, UK Biobank and HUNT * 13,093 CKD cases and 238,118 controls of East Asian ancestry: Biobank Japan, China Kadoorie Biobank and Japan-Kidney-Biobank/ToMMo

In European ancestry samples we found evidence of causality for: body mass index (BMI), hypertension, systolic blood pressure, high-density lipoprotein cholesterol, apolipoprotein A-I, lipoprotein(a), type 2 diabetes (T2D) and nephrolithiasis.

In East Asians we were only able to analyse a subset of risk factors (due to availability of exposure genetic data for MR), but found evidence of causal effects for: BMI, T2D and nephrolithiasis

Interestingly, in two independent analyses we observed evidence of a causal effect of hypertension on CKD risk in Europeans, but little evidence of an effect in East Asians.

Figure 1: Study design for the trans-ethnic MR study of CKD.

Paper

‘Trans-ethnic Mendelian-randomization study reveals causal relationships between cardiometabolic factors and chronic kidney disease’ by Jie Zheng, Yuemiao Zhang, Humaira Rasheed, Venexia Walker, Yuka Sugawara, Jiachen Li, Yue Leng, Benjamin Elsworth, Robyn E Wootton, Si Fang, Qian Yang, Stephen Burgess, Philip C Haycock, Maria Carolina Borges, Yoonsu Cho, Rebecca Carnegie, Amy Howell, Jamie Robinson, Laurent F Thomas, Ben Michael Brumpton, Kristian Hveem, Stein Hallan, Nora Franceschini, Andrew P Morris, Anna Köttgen, Cristian Pattaro, Matthias Wuttke, Masayuki Yamamoto, Naoki Kashihara, Masato Akiyama, Masahiro Kanai, Koichi Matsuda, Yoichiro Kamatani, Yukinori Okada, Robin Walters, Iona Y Millwood, Zhengming Chen, George Davey Smith, Sean Barbour, Canqing Yu, Bjorn Olav Asvold, Hong Zhang and Tom R Gaunt in International Journal of Epidemiology (2021).

Data availability

The GWAS summary statistics for CKD and eGFR that were generated using UK Biobank and CKDGen data are available from the MRC-IEU OpenGWAS database https://gwas.mrcieu.ac.uk/ and CKDGen website http://ckdgen.imbi.uni-freiburg.de/, respectively. Other datasets are available on request to the originating study.

EpiGraphDB platform version 1.0

2021-03-22T00:00:00.000Z

EpiGraphDB version 1.0

The EpiGraphDB platform has been updated with a new major release (version 1.0). This is the first release since version 0.3 in 2020 (what a year!) as well as since the publication of the journal article on Bioinformatics. We believe the underlying integration pipeline, data structure and architecture for the EpiGraphDB platform has now progressed sufficiently to a stable state that we are pleased to announce this major release a version 1.0!

In the following sections we highlight a few key new features and changes in this update. For more detailed and technical changes, please visit the changelog in the platform documentation.

New and overhauled data sources

ClinVar

ClinVar is a public archive of reports of genetic variants and interpretations of their clinical relevance to disease. The variants are submitted by clinical testing laboratories, research laboratories, expert panels and other groups.

We import ClinVar data (extracted on 2021-01-12) as gene-disease associations, available as [GENE_TO_DISEASE] relationship in EpiGraphDB. The sources of information for the gene-disease relationship include OMIM, GeneReviews, and a limited amount of curation by NCBI staff.

Mapping between EBI GWAS Catalog GWAS traits and EFO terms

To complement existing semantic mapping between (Gwas) traits and (Efo) ontology terms (GWAS_NLP_EFO) we have added the official mapping from EBI GWAS Catalog (available as “ebi-a” studies in OpenGWAS) and EFO terms. Such mapping is available as [GWAS_EFO_EBI] in EpiGraphDB.

MR-EvE

We have incorporated the latest MR-EvE evidence to EpiGraphDB. The MR-EvE evidence is represented as [MR_EVE_MR] in EpiGraphDB. With this update, [MR_EVE_MR] evidence has increased from 583,619 records to 25,804,945 records (for further details visit the metadata and metrics ). For further examples regarding the MR-EvE evidence, take a look at the MR view on the EpiGraphDB WebUI and the confounder view as well as the underlying API endpoints in the EpiGraphDB API.

Reactome

The Reactome data source has been overhauled and simplified. We now make use of the protein and pathway data sets available to download here.

Literature derived evidence

In addition to the newer version of SemMedDB (semmedVER42_R) we used SemRep to create semantic triples from the MedRxiv and BioRxiv titles and abstracts. This resulted in renaming the literature nodes and relationships in the graph, e.g. instead of (SemmedTriple) we now have (LiteratureTriple), and instead of (SemmedTerm) we now have (LiteratureTerm), each with a _source property (see changelog). Relationships between the new nodes are named after the data source, e.g. [SEMMEDDB_OBJ], [BIORXIV_OBJ] and [MEDRXIV_OBJ] in place of [SEM_OBJ].

Codebase

We have refactored our entire graph build pipeline to improve transparency, reliability and robustness. For this we use the neo4j-build-pipeline which uses defined schemas and tests to ensure the graph is consistent and clean. More details on this can be found in the following blog posts https://www.biocompute.org.uk/post/neo4j_data_integration/ and https://neo4j.com/blog/neo4j-data-integration-pipeline-using-snakemake-and-docker/.

In addition, the source code for the Graph, WebUI and API are now hosted on GitHub under the MRCIEU organisation. We plan to write a separate blog post regarding the technologies behind the EpiGraphDB platform in the near term future.

For further information on the software side of the EpiGraphDB project (as well as other software projects developed in the MRC IEU) please visit MRC IEU’s GitHub Pages.

EpiGraphDB can be accessed and interactive with via the following ways: - The interactive Web UI - The API - Example Jupyter notebooks - The R package

For further details on the EpiGraphDB research project please read our journal article published on Bioinformatics as well as the platform documentation.

Please get in touch with the team via GitHub issues, on twitter, or via emails.

EpiGraphDB team

MendelVar: gene prioritization at GWAS loci using phenotypic enrichment of Mendelian disease genes

2021-01-01T00:00:00.000Z

Overview

Gene prioritization at human GWAS loci is challenging due to linkage-disequilibrium and long-range gene regulatory mechanisms. However, identifying the causal gene is crucial to enable identification of potential drug targets and better understanding of molecular mechanisms. Mapping GWAS traits to known phenotypically relevant Mendelian disease genes near a locus is a promising approach to gene prioritization. MendelVar is a novel web-based platform to support gene prioritization using data from Mendelian disease genes, variants identified in clinical genetics and data from disease ontologies.

The platform

MendelVar, presented in this Bioinformatics paper, provides a quick overview of possible impact of Mendelian disease-related genes on user’s complex phenotype of interest. It returns the details of all known broadly defined Mendelian diseases and their causal genes found in the custom genomic intervals as well as overlapping pathogenic rare mutations responsible for Mendelian disease. Enrichment of Disease Ontology, Human Phenotype Ontology terms among the Mendelian genes gives the researcher an overview of any shared features with their trait of interest, e.g. in terms of anatomy.

Data sources

MendelVar uses all the confirmed gene-disease relationships featured in OMIM and complements it with three more specialist data sources for Mendelian disease: Orphanet (a database centred on rare, typically monogenic disease), expertly curated gene panels used for diagnostics from Genomics England PanelApp and results from the on-going Deciphering Developmental Disorders Study (made available in DECIPHER). MendelVar includes short disease descriptions sourced from OMIM, Orphanet, Uniprot and DO. In addition, MendelVar cross-references input genomic intervals against ClinVar.

Usage

MendelVar accepts a user defined list of genomic intervals or a list of top SNPs. Top SNPs can be used to create genomic intervals in two ways in MendelVar: using pre-set basepair flanks or via generation of individual LD-based boundaries around each top SNP, identified either through dbSNP rsIDs or positional coordinates. The LD-based intervals can be also optionally extended to the nearest recombination hotspot. There is full support for hg19/GRCh37 and hg38/GRCh38 human genome builds for input. Six 1000 Genomes populations - EUR, CEU, AFR, AMR, EAS, SAS supported via LDlink for LD-based genomic interval generation.

The web-browser will return a compressed file containing the results of the enrichment and annotation analyses. An online tutorial provides clear documentation on how to use the platform and how to interpret the results.

See Figure 1 for an illustration of the different user routes through MendelVar.

Figure 1: A flowchart demonstrating three possible user routes through MendelVar: (a) left: MendelVar generates fixed genomic intervals using preset left and right flanks against a user-submitted list of genomic positions; (b) centre: MendelVar generates flexible genomic intervals using LD pattern in the region around each user-submitted position/variant rsID; (c) right: MendelVar accepts user-submitted genomic intervals. The genomic intervals generated or obtained from user are subsequently bisected with coordinates for genes and variants known to cause Mendelian disease. Ontology terms associated with Mendelian disease in HPO, DO are propagated to causal genes and are tested for enrichment among target genes in input genomic intervals. MendelVar also provides an option for enrichment testing with Gene Ontology and biological pathway databases.

Paper

‘MendelVar: gene prioritization at GWAS loci using phenotypic enrichment of Mendelian disease genes’ by M K Sobczyk, T R Gaunt and L Paternoster in Bioinformatics (2021).

Neo4J data integration pipeline

2020-11-17T00:00:00.000Z

Background

We’ve been using Neo4j for around five years in a variety of projects, sometimes as the main database MELODI and sometimes as part of a larger platform (OpenGWAS). We find creating queries with Cypher intuitive and query performance to be good. However, the integration of data into a graph is still a challenge, especially when using many data from a variety of sources. Our latest project EpiGraphDB uses data from over 20 independent sources, most of which require cleaning and QC before they can be incorporated. In addition, each build of the graph needs to contain information on the versions of data, the schema of the graph and so on.

Most tutorials and guides focus on post graph analytics, not how the graph was created. Often the process of bringing all the data together is overlooked or assumed to be straight forward. We are keen to provide access and transparency to the entire process and designed this pipeline to help with our projects, but believe this could be of use to others too.

Our data integration pipeline aims to create a working graph from raw data, whilst running checks on each data set and automating the build process. These checks include:

Data profiling reports with pandas-profiling to help understand any issues with a data set Comparing each node and relationship property against a defined schema Merging overlapping node data into single node files. The data are formatted for use with the neo4j-import tool as this keeps build time for large graphs reasonable. By creating this pipeline, we can provide complete provenance of a project, from raw data to finished graph.

The pipeline

The code and documentation for the pipeline are here – https://github.com/elswob/neo4j-build-pipeline

Below is a figure representing how this might fit into a production environment, with the pipeline running on a development server and shared data on a storage server

Setup

The project comes with a set of test data that can quickly be used to demonstrate the pipeline and create a basic graph. This requires only a few steps, e.g.

## clone the repo (use https if necessary)
git clone git@github.com:elswob/neo4j-build-pipeline.git
cd neo4j-build-pipeline
 
## create the conda environment
conda env create -f environment.yml
conda activate neo4j_build
 
## create a basic environment variable file for test data
## works ok for this test, but needs modifying for real use
cp example.env .env
 
## run the pipeline
snakemake -r all --cores 4

For a new project, the steps to create a graph from scratch are detailed here and proceed as follows:

Create a set of source data.
- These can be local to the graph or on an external server
- Scripts that created them should be added to the code base
Set up a local instance of the pipeline
Create a graph schema
Create processing scripts to read in raw data and modify to match schema
Test the build steps of individual or all data files and visualise data summary
Run the pipeline
1. Raw data are checked against schema and processed to produce clean node and relationship CSV and header files
2. Overlapping node data are merged
3. Neo4j graph is created using neo4j-admin import
4. Constraints and indices are added
5. Clean data are copied back to specified location

Future plans

We think the work we have done here may be of interest to others. If anyone would like to get involved in this project we would love to collaborate and work together towards refining and publishing the method. Comments also welcome.

Code: https://github.com/elswob/neo4j-build-pipeline
Email: ben.elsworth@bristol.ac.uk
Twitter: @elswob

Reducing drug development costs

2020-11-08T00:00:00.000Z

Overview

This short animation explains how we use Mendelian randomization and colocalization to help prioritise drug targets. One of our aims in both programme 4 of the MRC IEU and the Integrative Cancer Epidemiology Programme is to integrate such prioritizations with other data to help inform drug development.

Video

About the animation

The animation is based on recent work by Dr Jie (Chris) Zheng, Vice-Chancellors Fellow based in programme 4 of the MRC IEU, who recently published an innovative Mendelian randomization and colocalization study of plasma protein levels in Nature Genetics, that demonstrated how genetic data can be used to support drug target prioritisation by identifying the causal effects of proteins on diseases.

Using a set of genetic epidemiology approaches, including Mendelian randomization and genetic colocalization, we built a causal network of 1002 plasma proteins on 225 human diseases. In doing so, we identified 111 putatively causal effects of 65 proteins on 52 diseases, covering a wide range of disease areas. The results of this study are accessible via EpiGraphDB.

Visualising Brexit’s Impact on Food Safety in Britain

2020-10-06T00:00:00.000Z

Written by Marina Vabistsevits and Oliver Lloyd, researchers on PhD studentships linked to the “Data Mining Epidemiological Relationships” programme at the MRC IEU. Follow us on twitter – @marina_vab, @PlotThiggins

Leaving the EU presents many unique challenges to Britain, among which is the crucial task of maintaining our high levels of food safety. As a submission to the Jean Golding Institute’s data visualisation competition, we briefly investigated the impacts that Brexit may have on British food supplies. The dataset used in this analysis was made available by the Food Standards Agency (FSA) as the focus of the competition, and all code used is freely available in our github repository.

The Need for Information Recompense

In the first part of the analysis, we explored cases where food imported to Britain led to an alert being raised. The two biggest sources for such alerts were Britain’s internal alert systems (largely the FSA), and the EU’s Rapid Alert System for Food and Feed (RASFF).

Since Britain is on course to lose access to RASFF-supplied information once Brexit is finalised in early 2021, we created the visualisation below as a comparison of the FSA and the RASFF in terms of both the number of alerts raised and the corresponding food’s origin country for each alert.

Figure: Alerts from the EU Alert System

The arcs show the countries of origin of imports that raised alerts, and the yellow-red density map shows the recorded hazard alert frequency from those origins. Interactive versions of the two map instances can be found by following these links: RASFF, UK internal alerts.

Figure: Alerts from the UK Alert System

If the UK does indeed lose access to the RASFF, the loss of food hazards information about our own imports will be tremendous. The burden then falls on the FSA to develop and extend their alert system (which currently focuses very little on internationally supplied food) to bridge this information gap and ensure food safety for globally imported goods. As of the time of writing we are unsure what steps are being taken by the FSA, or the government at large, to address this issue.

Post-Brexit Shifts in Food Hazard Threats

As an extension of this work, we turned our attention to tariffs and the effect they might have on whom Britain chooses to import from. Upon leaving the EU the UK will have to negotiate new trade deals with both EU and non-EU countries. Since the cost for EU-produced food is expected to rise for Britain after Brexit, we may indeed see Britain importing more from outside of the union, which would naturally bring a shift to the make–up of food hazards that our alert systems will need to detect. Anticipating this shift will allow us to better mitigate the accompanying risk if it does begin to materialise.

To this end, we explored the differences in food hazard threats posed by EU vs non-EU suppliers of Britain’s largest class of imported food: fruits and vegetables. The plot below shows the relative change in frequency for each category of food hazard in the case that Britain switched from 100% EU imports of fruit and vegetables to 100% non-EU. The hazard categories that are likely to increase in non-EU imports are highlighted in red. Please note that this is the most extreme case possible and is unlikely to unfold to this extent in reality– this plot is therefore presented as a guide to the different food threats posed by EU vs non-EU imports.

Figure: Hazard alerts for fruits and vegetables: EU vs non-EU imports

Our full submission ‘Too Much Tooty in the Fruity: Keeping Food Safe in a Post-Brexit Britain’ can be found here, and includes a further breakdown of some of the categories of hazards displayed in the chart above. This work was awarded one of two joint runner-up prizes of the competition, tied with Angharad Stell’s Shiny app: ‘From a data space to knowledge discovery’. The winner of the competition was Robert Eyre, who produced this impressive visualization dashboard using D3. The Jean Golding Institute are hosting a showcase event on the 18th November, where all competition entries will be presented.

We would like to thank the JGI for hosting the competition, and our PhD supervisors, Prof. Tom Gaunt and Dr. Ben Elsworth, for encouraging us to enter.

DMER Group Blog

MR-KG: A Knowledge Graph of Mendelian Randomization Evidence Powered by Large Language Models

📌 Background​

🧠 What Is MR-KG?​

🛠️ How It Works​

🔍 Key Features​

💡 Why It Matters​

🚀 Getting Started​

🧪 A Note on Preprints​

Links​

Genetics as a side‑effect detective for antipsychotic medicines

The basic idea​

What we analysed​

What we found​

Three concrete examples​

1) Neutropenia (dangerously low white blood cell counts)​

2) Weight gain​

3) Blood pressure effects​

Why this matters​

Caveats we should keep in mind​

Links​

M-PreSS: a transparent, open-source approach to study screening in systematic reviews

Overview​

What we did​

What we found​

Why this matters​

Limitations and next steps​

🧪 A Note on Preprints​

Links​

Integrating Mendelian randomization and literature mining to map breast cancer risk factors

Why combine MR with literature mining?​

What we did​

What we found​

Why this matters​

Try it yourself​

Dissecting blood pressure and BMI a pathway- and tissue-partitioned Mendelian randomization comparison

Two complementary ways to split instruments​

1) Pathway-partitioned instruments from Mendelian disease phenotypes​

2) Tissue-partitioned instruments from genetic colocalization​

What did we find?​

Blood pressure: vessel pathways stand out for heart disease, but tissue signals are more nuanced​

BMI and atrial fibrillation: metabolic vs brain components​

Why this matters for MR interpretation​

Our take​

Paper​

CanDrivR-CS: cancer-specific machine learning to separate recurrent from rare missense variants

Overview​

What we did​

Key results​

Why this matters​

Resources​

Paper​

DrivR-Base: a feature extraction toolkit for variant effect prediction

What problem is DrivR-Base trying to solve?​

What is DrivR-Base?​

What features does it extract?​

Why this matters for our work​

Getting started​

Reference​

Pilot analysis on BioRxiv and MedRxiv full text data to facilitate comprehensive data mining on biomedical literature

Overview​

Full text archives of BioRxiv / MedRxiv preprints​

Some exploratory results​

Trends and distributions​

Topic analysis​

Summary​

Code availability​

References​

Proteome-wide Mendelian randomization in global biobank to identify multi-ancestry drug targets

Overview​

What we did​

Paper​

New funding: NIHR Bristol Biomedical Research Institute.

Links to the Data Mining Epidemiological Relationships programme​

Prioritizing interventions​

Omics for prediction and prognosis​

Triangulating evidence in health sciences with Annotated Semantic Queries

Overview​

What we did​

Paper​

📌 Background

🧠 What Is MR-KG?

🛠️ How It Works

🔍 Key Features

💡 Why It Matters

🚀 Getting Started

🧪 A Note on Preprints

Links

The basic idea

What we analysed

What we found

Three concrete examples

1) Neutropenia (dangerously low white blood cell counts)

2) Weight gain

3) Blood pressure effects

Why this matters

Caveats we should keep in mind

Links

Overview

What we did

What we found

Why this matters

Limitations and next steps

🧪 A Note on Preprints

Links

Why combine MR with literature mining?

What we did

What we found

Why this matters

Try it yourself

Two complementary ways to split instruments

1) Pathway-partitioned instruments from Mendelian disease phenotypes

2) Tissue-partitioned instruments from genetic colocalization

What did we find?

Blood pressure: vessel pathways stand out for heart disease, but tissue signals are more nuanced

BMI and atrial fibrillation: metabolic vs brain components

Why this matters for MR interpretation

Our take

Paper

Overview

What we did

Key results

Why this matters

Resources

Paper

What problem is DrivR-Base trying to solve?

What is DrivR-Base?

What features does it extract?

Why this matters for our work

Getting started

Reference

Overview

Full text archives of BioRxiv / MedRxiv preprints

Some exploratory results

Trends and distributions

Topic analysis

Summary

Code availability

References

Overview

What we did

Paper

Links to the Data Mining Epidemiological Relationships programme

Prioritizing interventions

Omics for prediction and prognosis

Overview

What we did

Paper

Code availability

References

Overview

What we did

Paper

Code and data availability

Overview

What we did

Paper

The role:

What will you be doing?

You should apply if: